DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion

Liu, Cong; Gao, Quanwei; Song, Chenxi; Ouyang, Bo; Wang, Ruyu; Fan, Hongtao

doi:10.3390/rs18111852

Open AccessArticle

DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion

by

Cong Liu

¹,

Quanwei Gao

²

,

Chenxi Song

²,

Bo Ouyang

¹,

Ruyu Wang

^1,* and

Hongtao Fan

¹

College of Science, Northwest A&F University, Yangling 712100, China

²

College of Information Engineering, Northwest A&F University, Yangling 712100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1852; https://doi.org/10.3390/rs18111852

Submission received: 10 May 2026 / Revised: 31 May 2026 / Accepted: 3 June 2026 / Published: 4 June 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Differentiated Guided Reconstruction Masked Autoencoder (DGR-MAE) is proposed for cloud-occluded aircraft recognition by incorporating global attention scoring and posterior semantic correction into masked image modeling.
DGR-MAE achieves the best performance among the evaluated self-supervised learning methods on the ASRAir benchmark, reaching 74.28% Top-1 accuracy without increasing inference complexity.

What are the implications of the main findings?

Robust semantic representation learning can improve recognition robustness under cloud-induced information loss by directly learning from incomplete observations.
The findings highlight the potential of representation learning-based approaches for remote sensing target recognition under cloud occlusion.

Abstract

Fine-grained aircraft recognition in optical remote sensing imagery remains highly challenging under cloud occlusion, as visibility degradation causes structural information loss and weakens discriminative representation learning. To address this issue, we propose DGR-MAE, a teacher–student masked image modeling framework based on posterior semantic correction for robust representation learning under incomplete observations. Unlike existing semantic-guided masking methods that modify token visibility during input construction, DGR-MAE preserves high-ratio stochastic masking in the student branch and introduces semantic correction after visibility degradation through teacher-guided differential reconstruction. Specifically, a semantic-aware teacher branch estimates patch-level importance to partition masked regions into semantic-critical and non-critical subsets, enabling region-dependent reconstruction prioritization. A collaborative feature refinement mechanism is further incorporated to enhance contextual consistency and structural reasoning during pretraining. To support controlled evaluation, we construct the ASRAir benchmark with hierarchical cloud occlusion levels. Experimental results show that DGR-MAE achieves 74.28% Top-1 accuracy on ASRAir-Occ and achieves the best Top-1 performance while maintaining competitive Top-5 accuracy compared with representative self-supervised baselines. In particular, it demonstrates substantially improved robustness under moderate-to-severe cloud occlusion, validating the effectiveness of posterior semantic correction for remote sensing representation learning under visibility degradation.

Keywords:

remote sensing imagery; fine-grained aircraft recognition; cloud occlusion; masked image modeling; posterior semantic correction

1. Introduction

Fine-grained aircraft recognition is a representative fine-grained visual classification task and plays an important role in both military applications [1] and civilian scenarios [2,3,4]. With the rapid development of remote sensing imaging technologies and large-scale data acquisition, the spatial resolution and data scale of optical remote sensing imagery have continuously improved, enabling clearer characterization of aircraft appearance and structural details. However, cloud occlusion remains one of the major challenges limiting the performance of optical remote sensing interpretation. Previous studies have shown that approximately 60% of the Earth’s surface is persistently covered by clouds, with an average cloud coverage of about 54% over land areas [5]. In practical observation scenarios, aircraft targets are often affected by varying degrees of cloud occlusion. Thin clouds usually reduce image contrast and weaken edge and texture information, while thick clouds may cause severe structural degradation and missing target regions, thereby significantly increasing the difficulty of fine-grained recognition.

From the perspective of visual representation learning, cloud occlusion essentially leads to partial observation loss and structural information degradation. When key aircraft components, such as wings, fuselage contours, and tail structures, are occluded, discriminative details and geometric structures become incomplete, resulting in degraded modeling capability for overall morphology and fine-grained category distinctions. Such incomplete observations caused by cloud occlusion can be naturally viewed as a structural missing-information problem in representation learning. To address this issue, Masked Image Modeling (MIM) has emerged as a representative self-supervised learning paradigm [6,7,8,9,10,11,12]. By randomly masking image patches and reconstructing missing regions, MIM encourages models to infer absent content from contextual information, thereby strengthening both local texture representation and global semantic dependency modeling. Among existing approaches, Masked Autoencoder (MAE) [8] adopts a high-ratio random masking strategy and asymmetric encoder–decoder architecture, enabling the encoder to learn structurally aware representations under sparse observations. Although MAE is not specifically designed for cloud occlusion, its incomplete-observation learning paradigm exhibits strong consistency with the stochastic information loss introduced by cloud interference. Nevertheless, effectively modeling semantic uncertainty under cloud occlusion remains challenging, since discriminative target structures may themselves be partially missing and cannot be reliably handled by purely random masking or prior semantic preservation strategies.

Recent advances in semantic-guided masked image modeling have demonstrated that introducing semantic priors can improve reconstruction efficiency and representation quality. However, most existing methods inject semantic guidance before masking by preferentially preserving semantically informative regions during encoder input construction. Such strategies implicitly rely on a prior-visibility assumption, namely that semantically critical regions remain observable and can therefore be explicitly retained. This assumption becomes unreliable in cloud-occluded remote sensing scenarios, where cloud coverage is stochastic and physically independent of semantic content. As a result, highly discriminative aircraft structures may themselves be partially or entirely obscured, making prior semantic preservation suboptimal for realistic incomplete-observation learning.

To address this challenge, this paper proposes DGR-MAE, a teacher–student collaborative masked autoencoding framework based on a posterior semantic reconstruction paradigm. Different from existing semantic-guided masking strategies, DGR-MAE preserves the standard high-ratio random masking process in the student branch and introduces semantic guidance only after masking through teacher-estimated semantic partitioning of masked regions. Specifically, the teacher branch identifies semantically informative structural regions and uses them to impose differential reconstruction constraints on masked student tokens according to semantic importance. In this way, semantic correction is performed after visibility uncertainty has already been instantiated, enabling robust representation learning under incomplete observations while avoiding strong observability assumptions. To further enhance semantic consistency between the two branches, a representation-level alignment objective is introduced to facilitate collaborative optimization in both reconstruction and feature spaces.

The main contributions of this paper are summarized as follows:

We propose a Differentiated Guided Reconstruction Masked Autoencoder (DGR-MAE) framework for fine-grained aircraft recognition under cloud occlusion. Through a teacher–student collaborative learning paradigm, the proposed framework formulates representation learning under incomplete observations as a posterior semantic reconstruction process.
We introduce a posterior semantic correction-driven differentiated reconstruction strategy, which applies region-dependent reconstruction constraints to different masked regions according to semantic importance estimated by the teacher network, thereby enhancing the representation capability of critical structural features.
We construct the ASRAir benchmark for cloud-occluded aircraft recognition, consisting of a cloud-free training subset, a cloud-occluded evaluation subset, and a multi-level occlusion severity subset, enabling systematic assessment of model robustness under varying cloud interference conditions.

The remainder of this paper is organized as follows. Section 2 reviews related work on fine-grained aircraft recognition and SSL, including MIM and cloud-occluded remote sensing imagery. Section 3 presents the proposed DGR-MAE framework in detail, including teacher–student architecture and posterior semantic reconstruction mechanism. Section 4 describes experimental settings and reports comprehensive results, including ablation studies, to evaluate effectiveness of proposed method. Section 5 discusses advantages, limitations, and future work of the proposed method. Finally, Section 6 concludes the paper and discusses future research directions.

2. Related Work

To better understand the performance of existing methods in optical remote sensing aircraft recognition and self-supervised visual representation learning, as well as their limitations under cloud occlusion conditions, this section provides a systematic review of related work. The discussion is organized into four aspects: aircraft recognition in optical remote sensing imagery, self-supervised learning and masked image modeling, cloud occlusion in remote sensing representation learning, and a comparison with related semantic-guided masked image modeling methods.

2.1. Aircraft Recognition in Optical Remote Sensing Imagery

Early studies on aircraft recognition in optical remote sensing imagery mainly relied on hand-crafted low-level visual features combined with traditional classifiers. However, such methods heavily depend on expert knowledge and exhibit limited generalization capability, leading to unstable performance in practical applications [1]. With the development of deep learning, convolutional neural networks (CNNs) have been widely introduced into this task, significantly improving recognition performance through end-to-end feature learning. For example, Guan et al. [13] proposed FS2ANet, which enhances fine-grained aircraft recognition through multi-scale feature fusion and attention mechanisms. Hu et al. [14] further demonstrated the effectiveness of CNN-based unified detection and recognition frameworks in handling scale variations, pose changes, and complex backgrounds. Moreover, related studies in remote sensing target analysis have further demonstrated the effectiveness of multi-scale feature fusion, rotation-invariant representation learning, and modality compensation strategies for improving robustness under complex observation conditions [15,16,17].

However, CNN-based methods are inherently limited by local receptive fields and fail to fully capture global structural relationships. To address this limitation, Vision Transformer (ViT) [18] has been introduced into remote sensing image analysis. Bazi et al. [19] showed that ViT can effectively model global contextual information for remote sensing classification tasks. Subsequently, He et al. [20] proposed an adaptive self-supervised transformer, further improving representation learning in remote sensing imagery. Jiang et al. [21] introduced learnable meta tokens to enhance efficiency and representation capability. These studies demonstrate that transformer-based methods provide a promising paradigm for modeling structural relationships in aircraft recognition tasks.

Nevertheless, most existing methods are trained and evaluated under relatively clean imaging conditions with complete structural information [3]. Under cloud occlusion, critical aircraft components may be partially missing or degraded, leading to significant performance deterioration.

2.2. Self-Supervised Learning and Masked Image Modeling

Self-supervised learning (SSL) has become a prominent research direction in remote sensing [22,23]. Among SSL methods, masked image modeling (MIM) has emerged as a representative generative paradigm. By reconstructing masked input regions, MIM pretrains Vision Transformers (ViTs), with representative methods including BEiT [6], MaskFeat [7], MAE [8], SimMIM [9], PeCo [10], and MVP [11].

Among them, MAE [8] has gained significant attention due to its high masking ratio and asymmetric encoder–decoder architecture, achieving strong performance while maintaining computational efficiency. Inspired by MAE, many remote sensing studies have extended masked modeling techniques. For instance, SatViT [24] demonstrates the effectiveness of transformer-based SSL in earth observation tasks. FGMAE [25] enhances feature learning through guided masking. Cross-scale MAE [26] explicitly models multi-scale information, while SS-MAE [22] incorporates spatial-spectral dependencies for multisource data modeling. SatMAE [27] and Scale-MAE [23] further extend MAE to multi-spectral and scale-aware scenarios.

Overall, existing MIM-based methods primarily rely on content-agnostic random masking strategies, which limits their robustness under complex real-world conditions such as cloud occlusion.

2.3. Cloud Occlusion in Remote Sensing Representation Learning

Remote sensing imagery is highly affected by atmospheric conditions, especially cloud occlusion, which significantly reduces image quality and information completeness. Existing studies on cloud removal mainly fall into two categories: information-driven methods and generation-based methods.

Information-driven methods leverage multi-temporal or multi-source data to reconstruct occluded regions [28,29,30,31]. Generation-based methods focus on learning direct mappings for cloud removal or background reconstruction from single or multi-image inputs [32,33,34,35].

Although these methods achieve promising results, they primarily focus on pixel-level or structure-level restoration. However, due to the inherent randomness and spatial uncertainty of cloud coverage in real-world scenarios, fully recovering occluded regions is often infeasible. Therefore, this paper reconsiders cloud occlusion from a representation learning perspective, aiming to learn robust and discriminative features under incomplete observations rather than explicitly restoring missing information.

2.4. Comparison with Related Semantic-Guided MIM Methods

Recent advances in masked image modeling have explored semantic-aware masking and teacher-guided representation learning to improve reconstruction efficiency and representation discriminability. Representative methods such as SemMAE [12] and AttMask [36] estimate token importance prior to masking and preferentially preserve semantically informative regions during encoder input construction, while teacher-student based approaches such as DMAE [37] enhance student representation learning through feature-level knowledge transfer. Although effective, these methods generally rely on an implicit prior-visibility assumption, namely that semantically critical regions remain observable and can therefore be preferentially retained during encoding.

However, this assumption becomes less reliable in cloud-occluded remote sensing scenarios. Since cloud coverage is stochastic and physically independent of semantic content, highly discriminative target structures may themselves be partially or entirely obscured. Under such conditions, enforcing semantic-aware visibility priors may introduce observation bias and reduce robustness to realistic incomplete observations.

Different from these approaches, the proposed DGR-MAE preserves the standard high-ratio random masking strategy in the student branch and introduces semantic guidance only after masking through teacher-estimated semantic partitioning of masked regions. This establishes a posterior semantic reconstruction mechanism rather than a prior semantic masking strategy. By introducing semantic correction after visibility uncertainty has already been instantiated, DGR-MAE enables robust representation learning under incomplete observations while avoiding the observability assumptions required by prior semantic-guided masking methods, making it particularly suitable for cloud-occluded remote sensing representation learning.

3. Methodology

3.1. Overall Framework

As shown in Figure 1, the proposed DGR-MAE is formulated as a teacher–student collaborative masked autoencoding framework under a posterior semantic reconstruction paradigm for cloud-occluded representation learning. Unlike prior semantic-guided masked image modeling methods that implicitly assume semantic reasoning can be performed under uncorrupted or weakly corrupted observations, DGR-MAE explicitly formulates representation learning as a conditional inference problem under stochastic visibility degradation.

In particular, instead of modeling semantic guidance as a function of the input image alone, we consider semantic reasoning under both image content and its visibility condition. This leads to a posterior semantic formulation:

P (S ∣ X, M),

(1)

where X denotes the input image and M represents a stochastic visibility realization induced by masking. In contrast, prior methods typically approximate:

P (S ∣ X),

(2)

which assumes that semantic importance can be inferred independently of visibility corruption. This difference fundamentally shifts semantic guidance from a pre-visibility intervention to a visibility-conditioned inference process.

To instantiate this formulation, DGR-MAE preserves standard high-ratio random masking in the student branch, while semantic guidance is introduced only after the visibility uncertainty induced by masking has been realized. This design enables semantic correction under incomplete observations while avoiding the strong observability assumptions required by prior semantic masking strategies.

Specifically, both the teacher and student branches operate on the same input image, which ensures semantic consistency at the image level and avoids multi-view augmentation discrepancies. However, they do not share identical masking realizations. Instead, the student branch adopts a stochastic masking process

M^{(s)}

, while the teacher branch constructs a semantically induced visibility structure

M^{(t)}

based on patch-level importance estimation. In this setting, M is interpreted as a visibility-dependent random variable rather than a shared token-level mask.

Conditioned on their respective visibility realizations, the teacher branch re-evaluates semantic importance over image regions and extracts structural priors from informative visible tokens, while the student branch learns robust latent representations under severe partial-observation constraints. Importantly, semantic reasoning is performed under correlated but distinct visibility realizations, rather than under an identical masking configuration.

Based on this formulation, the teacher induces a semantic partition over masked student tokens by jointly considering semantic relevance and the stochastic visibility realization, resulting in semantic-critical and non-critical subsets. This partition enables region-dependent reconstruction supervision, transforming the learning objective from uniform masked reconstruction into posterior semantic correction under visibility-conditioned uncertainty. Meanwhile, representation-level teacher–student alignment further enforces consistency between representations learned under shared semantic structure but heterogeneous visibility conditions. Overall, DGR-MAE reframes representation learning from deterministic reconstruction into conditional learning under stochastic visibility degradation, enabling robust and discriminative feature learning for cloud-occluded aircraft recognition.

3.2. Semantic Importance-Aware Differential Visibility Encoding

Within the proposed framework, both branches operate on the same input image to ensure semantic consistency at the data level and avoid discrepancies introduced by multi-view augmentation. However, they are exposed to distinct visibility realizations, leading to heterogeneous observation conditions for representation learning.

Specifically, the teacher branch constructs a semantically structured visibility pattern, while the student branch is subjected to stochastic visibility degradation through random masking. Importantly, these two visibility patterns are independently generated and do not share identical masking realizations.

Given an input image I, it is first partitioned into N non-overlapping patches:

X = {x_{i}}_{i = 1}^{N},

(3)

where N denotes the total number of patches.

A semantic scoring function is introduced in the teacher branch to estimate patch-level importance. We instantiate this scoring function using a lightweight Vision Transformer-based architecture, termed Global Attention Scoring (GAS).

Given the input patch set

X = {x_{i}}_{i = 1}^{N}

, we first obtain patch embeddings:

z^{(0)} = PatchEmbed (X) .

(4)

The embeddings are then processed by a Transformer encoder to model global context:

z^{(L)} = Transformer (z^{(0)}),

(5)

followed by a residual MLP refinement:

\tilde{z} = z^{(L)} + MLP (z^{(L)}) .

(6)

Finally, patch-level importance scores are predicted via a linear projection:

s_{i} = W_{s} {\tilde{z}}_{i} + b_{s}, i = 1, \dots, N .

(7)

We denote this process as:

GAS (X) = {s_{i}}_{i = 1}^{N} .

(8)

The GAS module is trained jointly with the overall framework under the reconstruction objective of the student branch. Importantly, it is not updated via exponential moving average (EMA), but optimized through end-to-end gradient backpropagation from the reconstruction loss. Specifically, the parameters of GAS are included in the trainable optimization set, and the predicted importance scores participate in the masking operation and further affect the reconstruction loss computation. Consequently, gradients can be propagated back to the GAS module through the computational graph, enabling direct optimization of semantic importance estimation under the reconstruction objective.

Based on these scores, the teacher constructs a semantically guided visible token set:

V^{(t)} = TopK ({s_{i}}_{i = 1}^{N}),

(9)

with

K = ⌊ N / 4 ⌋

, and the corresponding masked set is:

M^{(t)} = {1, 2, \dots, N} ∖ V^{(t)} .

(10)

In contrast, the student branch adopts a stochastic masking strategy independent of semantic scores:

V^{(s)} \subseteq {1, 2, \dots, N}, | V^{(s)} | = ⌊ (1 - r_{mask}) N ⌋,

(11)

where

r_{mask} = 0.75

, and the corresponding masked set is:

M^{(s)} = {1, 2, \dots, N} ∖ V^{(s)} .

(12)

Unlike prior semantic-guided masking strategies that perform importance-aware token selection prior to corruption, the proposed formulation explicitly decouples semantic estimation from stochastic visibility generation. The teacher branch produces a semantically structured visibility pattern conditioned on learned importance, whereas the student branch observes a randomly corrupted version of the same input without semantic bias.

This asymmetric visibility design establishes a cross-visibility learning paradigm, where semantic structure inferred from a semantically organized view is transferred to a stochastically corrupted view. Consequently, representation learning is driven by alignment between semantically structured and randomly degraded observations, enabling robust feature learning under severe cloud-induced information loss.

3.3. Semantic-Guided Differential Reconstruction Mechanism

Building upon the asymmetric visibility encoding, we formulate reconstruction as a conditional recovery problem under stochastic visibility degradation. Instead of uniformly treating all masked tokens as independent reconstruction targets, we explicitly condition reconstruction supervision on both semantic structure inferred from the teacher branch and the realized visibility configurations. This enables adaptive recovery under visibility-conditioned uncertainty rather than under a fixed semantic prior.

Importantly, although teacher and student branches operate under different visibility realizations, their token indices are defined over the same underlying patch space, enabling alignment in a shared structural domain.

Given the teacher-visible token set

V^{(t)}

, the teacher masked set

M^{(t)}

, and the student masked set

M^{(s)}

, we define a semantic partition over student masked tokens induced jointly by semantic relevance and heterogeneous visibility conditions.

The semantically critical masked subset is defined as:

M_{c} = V^{(t)} \cap M^{(s)},

(13)

which corresponds to tokens that are semantically informative under the teacher’s visibility realization but are simultaneously occluded under the student’s stochastic masking process. These regions represent high-uncertainty semantic losses requiring stronger reconstruction constraints.

The semantically non-critical subset is defined as:

M_{n} = M^{(t)} \cap M^{(s)},

(14)

which corresponds to tokens that are considered less informative under the teacher’s visibility structure and are also masked in the student branch.

Accordingly, the student masked set is decomposed as:

M^{(s)} = M_{c} \cup M_{n}, M_{c} \cap M_{n} = ⌀ .

(15)

This formulation transforms standard masked reconstruction into a posterior semantic allocation process under heterogeneous visibility realizations. Rather than enforcing uniform reconstruction over all masked regions, DGR-MAE adaptively redistributes supervision according to the interaction between semantic structure and stochastic masking, thereby enabling selective recovery of structurally informative regions while preserving contextual consistency across non-critical regions. This conditional reconstruction mechanism distinguishes DGR-MAE from prior semantic-guided masked image modeling approaches.

3.4. Semantic-Guided Weighted Reconstruction and Teacher–Student Joint Optimization

The semantic partitioning defined above induces a conditional optimization objective under heterogeneous visibility realizations, where reconstruction and representation alignment are jointly optimized in a visibility-conditioned semantic space. Instead of treating all masked tokens as uniformly independent prediction targets, the learning objective explicitly adapts supervision according to both semantic structure inferred from the teacher branch and the realized masking configuration.

The overall training objective is formulated as:

L = λ_{1} L_{r e c} + λ_{2} L_{a l i g n},

(16)

where

λ_{1} = 0.9

and

λ_{2} = 0.1

balance reconstruction fidelity and representation-level consistency.

3.4.1. Semantic-Guided Differential Weighted Reconstruction Loss

Given the input patch sequence

{x_{i}}_{i = 1}^{N}

, the decoder predicts reconstructed patches

{\hat{x}}_{i}

, and reconstruction error is measured via:

l_{i} = {∥ {\hat{x}}_{i} - x_{i} ∥}^{2} .

(17)

In contrast to standard MAE formulations that assume uniform reconstruction over masked regions, we perform reconstruction under a visibility-conditioned semantic weighting scheme. Each masked token is assigned a weight derived from the interaction between teacher-inferred semantic importance and the student masking realization. Specifically, the reconstruction weight is defined as:

w_{i} = \{\begin{matrix} a, & i \in M_{c}, \\ b, & i \in M_{n}, \end{matrix}

(18)

where a and b are predefined weighting hyperparameters satisfying a > b, indicating that semantically critical masked regions receive stronger reconstruction supervision than semantically non-critical regions. In this work, the default configuration is set to a = 0.6 and b = 0.4, whose effectiveness has been validated through the ablation study. The reconstruction loss is therefore expressed as:

L_{r e c} = \frac{1}{| M^{(s)} |} \sum_{i \in M^{(s)}} w_{i} \cdot l_{i} .

(19)

This formulation transforms standard masked reconstruction into a posterior semantic recovery process under stochastic visibility uncertainty, where supervision is adaptively redistributed over semantically differentiated masked regions rather than uniformly applied.

3.4.2. Representation-Level Teacher–Student Alignment Loss

To further regularize feature learning under heterogeneous visibility conditions, we enforce representation-level consistency between teacher and student encoders. Unlike reconstruction, which operates at the patch level, this objective operates in the global representation space to ensure semantic invariance across different visibility realizations.

After projection and normalization:

z_{s} = \frac{g_{s} (f_{s})}{∥ g_{s} (f_{s}) ∥}, z_{t} = \frac{g_{t} (f_{t})}{∥ g_{t} (f_{t}) ∥} .

(20)

A feature center is introduced to stabilize optimization dynamics:

z_{t}^{'} = z_{t} - c,

(21)

where c is updated via exponential moving average during training.

We note that, instead of directly matching raw features, we model representation consistency in a probability space to improve robustness under visibility-induced noise. Specifically, the representations are mapped into temperature-scaled distributions:

p^{(t)} = softmax (z_{t}^{'} / T), q^{(s)} = softmax (z_{s} / T),

(22)

with temperature T = 0.07, and the alignment objective is defined as:

L_{a l i g n} = - \sum_{j = 1}^{d} p_{j}^{(t)} \log q_{j}^{(s)},

(23)

which enforces consistency between representations learned under different visibility realizations while preserving shared semantic structure, while teacher parameters are updated via exponential moving average:

θ_{t} \leftarrow m θ_{t} + (1 - m) θ_{s},

(24)

where m = 0.9 controls the momentum update. Overall, this joint optimization framework enables DGR-MAE to learn robust representations by jointly modeling reconstruction and representation alignment under heterogeneous visibility realizations, rather than optimizing each objective independently.

4. Experiments

4.1. Experimental Settings

All experiments are conducted on a single NVIDIA RTX 5090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 64 GB memory. Downstream evaluation is performed using ViT-Base fine-tuned on ASRAir-Occ and tested under the ASRAir-Sev evaluation protocol to assess robustness under increasing occlusion severity. ASRAir-Sev is constructed by partitioning ASRAir-Occ into severity levels according to the image-level occlusion ratio. All occluded subsets inherit the original train/validation splits defined on ASRAir-Clean, ensuring no data leakage.

The implementation is based on PyTorch (v2.7.1, https://pytorch.org, accessed on 31 May 2026) and timm (v0.4.12, https://github.com/huggingface/pytorch-image-models, accessed on 31 May 2026).

4.2. Dataset Construction

We construct a high-resolution optical remote sensing benchmark named ASRAir for systematically evaluating representation robustness under stochastic visibility degradation caused by cloud occlusion. The dataset consists of three subsets, namely ASRAir-Clean, ASRAir-Occ, and ASRAir-Sev, as shown in Figure 2.

4.2.1. ASRAir-Clean

ASRAir-Clean contains cloud-free aircraft images and serves as the source dataset for controlled occlusion synthesis. It includes 24,189 images across 50 aircraft categories, with 18,823 training samples and 5366 validation samples. The category distribution is moderately imbalanced, with per-class sample counts ranging from 350 to 953. All categories are fully annotated and clearly separable under clear-sky conditions, ensuring reliable semantic supervision, as shown in Figure 3.

4.2.2. ASRAir-Occ Construction

ASRAir-Occ is generated by applying cloud synthesis to ASRAir-Clean while preserving semantic labels. Cloud occlusion is modeled via an alpha blending formulation:

I_{occ} (x) = (1 - α (x)) I_{clean} (x) + α (x) I_{cloud} (x) .

(25)

where

I_{clean}

denotes the original image,

I_{cloud}

is sampled from a cloud texture pool, and

α (x) \in [0, 1]

represents spatially varying cloud opacity.

This construction follows a controlled robustness evaluation protocol rather than a strict physical atmospheric simulation, motivated by the difficulty of obtaining large-scale real cloud-occluded aircraft datasets with reliable fine-grained annotations. In real remote sensing scenarios, severe cloud coverage may obscure aircraft structures, making accurate labeling and visibility control unreliable. Therefore, we adopt a controlled protocol that ensures (i) consistent annotation under occlusion, (ii) controllable degradation intensity, and (iii) systematic evaluation under progressive visibility reduction.

ASRAir-Occ contains 5773 training images and 1668 validation images, providing supervision under occluded conditions. Compared with ASRAir-Clean, it exhibits a more balanced class distribution, as illustrated in Figure 4.

To further characterize the synthesized occlusion process, we analyze the statistical properties of the generated data, as shown in Figure 5. The alpha blending coefficient, after scaling to a 0–99 integer range for statistical analysis, ranges from 0 to 99 with a mean of 48.0 and a standard deviation of 34.7. Image brightness ranges from 0 to 255 with a mean of 173.9. A positive correlation is observed between alpha values and brightness in the synthesized data.

4.2.3. ASRAir-Sev

To evaluate robustness under progressively increasing occlusion severity, we construct ASRAir-Sev based on the image-level occlusion ratio ρ, which is defined as the spatial mean of the blending coefficient map

α (x)

, where

α (x)

is the pixel-wise blending coefficient inherited from the synthesis process:

ρ = E_{x} [α (x)], ρ \in [0, 1] .

(26)

The severity level is obtained by quantizing ρ into 10 uniform bins:

L = ⌊10 ρ⌋ .

(27)

This formulation discretizes continuous cloud coverage into severity levels, where higher levels correspond to stronger occlusion and reduced visibility of discriminative regions. The resulting distribution is shown in Figure 6, which reports the number of samples per level (Level 1–10) and the corresponding proportion distribution.

4.2.4. Dataset Splitting Protocol

To ensure experimental rigor and prevent data leakage, train-validation splits are performed on ASRAir-Clean prior to cloud synthesis. All occluded subsets are generated from disjoint splits, ensuring no overlap between training and evaluation data.

4.3. Comparison with State-of-the-Art Methods

In this section, we compare DGR-MAE with representative vision models and self-supervised pretraining strategies on downstream classification tasks using the ASRAir-Occ benchmark.Following the experimental protocol in [8], all methods are evaluated under a unified fine-tuning setting. To assess robustness under cloud occlusion, we report Top-1 and Top-5 classification accuracy under consistent training configurations. Table 1 summarizes the parameter size, FLOPs, inference time, and downstream classification performance of different methods.

Overall, DGR-MAE achieves the highest Top-1 accuracy of 74.28%. Compared with MAE, BEiT, and CAE, DGR-MAE improves Top-1 accuracy by 1.20%, 1.14%, and 1.38%, respectively, validating the effectiveness of the proposed posterior semantic reconstruction paradigm for cloud-occluded representation learning.

Specifically, although all methods are fine-tuned using the same ViT-Base backbone in downstream tasks, the initialization representation spaces learned by different pre-training strategies still exhibit noticeable differences. Through the posterior semantic correction mechanism, DGR-MAE places greater emphasis during pre-training on consistent recovery of critical semantic structures and robust representation modeling, thereby enabling the model to more easily learn stable and highly separable feature representations during fine-tuning. As a result, DGR-MAE achieves a more significant advantage in Top-1 accuracy. Meanwhile, this result also suggests that the proposed pre-training strategy tends, to some extent, to focus more on robust representation recovery under incomplete observation conditions, while the improvement in candidate coverage for semantically similar categories does not exhibit a proportional gain. Consequently, the improvement in Top-5 performance is relatively more limited compared with that in Top-1 accuracy.

The results further show that from-scratch supervised methods, such as DeiT and ViT, achieve comparatively lower performance, suggesting that direct supervised optimization is insufficient for learning robust representations under severe visibility degradation. Contrastive learning methods, including iBOT and MoCo v3, also underperform compared with masked image modeling approaches, indicating that robustness under cloud occlusion depends not only on global semantic consistency, but also on explicitly modeling stochastic visibility uncertainty during pretraining.

Among masked image modeling methods, DGR-MAE consistently achieves superior Top-1 performance. Compared with existing semantic-guided or teacher–student masked modeling approaches, the results suggest that maintaining stochastic visibility uncertainty during masking while performing semantic-aware reconstruction is beneficial for robust representation learning under cloud occlusion. This observation further demonstrates the effectiveness of the proposed optimization strategy under incomplete observation conditions.

It should be noted that all compared methods are first pre-trained using their respective learning strategies and then fine-tuned under the same ViT-Base backbone in downstream tasks. Therefore, the observed performance differences primarily reflect the quality of the learned representations rather than differences in downstream network architectures or inference settings.

4.4. Performance on the Hierarchical Occlusion-Level Dataset

To further evaluate the robustness of different methods under varying occlusion intensities, we conduct comprehensive experiments on the ASRAir-Sev benchmark. Table 2 reports the Top-1 classification accuracy of all methods across different occlusion levels.

As the occlusion severity increases from Level 1 to Level 10, cloud coverage progressively becomes the dominant factor degrading observable semantic structure, leading to consistent performance degradation across all methods. The degradation rates, however, reveal each model’s ability to preserve discriminative representation learning under progressively increasing visibility uncertainty.

As shown in Figure 7, performance differences among methods remain relatively small under low-occlusion conditions (Level 1–Level 3), where most structural information is still preserved. This suggests that when sufficient visual evidence is available, conventional reconstruction objectives remain adequate for representation learning.

However, in the moderate-to-severe occlusion regime (Level 4–Level 8), the performance gap becomes substantially larger. This stage corresponds to the onset of semantic observability collapse, where the advantage of posterior semantic correction becomes increasingly evident. For example, at Level 8, MAE and MCMAE drop to 77.30% and 71.36%, respectively, while DGR-MAE maintains 86.20%, indicating that semantic correction after stochastic visibility loss is more effective than prior semantic intervention.

Under extreme occlusion (Level 9–Level 10), all methods converge toward lower accuracy as discriminative structures become largely unavailable. Nevertheless, DGR-MAE consistently maintains the best overall performance (67.28%) and exhibits the slowest degradation trend, providing empirical evidence that posterior semantic reconstruction is particularly effective under severe cloud-induced incomplete observations.

4.5. Visualization Analysis

To further validate the effectiveness of the proposed method from a representation learning perspective, we conduct comprehensive visualization analyses from three complementary aspects: the pre-training stage, the fine-tuning stage, and feature distribution. Specifically, we present attention heatmaps, mask–reconstruction visualizations, confusion matrices, and t-SNE feature embeddings. These qualitative results provide an intuitive understanding of the model’s attention behavior, structural modeling capability, and discriminative representation learning under cloud-occluded conditions.

4.5.1. Pre-Training Stage Visualization

As shown in Figure 8, clear differences emerge in how different methods respond to incomplete visual observations during pre-training. Baseline models such as ViT, as well as representative self-supervised methods such as iBOT, are able to roughly localize target regions; however, their attention responses remain relatively diffuse, with substantial activations appearing in background areas and object boundaries, indicating limited structural discrimination capability.

Masked image modeling methods such as MAE, CrossMAE, and MCMAE exhibit improved object-level focus. Nevertheless, their attention distributions tend to be overly smooth and often fail to precisely emphasize semantically critical structural components, such as wing–fuselage junctions.

In contrast, DGR-MAE produces more concentrated and structurally coherent attention responses. High-activation regions align closely with semantically critical aircraft structures while effectively suppressing background interference. This qualitative evidence supports the central hypothesis of this work: posterior semantic correction after stochastic visibility loss promotes more precise structural awareness than prior visibility intervention, leading to stronger discriminative representation learning during pre-training.

4.5.2. Masking–Reconstruction Process Visualization

The masking–reconstruction results further illustrate the learning dynamics of DGR-MAE during pre-training. As shown in Figure 9, compared with MAE and CrossMAE, DGR-MAE produces more structurally coherent reconstructions, particularly in preserving aircraft contours and semantically critical components.

This improvement can be attributed to the proposed posterior semantic correction mechanism. Specifically, after stochastic masking, reconstruction priorities are reweighted according to semantic importance, enabling the model to focus more on semantically informative yet unobserved regions. Consequently, the reconstructed outputs exhibit clearer structural continuity and more complete recovery of key aircraft components.

Importantly, this reconstruction process is not designed as a free-form generative inpainting task. Instead, it is strictly constrained by pixel-level supervision on visible regions and semantic guidance from the teacher branch. Therefore, the model cannot arbitrarily synthesize novel or physically inconsistent aircraft parts in occluded regions. The posterior semantic correction only modulates the weighting of reconstruction loss over masked tokens, while preserving faithful reconstruction constraints on observed pixels. This ensures that structural consistency is learned as a constraint-aware representation objective rather than unconstrained image hallucination.

Moreover, the reconstruction patterns indicate a stronger preference for structural consistency modeling rather than uniform pixel-wise reconstruction error minimization. This suggests that DGR-MAE enhances the modeling of global semantic organization and inter-part structural relationships during pre-training, which helps explain its superior robustness under cloud-induced incomplete observations. It should be noted that this stage of pre-training is conducted on the clean ASRAir-Clean dataset; therefore, the model primarily learns stable structural representations rather than reconstruction biases induced by complex cloud-related noise.

4.5.3. Fine-Tuning Stage Visualization

During fine-tuning under cloud-occluded conditions, performance differences among methods become more pronounced, as illustrated in Figure 10. Under varying occlusion levels, ViT and several contrastive learning-based methods show less stable attention patterns, with responses occasionally drifting toward high-contrast cloud boundaries or background regions, which may interfere with structural localization.

Although MAE-based methods improve robustness to some extent, their attention distributions remain relatively diffuse under severe occlusion, with partial activations on cloud structures or irrelevant background regions. In comparison, DGR-MAE exhibits more concentrated attention on target-related regions across different occlusion levels.

Even under heavy occlusion, the model tends to preserve responses on partially observable structural cues such as wing fragments and contour regions. This behavior is consistent with the representations learned during pretraining, where global semantic dependencies are modeled under high-ratio masking and posterior semantic correction.

These visual patterns provide qualitative support for the effectiveness of posterior semantic correction under progressive visibility degradation. The observations are consistent with the quantitative improvements in the Level 6–Level 8 regimes. All visualizations are obtained from randomly sampled test set images without manual selection.

4.5.4. Confusion Matrix Visualization

To further examine category-level discriminative behavior under incomplete observations, we visualize the confusion matrices of all compared methods on the ASRAir-Sev evaluation subset. As shown in Figure 11, clear differences emerge in class separability across methods, particularly for aircraft categories with similar structural characteristics.

Baseline models such as ViT and DeiT exhibit substantial inter-class confusion, reflected by broadly distributed off-diagonal responses. Although self-supervised pretraining methods such as MAE and CAE improve diagonal concentration to some extent, noticeable ambiguity remains under severe occlusion.

In contrast, DGR-MAE exhibits a more concentrated diagonal response pattern with reduced off-diagonal confusion. Even for categories with highly similar structural configurations, the model preserves clearer decision boundaries. These observations suggest that posterior semantic correction helps maintain semantic separability under cloud-induced information loss, thereby improving category-level discrimination under incomplete visual observations.

4.5.5. t-SNE Feature Distribution Visualization

To further analyze the feature representations learned by different models, we employ t-SNE to project high-dimensional embeddings into a two-dimensional space. As shown in Figure 12, substantial differences can be observed in intra-class compactness and inter-class separability across methods.

ViT and several contrastive learning-based methods exhibit relatively scattered feature distributions with considerable overlap among categories, indicating limited robustness under incomplete observations. Although MAE and its variants improve cluster compactness to some extent, class boundaries remain partially ambiguous under cloud occlusion.

In contrast, DGR-MAE produces a more compact and well-separated feature distribution. Samples from the same category form tighter clusters, while inter-class separation becomes more distinct, resulting in clearer decision boundaries. This suggests that posterior semantic correction helps preserve semantic separability in feature space under visibility degradation, which is consistent with the quantitative robustness improvements reported in previous sections.

To further reduce the subjectivity of t-SNE visualization, we additionally evaluate feature distribution quality using clustering metrics, including the Silhouette Coefficient, Calinski–Harabasz Index (CHI), and Davies–Bouldin Index (DBI). As shown by the quantitative results, DGR-MAE achieves the highest Silhouette Coefficient (0.3187) and CHI value (46.2574), outperforming MAE (0.2792/42.4659), MCMAE (0.2720/42.3715), and CAE (0.2306/32.7084). Meanwhile, DGR-MAE obtains the lowest DBI value (1.6401), compared with MAE (1.7112), MCMAE (1.7217), CAE (2.0470), and ViT (3.5767). These results quantitatively demonstrate that DGR-MAE learns more compact intra-class distributions and clearer inter-class boundaries under cloud-occluded conditions, which is consistent with the t-SNE visualization observations.

4.6. Ablation Study

To further evaluate the contribution of each design component in DGR-MAE, we conduct systematic ablation experiments on the ASRAir-Occ benchmark. The analysis covers several key factors, including pretraining data selection, the posterior semantic correction mechanism, student masking strategy, collaborative feature refinement during pretraining, and differential reconstruction weighting. Unless otherwise specified, all experiments follow the same fine-tuning protocol for fair comparison.

4.6.1. Ablation on Pretraining Dataset

To evaluate the effect of pretraining data distribution, DGR-MAE is pretrained on the clean subset ASRAir-Clean and the cloud-occluded subset ASRAir-Occ, respectively, followed by fine-tuning and evaluation on ASRAir-Occ. The results are summarized in Table 3.

Pretraining on the clean ASRAir-Clean subset achieves better performance than pretraining directly on ASRAir-Occ (74.28% vs. 73.26%), indicating that structurally complete observations provide a more reliable basis for learning stable semantic and geometric priors. A plausible explanation is that cloud-contaminated samples introduce visibility ambiguity during pretraining, which may interfere with the establishment of consistent structural representations, whereas clean observations allow the model to first learn more complete structural and semantic patterns before adapting to occluded conditions during downstream fine-tuning.

Nevertheless, the model pretrained solely on ASRAir-Occ still achieves competitive performance, reaching 73.26% Top-1 accuracy after downstream fine-tuning. This result suggests that the proposed framework retains relatively strong representation learning capability even under incomplete observation conditions. A possible reason is that the posterior semantic correction mechanism enables the model to selectively emphasize structurally important regions during reconstruction, thereby alleviating the adverse effects caused by visibility degradation. In addition, the teacher-guided semantic redistribution strategy further improves the robustness of representation learning when only occluded data are available during pretraining.

4.6.2. Ablation on Teacher Branch

To evaluate the contribution of the semantic estimation branch, we compare DGR-MAE with and without the teacher branch during pretraining. The corresponding results are reported in Table 4.

Removing the teacher branch reduces Top-1 accuracy from 74.28% to 72.84%, indicating that explicit semantic importance estimation provides a measurable performance benefit. This improvement is not merely due to additional supervision; rather, the teacher branch functions as the key component for posterior semantic correction by estimating token-level semantic importance under the shared input observation. These estimates enable the partitioning of masked regions into semantic-critical and non-critical subsets, allowing reconstruction emphasis to be adaptively reallocated after stochastic visibility loss. Without this semantic estimation process, reconstruction degenerates into uniform optimization over masked regions, making it difficult to prioritize structurally informative components. The observed performance drop therefore directly validates the importance of semantic-aware posterior correction for robust representation learning under cloud-occluded conditions.

4.6.3. Ablation on Student Masking Strategy

To investigate the effect of student-side masking patterns during pretraining, we compare three masking strategies: Random, Grid, and Block masking. The corresponding results are presented in Table 5.

Among the evaluated strategies, Random masking achieves the best performance (74.28%), outperforming Grid masking (73.08%) and Block masking (72.54%), indicating that masking pattern design plays an important role in representation robustness. A plausible explanation is that Random masking introduces greater spatial diversity and visibility uncertainty, forcing the model to infer missing content under heterogeneous observation patterns, whereas Grid and Block masking impose more regular visibility structures that simplify reconstruction. More importantly, this result directly supports the central design principle of DGR-MAE: since posterior semantic correction operates after stochastic information loss, its effectiveness depends on preserving masking uncertainty during pretraining. This also explains why DGR-MAE maintains high-ratio random masking rather than introducing prior semantic intervention at the masking stage.

4.6.4. Ablation on Mask Ratio

To further investigate the effect of masking intensity during pretraining, we evaluate DGR-MAE under different mask ratio settings while keeping all other training configurations unchanged. The teacher branch uses the same mask ratio as the student branch to eliminate potential confounding factors arising from inconsistent observation difficulty. Following the standard pretraining protocol, all models are pretrained on the clean ASRAir-Clean dataset and subsequently fine-tuned and evaluated on ASRAir-Occ.

The results are summarized in Table 6. A clear non-monotonic trend can be observed as the mask ratio increases. Specifically, the highest performance is obtained when the mask ratio is set to 0.75, yielding a Top-1 accuracy of 74.28%.

When the mask ratio is relatively low (0.25–0.50), a substantial amount of visual information remains observable during reconstruction. Under this setting, the teacher branch also retains more texture-dominant regions, including certain background areas that contain local texture patterns but are not semantically relevant to the aircraft target. Consequently, reconstruction supervision may be partially allocated to non-critical regions, reducing the model’s incentive to learn long-range structural dependencies and semantic completion capabilities.

In contrast, when the mask ratio becomes excessively high (0.85), the amount of observable information becomes insufficient for reliable contextual reasoning. Since aircraft targets in remote sensing imagery often occupy only a limited spatial region, excessive masking significantly increases reconstruction difficulty and weakens the stability of semantic representation learning.

Therefore, a mask ratio of 0.75 achieves the best balance between preserving sufficient contextual information and facilitating robust semantic inference. These findings support the use of a high-ratio random masking strategy in DGR-MAE and further demonstrate the effectiveness of posterior semantic correction under incomplete observation conditions.

4.6.5. Ablation on Collaborative Feature Refinement Module (C2CF)

To assess the contribution of the collaborative feature refinement module (C2CF) during pretraining, we conduct ablation experiments on its two key components: Depthwise Convolution and Cross-Attention. The results are summarized in Table 7. The full configuration, which jointly incorporates both components, achieves the best performance (74.28%). Removing either component leads to a measurable performance drop, while removing both results in the lowest accuracy.

These results suggest that the two components provide complementary refinement effects during representation learning. Depthwise Convolution enhances local structural interactions, while Cross-Attention promotes global contextual aggregation across token representations. Their joint integration therefore improves feature consistency across different spatial scales.

It is important to note that C2CF serves as an auxiliary refinement module rather than the primary source of performance gain. Since all downstream evaluations are conducted using the same standard ViT-Base fine-tuning architecture, the observed improvements indicate that C2CF mainly facilitates more stable representation initialization during pretraining. The core robustness advantage of DGR-MAE still originates from the proposed posterior semantic correction mechanism, while C2CF provides complementary local-global feature refinement.

4.6.6. Ablation on Mask Reweighting Parameters

To investigate the influence of differential reconstruction weighting, we evaluate several weighting configurations for semantic-critical and non-critical masked regions. The results are summarized in Table 8.

The weighting ratio exhibits a clear non-monotonic effect on downstream performance. When equal weights are assigned (0.5/0.5), reconstruction reduces to uniform optimization, limiting the model’s ability to prioritize semantically informative regions. Introducing a moderate semantic bias improves recognition accuracy, indicating that posterior semantic correction benefits from selectively emphasizing structurally critical components. However, excessively increasing the weighting disparity (e.g., 0.7/0.3 and 0.8/0.2) leads to performance degradation, suggesting that overemphasis on salient regions suppresses contextual supervision and weakens global structural reasoning. Among all evaluated settings, the 0.6/0.4 configuration achieves the highest Top-1 accuracy of 74.28%, demonstrating that effective posterior semantic correction requires calibrated rather than aggressive semantic reallocation to balance structural discrimination with contextual completeness under cloud-induced incomplete observations.

5. Discussion

5.1. Advantages over Traditional Cloud Removal-Based Recognition

In cloud-occluded remote sensing target recognition, traditional approaches commonly adopt a two-stage processing paradigm, in which cloud removal is first performed to reconstruct a visually complete image, followed by downstream target recognition. Although intuitive, this paradigm implicitly assumes that information hidden by cloud occlusion can be accurately recovered before semantic analysis. However, under severe cloud occlusion, this assumption becomes increasingly difficult to satisfy because visibility degradation is inherently stochastic and irreversible.

First, traditional two-stage frameworks suffer from the problem of cascaded error accumulation. In such pipelines, the recognition model operates not on the original observations but on reconstructed images generated by the cloud removal stage. Consequently, reconstruction artifacts, structural distortions, and texture deviations introduced during cloud removal may propagate into subsequent feature extraction and classification processes, where they can be further amplified during representation learning and decision making, ultimately degrading recognition performance. This issue is particularly critical for fine-grained aircraft recognition, where category discrimination often depends on subtle structural differences, such as wing configurations, fuselage geometry, and tail structures. Even minor reconstruction errors may lead to deviations in discriminative features and consequently affect the final classification results.

More importantly, cloud removal itself is fundamentally constrained by the theoretical limits of information recovery. When key aircraft structures are heavily or completely occluded, the corresponding visual information is no longer observable in the input image. Under such circumstances, cloud removal cannot deterministically recover the true hidden content but can only generate a plausible estimation based on visible observations and learned data priors. The same visible observation may correspond to multiple reasonable hidden interpretations, making the reconstruction result inherently non-unique and preventing any guarantee of consistency with the actual scene. This uncertainty poses a significant challenge for recognition tasks. Even when reconstructed images appear visually plausible, they may still fail to preserve the critical structural cues required for category discrimination. Therefore, improvements in image reconstruction quality do not necessarily translate into corresponding gains in recognition performance.

From the perspective of task objectives, cloud removal and target recognition focus on fundamentally different optimization goals. The former primarily emphasizes visual restoration quality at the image level, whereas the latter is concerned with preserving and extracting semantically discriminative information for classification. Unlike traditional cloud removal methods, DGR-MAE does not formulate cloud occlusion as an explicit image reconstruction problem. Instead, it learns more stable structural and semantic representations through self-supervised pretraining. Its primary objective is not to recover the specific visual content hidden by occlusion but to establish reliable structural dependencies and semantic associations under incomplete observations, thereby improving the adaptability of downstream recognition models to missing information.

It is worth noting that although the best performance is achieved when pretraining on ASRAir-Clean, the ablation studies demonstrate that DGR-MAE still achieves a Top-1 accuracy of 73.26% when pretrained directly on the cloud-occluded dataset ASRAir-Occ, which is only 1.02 percentage points lower than the best-performing configuration. This observation suggests that the effectiveness of the proposed framework does not rely solely on cloud-free data. Instead, its performance gains are largely attributed to the robust representation learning capability introduced by the posterior semantic correction mechanism. Rather than attempting to reconstruct the pixel values of unobservable regions, DGR-MAE focuses on learning semantic representations that remain stable under varying visibility conditions, thereby providing stronger adaptability and generalization under stochastic cloud occlusion.

In addition, traditional cloud removal–recognition pipelines generally require an additional image restoration network during inference, resulting in increased system complexity and computational overhead. In contrast, DGR-MAE introduces its additional learning mechanisms only during the pretraining stage. During downstream deployment, it retains the same inference architecture as the standard ViT-Base model and therefore incurs no additional inference cost.

Overall, the findings of this study suggest that, in cloud-occluded remote sensing target recognition, directly enhancing representation robustness under incomplete observations constitutes a promising research direction compared with conventional two-stage pipelines that rely on explicit image restoration. By learning more stable structural and semantic representations, DGR-MAE improves the model’s robustness to information loss caused by cloud occlusion and provides an alternative perspective to traditional cloud removal-based paradigms for remote sensing target recognition under adverse visibility conditions.

5.2. Limitations and Future Work

Despite the encouraging results, several limitations of the present study should be acknowledged.

First, the current work primarily focuses on cloud-induced occlusion in aircraft recognition. Although cloud coverage represents one of the most common visibility degradation factors in optical remote sensing imagery, other atmospheric disturbances, such as cloud shadows, haze, and smoke, may exhibit different spatial characteristics and degradation patterns. The effectiveness of the proposed framework under these more diverse visibility conditions remains to be further investigated.

Second, the proposed DGR-MAE is evaluated mainly on the fine-grained aircraft recognition task. While the experimental results demonstrate its effectiveness for cloud-occluded aircraft classification, its generalization capability to other remote sensing targets and downstream tasks, such as object detection, scene classification, and semantic segmentation, has not yet been systematically explored.

Third, the random masking strategy adopted during pretraining serves as a generic simulation of information loss but does not fully reflect the spatial continuity and structural characteristics of real cloud occlusions. Future research may incorporate cloud-aware masking strategies or physically motivated occlusion priors to better bridge the gap between self-supervised pretraining and real-world visibility degradation scenarios.

Overall, these limitations provide several directions for future research. Extending the proposed framework to more diverse degradation conditions, target categories, and remote sensing tasks may further improve its applicability and robustness in practical Earth observation applications.

6. Conclusions

This paper addresses the problem of representation degradation caused by cloud-induced incomplete observations in fine-grained aircraft recognition from optical remote sensing imagery. To this end, we propose DGR-MAE, a teacher–student masked image modeling framework based on posterior semantic correction. Unlike prior semantic-guided masking methods that alter token visibility during input construction, DGR-MAE preserves high-ratio stochastic masking and introduces semantic correction after visibility degradation by reallocating reconstruction emphasis according to teacher-estimated semantic importance. This design enables the model to recover structurally critical information while maintaining contextual completeness under partial observation conditions.

Experiments conducted on the ASRAir-Clean, ASRAir-Occ, and ASRAir-Sev benchmarks demonstrate that DGR-MAE achieves 74.28% Top-1 accuracy on ASRAir-Occ and 86.20% accuracy on the most severe occlusion level of ASRAIR-Sev. These results indicate consistent performance across different levels of cloud occlusion and validate the effectiveness of the proposed method under challenging visibility conditions.

Overall, this work provides a new perspective for remote sensing representation learning under visibility degradation by reformulating cloud robustness as a semantic reconstruction prioritization problem. While the current framework relies on attention-based semantic importance estimation, future work will explore more robust semantic uncertainty modeling and adaptive correction strategies, as well as extend the framework to broader remote sensing scenarios involving real atmospheric interference and multimodal observations.

Author Contributions

Conceptualization, C.S.; Methodology, C.L.; Software, C.S.; Validation, C.L. and H.F.; Formal Analysis, C.S.; Investigation, Q.G.; Resources, Q.G.; Data Curation, B.O.; Writing—Original Draft Preparation, C.L. and R.W.; Writing—Review & Editing, Q.G., H.F. and all authors; Visualization, C.S. and R.W.; Supervision, H.F.; Project Administration, Q.G. and H.F.; Funding Acquisition, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Northwest A&F University.

Data Availability Statement

The datasets generated and used in this study, including ASRAir-Clean, ASRAir-Occ, and ASRAir-Sev, were constructed for optical remote sensing aircraft fine-grained recognition under cloud occlusion scenarios. The datasets are available from the corresponding author upon reasonable request.

Acknowledgments

We thank the HPC platform of NWAFU for providing computational resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Sun, H.; Zuo, J.; Wang, H.; Xu, G.; Sun, X. Aircraft type recognition in remote sensing images based on feature learning with conditional generative adversarial networks. Remote Sens. 2018, 10, 1123. [Google Scholar] [CrossRef]
Chen, J.; Zhang, B.; Wang, C. Backscattering feature analysis and recognition of civilian aircraft in TerraSAR-X images. IEEE Geosci. Remote Sens. Lett. 2014, 12, 796–800. [Google Scholar] [CrossRef]
Zhao, A.; Fu, K.; Wang, S.; Zuo, J.; Zhang, Y.; Hu, Y.; Wang, H. Aircraft recognition based on landmark detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1413–1417. [Google Scholar] [CrossRef]
Zuo, J.; Xu, G.; Fu, K.; Sun, X.; Sun, H. Aircraft type recognition based on segmentation with deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 282–286. [Google Scholar] [CrossRef]
Ding, Y.; Liu, Q.; Lao, P.; Li, M.; Li, Y.; Zheng, Q.; Peng, Y. Spatial distributions of cloud occurrences in terms of volume fraction as inferred from CloudSat and CALIPSO. Remote Sens. 2023, 15, 3978. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Wei, C.; Fan, H.; Xie, S.; Wu, C.Y.; Yuille, A.; Feichtenhofer, C. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 14668–14678. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 16000–16009. [Google Scholar]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 9653–9663. [Google Scholar]
Dong, X.; Bao, J.; Zhang, T.; Chen, D.; Zhang, W.; Yuan, L.; Chen, D.; Wen, F.; Yu, N.; Guo, B. Peco: Perceptual codebook for bert pre-training of vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 552–560. [Google Scholar]
Wei, L.; Xie, L.; Zhou, W.; Li, H.; Tian, Q. Mvp: Multimodality-guided visual pre-training. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 337–353. [Google Scholar]
Li, G.; Zheng, H.; Liu, D.; Wang, C.; Su, B.; Zheng, C. Semmae: Semantic-guided masking for learning masked autoencoders. Adv. Neural Inf. Process. Syst. 2022, 35, 14290–14302. [Google Scholar]
Guan, Q.; Liu, Y.; Chen, L.; Zhao, S.; Li, G. Aircraft detection and fine-grained recognition based on high-resolution remote sensing images. Electronics 2023, 12, 3146. [Google Scholar] [CrossRef]
Hu, Q.; Li, R.; Xu, Y.; Pan, C.; Niu, C.; Liu, W. Toward aircraft detection and fine-grained recognition from remote sensing images. J. Appl. Remote Sens. 2022, 16, 024516. [Google Scholar] [CrossRef]
Ai, J.; Tian, R.; Luo, Q.; Jin, J.; Tang, B. Multi-scale rotation-invariant Haar-like feature integrated CNN-based ship detection algorithm of multiple-target environment in SAR imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10070–10087. [Google Scholar] [CrossRef]
Ai, J.; Mao, Y.; Luo, Q.; Jia, L.; Xing, M. SAR target classification using the multikernel-size feature fusion-based convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5214313. [Google Scholar] [CrossRef]
Xue, W.; Ai, J.; Zhu, Y.; Sun, X.; Zhang, Y.; Gao, G. LMCNet: Lightweight Modality Compensation Network via Knowledge Distillation for Salient Ship Detection under Missing Modality Conditions. IEEE Trans. Aerosp. Electron. Syst. 2026, 62, 6547–6560. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
He, Q.; Sun, X.; Yan, Z.; Wang, B.; Zhu, Z.; Diao, W.; Yang, M.Y. AST: Adaptive self-supervised transformer for optical remote sensing representation. ISPRS J. Photogramm. Remote Sens. 2023, 200, 41–54. [Google Scholar] [CrossRef]
Jiang, W.; Zhang, J.; Wang, D.; Zhang, Q.; Wang, Z.; Du, B. LeMeViT: Efficient vision transformer with learnable meta tokens for remote sensing image interpretation. arXiv 2024, arXiv:2405.09789. [Google Scholar]
Lin, J.; Gao, F.; Shi, X.; Dong, J.; Du, Q. SS-MAE: Spatial–spectral masked autoencoder for multisource remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531614. [Google Scholar] [CrossRef]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4088–4099. [Google Scholar]
Fuller, A.; Millard, K.; Green, J.R. Satvit: Pretraining transformers for earth observation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3513205. [Google Scholar] [CrossRef]
Wang, Y.; Hernández, H.H.; Albrecht, C.M.; Zhu, X.X. Feature guided masked autoencoder for self-supervised learning in remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 321–336. [Google Scholar] [CrossRef]
Tang, M.; Cozma, A.; Georgiou, K.; Qi, H. Cross-scale mae: A tale of multiscale exploitation in remote sensing. Adv. Neural Inf. Process. Syst. 2023, 36, 20054–20066. [Google Scholar]
Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.; Ermon, S. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Adv. Neural Inf. Process. Syst. 2022, 35, 197–211. [Google Scholar]
Huang, Y.; Sun, Y.; Zheng, H. Regression-Based and Isophote-Constrained Cloud Removal for Time-Series Remote Sensing Imagery. In Proceedings of the 2025 11th International Conference on Computing and Artificial Intelligence (ICCAI); IEEE: Piscataway, NJ, USA, 2025; pp. 108–112. [Google Scholar]
Duan, C.; Belgiu, M.; Stein, A. Feature enhancement network for cloud removal in optical images by fusing with SAR images. Int. J. Remote Sens. 2024, 45, 51–67. [Google Scholar] [CrossRef]
Zhou, X.; Fang, Q.; Gong, X.; Yang, S.; Lu, T.; Wan, Y.; Ma, A.; Zhong, Y. AFR-CR: An Adaptive Frequency Domain Feature Reconstruction-Based Method for Cloud Removal via SAR-Assisted Remote Sensing Image Fusion. Remote Sens. 2026, 18, 201. [Google Scholar] [CrossRef]
Cao, L.; Pan, J.; Xu, J.; Chen, T.; Yuan, Q.; Sang, J. A global-local interaction and conditional consistency constrained diffusion model for SAR-guided optical image cloud removal. Int. J. Appl. Earth Obs. Geoinf. 2026, 146, 105013. [Google Scholar] [CrossRef]
Guo, Y.; He, W.; Hu, T.; Bejo, S.K.; Zhang, H. Cloud Meets Diffusion: Progressive Cloud Removal for Optical Remote Sensing Images. In Proceedings of the IGARSS 2025—2025 IEEE International Geoscience and Remote Sensing Symposium; IEEE: Piscataway, NJ, USA, 2025; pp. 6123–6127. [Google Scholar]
Gao, J.; Yuan, Q.; Li, J.; Zhang, H.; Su, X. Cloud removal with fusion of high resolution optical and SAR images using generative adversarial networks. Remote Sens. 2020, 12, 191. [Google Scholar] [CrossRef]
Jiang, B.; Li, X.; Chong, H.; Wu, Y.; Li, Y.; Jia, J.; Wang, S.; Wang, J.; Chen, X. A deep-learning reconstruction method for remote sensing images with large thick cloud cover. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103079. [Google Scholar] [CrossRef]
Stucker, C.; Garnot, V.S.F.; Schindler, K. U-TILISE: A sequence-to-sequence model for cloud removal in optical satellite time series. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5408716. [Google Scholar] [CrossRef]
Kakogeorgiou, I.; Gidaris, S.; Psomas, B.; Avrithis, Y.; Bursuc, A.; Karantzalos, K.; Komodakis, N. What to hide from your students: Attention-guided masked image modeling. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 300–318. [Google Scholar]
Bai, Y.; Wang, Z.; Xiao, J.; Wei, C.; Wang, H.; Yuille, A.L.; Zhou, Y.; Xie, C. Masked autoencoders enable efficient knowledge distillers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24256–24265. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 10347–10357. [Google Scholar]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9640–9649. [Google Scholar]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
Gao, P.; Ma, T.; Li, H.; Lin, Z.; Dai, J.; Qiao, Y. Mcmae: Masked convolution meets masked autoencoders. Adv. Neural Inf. Process. Syst. 2022, 35, 35632–35644. [Google Scholar]
Liu, J.; Huang, X.; Zheng, J.; Liu, Y.; Li, H. MixMAE: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6252–6261. [Google Scholar]
Peng, Z.; Dong, L.; Bao, H.; Ye, Q.; Wei, F. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv 2022, arXiv:2208.06366. [Google Scholar]
Chen, X.; Ding, M.; Wang, X.; Xin, Y.; Mo, S.; Wang, Y.; Han, S.; Luo, P.; Zeng, G.; Wang, J. Context autoencoder for self-supervised representation learning. Int. J. Comput. Vis. 2024, 132, 208–223. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed DGR-MAE framework. The model adopts a teacher-student dual-branch structure with posterior semantic-guided differential reconstruction for cloud-occluded aircraft recognition. Different colored arrows denote the mask generation and information flow associated with the teacher and student branches, respectively.

Figure 2. Overview of the proposed ASRAir benchmark. (a) ASRAir-Clean: cloud-free aircraft subset; (b) ASRAir-Occ: cloud-occluded robustness subset; (c) ASRAir-Sev: occlusion-severity evaluation subset stratified into 10 cloud coverage levels.

Figure 3. Class distribution comparison across ASRAir subsets. (a) ASRAir-Clean, (b) ASRAir-Occ, and (c) ASRAir-Sev. (d) Box plot summarizes distribution statistics.

Figure 4. Per-class image distribution comparison between ASRAir-Clean and ASRAir-Occ.

Figure 5. Cloud synthesis statistics. (a) Alpha distribution. (b) Brightness distribution. (c) Correlation between alpha and brightness. (d) Box plot of alpha values.

Figure 6. Cloud occlusion level distribution in ASRAir-Sev. (a) Number of samples per level. (b) Proportion distribution across severity levels.

Figure 7. Top-1 classification performance comparison of representative vision models (ViT, iBOT, BEiT, MAE, MCMAE, and the proposed DGR-MAE) across different cloud occlusion levels (Level 1–Level 10) on the ASRAir-Sev dataset.

Figure 8. Visualization of attention maps during the pre-training stage. Columns correspond to different model categories, including ViT trained from scratch, iBOT based on contrastive self-supervised learning, masked image modeling methods (MAE, CrossMAE, MCMAE, and DMAE), and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.

Figure 9. Visualization of the masking–reconstruction process during the pre-training stage. The student branch generates masked views via random masking, while the teacher branch produces attention-guided masked views. Reconstruction results from both branches are compared with MAE and CrossMAE. The proposed DGR-MAE achieves more accurate reconstruction in teacher-attended regions, demonstrating improved structural recovery under missing information conditions. The semi-transparent orange overlay indicates semantically important regions identified and preserved by the teacher branch through the attention-guided masking process.

Figure 10. Visualization of attention maps during the fine-tuning stage under cloud occlusion, with different methods arranged from left to right as ViT, iBOT, MAE, CrossMAE, MCMAE, DMAE, and the proposed DGR-MAE. Warmer colors indicate higher attention responses, while cooler colors indicate lower attention responses.

Figure 11. Comprehensive confusion matrix visualization of all compared methods listed in Table 1 on the ASRAir-Sev evaluation subset, covering three representative paradigms: from-scratch supervised training, contrastive self-supervised learning, and masked image modeling.

Figure 12. t-SNE visualization of learned feature embeddings across different methods for multi-scale aircraft recognition under cloud-occluded remote sensing conditions, illustrating intra-class compactness and inter-class separability.

Table 1. Performance comparison of representative self-supervised vision models for cloud-occluded aircraft recognition on the ASRAir-Occ benchmark under the fine-tuning protocol. All methods are evaluated using the same ViT-Base backbone, and computational cost metrics (Params, FLOPs, and inference time) are reported under identical model architecture and inference settings.

Method	Params (M)	FLOPs (G)	Inference (ms)	Top-1	Top-5
From-scratch training
DeiT [38]	86.6	17.6	1.76	66.73	79.74
ViT [18]	86.6	17.6	1.76	59.53	73.26
Contrastive learning methods
AttMask [36]	86.6	17.6	1.76	65.95	78.48
MoCo v3 [39]	86.6	17.6	1.76	63.73	79.02
iBOT [40]	86.6	17.6	1.76	50.42	74.22
Masked image modeling methods
BEiT [6]	86.6	17.6	1.76	73.14	82.25
MAE [8]	86.6	17.6	1.76	73.08	81.95
SimMIM [9]	86.6	17.6	1.76	47.00	73.68
CrossMAE [26]	86.6	17.6	1.76	65.17	78.96
MCMAE [41]	86.6	17.6	1.76	73.02	83.03
MixMIM [42]	86.6	17.6	1.76	65.17	79.44
BEiT V2 [43]	86.6	17.6	1.76	71.04	80.94
CAE [44]	86.6	17.6	1.76	72.90	83.93
DMAE [37]	86.6	17.6	1.76	60.85	77.10
DGR-MAE (Ours)	86.6	17.6	1.76	74.28	82.79

Table 2. Analysis of model robustness under progressive cloud occlusion: Top-1 accuracy comparison on the ASRAir-Sev dataset.

Method	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8	Level 9	Level 10	All
From-scratch training
DeiT [38]	96.77	97.10	96.77	98.71	99.35	97.10	93.87	75.67	40.13	9.76	63.63
ViT [18]	95.81	97.74	97.10	95.48	98.39	93.87	86.13	55.64	27.02	6.59	57.19
Contrastive learning methods
AttMask [36]	96.13	97.10	97.42	97.10	99.68	97.10	91.61	70.03	31.23	8.22	61.01
MoCo v3 [39]	94.19	95.81	96.77	96.45	97.10	95.81	88.71	65.88	27.83	6.02	58.64
iBOT [40]	86.13	91.61	89.68	86.13	87.10	81.94	69.03	41.25	18.93	6.51	49.22
Masked image modeling methods
BEiT [6]	96.13	93.23	94.52	90.00	93.23	85.48	67.42	30.42	13.43	5.70	48.60
MAE [8]	98.39	97.74	99.03	99.35	100.00	98.06	95.16	77.30	32.69	10.25	63.55
SimMIM [9]	68.39	77.42	80.97	75.48	74.52	72.90	57.42	37.09	15.86	6.51	42.63
CrossMAE [26]	95.81	96.45	96.77	97.10	98.71	96.77	89.03	60.24	25.24	8.38	58.49
MCMAE [41]	97.42	98.71	98.39	98.06	98.71	98.39	93.55	71.36	29.77	8.38	61.52
MixMIM [42]	96.45	96.77	97.42	97.10	97.74	96.77	96.13	56.82	20.55	7.73	57.07
BEiT V2 [43]	97.74	98.71	98.06	97.74	98.06	96.77	87.42	49.41	19.74	8.46	56.49
CAE [44]	98.39	98.06	98.39	99.35	100.00	97.74	95.81	80.86	40.29	8.79	64.68
DMAE [37]	90.32	93.55	95.48	93.55	93.55	90.97	78.71	47.77	14.72	6.02	52.42
DGR-MAE (Ours)	99.03	99.35	99.35	99.35	100.00	99.68	97.74	86.20	46.28	11.07	67.28

Table 3. Ablation study on pretraining dataset.

Pretrain	Finetune	Top-1
ASRAir-Clean	ASRAir-Occ	74.28
ASRAir-Occ	ASRAir-Occ	73.26

Table 4. Ablation study on teacher branch.

Teacher Branch	Top-1
w/o teacher	72.84
w/teacher	74.28

Table 5. Ablation study on student masking strategy.

Masking Strategy	Top-1
Random	74.28
Grid	73.08
Block	72.54

Table 6. Ablation study on different mask ratios.

Mask Ratio	Description	Top-1
0.25	Low masking intensity	72.90
0.50	Moderate masking intensity	73.08
0.75	Standard masking intensity	74.28
0.80	High masking intensity	73.50
0.85	Very high masking intensity	72.66

Table 7. Ablation study on C2CF module components. ✓ indicates the component is used, and × indicates it is not used.

Depthwise Conv	Cross-Attn	Top-1
✓	✓	74.28
✓	×	72.72
×	✓	73.08
×	×	72.72

Table 8. Ablation study on mask reweighting strategy.

a	b	Description	Top-1
0.5	0.5	Balanced weighting	73.02
0.6	0.4	Focus on key regions	74.28
0.7	0.3	Emphasize key regions	73.14
0.8	0.2	Over-emphasize key regions	73.08
0.9	0.1	Extreme emphasis on key regions	73.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Gao, Q.; Song, C.; Ouyang, B.; Wang, R.; Fan, H. DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion. Remote Sens. 2026, 18, 1852. https://doi.org/10.3390/rs18111852

AMA Style

Liu C, Gao Q, Song C, Ouyang B, Wang R, Fan H. DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion. Remote Sensing. 2026; 18(11):1852. https://doi.org/10.3390/rs18111852

Chicago/Turabian Style

Liu, Cong, Quanwei Gao, Chenxi Song, Bo Ouyang, Ruyu Wang, and Hongtao Fan. 2026. "DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion" Remote Sensing 18, no. 11: 1852. https://doi.org/10.3390/rs18111852

APA Style

Liu, C., Gao, Q., Song, C., Ouyang, B., Wang, R., & Fan, H. (2026). DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion. Remote Sensing, 18(11), 1852. https://doi.org/10.3390/rs18111852

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DGR-MAE: Posterior Semantic Correction Masked Autoencoder for Fine-Grained Aircraft Recognition Under Cloud Occlusion

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Aircraft Recognition in Optical Remote Sensing Imagery

2.2. Self-Supervised Learning and Masked Image Modeling

2.3. Cloud Occlusion in Remote Sensing Representation Learning

2.4. Comparison with Related Semantic-Guided MIM Methods

3. Methodology

3.1. Overall Framework

3.2. Semantic Importance-Aware Differential Visibility Encoding

3.3. Semantic-Guided Differential Reconstruction Mechanism

3.4. Semantic-Guided Weighted Reconstruction and Teacher–Student Joint Optimization

3.4.1. Semantic-Guided Differential Weighted Reconstruction Loss

3.4.2. Representation-Level Teacher–Student Alignment Loss

4. Experiments

4.1. Experimental Settings

4.2. Dataset Construction

4.2.1. ASRAir-Clean

4.2.2. ASRAir-Occ Construction

4.2.3. ASRAir-Sev

4.2.4. Dataset Splitting Protocol

4.3. Comparison with State-of-the-Art Methods

4.4. Performance on the Hierarchical Occlusion-Level Dataset

4.5. Visualization Analysis

4.5.1. Pre-Training Stage Visualization

4.5.2. Masking–Reconstruction Process Visualization

4.5.3. Fine-Tuning Stage Visualization

4.5.4. Confusion Matrix Visualization

4.5.5. t-SNE Feature Distribution Visualization

4.6. Ablation Study

4.6.1. Ablation on Pretraining Dataset

4.6.2. Ablation on Teacher Branch

4.6.3. Ablation on Student Masking Strategy

4.6.4. Ablation on Mask Ratio

4.6.5. Ablation on Collaborative Feature Refinement Module (C2CF)

4.6.6. Ablation on Mask Reweighting Parameters

5. Discussion

5.1. Advantages over Traditional Cloud Removal-Based Recognition

5.2. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI