Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers

Zhou, Xiang; Sun, Jiawei; Zhao, Jiannan; Shuang, Feng

doi:10.3390/aerospace13070590

Open AccessArticle

Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers

Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, No. 100, Daxue East Road, Nanning 530004, China

^*

Authors to whom correspondence should be addressed.

Aerospace 2026, 13(7), 590; https://doi.org/10.3390/aerospace13070590

Submission received: 26 May 2026 / Revised: 23 June 2026 / Accepted: 26 June 2026 / Published: 30 June 2026

(This article belongs to the Special Issue Machine Learning for Aerodynamic Analysis and Optimization)

Download

Browse Figures

Versions Notes

Abstract

The Vortex Ring State (VRS) is a critical aerodynamic hazard for rotorcraft, characterized by highly unsteady fluid–structure interactions and severe low-frequency vibrations. While data-driven deep learning models have shown promise in aviation state monitoring, their inherent “black-box” nature fundamentally contradicts the stringent interpretability requirements of airworthiness certification. To address this, we propose an “AI for Science” paradigm, investigating whether advanced Vision Transformers (ViT) can autonomously discover underlying aerodynamic mechanisms without human physical priors. First, to ensure absolute data fidelity, flight test datasets of a coaxial unmanned aerial vehicle were rigorously labeled using cross-validation from high-fidelity Computational Fluid Dynamics (CFD) simulations and wind tunnel tests. One-dimensional vibration signals were then transformed into two-dimensional Continuous Wavelet Transform (CWT) spectrograms. By employing Target-Layer Gradient Adaptation (Grad-CAM) techniques, we conducted a systematic comparison between traditional Convolutional Neural Networks (ResNet50) and ViT. The results demonstrate that while CNNs suffer from diffuse attention caused by high-frequency noise, the frozen-backbone ViT model achieves a physically interpretable accuracy of

93.24 %

, while autonomously locking its global attention onto a perfectly horizontal feature band centered at

41.7

Hz. Crucially, this autonomously discovered feature precisely aligns with the theoretically derived once-per-revolution (1P) fundamental frequency of the rotor’s flap-lag coupling response under VRS aerodynamic turbulence. This research provides direct visual evidence bridging black-box AI decisions with classical fluid mechanics, proposing a “Mechanism-Guided Verification” framework that offers a trustworthy pathway for the future certification of AI in safety-critical aerospace systems.

Keywords:

Vortex Ring State (VRS); coaxial rotorcraft; Explainable AI (XAI); Vision Transformer (ViT); aerodynamics; 1P Frequency; fluid–structure interaction

1. Introduction

The VRS remains one of the most hazardous aerodynamic phenomena in rotorcraft flight, particularly during vertical or steep descents where the descent rate approaches the rotor-induced velocity [1]. In this critical state, the rotor becomes trapped within its own re-circulating downwash, leading to the formation of a large-scale, unsteady toroidal vortex structure around the rotor disk [2,3]. This flow regime is characterized by severe aerodynamic unsteadiness, resulting in a sudden loss of thrust, diminished control authority, and violent aircraft buffeting. For coaxial rotorcraft, the complexity is further intensified by the strong aerodynamic interference between the upper and lower rotors [4,5]. The impingement of the upper rotor’s wake onto the lower disk induces drastic, asymmetric load variations, which not only accelerates the onset of VRS but also excites complex, low-frequency structural vibrations [6].

Despite the severity of VRS, identifying its onset remains a formidable challenge because it requires the aircraft to infer external, large-scale flow-field states using only internal, onboard proprioceptive sensors. Traditional model-based methods, while theoretically sound, often struggle to capture the highly non-linear and separated flow dynamics of VRS in real-time due to prohibitive computational costs or oversimplified assumptions [1]. Conversely, signal-processing-based approaches rely heavily on manual feature engineering, where domain experts must identify specific frequency bands or statistical anomalies [7,8]. Such methods often lack the robustness required for diverse flight missions and are prone to false alarms in the presence of non-stationary flight noise.

In recent years, the rapid advancement of deep learning has offered a powerful alternative, with architectures like Convolutional Neural Networks (CNNs) excelling at automatically extracting complex patterns from high-frequency sensor data [9]. However, as AI transitions from experimental research toward deployment in safety-critical aviation systems, a fundamental “trust gap” has emerged. The inherent “black-box” nature of deep learning models poses a significant barrier to airworthiness certification. Regulatory bodies, such as the European Union Aviation Safety Agency (EASA) and the Federal Aviation Administration (FAA), increasingly demand not only high predictive accuracy but also rigorous evidence of “learning the right physics” [10,11]. Without explicit interpretability, it is impossible to verify whether a model’s identification of VRS is based on authentic aerodynamic signatures—such as periodic load pulsations—or merely on spurious correlations within the training dataset [12].

To address these challenges, this study pioneers an “AI for Science” paradigm, explicitly decoupling the engineering objective (robust VRS classification) from the scientific inquiry (autonomous identification of fluid–structure interaction signatures). Rather than proposing a completely novel mathematical neural architecture from scratch, the core innovation of this work lies in the methodological synthesis and architectural refinement for aerospace certification. By transforming multi-sensor vibration streams into time–frequency spectrograms [13] and employing explainable AI (XAI) techniques like Grad-CAM [14], we provide a verifiable, mechanism-guided link between data-driven decisions and established fluid mechanics. The main contributions are summarized as follows:

Mechanism-Guided Verification (MGV) Framework: We pioneer a rigorous auditing framework tailored for rotorcraft telemetry, transforming standard black-box classifiers into interpretable aerodynamic probes using continuous wavelet transforms and gradient-based attribution.
Autonomous Physical Proxy Alignment: We prove that the ViT [15] inherently locks its global attention onto the theoretically derived $41.7 Hz$ (1P) frequency band. This demonstrates that the model autonomously aligns with established structural-dynamic proxies without any human-injected physical equations in the loss function.
Architectural Regularization via Backbone Freezing: We systematically expose the “shortcut learning” vulnerabilities of fully fine-tuned ViTs. By conducting formal architectural ablation studies, we demonstrate that a frozen-backbone constraint fundamentally forces the network to abandon high-frequency noise and maintain physically grounded feature extraction under imbalanced operational datasets.

2. Related Works

2.1. Vortex Ring State and Proprioceptive Aerodynamic Sensing

The accurate and timely identification of the VRS is foundational to rotorcraft flight safety, particularly for coaxial unmanned aerial vehicles (UAVs) governed by complex aerodynamic interferences [4,5]. Traditional flight diagnostics primarily rely on model-based estimators or classical signal-processing techniques applied to high-frequency structural vibrations [7,8]. Model-based formulations, such as Pitt-Peters or dynamic inflow updates, provide solid mathematical foundations but frequently struggle to replicate the severe, unsteady thrust loss and flow separations inherent to deep VRS regimes in real-time [1]. Conversely, non-stationary wave analysis, including the Short-Time Fourier Transform (STFT) and CWT, have widely succeeded in translating raw, multi-axis accelerometer diagnostics into interpretative time–frequency representations [13].

With the rapid emergence of deep learning, spatial feature extractors like CNNs have successfully automated aerodynamic state monitoring by treating these 2D spectrograms as visual textures [9]. However, these frameworks operate strictly under a supervised pattern-matching paradigm. While they capture high statistical correlation within specific flight test profiles, they bypass the underlying fluid–structure interaction principles, treating the physical characteristics of the recirculating wake as an unaudited statistical black box [3].

2.2. The Trust Gap and XAI in Airworthiness Certification

As artificial intelligence transitions from experimental flight testing to safety-critical aviation deployment, regulatory frameworks have encountered a profound “trust gap” [12]. Leading regulatory bodies, including the European Union Aviation Safety Agency (EASA) and the Federal Aviation Administration (FAA), have established stringent conceptual guidelines demanding that machine learning components in safety-critical domains must exhibit explicit audibility and eliminate “shortcut learning” risks [16]. To bridge this validation barrier, XAI attribution methods have been introduced into aerospace engineering. However, the selection of an appropriate XAI algorithm is heavily contingent upon the physical nature of the input data. While perturbation-based attribution algorithms, such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), provide valuable feature importance rankings in conventional computer vision tasks, they are fundamentally ill-suited for analyzing time–frequency transformations. These perturbation methods inherently disrupt the continuous 2D spatial geometry of CWT spectrograms by randomly masking local pixel superpixels, thereby destroying the temporal continuity of aerodynamic vibration signatures.

In contrast, gradient-based methods—predominantly visual backpropagation probes like Gradient-weighted Class Activation Mapping (Grad-CAM)—are deliberately selected for this study [14]. By backpropagating topological activations from the final convolutional layers, Grad-CAM flawlessly preserves the macroscopic, time-spanning horizontal structures inherent in the time–frequency domain. Early integrations of XAI in aerospace predominantly focused on validating defect localization or sensor anomaly isolation, primarily treating XAI as a qualitative “sanity check” to visualize whether an image classifier points at a generic region of interest. However, the geometric preservation afforded by Grad-CAM is mechanically essential for identifying sustained physical resonance bands (e.g., the 1P fundamental frequency). This capability facilitates an unaddressed paradigm shift: utilizing XAI quantitatively as a “scientific discovery probe” to verify whether an AI model’s internal decision-making boundaries are directly governed by continuous fluid–structure interaction laws and explicit aerodynamic instabilities [6].

2.3. Architectural Paradigms in Scientific ML: CNNs vs. Vision Transformers

The mathematical formulation of the neural architecture itself fundamentally governs how physical dimensions are synthesized from empirical data. CNNs inherit the rigid inductive biases of translation invariance and localized receptive fields [9]. While this localized spatial scanning excels at picking up edge textures and pixel clusters, it imposes severe limitations when analyzing scientific spectrograms where the key physical mechanism manifests as a persistent, global frequency resonance spanning long temporal horizons [15].

The ViT architecture completely discards localized spatial constraints, utilizing Multi-Head Self-Attention (MHSA) to establish global dependencies across non-overlapping image patches from the foundational layer [15]. Recently, ViT architectures, often coupled with time–frequency transformations, have been increasingly adopted for rotating machinery fault identification and structural health monitoring. While these existing publications convincingly demonstrate the superior capacity of ViTs to extract discriminative features from spectrograms compared to conventional CNNs, a critical methodological gap persists.

Existing aerospace and diagnostic research has predominantly treated the ViT as a high-capacity, static black-box classifier. The primary focus of previous achievements remains heavily anchored on optimizing statistical classification metrics (e.g., overall accuracy) on static fault datasets, utilizing XAI merely as qualitative “sanity checks” for localized component faults. These prior works fail to exploit the ViT’s global self-attention mechanism as a dynamic probe for non-stationary aerodynamic evolution. Specifically, prior literature has overlooked the structural harmony between the ViT’s global macro-structural capture and the continuous, time-spanning nature of fundamental physical harmonics. This oversight leaves the architectural potential to track dynamic physical phenomena—such as the frequency migration of the 1P aerodynamic signature during non-stationary VRS encounters—largely unexplored.

This paper directly addresses this critical void by systematically establishing an intrinsic causal chain: connecting the physical periodic nature of VRS vibrations, the regulatory necessity for physical XAI verification, and the ViT’s unique capacity for global macro-structural capture. By uniquely leveraging Grad-CAM to dynamically track the migration of the 1P frequency across varying descent velocities, we elevate XAI from a static visualization utility to a quantitative, fluid-dynamics verification mechanism. Consequently, the current study repositions the ViT not merely as a diagnostic classifier, but as an interpretable, autonomous mechanism discovery vehicle. This synthesis explicitly bridges the gap between deep learning architectures and rotorcraft aerodynamics, serving as the foundational rationale for our proposed Mechanism-Guided Verification (MGV) framework.

3. Materials and Methods

To transition from purely black-box statistical learning to physics-informed aerodynamic evaluation, this study formally introduces the Mechanism-Guided Verification (MGV) framework. Rather than presenting the data processing techniques and neural network architectures as isolated algorithms, we integrate them into a cohesive, three-step operational routing. This overarching framework serves as the structural roadmap for the methodology detailed in the subsequent subsections:

1.: Proprioceptive Space Mapping (detailed in Section 3.1 and Section 3.2): The acquisition of raw 1D inertial measurements bounded by high-fidelity CFD labeling, followed by Continuous Wavelet Transformations to preserve non-stationary structural-dynamic continuity in a 2D visual space.
2.: Architectural Regularization (detailed in Section 3.3): The deployment of Vision Transformers under strict backbone-freezing constraints to actively suppress high-frequency instrumentation noise and enforce macro-structural visual tracking.
3.: Quantitative Proxy Auditing (detailed in Section 3.3): The derivation of the Attention Focus Ratio (AFR) using target-layer Grad-CAM, enabling continuous mathematical auditing against the theoretical physical proxy mask ( $M_{R O I}$ ) derived directly from mechanical telemetry.

3.1. Dataset and Physics-Grounded Experimental Platform

3.1.1. Experimental UAV Platform and 1P Frequency Derivation

The experimental data were collected using a custom-built coaxial twin-rotor UAV platform [17]. The aircraft, with a maximum takeoff weight of

10 kg

, features two counter-rotating rotors driven by high-torque brushless motors. A critical physical anchor for this study is the rotor’s nominal operating speed, maintained at

2500 RPM

during the transition to VRS.

In rotorcraft dynamics, the once-per-revolution (1P) frequency is the fundamental harmonic of the periodic aerodynamic loads [6]. For our platform, the theoretical 1P frequency is derived as:

f_{1 P} = \frac{2500 RPM}{60 s / \min} \approx 41.67 Hz

(1)

Under normal flight conditions, structural vibrations localized at

41.7 Hz

are minimal. However, as the rotor enters the Vortex Ring State, the blades interact with highly unsteady, re-circulating tip vortices [1]. The theoretical and physical imperative linking VRS onset exclusively to the intensification of the 1P structural harmonic lies in the inherent azimuthal asymmetry of this flow field. During vertical descent into VRS, the toroidal vortex ring undergoes severe non-stationary deformations, bursting, and unstable vortex wandering across the rotor disk.

As an individual rotor blade traverses a full

360^{\circ}

azimuthal cycle, it experiences a highly non-uniform induced velocity field. This subjects the blade airfoil to drastic, periodic pulsations in its local effective angle of attack and aerodynamic lift once per revolution. This asymmetric downwash distribution acts as a powerful, deterministic periodic forcing function synchronized with the rotor’s rotational frequency (

f_{1 P}

), which heavily excites the fundamental elastic “flap-lag” coupling modes of the blades [1,18]. These fluctuating aerodynamic loads are continuously synthesized and transmitted through the rotor hub directly into the airframe as high-energy, discrete 1P structural vibrations [19].

Consequently, the localized

41.7 Hz

spectral band transcends generic aerodynamic turbulence; it constitutes the definitive physical signature and unique aerodynamic fingerprint of deep VRS development. This explicit causality provides the foundational “physical truth” that our Mechanism-Guided Verification (MGV) framework expects an intelligent system to autonomously discover and extract [6].

To capture high-frequency dynamic responses, the UAV is equipped with a 3-axis accelerometer and gyroscope (1200 Hz), a barometer (100 Hz), and a GPS module (10 Hz). Detailed hardware specifications are summarized in Table 1.

The statistical analysis of the processed raw flight dataset is shown in Table 2. It is notable that VRS samples constitute a very small portion of the entire dataset, not exceeding

8 %

in any mission. This indicates the challenge of extreme class imbalance in the data distribution. To effectively counteract this inherent flaw during training, the proposed deep learning solution employs strategies such as non-uniform weighting to ensure the model learns sufficiently from minority critical samples, thereby enhancing identification robustness and accuracy.

3.1.2. High-Fidelity Data Labeling via CFD and Wind Tunnel Cross-Validation

To ensure that the “black-box” model is trained on authentic physics rather than spurious noise, a rigorous multi-source labeling protocol was implemented [3].

1.: Initial GPS Screening: Flight segments with high descent rates ( $V_{z} > 8 m / s$ ) and low forward speeds were isolated.
2.: CFD Verification: The boundary conditions of these segments were replicated in high-fidelity, unsteady Computational Fluid Dynamics (CFD) simulations using ANSYS R2020 Fluent [5]. Only segments where the flow field clearly exhibited the toroidal recirculation characteristic of VRS (as shown in Figure 1) were considered for the VRS label [3].
3.: Wind Tunnel Calibration: Aerodynamic load pulsations measured in a controlled wind tunnel environment were used to calibrate the CFD model, ensuring the simulation accurately reproduced the 1P load variations encountered in real flight [20].

To resolve the macroscopic development of the recirculating toroidal flow field, three-dimensional Unsteady Reynolds-Averaged Navier–Stokes (URANS) equations were solved utilizing ANSYS Fluent [5]. The closure of the governing equations was achieved using the shear stress transport (SST)

k - ω

turbulence model, with the near-wall resolution rigorously maintained at

y^{+} \approx 1.14

to adequately capture flow separation characteristics. The spatial discretization utilized a SIMPLE pressure-velocity coupling scheme with Second Order Upwind formulations for convective terms, while temporal advancement employed a Second Order Implicit scheme.

Crucially, to ensure maximum fidelity to the actual flight envelope, the inlet boundary conditions were dynamically driven by User-Defined Functions (UDFs) mapping the precise descent velocity time-series extracted directly from the UAV flight logs. Given that the primary engineering objective of the CFD phase was to establish macroscopic binary labels (VRS vs. Nominal) based on low-frequency flow topology evolution rather than extracting high-frequency fluid–structure interaction loads, a Moving Reference Frame (MRF) approach was adopted for rotor kinematics with a temporal step size of

0.012 s

(equivalent to approximately a half rotor revolution). The total grid cell count was finalized at approximately

13.62

million. Each simulation required around 60 h of wall-clock time utilizing 128 CPU cores on a high-performance computing cluster. A mesh independence validation confirmed that variations in grid density (

\pm 30 %

) induced negligible topological shifts in the primary toroidal vortex structures, thereby guaranteeing the robustness of the derived ground-truth labels.

Figure 1 displays preliminary screened high descent rate intervals and their corresponding CFD flow field configurations. The curve on the left shows the vertical descent velocity time series recorded by the UAV’s high-precision GPS. Shaded regions indicate the high descent rate intervals preliminarily screened based on a vertical velocity threshold, which may contain VRS events. The subplots on the right show the CFD simulation flow field configurations corresponding to several typical moments within these intervals. Clear observation of the rotor-induced recirculation flow via the streamlines confirms high consistency with the typical VRS flow structure characteristics, providing physical validation for subsequent data labeling.

Figure 1. CFD simulation framework and data labeling process. (a) The data segment for labeling with CFD simulation (The purple line segments represent the parts labeled as VRS.), calculated by the authors based on the tested coaxial UAV platform. (b) The hybrid mesh topology around the coaxial rotors and streamline configurations illustrating toroidal recirculation characteristic of the VRS.

This cross-validation ensures that the ground truth labels represent the absolute physical state of the flow field [3], providing a solid foundation for evaluating the AI’s “scientific discovery” capability.

To rigorously define the epistemological boundary of “no human physical priors” highlighted in this study, it must be emphasized that the high-fidelity physics derived from the CFD simulations and wind tunnel tests are strictly utilized externally to assign discrete, binary ground-truth labels (

Y \in {0, 1}

). The neural network architecture, including its objective loss function, remains entirely agnostic to fluid-dynamic equations, kinematic constraints, or pre-defined structural frequency boundaries. Consequently, the Vision Transformer’s eventual alignment with the 1P frequency band is an autonomous feature-mining outcome, entirely free from manual physical prior injection during the machine learning optimization process.

3.2. Time-Frequency Representation via CWT

To bridge the gap between 1D temporal sensor signals and 2D visual patterns, we employ the CWT method. Unlike the Short-Time Fourier Transform (STFT), CWT provides a multi-resolution analysis that is particularly suited for non-stationary aircraft vibrations [21,22]. Using the Complex Morlet wavelet as the mother function [23], we compute the Euclidean norm (magnitude) of the 3-axis accelerometer data and transform this into high-resolution spectrograms. This transformation preserves both the transient energy bursts of VRS and the sustained structural resonance at the 1P frequency, converting a “hidden” aerodynamic mechanism into an “explicit” visual texture [13]. The sample figures of CWT in different stages are illustrated in Figure 2. A representative segment of the raw multi-sensor telemetry data and the corresponding descent trajectory during the flight test are illustrated in Figure 3.

To explicitly justify the selection of CWT over traditional time–frequency methods (such as the STFT), the non-stationary physical manifestations of VRS must be considered. STFT is mathematically constrained by a fixed window size (the Gabor limit), forcing a rigid trade-off between temporal and spectral resolution. This limitation renders it fundamentally inadequate for VRS aerodynamic sensing, which features both abrupt, transient high-frequency energy bursts (e.g., localized vortex tearing) and sustained, low-frequency structural resonances (e.g., the 1P flap-lag harmonic). The CWT overcomes this stricture by employing a dynamically scaling wavelet function, providing high temporal resolution for sudden transients and high spectral resolution for sustained low frequencies. This continuous multi-resolution capability is mechanically indispensable for generating the unbroken horizontal topological structures required for the Vision Transformer’s global attention to successfully lock onto the 1P physical proxy.

3.2.1. Engineering Causality Constraint

To ensure the algorithm can perform zero-lookahead, unidirectional real-time inference within a flight control system, we enforce a strict engineering causality constraint. A sliding window of 5 seconds is employed, where an asymmetric Hanning boundary weighting is applied to unequivocally sever the reliance on any forward-looking samples.

3.2.2. Small-Sample Modeling and Class Imbalance Mitigation

Given the extreme class imbalance inherent to the operational envelope (where VRS positive samples constitute a mere

6.22 %

of the total dataset), conventional optimization strategies rapidly degenerate into majority-class predictors. To provide a mathematically robust solution for small-sample modeling, aggressive class-weighted sampling was implemented during the mini-batch data loading phase. Furthermore, the optimization landscape was regularized using a Focal Loss penalty mechanism [24]. By configuring the focusing parameter to

γ = 2.0

, the loss function dynamically scales the gradients based on prediction confidence. This mechanism heavily penalizes misclassifications of the hard-to-predict minority VRS samples while actively down-weighting the easily classified nominal flight data, effectively preventing representation collapse.

To ensure strict reproducibility and address the challenges of small-sample modeling, the complete database statistics and training configurations are formalized in Table 3. The experimental campaign comprised 5 independent flight sorties, generating a total of

19, 624

discrete CWT spectrogram frames (normalized to

224 \times 224 \times 3

). Bounded by the operational envelope, true VRS instances constitute 1229 frames (yielding an extreme minority class ratio of

6.26 %

), while nominal flight conditions comprise

18, 403

frames.

To strictly prevent data leakage, the dataset was partitioned longitudinally by individual flight sorties rather than by randomized frame shuffling. As detailed in Table 3, Flights 1, 3 and 4 constructed the training set, Flight 2 served as the validation set, and Flight 5 was strictly reserved as the unseen independent test set. During the training phase, both the ResNet50 and ViT baseline models were optimized using the AdamW optimizer with an initial learning rate of

1 \times 10^{- 4}

, a cosine learning rate decay scheduler, and a batch size of 32. The models were trained for 50 epochs on an NVIDIA RTX 4090 GPU.

3.3. Explainable AI Framework: Architectures for Mechanism Discovery

In this study, we deliberately compare two distinct AI philosophies to determine which is better suited for “scientific discovery” in aerodynamics.

3.3.1. Convolutional Neural Networks (ResNet50)

As a baseline, we employ ResNet50, which utilizes residual connections to learn deep hierarchical features [9,25]. However, CNNs are inherently biased toward local receptive fields. In the context of a CWT spectrogram, a CNN focuses on local “pixel clusters”, which may lead it to overfit on high-frequency noise or localized artifacts rather than capturing the persistent, global frequency bands that define the 1P mechanism [15].

3.3.2. ViT and Global Self-Attention

The ViT represents a shift toward global context modeling [26]. By dividing the spectrogram into non-overlapping patches and applying Multi-Head Self-Attention, ViT can establish dependencies between distant time–frequency regions from the very first layer [15]. For aerodynamic sensing, this “global-first” approach is hypothesized to be superior for locking onto the 1P frequency band, which manifests as a persistent horizontal structure across the entire temporal axis of the spectrogram [15].

3.3.3. Explainability Probe: Target-Layer Grad-CAM and Spatial Reconstruction

To extract physically auditable evidence, we deploy Gradient-weighted Class Activation Mapping (Grad-CAM) [14]. To apply this gradient-based attribution to the non-convolutional Vision Transformer, we strictly implement a 2D spatial reconstruction transform (

T_{r e s h a p e}

) at the final layer block before the classification head. Let

X \in R^{(N + 1) \times D}

be the output sequence of the final Transformer encoder layer, where

N = 196

represents the number of flattened patch tokens and D is the embedding dimension. The prepended class token (

X_{0} \in R^{D}

) is detached. The remaining 1D sequence of patch tokens (

X_{1 : N} \in R^{N \times D}

) is then geometrically reshaped back into a 2D spatial grid pseudo-feature map

F \in R^{14 \times 14 \times D}

:

F = T_{r e s h a p e} (X_{1 : N})

(2)

Gradients of the target VRS class score are backpropagated directly to this reconstructed pseudo-feature map F, performing weighted sum accumulation across channels to yield the final

224 \times 224

heatmap. This mathematical transformation ensures full reproducibility and allows standard visual backpropagation to project global attention weights back onto the physical frequency and temporal axes.

To mathematically formalize the focus degree of the explainable probe within the physical true frequency band, we define the Attention Focus Ratio (AFR). Let

H \in R^{H \times W}

be the generated Grad-CAM heatmap normalized to

[0, 1]

. To eliminate background gradient dilution, a threshold filtration is applied such that

{\tilde{H}}_{i, j} = H_{i, j}

if

H_{i, j} > 0.5

, and

{\tilde{H}}_{i, j} = 0

otherwise. Let

M_{R O I} \in {0, 1}^{H \times W}

be the binary region-of-interest mask corresponding to the

30 - 45 Hz

aerodynamic frequency strip derived from the logspace mapping. The selection of the

30 - 45 Hz

band is rigorously grounded in physical telemetry, determined entirely a priori and independently of the ViT’s observed attention maps. Based on the fixed operational rotor speed of

2500 RPM

, the theoretical fundamental flap-lag (1P) frequency is established at approximately

41.7 Hz

. The defined region-of-interest boundary (

30 - 45 Hz

band) is mathematically necessary to account for mechanical governor lag, structural aerodynamic damping, and the inherent frequency bin smearing caused by the wavelet transform’s finite time–frequency resolution. The AFR is mathematically formulated as:

A F R = \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} {\tilde{H}}_{i, j} \cdot M_{R O I} (i, j)}{\sum_{i = 1}^{H} \sum_{j = 1}^{W} {\tilde{H}}_{i, j} + ϵ}

(3)

where

ϵ = 10^{- 8}

is an infinitesimal regularizer introduced to prevent division-by-zero anomalies under completely diffuse gradient fields.

To quantitatively evaluate boundary sensitivity, expanding the theoretical AFR integration window from

30 - 45 Hz

to a shifted

35 - 48 Hz

band yields an expected edge-shift anomaly of less than

1.4 %

. This minimal variance preserves the strict statistical order-of-magnitude divergence between the physically grounded frozen backbone (

16.82 %

) and the collapsed fine-tuned model (

0.40 %

), confirming that the evaluation framework is highly robust to predefined masking thresholds.

4. Results and Explainability Analysis

4.1. Quantitative Performance: The Architecture Bottleneck

To answer the fundamental question of whether AI can autonomously capture aerodynamic truths, we evaluated multiple model architectures on an independent test dataset (split by flight sortie to ensure rigorous generalization). The input consisted solely of single-channel CWT spectrograms derived from the magnitude of the 3-axis accelerometer data, ensuring identical feature exposure across all comparative architectures.

A critical challenge in aviation fault detection is the extreme class imbalance; in our test sortie, VRS events constituted less than

6 %

of the total flight time. To prevent the models from degenerating into majority-class predictors, class-weighted sampling was strictly enforced during training. Table 4 presents the comprehensive performance comparison across various mainstream architectures.

The empirical results reveal a profound architectural bottleneck inherent to CNNs. While all models achieve deceptively high overall accuracies (exceeding

92 %

) due to the dominance of nominal flight data, their capability to reliably detect VRS is highly differentiated. The CNN architectures, ranging from lightweight (MobileNet-V2, EfficientNet-B0) to deep models (VGG16, ResNet50), systematically fail to maintain a robust balance between Precision and Recall. For instance, while EfficientNet-B0 achieves high precision, its unacceptable recall (

18.69 %

) indicates that it misses over

81 %

of actual hazardous events, rendering it unviable for safety-critical applications.

Conversely, the ViT-B/16 demonstrates a systemic superiority, achieving a VRS F1-score of

58.47 %

, surpassing the best-performing CNN by over 21 percentage points. We hypothesize that this performance gap is not merely a statistical artifact, but a direct consequence of the architectural philosophy. CNNs, bounded by localized receptive fields, struggle to perceive structural, wide-spanning frequency features within the CWT spectrogram. The ViT, leveraging its global self-attention mechanism, inherently excels at capturing persistent, non-local frequency bands. To rigorously validate this physical hypothesis, we must transition from statistical metrics to XAI probes.

4.2. Robustness Against Synthetic Noise: A Test of Physical Grounding

While the baseline performance on clean data (Section 4.1) establishes ViT’s superiority, a more rigorous test of whether a model has learned fundamental physics—rather than merely memorizing dataset-specific background noise—is its robustness against signal corruption. In real-world aviation environments, sensor data is frequently contaminated by electromagnetic interference, mechanical resonance, and sensor aging.

To simulate these harsh operational conditions and probe the models’ internal reliance on physical features, we conducted a synthetic noise robustness test. We injected zero-mean Gaussian white noise with increasing standard deviations (

σ \in [0, 0.05, 0.1, 0.2, 0.3]

) directly into the normalized CWT spectrograms of the unseen test set, evaluating the degradation of the VRS F1-score for both ViT-B/16 (Frozen Backbone) and the CNN baseline (ResNet50).

As illustrated in Figure 4, the performance trajectories of the two architectures diverge dramatically under noisy conditions. The CNN architecture exhibits a catastrophic degradation; its VRS F1-score plummets sharply even at minimal noise levels (

σ = 0.05

). This vulnerability mathematically proves that the CNN’s decision boundary heavily relies on fragile, localized high-frequency textures or spurious background correlations, which are instantly masked by Gaussian noise.

In stark contrast, the ViT demonstrates exceptional resilience. Its performance curve degrades smoothly and maintains a significant margin of superiority across all noise intensity levels. This robust behavior strongly supports our hypothesis: because the ViT utilizes global self-attention, it bases its decisions on the macro-structural features of the spectrogram—specifically, the persistent horizontal energy band of the 1P frequency. Because this low-frequency aerodynamic resonance is structurally coherent across the time axis, it remains perceptible to the ViT’s global attention mechanism even when heavily overlaid with random synthetic noise.

4.3. Visual Evidentiary Comparison: Local Convolution vs. Global Attention

To uncover the internal geometric constraints that led to the statistical performance gaps observed in Section 4.1 and Section 4.2, we deployed Target-Layer Grad-CAM to expose the decision-making foundations of each architecture. Figure 5 presents a strict side-by-side architectural visualization using an identical, fully developed VRS sample (

V_{z} = 9.12 m / s

) evaluated across three distinct model configurations.

To prevent color-blending artifacts and preserve the physical textures of the underlying flow field, a luminance-based overlay strategy was adopted for visualization. Instead of applying a conventional pseudo-color colormap which could obscure structural details, the normalized grayscale Grad-CAM attributions are applied strictly as a luminance (brightness) mask over the original CWT spectrogram. Consequently, regions with high attention gradients appear structurally illuminated, while non-contributing background noise is naturally darkened, ensuring unambiguous visual identification of the target frequencies.

As visualized in Figure 5A, the traditional CNN architecture (ResNet50) exhibits a severely diffuse attention pattern. Bounded by the inductive bias of localized receptive fields, the convolutional kernels operate as local pixel-cluster aggregators. Consequently, the model’s high-activation gradients are erratically scattered across the upper and lower boundaries of the CWT spectrogram, chasing high-frequency transient maneuvers and non-physical background instrumentation noise. While this localized feature matching manages to output a correct classification token, the underlying reasoning is aerodynamically vacuous.

Most illuminatingly, Figure 5B captures the visual failure mode of the fully fine-tuned Vision Transformer. Although this variant achieved the highest historical validation scores (

F_{1} = 58.47 %

), its unconstrained end-to-end optimization triggered severe “shortcut learning”. Without architectural boundaries, the multi-head self-attention layers overfitted to specific localized high-energy noise fragments, totally dissolving the pre-trained macro-structural context. The heatmap manifests as an unaligned, chaotic hotspot patch, visually confirming that high statistical accuracy can be achieved for entirely uncertifiable reasons.

In sharp and decisive contrast, the frozen-backbone ViT (the frozen-backbone regularization framework, Figure 5B) demonstrates an autonomous convergence toward physical law. Without any human domain guidance or physical priors injected into its loss function, the model’s global attention mechanism seamlessly ignores the surrounding flight noise and compresses its absolute gradient weight into a singular, hyper-focused horizontal bright band. Critically, this continuous energy tract spans across the temporal axis and is perfectly aligned with the theoretically derived

41.7 Hz

(1P) fundamental flap-lag coupling harmonic. This striking contrast provides unequivocal visual evidence that freezing the backbone forces the transformer to act as a pure structural text extractor, successfully bridging raw proprioceptive data with classical fluid mechanics.

4.4. Dynamic Aerodynamic Tracking: Attention Evolution Across Descent Velocities

To mathematically and physically disqualify the hypothesis of “shortcut learning”—where a model might merely memorize coincidental background noise—we conducted a multi-stage aerodynamic evolution analysis using our frozen-backbone ViT (the physically constrained optimization model). The VRS is not a static event; it is a transient, highly non-linear aerodynamic boundary that intensifies and evolves dynamically with the descent velocity (

V_{z}

). Figure 6 displays the sequential transformation of the CWT spectrogram alongside its corresponding quantified Grad-CAM energy distribution across three critical operational phases: the Early Entry Phase (

V_{z} = 7.08 m / s

), the Fully Developed Phase (

V_{z} = 9.12 m / s

), and the Extreme Buffet Phase (

V_{z} = 12.42 m / s

).

At the inception of VRS, or the Early Entry Phase (

V_{z} = 7.08 m / s

), the recirculating toroidal wake is freshly generating, inducing localized tip vortex impingements. As shown in the top row of Figure 6, the model confidently identifies the state (

p = 0.719

), and the quantitative attention focus ratio (AFR) within the

30 - - 45 Hz

fundamental band registers at

25.7 %

. The structural vibrations are beginning to gather around the 1P frequency, but the flow field remains partially disorganized.

As the aircraft transitions into the Fully Developed Phase (

V_{z} = 9.12 m / s

), strong flap-lag aerodynamic coupling occurs due to the highly asymmetric load variations across the rotor blade cycle. Crucially, the ViT automatically captures this physical manifestation. Its global attention collapses into a perfectly horizontal, hyper-focused energy strip tightly bound to the

41.7 Hz

theoretical 1P line. The quantified frequency spectrum reveals that the

30 - - 45 Hz

band now captures

44.6 %

of the absolute decision-making gradients. This demonstrates that the frozen backbone preserves an immaculate physical feature extractor, prioritizing the theoretical aerodynamic truth over transient noise.

Most compellingly, as the flight deepens into the Extreme Buffet Phase (

V_{z} = 12.42 m / s

), the aerodynamic regime approaches the windmill brake state boundary, where large-scale, low-frequency chaotic flow separations dominate the airframe. The quantitative XAI metrics capture a profound physical synchronicity: the model’s fundamental 1P AFR drops sharply to

14.8 %

, while its global attention adaptively shifts downward, allocating a major energy proportion into the low-frequency (

3 - - 10 Hz

) buffeting band.

This synchronized migration between data-driven attention focus and classical fluid–structure transformation provides unequivocal visual and statistical proof. The Vision Transformer, when constrained by a frozen pre-trained backbone, does not merely memorize a static

41.7 Hz

pattern; it autonomously acquires and tracks genuine aerodynamic principles. This dynamic tracking capability firmly validates our proposed Mechanism-Guided Verification (MGV) framework, proving that AI can serve as a trustworthy, transparent probe for complex aerodynamic states.

4.5. Ablation Studies: Effects of Architectural Regularization and Layer-Specific Fine-Tuning

To comprehensively address the sensitivity of the model’s physical interpretability to its architectural constraints, a targeted ablation study was conducted. Unlike natural images where shallow layers extract generic visual edges and deep layers construct semantic logic, CWT spectrograms encode critical physical frequencies directly within their low-level textural topology. Therefore, we evaluated four distinct transfer-learning configurations to map the accuracy-interpretability trade-off: an unconstrained Fully Fine-Tuned ViT (tuning all layers

0 - - 11

), a Bottom-Open variant (tuning low-level feature extractors, layers

0 - - 2

), a Top-Open variant (tuning high-level semantic layers

9 - - 11

), and the strictly Regularized Fully Frozen ViT. Table 5 details the joint variance of the classification performance (

F_{1}

-score) and the physical grounding metric (Mean AFR) across all 305 true VRS positive frames.

As illustrated in Table 5, a profound, domain-specific physical dichotomy emerges. The unconstrained model achieves the highest statistical

F_{1}

-score (

58.47 %

) but experiences catastrophic representation collapse (Mean AFR

\approx 0.40 %

, far below the

13.39 %

spatial random baseline). Notably, the Top-Open configuration, which follows standard transfer-learning protocols by tuning late-stage semantic layers, fails to improve classification (

F_{1} = 35.91 %

) because the frozen shallow layers cannot adapt to CWT-specific textures; however, it retains a high physical alignment (AFR

= 10.06 %

).

In stark contrast, the Bottom-Open configuration significantly boosts the

F_{1}

-score to

54.25 %

by allowing early block adaptation to aeroelastic spectrogram textures, while still rescuing a meaningful portion of physical interpretability (AFR

= 6.05 %

). This validates the hypothesis that frequency evidence in CWT is primarily anchored in low-level feature representations. Nevertheless, only the Fully Frozen configuration secures maximal physical alignment (

16.82 %

). The ablation firmly proves that when strict adherence to a physical proxy (1P frequency) is the primary certification objective for aviation AI, deep structural regularization is absolutely mandatory to prevent high-frequency shortcut learning.

5. Discussion

5.1. The Accuracy Trap and the Necessity of Mechanism-Guided Verification

A misleading apparent contradiction emerged during our systematic exploration. When we allowed the Vision Transformer to undergo full end-to-end fine-tuning without backbone constraints, the model achieved the highest VRS F1-score (

58.47 %

) on the highly imbalanced test set. However, XAI probing revealed that this fully fine-tuned model suffered from a severe representation collapse, manifesting as a total loss of its global structural attention. Its attention map devolved into a diffuse pattern, heavily reliant on spurious local correlations and high-frequency background noise rather than the

41.7 Hz

physical resonance. It achieved high accuracy through “shortcut learning”.

To disqualify the possibility of this being a localized anomaly of a single sample, a comprehensive statistical validation was conducted by computing the mean AFR across all 305 true VRS positive frames in the independent test set (Flight 5). The fully fine-tuned ViT exhibited a catastrophic mean AFR of only

0.40 \pm 0.72 %

, which falls drastically below the randomized spatial area baseline (

13.39 %

), mathematically proving that its unconstrained multi-head self-attention layers systematically overfitted to fragile high-frequency instrumentation noise fragments. Conversely, the frozen-backbone regularization configuration yielded a robust, physically grounded mean AFR of

16.82 \pm 7.58 %

across the entire test population.

Conversely, by deliberately freezing the pre-trained backbone and restricting training to the classification head (yielding a lower, but robust F1-score of

36.40 %

), the model was forced to seek macroscopic, structurally persistent features. As demonstrated in Section 4.4, this mathematically constrained “Scientist” model autonomously locked onto the physically derived 1P frequency and dynamically tracked its evolution.

It is crucial to acknowledge the inherent accuracy-interpretability trade-off demonstrated by these results. While the frozen-backbone configuration successfully secures physically grounded representations, its

F_{1}

-score (

36.40 %

) is significantly lower than the fully fine-tuned unconstrained variant (

58.47 %

), reflecting a diminished representational capacity when the deep feature space is locked. To overcome this limitation, future iterations of the Mechanism-Guided Verification framework should evolve from post hoc auditing to proactive physics-informed training. Rather than rigidly freezing the backbone, the computed Attention Focus Ratio (AFR) can be mathematically integrated into the optimization objective as a differentiable penalty term. By jointly minimizing the classification cross-entropy and maximizing the AFR-based physical alignment loss, the network could theoretically achieve high classification performance while strictly respecting the defined aerodynamic frequency boundaries.

This juxtaposition exposes a critical vulnerability in the current airworthiness certification paradigm, which predominantly relies on statistical metrics (e.g., Accuracy, F₁-score) derived from independent test sets [16]. In safety-critical aerospace applications, an AI model that achieves high scores by memorizing non-physical noise is fundamentally unsafe; it will fail unpredictably under novel turbulent conditions [12].

Based on the quantitative interpretability demonstrated in this study, the obtained results provide strong technical grounds for recommending the inclusion of Mechanism-Guided Verification (MGV) methodologies into future AI certification frameworks. Rather than relying solely on black-box statistical metrics, regulatory authorities could consider evaluating the Attention Focus Ratio (AFR) as a supplementary, physics-informed auditing tool to ensure that deep learning algorithms deployed in flight-critical systems inherently align with established fluid-dynamic and structural realities.

5.2. Disentangling Physical Grounding from Geometrical Bias

A critical methodological question raised during peer evaluation is whether the model’s attention shift reflects a genuine distillation of aerodynamic regimes, or merely a trivial statistical response to gradient distributions driven by the physical weakening of the 1P resonance band. If the frozen-backbone ViT operated strictly as a rigid “horizontal line detector” or overfitted to ImageNet’s linear textures, the attenuation of the 1P fundamental frequency would logically cause a catastrophic collapse of its classification confidence or disperse its attention into random background noise.

However, as empirical evidence in Section 4.4 demonstrates, during the Extreme Buffet Phase (

V_{z} = 12.42 m / s

), the model seamlessly relinquishes its horizontal focus and dynamically reallocates its attention gradients to the chaotic, non-linear morphological blobs in the

3 - 10 Hz

regime, maintaining a robust classification probability (

p = 0.6347

). This adaptive transition indicates that the model tracks the fluid–structure interaction’s energy migration rather than clinging to geometric templates.

Nonetheless, we strictly acknowledge the limitation that post hoc visual attributions cannot entirely isolate implicit network biases. To fully mathematically decouple architectural priors from physical reasoning, future investigations will incorporate adversarial ablation paradigms, such as randomized frequency-axis shuffling and physical masking counter-tests, to further formalize the boundaries of certifiable scientific machine learning.

5.3. Remark on Conservation Laws and Physical Proxies

A fundamental limitation of applying deep learning to fluid-dynamic environments is the risk of models achieving high accuracy while silently violating underlying conservation laws (e.g., mass and momentum conservation). It must be explicitly clarified that the proposed ViT architecture processes 2D CWT spectrograms derived from 1D structural vibrations, rather than predicting 3D spatial flow fields (velocity and pressure distributions). Consequently, directly computing Navier–Stokes residuals on the network’s output is mathematically infeasible.

To bridge this gap, the Mechanism-Guided Verification framework employs the aeroelastic structural response—specifically, the

41.7 Hz

(1P) frequency band—as a rigorous physical proxy constraint. While the model does not explicitly solve volumetric fluid conservation equations, its learned classification boundary respects the classical structural-dynamic symptoms induced by those very fluid mechanics. By statistically verifying that the model’s global attention aligns with this established aerodynamic signature, we provide a pragmatic, verifiable pathway to physical grounding that bypasses the computational impossibility of real-time spatial residual calculation.

5.4. Methodological Generalization to Variable-RPM Platforms

Having formally established the three-step Mechanism-Guided Verification (MGV) framework in Section 3, a critical operational consideration for future deployment is the framework’s generalization across platforms with variable rotor speeds (Variable RPM).

While the current study and its corresponding baseline

M_{R O I}

utilize a fixed operational condition (

2500 RPM

), the static predefined

30 - 45 Hz

auditing mask can be mathematically adapted for variable-speed rotorcraft. By establishing a conditional token integration layer, the instantaneous tachometer frequency

f_{1 P} (t)

will be fed into the ViT’s Multi-Head Self-Attention blocks as a dynamic coordinate scaling prior. This advancement will allow the

M_{R O I}

tracking boundary to shift dynamically in real-time alongside the mechanical speed governor, ensuring persistent physical alignment and quantitative proxy auditing without necessitating the computationally expensive retraining of the core visual feature weights. This dynamic mask adaptation remains a theoretical extension of the MGV framework; experimental validation on variable-RPM flight test campaigns is reserved for future work.

5.5. Traditional Baselines and Computational Cost Analysis

A fundamental inquiry addresses whether a simple classical signal-processing baseline—such as applying a hard energy threshold on the 1P frequency band—could achieve comparable VRS detection performance with negligible computational overhead. In traditional Health and Usage Monitoring Systems (HUMS), univariate condition indicators utilizing hard amplitude thresholds frequently suffer from unacceptably high false-alarm rates [27]. While computationally trivial, extracting raw vibration amplitude at

41.7 Hz

inherently conflates aerodynamic anomalies with pilot-induced dynamics. In real-world operations, nominal aggressive maneuvers (e.g., steep pitch/roll commands) and turbulent wind gusts routinely alter the rotor’s operating conditions, intensely exciting the fundamental flap-lag structural modes and causing transient spikes in 1P energy [28]. The proposed ViT architecture transcends simple amplitude thresholding by globally capturing the 2D morphological evolution of the CWT spectrogram, differentiating between the transient spikes of control maneuvers and the sustained, chaotic frequency smearing characteristic of vortex ring interactions. Furthermore, ResNet50 was specifically selected as the primary deep learning baseline because it represents the most widely adopted and mathematically robust benchmark in industrial computer vision, allowing for a rigorous, fair comparison between standard localized convolutions and global self-attention mechanisms.

Regarding computational cost, the proposed framework leverages an asymmetric offline-online paradigm. The high-fidelity CFD simulations are strictly offline, demanding extreme computational resources (approximately 60 hours per segment on 128 CPU cores) to establish irrefutable physical ground-truth labels. In stark contrast, the online phase utilizes the pre-trained ViT for dynamic inference. Evaluated on a desktop-class NVIDIA RTX 3060 GPU, the single-frame forward pass for the frozen-backbone ViT requires approximately

12 - 15 ms

(>60 FPS), well within the real-time threshold of typical rotorcraft flight control loops (

10 - 50 Hz

).

While current inference was executed on a desktop architecture for algorithmic validation, deploying standard ViT models to edge-computing hardware (e.g., NVIDIA Jetson AGX Orin) is highly feasible. By applying post-training quantization (INT8) and TensorRT operator fusion, the memory bandwidth bottlenecks of edge devices can be significantly mitigated, ensuring that the MGV framework maintains real-time predictive capabilities aboard resource-constrained UAV platforms without sacrificing its interpretable physical grounding.

6. Conclusions

This study addressed the critical challenge of identifying the highly unsteady Vortex Ring State (VRS) using only internal proprioceptive sensors, shifting the paradigm from building a mere “black-box classifier” to developing a transparent “scientific discovery tool”. By transforming one-dimensional vibration streams into two-dimensional CWT spectrograms [13], we framed aerodynamic state sensing as a visual pattern recognition task governed by the newly formalized Mechanism-Guided Verification (MGV) framework.

The empirical results and systematic ablation studies yield three critical conclusions. First, the explainable probe reveals a dual-stage attention paradigm inherent to the regularized Vision Transformer: it autonomously locks onto the

41.7 Hz

(1P) fundamental frequency during the fully developed VRS phase, and dynamically migrates its focus toward low-frequency (

3 - 10 Hz

) buffeting structures as the aircraft enters extreme descent velocities. Second, the formal layer-specific ablation study demonstrates that backbone freezing acts as a mandatory architectural regularizer; it forces the ViT to abandon the high-frequency “shortcut learning” that plagues the unconstrained model, ensuring physically grounded feature extraction. In contrast, standard CNNs systematically exhibit diffuse attention scattered across non-physical instrumentation noise.

Finally, the MGV framework provides a pragmatic pathway for airworthiness certification by verifying physical grounding through established aeroelastic proxies (the 1P structural response), effectively bypassing the prohibitive computational overhead associated with real-time first-principles aerodynamic modeling and volumetric flow simulations. While the current quantitative auditing mask (

M_{R O I}

) is anchored to a fixed

2500 RPM

condition, the framework is mathematically designed to generalize to variable-RPM platforms via dynamic tachometer integration, as discussed in Section 5.4. By successfully bridging data-driven decisions with classical fluid mechanics, this research lays a crucial, physically auditable foundation for the deployment of artificial intelligence in advanced, safety-critical aerospace systems.

Author Contributions

Conceptualization, X.Z. and F.S.; methodology, X.Z. and J.S.; software, X.Z.; validation, X.Z., J.S., and J.Z.; formal analysis, X.Z.; investigation, X.Z.; resources, F.S.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, J.Z. and F.S.; visualization, X.Z.; supervision, F.S.; project administration, F.S.; funding acquisition, F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62206065, and the Young Elite Scientists Sponsorship Program by GXAST (2025YESSGX158).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to institutional restrictions and ongoing research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

1P	First harmonic fundamental frequency
CFD	Computational Fluid Dynamics
CNN	Convolutional Neural Network
CWT	Continuous Wavelet Transform
GPS	Global Positioning System
Grad-CAM	Gradient-weighted Class Activation Mapping
IMU	Inertial Measurement Unit
RPM	Revolutions Per Minute
UAV	Unmanned Aerial Vehicle
ViT	Vision Transformer
VRS	Vortex Ring State

References

Johnson, W. Model for Vortex Ring State Influence on Rotorcraft Flight Dynamics. Technical Report, NASA, 2005. Available online: https://rotorcraft.arc.nasa.gov/Publications/files/Johnson_TP-2005-213477.pdf (accessed on 24 June 2026).
Pickles, D.J.; Green, R.B.; Busse, A. The Vortex Ring State of a Rotor and Its Comparison with the Collapse of an Annular Jet in Counterflow. Phys. Fluids 2023, 35, 044103. [Google Scholar] [CrossRef]
Zeng, L.; Gao, Z.; Xu, T.; Wang, L. Mechanism Analysis of the Vortex Ring and Its Effects on an Axial Descending Rotor. Int. J. Fluid Eng. 2024, 1, 23101. [Google Scholar] [CrossRef]
Barbely, N.L.; Komerath, N.M.; Novak, L.A. A Study of Coaxial Rotor Performance and Flow Field Characteristics. NASA Ames Research Center, Vertical Lift Research Center of Excellence (VLRCOE), 2016. In Proceedings of the AHS Technical Meeting on Aeromechanics Design for Vertical Lift, San Francisco, CA, USA, 20–22 January 2016. [Google Scholar]
Konstantinov, S.G.; Ignatkin, Y.M.; Makeev, P.V.; Nikitin, S.O. Comparative Study of Coaxial Main Rotor Aerodynamics in the Hover with the Usage of Two Methods of Computational Fluid Dynamics. J. Aerosp. Technol. Manag. 2021, 13, e1821. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Wu, H.; Zhou, P.; Zhang, X.; Zhong, S. Experimental and Numerical Investigations on Rotor Noise in Axial Descending Flight. Phys. Rev. Fluids 2023, 8, 094803. [Google Scholar] [CrossRef]
Giurgiutiu, V.; Cuc, A.; Goodman, P. Review of Vibration-Based Helicopters Health and Usage Monitoring Methods. In Proceedings of the Meeting of the Society for Machinery Failure Prevention Technology, Virginia Beach, VA, USA, 2–5 April 2001. [Google Scholar]
Pawar, P.M.; Ganguli, R. Helicopter Rotor Health Monitoring-a Review. Proc. Inst. Mech. Eng. Part G. J. Aerosp. Eng. 2007, 221, 631–647. [Google Scholar]
Zhao, J.; Wang, W.; Huang, J.; Ma, X. A Comprehensive Review of Deep Learning-Based Fault Diagnosis Approaches for Rolling Bearings: Advancements and Challenges. AIP Adv. 2025, 15, 020702. [Google Scholar] [CrossRef]
European Union Aviation Safety Agency (EASA). EASA Artificial Intelligence Roadmap 2.0: Human-Centric AI in Aviation; Roadmap, EASA: Cologne, Germany, 2023. [Google Scholar]
Federal Aviation Administration (FAA). Review of Machine Learning Application in Aviation; Technical Report DOT/FAA/TC-21/50; U.S. Department of Transportation, Federal Aviation Administration: Washington, DC, USA, 2021. [Google Scholar]
Sutthithatip, S.; Perinpanayagam, S.; Aslam, S.; Wileman, A. Explainable AI in Aerospace for Enhanced System Performance. In Proceedings of the 2021 IEEE/AIAA 40th Digital Avionics Systems Conference (DASC), San Antonio, TX, USA, 3–7 October 2021; pp. 1–7. [Google Scholar]
Yang, C.L.; Chen, Z.X.; Yang, C.Y. Sensor Classification Using Convolutional Neural Network by Encoding Multivariate Time Series as Two-Dimensional Colored Images. Sensors 2019, 20, 168. [Google Scholar] [CrossRef] [PubMed]
O’Sullivan, C. Grad-CAM for Explaining Computer Vision Models. A Data Odyssey. 2025. Available online: https://adataodyssey.com/grad-cam/ (accessed on 24 June 2026).
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
KN, K.; Perrusquia, A.; Tsourdos, A.; Ignatyev, D. Integrating Explainable AI into Two-Tier ML Models for Trustworthy Aircraft Landing Gear Fault Diagnosis. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 1928. [Google Scholar]
Sun, J.; Zhou, X.; Ban, T.; Zhao, J.; Shuang, F. Energy Consumption Modelling of Coaxial-Rotor in Vortex Ring State for Controllable High-speed Descending. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 3803–3809. [Google Scholar]
Leishmann, J.G. Principles of Helicopter Aerodynamics; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Leishman, J.G. Rotorcraft Aerodynamics. In Encyclopedia of Aerospace Engineering; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
Marshall, M.; Tang, E.; Cornelius, J.; Ruiz, F.; Schmitz, S. Performance of the Dragonfly Lander’s Coaxial Rotor in Vortex Ring State. In Proceedings of the AIAA Scitech 2024 Forum, Orlando, FL, USA, 8–12 January 2024; p. 0247. [Google Scholar]
Sadowsky, J. The Continuous Wavelet Transform: A Tool for Signal Investigation and Understanding. Johns Hopkins APL Tech. Dig. 1994, 15, 306–318. Available online: https://www.jhuapl.edu/Content/techdigest/pdf/V15-N04/15-04-Sadowsky.pdf (accessed on 24 June 2026).
PyWavelets Developers. Continuous Wavelet Transform (CWT)—PyWavelets Documentation. Available online: https://pywavelets.readthedocs.io/en/latest/ref/cwt.html (accessed on 24 June 2026).
Weisang GmbH. Continuous Wavelet Transform (CWT)—FlexPro Documentation. Available online: https://www.weisang.com/en/support/know/flexpro-documentation/additional-topics/continuous-wavelet-transformation-cwt/ (accessed on 24 June 2026).
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Wang, H.; Wang, H.; Tang, X. A Review of Deep Learning in Rotating Machinery Fault Diagnosis and Its Prospects for Port Applications. Appl. Sci. 2025, 15, 11303. [Google Scholar] [CrossRef]
Naman, A.; Zhang, G. FAST: Fast Audio Spectrogram Transformer. arXiv 2025. [Google Scholar] [CrossRef]
Leoni, J.; Tanelli, M.; Palman, A. A new comprehensive monitoring and diagnostic approach for early detection of mechanical degradation in helicopter transmission systems. Expert Syst. Appl. 2022, 210, 118412. [Google Scholar] [CrossRef]
Bechhoefer, E.; Bernhard, D. Signal detection theory applied to helicopter transmission diagnostic thresholds. In Proceedings of the American Helicopter Society 63rd Annual Forum, Virginia Beach, VA, USA, 1–3 May 2007. [Google Scholar]

Figure 2. CWT time–frequency plot example. (a) Hovering stage; (b) high-speed descent.

Figure 3. (a) A segment of the time series curves from a flight test; (b) the descent path.

Figure 4. VRS F1-score degradation curves of ViT-B/16 (Frozen Backbone) and ResNet50 under increasing levels of synthetic Gaussian noise (

σ

). The ViT exhibits resilient degradation, indicating a reliance on stable macro-structural physical features rather than fragile local textures.

Figure 4. VRS F1-score degradation curves of ViT-B/16 (Frozen Backbone) and ResNet50 under increasing levels of synthetic Gaussian noise (

σ

). The ViT exhibits resilient degradation, indicating a reliance on stable macro-structural physical features rather than fragile local textures.

Figure 5. Visual explanation comparison on an identical VRS sample (

V_{z} = 9.12 m / s

). The Grad-CAM attribution is applied as a luminance mask over the original CWT spectrogram to prevent color contamination and preserve physical texture clarity. (A) ResNet50 exhibits diffuse attention scattered across high-frequency noise. (B) Fully fine-tuned ViT reveals shortcut learning, focusing on arbitrary localized patches to achieve high scores. (C) Frozen-backbone ViT displays physical grounding, compressing its global attention exclusively onto the theoretical

41.7 Hz

1P fundamental frequency band.

Figure 5. Visual explanation comparison on an identical VRS sample (

V_{z} = 9.12 m / s

). The Grad-CAM attribution is applied as a luminance mask over the original CWT spectrogram to prevent color contamination and preserve physical texture clarity. (A) ResNet50 exhibits diffuse attention scattered across high-frequency noise. (B) Fully fine-tuned ViT reveals shortcut learning, focusing on arbitrary localized patches to achieve high scores. (C) Frozen-backbone ViT displays physical grounding, compressing its global attention exclusively onto the theoretical

41.7 Hz

1P fundamental frequency band.

Figure 6. Multi-stage aerodynamic evolution of VRS captured by the ViT. From top to bottom: Early Entry Phase (

V_{z} = 7.08 m / s

), Fully Developed Phase (

V_{z} = 9.12 m / s

), and Extreme Buffet Phase (

V_{z} = 12.42 m / s

). The model’s attention adaptively shifts from emerging 1P vibrations (

25.7 %

AFR), to hyper-focused 1P resonance (

44.6 %

AFR), and finally to low-frequency buffeting (

14.8 %

AFR), perfectly mirroring classical fluid dynamics.

Figure 6. Multi-stage aerodynamic evolution of VRS captured by the ViT. From top to bottom: Early Entry Phase (

V_{z} = 7.08 m / s

), Fully Developed Phase (

V_{z} = 9.12 m / s

), and Extreme Buffet Phase (

V_{z} = 12.42 m / s

). The model’s attention adaptively shifts from emerging 1P vibrations (

25.7 %

AFR), to hyper-focused 1P resonance (

44.6 %

AFR), and finally to low-frequency buffeting (

14.8 %

AFR), perfectly mirroring classical fluid dynamics.

Table 1. Sensor Parameters of the UAV Flight Control System.

Sensor Type	Freq. (Hz)	Main Performance
Accelerometer	1200	Range: 200 G;
		Zero-bias Stability: 0.8 mg;
		Bandwidth: 200 Hz;
Gyroscope	1200	Range: 2000° s⁻¹
		Zero-bias Stability: 16° h⁻¹
		Bandwidth: 200 Hz;
Barometer	100	Working Pressure: 100–1300 hPa;
		Resolution: ± 0.002 hPa;
		Relative Accuracy: ± 0.06 hPa;
GPS	10	Horizontal Accuracy: 0.2 m;
GPS	10	Elevation Accuracy: 0.5 m;

Table 2. Raw Flight Data Statistics.

Flight No.	Total IMU Samples	VRS Samples	Percentage of VRS
1	164,419	9855	$5.99 %$
2	202,515	10,638	$5.25 %$
3	224,192	11,515	$5.14 %$
4	328,621	25,563	$7.78 %$
5	324,447	17,824	$5.49 %$

Table 3. CWT Time–Frequency Map Data Statistics and Dataset Split.

Flight No.	Total CWT Maps	VRS-Marked Maps	VRS Percentage	Dataset Split
1	2595	167	$6.44 %$	Training
2	3115	169	$5.43 %$	Validation
3	3574	188	$5.26 %$	Training
4	5034	400	$7.95 %$	Training
5	5306	305	$5.74 %$	Independent Test
Total	19,624	1229	6.26%	–

Table 4. Performance Comparison of Single-Frame Models on the Unseen Test Set under Class-Weighted and Focal Constraints.

Model Configuration	Architecture Type	Overall Acc	VRS Precision	VRS Recall	VRS F1-Score
MobileNet-V2	Lightweight CNN	$92.44 %$	$7.29 %$	$2.30 %$	$3.49 %$
EfficientNet-B0	Lightweight CNN	$94.98 %$	86.36%	$18.69 %$	$30.73 %$
VGG16	Deep CNN	$94.45 %$	$57.24 %$	$27.21 %$	$36.89 %$
ResNet50	Deep CNN	$93.77 %$	$46.53 %$	$30.82 %$	$37.08 %$
ViT-B/16 (Full Fine-Tuning)	Vision Transformer	$95.64 %$	$67.67 %$	$51.48 %$	$58.47 %$
ViT-B/16 (Frozen Backbone) *	Vision Transformer	$93.24 %$	$41.42 %$	$32.46 %$	$36.40 %$

* Note: All baseline configurations universally employ a WeightedRandomSampler and class-weighted optimization to counteract the extreme minority-class imbalance. The frozen-backbone ViT variant additionally substitutes the standard Cross-Entropy loss with Focal Loss (

γ = 2.0

) to suppress overwhelming gradients from easy nominal flight samples, cleanly bounding the frozen representation space. Due to computational constraints, the baseline

F_{1}

-scores are reported from a single definitive run on the fixed independent test set; establishing cross-validation variance across multiple seed initializations remains a limitation to be addressed in future work.

Table 5. Ablation study evaluating the effects of layer-specific fine-tuning on classification performance (

F_{1}

-score) and physical proxy alignment (Mean AFR) across 305 independent test samples.

Table 5. Ablation study evaluating the effects of layer-specific fine-tuning on classification performance (

F_{1}

-score) and physical proxy alignment (Mean AFR) across 305 independent test samples.

Ablation Configuration	Trained Encoder Layers	VRS F₁-Score (%)	Mean AFR (%)
Fully Fine-Tuned	Layers $0 - 11$	$58.47$	$0.40 \pm 0.72$
Bottom-Open Regularization	Layers $0 - 2$	$54.25$	$6.05 \pm 2.38$
Top-Open Regularization	Layers $9 - 11$	$35.91$	$10.06 \pm 6.22$
Fully Frozen	None (Head only)	$36.40$	$16.82 \pm 7.58$
Baseline Random Spatial ROI Area: $13.39 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, X.; Sun, J.; Zhao, J.; Shuang, F. Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers. Aerospace 2026, 13, 590. https://doi.org/10.3390/aerospace13070590

AMA Style

Zhou X, Sun J, Zhao J, Shuang F. Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers. Aerospace. 2026; 13(7):590. https://doi.org/10.3390/aerospace13070590

Chicago/Turabian Style

Zhou, Xiang, Jiawei Sun, Jiannan Zhao, and Feng Shuang. 2026. "Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers" Aerospace 13, no. 7: 590. https://doi.org/10.3390/aerospace13070590

APA Style

Zhou, X., Sun, J., Zhao, J., & Shuang, F. (2026). Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers. Aerospace, 13(7), 590. https://doi.org/10.3390/aerospace13070590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable AI in Rotorcraft Aerodynamics: Autonomous Discovery and Dynamic Tracking of Vortex Ring State Mechanisms via Vision Transformers

Abstract

1. Introduction

2. Related Works

2.1. Vortex Ring State and Proprioceptive Aerodynamic Sensing

2.2. The Trust Gap and XAI in Airworthiness Certification

2.3. Architectural Paradigms in Scientific ML: CNNs vs. Vision Transformers

3. Materials and Methods

3.1. Dataset and Physics-Grounded Experimental Platform

3.1.1. Experimental UAV Platform and 1P Frequency Derivation

3.1.2. High-Fidelity Data Labeling via CFD and Wind Tunnel Cross-Validation

3.2. Time-Frequency Representation via CWT

3.2.1. Engineering Causality Constraint

3.2.2. Small-Sample Modeling and Class Imbalance Mitigation

3.3. Explainable AI Framework: Architectures for Mechanism Discovery

3.3.1. Convolutional Neural Networks (ResNet50)

3.3.2. ViT and Global Self-Attention

3.3.3. Explainability Probe: Target-Layer Grad-CAM and Spatial Reconstruction

4. Results and Explainability Analysis

4.1. Quantitative Performance: The Architecture Bottleneck

4.2. Robustness Against Synthetic Noise: A Test of Physical Grounding

4.3. Visual Evidentiary Comparison: Local Convolution vs. Global Attention

4.4. Dynamic Aerodynamic Tracking: Attention Evolution Across Descent Velocities

4.5. Ablation Studies: Effects of Architectural Regularization and Layer-Specific Fine-Tuning

5. Discussion

5.1. The Accuracy Trap and the Necessity of Mechanism-Guided Verification

5.2. Disentangling Physical Grounding from Geometrical Bias

5.3. Remark on Conservation Laws and Physical Proxies

5.4. Methodological Generalization to Variable-RPM Platforms

5.5. Traditional Baselines and Computational Cost Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI