1. Introduction
The VRS remains one of the most hazardous aerodynamic phenomena in rotorcraft flight, particularly during vertical or steep descents where the descent rate approaches the rotor-induced velocity [
1]. In this critical state, the rotor becomes trapped within its own re-circulating downwash, leading to the formation of a large-scale, unsteady toroidal vortex structure around the rotor disk [
2,
3]. This flow regime is characterized by severe aerodynamic unsteadiness, resulting in a sudden loss of thrust, diminished control authority, and violent aircraft buffeting. For coaxial rotorcraft, the complexity is further intensified by the strong aerodynamic interference between the upper and lower rotors [
4,
5]. The impingement of the upper rotor’s wake onto the lower disk induces drastic, asymmetric load variations, which not only accelerates the onset of VRS but also excites complex, low-frequency structural vibrations [
6].
Despite the severity of VRS, identifying its onset remains a formidable challenge because it requires the aircraft to infer external, large-scale flow-field states using only internal, onboard proprioceptive sensors. Traditional model-based methods, while theoretically sound, often struggle to capture the highly non-linear and separated flow dynamics of VRS in real-time due to prohibitive computational costs or oversimplified assumptions [
1]. Conversely, signal-processing-based approaches rely heavily on manual feature engineering, where domain experts must identify specific frequency bands or statistical anomalies [
7,
8]. Such methods often lack the robustness required for diverse flight missions and are prone to false alarms in the presence of non-stationary flight noise.
In recent years, the rapid advancement of deep learning has offered a powerful alternative, with architectures like Convolutional Neural Networks (CNNs) excelling at automatically extracting complex patterns from high-frequency sensor data [
9]. However, as AI transitions from experimental research toward deployment in safety-critical aviation systems, a fundamental “trust gap” has emerged. The inherent “black-box” nature of deep learning models poses a significant barrier to airworthiness certification. Regulatory bodies, such as the European Union Aviation Safety Agency (EASA) and the Federal Aviation Administration (FAA), increasingly demand not only high predictive accuracy but also rigorous evidence of “learning the right physics” [
10,
11]. Without explicit interpretability, it is impossible to verify whether a model’s identification of VRS is based on authentic aerodynamic signatures—such as periodic load pulsations—or merely on spurious correlations within the training dataset [
12].
To address these challenges, this study pioneers an “AI for Science” paradigm, explicitly decoupling the engineering objective (robust VRS classification) from the scientific inquiry (autonomous identification of fluid–structure interaction signatures). Rather than proposing a completely novel mathematical neural architecture from scratch, the core innovation of this work lies in the methodological synthesis and architectural refinement for aerospace certification. By transforming multi-sensor vibration streams into time–frequency spectrograms [
13] and employing explainable AI (XAI) techniques like Grad-CAM [
14], we provide a verifiable, mechanism-guided link between data-driven decisions and established fluid mechanics. The main contributions are summarized as follows:
Mechanism-Guided Verification (MGV) Framework: We pioneer a rigorous auditing framework tailored for rotorcraft telemetry, transforming standard black-box classifiers into interpretable aerodynamic probes using continuous wavelet transforms and gradient-based attribution.
Autonomous Physical Proxy Alignment: We prove that the ViT [
15] inherently locks its global attention onto the theoretically derived
(1P) frequency band. This demonstrates that the model autonomously aligns with established structural-dynamic proxies without any human-injected physical equations in the loss function.
Architectural Regularization via Backbone Freezing: We systematically expose the “shortcut learning” vulnerabilities of fully fine-tuned ViTs. By conducting formal architectural ablation studies, we demonstrate that a frozen-backbone constraint fundamentally forces the network to abandon high-frequency noise and maintain physically grounded feature extraction under imbalanced operational datasets.
2. Related Works
2.1. Vortex Ring State and Proprioceptive Aerodynamic Sensing
The accurate and timely identification of the VRS is foundational to rotorcraft flight safety, particularly for coaxial unmanned aerial vehicles (UAVs) governed by complex aerodynamic interferences [
4,
5]. Traditional flight diagnostics primarily rely on model-based estimators or classical signal-processing techniques applied to high-frequency structural vibrations [
7,
8]. Model-based formulations, such as Pitt-Peters or dynamic inflow updates, provide solid mathematical foundations but frequently struggle to replicate the severe, unsteady thrust loss and flow separations inherent to deep VRS regimes in real-time [
1]. Conversely, non-stationary wave analysis, including the Short-Time Fourier Transform (STFT) and CWT, have widely succeeded in translating raw, multi-axis accelerometer diagnostics into interpretative time–frequency representations [
13].
With the rapid emergence of deep learning, spatial feature extractors like CNNs have successfully automated aerodynamic state monitoring by treating these 2D spectrograms as visual textures [
9]. However, these frameworks operate strictly under a supervised pattern-matching paradigm. While they capture high statistical correlation within specific flight test profiles, they bypass the underlying fluid–structure interaction principles, treating the physical characteristics of the recirculating wake as an unaudited statistical black box [
3].
2.2. The Trust Gap and XAI in Airworthiness Certification
As artificial intelligence transitions from experimental flight testing to safety-critical aviation deployment, regulatory frameworks have encountered a profound “trust gap” [
12]. Leading regulatory bodies, including the European Union Aviation Safety Agency (EASA) and the Federal Aviation Administration (FAA), have established stringent conceptual guidelines demanding that machine learning components in safety-critical domains must exhibit explicit audibility and eliminate “shortcut learning” risks [
16]. To bridge this validation barrier, XAI attribution methods have been introduced into aerospace engineering. However, the selection of an appropriate XAI algorithm is heavily contingent upon the physical nature of the input data. While perturbation-based attribution algorithms, such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), provide valuable feature importance rankings in conventional computer vision tasks, they are fundamentally ill-suited for analyzing time–frequency transformations. These perturbation methods inherently disrupt the continuous 2D spatial geometry of CWT spectrograms by randomly masking local pixel superpixels, thereby destroying the temporal continuity of aerodynamic vibration signatures.
In contrast, gradient-based methods—predominantly visual backpropagation probes like Gradient-weighted Class Activation Mapping (Grad-CAM)—are deliberately selected for this study [
14]. By backpropagating topological activations from the final convolutional layers, Grad-CAM flawlessly preserves the macroscopic, time-spanning horizontal structures inherent in the time–frequency domain. Early integrations of XAI in aerospace predominantly focused on validating defect localization or sensor anomaly isolation, primarily treating XAI as a qualitative “sanity check” to visualize whether an image classifier points at a generic region of interest. However, the geometric preservation afforded by Grad-CAM is mechanically essential for identifying sustained physical resonance bands (e.g., the 1P fundamental frequency). This capability facilitates an unaddressed paradigm shift: utilizing XAI quantitatively as a “scientific discovery probe” to verify whether an AI model’s internal decision-making boundaries are directly governed by continuous fluid–structure interaction laws and explicit aerodynamic instabilities [
6].
2.3. Architectural Paradigms in Scientific ML: CNNs vs. Vision Transformers
The mathematical formulation of the neural architecture itself fundamentally governs how physical dimensions are synthesized from empirical data. CNNs inherit the rigid inductive biases of translation invariance and localized receptive fields [
9]. While this localized spatial scanning excels at picking up edge textures and pixel clusters, it imposes severe limitations when analyzing scientific spectrograms where the key physical mechanism manifests as a persistent, global frequency resonance spanning long temporal horizons [
15].
The ViT architecture completely discards localized spatial constraints, utilizing Multi-Head Self-Attention (MHSA) to establish global dependencies across non-overlapping image patches from the foundational layer [
15]. Recently, ViT architectures, often coupled with time–frequency transformations, have been increasingly adopted for rotating machinery fault identification and structural health monitoring. While these existing publications convincingly demonstrate the superior capacity of ViTs to extract discriminative features from spectrograms compared to conventional CNNs, a critical methodological gap persists.
Existing aerospace and diagnostic research has predominantly treated the ViT as a high-capacity, static black-box classifier. The primary focus of previous achievements remains heavily anchored on optimizing statistical classification metrics (e.g., overall accuracy) on static fault datasets, utilizing XAI merely as qualitative “sanity checks” for localized component faults. These prior works fail to exploit the ViT’s global self-attention mechanism as a dynamic probe for non-stationary aerodynamic evolution. Specifically, prior literature has overlooked the structural harmony between the ViT’s global macro-structural capture and the continuous, time-spanning nature of fundamental physical harmonics. This oversight leaves the architectural potential to track dynamic physical phenomena—such as the frequency migration of the 1P aerodynamic signature during non-stationary VRS encounters—largely unexplored.
This paper directly addresses this critical void by systematically establishing an intrinsic causal chain: connecting the physical periodic nature of VRS vibrations, the regulatory necessity for physical XAI verification, and the ViT’s unique capacity for global macro-structural capture. By uniquely leveraging Grad-CAM to dynamically track the migration of the 1P frequency across varying descent velocities, we elevate XAI from a static visualization utility to a quantitative, fluid-dynamics verification mechanism. Consequently, the current study repositions the ViT not merely as a diagnostic classifier, but as an interpretable, autonomous mechanism discovery vehicle. This synthesis explicitly bridges the gap between deep learning architectures and rotorcraft aerodynamics, serving as the foundational rationale for our proposed Mechanism-Guided Verification (MGV) framework.
3. Materials and Methods
To transition from purely black-box statistical learning to physics-informed aerodynamic evaluation, this study formally introduces the Mechanism-Guided Verification (MGV) framework. Rather than presenting the data processing techniques and neural network architectures as isolated algorithms, we integrate them into a cohesive, three-step operational routing. This overarching framework serves as the structural roadmap for the methodology detailed in the subsequent subsections:
- 1.
Proprioceptive Space Mapping (detailed in Section 3.1 and Section 3.2): The acquisition of raw 1D inertial measurements bounded by high-fidelity CFD labeling, followed by Continuous Wavelet Transformations to preserve non-stationary structural-dynamic continuity in a 2D visual space.
- 2.
Architectural Regularization (detailed in Section 3.3): The deployment of Vision Transformers under strict backbone-freezing constraints to actively suppress high-frequency instrumentation noise and enforce macro-structural visual tracking.
- 3.
Quantitative Proxy Auditing (detailed in Section 3.3): The derivation of the Attention Focus Ratio (AFR) using target-layer Grad-CAM, enabling continuous mathematical auditing against the theoretical physical proxy mask (
) derived directly from mechanical telemetry.
3.1. Dataset and Physics-Grounded Experimental Platform
3.1.1. Experimental UAV Platform and 1P Frequency Derivation
The experimental data were collected using a custom-built coaxial twin-rotor UAV platform [
17]. The aircraft, with a maximum takeoff weight of
, features two counter-rotating rotors driven by high-torque brushless motors. A critical physical anchor for this study is the rotor’s nominal operating speed, maintained at
during the transition to VRS.
In rotorcraft dynamics, the once-per-revolution (1P) frequency is the fundamental harmonic of the periodic aerodynamic loads [
6]. For our platform, the theoretical 1P frequency is derived as:
Under normal flight conditions, structural vibrations localized at
are minimal. However, as the rotor enters the Vortex Ring State, the blades interact with highly unsteady, re-circulating tip vortices [
1]. The theoretical and physical imperative linking VRS onset exclusively to the intensification of the 1P structural harmonic lies in the inherent azimuthal asymmetry of this flow field. During vertical descent into VRS, the toroidal vortex ring undergoes severe non-stationary deformations, bursting, and unstable vortex wandering across the rotor disk.
As an individual rotor blade traverses a full
azimuthal cycle, it experiences a highly non-uniform induced velocity field. This subjects the blade airfoil to drastic, periodic pulsations in its local effective angle of attack and aerodynamic lift once per revolution. This asymmetric downwash distribution acts as a powerful, deterministic periodic forcing function synchronized with the rotor’s rotational frequency (
), which heavily excites the fundamental elastic “flap-lag” coupling modes of the blades [
1,
18]. These fluctuating aerodynamic loads are continuously synthesized and transmitted through the rotor hub directly into the airframe as high-energy, discrete 1P structural vibrations [
19].
Consequently, the localized
spectral band transcends generic aerodynamic turbulence; it constitutes the definitive physical signature and unique aerodynamic fingerprint of deep VRS development. This explicit causality provides the foundational “physical truth” that our Mechanism-Guided Verification (MGV) framework expects an intelligent system to autonomously discover and extract [
6].
To capture high-frequency dynamic responses, the UAV is equipped with a 3-axis accelerometer and gyroscope (1200 Hz), a barometer (100 Hz), and a GPS module (10 Hz). Detailed hardware specifications are summarized in
Table 1.
The statistical analysis of the processed raw flight dataset is shown in
Table 2. It is notable that VRS samples constitute a very small portion of the entire dataset, not exceeding
in any mission. This indicates the challenge of extreme class imbalance in the data distribution. To effectively counteract this inherent flaw during training, the proposed deep learning solution employs strategies such as non-uniform weighting to ensure the model learns sufficiently from minority critical samples, thereby enhancing identification robustness and accuracy.
3.1.2. High-Fidelity Data Labeling via CFD and Wind Tunnel Cross-Validation
To ensure that the “black-box” model is trained on authentic physics rather than spurious noise, a rigorous multi-source labeling protocol was implemented [
3].
- 1.
Initial GPS Screening: Flight segments with high descent rates () and low forward speeds were isolated.
- 2.
CFD Verification: The boundary conditions of these segments were replicated in high-fidelity, unsteady Computational Fluid Dynamics (CFD) simulations using ANSYS R2020 Fluent [
5]. Only segments where the flow field clearly exhibited the toroidal recirculation characteristic of VRS (as shown in
Figure 1) were considered for the VRS label [
3].
- 3.
Wind Tunnel Calibration: Aerodynamic load pulsations measured in a controlled wind tunnel environment were used to calibrate the CFD model, ensuring the simulation accurately reproduced the 1P load variations encountered in real flight [
20].
To resolve the macroscopic development of the recirculating toroidal flow field, three-dimensional Unsteady Reynolds-Averaged Navier–Stokes (URANS) equations were solved utilizing ANSYS Fluent [
5]. The closure of the governing equations was achieved using the shear stress transport (SST)
turbulence model, with the near-wall resolution rigorously maintained at
to adequately capture flow separation characteristics. The spatial discretization utilized a SIMPLE pressure-velocity coupling scheme with Second Order Upwind formulations for convective terms, while temporal advancement employed a Second Order Implicit scheme.
Crucially, to ensure maximum fidelity to the actual flight envelope, the inlet boundary conditions were dynamically driven by User-Defined Functions (UDFs) mapping the precise descent velocity time-series extracted directly from the UAV flight logs. Given that the primary engineering objective of the CFD phase was to establish macroscopic binary labels (VRS vs. Nominal) based on low-frequency flow topology evolution rather than extracting high-frequency fluid–structure interaction loads, a Moving Reference Frame (MRF) approach was adopted for rotor kinematics with a temporal step size of (equivalent to approximately a half rotor revolution). The total grid cell count was finalized at approximately million. Each simulation required around 60 h of wall-clock time utilizing 128 CPU cores on a high-performance computing cluster. A mesh independence validation confirmed that variations in grid density () induced negligible topological shifts in the primary toroidal vortex structures, thereby guaranteeing the robustness of the derived ground-truth labels.
Figure 1 displays preliminary screened high descent rate intervals and their corresponding CFD flow field configurations. The curve on the left shows the vertical descent velocity time series recorded by the UAV’s high-precision GPS. Shaded regions indicate the high descent rate intervals preliminarily screened based on a vertical velocity threshold, which may contain VRS events. The subplots on the right show the CFD simulation flow field configurations corresponding to several typical moments within these intervals. Clear observation of the rotor-induced recirculation flow via the streamlines confirms high consistency with the typical VRS flow structure characteristics, providing physical validation for subsequent data labeling.
Figure 1.
CFD simulation framework and data labeling process. (a) The data segment for labeling with CFD simulation (The purple line segments represent the parts labeled as VRS.), calculated by the authors based on the tested coaxial UAV platform. (b) The hybrid mesh topology around the coaxial rotors and streamline configurations illustrating toroidal recirculation characteristic of the VRS.
Figure 1.
CFD simulation framework and data labeling process. (a) The data segment for labeling with CFD simulation (The purple line segments represent the parts labeled as VRS.), calculated by the authors based on the tested coaxial UAV platform. (b) The hybrid mesh topology around the coaxial rotors and streamline configurations illustrating toroidal recirculation characteristic of the VRS.
This cross-validation ensures that the ground truth labels represent the absolute physical state of the flow field [
3], providing a solid foundation for evaluating the AI’s “scientific discovery” capability.
To rigorously define the epistemological boundary of “no human physical priors” highlighted in this study, it must be emphasized that the high-fidelity physics derived from the CFD simulations and wind tunnel tests are strictly utilized externally to assign discrete, binary ground-truth labels (). The neural network architecture, including its objective loss function, remains entirely agnostic to fluid-dynamic equations, kinematic constraints, or pre-defined structural frequency boundaries. Consequently, the Vision Transformer’s eventual alignment with the 1P frequency band is an autonomous feature-mining outcome, entirely free from manual physical prior injection during the machine learning optimization process.
3.2. Time-Frequency Representation via CWT
To bridge the gap between 1D temporal sensor signals and 2D visual patterns, we employ the CWT method. Unlike the Short-Time Fourier Transform (STFT), CWT provides a multi-resolution analysis that is particularly suited for non-stationary aircraft vibrations [
21,
22]. Using the Complex Morlet wavelet as the mother function [
23], we compute the Euclidean norm (magnitude) of the 3-axis accelerometer data and transform this into high-resolution spectrograms. This transformation preserves both the transient energy bursts of VRS and the sustained structural resonance at the 1P frequency, converting a “hidden” aerodynamic mechanism into an “explicit” visual texture [
13]. The sample figures of CWT in different stages are illustrated in
Figure 2. A representative segment of the raw multi-sensor telemetry data and the corresponding descent trajectory during the flight test are illustrated in
Figure 3.
To explicitly justify the selection of CWT over traditional time–frequency methods (such as the STFT), the non-stationary physical manifestations of VRS must be considered. STFT is mathematically constrained by a fixed window size (the Gabor limit), forcing a rigid trade-off between temporal and spectral resolution. This limitation renders it fundamentally inadequate for VRS aerodynamic sensing, which features both abrupt, transient high-frequency energy bursts (e.g., localized vortex tearing) and sustained, low-frequency structural resonances (e.g., the 1P flap-lag harmonic). The CWT overcomes this stricture by employing a dynamically scaling wavelet function, providing high temporal resolution for sudden transients and high spectral resolution for sustained low frequencies. This continuous multi-resolution capability is mechanically indispensable for generating the unbroken horizontal topological structures required for the Vision Transformer’s global attention to successfully lock onto the 1P physical proxy.
3.2.1. Engineering Causality Constraint
To ensure the algorithm can perform zero-lookahead, unidirectional real-time inference within a flight control system, we enforce a strict engineering causality constraint. A sliding window of 5 seconds is employed, where an asymmetric Hanning boundary weighting is applied to unequivocally sever the reliance on any forward-looking samples.
3.2.2. Small-Sample Modeling and Class Imbalance Mitigation
Given the extreme class imbalance inherent to the operational envelope (where VRS positive samples constitute a mere
of the total dataset), conventional optimization strategies rapidly degenerate into majority-class predictors. To provide a mathematically robust solution for small-sample modeling, aggressive class-weighted sampling was implemented during the mini-batch data loading phase. Furthermore, the optimization landscape was regularized using a Focal Loss penalty mechanism [
24]. By configuring the focusing parameter to
, the loss function dynamically scales the gradients based on prediction confidence. This mechanism heavily penalizes misclassifications of the hard-to-predict minority VRS samples while actively down-weighting the easily classified nominal flight data, effectively preventing representation collapse.
To ensure strict reproducibility and address the challenges of small-sample modeling, the complete database statistics and training configurations are formalized in
Table 3. The experimental campaign comprised 5 independent flight sorties, generating a total of
discrete CWT spectrogram frames (normalized to
). Bounded by the operational envelope, true VRS instances constitute 1229 frames (yielding an extreme minority class ratio of
), while nominal flight conditions comprise
frames.
To strictly prevent data leakage, the dataset was partitioned longitudinally by individual flight sorties rather than by randomized frame shuffling. As detailed in
Table 3, Flights 1, 3 and 4 constructed the training set, Flight 2 served as the validation set, and Flight 5 was strictly reserved as the unseen independent test set. During the training phase, both the ResNet50 and ViT baseline models were optimized using the AdamW optimizer with an initial learning rate of
, a cosine learning rate decay scheduler, and a batch size of 32. The models were trained for 50 epochs on an NVIDIA RTX 4090 GPU.
3.3. Explainable AI Framework: Architectures for Mechanism Discovery
In this study, we deliberately compare two distinct AI philosophies to determine which is better suited for “scientific discovery” in aerodynamics.
3.3.1. Convolutional Neural Networks (ResNet50)
As a baseline, we employ ResNet50, which utilizes residual connections to learn deep hierarchical features [
9,
25]. However, CNNs are inherently biased toward local receptive fields. In the context of a CWT spectrogram, a CNN focuses on local “pixel clusters”, which may lead it to overfit on high-frequency noise or localized artifacts rather than capturing the persistent, global frequency bands that define the 1P mechanism [
15].
3.3.2. ViT and Global Self-Attention
The ViT represents a shift toward global context modeling [
26]. By dividing the spectrogram into non-overlapping patches and applying Multi-Head Self-Attention, ViT can establish dependencies between distant time–frequency regions from the very first layer [
15]. For aerodynamic sensing, this “global-first” approach is hypothesized to be superior for locking onto the 1P frequency band, which manifests as a persistent horizontal structure across the entire temporal axis of the spectrogram [
15].
3.3.3. Explainability Probe: Target-Layer Grad-CAM and Spatial Reconstruction
To extract physically auditable evidence, we deploy Gradient-weighted Class Activation Mapping (Grad-CAM) [
14]. To apply this gradient-based attribution to the non-convolutional Vision Transformer, we strictly implement a 2D spatial reconstruction transform (
) at the final layer block before the classification head. Let
be the output sequence of the final Transformer encoder layer, where
represents the number of flattened patch tokens and
D is the embedding dimension. The prepended class token (
) is detached. The remaining 1D sequence of patch tokens (
) is then geometrically reshaped back into a 2D spatial grid pseudo-feature map
:
Gradients of the target VRS class score are backpropagated directly to this reconstructed pseudo-feature map F, performing weighted sum accumulation across channels to yield the final heatmap. This mathematical transformation ensures full reproducibility and allows standard visual backpropagation to project global attention weights back onto the physical frequency and temporal axes.
To mathematically formalize the focus degree of the explainable probe within the physical true frequency band, we define the Attention Focus Ratio (AFR). Let
be the generated Grad-CAM heatmap normalized to
. To eliminate background gradient dilution, a threshold filtration is applied such that
if
, and
otherwise. Let
be the binary region-of-interest mask corresponding to the
aerodynamic frequency strip derived from the logspace mapping. The selection of the
band is rigorously grounded in physical telemetry, determined entirely
a priori and independently of the ViT’s observed attention maps. Based on the fixed operational rotor speed of
, the theoretical fundamental flap-lag (1P) frequency is established at approximately
. The defined region-of-interest boundary (
band) is mathematically necessary to account for mechanical governor lag, structural aerodynamic damping, and the inherent frequency bin smearing caused by the wavelet transform’s finite time–frequency resolution. The AFR is mathematically formulated as:
where
is an infinitesimal regularizer introduced to prevent division-by-zero anomalies under completely diffuse gradient fields.
To quantitatively evaluate boundary sensitivity, expanding the theoretical AFR integration window from to a shifted band yields an expected edge-shift anomaly of less than . This minimal variance preserves the strict statistical order-of-magnitude divergence between the physically grounded frozen backbone () and the collapsed fine-tuned model (), confirming that the evaluation framework is highly robust to predefined masking thresholds.
4. Results and Explainability Analysis
4.1. Quantitative Performance: The Architecture Bottleneck
To answer the fundamental question of whether AI can autonomously capture aerodynamic truths, we evaluated multiple model architectures on an independent test dataset (split by flight sortie to ensure rigorous generalization). The input consisted solely of single-channel CWT spectrograms derived from the magnitude of the 3-axis accelerometer data, ensuring identical feature exposure across all comparative architectures.
A critical challenge in aviation fault detection is the extreme class imbalance; in our test sortie, VRS events constituted less than
of the total flight time. To prevent the models from degenerating into majority-class predictors, class-weighted sampling was strictly enforced during training.
Table 4 presents the comprehensive performance comparison across various mainstream architectures.
The empirical results reveal a profound architectural bottleneck inherent to CNNs. While all models achieve deceptively high overall accuracies (exceeding ) due to the dominance of nominal flight data, their capability to reliably detect VRS is highly differentiated. The CNN architectures, ranging from lightweight (MobileNet-V2, EfficientNet-B0) to deep models (VGG16, ResNet50), systematically fail to maintain a robust balance between Precision and Recall. For instance, while EfficientNet-B0 achieves high precision, its unacceptable recall () indicates that it misses over of actual hazardous events, rendering it unviable for safety-critical applications.
Conversely, the ViT-B/16 demonstrates a systemic superiority, achieving a VRS F1-score of , surpassing the best-performing CNN by over 21 percentage points. We hypothesize that this performance gap is not merely a statistical artifact, but a direct consequence of the architectural philosophy. CNNs, bounded by localized receptive fields, struggle to perceive structural, wide-spanning frequency features within the CWT spectrogram. The ViT, leveraging its global self-attention mechanism, inherently excels at capturing persistent, non-local frequency bands. To rigorously validate this physical hypothesis, we must transition from statistical metrics to XAI probes.
4.2. Robustness Against Synthetic Noise: A Test of Physical Grounding
While the baseline performance on clean data (
Section 4.1) establishes ViT’s superiority, a more rigorous test of whether a model has learned fundamental physics—rather than merely memorizing dataset-specific background noise—is its robustness against signal corruption. In real-world aviation environments, sensor data is frequently contaminated by electromagnetic interference, mechanical resonance, and sensor aging.
To simulate these harsh operational conditions and probe the models’ internal reliance on physical features, we conducted a synthetic noise robustness test. We injected zero-mean Gaussian white noise with increasing standard deviations () directly into the normalized CWT spectrograms of the unseen test set, evaluating the degradation of the VRS F1-score for both ViT-B/16 (Frozen Backbone) and the CNN baseline (ResNet50).
As illustrated in
Figure 4, the performance trajectories of the two architectures diverge dramatically under noisy conditions. The CNN architecture exhibits a catastrophic degradation; its VRS F1-score plummets sharply even at minimal noise levels (
). This vulnerability mathematically proves that the CNN’s decision boundary heavily relies on fragile, localized high-frequency textures or spurious background correlations, which are instantly masked by Gaussian noise.
In stark contrast, the ViT demonstrates exceptional resilience. Its performance curve degrades smoothly and maintains a significant margin of superiority across all noise intensity levels. This robust behavior strongly supports our hypothesis: because the ViT utilizes global self-attention, it bases its decisions on the macro-structural features of the spectrogram—specifically, the persistent horizontal energy band of the 1P frequency. Because this low-frequency aerodynamic resonance is structurally coherent across the time axis, it remains perceptible to the ViT’s global attention mechanism even when heavily overlaid with random synthetic noise.
4.3. Visual Evidentiary Comparison: Local Convolution vs. Global Attention
To uncover the internal geometric constraints that led to the statistical performance gaps observed in
Section 4.1 and
Section 4.2, we deployed Target-Layer Grad-CAM to expose the decision-making foundations of each architecture.
Figure 5 presents a strict side-by-side architectural visualization using an identical, fully developed VRS sample (
) evaluated across three distinct model configurations.
To prevent color-blending artifacts and preserve the physical textures of the underlying flow field, a luminance-based overlay strategy was adopted for visualization. Instead of applying a conventional pseudo-color colormap which could obscure structural details, the normalized grayscale Grad-CAM attributions are applied strictly as a luminance (brightness) mask over the original CWT spectrogram. Consequently, regions with high attention gradients appear structurally illuminated, while non-contributing background noise is naturally darkened, ensuring unambiguous visual identification of the target frequencies.
As visualized in
Figure 5A, the traditional CNN architecture (ResNet50) exhibits a severely diffuse attention pattern. Bounded by the inductive bias of localized receptive fields, the convolutional kernels operate as local pixel-cluster aggregators. Consequently, the model’s high-activation gradients are erratically scattered across the upper and lower boundaries of the CWT spectrogram, chasing high-frequency transient maneuvers and non-physical background instrumentation noise. While this localized feature matching manages to output a correct classification token, the underlying reasoning is aerodynamically vacuous.
Most illuminatingly,
Figure 5B captures the visual failure mode of the fully fine-tuned Vision Transformer. Although this variant achieved the highest historical validation scores (
), its unconstrained end-to-end optimization triggered severe “shortcut learning”. Without architectural boundaries, the multi-head self-attention layers overfitted to specific localized high-energy noise fragments, totally dissolving the pre-trained macro-structural context. The heatmap manifests as an unaligned, chaotic hotspot patch, visually confirming that high statistical accuracy can be achieved for entirely uncertifiable reasons.
In sharp and decisive contrast, the frozen-backbone ViT (the frozen-backbone regularization framework,
Figure 5B) demonstrates an autonomous convergence toward physical law. Without any human domain guidance or physical priors injected into its loss function, the model’s global attention mechanism seamlessly ignores the surrounding flight noise and compresses its absolute gradient weight into a singular, hyper-focused horizontal bright band. Critically, this continuous energy tract spans across the temporal axis and is perfectly aligned with the theoretically derived
(1P) fundamental flap-lag coupling harmonic. This striking contrast provides unequivocal visual evidence that freezing the backbone forces the transformer to act as a pure structural text extractor, successfully bridging raw proprioceptive data with classical fluid mechanics.
4.4. Dynamic Aerodynamic Tracking: Attention Evolution Across Descent Velocities
To mathematically and physically disqualify the hypothesis of “shortcut learning”—where a model might merely memorize coincidental background noise—we conducted a multi-stage aerodynamic evolution analysis using our frozen-backbone ViT (the physically constrained optimization model). The VRS is not a static event; it is a transient, highly non-linear aerodynamic boundary that intensifies and evolves dynamically with the descent velocity (
).
Figure 6 displays the sequential transformation of the CWT spectrogram alongside its corresponding quantified Grad-CAM energy distribution across three critical operational phases: the Early Entry Phase (
), the Fully Developed Phase (
), and the Extreme Buffet Phase (
).
At the inception of VRS, or the
Early Entry Phase (
), the recirculating toroidal wake is freshly generating, inducing localized tip vortex impingements. As shown in the top row of
Figure 6, the model confidently identifies the state (
), and the quantitative attention focus ratio (AFR) within the
fundamental band registers at
. The structural vibrations are beginning to gather around the 1P frequency, but the flow field remains partially disorganized.
As the aircraft transitions into the Fully Developed Phase (), strong flap-lag aerodynamic coupling occurs due to the highly asymmetric load variations across the rotor blade cycle. Crucially, the ViT automatically captures this physical manifestation. Its global attention collapses into a perfectly horizontal, hyper-focused energy strip tightly bound to the theoretical 1P line. The quantified frequency spectrum reveals that the band now captures of the absolute decision-making gradients. This demonstrates that the frozen backbone preserves an immaculate physical feature extractor, prioritizing the theoretical aerodynamic truth over transient noise.
Most compellingly, as the flight deepens into the Extreme Buffet Phase (), the aerodynamic regime approaches the windmill brake state boundary, where large-scale, low-frequency chaotic flow separations dominate the airframe. The quantitative XAI metrics capture a profound physical synchronicity: the model’s fundamental 1P AFR drops sharply to , while its global attention adaptively shifts downward, allocating a major energy proportion into the low-frequency () buffeting band.
This synchronized migration between data-driven attention focus and classical fluid–structure transformation provides unequivocal visual and statistical proof. The Vision Transformer, when constrained by a frozen pre-trained backbone, does not merely memorize a static pattern; it autonomously acquires and tracks genuine aerodynamic principles. This dynamic tracking capability firmly validates our proposed Mechanism-Guided Verification (MGV) framework, proving that AI can serve as a trustworthy, transparent probe for complex aerodynamic states.
4.5. Ablation Studies: Effects of Architectural Regularization and Layer-Specific Fine-Tuning
To comprehensively address the sensitivity of the model’s physical interpretability to its architectural constraints, a targeted ablation study was conducted. Unlike natural images where shallow layers extract generic visual edges and deep layers construct semantic logic, CWT spectrograms encode critical physical frequencies directly within their low-level textural topology. Therefore, we evaluated four distinct transfer-learning configurations to map the accuracy-interpretability trade-off: an unconstrained Fully Fine-Tuned ViT (tuning all layers
), a Bottom-Open variant (tuning low-level feature extractors, layers
), a Top-Open variant (tuning high-level semantic layers
), and the strictly Regularized Fully Frozen ViT.
Table 5 details the joint variance of the classification performance (
-score) and the physical grounding metric (Mean AFR) across all 305 true VRS positive frames.
As illustrated in
Table 5, a profound, domain-specific physical dichotomy emerges. The unconstrained model achieves the highest statistical
-score (
) but experiences catastrophic representation collapse (Mean AFR
, far below the
spatial random baseline). Notably, the Top-Open configuration, which follows standard transfer-learning protocols by tuning late-stage semantic layers, fails to improve classification (
) because the frozen shallow layers cannot adapt to CWT-specific textures; however, it retains a high physical alignment (AFR
).
In stark contrast, the Bottom-Open configuration significantly boosts the -score to by allowing early block adaptation to aeroelastic spectrogram textures, while still rescuing a meaningful portion of physical interpretability (AFR ). This validates the hypothesis that frequency evidence in CWT is primarily anchored in low-level feature representations. Nevertheless, only the Fully Frozen configuration secures maximal physical alignment (). The ablation firmly proves that when strict adherence to a physical proxy (1P frequency) is the primary certification objective for aviation AI, deep structural regularization is absolutely mandatory to prevent high-frequency shortcut learning.
5. Discussion
5.1. The Accuracy Trap and the Necessity of Mechanism-Guided Verification
A misleading apparent contradiction emerged during our systematic exploration. When we allowed the Vision Transformer to undergo full end-to-end fine-tuning without backbone constraints, the model achieved the highest VRS F1-score () on the highly imbalanced test set. However, XAI probing revealed that this fully fine-tuned model suffered from a severe representation collapse, manifesting as a total loss of its global structural attention. Its attention map devolved into a diffuse pattern, heavily reliant on spurious local correlations and high-frequency background noise rather than the physical resonance. It achieved high accuracy through “shortcut learning”.
To disqualify the possibility of this being a localized anomaly of a single sample, a comprehensive statistical validation was conducted by computing the mean AFR across all 305 true VRS positive frames in the independent test set (Flight 5). The fully fine-tuned ViT exhibited a catastrophic mean AFR of only , which falls drastically below the randomized spatial area baseline (), mathematically proving that its unconstrained multi-head self-attention layers systematically overfitted to fragile high-frequency instrumentation noise fragments. Conversely, the frozen-backbone regularization configuration yielded a robust, physically grounded mean AFR of across the entire test population.
Conversely, by deliberately freezing the pre-trained backbone and restricting training to the classification head (yielding a lower, but robust F1-score of
), the model was forced to seek macroscopic, structurally persistent features. As demonstrated in
Section 4.4, this mathematically constrained “Scientist” model autonomously locked onto the physically derived 1P frequency and dynamically tracked its evolution.
It is crucial to acknowledge the inherent accuracy-interpretability trade-off demonstrated by these results. While the frozen-backbone configuration successfully secures physically grounded representations, its -score () is significantly lower than the fully fine-tuned unconstrained variant (), reflecting a diminished representational capacity when the deep feature space is locked. To overcome this limitation, future iterations of the Mechanism-Guided Verification framework should evolve from post hoc auditing to proactive physics-informed training. Rather than rigidly freezing the backbone, the computed Attention Focus Ratio (AFR) can be mathematically integrated into the optimization objective as a differentiable penalty term. By jointly minimizing the classification cross-entropy and maximizing the AFR-based physical alignment loss, the network could theoretically achieve high classification performance while strictly respecting the defined aerodynamic frequency boundaries.
This juxtaposition exposes a critical vulnerability in the current airworthiness certification paradigm, which predominantly relies on statistical metrics (e.g., Accuracy,
F1-score) derived from independent test sets [
16]. In safety-critical aerospace applications, an AI model that achieves high scores by memorizing non-physical noise is fundamentally unsafe; it will fail unpredictably under novel turbulent conditions [
12].
Based on the quantitative interpretability demonstrated in this study, the obtained results provide strong technical grounds for recommending the inclusion of Mechanism-Guided Verification (MGV) methodologies into future AI certification frameworks. Rather than relying solely on black-box statistical metrics, regulatory authorities could consider evaluating the Attention Focus Ratio (AFR) as a supplementary, physics-informed auditing tool to ensure that deep learning algorithms deployed in flight-critical systems inherently align with established fluid-dynamic and structural realities.
5.2. Disentangling Physical Grounding from Geometrical Bias
A critical methodological question raised during peer evaluation is whether the model’s attention shift reflects a genuine distillation of aerodynamic regimes, or merely a trivial statistical response to gradient distributions driven by the physical weakening of the 1P resonance band. If the frozen-backbone ViT operated strictly as a rigid “horizontal line detector” or overfitted to ImageNet’s linear textures, the attenuation of the 1P fundamental frequency would logically cause a catastrophic collapse of its classification confidence or disperse its attention into random background noise.
However, as empirical evidence in
Section 4.4 demonstrates, during the Extreme Buffet Phase (
), the model seamlessly relinquishes its horizontal focus and dynamically reallocates its attention gradients to the chaotic, non-linear morphological blobs in the
regime, maintaining a robust classification probability (
). This adaptive transition indicates that the model tracks the fluid–structure interaction’s energy migration rather than clinging to geometric templates.
Nonetheless, we strictly acknowledge the limitation that post hoc visual attributions cannot entirely isolate implicit network biases. To fully mathematically decouple architectural priors from physical reasoning, future investigations will incorporate adversarial ablation paradigms, such as randomized frequency-axis shuffling and physical masking counter-tests, to further formalize the boundaries of certifiable scientific machine learning.
5.3. Remark on Conservation Laws and Physical Proxies
A fundamental limitation of applying deep learning to fluid-dynamic environments is the risk of models achieving high accuracy while silently violating underlying conservation laws (e.g., mass and momentum conservation). It must be explicitly clarified that the proposed ViT architecture processes 2D CWT spectrograms derived from 1D structural vibrations, rather than predicting 3D spatial flow fields (velocity and pressure distributions). Consequently, directly computing Navier–Stokes residuals on the network’s output is mathematically infeasible.
To bridge this gap, the Mechanism-Guided Verification framework employs the aeroelastic structural response—specifically, the (1P) frequency band—as a rigorous physical proxy constraint. While the model does not explicitly solve volumetric fluid conservation equations, its learned classification boundary respects the classical structural-dynamic symptoms induced by those very fluid mechanics. By statistically verifying that the model’s global attention aligns with this established aerodynamic signature, we provide a pragmatic, verifiable pathway to physical grounding that bypasses the computational impossibility of real-time spatial residual calculation.
5.4. Methodological Generalization to Variable-RPM Platforms
Having formally established the three-step Mechanism-Guided Verification (MGV) framework in
Section 3, a critical operational consideration for future deployment is the framework’s generalization across platforms with variable rotor speeds (Variable RPM).
While the current study and its corresponding baseline utilize a fixed operational condition (), the static predefined auditing mask can be mathematically adapted for variable-speed rotorcraft. By establishing a conditional token integration layer, the instantaneous tachometer frequency will be fed into the ViT’s Multi-Head Self-Attention blocks as a dynamic coordinate scaling prior. This advancement will allow the tracking boundary to shift dynamically in real-time alongside the mechanical speed governor, ensuring persistent physical alignment and quantitative proxy auditing without necessitating the computationally expensive retraining of the core visual feature weights. This dynamic mask adaptation remains a theoretical extension of the MGV framework; experimental validation on variable-RPM flight test campaigns is reserved for future work.
5.5. Traditional Baselines and Computational Cost Analysis
A fundamental inquiry addresses whether a simple classical signal-processing baseline—such as applying a hard energy threshold on the 1P frequency band—could achieve comparable VRS detection performance with negligible computational overhead. In traditional Health and Usage Monitoring Systems (HUMS), univariate condition indicators utilizing hard amplitude thresholds frequently suffer from unacceptably high false-alarm rates [
27]. While computationally trivial, extracting raw vibration amplitude at
inherently conflates aerodynamic anomalies with pilot-induced dynamics. In real-world operations, nominal aggressive maneuvers (e.g., steep pitch/roll commands) and turbulent wind gusts routinely alter the rotor’s operating conditions, intensely exciting the fundamental flap-lag structural modes and causing transient spikes in 1P energy [
28]. The proposed ViT architecture transcends simple amplitude thresholding by globally capturing the 2D morphological evolution of the CWT spectrogram, differentiating between the transient spikes of control maneuvers and the sustained, chaotic frequency smearing characteristic of vortex ring interactions. Furthermore, ResNet50 was specifically selected as the primary deep learning baseline because it represents the most widely adopted and mathematically robust benchmark in industrial computer vision, allowing for a rigorous, fair comparison between standard localized convolutions and global self-attention mechanisms.
Regarding computational cost, the proposed framework leverages an asymmetric offline-online paradigm. The high-fidelity CFD simulations are strictly offline, demanding extreme computational resources (approximately 60 hours per segment on 128 CPU cores) to establish irrefutable physical ground-truth labels. In stark contrast, the online phase utilizes the pre-trained ViT for dynamic inference. Evaluated on a desktop-class NVIDIA RTX 3060 GPU, the single-frame forward pass for the frozen-backbone ViT requires approximately (>60 FPS), well within the real-time threshold of typical rotorcraft flight control loops ().
While current inference was executed on a desktop architecture for algorithmic validation, deploying standard ViT models to edge-computing hardware (e.g., NVIDIA Jetson AGX Orin) is highly feasible. By applying post-training quantization (INT8) and TensorRT operator fusion, the memory bandwidth bottlenecks of edge devices can be significantly mitigated, ensuring that the MGV framework maintains real-time predictive capabilities aboard resource-constrained UAV platforms without sacrificing its interpretable physical grounding.
6. Conclusions
This study addressed the critical challenge of identifying the highly unsteady Vortex Ring State (VRS) using only internal proprioceptive sensors, shifting the paradigm from building a mere “black-box classifier” to developing a transparent “scientific discovery tool”. By transforming one-dimensional vibration streams into two-dimensional CWT spectrograms [
13], we framed aerodynamic state sensing as a visual pattern recognition task governed by the newly formalized Mechanism-Guided Verification (MGV) framework.
The empirical results and systematic ablation studies yield three critical conclusions. First, the explainable probe reveals a dual-stage attention paradigm inherent to the regularized Vision Transformer: it autonomously locks onto the (1P) fundamental frequency during the fully developed VRS phase, and dynamically migrates its focus toward low-frequency () buffeting structures as the aircraft enters extreme descent velocities. Second, the formal layer-specific ablation study demonstrates that backbone freezing acts as a mandatory architectural regularizer; it forces the ViT to abandon the high-frequency “shortcut learning” that plagues the unconstrained model, ensuring physically grounded feature extraction. In contrast, standard CNNs systematically exhibit diffuse attention scattered across non-physical instrumentation noise.
Finally, the MGV framework provides a pragmatic pathway for airworthiness certification by verifying physical grounding through established aeroelastic proxies (the 1P structural response), effectively bypassing the prohibitive computational overhead associated with real-time first-principles aerodynamic modeling and volumetric flow simulations. While the current quantitative auditing mask (
) is anchored to a fixed
condition, the framework is mathematically designed to generalize to variable-RPM platforms via dynamic tachometer integration, as discussed in
Section 5.4. By successfully bridging data-driven decisions with classical fluid mechanics, this research lays a crucial, physically auditable foundation for the deployment of artificial intelligence in advanced, safety-critical aerospace systems.