It is important to note that the evaluation methodology in the target domain differs fundamentally from the source domain. Since the Stanford dataset lacks ground truth anomaly labels, we cannot compute traditional classification metrics (Precision, Recall, F-β). Instead, we evaluate the framework’s ability to rank flight maneuvers according to their mechanical complexity, measured by the monotonic relationship between the model’s predicted Normalized Anomaly Index (NAI) and the physics-based Control Effort Score (CES).
This evaluation approach is grounded in the assumption that more aggressive flight maneuvers induce higher mechanical stress and vibration anomalies, which the model should detect as elevated anomaly scores. A strong positive correlation between NAI and CES would indicate successful knowledge transfer: the model learned from the source domain (Airbus) can meaningfully assess mechanical stress in the target domain (Stanford) without any target-specific training.
3.2.2. Normalized Anomaly Index (NAI)
The Normalized Anomaly Index used for target domain evaluation is not a single metric but a final ensemble Index created by fusing 14 individual component scores from the model. The reason for using the term Index here is to clarify that this value is not an ‘Evaluation Score’ like F1 Score, but rather a ‘Measurement’ indicating a relative level calculated by synthesizing 14 components.
The 14 weights (
wi) presented in
Table 18 are not manually set values. These weights are optimal values automatically explored through a Global Optimization algorithm called Differential Evolution. The objective of the optimization was to find values where the Raw Anomaly Index rankings of 15,517 samples align maximally with CES rankings. In other words, the goal was to find the weight (
wi) combination that maximizes the Spearman correlation coefficient (
ρ) to be evaluated in
Section 3.2.3. Through this optimization process, components with high contribution to explaining target domain difficulty (e.g., Pattern4) automatically receive high weights, while components with low contribution (e.g., Prediction) automatically receive low weights.
Group A (Model-direct Analysis): These are values directly calculated from each output head of the transformer model learned in Stage 1 (Airbus) (e.g., prediction error, reconstruction error, Anomaly Head prediction values, etc.). These values are independent of the CORAL adaptation in
Section 3.2.1.
Group B (Embedding-based Analysis): These are scores obtained by converting the ‘Sequence Representation’ extracted from the Stage 1 model into ‘Aligned Embeddings’ through CORAL in
Section 3.2.1 and then analyzing these processed embeddings with traditional unsupervised learning models such as GMM, K-means, and LOF.
The optimized weights (
wi) in
Table 18 and
Figure 3 clearly demonstrate the success factors of the proposed 2-Flow ensemble framework.
First, the most decisive weights came from temporal patterns directly learned by the transformer. The components with the highest weights were Pattern4 (wi = 0.2906) and Temporal (wi = 0.2137), which together accounted for over 50% of the total score. Both components belong to Group A: Model-Direct Analysis. This demonstrates that the ‘fundamental temporal patterns of vibration signals’ learned by the transformer through multi-task learning during Stage 1 (Airbus) pre-training were the most powerful single predictor for predicting maneuver complexity in the scale-different target domain (Stanford).
Second, the ‘geometric characteristics’ of the CORAL-aligned embedding space served as powerful auxiliary indicators. The components receiving the next highest weights after Temporal and Pattern4 were Elliptic (
wi = 0.1330) and LOF (
wi = 0.1010) from Group B: Embedding-based Analysis. This suggests that the CORAL domain adaptation performed in
Section 3.2.1 was successful. That is, as a result of statistically aligning the embeddings of both domains, detecting ‘geometric outliers’ deviating from the main cluster of normal data (Elliptic, LOF) served as a highly significant indicator for identifying actual anomalies.
Third, the auxiliary losses from Stage 1 showed low weights in the final score fusion. The Prediction (wi = 0.0080) and Reconstruction (wi = 0.0279) components received the lowest weights. This indicates that rather than directly predicting the final score, they focused on their auxiliary role of helping the model learn richer features like Pattern head or Sequence Representation during the Stage 1 pre-training process.
In conclusion, the high performance (
ρ = 0.903) of the proposed framework is not from a single approach, but the result of intelligently fusing two heterogeneous information flows: (A) temporal patterns directly learned by the transformer and (B) geometric outlier analysis of the CORAL-aligned embedding space. These 14 individual component scores (
si) are combined with the optimal weights (
wi) defined in
Table 18 to calculate the Raw Anomaly Index (RAI) through the following weighted sum. The weights were optimized using Differential Evolution, as shown in
Table 19 and
Table 20. The actually measured RAI was distributed in an arbitrary range between 0.4443 and 0.5710.
The algorithm converged consistently across 5 independent runs with different seeds, achieving ρ = 0.903 ± 0.002. The low variance (coefficient of variation < 5% for high-weight components) confirms optimization stability. The objective function was to maximize the Spearman correlation coefficient between CES and the weighted sum of component scores: max_{w1, …, w14} ρ(CES, Σiwi·si).
Several clarifications regarding the NAI computation are warranted. First, the component weights
wi in
Table 18 do not sum exactly to 1 (Σ
iwi = 1.0001). This is intentional: since Spearman correlation is rank-based and depends only on the rank order of values rather than their absolute scale, the sum of weights does not affect the correlation. Constraining Σ
iwi = 1 would reduce the search space and potentially degrade performance. For interpretability, weights can be normalized post hoc (
wi′ =
wi/Σ
jwj), with the normalized weights differing by <0.3% from the original values.
Second, before computing the weighted sum, each component score s
i is z-score standardized:
where μ
i and σ
i are the mean and standard deviation computed over all 15,517 Stanford samples. This standardization ensures that all components have mean ≈ 0 and std ≈ 1, enabling fair comparison in the weighted combination regardless of their original scales (e.g., prediction error in [0.01, 0.08] vs. LOF score in [0.8, 3.2]).
Third, the linear combination approach is primarily empirical but grounded in ensemble learning theory: combining diverse predictors reduces variance and improves robustness. Empirical validation showed that optimized weights (
ρ = 0.903) significantly outperformed equal weights (
ρ = 0.756) and heuristic weights (
ρ = 0.834). Since the absolute values of Raw Anomaly Index (RAI) are difficult to interpret intuitively, Min-Max Scaling was applied to transform them into values between 0 (minimum anomaly) and 1 (maximum anomaly) to aid reader understanding. This value is defined as the Normalized Anomaly Index (NAI). Here,
Indexmin(0.4443) and
Indexmax(0.5710) are the minimum and maximum values of Raw Anomaly Index (RAI) observed across the 20 maneuvers:
Figure 4 summarizes the Normalized Anomaly Index (NAI) values actually measured by the model for each of the 20 flight maneuvers. As shown in
Figure 4, the Normalized Anomaly Index (NAI) for each flight maneuver was normalized between the lowest 0 (Forward Sideways flight) and the highest 1 (Freestyle aggressive) for intuitive understanding. The relative numerical differences between flight maneuvers can be clearly observed. Higher values indicate higher anomaly levels predicted by the model.
Notably, the distribution of NAI values reveals a clear correlation between flight aggressiveness and predicted anomaly levels. Stabilized flight maneuvers such as Forward, Sideways, and Hover exhibited consistently lower NAI values, indicating that the model correctly identified these as baseline normal operating conditions. In contrast, dynamic maneuvers involving rapid attitude changes, such as Freestyle variations and Tic-Toc, demonstrated progressively higher anomaly indices. This gradient pattern across the maneuver spectrum confirms that the framework effectively learned to associate control input intensity with corresponding vibration signatures, even when transferring knowledge across vastly different aircraft scales. The fact that these distinctions emerged without explicit supervision in the target domain underscores the practical value of the proposed approach for real-world helicopter maintenance scenarios where labeled anomaly data are rarely available.
3.2.3. Correlation Analysis of CES and Normalized Anomaly Index (NAI)
To quantitatively validate this observed pattern, we compared the CES described in
Section 2 with the NAI values. This study analyzed the correlation between CES, representing the physical complexity of 20 flight maneuvers, and the Normalized Anomaly Index (NAI) predicted by the model fully trained through the framework. The analysis results showed a Spearman correlation coefficient
ρ of 0.903 (
p-value < 0.001), indicating a very strong positive correlation, as in
Table 21. This demonstrates that the model learned in the source domain successfully detected vibration patterns related to the physical complexity (difficulty) of maneuvers even in the target domain of completely different scale.
The Kendall tau (τ) value analyzing rank agreement was also 0.768 (
p-value < 0.001), indicating that the model very accurately distinguished the relative difficulty order between maneuvers. Additionally, analyzing the score distribution difference between Hard group (Difficulty 6–10) and Easy group (Difficulty 1–5), Cliff’s Delta (
δ) was 0.920, confirming that the two groups were statistically clearly separated, as summarized in the statistical significance tests (see
Table 22).
Overall, as CES Difficulty Rank (1–10) increased, the model’s Normalized Anomaly Index (NAI) also showed a strong monotonic increase trend, consistently increasing from 0.0000 (Forward sideways flight) to 1.0000 (Freestyle aggressive). Of course, this relationship is not perfectly linear. For example, difficulty 7 circles (0.2644) and Orientation sweeps (0.2778) recorded lower values than difficulty 5 Dodging demos2 (0.5099). Also, difficulty 10 Chaos (0.7229) was lower than difficulty 9 Tictocs (0.7743). These minor discrepancies actually strengthen the validity of this framework. While CES measures difficulty based on pilot control inputs (Control Effort) in the Stanford dataset, the proposed model detected anomalies based on features such as vibration patterns learned by the multi-task transformer model from the source domain (Airbus). The fact that the two metrics have a very high correlation of 90.3% while showing such minor differences with different physical bases suggests that the model has not simply overfit to the CESs themselves, but has successfully learned the same concept of ‘maneuver complexity’ based on independent evidence of vibrations.
The scatter plot in
Figure 5 shows the strong positive monotonic relationship (
ρ = 0.903,
p-value < 0.001) between the model-predicted NAI and the physics-based maneuver complexity CES, demonstrating the validity of label-free zero-shot transfer learning.
The Spearman correlation of ρ = 0.903 with p < 0.001 indicates a very strong positive correlation that is highly unlikely to occur by chance. The bootstrap 95% confidence interval [0.856, 0.943] does not include zero, confirming the reliability of the correlation estimate. The Kendall τ = 0.768 (p < 0.001) indicates strong concordance between CES and NAI rankings. Cliff’s Delta δ = 0.920 represents a large effect size, indicating that the Hard maneuver group (CES Difficulty 6–10) and Easy maneuver group (CES Difficulty 1–5) are statistically well-separated by the model’s predictions. Additionally, we performed a permutation test by randomly shuffling the CES-NAI pairings 1000 times. None of the permuted correlations exceeded the observed ρ = 0.903, yielding p < 0.001. This confirms that the observed correlation reflects genuine knowledge transfer rather than spurious associations.
Figure 6 provides a three-dimensional view of how the framework proposed in this study operates. Each of the three axes carries different meanings, and through their relationships, we can deeply understand the framework’s performance.
The X-axis CES (Control Effort Score) Difficulty Rank represents the objective difficulty of helicopter control. This value was calculated by combining volatility and aggressiveness of control inputs, ranging from 1 to 10, with higher values indicating more difficult flight maneuvers. The four colored regions (Easy, Moderate, Hard, Extreme) shown on the base plane visually distinguish these difficulty levels.
The Y-axis Pattern4 Component Score is a particularly interesting discovery. The fact that Pattern4 received the highest weight of 29.06% among the model’s 14 detection components means that this component captures the most important temporal pattern in helicopter anomaly detection. The graph shows a clear trend of Pattern4 score increasing with CES difficulty. Starting from low activation of 0.182 in Dodging Demos 1, it reaches almost maximum activation of 0.924 in Freestyle Aggressive. This shows that the model detects specific temporal patterns more strongly in complex flight maneuvers.
The Z-axis NAI (Normalized Anomaly Index) is the final combined anomaly value of 14 components measured by the model, normalized between 0 and 1. Notably, NAI shows a very strong positive correlation (Spearman ρ = 0.903) with CES Difficulty Rank. This means that knowledge learned from Airbus helicopter data was successfully transferred to Stanford RC helicopter data. It is particularly important that such high correlation was achieved without additional target domain learning.
The selection of four representative flight maneuvers is also meaningful. By selecting one from each difficulty group, the entire difficulty spectrum was evenly represented. Particularly, comparing Flips Loops (NAI = 0.623) and Turn Demos 3 (NAI = 0.647), while CES ranks differ at 8 and 6, NAI shows similar levels. However, Pattern4 scores show clear differences of 0.687 and 0.453, suggesting that Pattern4 independently captures specific temporal characteristics of flight maneuvers rather than having a simple linear relationship with NAI.
The translucent surface represents a manifold indicating the learned relationship between the three dimensions. The fact that this surface aligns well with data points shows that the model effectively learned the complex nonlinear relationships between the three variables. The trajectory of four points connected by red lines clearly shows the trend of both Pattern4 and NAI increasing with difficulty, demonstrating the framework’s consistency and stability.
The particular value of this visualization in the paper lies in visually decomposing how a specific component (Pattern4) operates to achieve high performance. This is an important contribution in providing interpretability for the internal workings of deep learning models that could be considered black boxes.
To further support the interpretability of Pattern 4’s role in capturing non-stationary signal characteristics,
Figure 7 and
Figure 8 present a comparative time–frequency and spectral analysis between simple and aggressive flight maneuvers. A systematic frequency analysis was conducted to identify bands with the largest spectral differences, revealing three prominent bands: Band A (80–90 Hz), Band B (95–110 Hz), and Band C (115–130 Hz).
The spectrograms in
Figure 7 illustrate non-stationary characteristics that are consistent with those attributed to Pattern 4. Forward Sideways (panel a), representing the easiest maneuver (CES Difficulty Rank 1), exhibits consistent spectral content across all highlighted bands over time, characteristic of stable flight dynamics. In contrast, Freestyle Aggressive (panel b), representing the most difficult maneuver (CES Difficulty Rank 10), shows substantial time-varying energy distribution with transient bursts particularly visible in the highlighted bands.
Figure 8 quantifies these differences through power spectral density (PSD) comparison. Band A (80–90 Hz) shows the largest difference of +6.3 dB higher energy (≈4.3× power), Band B (95–110 Hz) shows +4.9 dB (≈3.1× power), and Band C (115–130 Hz) shows +4.7 dB (≈3.0× power) for aggressive maneuvers. All three bands demonstrate substantially elevated energy levels during aggressive flight, with differences ranging from approximately 3× to 4× power increase.
The transient energy bursts observed in these frequency bands are consistent with the non-stationary components that Pattern 4 is designed to capture through attention-based temporal pattern recognition. Taken together, the time–frequency and PSD evidence provides supporting visual and quantitative evidence consistent with Pattern 4’s high weight assignment (w4 = 0.2906) in the NAI computation, suggesting that the learned attention patterns are physically meaningful and aligned with vibration-characteristic differences between flight conditions.
From a physical perspective, the 80–130 Hz frequency range highlighted in this analysis is plausible for small-scale rotorcraft dynamics because higher rotor speeds shift prominent rotor-related harmonics into the mid-frequency band. The Stanford RC helicopter used in this study operates at approximately 1500–2000 RPM (≈25–33 Hz in 1/rev), so the 80–130 Hz region is more consistent with higher-order rotor harmonics (e.g., ≈3–5/rev) and/or airframe/drive-train structural modes that can be excited by rotor forcing. During aggressive maneuvers such as Freestyle Aggressive, rapid cyclic and collective pitch changes modulate blade loading and hub forces, producing non-stationary, broadband energy increases around these harmonics and nearby modes.
The distinct spectral signatures observed between simple and aggressive maneuvers also reflect fundamental differences in flight-dynamics stability. Forward Sideways flight, despite involving lateral movement, maintains relatively constant rotor-disk loading and predictable aerodynamic conditions, resulting in the stable spectral patterns visible in
Figure 7a. In contrast, Freestyle Aggressive involves continuous attitude changes, rapid accelerations, and dynamic thrust vectoring that create time-varying aerodynamic forces. These forces propagate through the rotor system and airframe structure, generating transient vibration bursts visible in
Figure 7b. In addition, the observation that Pattern 4—learned from Airbus helicopter data—highlights similar transient/non-stationary characteristics in a geometrically different RC helicopter provides supporting evidence for the cross-scale generalization capability of the proposed framework.