Review Reports - Intent-Aware CNN–Informer for Long-Horizon Trajectory Prediction of Cross-Domain Unmanned Aerial Vehicles in Constrained Environments

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I believe this manuscript has shown significant value for your first submission. However, it still harbors some potential hard flaws. Some questions must be answered and fixed out carefully.

1. The claim that DBL parameters are "physically interpretable" is overstated without a rigorous sensitivity analysis linking them to actual control surface deflections. You need to provide quantitative evidence of how these parameters map to physical actuation limits. This hand-waving interpretation weakens the methodological foundation significantly.

2. Your intent features assume perfect knowledge of no-fly zone boundaries and destinations, yet you never discuss how errors in these priors affect prediction robustness. Show me the degradation curves when destination coordinates are corrupted by noise. Without this, your approach has little operational value.

3. The CNN-Informer hybrid is presented as novel, but similar cascaded architectures have been published in trajectory forecasting since 2021. You must clarify what specific architectural innovation distinguishes your work from existing CNN-Transformer hybrids. The current description reads like a routine engineering application.

4. Training and testing on trajectories generated from the same optimized library introduces dangerous circular reasoning. Validate your model on trajectories with completely different initial condition distributions, not just perturbed within narrow ranges. The current experimental design is fundamentally flawed for generalization claims.

5. Your ablation study only removes entire feature groups but never isolates interaction effects between control parameters and intent features. Design additional experiments where you ablate single features while keeping others fixed. The current analysis tells us nothing about feature redundancy.

6. The Pearson correlation analysis is misused for time series data where autocorrelation violates independence assumptions. Apply partial correlation or mutual information instead. Your current Figure 5 is statistically invalid for sequential observations.

7. You claim a 17.2% error reduction over Informer, but this is meaningless without reporting confidence intervals or statistical significance tests. Perform paired bootstrap tests across multiple random seeds. A single percentage without variance bounds is unacceptable for a flagship result.

8. The CS-UKF tracking method is mentioned without justification for why it was chosen over simpler or more modern alternatives. Justify this choice with computational or accuracy comparisons. Otherwise, it appears arbitrary.

9. Your dataset contains only 658 base trajectories, which is grossly insufficient for training a transformer-scale model with 512 hidden dimensions. Calculate the actual sample-to-parameter ratio and discuss overfitting risks. You are likely memorizing trajectories rather than learning dynamics.

10. Sliding window segmentation from only 658 trajectories does not create independent samples. Report the effective degrees of freedom after accounting for temporal autocorrelation. Your training set size claim of 21700 is deceptive.

11. The early stopping patience of 40 epochs is excessively large for a dataset of this size. Demonstrate that validation loss plateaus well before this threshold. This suggests you are unnecessarily risking overfitting.

12. You use MAE/MSE loss which averages over trajectory errors, but this ignores the asymmetric cost of different error directions for collision avoidance. Justify your loss function choice in the context of safety-critical applications. Standard regression losses are often inappropriate here.

13. The comparison with SSD-LSTM is unfair because that model was designed for different trajectory types and input representations. Re-implement SSD-LSTM with your feature set and tune it properly. The current comparison stacks the deck in your favor.

14. Your terminal error EF is defined at a fixed prediction horizon, but trajectory prediction performance is highly sensitive to horizon length. Report errors as functions of prediction time, not just single points. A single number hides when your model actually fails.

15. The intent features dn, Δψ, and vc are geometrically derived but you never prove they are sufficient for intent representation. Provide a theoretical argument or empirical evidence that no other intent-relevant quantity is needed. This is an arbitrary feature selection.

16. You claim the model handles "abrupt reversal" of bank angle, but Figure 9 shows your predictions still lag during transients. Quantify the delay in response using cross-correlation analysis. Smoothing does not equal accurate prediction.

17. The control-affine transformation assumes perfect knowledge of aerodynamic coefficients CD0 and K, which are notoriously uncertain in real flight. Test your method with ±20% perturbations in these coefficients. Your current results assume an unrealistically perfect model.

18. Your encoder-decoder generates control parameters first, then integrates to positions, but this compounds errors. Analyze how control prediction errors propagate through integration. The indirect prediction path is a structural weakness.

19. The no-fly zone model uses infinite-height cylinders, which is physically unrealistic for most UAV operations. Discuss how your method would extend to altitude-dependent restrictions. This simplification limits real-world applicability.

20. You never compare against a pure physics-based numerical integration baseline. Add a simple Runge-Kutta propagator using nominal aerodynamics to establish a lower performance bound. Without this, we cannot tell if your deep learning is actually adding value.

21. The training/validation/test split is described only as "in proportion" without specifying the exact ratio per trajectory. Report the exact numbers and ensure no trajectory segments appear in multiple splits. Current description is ambiguous.

22. Your feature normalization uses extrema from the training set only, but you never check if test set values fall outside these ranges. Report the percentage of test samples that required extrapolation. This is a basic robustness check you missed.

23. The SE channel attention module is mentioned but its contribution is never ablated. Remove it and report performance changes. You cannot claim it helps without evidence.

24. You claim the model captures "long-range dependencies" but your maximum prediction horizon is only 200 seconds of 1Hz data. Demonstrate performance on 10x longer sequences or retract the claim. 200 steps is not long-range for transformer architectures.

25. The initial state variations (r0: 70-72km, v0: 5500-6200 m/s) are extremely narrow compared to real operational envelopes. Expand your training distribution or explicitly state this limitation. Current generalization claims are unsupported.

26. Your objective function J includes terminal position error but you never verify that optimized trajectories actually satisfy no-fly zone constraints. Post-optimization constraint violation checking is mandatory. The current trajectory generation may produce invalid data.

27. The relative closing velocity vc becomes positive only when heading toward destination, but you never address sign conventions in your feature analysis. Define clearly what positive and negative vc physically mean. This ambiguity affects reproducibility.

28. You have six authors but the writing contains multiple redundant passages, especially in the introduction and feature analysis sections. Consolidate repetitive descriptions of the same contributions. The paper length could be cut by 30% without loss.

29. The comparison with iTransformer and DLinear uses default hyperparameters without tuning for this specific task. Conduct proper hyperparameter sweeps for all baselines. Unfair tuning is a common way to inflate performance claims.

30. Your method requires simultaneous knowledge of all no-fly zones, but in real scenarios zones may be discovered sequentially. Test a streaming setting where zones appear over time. Current batch assumption is unrealistic.

31. The correlation analysis reveals that Δψ and vc are nearly perfectly correlated (0.9638), suggesting redundancy. Remove one of them and re-evaluate. Your feature set likely contains unnecessary dimensions.

32. You never validate that the DBL parameters are actually easier to learn than raw angle of attack and bank angle. Train a baseline that predicts α and σ directly. Without this comparison, the DBL advantage is unsubstantiated.

33. The Earth radius Rc appears in Eq (21) but you never specify its numerical value or whether you used a spherical or ellipsoidal model. Report your exact geodetic assumptions for reproducibility.

34. Your loss curves in Figure 11 show Transformer converging to a lower validation loss than your method at early epochs. Explain why your method eventually overtakes but starts slower. Initial convergence behavior reveals architectural limitations.

35. The phrase "cross-domain unmanned aerial vehicle" is never properly defined in terms of what domains are crossed (air/space/water?). Provide a precise operational definition or use a standard term. This jargon is unclear.

36. You cite reference [36] for intent inference but this is "in Chinese" and potentially inaccessible. Either summarize the relevant method or remove the dependency. Your paper should be self-contained.

37. The maximum prediction error EM is nearly identical to EF for most models, indicating that the worst error always occurs at the final timestep. Analyze why errors do not peak earlier. This suggests your models are drifting systematically.

38. You use GELU activation but never justify why it was chosen over ReLU or Swish. Perform an ablation on activation functions. This is a standard hyperparameter you ignored.

39. The trajectory tracking starts at 200 seconds for all samples, meaning early flight dynamics are completely excluded. Validate that prediction performance does not depend on this arbitrary truncation point. Your results may be phase-specific.

40. You claim the method is "transferable" to other UAVs but provide zero evidence on different platforms. Test on at least one other vehicle model with different aerodynamics. Current claims of generality are aspirational.

41. The dropout rate of 0.05 is extremely low for a model with 512-dim hidden states. Sweep higher dropout values to check for overfitting. You are likely under-regularizing.

42. Your encoder uses three ProbSparse layers but you never ablate the depth. Test with 1, 2, 4, and 6 layers. The current choice appears arbitrary without scaling analysis.

43. The relative closing velocity vc is computed purely kinematically but should also depend on thrust and drag. Reformulate vc to incorporate energy state or remove the misleading name. Current definition misses key physics.

44. You mention "gradient vanishing problem" but your network is only a few layers deep with skip connections. Compute the actual gradient norms during training. This is not a real issue for your architecture.

45. The test set size of 1500 samples is small for evaluating transformer models. Report confidence intervals using bootstrapping over at least 1000 resamples. Point estimates on 1500 samples are unstable.

46. Your conclusion claims the method is validated on a "representative CDUAV scenario" but you only used one aerodynamic model. Test on multiple vehicle configurations with different lift-to-drag ratios. One scenario proves nothing about representativeness.

47. The current manuscript suffers from an insufficient literature review that fails to establish its novelty against recent relevant advances. It ignores several key contemporary studies published in 2024-2025. Including all of these below: Physics-inspired time-frequency feature extraction and lightweight neural network for power quality disturbance classification; Short-term power load forecast using OOA optimized bidirectional long short-term memory network with spectral attention for the frequency domain, about cross-domain and prediction would be beneficial.

48. The data availability statement says data is "included in the article" but no actual data or code link is provided. Release the trajectory dataset and implementation code for reproducibility. Without open data, your results cannot be verified.

49. Provide all boundary conditions for model development.

Author Response

Dear Reviewer,

Given that the revisions involve a substantial amount of textual and graphical content, we have not used the journal’s standard revision template. However, we have clearly indicated all specific changes and their locations in the attached document, and have provided careful, point-by-point responses to each of your comments. We respectfully ask for your kind consideration.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Major Comments:

1 - 3.1 Construction of the Trajectory Dataset (lines 371-381):
Additional discussion on how these parameters were chosen or optimized would be beneficial. Whether the parameters were empirically selected, analytically derived, or tuned through experimentation and/or how sensitive are the trajectories to these choices.

2 - Long-Horizon Prediction Robustness under Maneuvers
The proposed CNN-Informer framework is designed for long-horizon trajectory prediction in constrained environments. However, trajectory-prediction uncertainty typically accumulates significantly over longer horizons, especially during abrupt maneuvering phases. It would be valuable to discuss how the model behaves under highly dynamic or maneuvering conditions (see suggestion section).

3 - Prediction Error Magnitude and Interpretation (figure 10 and table 1)
The reported trajectory-prediction errors remain on the order of several kilometers for long-horizon prediction tasks. It would be helpful to provide additional operational interpretation regarding whether these error magnitudes are acceptable for the intended application, and how prediction uncertainty would affect planning or collision-avoidance systems.

4 - Prediction Error Magnitude and Evaluation Metrics (Table 1)

In table 1, a comparison of the Informer based methods are shown. This evaluation would be strengthened by including at least one comparison with a classical non-machine-learning trajectory prediction or estimation method (KF, IMM or other traditional maneuvering-target tracking frameworks) to provide additional context regarding the advantages and trade-offs of the proposed deep-learning architecture.

5 – Operational Integration and Airspace Context

The manuscript focuses on trajectory prediction performance. However, practical UAV deployment often requires integration with broader autonomous navigation and airspace-management systems.. A brief discussion on how the framework could integrate with UTM services or airspace coordination architectures would improve the real world application of the research.

Minor Comments and Suggestions

1 – Code Availability and Reproducibility:

It is not clear whether the implementation code or datasets used for evaluation are publicly available. Providing access to the source code, trained models, or detailed implementation resources would significantly improve reproducibility and facilitate adoption by the research community.

2 – Related Work (lines 85–97):
In the discussion regarding model-based estimation approaches, the authors may consider including additional references related to the maneuvering target tracking problem, particularly works involving adaptive and multiple-model estimation frameworks under uncertain or time-varying dynamics. Suggested references such as:

X.R. Li and V.P. Jilkov, “Survey of Maneuvering Target Tracking. Part V: Multiple-Model Methods,” IEEE Transactions on Aerospace and Electronic Systems, 41(4):1255–1321, 2005.

Canolla, A., Jamoom, M. B., and Pervan, B., “Interactive Multiple Model Sensor Analysis for Unmanned Aircraft Systems (UAS) Detect and Avoid (DAA),” 2018 IEEE/ION Position, Location and Navigation Symposium (PLANS), Monterey, CA, USA, 2018, pp. 757–766.

Bar-Shalom, Y., Li, X. R., and Kirubarajan, T., Estimation with Applications to Tracking and Navigation: Theory, Algorithms and Software, Wiley-Interscience, 2001.

These references may provide additional context regarding adaptive estimation and multiple-model tracking strategies for dynamic target motion and uncertain operating conditions.

3 - Model Training Hyperparameters (Section 4.2):

The manuscript describes the architecture and training configuration in detail, including the number of attention layers, hidden dimensions, learning rate scheduling, and early-stopping strategy. However, it would be beneficial to include a discussion on how these hyperparameters were selected or evaluated.

Author Response

Dear Reviewer,

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The intention-aware CNN-Informer platform proposed by the authors in the reviewed article addresses the challenge of achieving high accuracy in predicting the flight trajectories of unmanned aerial vehicles in complex environments with limited manoeuvring freedom. The authors describe a solution in which, by combining physically interpretable control parameters with continuous intention features describing the avoidance of no-fly zones and goal-oriented movement, the model achieves results that surpass existing methods. The paper presents the results of experiments conducted on a dataset concerning unmanned aerial vehicles from various fields. The authors demonstrate that this approach reduces the mean prediction error by 17.2% compared to the baseline Informer model and significantly improves the final and maximum prediction accuracy.
The fundamental innovation of this framework lies in a three-part strategy: reformulating vehicle dynamics in a control-consistent manner, which simplifies the mapping of hidden control effects; constructing three-dimensional continuous features of — tangential distance from no-fly zones, course error angle, and relative approach speed — and a hybrid network architecture. This architecture utilises a CNN front-end to extract local manoeuvre patterns and an Informer encoder to capture long-term temporal dependencies. By integrating these components, the model effectively reproduces the system’s one-step conditional sufficiency, providing a more reliable predictive solution than general time series models.
These results demonstrate that incorporating vehicle dynamics and mission-related intentions into deep sequential models can significantly enhance reliability under conditions of partial observability. Although the study currently relies on offline-optimised simulation data and prior knowledge of constraints, it establishes a methodology that can be transferred to behaviour-aware forecasting and autonomous decision-making across a wider range of drone operations. Future research will aim to extend this framework towards probabilistic inference of intentions and uncertainty-aware forecasting in real-world scenarios
Although the proposed CNN-Informer framework demonstrates a robust integration of physical dynamics and mission intentions, in my view several key areas require further development to enhance its practical utility. It seems to me that the main issue is the model’s heavy reliance on prior knowledge, specifically on predefined no-fly zone locations and potential destinations. It should be noted that in real-world operational environments, such information is often incomplete, uncertain or subject to sudden changes. Consequently, the authors should address the question of how the system would adapt to unknown obstacles or dynamic mission changes. Furthermore, as the dataset is derived entirely from offline-optimised simulation trajectories, it appears to lack the stochastic complexity characteristic of real-world flight, such as wind disturbances and sensor noise. Validation based on empirical flight data is necessary to confirm the reliability of the framework under practical conditions.
A significant limitation of the current framework is the generation of strictly deterministic forecasts. For safety-critical autonomous systems, the lack of uncertainty quantification or confidence intervals hinders effective risk assessment and decision-making. Furthermore, although the self-monitoring mechanism was designed with performance in mind, the reviewed article, in my opinion, lacks a more detailed analysis of inference latency and computational load on the embedded hardware. Such data is essential to confirm the model’s feasibility for real-time navigation and autonomous operations.
In summary, the article is interesting and brings new insights to the field of research; nevertheless, in my opinion, it requires minor revisions prior to publication.

Author Response

Dear Reviewer,

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript presents a well-organized study on long-horizon trajectory prediction for UAVs operating in constrained environments. The proposed intent-aware CNN-Informer framework effectively combines physically interpretable DBL control parameters with continuous intent-aware features, and the experimental results demonstrate clear performance improvements over several representative baseline models. The paper is generally well written, and the contribution is meaningful for the UAV trajectory prediction community. Therefore, I recommend acceptance after minor revisions.

Comment 1

A short paragraph summarizing the unique novelty compared with recent Transformer-based UAV prediction studies would strengthen the contribution.

Comment 2

Although the simulation setup is carefully designed, the manuscript would benefit from additional discussion regarding the applicability of the proposed method to real-world UAV datasets or noisy sensing environments. The authors may briefly discuss possible transferability and domain adaptation considerations.

Comment 3

It would be helpful to provide additional information about computational cost, training time, or inference latency of the proposed model integrates CNN and Informer modules.

Comment 4

A clearer percentage improvement table or discussion would improve readability.

Comment 5

There are several minor notation and formatting inconsistencies throughout the manuscript. For example, some symbols appear with inconsistent spacing or formatting in equations and surrounding text. A careful proofreading of mathematical notation is recommended.

Comment 6

Several sentences could be improved for readability and conciseness. For example:

Page 8: “Therefore, when predicting the flight trajectory, Therefore, trajectory prediction should jointly consider...” contains a duplicated expression.
Some long paragraphs in Sections 3.2 and 3.3 may benefit from sentence simplification.

Comment 7

Figures 6–8 are informative, but some labels and annotations appear relatively small and difficult to read. Increasing font size and improving visual contrast would improve presentation quality.

Author Response

Dear Reviewer,

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The extraction of intent features currently depends on precise prior knowledge of candidate destinations and NFZ locations. Please add a sensitivity analysis evaluating how the prediction accuracy degrades if NFZ boundaries or intended destinations are uncertain or noisy.
The trajectory dataset was generated offline using sequential convex optimization. Since practical UAV tracking involves sensor noise, please test the model's robustness by injecting varying levels of Gaussian noise into the historical observation sequence during the testing phase.
The model currently outputs deterministic trajectory predictions without quantifying predictive uncertainty, which is a noted limitation. Please add a brief discussion in the conclusion on how the framework could be extended to output probabilistic distributions to provide confidence intervals for long-horizon forecasts.
Executing a hybrid CNN-Informer network on a resource-constrained UAV onboard computer poses computational challenges. Please provide a quantitative metric (e.g., average inference time in milliseconds) to validate that the model can be executed fast enough to support real-time situational awareness and guidance.
To enrich the related work part, please consider the following work:
- "Deep Reinforcement Learning for Resource Sharing and UAV Trajectory Optimization in Multi-Operator UAV-Assisted Wireless Networks," in IEEE Transactions on Vehicular Technology
- "AV-DTEC: Self-Supervised Audio–Visual Fusion for Drone 3-D Trajectory Estimation and Classification," in IEEE Sensors Journal

Author Response

Dear Reviewer,

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for your thorough revisions and detailed responses to the reviewer comments. The manuscript has been improved, and the concerns raised during the review process have been adequately addressed.

I have no further comments.

Best regards,