Next Article in Journal
Comparative Assessment of Vegetation Removal for DTM Generation and Earthwork Volume Estimation Using RTK-UAV Photogrammetry and LiDAR Mapping
Previous Article in Journal
Cooperative Navigation Framework for UAV Formations Using LSTM and Dynamic Model Fusion
 
 
Article
Peer-Review Record

BEHAVE-UAV: A Behaviour-Aware Synthetic Data Pipeline for Wildlife Detection from UAV Imagery

by Larisa Taskina *, Kirill Vorobyev, Leonid Abakumov and Timofey Kazarkin
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Reviewer 5: Anonymous
Submission received: 6 November 2025 / Revised: 11 December 2025 / Accepted: 16 December 2025 / Published: 4 January 2026
(This article belongs to the Section Drones in Ecology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

- The authors need to delete sentences of "This is an optional section ...... to 2 bullet points." among line 8 and line 12.

- The authors should ensure that names of components among lines 180 and 182 are consistent with the corresponding names of components in titles of 3.1, 3.2, and 3.3.

- The authors need to add an overall title for title of Figure 1 in line 228.

- The authors need to adjust the format for authors and affiliations of authors among line 4 and line 6 according to format requirements of journal Drones.

- The authors should whether sentence of "Contributions. Our work makes the following contributions: packet:" in line 85, and sentence of "We ask whether behavior-aware synthetic data improve UAV wildlife detection at high altitudes" in line 252 are all suitable.

- The authors should better reorganize structure of the manuscript by referring to structure of MSMT-RTDETR.

  • The authors should better merge contents of 2. Related works into section of 1. Introduction, since some of contents such as contributions of manuscript among line 171 and line 175 in section of 2. Related works may overlap with corresponding contents in section of 1. Introduction.
  • The authors should substitute title of 2. Materials and Methods for title of Synthetic Dataset Generation Methodology. In addition, the author should better give a name such as XXX for proposed synthetic dataset generation methodology. Then, the authors should better add new title of 2.1. XXX below title of 2. Materials and Methods. Besides, the authors should better add new title of 2.1.1. Pipeline of XXX below title of 2.1. XXX. Furthermore, the authors should substitute 3.1 Environment and domain randomization for 2.1.2. Environment and Domain Randomization, 3.2 Biological Agent Class and Behaviors for 2.1.3. Biological Agent Class and Behaviors, and 3.3 UAV platform and Automatic Annotation for 2.1.4. UAV Platform and Automatic Annotation.
  • The authors should delete title of Experiments in line 250, title of 4.2 Training regimes in line 265, and substitute 2.1.1. Training Settings for 4.1 Experimental Setup. Then, the authors need to add a new title of 2.2. Experimental Settings above title of 2.1.1. Training Settings, a new title of 2.1.2. Evaluation Metrics above title of 5. Results in line 278.
  • The authors should substitute title of Experimental Results and Analysis for title of 5. Results, title of 3.1. Comparison of Detection Performance for title of 5.1 Main Detection Performance, title of 3.2. Scale-Aware Behavior and Distribution Shift for title of 5.2 Scale-Aware Behavior and Distribution Shift, title of 3.3. Visualization for title of 5.3 Analysis on Real UAV images, and title of 4. Discussion for title of 6. Discussion.

- The authors need to append a new Figure for Pipeline of XXX such as Figure 2 of MSMT-RTDETR in section of 2.1.1. Pipeline of XXX. Please add a subfigure (A) of Overall pipeline, three subfigures of (B), (C) and (D) of sub-pipelines for three components of XXX, and as many images as possible to the new Figure.

- The authors need to move contents of "Metrics follow the Ultralytics protocol: mAP@[0.5:0.9] (average AP over IoU thresholds 0.50–0.90), along with precision and recall aggregated over all test images." among line 285 and line 286, contents of "Following the COCO ...... the real test set." among line 352 and line 360, and any other contents related to evaluation metrics if applicable, into section of 2.1.2. Evaluation Metrics.

- The authors should mention in section of 1. Introduction that there are two categories of object detection approaches current available which are CNN based detection approaches and Transformer based detection approaches by referring to corresponding contents in section of 1. Introduction of MSMT-RTDETR. Besides, the authors need to add an analysis of reasons or motivations for using YOLOv8s in section of 1. Introduction.

- The authors need better to adopt at least two YOLO approaches such as YOLOv8s and YOLO11n and two Transformer approaches such as DETR and RT-DETR in both sections of 3.1. Comparison of Detection Performance and 3.3. Visualization. Besides, the authors should better supplement confusion matrices into section of 3.1. Comparison of Detection Performance, and heatmaps into section of 3.3. Visualization.

- The authors should further check whether they should move both Figure 1 and Figure 2 into section of 3. Experimental Results and Analysis.

- The authors may need to replace the title of the manuscript by a new title similar as "XXX: A ...... pipeline for ......".

- The authors need to replace both YOLOv8s and YOLOv8-s by YOLOv8s in the manuscript.

Comments on the Quality of English Language

The English of the manuscript could be further polished.

Author Response

We sincerely thank the reviewer for the detailed and constructive comments. We have carefully revised the manuscript accordingly. Below we address each point in turn.

Comment 1: “The authors need to delete sentences of ‘This is an optional section ...... to 2 bullet points.’ among line 8 and line 12.”

Response 1: We agree. These template sentences have been removed from the front matter. The manuscript now begins directly with the article title, author list, affiliations, and the “Highlights” section, in accordance with the Drones template. All such placeholder text is deleted and the changes are marked in red in the revised file.

 

Comment 2: “The authors should ensure that names of components among lines 180 and 182 are consistent with the corresponding names of components in titles of 3.1, 3.2, and 3.3.”

Response 2: Thank you for pointing this out. We have standardised the naming of the three main components across the text and figures. Throughout Section 2.1 and Figure 1 we now consistently use:
“Environment and domain randomization”, “Biological agents and behaviors”, and “UAV platform and automatic annotation”. These names are used consistently in the Methods and referenced in the Results section (Sections 3.1–3.4). The corresponding titles and figure labels have been harmonised.

 

Comment 3: “The authors need to add an overall title for title of Figure 1 in line 228.”

Response 3: We agree. Figure 1 now has an explicit overall title: Figure 1. Overview of the BEHAVE-UAV behavior-aware synthetic data pipeline. The subpanels (a–d) are described under this unified caption, as shown in Section 2.1 (Figure 1).

 

Comment 4: “The authors need to adjust the format for authors and affiliations of authors among line 4 and line 6 according to format requirements of journal Drones.”

Response 4: We have reformatted the author and affiliation block to follow the Drones style. Affiliations are now numbered and listed as: 

1 Samara National Research University, 34 Moskovskoye shosse, 443086 Samara, Russia; …

with a single “*Correspondence: …” line and without template placeholders. The layout now matches the MDPI guidelines.

 

Comment 5: “The authors should whether sentence of ‘Contributions. Our work makes the following contributions: packet:’ in line 85, and sentence of ‘We ask whether behavior-aware synthetic data improve UAV wildlife detection at high altitudes’ in line 252 are all suitable.”

Response 5: We appreciate this remark and agree that the original phrasing was awkward and informal. We have removed the stray word “packet” from the contribution heading and fully rephrased the contribution block. At the end of the Introduction, we now use the neutral heading:

- Within this scope, our work makes the following contributions: followed by a concise, four-point contribution list (Section 1.3).

The informal sentence “We ask whether behavior-aware synthetic data improve UAV wildlife detection at high altitudes” has been removed. Instead, we explicitly state three research questions RQ1–RQ3:

RQ1: To what extent can behavior-aware synthetic data, generated at high UAV altitudes, support wildlife detection compared with training solely on real imagery?
RQ2: …
RQ3: …

These revisions clarify the study aims and separate research questions from the contribution list.

 

Comment 6: “The authors should better reorganize structure of the manuscript by referring to structure of MSMT-RTDETR.”

Response 6: We have substantially reorganised the manuscript structure inspired by the MSMT-RTDETR layout, while staying consistent with the Drones template:

- The methods are now grouped under Section 2. Materials and Methods.

- The synthetic pipeline is presented as Section 2.1. BEHAVE-UAV synthetic pipeline with subsections 2.1.1–2.1.5.

- The real and synthetic datasets, baseline, metrics, and training regimes are covered in Sections 2.2–2.6.

- Results, Discussion, and Conclusions now follow as Sections 3, 4, and 5, respectively.

This reorganisation makes the flow closer to MSMT-RTDETR and improves readability.

 

Comment 7: “The authors should better merge contents of 2. Related works into section of 1. Introduction, since some of contents such as contributions of manuscript among line 171 and line 175 in section of 2. Related works may overlap with corresponding contents in section of 1. Introduction.”

Response 7: We agree that the previous version had undesirable overlap between the Introduction and Related Work, especially where contributions were mentioned twice.

In the revised manuscript:

- The contributions now appear only once, at the end of Section 1.3 (“Research questions and contributions”).

- The research gaps are summarised in Section 1.2, without re-listing the contributions.

- Section 2. Related Work has been tightened to focus purely on prior literature, without restating our contributions.

We retain a separate Related Work section (Section 2) to avoid an excessively long Introduction and to align with the structure of many Drones articles, but the previous duplication has been removed.

 

Comment 8:
“The authors should substitute title of 2. Materials and Methods for title of Synthetic Dataset Generation Methodology. In addition, the author should better give a name such as XXX for proposed synthetic dataset generation methodology. Then, the authors should better add new title of 2.1. XXX below title of 2. Materials and Methods. Besides, the authors should better add new title of 2.1.1. Pipeline of XXX below title of 2.1. XXX. Furthermore, the authors should substitute 3.1 Environment and domain randomization for 2.1.2. Environment and Domain Randomization, 3.2 Biological Agent Class and Behaviors for 2.1.3. Biological Agent Class and Behaviors, and 3.3 UAV platform and Automatic Annotation for 2.1.4. UAV Platform and Automatic Annotation.”

Response 8:
We appreciate this detailed structural suggestion and have followed it closely while adopting a concise naming scheme:

The former “Synthetic Dataset Generation Methodology” is now Section 2. Materials and Methods.

We introduce and name our pipeline “BEHAVE-UAV” and add:

2.1. BEHAVE-UAV synthetic pipeline
2.1.1. Overall pipeline
2.1.2. Environment and domain randomization
2.1.3. Biological agents and behaviors
2.1.4. UAV platform and automatic annotation
2.1.5. Quality-assurance filtering

These subsections correspond to the components mentioned in your comment and bring the structure much closer to the requested form.

 

Comment 9:
“The authors should delete title of Experiments in line 250, title of 4.2 Training regimes in line 265, and substitute 2.1.1. Training Settings for 4.1 Experimental Setup. Then, the authors need to add a new title of 2.2. Experimental Settings above title of 2.1.1. Training Settings, a new title of 2.1.2. Evaluation Metrics above title of 5. Results in line 278.”

Response 9:
We have reorganised the experimental description as follows:

The separate “Experiments” heading has been removed.

We now provide:

2.4. COCO-pretrained baseline
2.5. Evaluation metrics
2.6. Training regimes and experimental design

Training details (previously under “4.1 Experimental Setup” and “4.2 Training regimes”) are now grouped in Section 2.6.

All evaluation metrics, including the scale-aware AP definitions, are now clearly defined in Section 2.5 (see also Response 12).

Thus, experimental settings and metrics are fully integrated into the Methods section.

 

Comment 10:
“The authors should substitute title of Experimental Results and Analysis for title of 5. Results, title of 3.1. Comparison of Detection Performance for title of 5.1 Main Detection Performance, title of 3.2. Scale-Aware Behavior and Distribution Shift for title of 5.2 Scale-Aware Behavior and Distribution Shift, title of 3.3. Visualization for title of 5.3 Analysis on Real UAV images, and title of 4. Discussion for title of 6. Discussion.”

Response 10:
We have adopted the spirit of this suggestion while simplifying the wording and numbering in line with the Drones format:

The former “Results” section has been reorganised as Section 3. Results, with subsections:

3.1. Baseline detection performance
3.2. Effect of real-data fraction after synthetic pre-training
3.3. Scale-aware behavior and distribution shift
3.4. Qualitative analysis on real UAV images

The discussion of qualitative visualisations now appears under 3.4. Qualitative analysis on real UAV images, separating it from the quantitative metrics in 3.1–3.3.

The Discussion remains a dedicated Section 4. Discussion, and Conclusions are in Section 5.

This maintains a clear progression (Methods - Results - Discussion - Conclusions) while making each subsection title more descriptive of its content.

 

Comment 11:
“The authors need to append a new Figure for Pipeline of XXX such as Figure 2 of MSMT-RTDETR in section of 2.1.1. Pipeline of XXX. Please add a subfigure (A) of Overall pipeline, three subfigures of (B), (C) and (D) of sub-pipelines for three components of XXX, and as many images as possible to the new Figure.”

Response 11:
We fully agree that a pipeline figure is valuable. We have added a new multi-panel Figure 1 in Section 2.1 that summarises the BEHAVE-UAV pipeline:

Panel (a) shows the overall data flow from procedurally generated environments and behavioural agents through UAV flights to QA-filtered synthetic data.

Panels (b), (c), and (d) detail the three subsystems: environment and domain randomization, behavioral deer agents, and the UAV platform with automatic annotation export.

This closely follows the reviewer’s suggestion and greatly clarifies the methodology.

 

Comment 12:
“The authors need to move contents of ‘Metrics follow the Ultralytics protocol: mAP@[0.5:0.9] (average AP over IoU thresholds 0.50–0.90), along with precision and recall aggregated over all test images.’ among line 285 and line 286, contents of ‘Following the COCO ...... the real test set.’ among line 352 and line 360, and any other contents related to evaluation metrics if applicable, into section of 2.1.2. Evaluation Metrics.”

Response 12:
We agree that all evaluation metrics should be defined in one place. In the revised manuscript:

- We have created Section 2.5. Evaluation metrics, where we clearly define precision, recall, AP@0.5, mAP, and the scale-aware APsmall, APmedium and APlarge.

- The previously scattered metric descriptions from the Results section have been moved into Section 2.5 and removed from their original locations.

- The Results section now only interprets these metrics and refers back to Section 2.5 for definitions.

- This addresses the reviewer’s concern about redundancy and clarity.

 

Comment 13:
“The authors should mention in section of 1. Introduction that there are two categories of object detection approaches current available which are CNN based detection approaches and Transformer based detection approaches by referring to corresponding contents in section of 1. Introduction of MSMT-RTDETR. Besides, the authors need to add an analysis of reasons or motivations for using YOLOv8s in section of 1. Introduction.”

Response 13:
We appreciate this suggestion and have incorporated it into the Introduction:

We now explicitly describe the two main families of detectors used in UAV imagery:

“Modern object detectors for UAV imagery broadly fall into two families: (i) CNN-based one- and two-stage architectures, such as Faster R-CNN and YOLO variants; and (ii) more recent transformer-based detectors that leverage self-attention for global context.”

We then motivate our focus on YOLOv8s:

“In this work we focus on a widely adopted CNN-based one-stage detector, YOLOv8s, which offers a good compromise between accuracy and computational cost on resource-constrained UAV platforms and is available with open-source implementation and MS COCO-pretrained weights.”

These additions (Section 1.1–1.2) provide the requested architectural context and justification.

 

Comment 14:
“The authors need better to adopt at least two YOLO approaches such as YOLOv8s and YOLO11n and two Transformer approaches such as DETR and RT-DETR in both sections of 3.1. Comparison of Detection Performance and 3.3. Visualization. Besides, the authors should better supplement confusion matrices into section of 3.1. Comparison of Detection Performance, and heatmaps into section of 3.3. Visualization.”

Response 14:
We thank the reviewer for this recommendation to broaden the experimental scope.

We have extended our experiments to include two convolutional YOLO architectures (YOLOv8s and YOLO11n) and a transformer-based detector (RT-DETR-L). Their performance on the Rucervus test set is summarised in the new Table 3, which reports mAP@[0.5:0.95], precision and recall for real-only and synthetic-pretrained regimes at 1280 × 1280 px.

To better illustrate detector behaviour, we have added Grad-CAM heatmaps for real UAV imagery in Figure 6, comparing real-only and synthetic-pretrained models (YOLOv8s and YOLO11n). This directly addresses the suggestion to include heatmaps in the visual analysis.

We also computed confusion matrices for the main regimes to verify that the improvements are driven primarily by reduced false negatives on small and partially occluded deer. To avoid overloading the main text with large tables, we summarise these observations qualitatively in Sections 3.1 and 3.4 (higher recall on small and occluded animals) and would be happy to provide full confusion matrices as supplementary material if requested by the editor.

Due to computational constraints and the limited scope of a single manuscript, we did not retain a full second transformer architecture (DETR) in the final comparison plots, but the added YOLO11n and RT-DETR-L already demonstrate that our conclusions are consistent across convolutional and transformer-based detectors.

 

Comment 15:
“The authors should further check whether they should move both Figure 1 and Figure 2 into section of 3. Experimental Results and Analysis.”

Response 15:
We carefully considered the placement of the figures:

The new Figure 1 is a methodological pipeline figure and is therefore kept in Section 2.1 (Materials and Methods), where it supports the description of BEHAVE-UAV.

The synthetic imagery and label examples are now Figure 3 in Section 2.2, illustrating the synthetic dataset, which also belongs naturally to the Methods.

Figures illustrating quantitative results and qualitative detections on real UAV imagery (Figures 4–6) are placed in Section 3. Results. We believe this distribution maintains a clear distinction between methodology and results while addressing the spirit of the reviewer’s comment.

 

Comment 16:
“The authors may need to replace the title of the manuscript by a new title similar as ‘XXX: A …… pipeline for ……’.”

Response 16:
We agree that a pipeline-style title better reflects the main contribution. Following your advice, we have renamed the paper to:

“BEHAVE-UAV: A behavior-aware synthetic data pipeline for wildlife detection from UAV imagery”

This title introduces the pipeline name and follows the recommended “XXX: A … pipeline for …” structure.

 

Comment 17:
“The authors need to replace both YOLOv8s and YOLOv8-s by YOLOv8s in the manuscript.”

Response 17:
Thank you for noticing this inconsistency. We have standardised the notation to “YOLOv8s” throughout the manuscript, including the Abstract, Introduction, Methods, Results, figures, and tables. All occurrences of “YOLOv8-s” have been corrected.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript provides a coherent and technically sophisticated analysis of the generation of synthetic data for object detection in ecological unmanned aerial vehicle (UAV) applications. Its structure is clear and consistent with the standards of scientific journals. The research objective is clearly defined, and the introduction effectively contextualises the work within the existing literature by highlighting the key limitations of previous approaches. The literature review is up to date, including a number of items from the last five years, thereby increasing the study's substantive value. The citations do not give the impression of excessive self-citation, and the selected source material is relevant to the topic. The experimental design is logical and consistent with the hypothesis regarding the effectiveness of pre-training on synthetic data. The detailed description of the UE5 environment, animal agents, and annotation generation ensures high research replicability, a significant strength of the study. The selection of metrics (mAP@[0.5:0.95], precision and recall) is consistent with object detection research practices, allowing comparability with other systems. The results are comprehensive, consistently interpreted, and linked to distribution shift analysis, demonstrating the authors' awareness of the limitations of synthetic datasets. The analysis of object scale is particularly valuable, as it accurately identifies the mismatch in object size between the synthetic and real domains. The visual elements (i.e. synthetic images, segmentations and prediction comparisons) are clear and directly support the experimental conclusions. The tables present the results concisely and correctly, although their interpretation could delve deeper into the context of model errors. The final conclusions are consistent with the evidence presented and do not exceed the scope of the obtained data. The discussion section is well-considered and suggests avenues for future development, such as extending species classes and adding thermal modality. The ethical and data availability statements are comprehensive, and the synthetic dataset repository is presented in such a way that the research can be replicated. However, the manuscript could be improved in a few places in terms of conciseness and linguistic precision. In particular, some passages repeat theses from the abstract or discussion, which slightly distorts the proportions of the section. The overall linguistic quality is satisfactory, although the text occasionally loses clarity due to overly long and complex sentences. This work makes a significant contribution to synthetic data generation for ecological UAV applications. The results are particularly relevant to practitioners in terms of reducing annotation costs and optimising the training of detection models. After minor editorial revisions, the article can be considered valuable and suitable for publication.

Specific comments for improvement to enhance the scientific quality of the article:

Lines 85–88: the term ‘packet’ in the list heading is ambiguous and stylistically inappropriate — it should be removed or replaced with an academic term.

Lines 128–132: the description of ‘behaviourally parameterised animal agents’ is too brief in relation to the subsequent section 3.2 — it is worth specifying which parameters are key.

Lines 197–203: the section on agent properties should include a bibliographic reference to crowdsimulation methods or motion models, e.g. Reynolds (1987).

Lines 241–246: the QA filter is only mentioned briefly — it would be advisable to specify the threshold values and the reasons for their selection.

Lines 337–341: the description of mAP changes for individual percentages of the real set could be presented more concisely, e.g. as a sentence summarising the trend.

Lines 431–433: the discussion on scale mismatch should indicate whether scale normalisation mechanisms are planned in the pipeline — currently, the information is incomplete.

Author Response

We sincerely thank the reviewer for the detailed and constructive comments. We have carefully revised the manuscript accordingly. Below we address each point in turn.

 

Comment 1:
“However, the manuscript could be improved in a few places in terms of conciseness and linguistic precision. In particular, some passages repeat theses from the abstract or discussion, which slightly distorts the proportions of the section. The overall linguistic quality is satisfactory, although the text occasionally loses clarity due to overly long and complex sentences.”

Response 1:
We thank the reviewer for this careful assessment. In the revised manuscript we have (1) rewritten the Abstract to be more focused and less numerically overloaded, (2) shortened several long sentences in the Introduction and Results, and (3) removed repeated formulations that duplicated the Abstract or Discussion (particularly around data scarcity and the main findings). We will perform a final language pass at the proof stage to further polish style and clarity.

 

Comment 2:
“Lines 85–88: the term ‘packet’ in the list heading is ambiguous and stylistically inappropriate — it should be removed or replaced with an academic term.”

Response 2:
We agree and have removed the word “packet” from the contribution heading and rewritten the lead-in in a standard academic form.
Revised text:
- Within this scope, our work makes the following contributions:
- followed by a concise, four-point bullet list (Section 1.3).

 

Comment 3:
“Lines 128–132: the description of ‘behaviourally parameterised animal agents’ is too brief in relation to the subsequent section 3.2 — it is worth specifying which parameters are key.”

Response 3:
We appreciate this suggestion. In the revised Methods we now explicitly enumerate the main parameter groups for each animal agent in Section 2.1.3 “Biological agents and behaviors”.

Revised text:

Each animal agent is represented by a template that exposes tunable properties in four groups: Appearance (mesh and texture variants), Kinematics (base speed, perception radius, maximum acceleration), Social behavior (interaction rules approximating herd dynamics), and High-level states (finite state machine with idle, walk, run and context-dependent transitions).

This makes clear which parameters control behaviour and how they relate to the later analysis.

 

Comment 4:
“Lines 197–203: the section on agent properties should include a bibliographic reference to crowdsimulation methods or motion models, e.g. Reynolds (1987).”

Response 4:
We agree and have added an explicit link to classical work on flocking and crowd simulation in the Related Work section. Section 1.2.2 now cites Reynolds’ boids model and later social-force–type approaches as the conceptual basis for our group dynamics.

Revised text:

Classic work on flocking and crowd simulation showed that realistic collective motion can emerge from local interaction rules, exemplified by Reynolds’ distributed behavioral model and later social-force type models [18].

This is then connected to our parameterisation of herd behaviour in Section 2.1.3.

 

Comment 5:
“Lines 241–246: the QA filter is only mentioned briefly — it would be advisable to specify the threshold values and the reasons for their selection.”

Response 5:
We agree that the QA filter required a clearer definition. In the revised manuscript we introduce a dedicated subsection 2.1.5 “Quality-assurance filtering”, which explicitly states both the conditions and the numeric threshold used.

Revised text:

Concretely, we discard any frame that satisfies one of the following conditions: no valid animal instances; all bounding boxes degenerate; all visible animals heavily truncated at borders; or the median bounding-box area Ā falls below a small fraction of the image area, Ā < τ·W·H. In our implementation we set τ = 10⁻³, so frames in which all animals occupy less than 0.1% of the image area are discarded as unlikely to provide stable gradients for training.

We also state that these checks are applied as inexpensive post-processing on the exported labels and briefly motivate the choice of τ.

 

Comment 6:
“Lines 337–341: the description of mAP changes for individual percentages of the real set could be presented more concisely, e.g. as a sentence summarising the trend.”

Response 6:
Thank you for this suggestion. We have rewritten the paragraph describing the fractional fine-tuning experiment to emphasise the trend rather than listing every configuration.

Revised text:

When only a small portion of the real training images is available (around 10–30%), performance remains modest (mAP < 0.35 and limited recall). The most substantial gains occur as the fraction increases to about one half of the real training set: at 50% real data mAP reaches 0.41 with precision 0.81 and recall 0.69. Beyond roughly 60–80% real data the curve flattens, and even using the full training set (100%) does not yield a noticeable further increase.

The following sentence then compares this configuration directly to the real-only baseline, summarising the main conclusion without repeating all intermediate numbers.

 

Comment 7:
“Lines 431–433: the discussion on scale mismatch should indicate whether scale normalisation mechanisms are planned in the pipeline — currently, the information is incomplete.”

Response 7:
We fully agree that the Discussion should indicate concrete directions for addressing scale mismatch. In the revised Discussion / Future Work we explicitly outline planned scale-normalisation strategies.

Revised text:

On the methodological side, we intend to explore more explicit scale-normalisation mechanisms—for example, tile-based training, adaptive cropping, and scale-mix augmentations—as well as ablations between static and behavior-driven scenes to better quantify the contribution of agent dynamics.

This explicitly connects the observed scale shift in Section 3.3 with our planned extensions of the pipeline.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript investigates a synthetic data pipeline for wildlife detection using UAV imaging based on Unreal Engine 5 and evaluates its effectiveness in combination with real data for training YOLOv8s models. Positive aspects include the high degree of automation and the demonstration that a substantial portion of the detection performance of real models can be reproduced. Nevertheless, the manuscript requires a major revision, as key methodological details are missing and both structure and content need improvement. Formal errors are also present. Please find my detailed comments below.

Abstract: The opening could be more contextual and less technical (especially the first 1–2 sentences).

Introduction: Please explicitly state the research questions (RQs) you aim to answer. You list the contributions twice (also in Section 2.4); I suggest stating contributions once and the RQs separately.

YOLOv8 should be explicitly referenced the first time the network is mentioned in the text.

Section 2.2 (Behaviour Modelling) is underdeveloped relative to the importance of the topic for the paper. Neither modelling approaches nor validation strategies are explained in detail, and there is a lack of references to literature on animal movement.

Section 2.3 (UAV-Centric Training Datasets) is abrupt and too superficial. Key datasets like DetReIDX and RCSD-UAV should be better contextualised so the reader understands their relevance.

Section 3.1 is too brief; concrete parameter values and examples for variation are missing, or at least should be referenced elsewhere in the manuscript.

The authors work exclusively with the class deer, which limits generalisability. However, this is acknowledged in the Discussion and should be viewed positively as an honest limitation.

Discussion should be extended to include a critical reflection of the results and how they compare with relevant external scientific literature. Currently, the section does not cite a single external source.

Incorrect section numbering: Section 5 is labelled both as Results and Conclusions, while Discussion is numbered 6. Please revise for consistency.

A direct comparison of YOLO training on static versus behaviourally-driven scenes is missing and would significantly strengthen the claims about behavioural realism.

Dataset composition per test for the real-world data in Section 3.3 is unclear. 

I recommend including a table summarising all relevant training hyperparameters, including learning rate, batch size, optimiser, and scheduler settings.

Author Response

Comments 1:
“Abstract: The opening could be more contextual and less technical (especially the first 1–2 sentences).”

Response 1:
Thank you for this suggestion. We have completely rewritten the Abstract. The new version starts from the ecological UAV monitoring problem and data scarcity, then briefly introduces the behaviour-aware UE5 pipeline and only afterwards summarises the experimental design and key findings in a less technical, more accessible way. The revised Abstract is marked in red in the manuscript.

 

Comments 2:
“Introduction: Please explicitly state the research questions (RQs) you aim to answer. You list the contributions twice (also in Section 2.4); I suggest stating contributions once and the RQs separately.”

Response 2:
We agree. At the end of the Introduction we now explicitly list three research questions (RQ1–RQ3), followed by a single consolidated contribution list. The duplicated contributions in the former Section 2.4 have been removed; that section now only summarises research gaps. Thus, research gaps, RQs, and contributions each appear once in clearly separated paragraphs (Section 1.3).

 

Comments 3:
“YOLOv8 should be explicitly referenced the first time the network is mentioned in the text.”

Response 3:
We have added an explicit citation for YOLOv8s at its first mention in the Introduction, referring to the official Ultralytics/COCO implementation. This makes the architectural choice clearly citable for the reader.

 

Comments 4:
“Related Work: The section could be better focused on behaviour modelling and UAV datasets for long-range monitoring; some parts are generic while important ecological aspects are underdeveloped.”

Response 4:
We appreciate this comment. We have refocused the Related Work in two ways:

Section 1.2.2 (“Behavior modeling and animal movement”) now gives a more detailed and referenced discussion of flocking, crowd models, and agent-based models for ungulates, linking these directly to our parameterisation of group dynamics.

Section 1.2.3 (“UAV datasets for object detection and long-range monitoring”) has been expanded to discuss DetReIDx, RCSD-UAV and related datasets, with explicit comparison to our long-range wildlife monitoring scenario.

These changes sharpen the focus on behaviour and UAV ecology and reduce generic background.

 

Comments 5:
“The description of the behaviour model and agent parameters is too brief; please provide more detail on how group dynamics are implemented.”

Response 5:
We agree. Section 2.1.3 (“Biological agents and behaviors”) has been expanded. We now (1) distinguish four parameter groups (appearance, kinematics, social behaviour, high-level states), (2) describe the herd controller that coordinates up to 15 agents via short-horizon path planning and social-force–style interactions, and (3) explain how these settings lead to realistic local densities, occlusions and trajectories.

 

Comments 6:
“The real dataset section should describe the Rucervus dataset more concretely (origin, number of images, UAV vs handheld, splits).”

Response 6:
We have fully rewritten Section 2.3 (“Real dataset”). It now specifies the total number of images (8210), the split between UAV and handheld imagery, the exact train/validation/test counts (6198/1325/687), the UAV platforms used, and the fact that we follow the official partition defined in the original Rucervus study. This makes the dataset description self-contained.

 

Comments 7:
“It would be helpful to have all metric definitions in one place, particularly the scale-aware AP_small, AP_medium, AP_large.”

Response 7:
We agree. All metric definitions have been consolidated into a new Section 2.5 (“Evaluation metrics”). There we define precision, recall, AP@0.5, mAP, and the scale-aware AP_small, AP_medium and AP_large based on normalized bounding-box area. The Results section now only interprets these metrics and refers back to Section 2.5 for formal definitions.

 

Comments 8:
“The numbering and labelling of sections and subsections (Results, Discussion, Conclusions) are slightly confusing in the current version.”

Response 8:
Thank you for pointing this out. We have cleaned up the section structure so that the paper now follows a simple sequence: 1. Introduction, 2. Materials and Methods, 3. Results, 4. Discussion, 5. Conclusions. Inconsistencies in the previous numbering (e.g., combined “Results and Conclusions” and a separate “Discussion 6”) have been removed. All internal references were updated accordingly.

 

Comments 9:
“Discussion should engage more with external literature rather than only restating your own results.”

Response 9:
We agree. The Discussion has been revised to more explicitly relate our findings to existing work. In particular, we now compare BEHAVE-UAV to UE-based synthetic pipelines such as replicAnt and Unity-based perception frameworks, and we discuss how our behaviour-aware, UAV-accurate setup complements these efforts. We also relate our observations on synthetic pre-training to prior results on transfer learning in UAV and ecological settings. These additions are highlighted in Section 4.

 

Comments 10:
“You briefly mention scale mismatch and behaviour-driven scenes; a short outlook on experiments with static vs. behaviour-driven synthetic data would be useful.”

Response 10:
We appreciate this forward-looking suggestion. In the revised Future Work paragraph, we now explicitly state that we plan to perform ablation studies comparing static asset placement against behaviour-driven agents, and to combine this with scale-normalisation strategies (tile-based training, scale-mix augmentations). We clarified that such experiments are beyond the scope of the current submission but are a natural next step.

 

Reviewer 4 Report

Comments and Suggestions for Authors
  1. You lost A in LINE 356, eq(2)

𝑆_small = { < 322 },

  1. About the training process from LINE 265: A key concern in this section is that the six training regimes are described as a list of configurations rather than as a clearly motivated and systematically controlled experimental design. To strengthen the scientific rigor, the authors should clarify the rationale behind comparing these specific regimes, especially regarding the interplay between data source (synthetic vs. real), initialization strategy (COCO-pretrained vs. synthetic-pretrained), and input resolution (640 vs. 1280). Important details needed for reproducibility—such as the size and characteristics of the synthetic dataset, the composition and annotation quality of the real Rucervus dataset, augmentation settings, batch size, and the exact early-stopping criteria—are currently missing and should be specified. In addition, the “fraction sweep” in regime 6 requires a clearer methodological explanation, including the sampling strategy, the specific percentage steps used, and whether multiple runs were averaged to account for variance. Finally, because multiple variables change across regimes, the authors should more explicitly justify how the design isolates the effects of resolution, pretraining source, and data type, or consider adding baselines that allow these factors to be disentangled more cleanly.
  2. Please provide a dedicated section that offers a comprehensive and technically rigorous description of the COCO-pretrained model used in this study. This section should clearly articulate the pretraining process, the characteristics and scale of the COCO dataset, the specific YOLO variant and weights adopted, and the implications of using COCO-pretrained initialization for downstream UAV-related tasks. Additionally, discuss how COCO pretraining influences feature representation, generalization, and domain transfer, and explain why this initialization serves as an appropriate baseline for the comparative training regimes evaluated in the experiment.
  3. You have also not provided a detailed description of the Rucervus dataset. For clarity and reproducibility, the manuscript should include a dedicated subsection that outlines the dataset’s origin, collection methodology, annotation protocol, class definitions, data volume, train–validation–test splits, and any domain characteristics relevant to UAV-based detection tasks. Without this information, it is difficult for readers and reviewers to assess the dataset’s suitability, understand potential domain gaps, or evaluate the validity of the experimental comparisons.
  4. From LINE 386:

The qualitative analysis presented here is particularly strong and provides compelling evidence in support of the study’s central claims. The authors effectively demonstrate that the synthetic-initialized model not only outperforms the real-only baseline on small and distant targets—an especially challenging regime for aerial wildlife detection—but does so in a manner that is consistent with the quantitative AP_small improvements reported earlier. The nuanced observations across varying scene conditions (small-scale targets, partial occlusion, close-range imagery, and background confounders) reveal a careful and thorough evaluation, showing that performance gains are not limited to isolated cases but generalize across typical UAV scenes. Notably, the model’s confident detections of unlabelled but visually plausible small deer highlight both the robustness of the synthetic pretraining and the presence of annotation gaps in the real dataset. This finding is scientifically valuable: it suggests that the reported metrics are conservative and that the proposed approach may in fact be even more effective in real-world deployment than the benchmark numbers imply. Overall, this section significantly strengthens the manuscript by showing clear, interpretable, and practically meaningful benefits of synthetic pretraining, while simultaneously identifying opportunities for dataset refinement and future semi-supervised improvements.

Comments on the Quality of English Language

It is OK.

Author Response

We sincerely thank the reviewer for the detailed and constructive comments. We have carefully revised the manuscript accordingly. Below we address each point in turn.

 

Comments 1:
“You lost A in LINE 356, eq(2) ?_small = { < 32² }. Please check the formal definition of the scale bins and ensure consistency with the code.”

Response 1:
We thank the reviewer for catching this issue. In the revised manuscript we explicitly introduce the normalized box area and correct the scale-bin definitions in Section 2.5.

We also clarify that these thresholds are analogous in spirit to the COCO small/medium/large split but defined in normalized area, which matches the evaluation code used to compute APsmall_{\text{small}}, APmedium_{\text{medium}} and APlarge_{\text{large}}.

 

Comments 2:
“Please ensure that the way you compute mAP, AP@0.5 and the scale-aware AP is consistent and clearly defined (Ultralytics vs COCO definition).”

Response 2:
We agree. Section 2.5 (“Evaluation metrics”) now explicitly states that we follow the Ultralytics implementation:

AP@0.5 is computed at IoU = 0.5, and mAP@[0.5:0.95] is the mean AP over IoU thresholds 0.5…0.95 for our single deer class. Scale-aware APsmall, APmedium and APlarge are obtained by restricting predictions and ground truth to the respective area subsets and applying the same matching and averaging protocol.

All references to mAP and AP in the Results now point back to this unified definition.

 

Comments 3:
“A key concern in this section is that the six training regimes are described as a list of configurations rather than as a clearly motivated and systematically controlled experimental design…”

Response 3:
We appreciate this important remark. We have restructured the description of our experiments in a new subsection 2.6 “Training regimes and experimental design”:

We now explicitly separate the factors under study—data source (synthetic vs. real), initialization (COCO-pretrained vs. synthetic-pretrained), and input resolution (640 vs. 1280 px)—and explain how the six base regimes form a controlled comparison across these factors. For the real-data fraction sweep (10–100%), we describe the stratified sampling strategy, percentage steps, and fixed random seed used.

This makes the experimental design more transparent and reproducible, rather than a mere list of configurations.

 

Comments 4:
“Please provide a dedicated section that offers a comprehensive and technically rigorous description of the COCO-pretrained model used in this study.”

Response 4:
We agree that this needed clarification. We have added a dedicated subsection 2.4 “COCO-pretrained baseline”, where we:

state that all real-only baselines use the Ultralytics YOLOv8s detector initialised from MS COCO 2017 pretrained weights; briefly describe the backbone–neck–head structure and anchor-free detection head; and explain why YOLOv8s is a suitable baseline for UAV wildlife detection in terms of accuracy–speed trade-off and community adoption.

We also make clear which regimes use this COCO-pretraining versus additional synthetic pretraining.

 

Comments 5:
“You have also not provided a detailed description of the Rucervus dataset.”

Response 5:
We fully agree. Section 2.3 “Real dataset (Rucervus)” has been rewritten to provide a self-contained description:

We now report the total number of images (8210), the split between UAV and handheld imagery, the official train/validation/test partition (6198/1325/687), the UAV platforms used, the annotation format, and the typical viewing conditions (altitudes, habitats, occlusions).

We also emphasise that we follow the original Rucervus split protocol and explain why this dataset is an appropriate real-world target for our synthetic-to-real transfer study.

 

Comments 6:
“The qualitative analysis presented here is particularly strong and provides compelling evidence for the claims about small-object detection and missing labels.”

Response 6:
We thank the reviewer for this positive feedback. In the revision we have tightened the link between the qualitative and quantitative results:

In Section 3.4 we now explicitly relate examples of small and partially occluded deer detected by synthetic-pretrained models to the improved APsmall values, and we highlight cases where the model finds plausible deer that are missing in the ground-truth labels, reinforcing the discussion of label noise in the real dataset.

We appreciate the reviewer’s recognition of this aspect of the work.

Reviewer 5 Report

Comments and Suggestions for Authors

The manuscript sits at the intersection of computer vision, ecological monitoring, and simulation-based data generation, which are becoming relevant nowadays. It is technically solid, direct, and empirically well grounded. The authors propose a novel UE5-based pipeline that integrates behavior-aware agents, UAV-accurate imaging geometry, and automatic YOLO-ready annotations. The set of experiments is extensive and generally convincing, especially the fractional fine-tuning results that quantify how much real data is actually needed after synthetic pretraining.

Overall, the paper offers a nice work for ecological UAV monitoring. The manuscript is almost publishable as is, but it would benefit from some tightening, clearer framing of originality, and a more explicit discussion of limitations. Below I give detailed comments.

1. Right now, the introduction and related work sections are quite long, but the exact novel contributions is not completely clear. The pipeline is strong, but several parts overlap with earlier work (replicAnt, Unity Perception, synthetic UAV datasets). The manuscript could benefit from a clearer comparison with them.

2. You rely heavily on the Rucervus dataset, but the manuscript gives only minimal characterisation. Readers who do not know this dataset may struggle. Add some extra info, like the altitude range of real dataset, so the reader has an idea.

3. You mention three limitations: species breadth, RGB modality, extreme weather approximations. Could you explain them more? Or could you cite more examples of limitations? I think for example that unreal-based agents still cannot fully reproduce natural gait or fur reflectance.

4. The abstract needs to be redone. It is too technical and misses the point of giving an idea of what is treated in the text.

Author Response

Comment 1.
“Right now, the introduction and related work sections are quite long, but the exact novel contributions is not completely clear. The pipeline is strong, but several parts overlap with earlier work (replicAnt, Unity Perception, synthetic UAV datasets). The manuscript could benefit from a clearer comparison with them.”

Response 1.
We thank the reviewer for this important remark on framing originality. In the revised manuscript, we have (i) tightened the Introduction and Related Work by removing repeated statements, and (ii) added a dedicated subsection “1.2.4 Research gaps in existing synthetic pipelines” where replicAnt and Unity Perception are discussed explicitly and contrasted with our setting. We now make clear that our novelty lies in an end-to-end, behaviour-aware UE5 pipeline tailored to high-altitude UAV wildlife detection, with UAV-accurate imaging, detector- and tracker-ready labels, and a controlled transfer study to an independent real deer dataset. Finally, Section 1.3 Research questions and contributions has been rewritten so that each contribution is explicitly linked to one of these identified gaps, making the unique aspects of BEHAVE-UAV more visible.

 

Comment 2.
“You rely heavily on the Rucervus dataset, but the manuscript gives only minimal characterisation. Readers who do not know this dataset may struggle. Add some extra info, like the altitude range of real dataset, so the reader has an idea.”

Response 2.
We appreciate this suggestion. Section 2.3 Real dataset has been substantially expanded: we now report the exact dataset size (8210 images), the split between UAV and handheld images, and the official train/val/test partition used in our experiments. We also describe the UAV platforms (DJI Mavic 2 Zoom, Mavic 2 Enterprise, Mavic Pro), the variety of habitats, and the typical altitude range and viewpoints, so that readers unfamiliar with Rucervus have a concrete sense of the imaging conditions and scale distribution.

 

Comment 3.
“You mention three limitations: species breadth, RGB modality, extreme weather approximations. Could you explain them more? Or could you cite more examples of limitations? I think for example that unreal-based agents still cannot fully reproduce natural gait or fur reflectance.”

Response 3.
We agree that the limitations section was too brief. In the revised Discussion/Conclusions, we expanded the limitations paragraph to (a) emphasise the current restriction to deer-like ungulates, (b) explain the consequences of using RGB-only rendering for nocturnal and canopy-occluded scenarios, and (c) clarify that extreme weather and seasonal variation are approximated via style randomisation rather than full physical modelling. Following the reviewer’s example, we also added a new point noting that Unreal-based agents still cannot fully reproduce natural gait, fine-scale fur reflectance, and rare edge cases (e.g., interactions with infrastructure). We explicitly state that, despite these limitations, style randomisation and short fine-tuning on local real data are needed when transferring to new regions or species.

 

Comment 4.
“The abstract needs to be redone. It is too technical and misses the point of giving an idea of what is treated in the text.”

Response 4.
We fully agree and have rewritten the Abstract from scratch. The new version starts from the ecological UAV monitoring problem and the challenge of scarce labelled data, briefly introduces the behaviour-aware UE5 pipeline, and then summarises the experimental design (six training regimes, two resolutions, fractional fine-tuning on a real deer dataset) using only a small number of representative results. It ends with a clear practical takeaway about how high-resolution synthetic pre-training plus partial real fine-tuning can reduce annotation effort for UAV wildlife detection. We believe this makes the abstract less technical and more informative about what the paper does and why it matters.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Editor,

Most revisions have been implemented satisfactorily. Only one point remains unresolved. The Discussion still lacks explicit citations to the external literature mentioned (e.g., replicAnt, Unity Perception, transfer-learning studies). I therefore recommend minor revisions, limited to adding the necessary references and integrating them into the argumentation.

Best regards

Author Response

Dear Reviewer,
We sincerely thank you for the detailed and constructive comments. We have carefully revised the manuscript accordingly. 

Comment:

“The Discussion still lacks explicit citations to the external literature mentioned (e.g., replicAnt, Unity Perception, transfer-learning studies).”

Response:

In the revised Discussion we have (1) added explicit citations to replicAnt and Unity Perception when comparing BEHAVE-UAV to existing UE-based synthetic pipelines [12,13], and (2) linked our findings on sample-efficient synthetic pre-training plus fractional real fine-tuning to prior work on synthetic-to-real transfer and domain randomization [11,15,24]. We also added corresponding citations in the Conclusions when contrasting our pipeline with earlier UE-based platforms. These changes integrate the relevant external literature into the argumentation more explicitly.

Back to TopTop