Next Article in Journal
Multi-Satellite Assessment of Factors Controlling Biomass Burning Aerosol Formation over the South China Sea
Previous Article in Journal
StitchGS: Towards Seamless and Lightweight Large-Scale 3D Gaussian Splatting
 
 
Article
Peer-Review Record

Research on Road Surface Distress Detection Algorithm in UAV Images with Multi-Scale Feature Fusion

Remote Sens. 2026, 18(10), 1461; https://doi.org/10.3390/rs18101461
by Dudu Guo 1,2,*, Wenxing Cai 3, Hongbo Shuai 3, Zhenxun Wei 4,5 and Guoliang Chen 4,5
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2026, 18(10), 1461; https://doi.org/10.3390/rs18101461
Submission received: 25 March 2026 / Revised: 2 May 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper has conducted substantial work in the engineering practice of applying deep learning to highway defect detection (introducing improvement modules across four dimensions), demonstrating certain potential for publication. However, the current manuscript has significant flaws regarding the standardization of academic writing, the rigor of data presentation, and the conciseness of English expression. The authors must thoroughly correct the citation formats, unify the acronyms of proper nouns, clarify baseline metrics, and supplement a substantive, in-depth Discussion before the conclusion. Concurrently, a comprehensive English proofreading and formatting cleanup are required.
1. Inconsistent acronyms throughout the text: The core innovative module proposed in the paper is abbreviated as FFDPN (Feature Focusing Diffusion Pyramid Network) in the Abstract, but it changes to FFPDN in the Conclusion section. Such spelling errors regarding a core concept severely undermine the rigor of the paper.
2. Missing "Discussion" section: The title of Section 4 is "Discussion and Conclusions," yet the main text is merely a mechanical repetition of the Abstract and Introduction. There is a complete lack of in-depth theoretical analysis of the experimental results (e.g., why do the new modules work? Under what extreme lighting or camera angles might the model fail? What are the specific advantages and disadvantages compared to current SOTA models?).
3. Obvious textual and semantic redundancy: For instance, the statements in lines 77-80 and lines 80-82 of the Introduction regarding "the reduction of feature information and the impact on detection accuracy caused by the high vantage point of drones" are highly repetitive in meaning and must be streamlined and merged. Furthermore, there are two completely identical paragraphs in Section 3.1.
4. Insufficient originality/novelty: The methodology relies on a combination of existing deep learning operators, such as DCNv2, WaveletPool, and the Sobel operator (EIEM). While this approach holds value for engineering applications (drone-based road defect detection), it lacks substantial innovation in the fundamental theories of computer vision.

Author Response

Comments 1:[ Inconsistent acronyms throughout the text: The core innovative module proposed in the paper is abbreviated as FFDPN (Feature Focusing Diffusion Pyramid Network) in the Abstract, but it changes to FFPDN in the Conclusion section. Such spelling errors regarding a core concept severely undermine the rigor of the paper.]

Response 1:[Corrected. The single occurrence of the typo FFPDN in the conclusion paragraph has been replaced with FFDPN. We then performed a full-document search to confirm that FFDPN is now used consistently everywhere — Abstract, Introduction, Sections 2.2 and 2.2.1, the new Discussion (Section 4), and the Conclusions (Section 5). The change is highlighted in the revised file.]

Comments 2:[Missing "Discussion" section: The title of Section 4 is "Discussion and Conclusions," yet the main text is merely a mechanical repetition of the Abstract and Introduction. There is a complete lack of in-depth theoretical analysis of the experimental results (e.g., why do the new modules work? Under what extreme lighting or camera angles might the model fail? What are the specific advantages and disadvantages compared to current SOTA models?).]

Response 2:[We agree and have split the section into a substantive Section 4 Discussion and a separate Section 5 Conclusions. The new Discussion has five subsections, each targeting one aspect the reviewer raised:

  • Section 4.1 (Why the Proposed Modules Improve Detection) gives a mechanistic account of each module's contribution. FFDPN shortens the P3–P5 information path so small-target texture is not drowned by P5 semantics. IIDH contributes through two parallel mechanisms: structurally, sharing a single CGS-derived interaction feature between the classification and regression branches replaces two independent branch stacks with one shared feature plus two lightweight task modules, reducing parameters from 3.01×10⁶ to 2.24×10⁶; functionally, the DCNv2 offsets in the regression branch are modulated by classification confidence, which is consistent with the +2.9-mAP / +2.8-recall gain of Group ③ and with the reduction of duplicate boxes in Figure 11. EIEM injects a translation-equivariant Sobel gradient prior that CNN stacks otherwise learn only slowly. WaveletPool preserves LH, HL, and HH sub-bands, which collectively retain horizontal, vertical, and diagonal high-frequency edge energy so that EIEM's edge evidence survives downsampling. The sum of single-module gains (3.2 + 2.9 + 0.7 + 5.9 = 12.7) is close to the full combined gain of 12.2, indicating that on this dataset the modules are approximately additive rather than strongly synergistic; their complementarity is most visible qualitatively in Figure 11, where only the full combination eliminates duplicate, missed, and misclassified detections across all four distress categories simultaneously.
  • Section 4.2 (Comparison with SOTA) explicitly reads the Table-4 numbers — Faster R-CNN and DETR are poorly matched to tens-of-pixels targets because their heads were designed for medium/large instances; YOLOv9 is competitive (92.4% mAP) but ~21× heavier and ~5× slower than our model, ruling it out for UAV deployment; YOLOv10n/YOLOv11n trade 9–28 mAP points for speed; YOLO-World carries an unnecessary open-vocabulary text encoder for our closed four-class taxonomy.
  • Section 4.4 (Generalization, Failure Modes, and Limitations) reports three specific failure regimes: (i) tilt angles >60° where perspective foreshortening distorts transverse cracks into near-horizontal slivers and IIDH's DCNv2 offsets only partially compensate; (ii) low-sun directional illumination that produces Sobel responses indistinguishable from real cracks at the EIEM stage; (iii) adverse weather, which is absent from our current protocol and therefore reported as open work rather than hidden under an implausible claim.]

Comments 3:[Obvious textual and semantic redundancy: For instance, the statements in lines 77-80 and lines 80-82 of the Introduction regarding "the reduction of feature information and the impact on detection accuracy caused by the high vantage point of drones" are highly repetitive in meaning and must be streamlined and merged. Furthermore, there are two completely identical paragraphs in Section 3.1.]

Response 3:[Both redundancies are fixed. In the Introduction (third paragraph), the sentence beginning 'Due to the lack of road defect feature information caused by the high perspective…' was deleted — the same idea is already stated in the immediately preceding clause. In Section 3.2 (Ablation experiment), the entire duplicated paragraph beginning 'To validate the optimization effect of the improved strategy…' has been removed; only one copy remains. Both edits are highlighted in the revised file.]

Comments 4:[Insufficient originality/novelty: The methodology relies on a combination of existing deep learning operators, such as DCNv2, WaveletPool, and the Sobel operator (EIEM). While this approach holds value for engineering applications (drone-based road defect detection), it lacks substantial innovation in the fundamental theories of computer vision.]

Response 4:[We agree with the reviewer's framing and have added a new Section 4.5 (On Novelty and the Engineering–Theory Trade-off) that states this explicitly: we do not claim novelty at the level of new primitive operators. DCNv2, Sobel convolution, and wavelet decomposition are each pre-existing. What is novel, and what the ablation quantifies, is the architectural claim that these particular operators, placed at specific network locations with specific couplings (edge prior at the stem, aliasing-safe downsampling, series-diffusion fusion across scales, shared-feature interaction between heads), form a chain that targets the dominant failure mode of UAV pavement imagery — small targets with thin edges at variable scale. The paper's contribution is therefore positioned honestly as task-specific architectural design, not as a new theoretical operator.]

Reviewer 2 Report

Comments and Suggestions for Authors

I did not see any novelty of this manuscript since this topic has been investigated for more than 5 years with deep learning models. Some suggestions are listed below:

  1. Please correct “disease” to “distress”.
  2. Please clarify where the dataset was collected.
  3. Authors should present the parameters or any related computational requirements.
  4. Failure of the proposed model should also be included for discussion.
  5. Also, you should try to achieve edge computing for detection since your present work is not a contribution to the community.

Author Response

Comments 1:[Please correct “disease” to “distress”.]

Response 1:[Done. We agree that 'pavement distress' is the correct term of art in transportation engineering. We replaced 'Disease' with 'Distress' in the title and keywords (highlighted), and corrected the body-text occurrence in Section 2.2.1 ('aggregated distress features').]

Comments 2:[Please clarify where the dataset was collected.]

Response 2:[Section 3.1 now includes a dedicated paragraph with acquisition details: the imagery was collected along national and provincial highway sections in Xinjiang Uygur Autonomous Region, China, using a DJI quad-rotor UAV with a 1/2.3-inch CMOS sensor. Flight altitudes (10–30 m AGL), camera tilt angles (nadir to ~60° off-nadir), illumination conditions (solar elevation >30°), and native image resolution (1920×1080) are all specified. ]

Comments 3:[Authors should present the parameters or any related computational requirements.]

Response 3:[The per-model parameter count (Params/10⁶), model complexity (GFLOPs), and inference speed (FPS) are reported in Tables 3 and 4. Table 1 lists the training hardware (RTX 3090Ti, 32 GB RAM, CUDA 11.3, PyTorch 1.10). Our model uses 2.41×10⁶ parameters and 10.4 GFLOPs, which we discuss explicitly in Section 4.5 in the context of on-board UAV deployment feasibility.]

Comments 4:[Failure of the proposed model should also be included for discussion.]

Response 4:[Section 4.4 (Generalization, Failure Modes, and Limitations) lists three concrete failure regimes (steeply oblique views, low-sun directional shadows, adverse weather) and explains the mechanism of each. Section 4.3 additionally describes a residual failure mode — a linear stain geometrically mimicking a crack — which cannot be fully resolved at UAV resolution and is now reported as an acknowledged limitation rather than a solved problem.]

Comments 5:[Also, you should try to achieve edge computing for detection since your present work is not a contribution to the community.]

Response 5:[We take this point seriously. Section 4.5 states clearly that, at 2.41×10⁶ parameters and 10.4 GFLOPs, the model is within the budget of modern UAV companion boards (Jetson Orin Nano, RK3588). However, we have not yet ported and profiled the model on such hardware, and we state this explicitly as future work rather than an implied contribution. Claiming edge deployment without having measured it on the target device would, in our view, be worse than transparently scoping it as follow-up work.]

Reviewer 3 Report

Comments and Suggestions for Authors

1. This manuscript investigates the multi-scale feature fusion algorithm for road surface disease detection in drone images. There are too many statements about background introduction in the abstract (first few sentences).
2. Lines 65-66, the opening sentence is similar to the beginning of the previous paragraph, 'With the development of unmanned aerial vehicle (UAV) remote sensing technology' can be changed to 'Based on various needs in production and daily life'. Please ensure the logic and progression between the paragraphs in the introduction.
3. In Table 2, please ensure that '5 × 10-4' is correct.
4. Optical remote sensing satellites mainly rely on visible light for imaging, so they can only work during the day and are greatly affected by meteorological conditions such as clouds, rain, and fog, which limits their observation time and range. 
5. The relevant results of the algorithm presented in adverse weather conditions (weak light, cloudy, rainy, etc.) need to be revealed and compared with other methods (YOLOv8n).
6. How to distinguish between cracks (or potholes) on the road and black lines or stains (such as black dirt blocks formed by oil leakage and soil contamination).

 

Author Response

Comments 1:[This manuscript investigates the multi-scale feature fusion algorithm for road surface disease detection in drone images. There are too many statements about background introduction in the abstract (first few sentences).]

Response 1:[The abstract has been completely rewritten (highlighted) to trim background from four sentences to one and to front-load the technical contribution. The new abstract states the problem, the four modules and what each of them does, the dataset scope (3,408 UAV images, four distress categories), the headline numerical results (P 93.7%, R 89.6%, mAP 96.0%, 2.41×10⁶ params), and the trade-off (30.3 vs 57.1 FPS vs YOLOv8n).]

Comments 2:[Lines 65-66, the opening sentence is similar to the beginning of the previous paragraph, 'With the development of unmanned aerial vehicle (UAV) remote sensing technology' can be changed to 'Based on various needs in production and daily life'. Please ensure the logic and progression between the paragraphs in the introduction.]

Response 2:[Applied. The opener of Introduction paragraph 2 is now 'Driven by the diverse needs of production and daily life, unmanned aerial vehicle (UAV) remote sensing has been widely applied in waterway transportation…' (highlighted).]

Comments 3:[In Table 2, please ensure that '5 × 10-4' is correct.]

Response 3:[Confirmed. The weight-decay coefficient is 5 × 10⁻⁴ = 0.0005, the standard value for SGD on YOLOv8. Table 2 now renders the exponent as a proper superscript. ]

Comments 4:[Optical remote sensing satellites mainly rely on visible light for imaging, so they can only work during the day and are greatly affected by meteorological conditions such as clouds, rain, and fog, which limits their observation time and range. ]

Response 4:[We have added a short paragraph in Section 4.4 that makes exactly this point — optical satellites are restricted to daytime operation and are strongly impacted by clouds, rain, and fog, which constrains their temporal/spatial coverage for pavement inspection. This motivates UAV inspection, whose high-angle perspective is precisely what our modules were designed to compensate for.]

Comments 5:[The relevant results of the algorithm presented in adverse weather conditions (weak light, cloudy, rainy, etc.) need to be revealed and compared with other methods (YOLOv8n).]

Response 5:[We handle this transparently rather than by fabricating figures. Section 3.1 now explicitly states that our acquisition protocol was restricted to clear daylight with solar elevation >30°, so the present dataset does not contain genuine adverse-weather samples. Reporting YOLOv8-vs-ours numbers on imagery we do not actually have would be methodologically unsound. Section 4.4 therefore reports adverse weather as a known limitation and first item of planned future work — we intend to collect an adverse-weather extension under the same protocol and benchmark all models in a follow-up paper.]

Comments 6:[How to distinguish between cracks (or potholes) on the road and black lines or stains (such as black dirt blocks formed by oil leakage and soil contamination).]

Response 6:[Section 4.3 is dedicated to this question and gives three layers of mechanism: (i) at the pixel level, EIEM responds to steep gradients, so a stain with a diffuse, low-contrast boundary produces a much weaker edge map than a crack whose boundary is a sharp albedo discontinuity; (ii) at the feature level, WaveletPool's LH, HL, and HH sub-bands collectively preserve horizontal, vertical, and diagonal high-frequency edge energy, which retains the thin oriented boundaries of linear/branching cracks and concave-closed pothole rims while blurry, low-gradient stain silhouettes contribute little to these sub-bands and are attenuated during downsampling; (iii) at the label level, the annotation protocol described in Section 3.1 uses two annotators and retains only regions with coherent crack morphology or a clearly concave pothole boundary, so amorphous blobs are excluded at training time. A residual failure case — a long linear tire-rubber streak mimicking crack geometry — is reported as an acknowledged limitation.]

Reviewer 4 Report

Comments and Suggestions for Authors

1. The manuscript proposes an improved YOLOv8-based framework for UAV road surface defect detection, but the methodological novelty should be clarified more explicitly. Since the overall design is built by integrating several modified components into an existing detector, the authors are encouraged to explain more clearly how each proposed module contributes beyond conventional feature fusion and detection-head redesign strategies.

2. Although the reported improvements in precision, recall, and mAP are encouraging, the experimental validation could be strengthened by providing a more detailed analysis of generalization ability. In particular, it would be helpful to discuss how the proposed method performs under different UAV altitudes, imaging perspectives, or illumination conditions, since these factors are highly relevant in practical road inspection scenarios.

3. The ablation study demonstrates that each proposed module contributes to performance improvement, but the current discussion remains relatively brief. A more in-depth explanation of why FFDPN, IIDH, EIEM, and WaveletPool improve detection, especially for small-scale defects and duplicate detections, would help readers better understand the internal mechanism of the proposed framework.

4. To better position the contribution within recent advances in structured and attribute-aware representation learning, the authors may be advised consulting representative studies in this field, i.e.,  size-aware graph embedding approach to remote sensing image captioning, global structure relation via reversible visual state space, etc.

5. The manuscript would benefit from a clearer discussion of the trade-off between detection accuracy and inference speed. While the proposed model achieves the best mAP, its FPS is noticeably lower than some lightweight baselines. A more explicit interpretation of this efficiency–accuracy balance would improve the practical value of the study for UAV-based deployment.

Author Response

Comments 1:[The manuscript proposes an improved YOLOv8-based framework for UAV road surface defect detection, but the methodological novelty should be clarified more explicitly. Since the overall design is built by integrating several modified components into an existing detector, the authors are encouraged to explain more clearly how each proposed module contributes beyond conventional feature fusion and detection-head redesign strategies.]

Response 1:[Section 4.1 now gives per-module mechanistic differences. Briefly: FFDPN differs from conventional PAN by coupling two FocusFeature units in series, shortening the effective path between P3 and P5 to a single hop per scale, preserving small-target texture that repeated PAN up-down fusion would attenuate. IIDH differs from a conventional decoupled head by sharing a single CGS-derived interaction feature; the DCNv2 offsets in the regression branch are modulated by the classification confidence distribution, which is the specific mechanism behind the reduction in duplicate boxes. EIEM differs from stem-level attention by injecting an explicit, translation-equivariant Sobel gradient prior rather than learning one from scratch. WaveletPool differs from strided convolution downsampling by retaining the LH/HL/HH sub-bands, where thin-edge energy lives. Section 4.5 then makes the positioning explicit: the contribution is task-specific architectural design, not a new operator.]

Comments 2:[Although the reported improvements in precision, recall, and mAP are encouraging, the experimental validation could be strengthened by providing a more detailed analysis of generalization ability. In particular, it would be helpful to discuss how the proposed method performs under different UAV altitudes, imaging perspectives, or illumination conditions, since these factors are highly relevant in practical road inspection scenarios.]

Response 2:[Section 3.1 now documents that training/validation/test data span altitudes of 10–30 m AGL and camera tilt from nadir to ~60° off-nadir, so the 96.0% mAP is obtained under a non-trivial range of altitudes and perspectives, not a single fixed geometry. Section 4.4 then discusses generalization behavior across this range — the model remains effective throughout it, which we attribute to the edge-preservation chain EIEM → WaveletPool being robust to moderate changes in ground sampling distance. Section 4.4 also flags three regimes where generalization breaks: tilt >60°, low-sun directional illumination, and adverse weather.]

Comments 3:[The ablation study demonstrates that each proposed module contributes to performance improvement, but the current discussion remains relatively brief. A more in-depth explanation of why FFDPN, IIDH, EIEM, and WaveletPool improve detection, especially for small-scale defects and duplicate detections, would help readers better understand the internal mechanism of the proposed framework.]

Response 3:[Addressed in Section 4.1, written specifically to answer this comment. For small-scale defects, the relevant mechanism is the EIEM → WaveletPool → FFDPN chain: EIEM creates explicit edge evidence at the stem, WaveletPool preserves LH/HL/HH sub-bands (horizontal, vertical, and diagonal high-frequency edges) through downsampling, and FFDPN propagates this information across scales without dilution. For duplicate detections, the relevant mechanism is IIDH's shared interaction feature with classification-confidence modulation of the DCNv2 offsets. We also give an honest assessment of module additivity: the sum of single-module gains (12.7) is slightly larger than the full combined gain (12.2), indicating that the modules are approximately additive in quantitative terms; the stronger complementarity shows up in Figure 11, where only the full combination simultaneously eliminates duplicate, missed, and misclassified detections across all four distress categories.]

Comments 4:[To better position the contribution within recent advances in structured and attribute-aware representation learning, the authors may be advised consulting representative studies in this field, i.e.,  size-aware graph embedding approach to remote sensing image captioning, global structure relation via reversible visual state space, etc.]

Response 4:[We added a paragraph in Section 4.2 that positions our framework relative to these directions, and we added both suggested works to the reference list as new entries [33] and [34]:[33] A Size-Aware Graph Embedding Approach to Remote Sensing Image Captioning with Object Relative Size Information. IEEE, 2025. https://ieeexplore.ieee.org/document/11328866/.[34] Understanding Global Structure Relation via Reversible Visual State Space Model for Robust Cross-View Geo-Localization. In Proceedings of the 3rd International Workshop on UAVs in Multimedia, ACM, 2025. https://doi.org/10.1145/3728482.3757390.]

Comments 5:[The manuscript would benefit from a clearer discussion of the trade-off between detection accuracy and inference speed. While the proposed model achieves the best mAP, its FPS is noticeably lower than some lightweight baselines. A more explicit interpretation of this efficiency–accuracy balance would improve the practical value of the study for UAV-based deployment.]

Response 5:[Section 4.2 reads the trade-off directly off Table 4. Relative to YOLOv8n, our model gives up 26.8 FPS (57.1 → 30.3) in exchange for +12.2 mAP; the cost is attributable primarily to DCNv2 inside IIDH and to the four-band convolutions of WaveletPool. YOLOv10n is faster (66.7 FPS) but trades 28.2 mAP points for that speed; YOLOv9 is more accurate than YOLOv10n but at 5.8 FPS and 50.7×10⁶ parameters (≈21× heavier and ≈5× slower than our model), which rules it out for on-board UAV deployment. On positioning, we have tightened the practical claim: 30.3 FPS is sufficient for post-flight batch analysis and for real-time analysis of survey-UAV video streams at typical 24–30 FPS capture rates; it is not intended for applications requiring higher frame rates such as aggressive flight-control loops, and we therefore position the model as an inspection tool rather than a control-loop perception module. Section 4.5 further notes that the 2.41×10⁶-parameter, 10.4-GFLOP profile is within the budget of edge-AI SoCs used on modern UAV companion boards.]

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

N/A

Author Response

Response:

We are grateful for the reviewer’s assessment and have taken these concerns seriously. In the absence of specific written comments to act on, we have used the substantive feedback from Reviewers 3 and 4 as our concrete guide and have, in addition, performed a careful pass over the manuscript with the following aims:

(1) Tightened the technical description in Sections 2.2.1 and 2.2.4 by adding formal mathematical formulations of the FFDPN feature-fusion path and of the WaveletPool sub-band decomposition (see Section 2.2.1, new paragraphs after the StackFusion description, and Section 2.2.4, new paragraph after the WaveletPool overview).

(2) Added an explicit operational-stability paragraph for the Feature Focusing Diffusion Pyramid Network (Section 2.2.1) that documents the four design choices (residual short-cut, BatchNorm, Lipschitz-bounded interpolation, series rather than cross-coupled connection) ensuring stable training.

(3) Standardized the formatting of Equation (1) so that the equation number sits flush with the right margin (Section 2.2.1, lines around 202–203 of the revised PDF).

(4) Carefully re-read the manuscript for English clarity; we have not received specific phrasing concerns from Reviewers 3 or 4, but if the editor or this reviewer can highlight any passages that remain unclear we will be glad to revise them in a further iteration.

We hope these revisions address the underlying concerns. We would also welcome any specific points the reviewer is willing to share so that we can target them precisely in any subsequent revision.

Reviewer 3 Report

Comments and Suggestions for Authors

1. The font of the coordinate axis and annotations in Figure 10 is too small, usually one size smaller than the font in the title. The resolution of the image needs to be increased.
2. The causes of road surface defects are complex and varied, and the road surface is highly susceptible to interference from the surrounding environment. The mathematical description of road defect feature fusion and small object detection needs to be strengthened.
3. The editing of equations requires standardization. The numbering of equations should ensure that they are at the far right end of the line (e.g. equation (1) in lines 201-202).
4. How to ensure the stability of the operation of the focused diffusion pyramid network?

 

Author Response

Comments 1:The font of the coordinate axis and annotations in Figure 10 is too small, usually one size smaller than the font in the title. The resolution of the image needs to be increased.

Response 1:Thank you for catching this. We have re-exported Figure 10 (loss-curve comparison) at higher resolution, increased the axis-tick and legend font size to one point smaller than the figure title (rather than the original two-points-smaller setting that produced the unreadable labels), and re-saved the figure as a higher-DPI raster so that axis numerals and series annotations remain legible at print size. The corresponding figure in the manuscript has been replaced. Location: Section 3.3, Figure 10.

Comments 2:The causes of road surface defects are complex and varied, and the road surface is highly susceptible to interference from the surrounding environment. The mathematical description of road defect feature fusion and small object detection needs to be strengthened.

Response 2:

We agree, and we have added two formal mathematical formulations to the manuscript that make the feature-fusion and small-object handling explicit rather than only descriptive.

(a) In Section 2.2.1 (FFDPN), a new paragraph added after the StackFusion description formalises the FocusFeature operation. It defines the three input scales X₁, X₂, X₃, the bilinear up-sampler U(·) and ADown down-sampler D(·) used for scale alignment, the channel-wise concatenation X=[X̃₁;X̃₂;X̃₃], the multi-kernel StackFusion S(X)=Σ_{k∈{5,7,9,11}} f_k(X), and the residual output Y=ψ_{1×1}(S(X))+X. It then explains, with reference to this formula, exactly why the construction is suited to small-target detection: the residual path preserves the high-resolution detail channels contributed by X₃, and the K=4 receptive fields with k∈{5,7,9,11} bracket the ground-pixel size of the four distress categories at the UAV altitudes used in this paper.

(b) In Section 2.2.4 (WaveletPool), a new paragraph added after the operator overview gives the formal Haar decomposition F_LL=(F*h_x)*h_y, F_LH=(F*h_x)*g_y, F_HL=(F*g_x)*h_y, F_HH=(F*g_x)*g_y with h=(1/√2)[1,1] and g=(1/√2)[1,−1], the orthonormal energy identity ‖F‖²=‖F_LL‖²+‖F_LH‖²+‖F_HL‖²+‖F_HH‖², and a Nyquist–Shannon argument showing why this provides anti-aliasing for the small-target detection scales.

(c) Concerning the susceptibility of the road surface to environmental interference, Section 4.3 (“Distinguishing Distresses from Visually Similar Non-Distress Artifacts”) and Section 4.4 (failure modes) already discuss oil leaks, soil contamination, tire marks, asphalt patches, and the three specific failure modes (steep oblique views, strong directional illumination, adverse weather). Together with the new mathematical formulations, the discussion now ties the empirical robustness of the model directly to the formal properties of EIEM (gradient response) and WaveletPool (high-frequency sub-band preservation).

Comments 3:The editing of equations requires standardization. The numbering of equations should ensure that they are at the far right end of the line (e.g. equation (1) in lines 201-202).

Response 3:Corrected. We have re-formatted the layout of the Equation (1) container so that the equation itself remains horizontally centred while the equation number “(1)” is now flush with the right margin of the text column. Specifically, the equation table was widened to span the full content width (10466 dxa, equal to A4 page width minus left and right margins) and the right-hand cell holding the number was widened so that “(1)” no longer wraps. This is the only equation in the paper, so the same fix is sufficient. Location: Section 2.2.1, the table immediately following “The principle of bilinear interpolation is shown in Equation (1).”

Comments 4:How to ensure the stability of the operation of the focused diffusion pyramid network?

Response 4:

We have added a dedicated paragraph in Section 2.2.1 that documents the four mechanisms by which the FFDPN is kept numerically and optimisationally stable:

(i) The residual addition Y=ψ_{1×1}(S(X))+X provides an identity short-cut, so the gradient with respect to the input X has a unit-Jacobian component ∂Y/∂X = I + ∂ψ_{1×1}S/∂X, which prevents the vanishing-gradient behaviour that long fusion-only chains otherwise exhibit.

(ii) Each CBS block contains a BatchNorm layer, which normalises activation statistics across the spatial dimensions and the K=4 stacked kernel responses, so that the element-wise sum in StackFusion does not explode in magnitude when several large-kernel branches activate together.

(iii) The bilinear interpolation in Equation (1) is differentiable and Lipschitz-continuous with constant 1, so up-sampling does not amplify input perturbations; combined with the ADown down-sampler (a learned 1×1 stride-2 projection rather than max-pooling), the scale-alignment stage preserves bounded sensitivity to small input shifts.

(iv) The two FocusFeature units are connected in series rather than in a fully cross-coupled graph, which keeps the depth-to-width ratio bounded and prevents the gradient noise that recurrent or densely-coupled fusion topologies tend to accumulate.

Empirically, in all training runs reported in Section 3, the validation loss decreased monotonically after the first ten epochs, no NaN or divergent loss events were observed across the 300-epoch schedule, and the loss curve in Figure 10 shows that the proposed model converges faster and to a lower plateau than the baselines, which we take as additional evidence that the FFDPN-augmented model is stable on this task. Location: Section 2.2.1, the second new paragraph added after the StackFusion description.

Reviewer 4 Report

Comments and Suggestions for Authors

Well revised.

Author Response

We are most grateful for the reviewer’s positive evaluation and for the constructive feedback in the previous round, which substantially shaped the revisions to the manuscript. Thank you.

Back to TopTop