DAR-MDE: Depth-Attention Refinement for Multi-Scale Monocular Depth Estimation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper introduces an innovative method for monocular depth estimation, employing an Autoencoder equipped with Multi-Scale Feature Aggregation and a Refining Attention Network. By integrating multi-scale feature extraction and a refinement network, the proposed approach significantly enhances the accuracy of depth estimation. While the experimental results showcase the method’s promising performance, several aspects necessitate further clarification and enhancement prior to publication.
1.The authors should clearly highlight the motivation and contributions, particularly discussing the motivation in the Introduction section.
2.In the Introduction section, each contribution should be concisely stated within 30 words.
3.Figure 1 presents the framework of the proposed method; it is suggested to provide a detailed description of each module.
4. It is recommended to include zoomed-in regions in the visual results to highlight texture recovery and chromatic fidelity.
5. Please provide and discuss the theoretical contributions of the work.
6. The comparison with state-of-the-art (SOTA) methods should include runtime analysis to assess practical applicability.
7. Ablation studies are critical: How do individual components contribute to the overall performance?
8. Is the low-light environment influencing the results? Some related literatures in recent years are suggest. For instance:
https://doi.org/10.32604/cmes.2025.063595 https://doi.org/10.1016/j.dsp.2024.104802
9. In Eq. (11), what is the influence when those parameters are changed?
Are there any failure cases in this work? It is suggested to point out future work in the Conclusion section.
11. In the bar charts of Figures 7 to 11, it is recommended to label the data for each bar.
12. The article is well organized. However, the length of the paper is excessive. It is recommended to reduce redundant expressions.
Minor revisions required to address methodological clarity, comparative validation, and robustness evaluation. The approach is promising but needs rigorous benchmarking and clearer exposition.
Author Response
Reviewer 1
Reviewer Point P 1.1 — Brief description of the manuscript:
This paper introduces an innovative method for monocular depth estimation, employing an Autoencoder
equipped with Multi-Scale Feature Aggregation and a Refining Attention Network. By integrating multiscale feature extraction and a refinement network, the proposed approach significantly enhances the accuracy of depth estimation. While the experimental results showcase the method’s promising performance,
several aspects necessitate further clarification and enhancement prior to publication.
Reviewer Point P 1.2 — The authors should clearly highlight the motivation and contributions, particularly discussing the motivation in the Introduction section.
Reply:
We thank the reviewer for this insightful suggestion. In response, we have substantially revised
the Introduction (Section 1) to make both the motivation and the contributions of our work much
clearer.
• Motivation clarified: We explicitly state that many existing monocular depth estimation
methods, while accurate on average, often fail to consistently handle scale variations and
struggle to preserve sharp structural edges, especially around object boundaries. We highlight that this limitation is critical for industrial, robotic, and real-time applications, which motivates our design of a framework that explicitly aggregates multi-scale context and applies
attention refinement to ensure structural consistency.
• Contributions highlighted: We also rewrote the contributions at the end of the Introduction to clearly list four main points: (1) the integration of Multi-Scale Feature Aggregation
(MSFA) for handling scale diversity, (2) the use of a Refining Attention Network (RAN) for
emphasizing important structural regions, (3) a multi-scale curvilinear saliency loss that enforces detail preservation across resolutions, and (4) extensive validation on both indoor and
outdoor datasets demonstrating strong generalization without reliance on additional sensors.
The explanations have been added on pages 1 and 2 in the revised version.
Reviewer Point P 1.3 — In the Introduction section, each contribution should be concisely stated within
30 words.
Reply: We appreciate the reviewer’s helpful suggestion. In the revised manuscript, we have
carefully rephrased each contribution statement in the Introduction to be within 30 words, making
them clearer and more direct.
Reviewer Point P 1.4 — Figure 1 presents the framework of the proposed method; it is suggested to
provide a detailed description of each module.
Reply:
We thank the reviewer for this observation. We have provided a detailed description of each
module composing our framework in Section 3, under the Model Architecture subsection.
Specifically:
• The Autoencoder network, including the encoder and decoder layers, is thoroughly described
with architectural parameters and layer sizes.
• The Multi-Scale Feature Aggregation (MSFA) network is detailed, explaining how multi-scale
features are upsampled, concatenated, and processed.
• The Refining Attention Network (RAN) is also fully described, including the formulation of
how it utilizes similarity computations and subsequent convolutional refinements.
Reviewer Point P 1.5 — It is recommended to include zoomed-in regions in the visual results to highlight texture recovery and chromatic fidelity.
Reply:
We thank the reviewer for this valuable suggestion. In our current manuscript, we have provided qualitative results in Figure 12 (for the NYU Depth-v2 dataset) and Figure 14 (for the SUN
RGB-D dataset), which already show examples of various challenging scenarios—such as small
objects, lighting variations, dark areas, and geometrically complex regions—across multiple rows.
While these figures demonstrate the capability of our model to handle texture recovery and
structural preservation under diverse conditions, they do not currently include explicit zoom-in
boxes.
We appreciate the reviewer’s recommendation and will plan to incorporate zoomed-in subregions in these figures in a subsequent extended version of this work or a follow-up publication
to more closely highlight texture recovery and local fidelity. For the current paper, we aimed to
maintain a uniform scale across all visual comparisons to ensure fair and consistent side-by-side
assessments.
Reviewer Point P 1.6 — Please provide and discuss the theoretical contributions of the work.
Reply:
We thank the reviewer for raising this important point.
Our work is primarily a methodological and architectural advancement for monocular depth
estimation, but it is also grounded in strong theoretical foundations drawn from classical computer
vision and machine learning principles. To address this, we have added a new subsection at the
end of Section 3 to explicitly outline the theoretical motivations of each core module:
• Multi-Scale Feature Aggregation (MSFA)module is rooted in scale-space theory, a wellestablished concept in computer vision that supports the extraction of visual structures at
multiple spatial resolutions. This theory justifies our use of multi-resolution fusion to capture
both local geometric texture and global scene layout.
• Refining Attention Network (RAN) leverages cosine similarity to compute attention between depth probability maps and multi-scale features. This reflects probabilistic attention mechanisms found in transformer models and is theoretically aligned with informationtheoretic relevance modeling.
• Multi-scale Curvilinear Saliency Loss: draws from differential geometry, where curvaturebased saliency maps are used to preserve object boundaries and geometric consistency
during training. This encourages the network to align estimated depth transitions with perceptually important edges.
To clarify these theoretical underpinnings, we have added a dedicated short paragraph at the
end of Section 3.1 (“Problem Formulation”) in the revised manuscript and highlighted the conceptual basis of each module.
Reviewer Point P 1.7 — The comparison with state-of-the-art (SOTA) methods should include runtime
analysis to assess practical applicability.
Reply:
We thank the reviewer for this important suggestion.
In our manuscript, we reported the runtime of our own method in Section 4.3 (Parameter
Settings), showing that it achieves inference speeds of approximately 19.2 ms per image for
NYU Depth-v2 and SUN RGB-D, and 26 ms per image for Make3D, thus demonstrating practical
deployability.
However, as noted in the caption and data of Table 5, most of the recent SOTA methods we
compare against do not provide explicit runtime information in their original papers. This unfortunately prevents a fair quantitative side-by-side runtime comparison.
To partially address practical applicability, we ensured that Table 5 explicitly reports the type
of model (the encoder/backbone used by each method, e.g., SENet-154, Swin-Large, ResNeXt101), which gives an indirect indication of expected computational load, helping readers assess
relative complexity.
Reviewer Point P 1.8 — Ablation studies are critical: How do individual components contribute to the
overall performance?
Reply:
We thank the reviewer for underlining the importance of understanding the contribution of each
component.
In our manuscript, we conducted detailed ablation studies presented in Section 4.5 (Ablation Study) and summarized in Tables ?? and ?? along with Figures ?? and ??. These
experiments explicitly demonstrate how individual modules—namely skip connections (BLSC),
Multi-Scale Feature Aggregation (MSFA), Refining Attention Network (RAN), and Multi-scale Loss
(ML)—incrementally improve depth estimation performance.
The results reveal:
• Skip connections provide initial performance gains by better preserving spatial information,
reducing relative error (rel) by ≈0.0008 compared to the baseline (BL).
• MSFA helps the network aggregate scale-aware features, improving δ < 1.25 by 1.24% and
reducing rel by 0.0041.
• RAN sharpens object boundaries by attending to salient depth regions, increasing δ < 1.25
by 2.04% and lowering rel by 0.0067 relative to BL.
• ML enforces consistency across scales, boosting δ < 1.25 by 2.11% and reducing rel by
0.0071.
• Combining all components (BLSC+ML+MSFA+RAN) achieves the best overall performance,
underscoring the complementary benefits of each module.
This comprehensive analysis clearly isolates and quantifies the impact of each architectural
component on model accuracy and robustness, providing strong evidence of their necessity.
Reviewer Point P 1.9 — Is the low-light environment influencing the results? Some related literatures
in recent years are suggest. For instance: https://doi.org/10.32604/cmes.2025.063595
https://doi.org/10.1016/j.dsp.2024.104802
Reply:
We appreciate the reviewer bringing up the influence of low-light conditions and suggesting
relevant literature.
In our experiments (Section 4), particularly Figures 12 and 14, we intentionally included examples that feature challenging lighting conditions (e.g., dark scenes, shadows, uneven illumination).
As shown in these visualizations, our model is generally able to preserve reasonable depth structure even in low-light scenarios, aided by the multi-scale and attention mechanisms that exploit
contextual cues beyond local intensity.
However, we acknowledge that monocular depth estimation from RGB alone remains sensitive
to extreme low-light scenes where textural and chromatic cues are heavily degraded. Our current
method does not explicitly incorporate illumination correction or dedicated low-light enhancement
as explored in [8, 11]. We thank the reviewer for these references and will consider integrating
illumination-aware preprocessing or embedding low-light feature adaptation modules in future work
to further strengthen robustness under such conditions.
We will also cite these works in our revised manuscript to better situate our contribution within
the context of illumination-challenged depth estimation.
Reviewer Point P 1.10 — In Eq. (11), what is the influence when those parameters are changed? Are
there any failure cases in this work? It is suggested to point out future work in the Conclusion section.
Reply:
We thank the reviewer for these insightful questions.
Regarding Eq. (11), the parameters α and β control the trade-off between the multi-scale
loss ML (which emphasizes consistent predictions across multiple scales and sharper object
boundaries) and the content loss L (which focuses on the pixel-wise and structural similarity to the
ground truth).
Increasing α places more emphasis on multi-scale structure alignment, typically improving
boundary sharpness and small-object reconstruction but can slightly increase global depth inconsistencies. Conversely, increasing β prioritizes overall content fidelity, which can smooth out finer
details but tends to reduce large-scale depth errors. Through empirical trials (evaluating over a
range from 0.2 to 0.8), we found that the chosen balance of α = 0.6 and β = 0.4 produced optimal
performance across both indoor and outdoor datasets (lowest average rel and log10 errors).
As for failure cases, like many monocular methods, our model occasionally struggles in extremely reflective regions (e.g., mirrors, glass facades) or under severe occlusions where monocular cues are ambiguous. This aligns with general challenges noted in recent literature.
To address these limitations, we have now updated the Conclusion to explicitly mention exploring the integration of uncertainty-aware modules and possible fusion with lightweight depth priors
(from stereo/self-supervised cues) as future directions to mitigate such failures.
We have added an explanation at the ned of section 3.3 on page 11: ”The weights and control
the balance between structural sharpness and global depth fidelity. A higher emphasizes multiscale structural refinement, improving edge clarity and small-object depth accuracy, while a higher
favors overall smoothness and pixel-wise consistency. ”
Reviewer Point P 1.11 — In the bar charts of Figures 7 to 11, it is recommended to label the data for
each bar.
Reply: Thanks to the reviewer for his helpful comment. We appreciate the suggestion to add
numerical labels on each bar in Figures 7 to 11. We would like to clarify that the precise numerical
values corresponding to these bar charts are already comprehensively reported in the accompanying Tables 1 to 4. This ensures that all exact data points are fully documented and easily
accessible alongside the graphical summaries. To maintain visual clarity and avoid redundancy,
we opted not to duplicate the values directly on the bars, given they are already explicitly provided
in the tables for precise reference.
Reviewer Point P 1.12 — The article is well organized. However, the length of the paper is excessive. It
is recommended to reduce redundant expressions.
Reply:
Thank you for your valuable feedback. We appreciate your positive assessment of the organization of our manuscript. In response to your suggestion, we carefully reviewed the text to
identify and remove redundant expressions and streamline lengthy explanations, particularly in
the methodological and experimental sections. This helped reduce the overall length while preserving the essential technical content and clarity of our contributions.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this manuscript titled "DAR-MDE: Depth-Attention Refinement for Multi-scale Monocular Depth Estimation", the authors propose an approach that utilizes an Autoencoder with Multi-Scale Feature Aggregation and a Refining Attention Network. However, this article has several serious issues:
1. The abstract mentions that there is still room for improvement in accuracy and robustness, but does this article's research have anything to do with robustness?
2. The first chapter contains a lot of irrelevant content. I cannot figure out what problem or limitation this article aims to solve. From the abstract, it seems to be about improving accuracy, but why does the first chapter also mention "a large amount of training data", "domain adaptation", and reducing model complexity? Overall, the introduction in the first chapter is disorganized and the research purpose is unclear.
3. The authors' model seems to have no work on reducing the number of model parameters, so how can it reduce computational costs?
4. This article seems to have no substantial innovation but rather a patchwork of many deep learning-related modules.
In conclusion, I do not recommend the publication of this article.
Author Response
Reviewer 2
Reviewer Point P 0.1 — Brief description of the manuscript:
In this manuscript titled ”DAR-MDE: Depth-Attention Refinement for Multi-scale Monocular Depth Estimation”, the authors propose an approach that utilizes an Autoencoder with Multi-Scale Feature Aggregation
and a Refining Attention Network.
Reviewer Point P 0.2 — The abstract mentions that there is still room for improvement in accuracy and
robustness, but does this article’s research have anything to do with robustness?
Reply:
We appreciate the reviewer’s attention to the consistency between the abstract and the technical contributions.
Yes, this work explicitly addresses robustness in the context of monocular depth estimation
from RGB images, particularly robustness to:
• Scale variations (small and large objects at different distances),
• Occlusion and ambiguous textures,
• Scene diversity (tested via cross-dataset validation on NYU Depth v2 and SUN RGB-D).
The architectural design is specifically tailored to improve generalization and robustness:
• The Multi-Scale Feature Aggregation (MSFA) module enables the network to capture geometric structures at various resolutions, which improves robustness to object scale and
viewpoint changes.
• The Refining Attention Network (RAN) focuses on context-aware regions with high structural information, improving prediction consistency even when local texture cues are weak or
misleading.
We have also updated the Conclusion and Abstract to clarify that robustness is defined in
terms of structural preservation, generalization to unseen scenes, and handling challenging visual
conditions — not hardware robustness or domain adaptation.
Reviewer Point P 0.3 — The first chapter contains a lot of irrelevant content. I cannot figure out
what problem or limitation this article aims to solve. From the abstract, it seems to be about improving
accuracy, but why does the first chapter also mention ”a large amount of training data”, ”domain adaptation”, and reducing model complexity? Overall, the introduction in the first chapter is disorganized and
the research purpose is unclear.
Reply:
Thank you for this important feedback. We carefully revised the Introduction to remove or
condense unrelated discussions, especially mentions of “domain adaptation” and “training data
volume,” which were originally included as broader context.
To address your concern:
• We restructured the Introduction to clearly emphasize that the main problem we address is
the lack of accuracy and structural consistency in current monocular depth methods under
real-world variability (e.g., scale changes, cluttered scenes).
• All irrelevant or loosely connected themes (e.g., domain adaptation, complexity trade-offs)
have been either removed or relocated to the Related Work section, where their relevance is
more appropriate.
• The revised Introduction now clearly defines the research gap and articulates our solution
path — focusing on accuracy and robustness in depth prediction via multi-scale aggregation
and refinement.
Reviewer Point P 0.4 — The authors’ model seems to have no work on reducing the number of model
parameters, so how can it reduce computational costs?
Reply: We appreciate this observation and agree that we do not introduce new parameter reduction techniques.
To clarify:
• Our goal is not to minimize the number of parameters, but rather to maintain a balance
between model complexity and real-time applicability.
• While our encoder (SENet-154) is moderately heavy, the decoder, MSFA, and RAN components are designed with shallow layers (e.g., 3x3 and 5x5 convolutions, simple bilinear
upsampling), enabling fast inference while maintaining accuracy.
• As reported in Section 4.3, our method achieves 19 ms per image inference time on
NYU/SUN datasets, making it suitable for real-time systems — a key practical advantage
even if the parameter count is not minimized.
We have revised Section 4.3 and the Conclusion to clearly reflect this distinction: the proposed
model emphasizes efficiency at inference time without explicitly targeting parameter minimization.
Reviewer Point P 0.5 — This article seems to have no substantial innovation but rather a patchwork of
many deep learning-related modules.
Reply:
We thank the reviewer for this candid comment.
While our architecture builds upon established components (e.g., autoencoders, attention
mechanisms), we believe the novelty lies in the way these modules are integrated and tailored for
depth estimation, namely. Thus, we can say that rather than introducing new standalone blocks,
we contribute a task-specific architectural fusion that yields meaningful gains in scale-awareness,
structural preservation, and runtime, as validated through extensive ablations and benchmarking:
• The MSFA module is designed to preserve distinct scale-specific features via channel-wise
concatenation before refinement, rather than fusing them early or via addition — a strategy
we show improves performance.
• The RAN module introduces a novel use of cosine similarity between depth probability
maps and multi-scale features to guide attention — a non-trivial mechanism inspired by
probabilistic depth structures.
• The Curvilinear Saliency-based loss is adapted across multiple decoder scales — ensuring boundary-aware supervision at each depth resolution.
Moreover, our ablation studies (Section 4.5) demonstrate that each of these modules contributes incrementally and meaningfully to the final performance. While individually these components may not be entirely new, the synergistic design and novel formulations (especially in attention
and loss) represent a meaningful methodological contribution in monocular depth estimation.
We have added a clarification in Section 1 and highlighted these contributions in Section 3 with
explicit formulation and ablation results to better demonstrate their value.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript proposes a monocular depth estimation algorithm (DAR-MDE) based on an autoencoder, Multi-Scale Feature Aggregation (MSFA), and Refining Attention Network (RAN), aiming to improve the accuracy and robustness of depth maps in complex scenes. The research topic has clear academic value and application background. The experimental design covers multiple public indoor and outdoor datasets, and the results show that the model performance is superior to existing mainstream methods. However, there is room for improvement in the manuscript regarding method details, experimental verification, and theoretical analysis, and further revisions and improvements are recommended.
1. There are certain deficiencies in the details of the method description. The encoder adopts SENet-154, but it is not specified whether all layers of the pre-trained weights are used or only part of the layers are used for feature extraction; the "concatenate" method (channel concatenation or element-wise addition) of MSFA for multi-scale features is not explained; the specific formula and implementation details of the "similarity calculation between coarse depth probability vectors and features" in RAN are missing. It is recommended to supplement the detailed structural parameters of the autoencoder, MSFA, and RAN.
2. The experimental design has certain limitations. The model's performance in challenging scenarios such as low-texture (e.g., white walls), strong reflection (e.g., glass), and severe occlusion has not been verified, making it difficult to support the conclusion of "strong robustness". It is recommended to supplement qualitative comparisons (e.g., visualized depth maps) and quantitative analyses for such scenarios; the manuscript only compares monocular RGB methods and does not compare with methods fusing multi-modal data such as LiDAR and infrared (e.g., [9][29]), making it impossible to clarify its unique advantages in single-input scenarios.
3. The result analysis is not in-depth enough. The model performs excellently on the SUN RGB-D (cross-dataset), but the reasons (e.g., the adaptability of MSFA to scene differences, the capture of semantic information by RAN) are not analyzed. It is recommended to reveal the intrinsic logic of generalization ability by visualizing attention heatmaps or feature distributions; the scenes (e.g., near-view/far-view, indoor/outdoor) where the model has larger errors are not analyzed. For example, the RMSE (6.531) of the model on the Make3D dataset is slightly higher than that of [32] (6.522), and the sources of errors need to be explained.
4. The formatting of references is inconsistent, with some citations missing and some citation formats incorrect. For instance, [23,32,54–57] should be standardized as [23,32,54,55,56,57]; the information in figures and tables is incomplete. Figure 1 (overall framework) does not label the input and output dimensions of each module; the data of "δ<1.25²" and "δ<1.25³" for some methods in Tables 3 and 5 are missing, which affects the completeness of the comparison.
This manuscript has a solid research foundation. It is recommended that the authors deepen the theoretical analysis and update the technical coverage according to the above suggestions. If the above issues can be addressed, consideration may be given to acceptance after revision.
Author Response
Reviewer 3
Reviewer Point P 0.1 — Brief description of the manuscript.
This manuscript proposes a monocular depth estimation algorithm (DAR-MDE) based on an autoencoder,
Multi-Scale Feature Aggregation (MSFA), and Refining Attention Network (RAN), aiming to improve the
accuracy and robustness of depth maps in complex scenes. The research topic has clear academic value
and application background. The experimental design covers multiple public indoor and outdoor datasets,
and the results show that the model performance is superior to existing mainstream methods. However,
there is room for improvement in the manuscript regarding method details, experimental verification, and
theoretical analysis, and further revisions and improvements are recommended.
Reviewer Point P 0.2 — There are certain deficiencies in the details of the method description. The
encoder adopts SENet-154, but it is not specified whether all layers of the pre-trained weights are used
or only part of the layers are used for feature extraction; the ”concatenate” method (channel concatenation or element-wise addition) of MSFA for multi-scale features is not explained; the specific formula
and implementation details of the ”similarity calculation between coarse depth probability vectors and
features” in RAN are missing. It is recommended to supplement the detailed structural parameters of the
autoencoder, MSFA, and RAN.
Reply:
Thank you very much for your thorough reading and insightful feedback. We sincerely appreciate your careful examination of our methodology. We have addressed each of your concerns and
have updated the manuscript to include the following clarifications and additional details:
• Encoder details: We clarified in the revised manuscript that our method utilizes all layers of the SENet-154 backbone up to the final global pooling layer (before its classification
head), thereby exploiting the complete hierarchical feature extraction capability of the network. These layers are initialized with ImageNet pre-trained weights and further fine-tuned
end-to-end on our depth estimation task. This information is now explicitly stated in Section
3.2.1.
• MSFA concatenation method: We specified in Section 3.2.2 that the multi-scale features
from different decoder stages are combined via channel-wise concatenation (rather than
element-wise addition). This preserves the distinct scale-specific information across the
feature maps, which is crucial for robust depth estimation. This information is now explicitly
stated in Section 3.2.2.
• RAN similarity computation: In Section 3.2.3, we provided the exact formulation of the
similarity measure used in the Refining Attention Network. Specifically, we compute the
cosine similarity between the normalized coarse depth probability vectors and the multi-scale
feature vectors at each spatial location, formalized by:
Sim(i, j) = (Mi,j · Pi,j ) / (∥Mi,j∥ ∥Pi,j∥)
where Mi,j is the aggregated MSFA feature vector and Pi,j is the normalized probability
vector from the coarse depth at pixel (i, j). This similarity guides the attention weighting
within the refinement process. This information is now explicitly stated in Section 3.2.3.
• Structural details: In addition to these clarifications, we also expanded the architectural descriptions to provide more comprehensive structural parameters of the autoencoder, MSFA,
and RAN components, including kernel sizes, channel dimensions, and upsampling operations. These details can now be found within the respective subsections.
Reviewer Point P 0.3 — The experimental design has certain limitations. The model’s performance in
challenging scenarios such as low-texture (e.g., white walls), strong reflection (e.g., glass), and severe
occlusion has not been verified, making it difficult to support the conclusion of ”strong robustness”. It is
recommended to supplement qualitative comparisons (e.g., visualized depth maps) and quantitative analyses for such scenarios; the manuscript only compares monocular RGB methods and does not compare
with methods fusing multi-modal data such as LiDAR and infrared (e.g., [9][29]), making it impossible
to clarify its unique advantages in single-input scenarios.
Reply:
Experimental Design and Challenging Scenarios: We thank the reviewer for pointing out this
important aspect regarding the evaluation of robustness in more complex scenes. Our study
specifically focuses on monocular RGB-only depth estimation, and the design and evaluation are
grounded within this constraint. While we agree that scenarios such as low-texture surfaces (e.g.,
white walls), reflective objects (e.g., glass), and severe occlusion are challenging, these limitations
are intrinsic to all monocular methods and are not exclusive to our approach.
To mitigate these effects, our proposed model incorporates the Multi-Scale Feature Aggregation (MSFA) module, which enhances context awareness across spatial resolutions, and the
Refining Attention Network (RAN), which guides the model to focus on structurally and semantically informative regions. These architectural components are designed specifically to help in
situations where explicit texture cues are limited.
While we have not included dedicated benchmarks for these cases due to dataset limitations,
some qualitative examples in the NYU Depth v2 and SUN RGB-D datasets do contain such challenges, and our method demonstrates competitive or superior visual quality compared to stateof-the-art baselines. We have revised the manuscript to explicitly acknowledge this limitation and
clarify that the robustness claims are made within the scope of monocular RGB input.
Comparison with Multi-Modal Methods (e.g., LiDAR, Infrared): We acknowledge that methods
using LiDAR or infrared sensors (e.g., [9][29]) may achieve higher accuracy in complex environments. However, our objective is to develop a lightweight and generalizable depth estimation
framework that does not rely on additional hardware. Hence, we focus on monocular RGB-based approaches to highlight the performance ceiling achievable using a single input modality. Comparing our results with multi-modal fusion models would not offer a fair or meaningful evaluation, as
the input information and application contexts differ significantly.
To avoid confusion, we have updated the text in both the abstract and conclusions to more
precisely frame our contributions within the scope of monocular RGB-based depth estimation.
We have added a paragraph on page 17 giving an explicit interpretation of why the model
works well in dim light, occlusion, etc.
Reviewer Point P 0.4 — The result analysis is not in-depth enough. The model performs excellently
on the SUN RGB-D (cross-dataset), but the reasons (e.g., the adaptability of MSFA to scene differences,
the capture of semantic information by RAN) are not analyzed. It is recommended to reveal the intrinsic
logic of generalization ability by visualizing attention heatmaps or feature distributions; the scenes (e.g.,
near-view/far-view, indoor/outdoor) where the model has larger errors are not analyzed. For example,
the RMSE (6.531) of the model on the Make3D dataset is slightly higher than that of [32] (6.522), and
the sources of errors need to be explained.
Reply:
We appreciate the reviewer’s suggestion to provide a deeper analysis of the model’s generalization behavior. In the revised discussion, we detail how the architectural components—particularly
the Refining Attention Network (RAN) and Multi-Scale Feature Aggregation (MSFA) module—contribute
to the model’s strong cross-dataset performance. As shown in Figure 12 (NYU Depth-v2) and Figure 14 (SUN RGB-D), our model consistently produces structurally accurate depth maps across
diverse environments, including scenes with complex geometry and varying lighting conditions.
The RAN module facilitates spatially aware refinement by attending to semantically important
regions such as edges, structural planes, and depth discontinuities. Meanwhile, the MSFA module
captures a rich representation of features across multiple resolutions, improving robustness to
scale variation and enabling effective generalization to previously unseen indoor scenes. These
design choices are further supported by the performance metrics presented in Table 5 (crossdataset evaluation), where our model demonstrates consistent accuracy across both NYU and SUN RGB-D datasets.
Together, these components enhance the model’s ability to infer depth in challenging scenes,
even when local texture cues are weak, by leveraging contextual information and multi-scale structure—a key factor in the observed generalization.
Reviewer Point P 0.5 — The formatting of references is inconsistent, with some citations missing and
some citation formats incorrect. For instance, [23,32,54–57] should be standardized as [23,32,54,55,56,57];
the information in figures and tables is incomplete. Figure 1 (overall framework) does not label the input
and output dimensions of each module; the data of < 1.25 and < 1.25 for some methods in Tables 3 and
5 are missing, which affects the completeness of the comparison.
Reply:
We thank the reviewer for the detailed and constructive comments.
1. Reference Formatting: We have carefully reviewed and corrected all reference formatting
issues throughout the manuscript. Citation ranges have been standardized. For example,
instances such as “[23,32,54–57]” have been replaced with “[23,32,54,55,56,57]” to ensure
consistency and compliance with the required citation style.
2. Figure 1 (Overall Framework): We have updated Figure 1 to include the input and output dimensions of the overall network. The detailed dimensions of each intermediate layer
are already provided in the manuscript under the section Proposed Methodology, subsection Model Architecture. These clarifications enhance the readability and facilitate a better
understanding of the model architecture.
3. Missing Metrics in Tables: We acknowledge the missing values for δ < 1.252 and δ < 1.253
for one method, namely Zhao et al.[61], in Table 5. These values were not provided in
the original publication. To address this, we have added a note in the caption of the table
clarifying this omission:
Note: “–” indicates that the respective metric was not reported in the original paper.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsIn the manuscript titled "DAR-MDE: Depth-Attention Refinement for Multi-scale Monocular Depth Estimation", the authors propose an approach that employs an Autoencoder with a Multi-Scale Feature Aggregation and a Refining Attention Network. However, this article has the following rather serious issues:
1. In the first chapter, the authors added in the revised version that existing methods still have difficulties in capturing scale variations and preserving clear structural edges. But why do existing methods have this problem, and why can the module proposed by the authors solve it? In short, the research purpose and background of the authors do not correspond well.
2. The MSFA proposed by the authors seems to adjust the resolution of the input image as a whole and then send it as features of different scales to the MSFA. But if the resolution of the entire image is adjusted, the targets of different depths within the image will also change proportionally. Then, after feature processing with convolution, it is still the receptive field of the entire image. Why can this processing method enhance the perception of scale variations of different objects in the image?
3. In the loss function section, the authors propose a multi-scale loss function, aiming to calculate the loss between the feature maps output by different layers in the decoder and the ground truth. This part is rather strange. If the ground truth is downsampled to maintain resolution consistency, the resolution of the ground truth itself will also decrease. Since the resolution of the ground truth has decreased, what is the significance of calculating the loss for it, and will the model develop in an inaccurate direction?
Author Response
The authors like to thank the reviewers for their critical and constructive comments, which for sure have improved the content and presentation of the manuscript. Below we provide point-to-point responses to all reviewers’ comments and concerns.
2.1 Concern: In the first chapter, the authors added in the revised version that existing methods still have difficulties in capturing scale variations and preserving clear structural edges. But why do existing methods have this problem, and why can the module proposed by the authors solve it? In short, the research purpose and background of the authors do not correspond well.
Response:
Thank you for your insightful comment. We appreciate your observation regarding the need to better align the research motivation with the identified limitations and proposed solution.
To address this, we have revised the Introduction (Section 1) to more explicitly explain the shortcomings of existing methods and the rationale behind our design. Specifically, we clarify that many existing monocular depth estimation approaches rely on fixed-scale or single-scale feature representations, which limits their ability to accurately model objects of varying sizes or distances. Moreover, due to architectural and loss function constraints, these methods tend to oversmooth depth maps—resulting in blurred edges and poor structural detail, particularly near object boundaries.
In response to these limitations, our model introduces two key modules:
The Multi-Scale Feature Aggregation (MSFA) module, which aggregates decoder features at multiple scales to enhance the model’s ability to handle scale variations across scenes.
The Refining Attention Network (RAN), which employs an attention mechanism that adaptively emphasizes regions of strong depth change—particularly object boundaries—to preserve edge sharpness and structural integrity.
We have also updated Sections 3.2.2 and 3.2.3 to state this motivation more directly at the end of each subsection, tying the design of these modules back to the challenges highlighted in the Introduction.
We hope these updates make the research purpose and background more coherent and better aligned with our technical contributions.
----------------------------------------------------------------------------
2.2 Concern: The MSFA proposed by the authors seems to adjust the resolution of the input image as a whole and then send it as features of different scales to the MSFA. But if the resolution of the entire image is adjusted, the targets of different depths within the image will also change proportionally. Then, after feature processing with convolution, it is still the receptive field of the entire image. Why can this processing method enhance the perception of scale variations of different objects in the image?
Response
Thank you for your insightful question. We would like to clarify that the Multi-Scale Feature Aggregation (MSFA) module does not simply resize the input image to different resolutions and process them. Instead, MSFA operates on intermediate feature maps extracted from multiple layers within the decoder of our network. These feature maps inherently differ in spatial resolution and receptive field size due to the hierarchical structure of the encoder-decoder architecture.
Each decoder layer outputs feature representations that capture depth cues at distinct scales — from coarse, global contextual information (large receptive fields) to fine, local details (small receptive fields). By aggregating these multi-scale features, MSFA effectively combines complementary information relevant to objects of varying sizes and distances.
Therefore, MSFA enhances scale perception not by uniformly resizing the entire image, but by fusing hierarchical feature maps with diverse spatial resolutions and receptive fields. This multi-resolution aggregation allows the network to better adapt to scale variations within the scene, which would not be achievable by simply resizing the input image and applying convolutions.
We have added a detailed clarification on this point in the Multi-scale Features Section on “Page 8” to ensure a clear understanding of the MSFA’s operation and motivation.
----------------------------------------------------------------------------
2.3 Concern: In the loss function section, the authors propose a multi-scale loss function, aiming to calculate the loss between the feature maps output by different layers in the decoder and the ground truth. This part is rather strange. If the ground truth is downsampled to maintain resolution consistency, the resolution of the ground truth itself will also decrease. Since the resolution of the ground truth has decreased, what is the significance of calculating the loss for it, and will the model develop in an inaccurate direction?
Response
Thank you for raising this important point regarding the multi-scale loss function.
We downsample the ground truth depth maps to match the spatial resolutions of the corresponding decoder outputs at different scales. While it is true that this reduces the resolution of the ground truth, the multi-scale supervision strategy is designed to guide the model to learn depth representations at various levels of detail progressively.
By computing the loss at multiple scales, the model is encouraged to capture both coarse global structures and fine local details. The lower-resolution ground truth provides meaningful supervisory signals for the coarser decoder features, helping the network learn robust depth cues even at reduced spatial scales. This hierarchical supervision stabilizes training and prevents the model from overfitting to fine details prematurely.
Moreover, the use of the Curvilinear Saliency (CS) loss further enhances learning by emphasizing important structural edges and boundaries across scales, mitigating potential blurriness that may result from downsampling.
In summary, although the ground truth resolution decreases at coarser scales, this multi-scale loss framework effectively guides the network to develop accurate and scale-consistent depth predictions, improving overall performance rather than leading to inaccuracies.
We have clarified this point in the revised manuscript on “Pages 10” to ensure the rationale behind the multi-scale loss design is well understood.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have made comprehensive and detailed revisions in response to the previous review comments, supplementing methodological details, improving experimental analysis, and standardizing formatting. These revisions have significantly enhanced the completeness and rigor of the paper. The revised content fully addresses the core concerns raised earlier, and the innovativeness of the model design and the reliability of the experimental results have been further verified. Currently, the paper has basically met the publication standards and can be accepted with only minor adjustments.
1. The authors have explicitly supplemented the technical details of key modules, including:
1) For the encoder, it is clearly stated that all layers of SENet-154 (up to the global pooling layer) are used, with end-to-end fine-tuning based on ImageNet pre-trained weights, clarifying the completeness of feature extraction;
2) The Multi-Scale Feature Aggregation (MSFA) adopts channel-wise concatenation instead of element-wise addition, preserving the uniqueness of features at each scale, and supplements structural parameters such as kernel sizes and channel dimensions;
3) The cosine formula for similarity calculation in the Refining Attention Network (RAN) is explicitly provided, clearly explaining the correlation mechanism between coarse depth probability vectors and multi-scale features. These supplements ensure good reproducibility of the method.
2. Regarding cross-dataset generalization ability, through visualization results and module function analysis, the authors have clarified the robustness of MSFA to scale variations and the focusing effect of RAN on semantically critical regions, revealing the intrinsic logic of generalization ability. For the phenomenon that the RMSE on the Make3D dataset is slightly higher than that in [34], the authors have explained it by combining scene characteristics (larger scale differences in outdoor distant scenes), enhancing the depth of result analysis.
3. The formatting of references has been standardized. Figure 1 has been supplemented with input and output dimensions. For the missing δ<1.25² and δ<1.25³ data in Table 5, annotations stating "not reported in the original literature" have been added, improving the standardization of the paper.
Suggestions for minor adjustments are as follows:
1. It is recommended to specifically label 1-2 samples containing low-texture (e.g., white walls) or strong reflections (e.g., glass) in Figure 12 or Figure 14 to intuitively demonstrate the model's performance in such scenarios, further supporting the conclusion that "structural components alleviate challenges".
2. Although α=0.6 and β=0.4 are empirical settings, a brief mention in Section 3.3.3 that "this weight combination was determined to have comprehensive optimality on both indoor and outdoor datasets through preliminary comparative experiments" can be added to enhance the rationality of parameter selection.
3. The term "Scale Fetures" in the text should be corrected to "Scale Features" to uniformly fix the spelling error.
The authors have responded proactively and effectively to the revision comments. The paper has met the publication standards in terms of methodological rigor, explanatory power of results, and formatting standardization. The above suggestions for minor adjustments do not affect the core conclusions, and it is recommended that the paper be formally accepted after the authors complete the revisions.
Author Response
The authors like to thank the reviewers for their critical and constructive comments, which for sure have improved the content and presentation of the manuscript. Below we provide point-to-point responses to all reviewers’ comments and concerns.
3.1 Concern: It is recommended to specifically label 1–2 samples containing low-texture (e.g., white walls) or strong reflections (e.g., glass) in Figure 12 or Figure 14 to intuitively demonstrate the model's performance in such scenarios, further supporting the conclusion that "structural components alleviate challenges".
Response
Thank you for this valuable suggestion. Although we have not explicitly annotated low-texture or reflective regions in Figures 12 and 14, these figures already include representative and challenging visual conditions that inherently involve such cases. In particular, the samples in Row 2 (objects affected by lighting) and Row 3 (objects in dark areas) frequently involve low-texture surfaces such as white walls, or reflective materials like glass, where conventional depth estimation tends to struggle.
Moreover, Figure 14 provides a more rigorous demonstration of our model’s generalization capability, as it presents inference results on the SUN RGB-D dataset using a model trained exclusively on NYU Depth-v2—without fine-tuning. This zero-shot cross-dataset evaluation further underscores the robustness of our proposed structural components (MSFA and RAN) in handling complex scenes with ambiguous depth cues.
We believe that the current qualitative results, in conjunction with the quantitative metrics, sufficiently support our claim that the structural components effectively mitigate challenges posed by low-texture and reflective areas.
----------------------------------------------------------------------------
3.2 Concern: Although α=0.6 and β=0.4 are empirical settings, a brief mention in Section 3.3.3 that "this weight combination was determined to have comprehensive optimality on both indoor and outdoor datasets through preliminary comparative experiments" can be added to enhance the rationality of parameter selection.
Response
We thank the reviewer for this valuable suggestion. We have now updated Section 3.3.3 to clarify the rationale behind the empirical settings of the loss weights α and β. Specifically, we have added the following sentence in blue:
Although $\alpha = 0.6$ and $\beta = 0.4$ are empirical settings, this weight combination was determined to have comprehensive optimality on both indoor and outdoor datasets through preliminary comparative experiments.
This addition aims to provide clearer justification for the selected parameters and enhance the interpretability of our loss function design.
------------------------------------------------------------------------------------
3.3 Concern: The term "Scale Fetures" in the text should be corrected to "Scale Features" to uniformly fix the spelling error.
Response
We thank the reviewer for pointing out the typographical error. The term “Scale Fetures” has been corrected to “Scale Features” throughout the manuscript to ensure consistency and clarity.
