Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low-Cost Mixed Reality Glasses

Appl. Sci. 2025, 15(11), 6169; https://doi.org/10.3390/app15116169

by Wei-Jong Yang¹, Hsuan Tsai² and Din-Yuen Chan^3,*

Reviewer 1: Anonymous

Reviewer 2:

Siham Bakkouri

Reviewer 3:

Elias Miranda

Appl. Sci. 2025, 15(11), 6169; https://doi.org/10.3390/app15116169

Submission received: 4 April 2025 / Revised: 27 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Round 1

Reviewer 1 Report (New Reviewer)

Comments and Suggestions for Authors

The paper proposes an accurate, lightweight depth estimation network designed for mixed reality (MR) devices with limited computational resources, such as smart glasses. The network combines a dual-path autoencoder with an adaptive bin-based depth estimator, enabling the fusion of high-resolution RGB images with low-resolution depth sensor data. The work is interesting, but the following improvements are recommended:

1. The scope of the paper is focused on a final product represented by smart glasses. However, since the system is efficient and fast, it could have broader applicability in real-time systems such as motorcycle helmets.

2. Some references are out of order, and the formatting of citations appears inconsistent throughout the manuscript. It is recommended that a single citation style be adopted for reader clarity and to aid the reviewer.

3. Some grammatical issues affect clarity due to ambiguity. For example, on page 2, last paragraph, third paragraph, third line, the expression “consumes more computation resource” could be improved by specifying the amount of resources. In the same paragraph, the final sentence should also quantify how much the error rate is reduced by using the more complex upsampling block versus the simpler one. Additionally, there are instances of impersonal or informal phrasing, such as the expression “more and more” found on page 7, second-to-last paragraph, last line. Lastly, numeric expressions should follow a consistent format—some large values are abbreviated (e.g., "12K") while others are written in full, which should be standardized.

4. Reference [14] on page 3, first paragraph, line 6, should be revised. This work refers explicitly to a self-supervised method, so stating "unsupervised" without specifying that it is self-supervised may cause ambiguity, as not all unsupervised methods rely on pretext tasks or derived signals. Moreover, the final three lines of the same paragraph are misplaced and would be more appropriate in the Introduction section.

5. Images, tables, captions, and titles are formatted inconsistently across the paper. It is recommended that a uniform, preferably centered, format be adopted. Some figure captions contain excessive information; such content would be better placed in the corresponding descriptive paragraphs. This applies particularly to Figure 1, Figure 3, Figure 4, and Figure 5. Finally, Figure 8 suffers from low resolution and should be improved.
5. Equations should follow a consistent order and notation standard. For example, in the expression (H, W, (C/2)-1) associated with Figure 5, the format used in the figure differs from that in the paragraph. It is also preferable to use fractional notation rather than symbols like “/” in mathematical expressions to avoid confusion—for instance, in Equation 3. Furthermore, each equation should be presented independently; Equation 7, for example, improperly includes two distinct expressions under a single equation number.

Comments on the Quality of English Language

Some grammatical issues affect clarity due to ambiguity.

Author Response

Responses to Comments of Editor and Reviewer I

for High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low Cost MR Glasses, No. applsci03597461

General Responses:

We deeply thank the Editor and Reviewers for taking the time to review our manuscript and giving very helpful comments. In the revised manuscript, we have followed all the comments and take proper responses to them. Please find the detailed responses below and the corresponding revisions/corrections are highlighted in the re-submitted manuscript file with tracking highlight of the changes. In the revised manuscript, all of the refined, modified, and reorganized parts are highlighted with red words.

Responses to Reviewer I

_____________________________________________________________________

Comments 1: The scope of the paper is focused on a final product represented by smart glasses. However, since the system is efficient and fast, it could have broader applicability in real-time systems such as motorcycle helmets.

Response 1: Thank you for pointing this out. The usefulness of proposed network implemented in Jorjin J7EF Plus AR glasses has been verified by the physicians affiliated at the National Cheng Kung University Hospital. This is claimed at the end of section Discussion as “It is noted that our proposed network has been ever acknowledged for facilitating the surgery operations and the surgical education/training by the physicians affiliated at National Cheng Kung University Hospital.“ Hence, we believe that our proposed network will be an applicable edge-computing apparatus built in many other popular systems such as the detectors of self-driving vehicle and low-altitude drone with the LR depth for real-time high-precision depth completion.

Comment 2. Some references are out of order, and the formatting of citations appears inconsistent throughout the manuscript. It is recommended that a single citation style be adopted for reader clarity and to aid the reviewer.

Response 2: Thank you for pointing this problem. We have unified the formats of journal and conference papers listed in References.

Comment 3. Some grammatical issues affect clarity due to ambiguity. For example, on page 2, last paragraph, third paragraph, third line, the expression “consumes more computation resource” could be improved by specifying the amount of resources. In the same paragraph, the final sentence should also quantify how much the error rate is reduced by using the more complex upsampling block versus the simpler one. Additionally, there are instances of impersonal or informal phrasing, such as the expression “more and more” found on page 7, second-to-last paragraph, last line. Lastly, numeric expressions should follow a consistent format—some large values are abbreviated (e.g., "12K") while others are written in full, which should be standardized.

Response 3: Thank you very much for pointing these mistakes. The revised version is refined for improving the English presentation without the grammatical issues.

Comment 4. Reference [14] on page 3, first paragraph, line 6, should be revised. This work refers explicitly to a self-supervised method, so stating "unsupervised" without specifying that it is self-supervised may cause ambiguity, as not all unsupervised methods rely on pretext tasks or derived signals. Moreover, the final three lines of the same paragraph are misplaced and would be more appropriate in the Introduction section.

Response 4: Thank you for pointing the necessity of this rectification out. We have added the rectified statement for [14], and refine and rearrange the final three lines of the same paragraph to section Introduction. The statements are stated as “ Supervised learning networks [8-11], …, self-supervised networks can exploit the support of low-dimensional (LD) depth sensor or pose network [14] to promote the accuracy of monocular depth estimation. For training the self-supervised network, the stereo images can be emulated from the monocular sequences by training a second pose estimation network to predict the relative poses applied in the projection function [14]. …. Our proposed network, which belongs to the self-supervised approaches, can have a very lightweight model for fast inferring the high-precision monocular depths. “ in section Related Work, and “ …., we design the bin-based depth estimation network with variant decoders. It can create the high–precision depth by the RGB image only with the low-quality sparse depth map. By our design appropriated to bin-based networks, the simplest model of proposed network can be very lightweight.” in section of Introduction.

Comment 5. Images, tables, captions, and titles are formatted inconsistently across the paper. It is recommended that a uniform, preferably centered, format be adopted. Some figure captions contain excessive information; such content would be better placed in the corresponding descriptive paragraphs. This applies particularly to Figure 1, Figure 3, Figure 4, and Figure 5. Finally, Figure 8 suffers from low resolution and should be improved.
5. Equations should follow a consistent order and notation standard. For example, in the expression (H, W, (C/2)-1) associated with Figure 5, the format used in the figure differs from that in the paragraph. It is also preferable to use fractional notation rather than symbols like “/” in mathematical expressions to avoid confusion—for instance, in Equation 3. Furthermore, each equation should be presented independently; Equation 7, for example, improperly includes two distinct expressions under a single equation number.

Some grammatical issues affect clarity due to ambiguity.

Response 5. Thank you. According to this comment, three responses are given by the following actions:

(a) The captions and illustrations of Fig. 1, Fig. 3, Fig. 4, and Fig. 5 are properly shortened and refined by moving partial statements in the corresponding descriptive paragraphs.

(b) All the equations are corrected and checked.

(c) The ambiguous parts are clarified by clearly re-defining the critical variables and vectors such as “In our proposed network, the division of bins is hereafter named the bin-division vector which contains the consecutive bins where a bin records the granularity (range) of a depth fragment.” in page 4. According the revision in (a)-(c), some grammatical issues are resolved.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

In this manuscript, the authors proposed a lightweight depth estimation network that combines a dual-path autoencoder and an adaptive dynamic range bin estimator to generate high-precision dense depth maps using a high-resolution RGB image and a low-resolution depth sensor. The proposed work is well-aligned with the constraints of resource-limited MR glasses and demonstrates strong potential for real-time applications. However, several concerns regarding the methodology, structure, and technical clarity should be addressed:

The integration of the dual-path encoder is intuitive, but the actual mechanism of feature fusion in the multilevel shared decoder is underexplained. Specifically, how the skip connections are handled when combining features from RGB and depth branches could benefit from explicit notation or pseudocode.
The AdaDRBins module, while innovative, presents a rather dense explanation. The role and interaction of the multiple MLP blocks (INIT, Sn, Bn) could be better illustrated through a unifying diagram or a simplified example that traces how a sample bin evolves through the layers.
There is insufficient analysis regarding the selection of bin granularity and its impact on performance. For instance, why 32 bins were optimal in AdaDRBins versus 256 in AdaBins is not experimentally justified.
While the proposed SimpleUp and UpCSPN-k upsampling strategies are compared in terms of efficiency and performance, the criteria for selecting one over the other in a given application context are not clearly discussed. A design guideline or heuristic could be useful.
The loss function design inherits components from AdaBins, but its justification in the new context of AdaDRBins is weak. For example, it is not evident why a fixed α = 10 and β = 0.5 are ideal, especially when the regression and binning outputs vary across decoder types.
The LR depth generation strategy (median pooling + optional filtering) for simulating ToF sensor data is reasonable but lacks validation. A comparison against real-world low-resolution depth maps from ToF sensors would help confirm the validity of this simulation approach.
The experimental section is comprehensive, but the ablation studies do not fully explore intermediate configurations (e.g., UpCSPN without bin estimation or AdaDRBins with ResNet encoders). This limits insight into the modular benefits.
The authors mention deploying the model on Jorjin MR glasses and Unity with Barracuda, but detailed performance metrics such as model loading time, memory footprint, and frame drop rate on the target hardware are missing.
The manuscript could benefit from a clearer comparison with recent works such as ZoeDepth or other transformer-based monocular estimation models that are efficient and show strong zero-shot capabilities.

Author Response

Responses to Comments of Editor and Reviewer II

for High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low Cost MR Glasses, No. applsci03597461

General Responses:

Responses to Reviewer II

___________________________________________________________________________

Comment 1. The integration of the dual-path encoder is intuitive, but the actual mechanism of feature fusion in the multilevel shared decoder is under explained. Specifically, how the skip connections are handled when combining features from RGB and depth branches could benefit from explicit notation or pseudocode.

Response 1. Thank you very much. The combination of RGB and depth features is realized by the element-wise addition. The statement “At the beginning of SimpleUp and UpCSPN-k, the element-wise addition is applied to blend the RGB features and depth features.” is added at the top of Fig. 4 in Page 6. For further illustrating the effectiveness of blending RGB and depth features by element-wise addition, the statement “. It is noted that, instead of general convolution and concatenation, element-wise addition and depthwise separable convolution in SimpleUp and UpCSPN-k can efficiently fuse due RGB features and depth features without performance degradation through extensively testing possible structure variants.” is added at the end of Section 3.2 in Page 7.

Comment 2. The AdaDRBins module, while innovative, presents a rather dense explanation. The role and interaction of the multiple MLP blocks (INIT, Sn, Bn) could be better illustrated through a unifying diagram or a simplified example that traces how a sample bin evolves through the layers.

Response 2. Thank you very much. We added the figure of all MLP blocks in Fig. 7 to display their details in architecture.

Comment 3. There is insufficient analysis regarding the selection of bin granularity and its impact on performance. For instance, why 32 bins were optimal in AdaDRBins versus 256 in AdaBins is not experimentally justified.

Response 3. Thank you very much. In section of Experimental Results, we added the illustration ” Through our experiments, the depth completion accuracy of 32 bins can have very close to that of 64 bins, 128 bins and 256 bins for 224Í224 depth maps while adopting AdaDRBins in the proposed architecture. It means that the use of 32 bins provided by AdaDRBins, i.e., 32-bin AdaDRBins, could be the most appropriate under the consideration of model complexity for our proposed models. “ Besides, we add Table 6 to exhibit the compared computational complexities of 32-bin AdaDRBins and 256-bin AdaBins [12] used in the proposed models.

Comment 4. While the proposed SimpleUp and UpCSPN-k upsampling strategies are compared in terms of efficiency and performance, the criteria for selecting one over the other in a given application context are not clearly discussed. A design guideline or heuristic could be useful.

Response 4. Thank for the comment. We added the “Observe Tables 5, 6 and 8, the proposed model using DMobileNet_s, SimpleUP and 32-bin AdaDRBins could be the best selected in real-time low-cost MR systems in terms of the efficiency-accuracy leverage.” in the section of Discussion that Table 6 is newly added in the revised manuscript.

Comment 5. The loss function design inherits components from AdaBins, but its justification in the new context of AdaDRBins is weak. For example, it is not evident why a fixed α = 10 and β = 0.5 are ideal, especially when the regression and binning outputs vary across decoder types.

Response 5 Thank for this comment. Indeed, the regression and binning outputs varying across decoder types might let the analysis of loss weights be complicated. Instead of the troublesome analysis, the results of extensive experiments can also convince other researches of setting and while attempting retraining AdaDRBins. Hence, we add “By the extensive examinations on AdaDRBins, the scaling factors, b= 10 and m = 0.5 are selected because they can attain the higher accuracy of depth estimation on average and achieve more training for our network than other tested values of b and m.” “at page 11. Because the a has been used in (5), we change the symbols of a and b as b and m, respectively.

Comment 6. The LR depth generation strategy (median pooling + optional filtering) for simulating ToF sensor data is reasonable but lacks validation. A comparison against real-world low-resolution depth maps from ToF sensors would help confirm the validity of this simulation approach.

Response 6. Thank for this comment. For targeting the practical scenario by using low-cost AR glasses, we focus to design an efficient depth completion network on the utilization of low-cost MR glasses can attain the optimal fusion of low-resolution low-quality (LRLQ) depth map and RGB image to achieve the accurate high-precision depth map. Because LRLQ depth maps captured by the ToF camera of Jorjin J7EF Plus AR glasses are very blurry LR depth map, such a challenge is not easier than the prediction of high dimensional depth map from the sparse LiDAR depth map. Hence, we believe that our network can be effective for the estimation of high dimensional depth map from the sparse depth maps after increasing the upsampling layers and widening the AdaDRBins structure that the modules of AdaDRBins are detailed in a new added Fig. 7. The blurry extent of LRLQ depth maps targeted in this study can be observed from Fig. 8, Fig. 10, and Fig.11.

Comment 7. The experimental section is comprehensive, but the ablation studies do not fully explore intermediate configurations (e.g., UpCSPN without bin estimation or AdaDRBins with ResNet encoders). This limits insight into the modular benefits.

Response 7. Thank for this comment. According to this comment, we add Table 6 to exhibit the computational complexity comparisons of 32-bin AdaDRBins and 256-bin AdaBins [12] used in the proposed models. And, for fully explore intermediate configurations, we add Fig. 7 to exhibit the detailed structure and signal flow of feature embedding MLPs (E_k) and bins initialization MLPs (INIT), bins splitter MLPs (S_k) and bias prediction MLPs (B_k), as well as hybrid regression layer (HREG) and weighting-sum linear combination (LCOMB).

Comment 8. The authors mention deploying the model on Jorjin MR glasses and Unity with Barracuda, but detailed performance metrics such as model loading time, memory footprint, and frame drop rate on the target hardware are missing.

Response 8. Thank for this comment. Observing Table 6 and Table 8, our medium-complexity model, which adopts SimpleUp and AdaDRBins, can generate the high-precision 224Í224 depth maps of 560 fps, and the complex model, which adopts UpCSPN-7 and AdaDRBins, can generate the high-precision 224Í224 depth maps of 310 fps. Although the inference is run on single NVIDIA GeForce RTX 4090 GPU with 24 GB memory and Intel i7-13700K CPU, Table 5 shows that the complete model using DMobileNet_s, SimpleUp and AdaDRBins have only 2.28 M weights and 1.150 G MACs. Because the most proposed models designed by us are very lightweight, the requirement of low loading time and memory footprint without frame drop rate in the target hardware can be attained.

Comment 9. The manuscript could benefit from a clearer comparison with recent works such as ZoeDepth or other transformer-based monocular estimation models that are efficient and show strong zero-shot capabilities.

Response 9. Thank for this comment. The ZoeDepth network is a very computation-heavy depth estimator and the computational complexity of our proposed models is less its computational complexity with several orders that our proposed models are the high cost-effective approach. Hence, we only brief the ZoeDepth network [29] by “The ZoeDepth network [29] exploits the hierarchical-level bins fused decoder and dataset-depended multiple configurations for pre-training and metric fine-tuning to obtain high-quality and high resolution depth estimation. It belongs to computation-heavy depth estimators. On the contrary, our proposed models are cost-effective approaches.” in the end of Related Work.

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

The authors propose a competitive high-precision depth estimation network, which integrates dual-path autoencoder and adaptive bin depth estimator tog. The following suggestions and comments should be considered in order to improve the manuscript.

Major suggestions and comments

**The weakness of the manuscript is that relevant results were not highlighted in the Abstract, Discussion, and Conclusion sections.For example, Table 5 shows superiority in terms of Weights (M) MAC(G) RMSE, however they were not expressed in conclusions or abstract sections. Potential readers pay attention to these sections.

**The manuscript had a poor discussion section, the authors cite only one paper [36]. To enhance this section the authors could compare their results with current literature.
**Conclusions section must highlight the main contribution of the paper. This section should be organized and improved.

Minor comments

Please use ‘’ToF’’ instead of ‘’TOF’’ and define it in line 44.
The authors cited tables in two different ways, for example, Tab. 1, Tab. 2, Tab. 6, Table 3 , Table 5-7 . The authors need to choose one way.

Author Response

Responses to Comments of Editor and Reviewer III

for High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low Cost MR Glasses, No. applsci03597461

General Responses:

Responses to Reviewer III

____________________________________________________________________

Major suggestions and comments

Comment 1. The weakness of the manuscript is that relevant results were not highlighted in the Abstract, Discussion, and Conclusion sections. For example, Table 5 shows superiority in terms of Weights (M) MAC(G) RMSE, however they were not expressed in conclusions or abstract sections. Potential readers pay attention to these sections.

Response 1. Thank for the reminded comment. One related statement ” The simplest model of proposed network has only 2.28 M weights and 1.150 G MACs. It is very lightweight such that this model could be easily implemented into the platform of low-cost MR glasses to support the hand-gesture controls realized in edge devices in real time.“ is added in Abstract. Another “In the proposed network, the most lightweight model comprising MobileNet_s, SimpleUp and AdaDRBins only needs 2.28 M weights and 1.150 G MACs. “ is added in the section of Conclusions.

Comment 2. The manuscript had a poor discussion section, the authors cite only one paper [36]. To enhance this section the authors could compare their results with current literature.

Response 3. Thank you for pointing this issue out. This statement and cited paper [36] indeed appear unsuitable for discussing our proposed network. Hence, in section Discussion, the statement is replace by a paragraph as “ As a result, the proposed network has been ever acknowledged for facilitating the surgery operations and the surgical education/training by the physicians affiliated with the College of Medicine at National Cheng Kung University. Hence, we believe that our proposed network will be an applicable edge-computing apparatus for many other popular systems such as the detectors of self-driving vehicle and Low-altitude drone with the LR depth for real-time high-precision depth completion.”. And, the paper [36] is removed from the list of reference.

Comment 3. Conclusions section must highlight the main contribution of the paper. This section should be organized and improved.

Response 3. Thank you for pointing out this deficiency. The conclusion content is totally refined for highlighting the main contribution in the revised version.

Comment 4. Minor comments: please use ‘’ToF’’ instead of ‘’TOF’’ and define it in line 44. The authors cited tables in two different ways, for example, Tab. 1, Tab. 2, Tab. 6, Table 3 , Table 5-7 . The authors need to choose one way.

Response 4. Thank you for pointing this out. All the inconsistent parts are corrected into the consistent terms including ToF and table notations in the revised article.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

All comments have been taken into consideration in this version of the manuscript.

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

The authors improved the manuscript by addressing major and minor suggestions. The Abstract, Discussion, and Conclusion sections now highlight relevant results. These changes demonstrate the scalability of the proposed network, and the work's main contribution is now clearer.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper deals with an exciting topic. The article has been read carefully, and some minor issues have been highlighted in order to be considered by the author(s).

(1) The paper lacks an experimental comparison with existing depth estimation methods. A direct evaluation against state-of-the-art approaches is necessary to validate the effectiveness of the proposed model. Performance metrics such as accuracy, inference time, and computational efficiency should be compared to demonstrate the advantages of the proposed method.

(2) The computational efficiency of the proposed method needs further investigation. Given the constraints of MR glasses, a detailed experimental evaluation of time complexity (e.g., inference speed) and space complexity (e.g., memory consumption) is essential. This would provide insights into the feasibility of real-time deployment on resource-limited devices.

(3) The paper should include an ablation study to analyze the contributions of individual components within the proposed model, such as the dual-path autoencoder and the adaptive bin depth estimator. This analysis would help in understanding their respective impacts on the overall performance and justify their inclusion in the model architecture.

(4) It would be beneficial to briefly introduce related research on AI security (adversarial example) from a security perspective, such as the papers available at https://doi.org/10.1155/2024/1124598

Author Response

Responses to Reviewers’ Comments

The authors deeply appreciate all the comments raised by the Editor and Reviewers to improve the readability of our paper. Generally, we totally follow their valuable comments to revise the paper, while the responses after commands are written in blue texts and summarized as follows.

Responses to Reviewer 1’s Comments:

This paper deals with an exciting topic. The article has been read carefully, and some minor issues have been highlighted in order to be considered by the author(s).

Response:

Thanks for your positive comments and efforts to point out the minor issues for us to further improve the manuscript.

__________________________________________________________________

Response:

Thanks for your kind comments. Since the goal of this study is to design a lightweight depth estimation network with sufficient precision, we choose MonoDepth [13] and FastDepth [10] for comparisons. In the revised manuscript, we added Table 7, which shows the prediction performances in RMSEs and the complexity in the amount of multiply-accumulate operations (MACs). For 224×224 RGB input images, MonoDepth using ResNet-50 [13] and the fastest version in FastDepth [10] require about 4.8 G MACs and 0.74 G MACs, respectively. In our proposed network, the MobileNet encoder including 6 depthwise separable convolution layers and one pointwise convolution totally requires about 0.185 G MACs. As to the MACs of various proposed decoders listed in Table 6, we choose three models using the SimpleUp only decoder, the SimpleUp+AdaDRBins decoder and the UpCSPN-7+AdaDRBins decoder to display their computational complexities in order. These three models demand 0.675 (i.e., 0.185+0.49), 1.335 (i.e.,0.185+1.15) and 9.445 (i.e.,0.185+9.26) G MACs, respectively. Not only the model adopting the SimpleUp only decoder is lighter than the fastest model of FastDepth in terms of MACs, but also it has the lower 0.233 RMSE than 0.599 RMSE obtained by the latter [10] for testing the NYU Depth V2 dataset.

__________________________________________________________________

Response:

Thanks for your kind comments. In our experiment, the proposed fastest model which adopts the SimpleUp only decoder can merely require 0.675 G MACs and attain 1170 FPS inference speed on single NVIDIA GeForce RTX 4090 GPU with 24 GB memory and Intel i7-13700K CPU. And, by our trials, this model can achieve the real-time high-precision dense depth maps of moving hand in Jorjin J7EF devices from the very sparse depths of Jorjin MR-glasses. The associated video can be demoed at https://youtu.be/XFgtftdPQls.

__________________________________________________________________

Response:

Thanks for your kind comments. With the fixed encoder, the MACs and prediction performances of various proposed decoders have been listed in Tables 5 and 6. Because the various models with different combinations of proposed algorithms such as SimpleUP, UpCPSN-k, and AdaDRBins are all includes in Tables 5 and 6. We believe that Tables 5 and 6 can be regarded as the ablation study.

__________________________________________________________________

(4) It would be beneficial to briefly introduce related research on AI security (adversarial example) from a security perspective, such as the papers available at https://doi.org/10.1155/2024/1124598

Response:

Thanks. Indeed, the AI security for the use of smart MR-glasses is important when the MR glasses involve security sensitive applications, such as medical surgery and military security checks. The proposed networks could consider the technique involving camouﬂage through the use of adversarial examples [26]. In discussion, we added the possible future work for this important security issue.

Note that, basically, the proposed network with the SimpleUp decoder can be built in the Jorjin J7EF Plus MR glasses for edge-computing without the aid of outside server, making the AI security self-content.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I would like to thank the authors for their work on the paper titled "Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low-Cost MR Glasses." The idea behind the paper is original and promising, but significant improvements are needed before it can be considered for publication. I recommend the authors follow these remarks to enhance the paper:

Introduction: The introduction is too short and does not clearly explain the problem and motivation. It should be expanded to at least two pages to provide a better explanation of the research context and significance.
Language & Readability: The English should be improved. Examples:
- "Mixture Reality" should be "Mixed Reality" in the title and keywords.
- "Be hardly implemented" → "Are difficult to implement."
- "Make the finally generated depth maps be more accurate" → "Make the final depth maps more accurate."
- Some sentences are too long and complex. Shorter, clearer sentences would improve readability.
Figures & Captions:
- Figure 1 (Framework): Ensure all components are clearly labeled, and add a brief explanation in the caption.
- Figure 8 (Qualitative Results): Add zoomed-in sections to highlight differences and ensure consistent brightness and contrast.
- Figure 10 (Implementation on MR Glasses): Clearly label the images (8×8 depth, predicted depth, RGB) and provide a short caption explaining how the results were generated.
Methodology:
- Justify the choice of Adaptive Bins Estimation by comparing it with other binning strategies.
- The model is described as lightweight, but a comparison with FastDepth, MonoDepth, and other methods in terms of FLOPs or memory usage would add value.
- Clarify whether the batch size of 16 was chosen due to hardware constraints or experimental tuning.
Originality & Contribution: Clearly state how the proposed method differs from existing depth estimation networks. Emphasize the significance of using low-resolution depth maps and their impact on MR applications.
Results & Discussion:
- The RMSE values in Table 4 are lower than existing methods, which is good. However, a statistical significance test (e.g., paired t-test) would reinforce the claim of improvement.
- Discuss potential failure cases (e.g., depth errors in low-light conditions, fast-moving objects) and possible solutions.
- Since Swin Transformer-based methods have been successful in depth estimation, a brief discussion on why they were not considered would enhance the paper.
Additional Questions for the Authors:
- How does the model perform in varying lighting conditions?
- Does the RGB input significantly impact performance in different environments?
- Is there an attempt to reduce dependency on RGB images? How does the model perform with degraded RGB input?
- Have the authors considered knowledge distillation to further compress the model for real-time inference?

Author Response

Responses to Reviewers’ Comments

Responses to Reviewer 2’s Comments:

Response:

Thanks for your positive comments and efforts to point out the minor issues for us to further improve the manuscript.

__________________________________________________________________

Introduction: The introduction is too short and does not clearly explain the problem and motivation. It should be expanded to at least two pages to provide a better explanation of the research context and significance.

Response:

Thank the suggestion. In the revised manuscript, we added two paragraphs to extend Section Introduction by depicting the classes (kinds and types) of monocular depth estimation networks as supplementary information for the fundamental Introduction of monocular depth estimation networks to further confirm the significance of proposed networks in practical MR applications.

__________________________________________________________________

Language & Readability: The English should be improved. Examples:

"Mixture Reality" should be "Mixed Reality" in the title and keywords.
"Be hardly implemented" → "Are difficult to implement."
"Make the finally generated depth maps be more accurate" → "Make the final depth maps more accurate."
Some sentences are too long and complex. Shorter, clearer sentences would improve readability.

Response:

Thanks for your efforts to point out the errors. We corrected all of them and refined the corresponding sentences under the help of English tutor. Of course, we also try our best to check some other errors.

__________________________________________________________________

Figures & Captions:

Figure 1 (Framework): Ensure all components are clearly labeled, and add a brief explanation in the caption. Figure 8 (Qualitative Results): Add zoomed-in sections to highlight differences and ensure consistent brightness and contrast.

Response:

Thanks a lot. We refine the figures and clarify the captions according to these comments. The depth maps in Figure 8 directly exhibit the experimental results via popular coloring software without any image enhancement. Thus, for clarifying the difference of depth maps yet sustaining the original status of outcomes, we can merely add the white dash-line rectangles in Fig. 8 in the revised manuscript. The areas of map results by the proposed models using the “UpCSPN-7+AdaDRBins” decoder depth bounded in white dash-line rectangles are the superior portions than other models in visualization. By the way, we added a statement at the end of this paragraph as “Particularly, observing the 2^nd and 4^th rows of Figure 8, the rectangle-highlighted depths resulted by the proposed network using the “UpCSPN-7+AdaDRBins” decoder can be even better than the corresponding depths of GT maps in terms of rational visual smoothness.”

__________________________________________________________________

Figure 10 (Implementation on MR Glasses): Clearly label the images (8×8 depth, predicted depth, RGB) and provide a short caption explaining how the results were generated.

Response:

Thanks. According to this comment, we replenish more implementation details at Page 14 in Section 4.3. In fact, the labeling work of 8×8 depth maps is not necessary for training, because the calculation of depth-estimation loss directly exploits the high-resolution GT depth maps provided by NYU Depth V2 dataset and EgoGesture dataset. Those high-resolution GT depth maps are downsampled as the sensor-labeled 8×8 depth maps for training our networks.

Note that since the new section includes a new figure, the original 10^th figure becomes Figure 11.

__________________________________________________________________

Methodology:

Justify the choice of Adaptive Bins Estimation by comparing it with other binning strategies. The model is described as lightweight, but a comparison with FastDepth, MonoDepth, and other methods in terms of FLOPs or memory usage would add value.

Response:

Thanks for your kind comments. Since the goal of this study is to design a lightweight depth estimation network with sufficient precision, we choose MonoDepth [13] and FastDepth [10] for comparisons. In the revised manuscript, we added Table 7, which shows the prediction performances in RMSEs and the complexity in the amount of multiply-accumulate operations (MACs). For 224×224 RGB input images, MonoDepth using ResNet-50 [13] and the fastest version in FastDepth [10] require about 4.8 G MACs and 0.74 G MACs, respectively. In our proposed network, the MobileNet encoder including 6 depthwise separable convolution layers and one pointwise convolution totally requires about 0.18 5G MACs. The MACs of three proposed decoders enlisted in Table 5 and Table 6 are SimpleUp only, SimpleUp+AdaDRBins and UpCSPN-7+AdaDRBins decoders, which demand 0.675 (0.185+0.49), and 1.335 (0.185+1.15), and 9.445 (0.185+9.26) G MACs, respectively. Not only the model adopting the SimpleUp only decoder is lighter than the fastest model of FastDepth in terms of MACs, but also it has the lower 0.233 RMSE than 0.599 RMSE obtained by the latter [10] for testing the NYU Depth V2. dataset. Note that, since one duplicated index of Table 3 are corrected by incrementing the index 1, the original indices of tables from Table 5 are incremented by 1.

------------------------------------------------------------------------------------------------------

Clarify whether the batch size of 16 was chosen due to hardware constraints or experimental tuning.

Response:

Thanks. Yes, batch size 16 is the acceptable maximum to run the training in GeForce RTX 4090 GPU with 24 GB memory and Intel i7-13700K CPU in our laboratory. In general, if the better hardware is available, the maximal batch size can be increased to attain the better performance.

-----------------------------------------------------------------------------------------------------

Originality & Contribution: Clearly state how the proposed method differs from existing depth estimation networks. Emphasize the significance of using low-resolution depth maps and their impact on MR applications.

Response:

Thanks for your efforts to point out the errors, we corrected all of them and refine the corresponding sentences under the help of English tutor.

-----------------------------------------------------------------------------------------------------

Results & Discussion:

The RMSE values in Table 6 are lower than existing methods, which is good. However, a statistical significance test (e.g., paired t-test) would reinforce the claim of improvement.

Response:

Thanks. In this study, we can divide the models of proposed network into 4 groups where the decoders have the SimpleUp only, the UpCPSN-k only, the SimpleUP with AdaBins, the SimpleUP with AdaDRBins, and the UpCSPN-k with AdaDRBins. Observing Table 6, we can find the RMSEs of models between different groups have the salient differences, and the RMSEs of models in the identical group have close RMSEs. Even that, in the second group, i.e., UpCPSN-k only, the best model is to use the UpCPSN-5 only decoder which is even better than the UpCPSN-7 only decoder. This fact can manifest an evidence for the deep learning investigation that the more stacked layers could not guarantee better performance. Again, we can claim that the better, smarter, and more cautious structure design is more important than the use of larger amounts of kernels, i. e., the more computing power. This short conclusion is added in section Discussion. However, the related compared data are tabulated in a new table, Table 7. Note that the tested dataset of MonoDepth is the KITTI dataset that MonoDepth attains RNSE=4.392 for monotonous outdoor scenes of KITTI dataset. Thus, for the more complex indoor dataset, e.g., the NYU Depth V2 dataset, the RNSE of MonoDepth shall be larger than 4.392 such that the RMSE of MonoDepth is denoted as “>4.392” in Table 7.

__________________________________________________________________

Discuss potential failure cases (e.g., depth errors in low-light conditions, fast-moving objects) and possible solutions.

Response:

Thanks. Because the depths captured by the TOF depth camera in MR-glasses are very sparse, the depths of some fingers in depth maps might be entirely lost in some moments of hand-moving actions. The high-resolution depth predicted by our proposed networks using such a low-resolution depth map is hard to help attaining the accurate estimation of 3D hand gestures. To the specific, these predicted high-resolution depths of finger pixels could not provide the recognition of subtle 3D hand gestures. The statement is added to section Discussion.

__________________________________________________________________

Since Swin Transformer-based methods have been successful in depth estimation, a brief discussion on why they were not considered would enhance the paper.

Response:

Thanks a lot. Because the depths captured by the TOF depth camera in MR-glasses are very sparse, the depths of some fingers in depth maps might be entirely lost in some moments of hand-moving actions. The high-resolution depth predicted by our proposed networks using such a low-resolution depth map could not provide the recognition of subtle 3D hand gestures. Conceptually, the Swin-Transformer backbone can reveal a better global extraction than the local extraction for spatial features relative to the CNN backbone. As a result, the lightweight Swin-Transformer approach often has a higher barrier to predict the acceptable high-precision dense depths from spare depths than the lightweight CNN backbone. However, due to high complexity, the Swin-Transformer backbone is not suitable in the use of this study. The statement is also added to section Discussion.

__________________________________________________________________

Additional Questions for the Authors:

How does the model perform in varying lighting conditions?

Response:

Thanks a lot. Yes, the proposed models can perform in varying lighting environments that the model with the SimpleUp only decoder can already achieves the zero-shot inference. The link https://youtu.be/XFgtftdPQls demos a practical video fragment in the downstream depth estimation task. The model with the SimpleUp only decoder trained by NYU Depth V2 Dataset and EgoGesture datasets can achieve the depth estimation inference for the practical moving hand with dynamic gestures. The middle successive depth maps in the exhibited triple-image video are the inferred high-precision dense depth maps.

Does the RGB input significantly impact performance in different environments?

Response:

Thanks. By our experiments, the preprocessing with proposed RGB-image cropping and alignment between cropped-RGB image and low-resolution depth maps in this paper can make our networks stable in use for different environments without RGB impact even that the MR-glasses captured RGB images are somewhat blurred, as demoed by https://youtu.be/XFgtftdPQls.

Is there an attempt to reduce dependency on RGB images? How does the model perform with degraded RGB input?

Response:

Thanks. By only using 224×224 RGB input images, our experiments have already demonstrated the feasibility of all proposed models which can achieve the high precision depth estimation for some challenging cases https://youtu.be/XFgtftdPQls where the high resolution depths on narrow hand fingers can be still recovered. Hence, if the applied scenario is fixed, we believe that our networks can exploit the RGB images of lower than 224×224 resolutions to achieve the high resolution depth restoration from the very sparse depths available.Thus, the proposed networks can reduce their dependency on the resolution and quality of RGB images.

Have the authors considered knowledge distillation to further compress the model for real-time inference?

Response:

Thanks for this question. In this study, we follow the rule of gradually increasing the stacked ingredients and layers to develop our networks. Hence, obverse Table 5, the model with the SimpleUP+AdaBins decoder can play of teacher role for the model with the SimpleUP only decoder. Similar, the model with the UpCSPN-l+AdaDRBins decoder can play a teacher role for the model the UpCSPN-m+AdaDRBins decoder for l > m. By the exemplars stated above, the teacher-and-student structure can be easily attained by arbitrarily reducing the layers of a complex model as the student to further compress this model as its student for distillation. This statement can be also added at section Discussion. So, we thank this comment to enrich the section Discussion, again.

Article Menu

High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low-Cost Mixed Reality Glasses

Further Information

Guidelines

MDPI Initiatives

Follow MDPI