Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Unsupervised Depth-Estimation Model for Monocular Images Based on Perceptual Image Error Assessment

Appl. Sci. 2022, 12(17), 8829; https://doi.org/10.3390/app12178829

by Hyeseung Park¹

and Seungchul Park^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Abdullah Alshanqiti

Reviewer 4:

Xingsi Xue

Appl. Sci. 2022, 12(17), 8829; https://doi.org/10.3390/app12178829

Submission received: 1 August 2022 / Revised: 25 August 2022 / Accepted: 30 August 2022 / Published: 2 September 2022

(This article belongs to the Special Issue AI-Based Image Processing)

Round 1

Reviewer 1 Report

This paper studies an unsupervised depth estimation model for monocular images based on perceptual image error assessment. The topic is interesting and here are some comments:

1. Some typical equipment pictures can be added to the introduction.

2. The author's ‘ours’ in Figure 1 should be aligned with other texts.

3. In table1, the data can be visualized as graphs.

4. In 4.2.1，you could list data in a tabular form, and highlight important data to analyze specifically.

5. In picture 3, after the comparison, the specific comparison details can be displayed in the chart combined with relevant data.

6. The data in Table 2 can best be displayed in a chart and analyzed in different colors to be better visualized.

7. The shortcomings of this method and future research directions can be proposed in the conclusion.

8. Other unsupervised image processing methods should also be reviewed in the introduction section, such as active contour model, A hybrid active contour model based on pre-fitting energy and adaptive functions for fast image segmentation, Pattern Recognition Letters; An active contour model driven by adaptive local pre-fitting energy function based on Jeffreys divergence for image segmentation, Expert Systems With Applications。

Author Response

First of all, we are very grateful for your thoughtful comments. Thanks to you, our work has improved a lot. Below are our responses to each of your comments.

Some typical equipment pictures can be added to the introduction.

This study uses the public KITTI and CityScapes datasets for training and testing our neural network based on the PyTorch deep learning framework. Your suggestion is very thankful, but please understand that it is difficult to find special equipment relevant to this monocular depth prediction study based on perceptual image error assessment.

The author's ‘ours’ in Figure 1 should be aligned with other texts.

Figure 1 was a sample to show the readers that our model visually outperformed other studies. However, we deleted Figure 1 under the judgment that Figure 1 could be more clearly explained and aligned with other texts by Figure 2 and the corresponding paragraphs of the experimental results section.

In table1, the data can be visualized as graphs.

We agree that the graph will generally make it easier for readers to analyze the data, but the data distribution in Table 1 (Table 2 in the current version) is not suitable for graphical representation since its graph version is too complicated. Instead, as you suggested below, we colored the key data text for easy visual identification. Thank you very much for your understanding.

In 4.2.1，you could list data in a tabular form, and highlight important data to analyze specifically.

As you suggest, we have highlighted key data in Table 2 in red and blue. The data became more visual thanks to your suggestion.

In picture 3, after the comparison, the specific comparison details can be displayed in the chart combined with relevant data.

We used picture 3 for the qualitative comparison. So, please understand that it is difficult to display the comparison details in chart form.

The data in Table 2 can best be displayed in a chart and analyzed in different colors to be better visualized.

The data distribution in Table 2 (Table 3 in the current version) needs to be aligned with Table 1(current Table 2) because it is an ablation study result. Therefore, please understand we keep the same tabular representation in Table 1.

The shortcomings of this method and future research directions can be proposed in the conclusion.

As you suggested, we’ve added the shortcomings of this method and future research directions at the end part of the conclusion section.

Other unsupervised image processing methods should also be reviewed in the introduction section, such as active contour model, A hybrid active contour model based on pre-fitting energy and adaptive functions for fast image segmentation, Pattern Recognition Letters; An active contour model driven by adaptive local pre-fitting energy function based on Jeffreys divergence for image segmentation, Expert Systems With Applications。

The proposal that the contour model can be used for monocular depth prediction is very interesting. However, since our paper compares the KITTI Eigen split with other studies using the KITTI Eigen split as training data, the additional contour learning network seems to be out of the scope of our current work. In our future study, we will try to positively review the application of your suggestions to depth prediction. Thank you so much for your review and valuable suggestions.

Reviewer 2 Report

Paper be accepted and is substnatiated with good results. Need minor editing.

Author Response

First of all, we are very grateful for your thoughtful comments. It's great to get a good review from an expert. In response to the opinions of several reviewers, including your comments below, we have made improvements such as editing sentences and providing an additional table in our work.

"Paper be accepted and is substantiated with good results. Need minor editing."

Reviewer 3 Report

This manuscript is concerned with computer vision aspects related to monocular depth estimation, which is a recent active area of research. It attempts to address some shortcomings in the current depth estimation approaches that depend on, e.g., LIDAR sensors and/or stereo images. The authors suggest a ResNet50-based model for predicting pixel-level 3D-depth maps for monocular images and train it using their proposed loss function in an unsupervised fashion. The main contribution I can see revolves around their proposed loss function. Overall, I've found that the presented contribution is sufficient to be accepted for publication. However, I'd highly recommend proofreading many parts of the manuscript. I'll try to touch on some concerns regarding the suggested solution first and then give a few additional thoughts to the authors to consider.

1- (Major - Sections 1 & 3): I have a completely different opinion on the following statements and respectfully disagree.

* Line 61 "Even if a student has the capacity to learn, it is difficult to educate him if the teacher’s skills are poor. In the same way, regardless of the sophistication of the design of a deep-learning network, it is ineffective if the network does not learn as intended. Therefore, a deep-learning network requires a loss function that can train the network as intended."

* Line 71 "However, regardless of the quality of the network, it cannot accurately predict the depth if the loss functions for image synthesis are not well designed."

* Line 164 "As previously referenced, a student’s capacity to learn and a teacher’s educational skills could represent a learning network and loss functions, respectively, in the concept of machine learning. (in addition, textbooks could correspond to datasets.) A good loss function can effectively support learning goals. Consequently, we used a high-performance loss function to train our network. Note that we trained a relatively simple network to highlight the effect of the loss function."

- First: these statements mix up different ML terminologies (model capacity, the quality of NN/prediction vs. loss functions) in a confusing way! Model capacity w.r.t memorization/generalization over the target datasets depends principally on the architecture, and this has nothing to do with loss functions. The quality of a model capacity can be estimated theoretically by, e.g., VC Dimension. The quality of model prediction depends on many factors, one of which is the dataset's quality (garbage in, garbage out). The speed of a model to learn, however, (i.e., to memorize/generalize the training data) depends on many optimization factors, one of which is the ((adaptive loss function)).

- Second: I didn't understand what the authors were trying to say regarding student and teacher learning networks and linking them with loss functions. Student-teacher learning is a well-known knowledge distillation method in transfer learning, typically performed in a reinforcement learning method using, e.g., two networks. I got no clue how the authors likened such a transfer learning method with their approach that consists of a single model! Again, a cleaver loss function is associated with the speed/effectiveness of the learning process only.

I strongly suggest rephrasing what has been said in these statements correctly.

2- (Major - Section 2: Related work) To better present the originality of the work, and since the proposed loss function is the main contribution, please review the existing training loss functions for neural depth estimation in more detail. Besides, please summarize the performance of the current works (discussed already in Sections 2.2 and 2.3) in a proper table for readability.

3- (Major - Section 3.2: Training Loss)

- The coefficient (Aleph and Beta) in Eq.(7) are not discussed! Are they pre-defined regularizers or just other additional trainable (scalar) parameters?

- I'm probably missed to see a formula that combines the LR-consistency loss with the smoothness loss functions as depicted in Fig2!

- Generally speaking, increasing a polynomial loss function with extra coefficient parameters may create more complicated learning networks to learn and cause more computational time to execute. Hence, in the ablation study, it was interesting to see the effect of the proposed loss function on the training time. (this point is minor to me, and authors may consider this a future work).

4 - (Major - Figure 2): Please add a figure legend that describes the mentioned abbreviations for (disparity d, image I, consistency con, coarse c, medium m, and fine-view f, etc.). It made me read the text a few times to find out some of these abbreviations!

5- (Minor - Section 3, line 163): I'd suggest dropping the phrase "a well-designed" and probably say .. (This section also details the rationale of our introduced/defined loss function ..)

6- (Minor) Why is a recent benchmarking dataset (i.e., DIODE) not considered for testing the generalization?

7- (Minor) Why do you describe your loss function as high-performance? It looks weird! Either drop "high-performance" or rephrase it with a correct adj (e.g., effective, robust...).

8- (Minor) You may drop the unnecessary pre-2006 references such as 1,2, and 6.

9- (Minor - Language, and typos):

- Much mistaken use of past tense when discussing facts about the approach! Especially in the intro, see the following for examples:

(1) "our network was a simple ResNet50-based" => our network is a simple ResNet50-based

(2) "Our network was inspired by" => Our network is inspired by

(3) "The proposed network generated a disparity" => The proposed network generates a disparity

And many more to fix.

- Abstract:

"To be precise, it output" => it outputs

"through pairwise preference model" => through a pairwise preference model

-Intro.

"However, it can be not easy to" => However, it cannot be easy to

-Conclusion

"between the reconstructed image and the target image" => between the reconstructed and target images.

"difference between a distorted image and a real image"=> difference between distorted and real images.

"For fair comparison" => For a fair comparison

Author Response

First of all, we are very grateful for your thoughtful comments. Thanks to you, our work has improved a lot. Below are responses to each of your comments.

1- (Major - Sections 1 & 3): I have a completely different opinion on the following statements and respectfully disagree.

* Line 71 "However, regardless of the quality of the network, it cannot accurately predict the depth if the loss functions for image synthesis are not well designed."

- First: these statements mix up different ML terminologies (model capacity, the quality of NN/prediction vs. loss functions) in a confusing way! Model capacity w.r.t memorization/generalization over the target datasets depends principally on the architecture, and this has nothing to do with loss functions. The quality of a model capacity can be estimated theoretically by, e.g., VC Dimension. The quality of model prediction depends on many factors, one of which is the dataset's quality (garbage in, garbage out). The speed of a model to learn, however, (i.e., to memorize/generalize the training data) depends on many optimization factors, one of which is the ((adaptive loss function)).

- Second: I didn't understand what the authors were trying to say regarding student and teacher learning networks and linking them with loss functions. Student-teacher learning is a well-known knowledge distillation method in transfer learning, typically performed in a reinforcement learning method using, e.g., two networks. I got no clue how the authors likened such a transfer learning method with their approach that consists of a single model! Again, a cleaver loss function is associated with the speed/effectiveness of the learning process only.

I strongly suggest rephrasing what has been said in these statements correctly.

We really appreciate your careful review and valuable suggestions. We've reviewed your expert suggestions and rephrased problematic sentences. We’ve changed the tone in such a way that a good loss function can effectively contribute to network learning performance improvement (by reducing network misunderstandings), and removed the 'teacher-student learning' part that could confuse the reader. Thanks to your comments, the paper could be written more clearly.

2- (Major - Section 2: Related work) To better present the originality of the work, and since the proposed loss function is the main contribution, please review the existing training loss functions for neural depth estimation in more detail. Besides, please summarize the performance of the current works (discussed already in Sections 2.2 and 2.3) in a proper table for readability.

As you suggested, we’ve added 'Table 1: Image reconstruction loss of each unsupervised learning-based depth prediction studies.' to section 2. The paper seems to be more readable thanks to your suggestion.

3- (Major - Section 3.2: Training Loss)

- The coefficient (Aleph and Beta) in Eq.(7) are not discussed! Are they pre-defined regularizers or just other additional trainable (scalar) parameters?

The coefficients you mentioned are pre-defined regularizers. We’ve added the sentence '$\alpha in equation 7 represents the weight ratio of $L^{L1}$ loss and $L^{pieapp}$ loss among the total image reconstruction loss, and it is optimally pre-determined through experiments.' to avoid misunderstanding by readers.

- I'm probably missed to see a formula that combines the LR-consistency loss with the smoothness loss functions as depicted in Fig2!

Formulas (8)-(13) in the text represent left-right consistency loss and smoothness loss. The figure only shows which components are involved in the two losses.

4 - (Major - Figure 2): Please add a figure legend that describes the mentioned abbreviations for (disparity d, image I, consistency con, coarse c, medium m, and fine-view f, etc.). It made me read the text a few times to find out some of these abbreviations!

- As you mentioned, we’ve added abbreviations to the figure caption.

5- (Minor - Section 3, line 163): I'd suggest dropping the phrase "a well-designed" and probably say .. (This section also details the rationale of our introduced/defined loss function ..)

- As you suggested, we’ve restructured the problematic sentences. Thank you so much.

6- (Minor) Why is a recent benchmarking dataset (i.e., DIODE) not considered for testing the generalization?

Since we focused on the fact that this study can replace the function of the LiDAR sensor used in autonomous driving that is mainly operated in an outdoor environment, we did not separately consider the dataset that provides depth for the indoor scene. (For the same reason, we did not test generalization performance on datasets such as NYU v2, Make3D, etc., and used datasets more suitable for the original purpose, such as Kitti and CityScapes) We consider it positively to present test results on the latest datasets such as DIODE in future studies as well. Thank you so much for your understanding.

7- (Minor) Why do you describe your loss function as high-performance? It looks weird! Either drop "high-performance" or rephrase it with a correct adj (e.g., effective, robust...).

As you suggested, we’ve restructured the problematic sentences. Thank you so much.

8- (Minor) You may drop the unnecessary pre-2006 references such as 1,2, and 6.

As you suggested, we removed the reference papers [1], [2]. However, SSIM [6] is a representative traditional computer vision-based IQA algorithm and is one of the comparison groups necessary to highlight the superiority of the perceptual IQA model presented in this study, so please understand we decided to keep the reference.

9- (Minor - Language, and typos):

- Much mistaken use of past tense when discussing facts about the approach! Especially in the intro, see the following for examples:

(1) "our network was a simple ResNet50-based" => our network is a simple ResNet50-based

(2) "Our network was inspired by" => Our network is inspired by

(3) "The proposed network generated a disparity" => The proposed network generates a disparity

And many more to fix.

- Abstract:

"To be precise, it output" => it outputs

"through pairwise preference model" => through a pairwise preference model

-Intro.

"However, it can be not easy to" => However, it cannot be easy to

-Conclusion

"between the reconstructed image and the target image" => between the reconstructed and target images.

"difference between a distorted image and a real image"=> difference between distorted and real images.

"For fair comparison" => For a fair comparison

As you suggested, we’ve restructured the problematic sentences. Thank you so much.

Reviewer 4 Report

This paper presents an unsupervised learning approach based on the perceptual image error assessment to predict pixel-level depth maps for monocular camera images. The authors designed a learning model that integrated a simple deep learning network and a high-performance loss function that mimics human cognitive abilities. The authors trained their model to synthesize an input image (left image) as close as possible to an image from another corresponding viewpoint (right image). However, the description of innovation points is insufficient to reflect the innovation of this paper.

Q1: Abstract

It is suggested that the author add a challenging description of the work in this paper to reflect the necessity and innovation of this work.

Q2: Abstract

What are the cutting-edge methods for predicting pixel depth maps of monocular camera images, and what are the drawbacks of these methods?

Q3: Introduction

It is suggested that the author mainly describe the innovation points in this part and add the organizational structure of this paper.

Q4: Introduction

Fig.1 shows the experimental results of this paper, which is suggested to be moved to the section of experimental results.

Q5: Introduction

Fig.1 shows the results compared with other methods. What do these other methods mean? Why choose these methods?

Q6: Related Work

What are the advantages and disadvantages of the solutions mentioned in Sections 2.2 and 2.3? Adding these descriptions is more conducive to creativity.

Q7: Experiments

What is the principle of parameter setting in Section 4.1.2?

Q8: Conclusion

What are the drawbacks of the approach mentioned in this article? What is the future direction of work?

Author Response

First of all, we are very grateful for your thoughtful comments. Thanks to you, our work has improved a lot. Below are responses to each of your comments.

Q1: Abstract

It is suggested that the author add a challenging description of the work in this paper to reflect the necessity and innovation of this work.

As you suggested, we have rewritten the abstract to reflect the necessity and innovation of our work more clearly. At the end part of Abstract, we included the following paragraph. “In recent related studies, the photometric difference has been calculated through simple methods such as L1 and L2 loss, or by combining one of these with a traditional computer vision-based hand-coded image quality assessment algorithm such as SSIM. However, these methods have limitations in modeling various patterns at the level of the human visual system. Therefore, the proposed model uses a pre-trained perceptual image quality assessment model that effectively mimics human perception mechanisms to measure the quality of distorted images as image reconstruction loss. In order to highlight the performance of proposed loss functions, a simple ResNet50-based network is adopted in our model”

Q2: Abstract

What are the cutting-edge methods for predicting pixel depth maps of monocular camera images, and what are the drawbacks of these methods?

The related cutting-edge methods and their drawbacks were a bit more clearly explained in the introduction and the related work sections.

Q3: Introduction

It is suggested that the author mainly describe the innovation points in this part and add the organizational structure of this paper.

As you suggested, the innovation points are more clearly described in both the mid and end parts, and the organizational structure of the paper is presented at the end of the Introduction section.

Q4: Introduction

Fig.1 shows the experimental results of this paper, which is suggested to be moved to the section of experimental results.

Figure 1 was a sample to show the readers that our model visually outperformed other studies. However, we deleted Figure 1 under the judgment that Figure 1 could be more clearly explained and aligned with other texts by Figure 2 of the experimental results section.

Q5: Introduction

Fig.1 shows the results compared with other methods. What do these other methods mean? Why choose these methods?

We deleted Figure 1 under the judgment that Figure 1 could be more clearly explained and aligned with other texts by Figure 2 of the experimental results section. The related methods and the background of their choices are described in detail in the related work section and the experiments section.

Q6: Related Work

What are the advantages and disadvantages of the solutions mentioned in Sections 2.2 and 2.3? Adding these descriptions is more conducive to creativity.

As you suggested, we have added the advantages and disadvantages the solutions of sections 2.1, 2.2, and 2.3 at the ends of the corresponding subsections.

Q7: Experiments

What is the principle of parameter setting in Section 4.1.2?

Based on the general principle of hyperparameter setting, we repeatedly evaluated the network accuracy with randomly sampled validation data to set the optimal values.

Q8: Conclusion

What are the drawbacks of the approach mentioned in this article? What is the future direction of work?

As you suggested, we’ve added the shortcomings of this method and future research directions at the end part of the conclusion section. Thank you so much for your careful review and valuable suggestions. Thanks to your comments, the paper seems to become more focused and readable.

Round 2

Reviewer 1 Report

The authors have addressed all my comments.

Reviewer 3 Report

The authors have improved their manuscript and did implement my comments to a large extent professionally, so I, therefore, suggest publishing the paper.

Article Menu

An Unsupervised Depth-Estimation Model for Monocular Images Based on Perceptual Image Error Assessment

Further Information

Guidelines

MDPI Initiatives

Follow MDPI