Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Accurately sensing the surrounding 3D scene is indispensable for drones or robots to execute path planning and navigation. In this paper, a novel monocular depth estimation method was proposed that primarily utilizes a lighter-weight Convolutional Neural Network (CNN) structure for coarse depth prediction and then refines the coarse depth images by combining surface normal guidance. Specifically, the coarse depth prediction network is designed as pre-trained encoder–decoder architecture for describing the 3D structure. When it comes to surface normal estimation, the deep learning network was designed as a two-stream encoder–decoder structure, which hierarchically merges red-green-blue-depth (RGB-D) images for capturing more accurate geometric boundaries. Relying on fewer network parameters and simpler learning structure, better detailed depth maps are produced than the existing states. Moreover, 3D point cloud maps reconstructed from depth prediction images confirm that our framework can be conveniently adopted as components of a monocular simultaneous localization and mapping (SLAM) paradigm.


Background
Compared with depth estimation methods relying on laser rangefinders or other optical instruments, the computer vision method does not require expensive optical equipment and repeated lens calibration. Therefore, image-based depth prediction has been extensively studied and widely applied to 3D scene understanding tasks, such as structure from motion (SFM) [1,2], simultaneous localization and mapping (SLAM) [3,4], 3D object detection [5], etc.
The computer vision method, i.e., image-based depth estimation, defines image depth as the distance from the object point corresponding to each pixel to the camera and exploits clues of images like linear perspective, focus, occlusion, texture, shadow, gradient, etc. for calculation. All the image-based methods can be summarized as two classes: stereo vision methods and monocular methods. The stereo vision methods are heavily dependent on natural light in a natural environment to collect images and is sensitive to changes in illumination angle and changes in illumination intensity. The differences in image matching of the two pictures will result in considerable differences from the matching algorithm. Compared with the stereo vision methods, the monocular vision systems [6][7][8][9] rarely encounter the Densenet-121 model affords real-time coarse depth estimation and depth refinement depending on a Tesla M40 produced by NVIDA in Santa Clara, CA, U.S. with single 12G memory capacity. Benefiting from guidance of surface normal maps, the Densenet-121 based network obtained better depth maps than SOTA practices [6,7,9]. As shown in Figure 1, our network outperforms SOTA depth estimation on an NYU [14] dataset and produces higher-resolution results that capture object boundaries more faithfully.
Our second contribution is reflected in proposing an RGB-D surface normal network, which effectively captures the geometric relationships between RGB and depth images. Different from previous frameworks [21,22], we proposed a fusion network leveraging both RGB and coarse depth prediction images instead of using RGB images only. Images from different domains can complement each other in the surface normal estimation process, i.e., coarse depth images from the CDE network contribute to enhance geometry details, and RGB images make up the missing depth pixels. As shown in Figure 2, we achieved better surface normal maps than Qi et al. [10] based on the RGB-D surface normal network. Moreover, as we can see in Figure 1, with the geometrical guidance of surface normal, depth maps with distinct boundaries were acquired.

Nearest
Farthest For all depth maps in our work, we set the image color according to distance as the color bar above  [14], (c) Depth maps from the state-of-the-art (SOTA) practice [7], (d) Depth maps from our depth prediction network.  [14], (c) Depth maps from the state-of-the-art (SOTA) practice [7], (d) Depth maps from our depth prediction network.
Our second contribution is reflected in proposing an RGB-D surface normal network, which effectively captures the geometric relationships between RGB and depth images. Different from previous frameworks [21,22], we proposed a fusion network leveraging both RGB and coarse depth prediction images instead of using RGB images only. Images from different domains can complement each other in the surface normal estimation process, i.e., coarse depth images from the CDE network contribute to enhance geometry details, and RGB images make up the missing depth pixels. As shown in Figure 2, we achieved better surface normal maps than Qi et al. [10] based on the RGB-D surface normal network. Moreover, as we can see in Figure 1, with the geometrical guidance of surface normal, depth maps with distinct boundaries were acquired.

Related Work
Stereo depth estimation can be regarded as a well-posed process once ignoring the problem of occlusions and depth discontinuities. Moreover, depth estimation methods based on stereo vision achieved even more accurate and robust depth maps than RGB-D sensors [14,15]. Meanwhile, more precise depth features from our monocular method can also make contributions to better multi-view stereo reconstruction. Monocular depth estimation has been considered by plenty of researchers [23][24][25][26][27][28][29], which usually defined the estimation as a regression of the depth map from a single RGB image. Eigen et al. [21] introduced the application of CNNs in monocular depth estimation, which inspired researchers to explore methods based on deep learning. At present, deep learning methods play a leading role in monocular depth estimation practices. Generally, deep learning methods can be divided into supervised approaches, self-supervised approaches, and unsupervised approaches. Supervised monocular depth methods achieved great breakthroughs relying on well-annotated ground-truth images offered by datasets [14][15][16]. For example, Liu et al. [28] combined CNNs with CRF to learn the super-pixel-wise connections between depth and RGB images. Different from the supervised method, self-supervised methods usually set up a separate camera pose estimation network [26,27] or a jointly calculated optical flow and camera pose [30]. Unsupervised practices learn the scene depth image synthesis [10] or ego-motion in monocular videos without using ground truth data [24,25]. Similar to the SOTA methods [2,7], in recent years, the supervised learning method was selected in this paper.
Multi-task/cross-task learning was designed based on intrinsic connections of physical elements of the research targets selected. Some recent works attempt to investigate the sharing of image

Related Work
Stereo depth estimation can be regarded as a well-posed process once ignoring the problem of occlusions and depth discontinuities. Moreover, depth estimation methods based on stereo vision achieved even more accurate and robust depth maps than RGB-D sensors [14,15]. Meanwhile, more precise depth features from our monocular method can also make contributions to better multi-view stereo reconstruction. Monocular depth estimation has been considered by plenty of researchers [23][24][25][26][27][28][29], which usually defined the estimation as a regression of the depth map from a single RGB image. Eigen et al. [21] introduced the application of CNNs in monocular depth estimation, which inspired researchers to explore methods based on deep learning. At present, deep learning methods play a leading role in monocular depth estimation practices. Generally, deep learning methods can be divided into supervised approaches, self-supervised approaches, and unsupervised approaches. Supervised monocular depth methods achieved great breakthroughs relying on well-annotated ground-truth images offered by datasets [14][15][16]. For example, Liu et al. [28] combined CNNs with CRF to learn the super-pixel-wise connections between depth and RGB images. Different from the supervised method, self-supervised methods usually set up a separate camera pose estimation network [26,27] or a jointly calculated optical flow and camera pose [30]. Unsupervised practices learn the scene depth image synthesis [10] or ego-motion in monocular videos without using ground truth data [24,25]. Similar to the SOTA methods [2,7], in recent years, the supervised learning method was selected in this paper.
Multi-task/cross-task learning was designed based on intrinsic connections of physical elements of the research targets selected. Some recent works attempt to investigate the sharing of image Sensors 2020, 20, 4856 5 of 22 features between different tasks [31][32][33]. Jiao et al. [34] jointly trained semantic labeling and depth estimation in their encoder-decoder network architecture. Zhang et al. [11] proposed a pattern-affinity propagation method for jointly predicting depth, surface normal and semantic segmentation. As for us, the proposed network first jointly predicts depth and surface normal and then takes advantage of surface normal maps to refine the predicted coarse depth maps.
Surface normal guidance has been introduced by previous studies [10,19,20], where they employed surface normal maps as 3D cues for improving the geometric quality of monocular depth images. Qi et al. [10] jointly calculated depth and surface normal from a single image, making the final estimation geometrically more precise. In work of Zeng et al. [19], a skip-connected architecture was proposed to fuse features from different layers for surface normal estimation. A novel 3D geometric feature virtual normal was proposed by Yin et al. [20] to refine the predicted depth maps. Surface normal estimation was adopted in this paper for calculating the angular difference between predicted depth images and ground-truth maps, which thus applied geometric restriction to depth images.
A transfer-learning-based deep framework was adopted by Zhang et al. [11] to obtain the state-of-the-art semantic segmentation maps. We adopt a pre-trained Densenet-121 model [5] as the backbone of the coarse depth estimation network in depth feature extraction. Our method benefits from the application of transfer learning, where we take advantage of encoders originally designed for 3D object detection by Alhashim [6].
Encoder-decoder networks were widely adopted in various computer vision tasks such as image classification, image segmentation, and 3D objects detection. In recent years, such architectures made significant contributions to both supervised learning and unsupervised learning-based practices of monocular depth estimation [3,30]. We devised a compendious but effective symmetrical encoder-decoder structure with skip connections. Repeated experiments indicated that our encoder-decoder network with simple structure can outperform the SOTA depth synthesis based on more complicated deep learning architecture [7,9].

Our Method
This section presents our method for monocular depth estimation. First, the general deep learning framework is introduced. Then, rational loss functions for overall training process are defined. Finally, we discuss the practice of 3D point cloud maps reconstruction.

Framework Overview
As shown in Figure 3, the whole deep learning structure consists of three parts: a coarse depth estimation network, an RGB-D surface normal network, and a refinement network.
The coarse depth estimation (CDE) network leverages efficient encoder-decoder network architecture. Coarse depth images generated from the CDE network are then fed to the RGB-D surface normal network for RGB-D fusion. Moreover, coarse depth images are also converted to coarse surface normal maps based on a fix-weight network [4]. For convenience, we describe a single RGB input image as I C , a single in-painted ground truth depth image produced by Levin [35] as D GT , an output coarse depth map as D * , and a coarse surface normal map recovered from D * as N * .
The RGB-D surface normal (RSN) network was designed for obtaining accurate surface normal maps, which functions as the refining coarse depth map (D * ). As shown in Figure 3, the RSN network can be divided into two streams-the RGB stream and the depth stream. The latter can be further divided into depth branch and the confidence map branch. For the general architecture of RSN network, we define a single RGB input as I F , a corresponding coarse depth input as D * , and a surface normal output map as N F .
The RGB stream and depth branch in Figure 3 operate respectively to generate RGB feature R 1 ∼ R 4 and relative sensor depth feature D 1 ∼ D 4 with hierarchical resolution. Then, the two branches cooperate to combine and fuse features from each branch. [11], which generates confidence maps to indicate whether side effects resulted from pixel holes on * or not. Confidence maps [19] of depth image were produced by combining mask images [21] ( ) with relative coarse depth images ( * ) and were denoted as ~ according to resolution. The refinement network actually servers as the convolution kernel function, which optimizes the coarse depth maps from the CDE network guided by surface normal maps from the RSN network. Finally, with the aid of the refinement network, a superbly accurate depth map was generated.  Figure 4 shows the distinct structure of the encoder-decoder network for getting coarse depth maps ( * ). For the encoder, the raw RGB image ( ) is encoded into a feature vector based on a Densenet-121 [5] model that has been pre-trained on Image-Net [36]. The feature vector is then transmitted to a sequence of up-sampling layers to produce * with half resolution of . As for decoding operation, the decoder network consists of four up-sampling units (BU and USB ~USB ) and relative concatenation (⨁) skip-connections. In decoding layers, the 2× bi-linear interpolation proposed by Alhashim [6] is adopted as an up-sampling method.

Coarse Depth Estimation (CDE) Network
As shown in Figure 5, the coarse image * is then converted to coarse surface normal image ( * ) based on least square algorithm [37], and the inference network is just a fix-weight network [11]. The ground truth depth images (D GT ) used for training the CDE network were produced following Levin et al. [35]. This method greatly contributes to fill pixel holes in sensor depth images from NYU-Depth-V2 [14]. However, this method cannot eliminate the side effects caused by lost pixels on accuracy of ground truth depth images and corresponding coarse depth images (D * ). Therefore, a confidence map network branch was set following the method proposed by Zeng et al. [11], which generates confidence maps to indicate whether side effects resulted from pixel holes on D * or not. Confidence maps [19] of depth image were produced by combining mask images [21] (I M ) with relative coarse depth images (D * ) and were denoted as C 1 ∼ C 4 according to resolution.
The refinement network actually servers as the convolution kernel function, which optimizes the coarse depth maps from the CDE network guided by surface normal maps from the RSN network. Finally, with the aid of the refinement network, a superbly accurate depth map D was generated. Figure 4 shows the distinct structure of the encoder-decoder network for getting coarse depth maps (D * ). For the encoder, the raw RGB image (I C ) is encoded into a feature vector based on a Densenet-121 [5] model that has been pre-trained on Image-Net [36]. The feature vector is then transmitted to a sequence of up-sampling layers to produce D * with half resolution of I C . As for decoding operation, the decoder network consists of four up-sampling units (BU 1 and USB 1 ∼ USB 3 ) and relative concatenation (⊕) skip-connections. In decoding layers, the 2× bi-linear interpolation proposed by Alhashim [6] is adopted as an up-sampling method.

RGB-D Surface Normal (RSN) Network
Both the RGB stream and depth branch leverage encoder-decoder network architecture. The two branches employ a pre-trained Densenet-121 model as encoding backbone [26], whose detailed structure is illustrated in the upper row of Figure 6. Generally, the encoder consists of same convolution layers with the raw Densenet-121 [5] encoder, except the last convolution blocks. The number of channels was reduced from 1024 to 512 via bottleneck layers aiming to reduce redundant parameters. A symmetric decoder equipped with concatenation connections for the refitted encoder was designed. Multi-scale up-sampling layers were introduced to decoder, which enables RGB-D images fuse in different scales. What's more, common pooling masks were set to let the network learn more image features.
Different from CDE Network, the RSN Network leverages coarse depth images ( * ) instead of in-painted [35] ground truth depth images ( * ) as the depth input. The pre-trained RSN Network can thus be applied for estimating surface normal maps based on custom images.
As is shown in Figure 3, pixel holes in mask images ( ) suggest that there are lots of missing pixels in ground truth depth images, which inevitably causes deviation to supervised learning. Therefore, we adopted a multi-layer convolution network (CB2) for producing confidence map C [19] of input depth images. stands for scale value of images, i.e., if the resolution of 2D images can be denoted as × , then the corresponding is defined as {(l, H × W)} = {(1,40 × 30); (2,80 × 60); (3,160 × 120); (4,320 × 240)}. The detailed structure of 2 is shown in Figure 6b. The depth branch also adopts Densenet-121 model-based encoding layers, and the fusion calculation takes place at the decoder side. As shown in Figure 6a, the depth features * are passed into the fusion module in each scale and re-weighted with the confidence map ~ . Then re-weighted * are concatenated (⨂) with color features with same resolution and transmitted to a de-convolution layer to produce surface maps based on RGB-D fusion. Finally, the convolution block ( 4) of RSN Network generates the surface map * . The RGB-D fusion algorithm [19] can be expressed as Equation (1). As shown in Figure 5, the coarse image D * is then converted to coarse surface normal image (N * ) based on least square algorithm [37], and the inference network is just a fix-weight network [11].

RGB-D Surface Normal (RSN) Network
Both the RGB stream and depth branch leverage encoder-decoder network architecture. The two branches employ a pre-trained Densenet-121 model as encoding backbone [26], whose detailed structure is illustrated in the upper row of Figure 6. Generally, the encoder consists of same convolution layers with the raw Densenet-121 [5] encoder, except the last convolution blocks. The number of channels was reduced from 1024 to 512 via bottleneck layers aiming to reduce redundant parameters. A symmetric decoder equipped with concatenation connections for the refitted encoder was designed. Multi-scale up-sampling layers were introduced to decoder, which enables RGB-D images fuse in different scales. What's more, common pooling masks were set to let the network learn more image features.
Different from CDE Network, the RSN Network leverages coarse depth images ( * ) instead of in-painted [35] ground truth depth images ( * ) as the depth input. The pre-trained RSN Network can thus be applied for estimating surface normal maps based on custom images.
As is shown in Figure 3, pixel holes in mask images ( ) suggest that there are lots of missing pixels in ground truth depth images, which inevitably causes deviation to supervised learning. Therefore, we adopted a multi-layer convolution network (CB2) for producing confidence map C [19] of input depth images. stands for scale value of images, i.e., if the resolution of 2D images can be denoted as × , then the corresponding is defined as {(l, H × W)} = {(1,40 × 30); (2,80 × 60); (3,160 × 120); (4,320 × 240)}. The detailed structure of 2 is shown in Figure 6b. The depth branch also adopts Densenet-121 model-based encoding layers, and the fusion calculation takes place at the decoder side. As shown in Figure 6a, the depth features * are passed into the fusion module in each scale and re-weighted with the confidence map ~ . Then re-weighted * are concatenated (⨂) with color features with same resolution and transmitted to a de-convolution layer to produce surface maps based on RGB-D fusion. Finally, the convolution block ( 4) of RSN Network generates the surface map * . The RGB-D fusion algorithm [19] can be expressed as Equation (1).

RGB-D Surface Normal (RSN) Network
Both the RGB stream and depth branch leverage encoder-decoder network architecture. The two branches employ a pre-trained Densenet-121 model as encoding backbone [26], whose detailed structure is illustrated in the upper row of Figure 6. Generally, the encoder consists of same convolution layers with the raw Densenet-121 [5] encoder, except the last convolution blocks. The number of channels was reduced from 1024 to 512 via bottleneck layers aiming to reduce redundant parameters. A symmetric decoder equipped with concatenation connections for the refitted encoder was designed. Multi-scale up-sampling layers were introduced to decoder, which enables RGB-D images fuse in different scales. What's more, common pooling masks were set to let the network learn more image features.
Different from CDE Network, the RSN Network leverages coarse depth images (D * ) instead of in-painted [35] ground truth depth images (D * ) as the depth input. The pre-trained RSN Network can thus be applied for estimating surface normal maps based on custom images.
As is shown in Figure 3, pixel holes in mask images (I M ) suggest that there are lots of missing pixels in ground truth depth images, which inevitably causes deviation to supervised learning.
Therefore, we adopted a multi-layer convolution network (CB2) for producing confidence map C l [19] of input depth images. l stands for scale value of images, i.e., if the resolution of 2D images can be denoted as H × W, then the corresponding l is defined as (l, H × W) = (1, 40 × 30); (2, 80 × 60); (3, 160 × 120); (4, 320 × 240) . The detailed structure of CB2 is shown in Figure 6b. The depth branch also adopts Densenet-121 model-based encoding layers, and the fusion calculation takes place at the decoder side. As shown in Figure 6a, the depth features D * l are passed into the fusion module in each scale l and re-weighted with the confidence map C 1 ∼ C 4 . Then re-weighted D * l are concatenated (⊗) with color features with same resolution and transmitted to a de-convolution layer to produce surface maps based on RGB-D fusion. Finally, Sensors 2020, 20, 4856 8 of 22 the convolution block (CB4) of RSN Network generates the surface map N * . The RGB-D fusion algorithm [19] can be expressed as Equation (1).
Sensors 2020, 20, x FOR PEER REVIEW 8 of 23 The decoder layers of the RSN network were also designed based on the Densenet-121 model. Different from the CDE network, up-projection units [17] ( UPP ~UPP and UPP ~UPP ) were employed instead of bilinear interpolation for boosting the surface normal estimation process. The detailed structure of up-projection units is shown in the upper row of Figure 6b.

Refinement Network
For a random pixel i from the coarse depth images, we denote (h i , w i ) as the location of pixel i in 2D space and (x i , y i , d * i ) as the coordinate of corresponding 3D point, where d * i represents the coarse depth value. Similarly, surface normal can be noted as (n x i , n y i , n d * i ). Then, a tangent plane p i [11] can be defined according to Equation (2).
A small 3D neighborhood (H i ) of i was defined in previous studies [10,38]. For a random pixel j ∈ H i , its depth value d j . The depth prediction value of pixel i can be computed as Equation (2) according to the pinhole camera model, where f x and f y represent the focal length in x and y directions, respectively; C x and C y are coordinates of the lens principal points; and K represents the convolution kernel operation. Then, in order to refine depth value of pixel i, we applied kernel regression [10] to operate estimation on all pixels in H i as K n i , n j = n T j n i where d i is the refined depth value ]. According to the linear kernel algorithm, the angle error between surface normal image n i and image n j decides whether pixels i and j are in the same tangent plane p i or not. Therefore, the smaller angular error contributes to more accurate depth estimation d ji .

Loss Function
This section presents loss functions for regression problems existing in depth and surface normal estimation. The loss functions evaluate the difference between the ground truth images and the prediction images generated by the deep learning network. For random pixel i, we define the coarse depth map, refined depth map and ground-truth depth map as d * i , d i , and d gt i separately. Similarly, we denoted the coarse surface normal map, RGB-D fusion surface normal, and ground-truth surface normal as n * i , n F i , and n gt i separately. The total number of pixels is M, and the loss function for depth value is defined as Equation (6).
As shown in Equation (6), the loss function l depth is the point-wise loss defined according to depth values. On the one hand, l depth computes the sum of the L2 norm of the error vectors between coarse depth maps d * i and ground truth images d gt i . On the other hand, l depth calculates difference between refined depth maps d i and ground truth images d gt i over all pixels. The loss for surface normal training process is defined as Equation (7), Sensors 2020, 20, 4856 10 of 22 where l in Equation (7) stands for the scale value [19] of surface normal features and ω 1 , ω 2 , and µ l stand for weight parameter of loss functions. The loss function l normal computes the L2 norm of the angular errors of the orientation vectors for each pixel. On the one hand, l normal computes the L2 norm of the angular difference between coarse surface normal maps n * i and ground-truth normal maps n gt i . On the other hand, l normal calculates the divergence between fusion normal maps n F l i and ground-truth normal maps n gt i . There are a lot of step edges such as texture crosshatches or object boundaries [39] in natural RGB images, which inevitably interfere with the accuracy of depth estimation. It is necessary to prevent such interference so as to deal with distorted or blurry problems around step edges. The l grad function defined by Hu et.al. [9] was adopted in this paper in order to penalize depth errors between neighboring pixels of depth images, where and ∇ x ( τ i ) represent spatial derivation of τ i along the x and y directions, respectively. F(x) stands for a log algorithm function of depth errors defined by Hu [30].
To be concluded, we defined a total loss denoted as l total l total = λ 1 l depth + λ 2 l grad + λ 3 l normal (9) where l depth represents pixel-wise loss for depth values, l grad represents loss function for step edges, and l normal represents function for surface normal. λ 1 , λ 2 , and λ 3 stand for initialed weights working for balancing effects of back propagation.

Recovering 3D Features from Estimated Depth
As is often the case [10,21], 3D space geometric constraints contribute to the capability of deep network in terms of depth estimation and corresponding 3D point cloud reconstruction. Therefore, the 3D point cloud maps shown in Figure 7 were reconstructed following Li et al. [40] to visualize the quality of depth maps predicted by our scheme.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 23 and ground-truth normal maps . There are a lot of step edges such as texture crosshatches or object boundaries [39] in natural RGB images, which inevitably interfere with the accuracy of depth estimation. It is necessary to prevent such interference so as to deal with distorted or blurry problems around step edges. The function defined by Hu et.al. [9] was adopted in this paper in order to penalize depth errors between neighboring pixels of depth images.
where = * − , ( ) and ( ) represent spatial derivation of along the x and y directions, respectively. ( ) stands for a log algorithm function of depth errors defined by Hu [30].
To be concluded, we defined a total loss denoted as where represents pixel-wise loss for depth values, represents loss function for step edges, and represents function for surface normal. , , and stand for initialed weights working for balancing effects of back propagation.

Recovering 3D Features from Estimated Depth
As is often the case [10,21], 3D space geometric constraints contribute to the capability of deep network in terms of depth estimation and corresponding 3D point cloud reconstruction. Therefore, the 3D point cloud maps shown in Figure 7 were reconstructed following Li et al. [40] to visualize the quality of depth maps predicted by our scheme.
We compare the 3D point-cloud maps produced by our scheme based on Densenet-121 with that of Dense-depth [6], which is based on Densenet-169 and does not introduce geometric constrictions. As shown in Figure 7 and Figure 8, our scheme outperforms Dense-depth [6] in terms of depth estimation results and the average quality of the relative 3D point-cloud maps. Therefore, the guidance of surface normal will definitely improve the quality of depth maps in terms of features in 3D space. What is more, better 3D point cloud maps can be recovered from depth maps with constriction of geometric features. As a result, the quality of point-cloud maps should also be adopted as a fundamental metric for evaluating the accuracy of depth estimation.  We compare the 3D point-cloud maps produced by our scheme based on Densenet-121 with that of Dense-depth [6], which is based on Densenet-169 and does not introduce geometric constrictions. Figures 7 and 8, our scheme outperforms Dense-depth [6] in terms of depth estimation results and the average quality of the relative 3D point-cloud maps. Therefore, the guidance of surface normal will definitely improve the quality of depth maps in terms of features in 3D space. What is more, better 3D point cloud maps can be recovered from depth maps with constriction of geometric features. As a result, the quality of point-cloud maps should also be adopted as a fundamental metric for evaluating the accuracy of depth estimation.

As shown in
Sensors 2020, 20, x FOR PEER REVIEW 13 of 23

Benchmark Performance Comparison
We select the SOTA practice Dense-depth [6] for comparison, which adopted Densenet-169 [5] as the encoder of deep network. The pre-trained model functioned by extracting depth feature from RGB image input. As shown in Figure 8, Dense-depth outperforms the coarse depth estimation network based on Densenet-121 [5] proposed in this paper, while performs worse than our entire depth estimation scheme containing the surface normal guidance. Some depth estimation samples are shown in Figure 9, and more depth maps from our framework are listed in the Appendix A. All images are colorized for better visualization.  Table 1 lists depth estimation results generated by proposed method and previous masterpieces [6,7,9,21]. The quality of depth maps can be evaluated according to Equation (10), Equation (11), Equation (12) and Equation (13).
According to results shown in Table 1 and Figure 8, the refined depth images have a better geometric quality than that of Dense-depth, which proves the necessity of surface normal guidance in monocular depth prediction tasks.  Table 2 compares the surface normal calculation results based on RGB image methods [33,37], surface normal with depth consistency method [10], and the RGB-D fusion method (ours). From Table 2, it can be seen that the surface normal scheme leveraging both depth and RGB image features is averagely superior to the schemes that employ depth images or RGB image only.

Experiments
In this section, we describe the implement details of experiments, evaluate the performance of our depth estimation scheme on NYU-Depth-V2 [14], and compare the prediction results against existing state-of-the-art (SOTA) methods. Moreover, we present results of ablation experiment to analyze the influence of the different parts of our proposed method.

Dataset
The NYU-Depth-V2 dataset [14] contains 407 K frames taken from 464 different indoor scenes, which were split into 249 training scenes and 215 testing scenes. Specially, 1449 RGB images were accurately labeled with depth images, in which 654 images are annotated for testing phase and others for training phase. All images were collected from videos captured by Kinect RGB-D sensor produced by Microsoft in Redmond, WA, the U.S. NYU-Depth-V2 widely serves as training datasets for supervised monocular depth prediction due to its accurate ground-truth (GT) depth labels and abundant image samples.
In some previous studies [6,9,10,20], depth estimation networks were trained on subsets sampled from the NYU dataset [14]. In practice [10,20], 30 K training frames were sampled from the raw NYU dataset [14]. The fewer frames used, the better-designed the deep learning schemes [4,6], which outperformed SOTA practices [7,21,41] that were trained on the entire NYU dataset. Moreover, the practice by Hu et al. [9] proved that the models are trained on subsets with more frames performing slightly better, but gains in accuracy did not justify the lower learning efficiency and higher system latency. Therefore, instead of the official splits image set, a subset with 30 K frames was used in this paper. All frames were randomly sampled from 249 training subsets.
The NYU dataset [14] does not provide ground-truth surface normal maps (N GT ); previous studies [10,21] computed N GT from in-pained depth images, and thus, the quality of the produced N GT were up to the in-painting algorithm proposed by Levin [35]. Instead of the in-painting method, the method proposed by Hickson et al. [42] was leveraged for obtaining N GT in this paper. As for confidence map network, we utilized accurate binary mask images (I M ) offered by Eigen et al. [21].

Implementation Details
We implemented both the coarse depth estimation (CDE) network and the RGB-D surface normal network (RSN) using the deep learning platform PyTorch by Huang et al. [43] operating on a Tesla M40 GPU produced by NVIDA in Santa Clara, CA, the U.S. with 12 GB capacity.
The encoder of the CDE network was designed based on the Densenet-121 model, which initializes weights for decoder layers. Specially, the last classification layers of the Densenet-121 model were removed.
The Adam [44] optimizer was adopted for training the CDE network, and learning rate was set as 0.0001. The raw RGB images with resolution of H × W were down-sampled to H/2 × W/2 for boosting training processes and fitting the size of output depth images. We conducted the coarse depth estimation training phase with a batch size of six for 20 epochs. A small subset with 654 samples [14] was employed for testing performance of CDE network, and the batch size for testing phase was set as 32. The network finally produced depth images with resolution of H/2 × W/2 and an error evaluation index for supervising the training algorithm.
In order to avoid interference from over-fitting, four augmentation methods were employed following Cubuk et al. [45] for depth estimation training phase: (1) The horizontal mirroring operation was applied to both RGB and depth images with a probability of 25%.
(2) Much rotation leads into invalid data for GT depth images [6], so input images in training process were rotated by slight degrees, which ranged from −2 to 2 with a probability of 25%.
(3) Contrast and brightness values of RGB images input were randomly scaled by (0.6, 1.2) with a probability of 50%.
(4) Both RGB and depth images were randomly resized to 320 × 256 with a probability of 50%. The training process of surface normal was also performed on NYU Depth-V2 [14] and the number of training epochs was also set as 20 with batch size of six. The testing image-set for RSN Network consists of 654 sample RGB images and corresponding ground truth surface normal images produced following the practice by Hickson et al. [42]. For optimizer, the Adam [44] was selected with original learning rate of 1 × 10 −4 , initial parameters β 1 = 0.9, β 2 = 0.999 and weight decay of 1 × 10 −4 . The RSN network also produced surface normal images with the resolution of H/2 × W/2.
The refinement network functioned as the convolution kernel function [10], which does not demand any training process.
In all training experiments, weight ω 1 for l depth was set as 0.5 for balancing the importance of coarse depth estimation and depth refinement. Weight ω 2 for l grad was set as ω 2 = 0.5 to balance the influence of two terms. As is often the case [6,7,9,10,20], the hyper-parameters are empirically set as a reasonable value for loss functions. In our study, parameter ω 2 was set as 0.5 according to validations on an officially annotated subset with 1449 images. Weight µ l for l normal was set as µ l = 0.2l, where l = {1,2,3,4}. The parameter l was defined in Section 3.4, which stands for the scale value of images fed to the RSN network.
As shown in Equations (6)-(8), l depth calculates the depth value errors for each pixel, l normal calculates angular cosine errors for each pixel, and l grad calculates the gradient errors in the log domain. Therefore, the value of l depth will be obviously bigger than l normal and l grad . To mitigate this effect, the parameter λ 1 is set to a small value. According to test results on the annotated small subset, parameters { λ 1 , λ 2 , λ 3 } were set as reasonable weights { 0.2, 1, 1 }.
As the proceeding of training process, the model gradually converges. The total number of training parameters for entire network based on Densenet-121 model was approximately 32.4 M. Training was performed for 600 K iterations on NYU Depth-V2 [14], which took 18 h to finish. With repeated inference learning, the value of loss functions l grad , l normal , and l depth would converge to zero.

Evaluation Criteria
For quantitatively evaluation, three error metrics were used by the previous work of Eigen et al. [46]: absolute relative error (AbsRel), root mean squared error (RMSE), and average (Log 10 ) error E Log 10 . Moreover, threshold accuracy ( T re ) was selected as the accuracy metric. All metrics can be defined as: As is shown in the above metrics, we denote D GT i as the ground truth depth corresponding to pixel i, D i as the relative estimated depth, and S as the total number of pixels with available value in ground truth maps. Here, three different thresholds (δ, δ 2 , δ 3 ) are set as (1.25, 1.25 2 , 1.25 3 ) according to conventional works [2,46].
Three error metrics [21,27] were used for evaluating surface normal maps in this paper: mean of angle error (Mean), medians of angle error (Median), and root mean square error (RMSE). Moreover, three different thresholds (11.25 • , 22.5 • , 30 • ) [21] were used for calculating the specific angular error of pixels.

Benchmark Performance Comparison
We select the SOTA practice Dense-depth [6] for comparison, which adopted Densenet-169 [5] as the encoder of deep network. The pre-trained model functioned by extracting depth feature from RGB image input. As shown in Figure 8, Dense-depth outperforms the coarse depth estimation network based on Densenet-121 [5] proposed in this paper, while performs worse than our entire depth estimation scheme containing the surface normal guidance. Some depth estimation samples are shown in Figure 9, and more depth maps from our framework are listed in the Appendix A. All images are colorized for better visualization.  Instead of directly comparing errors between pixel values, Table 2 compares angular difference between orientation vectors for each pixel. The qualitative performance of previous research [10,22,37] is cited from the original papers. As listed in Table 2, with the aid of high-order geometric RGB-D fusion, the surface normal maps from our method outperforms Geo-Net [10]. From Table 2 and Figure 8, we can conclude that the method proposed in this paper can recover better shape from RGB images, which contributes to supply more accurately geometric details with depth images.  Table 3 compares the computational efficiency of proposed depth estimation algorithm with that of state-of-the-art (SOTA) methods [6,7,9]. It is seen in Table 3 that our model achieves SOTA results on the RMSE metric. What is more, our model requires fewer training parameters, fewer   [6,7,9,21]. The quality of depth maps can be evaluated according to Equations (10)-(13). According to results shown in Table 1 and Figure 8, the refined depth images have a better geometric quality than that of Dense-depth, which proves the necessity of surface normal guidance in monocular depth prediction tasks. Table 2 compares the surface normal calculation results based on RGB image methods [33,37], surface normal with depth consistency method [10], and the RGB-D fusion method (ours). From Table 2, it can be seen that the surface normal scheme leveraging both depth and RGB image features is averagely superior to the schemes that employ depth images or RGB image only. Instead of directly comparing errors between pixel values, Table 2 compares angular difference between orientation vectors for each pixel. The qualitative performance of previous research [10,22,37] is cited from the original papers. As listed in Table 2, with the aid of high-order geometric RGB-D fusion, the surface normal maps from our method outperforms Geo-Net [10]. From Table 2 and Figure 8, we can conclude that the method proposed in this paper can recover better shape from RGB images, which contributes to supply more accurately geometric details with depth images. Table 3 compares the computational efficiency of proposed depth estimation algorithm with that of state-of-the-art (SOTA) methods [6,7,9]. It is seen in Table 3 that our model achieves SOTA results on the RMSE metric. What is more, our model requires fewer training parameters, fewer training iterations, and fewer image samples in terms of the same training epochs. Furthermore, our model consumed less training time while operating on similar platforms. Table 3. Comparison of computational efficiency and performance on the NYU dataset [14]. The "Parm.'"stands for training parameters, and "Iters." represents the number of training iterations.

RMSE. Frames Epochs Training Time (h) Iters. Inference Time (s) Parms.
Fu [7] 0.509 120K --3M -110M Alhashim [6] 0 In additional, we tested our pre-trained model and testing models released by references [6,9] on an Tesla M40 produced by NVIDA in Santa Clara, CA, the U.S. with single 12G memory capacity. As shown in Table 3, our model performed lower latency and higher accuracy.
All the data shown in Table 3 were cited from the original paper and corresponding released models [6,9].

Ablation Study
In this section, ablation studies are performed to verify each part of the proposed architecture in terms of performance on depth estimation.
In this experiment, the DenseNet-121 model was substituted with pre-trained DenseNet-161 both in the coarse depth estimation (CDE) network and the RGB-D surface normal (RSN) network. As shown in Table 4, the network based on the DenseNet-161 model [5] outperforms that of the DenseNet-121 model [5] in terms of qualitative accuracy metrics. However, according to the training parameters listed in Table 4, the growth of encoding layers in deep learning deep structure will introduce superfluous training costs. Furthermore, as shown in Table 4, with the geometric constrictions from surface normal maps, the CDE network based on Densenet-121 model outperforms Dense-depth [6], whose decoder was designed based on deeper-model Densnet-169.
In this experiment, Mobilenet-V2 [47] was adopted as the backbone of depth prediction and surface normal estimation, which utilizes less training weight parameters and lower computational complexity than the Densenet-121 model. As shown in Table 5, the network based on Mobilenet-V2 apparently required less training cost and thus performed lower latency. However, Mobilenet-V2 based deep learning scheme produced worse depth maps than Densenet-121 model [5] at the same time. Therefore, Mobilenet-V2 [47] model-based networks can be conveniently embedded into simple mobile platforms such as a mobile phone and lite drone. The detailed structure of coarse depth estimation network based on Mobilenet-V2 [47] model was shown in Appendix B. In this experiment, up-projection units in decoder layers of the RGB-D surface normal (RSN) network was substituted with the 2× bilinear up-sampling and up-and down-projection unit [48], respectively, for ablative comparison.
The up-projection unit [17] was designed for embed platforms. As shown in Figure 10, it achieves the lowest latency in this experiment. Therefore, it is suitable to be employed as the up-sampling strategy of the RSN network for boosting surface normal inference process. input image to a higher resolution. As shown in Figure 10, the value of absolute relative error ( ) suggests that the refined depth maps obtained by the up and down projection unit slightly outperforms 2×bilinear interpolation [6]. When using the up-and down-projection projection unit [48] in decoder layers, we found that the gains in performance did not justify the slow learning time and the extra GPU memory required. Therefore, the 2× bilinear up-sampling proposed by Alhashimet al. [6] functioned as an up-sampling method in our coarse depth estimation network. 2× bilinear interpolation [6], up and down projection [48], and up-projection [17].

Custom Results
To verify the deep learning network proposed herein, we took some videos of different indoor scenes with the monocular camera of a smart phone. Then, we randomly captured some RGB images from videos and resized them as the resolution of 640×480 for depth estimation.
As shown in Figure 11, depth maps predicted from our network perform distinct boundary and robust geometric shapes.  2× bilinear interpolation [6], up and down projection [48], and up-projection [17].
The up-and down-projection unit applies super-resolution (SR) techniques to up-scaling the input image to a higher resolution. As shown in Figure 10, the value of absolute relative error (AbsRel) suggests that the refined depth maps obtained by the up and down projection unit slightly outperforms 2× bilinear interpolation [6]. When using the up-and down-projection projection unit [48] in decoder layers, we found that the gains in performance did not justify the slow learning time and the extra GPU memory required. Therefore, the 2× bilinear up-sampling proposed by Alhashimet al. [6] functioned as an up-sampling method in our coarse depth estimation network.

Custom Results
To verify the deep learning network proposed herein, we took some videos of different indoor scenes with the monocular camera of a smart phone. Then, we randomly captured some RGB images from videos and resized them as the resolution of 640 × 480 for depth estimation.
As shown in Figure 11, depth maps predicted from our network perform distinct boundary and robust geometric shapes.
In this experiment, up-projection units in decoder layers of the RGB-D surface normal (RSN) network was substituted with the 2× bilinear up-sampling and up-and down-projection unit [48], respectively, for ablative comparison.
The up-projection unit [17] was designed for embed platforms. As shown in Figure 10, it achieves the lowest latency in this experiment. Therefore, it is suitable to be employed as the up-sampling strategy of the RSN network for boosting surface normal inference process.
The up-and down-projection unit applies super-resolution (SR) techniques to up-scaling the input image to a higher resolution. As shown in Figure 10, the value of absolute relative error ( ) suggests that the refined depth maps obtained by the up and down projection unit slightly outperforms 2×bilinear interpolation [6]. When using the up-and down-projection projection unit [48] in decoder layers, we found that the gains in performance did not justify the slow learning time and the extra GPU memory required. Therefore, the 2× bilinear up-sampling proposed by Alhashimet al. [6] functioned as an up-sampling method in our coarse depth estimation network. 2× bilinear interpolation [6], up and down projection [48], and up-projection [17].

Custom Results
To verify the deep learning network proposed herein, we took some videos of different indoor scenes with the monocular camera of a smart phone. Then, we randomly captured some RGB images from videos and resized them as the resolution of 640×480 for depth estimation.
As shown in Figure 11, depth maps predicted from our network perform distinct boundary and robust geometric shapes. Figure 11. Refined depth images generating from custom images (densenet-161 model). Figure 11. Refined depth images generating from custom images (densenet-161 model).

Conclusions
In this work, we designed a lighter-weight encoder-decoder deep learning network for depth estimation from monocular RGB images. The encoder layers were designed based on pre-trained deep learning model originally for image classification. Experiments proved that an effective encoder based on transfer learning and geometric guidance outperforms previous methods [7,49] employing complex feature extraction layers. Ablation studies suggested that employing different pre-trained models enabled our networks adapt to different platforms. For example, Mobilenet-V2 [3] can be a suitable model for simple mobile platforms such as smart phones and light micro drones, the Densenet-121 model can be used for mobile platforms equipped with a powerful GPU device, while a denser decoder based on Densenet161 can only be applied to platforms that can afford expensive training costs.
What is more, surface normal estimation was introduced to improve the quality of depth images. Benefiting from geometric guidance offered by surface normal maps, our network obviously achieved better depth images on benchmark NYU depth V2 [14] than state-of-the-art methods [4,7,9]. With the geometric constrictions from surface normal maps, superb 3D point-cloud maps were reconstructed from refined depth images.
Because this work greatly benefits from surface normal estimation, we believe that there are still many other possible geometric features that can be used to improve monocular depth estimation. Therefore, we will further study the effects of geometric features such as 3D objects boundary, semantic label, and image defocus.

Layer
Output Operator Input 640× 480×3 - Figure A3. Point cloud maps were reconstructed based on custom images.