Multi-Scale Spatio-Temporal Feature Extraction and Depth Estimation from Sequences by Ordinal Classification.

Depth estimation is a key problem in 3D computer vision and has a wide variety of applications. In this paper we explore whether deep learning network can predict depth map accurately by learning multi-scale spatio-temporal features from sequences and recasting the depth estimation from a regression task to an ordinal classification task. We design an encoder-decoder network with several multi-scale strategies to improve its performance and extract spatio-temporal features with ConvLSTM. The results of our experiments show that the proposed method has an improvement of almost 10% in error metrics and up to 2% in accuracy metrics. The results also tell us that extracting spatio-temporal features can dramatically improve the performance in depth estimation task. We consider to extend this work to a self-supervised manner to get rid of the dependence on large-scale labeled data.


Introduction
Depth estimation [1][2][3][4][5][6][7] is a longstanding and fundamental task in 3D computer vision and enables a wide variety of applications, e.g., autonomous driving [8][9][10], Augmented Reality (AR) and Virtual Reality (VR) [11,12], Simultaneous Localization And Mapping (SLAM) [13][14][15][16][17][18], 2D-3D video conversion [19] and 3D scene understanding [20,21]. Most methods to estimate depth fall into three categories: Monocular Depth Estimation, Stereo Depth Estimation and Depth from Motion (Sequence). Monocular Depth Estimation [22][23][24][25][26][27][28] infers depth information from single RGB images and is demonstrated to be an ill-posed problem(In most cases, there are several possible outputs corresponding to a given input image and the problem can be seen as a task of selecting the most proper one from all the possible outputs [29]). Stereo Depth Estimation [30][31][32] needs specific devices to capture stereo images and can provide much more clues to estimate the depth than Monocular Depth Estimation. Depth from Motion or Sequence [33][34][35][36] tries to predict the depth of each pixel by taking successive frames into account, which is the most common situation in our life and many applications. We find that most recently proposed methods focus on Monocular Depth Estimation, as it is more difficult to solve academically. However, such methods ignore one of the most important features for determining depth in the human vision system, which is motion, and in most applications, the format of input is in sequence.
In this paper, we concentrate on depth estimation from monocular sequences by using a single moving camera. This choice is motivated because monocular systems have higher efficiency compared with other approaches. Another reason is that the processing of this three-dimensional spatio-temporal signal is also a key problem in Signal Processing and Machine Learning [37,38].
Before the age of Deep Learning, most methods [13,16,39] extracted and matched local features [40,41] from RGB images between neighbouring frames and got the pose of camera and depth information following the theory of Multiple view of Geometry [42]. Recently, Deep Learning [43] has been gradually applied into this field. Reference [44] regards the depth estimation as a pixel-wise regression problem and designs a multi-scale coarse-to-fine deep neural network to handle this problem. References [25,45] design a network to estimate the pose and depth from sequences in an unsupervised manner, but they just take single RGB images as input when predicting the depth, which is used as the intermediate results in their loss functions. Reference [46] predicts depth from video and extracts spatio-temporal features by ConvLSTM.
Most existing methods fail to extract the spatio-temporal features embedded in the sequences or do not make better use of it [46]. Moreover, it is observed that the uncertainty in depth prediction increases along with the underlying ground-truth depth, which indicates that it would be better to allow a relatively larger error when predicting a larger depth value to avoid over-strengthened influence of large depth values on the training process [47].
In this paper, we design a deep neural network to estimate the depth from sequences and our contributions are as follows:

1.
We design an encoder-decoder neural network with ConvLSTM to extract the spatio-temporal features from sequences.

2.
We combine several multi-scale strategies [48,49] to achieve an accurate high resolution estimation.

3.
Experimental results show that our network improves the performance of depth prediction.

4.
Experimental results show that spatio-temporal features play an important role in depth prediction.
This paper is organized as follows: In Section 2, we briefly review some related work. In Section 3, our proposed method is introduced in details. The experimental results can be found in Section 4. Finally, Section 5 concludes this paper.

ConvLSTM
Long Short-Term Memory (LSTM) [50] is one of the most famous building-blocks of Recurrent Neural Network (RNN). It is proved to be stable and powerful in modeling long-range dependencies for sequences.
LSTM, as shown in Figure 1, has a memory cell C t which essentially acts as an accumulator of the state information. The cell is accessed, written and cleared by several self-parameterized controlling gates. Every time a new input comes, its information will be accumulated to the cell if the input gate it is activated. In addition, the past cell status C t−1 could be "forgotten" in this process if the forget gate f t is on. Whether the latest cell output C t will be propagated to the final state h t is further controlled by the output gate o t . The key equations of LSTM are as below: where * denotes the matrix multiplication. The major drawback of traditional LSTM in handling spatio-temporal data is its usage of full connections in input-to-state and state-to-state transitions in which no spatial information is encoded [51]. ConvLSTM makes all the inputs x 1 , · · · , x t , cell outputs C 1 , · · · , C t , hidden states h 1 , · · · , h t , and gates i t , f t , o t to be 3D tensors whose last two dimensions are spatial dimensions (rows and columns). So ConvLSTM can take RGB images or feature maps from convolution neural network as input. The key equations of ConvLSTM are the same in Equation (1), but the * denotes the convolution operator.

Ordinal Classification
When a variable is ordinal [52,53], its categories can be ranked from low to high, but the distances between adjacent categories are unknown. For example, if someone asks you about an idea, whether you strongly agree, agreed, have no opinion, disagree or strongly disagree with it, your opinion is ordinal. Figure 2 shows a concrete example. We denote x as the features and β as learnable weights, is the random noise. The label y of this ordinal classification problem can be determined as follows. y is defined as follows: Here t 1 , . . . , t m−1 are unknown thresholds that have to be estimated from the data (together with the coefficient vector β). In this formulation the vector β does not contain an intercept β 0 . Alternatively, we could include β 0 and fix one of the thresholds, e.g. we could fix t 1 to zero. Let's first derive the formula for the probability that y = 1. We observe y = 1 when y * y* y 1 2 3 4 We can find that( t 0 = −∞ and t 4 = ∞) where F(·) is the cumulative density function of .
We can write the log-likelihood function: This expression can be maximized with numerical methods to estimate β.

Motivation
We describe the motivations of this paper from three aspects.

Depth from Sequences
We find that most applications [8,11,14] of depth estimation need to predict the depth of every frame from sequences. It is also known that successive frames in a sequence are highly related and embed numerous information of motion, which is believed to be important to predict depth [46]. So, it is a natural choice to train and test our model from sequences.

ConvLSTM
There are three main methods to extract spatio-temporal features, CNN-RNN [54], 3D CNN [55], ConvLSTM [51]. CNN-RNN first extracts spatial features by a CNN network and then sends the results to an RNN network for temporal features. It extracts two kinds of features separately and works very well in some applications such as image captioning [54]. 3D CNN extract spatio-temporal features simultaneously, but it is hard to train [55]. ConvLSTM introduces convolution to traditional LSTM and takes images as input. Both spatial and temporal features are extracted in this unit.

Ordinal Classification
As mentioned above, we are more confident about small depth, and should allow a relatively larger error when predicting a larger depth value. Most existing depth prediction deep learning networks output inverse depth or depth in log-space to solve this problem, and [47] shows that ordinal classification is another choice and achieves better performance. So we follow the idea in [47] and recast the depth estimation as an ordinal classification task.

Overview
The architecture of our proposed network can be found in Figure 3. It is an encoder-decoder [56] architecture. The encoder part consists of three ResNet [57] Bottleneck layers and ConvLSTM layers, it then extracts the spatio-temporal features of input images. The decoder part restores the original resolution of images by three Convolution layers and DeConvolution [58] layers. The Ordinal classification layers are attached to the DeConvolution layer to recover the depth map. The structure of our network can be found in Table 1. Conv(3 × 3 × 32) and ConvLSTM(3 × 3 × 32) means the kernel size is 3 × 3, the number of kernel is 32, the default value of stride is 1. DeConv(3 × 3 × 128.2) means the kernel size is 3 × 3, the number of kernel is 128, the stride is 2.  In the training stage, our network takes sequences(Every sequence is made up of three frames in our experiments) as input, ground-truth depth maps of corresponding frames as supervision. The encoder part tries to extract the embedded spatio-temporal features of input and obtains its feature map at 1/8 of original resolution (Conv3_2). Then the decoder attempts to recover the depth map from feature maps at different scale and compared the estimated depth map to the ground truth to obtain the error for back-propagation.
In the test stage, we just remove the Ordinal classification layers at ord1 and ord2 to estimate the depth map.
Details of our network are described in the following sections.

Multi-Scale Strategies
Pooling [43] is an important part of the deep convolution neural network. It reduces the complexity of the network and makes the model be invariant to some transformation. It is widely used in the task of classification [57,59], because classification needs to infer global information from the input, while depth estimation recovers local information from input. The repeated spatial pooling layers quickly reduce the spatial resolution of feature maps [47] (usually stride of 2), which is considered to have bad influence on performance of this task.
However, it is difficult to totally remove the pooling layers from the network. Hence, we adopt a multi-scale estimation to handle this problem.

Location of Pooling Layer
It is difficult to remove all pooling layers in network practically, or the number of parameters of the network will explode. So we put them just after the ConvLSTM layers, as shown in Table 1. Firstly, ResNet Bottleneck extracts spatial features at the same resolution with its input and sends them to the following ConvLSTM layer. Then, we perform a max pooling on the feature map from ConvLSTM to decrease the spatial resolution of feature maps. Although the spatial resolution is decreased, the important local information is encoded, extracted and stored in the ConvLSTM ahead.

Skip-Connection
In our network, there are three skip-connections [19] across the encoder and decoder to directly fuse the features at high resolution from the encoder to the decoder. This is one of the common strategies to improve the performance. Note that the width and height of input should be divisible by eight, or some error will occur in the decoder when concatenating them. For instance, if the size of input is 254 × 254 × 3, then the output size of Conv1_2, Conv2_2, Conv3_2 should be 127 × 127 × 64, 63 × 63 × 128, and 31 × 31 × 256. The output size of UpConv4_2 and UpConv5_1 is 62 × 62 × 128. Conv5_2 concatenates the output of Conv2_2 and UpConv5_1 and take them as input, so error occurs if their size is different.

Multi-Scale Estimation
In the training stage, our model outputs three depth maps at 1/4 (ord1), 1/2 (ord2) and original (ord3) resolution with an input frame. They are compared to the corresponding resized ground truth depth to calculate the error. This multi-scale estimation [44] forces the decoder of our model to recover the depth map progressively, and fine-tune the estimation from low resolution to high resolution.

Spacing-Increasing Discretization
Most existing methods regard the depth estimation as a pixel-wise regression problem. However, few methods output the depth directly, because it is well-known that the uncertainty in depth prediction increases along with the underlying ground-truth depth, which indicates that it would be better to allow a relatively larger error when predicting a larger depth value to avoid over-strengthened influence of large depth values on the training process [47]. A common solution is performing the regression in log space, but the results are still unsatisfactory [47].
Another idea is recasting the depth estimation as an ordinal classification task, and quantify the continuous depth value to several discrete label values. However, when the depth value becomes larger and larger, its confidence reduces dramatically, which means that the estimation error of larger depth values is generally larger. Hence, using the uniform discretization strategy would induce an over-strengthened loss for the large depth values. We adopt a Spacing-Increasing Discretization proposed in [47], which uniformly discretizes a given depth interval in log space to down-weight the training losses in regions with large depth values, so that our depth estimation network is capable to more accurately predict relatively small and medium depth and to rationally estimate large depth values.
Assuming that a depth interval [a, b] needs to be discretized into K sub-intervals.
The array {s i } K 0 is an arithmetic sequence, and Figure 4 shows an example of Spacing-Increasing Discretization.

Ordinal Classification
After obtaining the discrete depth values, it is straightforward to turn the standard regression problem into an ordinal classification [47] problem. Let X i = φ(I) denote the output of DeConvolution layer at different scale (i = 1 for UpConv4_2, i = 2 for UpConv5_2, i = 3 for UpConv6_2). Y i = Ψ(X i , Θ i ) denotes the output of ordinal classification layer given X i with parameters Θ i at scale i of size W i × H i × 2K. W i and H i are the width and height at scale i, and the value of depth are discretized into K sub-intervals.
The loss function at ith scale can be formulated as follows.
where l (w,h) is the ground truth label of depth of image located at (w, h),l (w,h) is the estimated label of depth if image located at (w, h). Equation (8) is the log-likelihood function at (w, h). P i,k (w,h) is calculated as follows where , and x i (w,h) ∈ X i . In the inference phase, after obtaining ordinal labels for each pixel of image, the predicted depth value can be calculated as follow(We ignore the superscript i of P i,k (w,h) for simplicity): where t i is the discretized depth value, as shown in Figure 4, I(·) is the indicator function, which means I(true) = 1 and I(false) = 0.

Loss Function
The final loss function of our model is the sum of loss functions at different scale with weight parameters.

Experimental Results
Our model was trained and evaluated on the KITTI dataset [60]. The KITTI dataset consisted of video sequences of outdoor scenes along with their corresponding depth maps, procured using car-mounted cameras and Velodyne LiDAR sensors. We split the KITTI dataset into train and test following the description in [44], and we trained on 28 sequences and tested on the 697 images provided in [44]. The range of depth was set from 0 to 80 m, which meant a = 0, b = 80, and the depth was divided into K = 70 sub-intervals. Throughout our experiments, the time-step of the ConvLSTM was set to 3. We evaluated our approach by using the standard metrics proposed by [44].
We compared the methods in two commonly used metrics: error metric and accurate metric. We denote y as the predicted depth and y * as ground truth depth, the key equations of these metrics are as follows: Accuracy Metric: percent of y that max( y y * , We conducted three experiments on our proposed method. Our vo means the model took sequences as input and got results from the ordinal classification layer as shown in Figure 3. Our vc replaced ordinal classification layers with convolution layers, namely Conv(3 × 3 × 1), and they were fine-tuned with training data. We trained this model to investigate the importance of ordinal classification layer. Our so means the model took single RGB images as input and got results from the ordinal classification layer. Although the network was trained using sequences, the decoder part was designed to individually recover each state of the phase of the encoder. Doing so allowed us to use a single image as input, estimate its depth map, and compare its results to Our vo to find out the how much performance improved by taking sequences as input. Table 2 shows the experimental results. We compared the proposed method with existing supervised method: DORN [47], DepthNet [46], Kuznietsov [22], and Eigen [44]. These four methods were trained with depth supervision, and Kuznietsov [22] had extra pose supervision.
In our experiments, we trained five different models for Our vo and Our vc respectively with the same training data and random initialization; the results of Our vo , Our vc and Our so are the average of 5 runs for each model. Firstly, Our vc had better results than DepthNet. Both of them took sequences as input, utilized ConvLSTM to extract spatio-temporal features, got output from the convolution layer; they also had similar network architecture. Nevertheless, our vc chose ResNet Bottleneck to extract features from images and adopted a multi-scale strategy, by which we believe improvement was brought.
Secondly, the performance of Our vo was better than DORN, while the performance of DORN was better than Our so . The main difference between DORN and Our so was the architecture of the network. DORN designed a complex network with dense feature extractor, multi-scale feature learner, cross channel information learner and a full-image encoder, while Our so lost its ability to infer through the time for taking single RGB images as input. However, Our vo beat DORN with a much simpler network which indicates the importance of temporal feature in depth estimation.
Finally, Our vc achieved better results than Our so which shows that temporal feature may play a much more important role than ordinal classification layer.
Some visual result can be found in Figure 5. The results of DORN and Our so was sharper and more jittery than Our vc and Our vo , because Our vc and Our vo took sequences as input and had the ability to smooth the output in the time domain.

Conclusions
In this paper, we design a deep learning neural network to extract spatio-temporal features from sequences, motivated by [46], and predict the depth map from it by ordinal classification, inspired by [47]. The network encodes the input by ResNet Bottleneck and ConvLSTM, then decodes and recovers the resolution of input images by Convolution and DeConvolution with skip-connection from encoder. We train the network with a multi-scale loss function to improve the performance. The results of our experiments show that the proposed method has an improvement of almost 10% in error metrics and up to 2% in accuracy metrics when comparing with the recently proposed supervised depth estimation methods. These results also show us that extracting temporal features can significantly improve the performance in depth estimation task. In the future, we will follow this work and design a self-supervised network to get rid of the dependence on large-scale labeled data.