Visual Object Tracking Based on Cross-Modality Gaussian-Bernoulli Deep Boltzmann Machines with RGB-D Sensors

Visual object tracking technology is one of the key issues in computer vision. In this paper, we propose a visual object tracking algorithm based on cross-modality featuredeep learning using Gaussian-Bernoulli deep Boltzmann machines (DBM) with RGB-D sensors. First, a cross-modality featurelearning network based on aGaussian-Bernoulli DBM is constructed, which can extract cross-modality features of the samples in RGB-D video data. Second, the cross-modality features of the samples are input into the logistic regression classifier, andthe observation likelihood model is established according to the confidence score of the classifier. Finally, the object tracking results over RGB-D data are obtained using aBayesian maximum a posteriori (MAP) probability estimation algorithm. The experimental results show that the proposed method has strong robustness to abnormal changes (e.g., occlusion, rotation, illumination change, etc.). The algorithm can steadily track multiple targets and has higher accuracy.


Introduction
Visual object tracking is one of the key research topics in the field of computer vision. In recent years, it has had a wide range of applications, such as robot navigation, intelligent video surveillance, andvideo measurement [1][2][3][4]. Despite many research efforts, visual object tracking is still regarded as a challenging problem due to changes in object appearance, occlusions, complex motion, illumination variation and background clutter [5].
A typical visual object tracking algorithm often includes three major components: a state transition model, an observation likelihood model and a search strategy. A state transition model is used to model the temporal consistency of the states of a moving object, whereas an observation likelihood model describes the object and observations based on visual representations. Undoubtedly, feature representation is the most important factor in visual object tracking. Most of existing RGB-D trackers [6][7][8] tend to use hand-crafted features to represent target objects, such as Harr-like features [9], histogram of oriented gradients (HOG) [10], and local binary patterns (LBP) [11]. Hand-crafted features aim to describe some pre-defined image patterns, but they cannot capture thecomplex and specific characteristics of target objects. Hand-crafted features may lead to the loss of unrecoverable information which is suitable for tracking in different scenarios. With the rapiddevelopment of computation power and the emergence of large-scale visual data, deep learning has received much attention and had a promising performance in computer vision tasks, e.g., object tracking [12], object detection [13], and image classification [14]. Wang et al. proposed a so-called deep learning tracker (DLT) for robust visual tracking [15]. DLT trackers learn generic features from auxiliary natural images offline. ADLT tracker cannot obtain deep features with temporal invariance, which is important for visual object tracking. In [16], the authors proposed a video tracking algorithm using learned hierarchical features in which the hierarchical features are learned via a two-layer convolutional neural network. Ding et al. [17] proposed a new tracking-learning-data architecture to transfer a generic object tracker to a blur invariant object tracker without deblurring image sequences. One of the research focuses of this paper is how to use deep learning effectively to extract the features of the target objects in RGB-D data.
To the best of our knowledge, the existing visual tracking methods using deep learning follow a similar procedure, which tracks objects in 2D sequences. Object tracking is performed over 2D video sequences in most early research works like TLD tracker [18], MIL tracker [19] and VTD tracker [20]. With the great popularity of affordable depth sensors, such as Kinect, Asus Xtion, and PrimeSense, an explosive growth of RGB-D data that can be used nowadays has been seen. Reliable depth images can provide valuable information to improve tracking performance. In [21], the author establishesa unified benchmark dataset of 100 RGB-D videos, which provide a foundation for further research in both RGB and RGB-D tracking. One of theresearch focuses of this paper is how to fuse RGB information and depth information effectively to improve the performance of visual object tracking in RGB-D data.
To overcome the problems in the existing methods, we propose a visual object tracking algorithm based on cross-modality feature learning using Gaussian-Bernoulli deep Boltzmann machines (DBM) over RGB-D data. A cross-modality deep learning framework is usedto learn a robust tracker forRGB-D data. The cross-modality features of the samples are input into the logistic regression classifier, andthe observation likelihood model is established according to the confidence score of the classifier. We obtain the object tracking results over RGB-D data using aBayesian maximum a posteriori probability estimation algorithm. Experimental results show that such a cross-modality learning can improve the tracking performance.
The main contributions of this paper can be summarized as follows: • We present a cross-modality Gaussian-Bernoulli deep Boltzmann machine (DBM) to learn the cross-modality features of target objects in RGB-D data. The proposed cross-modality Gaussian-Bernoulli DBM is constructed with two single-modality Gaussian-Bernoulli DBMs by adding an additional layer of binary hidden units on top of them, which can fuse RGB information and depth information effectively. • A unified RGB-D tracking framework based on Bayesian MAP is proposed, in which the robust appearance description with cross-modality features deep learning, temporal continuity is fully considered in the state transition model.

•
Extensive experiments are conducted to compare our tracker with several state-of-the-art methods on the recent benchmark dataset [21]. From experimental results, we can see that the proposed tracker performs favorably against the compared state-of-the-art trackers.
The remainder of the paper is organized as follows. First, feature learning over RGB-D data with cross-modality deep Boltzmann machines is described in the next section. Then we introduce our tracking framework in Section 3. The implementation of our proposed method is presented in Section 4. Experimental results and analysis are demonstrated in Section 5, and finally we draw conclusions in Section 6.

Boltzmann Machine
TheBoltzmann machine (BM) was proposed by Hinton and Sejnowski [22]. A Boltzmann machine is a feedback neural network consisting of fully connected coupled random neurons. The connections between neurons are symmetric, and there is no self-feedback. The outputs of neurons only have two states (active and inactive) which are expressed by 0 and 1, respectively. A set of visible units v ∈ {0, 1} D and a set of hidden units h ∈ {0, 1} F are included in BM (as shown in Figure 1). The visible units and hidden units are composed ofthe visible nodes and hidden nodes, and D and F represent the number of visible nodes and hidden layer nodes, respectively.

Boltzmann Machine
TheBoltzmann machine (BM) was proposed by Hinton and Sejnowski [22]. A Boltzmann machine is a feedback neural network consisting of fully connected coupled random neurons. The connections between neurons are symmetric, and there is no self-feedback. The outputs of neurons only have two states (active and inactive) which are expressed by 0 and 1, respectively. A set of visible and a set of hidden units are included in BM (as shown in Figure 1). The visible units and hidden units are composed ofthe visible nodes and hidden nodes, and D and F represent the number of visible nodes and hidden layer nodes, respectively. We formulate the energy function over the state { , } v h as: are the model parameters: , , W L R represent the symmetric interaction terms of visible nodes to hidden nodes, visible nodes to visible nodes, and hidden nodes to hidden nodes. The diagonal elements of L and R are set to 0. B and A are the threshold values of the visible layer and the hidden layer.
The model defines a probability distribution over a visible vector v as: is called the partition function, and * P is an unnormalized probability.
The following formulations give the conditional distributions over hidden and visible units: is the logistic function.

Restricted Boltzmann Machine
Setting both 0  L and 0  R in Equation (1), we will recover the model of a restricted Boltzmann machine (RBM), as shown in Figure 2. We formulate the energy function over the state {v, h} as: where Ψ = {W, L, R, B, A} are the model parameters: W, L, R represent the symmetric interaction terms of visible nodes to hidden nodes, visible nodes to visible nodes, and hidden nodes to hidden nodes. The diagonal elements of L and R are set to 0. B and A are the threshold values of the visible layer and the hidden layer. The model defines a probability distribution over a visible vector v as: where is called the partition function, and P * is an unnormalized probability.
The following formulations give the conditional distributions over hidden and visible units: where σ(x) = 1/(1 + exp(−x)) is the logistic function.

Restricted Boltzmann Machine
Setting both L = 0 and R = 0 in Equation (1), we will recover the model of a restricted Boltzmann machine (RBM), as shown in Figure 2. A restricted Boltzmann machine(RBM) is a generative stochastic artificial neural networkthat can learn a probability distribution over its set of inputs. It is an undirected graphical model with each visible unit only connected to each hidden unit. The energy function over the visible and hidden units. where where the normalizing factor ( ) Z  denotes the partition function.

Gaussian-Bernoulli Restricted Boltzmann Machines
When inputs are real-valued images, we formulate the energy function of the Gaussian-Bernoulli RBM over the state { , } v h as follows [23]:

Gaussian-Bernoulli Deep Boltzmann Machine
A deep Boltzmann machine (DBM) [24] contains a set of visible units , and a sequence of layers of hidden units Connections only exist between hidden units in adjacent layers. We illustrate a two-layer Gaussian-Bernoulli deep Boltzmann machine, consisting of learning a stack of modified Gaussian-Bernoulli RBMs (see Figure 3). A restricted Boltzmann machine(RBM) is a generative stochastic artificial neural networkthat can learn a probability distribution over its set of inputs. It is an undirected graphical model with each visible unit only connected to each hidden unit. The energy function over the visible and hidden units.
where the normalizing factor Z(Ψ) denotes the partition function.

Gaussian-Bernoulli Restricted Boltzmann Machines
When inputs are real-valued images, we formulate the energy function of the Gaussian-Bernoulli RBM over the state {v, h} as follows [23]: where Ψ = {a, b, W, σ} are the model parameters, b i and a j are biases corresponding to visible and hidden variables, respectively, W ij is the matrix of weights connecting visible and hidden nodes, and σ i is the standard deviation associated with a Gaussian visible variable v i .

Gaussian-Bernoulli Deep Boltzmann Machine
A deep Boltzmann machine (DBM) [24] contains a set of visible units v ∈ {0, 1} D , and a sequence of layers of hidden units Connections only exist between hidden units in adjacent layers. We illustrate a two-layer Gaussian-Bernoulli deep Boltzmann machine, consisting of learning a stack of modified Gaussian-Bernoulli RBMs (see Figure 3). Sensors 2017, 17, 121 5 of 17 The energy function of the joint configuration are the model parameters, and denote the set of hidden units. The probability distribution over a visible vector v can be modelled as:

Feature Learning UsingCross-Modality Deep Boltzmann Machines over RGB-D Data
ABoltzmann machine (BM) is an effective tool in representing probability distribution over its inputs. Deep Boltzmann Machines (DBMs) have been successfully used in many application domains, e.g., topic modelling, classification, dimensionality reduction, feature learning, etc. According to the task, DBMs can be trained in either unsupervised or supervised ways. In this paper, we propose the cross-modality DBMs for feature learning in visual tracking over RGB-D data. In this section, we first describe how to establish cross-modality DBMs, review BMs, RBMs and Gaussian-Bernoulli restricted Boltzmann machines, then go over them in detail.
Multimodal deep learning was proposed forvideo and audio [25,26]. In RGB-D data, we can also learn deep features over multiple modalities (RGB modality and depth modality). The proposed cross-modality DBM is constructed with two single-modality Gaussian-Bernoulli DBMs by adding an additional layer of binary hidden units on top of them (see Figure 4). Firstly, we model a RGBspecific Gaussian-Bernoulli DBM with two hidden layers as Figure 4a, where be the two layers of hidden units in the RGB-specific DBM. Then, the energy function of Gaussian-Bernoulli DBM over where ( ) RGB i  is thedeviation of the corresponding Gaussian model, and RGB  is the parameter vector of RGB-specific Gaussian-Bernoulli DBM. Therefore, the joint distribution of the energy-based probabilistic model is defined through an energy function as: The energy function of the joint configuration {v, h (1) , h (2) } is formulated as: where Ψ = {W (1) , W (2) } are the model parameters, and h = {h (1) , h (2) } denote the set of hidden units. The probability distribution over a visible vector v can be modelled as:

Feature Learning UsingCross-Modality Deep Boltzmann Machines over RGB-D Data
ABoltzmann machine (BM) is an effective tool in representing probability distribution over its inputs. Deep Boltzmann Machines (DBMs) have been successfully used in many application domains, e.g., topic modelling, classification, dimensionality reduction, feature learning, etc. According to the task, DBMs can be trained in either unsupervised or supervised ways. In this paper, we propose the cross-modality DBMs for feature learning in visual tracking over RGB-D data. In this section, we first describe how to establish cross-modality DBMs, review BMs, RBMs and Gaussian-Bernoulli restricted Boltzmann machines, then go over them in detail.
Multimodal deep learning was proposed forvideo and audio [25,26]. In RGB-D data, we can also learn deep features over multiple modalities (RGB modality and depth modality). The proposed cross-modality DBM is constructed with two single-modality Gaussian-Bernoulli DBMs by adding an additional layer of binary hidden units on top of them (see Figure 4). Firstly, we model a RGB-specific Gaussian-Bernoulli DBM with two hidden layers as Figure 4a, be the two layers of hidden units in the RGB-specific DBM. Then, the energy function of Gaussian-Bernoulli DBM over v RGB , h RGB is defined as: is thedeviation of the corresponding Gaussian model, and Ψ RGB is the parameter vector of RGB-specific Gaussian-Bernoulli DBM. Therefore, the joint distribution of the energy-based probabilistic model is defined through an energy function as: where Z(Ψ RGB ) is the partition function.
Sensors 2017, 17, 121 6 of 17 (1 ) ( 2 ) (1 ) where σ Therefore, the joint probability distribution over the cross-modal input {v RGB , v Depth } can be written as: where Ψ cross−modality is the parameter vector of cross-modality Gaussian-Bernoulli DBM. The task of learning the cross-modality Gaussian-Bernoulli DBM is the maximum likelihood learning for Equation (6) with respect to the model parameters.

Bayesian Framework
In this paper, the object tracking is formulated as a hidden state variable Bayesian maximum a posteriori (MAP) estimation problem in the Hidden Markov model. Given a set of observed variables Z t = {Z 1 , Z 2 , . . . , Z t }, we can estimate the hidden state variable X t = X 1 t , X 2 t , . . . . . . X N t by using Bayesian MAP theory [27].
The posteriori probability distribution according to the Bayesian theory can be modelled as the following derivation: where p(Z t |X t ) stands for an observation likelihood model and p( X t |X t−1 ) is called a state transition model for two consecutive frames. We can obtain the optimal stateX t among all the candidates through maximum posterior probability estimation:

State Transition Model
The state variable is defined as X t = {x t , y t , θ t , s t , α t , φ t }, which includes the six parameters of the motion affine transformation, where x t and y t denote the x-direction and y-direction translation of the object in the frame t respectively, θ t represents the rotation angle, s t stands for the scale change, α t denotes the aspect ratio, and φ t represents skew direction at time t.
We assume that the candidate states are generated according to Gaussian distribution: where Σ is a diagonal covariance matrix whose diagonal elements are σ 2

Observation Likelihood Model
In this paper, the observation model that we use is discriminative. A binary linear classifier is adopted to classify tracking observations into object class and background class during tracking. Observations are represented using features learned from the DBM introduced previously. We can obtain a training dataset with approximate labels after extracting features of positive and negative samples. Deep representations are likely to be linearly separable, and linear classifiers are less prone to overfitting. We adopt the logistic regression classifier owing to its capability of providing predictions in probability estimation.
Let h 3 i ∈ R r×1 denote the deep feature for the i-th training sample, and y i ∈ {−1, +1} represent the label for the i-th training sample. Z + = [h 3 1 + , h 3 2 + , . . . , h 3 D + ] ∈ R r×D + stands for the positive training set with their respective labels as Y + = [y 1 + , y 2 + , . . . , represents the negative training set with their respective labels as Training the logistic regression classifier by optimizing: where C + ∈ R is the parameter to weight the logistic cost of the positive-class and C − ∈ R is the parameter to weight the logistic cost of the negative-class logistic. Weight regularization w is added to the cost function in Equation (19) to reduce overfitting. In the prediction stage, the confidence score of the trained logistic regression classifier can be computed as follows:

The Implementation of Our Proposed Method
Our method has two major components, which are shown in Figures 5 and 6. In the first place, as demonstrated in Figure 5, unlabeled patches in RGB and depth modality are used to train the cross-modality Gaussian-Bernoulli DBM offline.

The Implementation of Our Proposed Method
Our method has two major components, which are shown in Figures 5 and 6. In the first place, as demonstrated in Figure 5, unlabeled patches in RGB and depth modality are used to train the cross-modality Gaussian-Bernoulli DBM offline. Then, the trained cross-modality Gaussian-Bernoulli DBM is transferred to an observational model for visual tracking online based on Bayesian MAP, as shown in Figure 6.

Experimental Results and Analysis
The experiments of our proposed tracking algorithm is implemented on MATLAB R2014a, Intel(R) Core(TM) i7-4712MQ, CPU@3.40 GHz and TITAN GPU, 8.00 GB RAM, Windows 8.1 operating system, in Beijing, China.

5.1.Qualitative Evaluation
In order to show the robustness of the visual object tracking algorithm discussed in this paper, we compare our tracker with several state-of-the-art methods on arecent benchmark dataset [21] in different environments with heavy or long-time partial occlusion, rotation, scale change, and fast Then, the trained cross-modality Gaussian-Bernoulli DBM is transferred to an observational model for visual tracking online based on Bayesian MAP, as shown in Figure 6.

The Implementation of Our Proposed Method
Our method has two major components, which are shown in Figures 5 and 6. In the first place, as demonstrated in Figure 5, unlabeled patches in RGB and depth modality are used to train the cross-modality Gaussian-Bernoulli DBM offline. Then, the trained cross-modality Gaussian-Bernoulli DBM is transferred to an observational model for visual tracking online based on Bayesian MAP, as shown in Figure 6.

Experimental Results and Analysis
The experiments of our proposed tracking algorithm is implemented on MATLAB R2014a, Intel(R) Core(TM) i7-4712MQ, CPU@3.40 GHz and TITAN GPU, 8.00 GB RAM, Windows 8.1 operating system, in Beijing, China.

5.1.Qualitative Evaluation
In order to show the robustness of the visual object tracking algorithm discussed in this paper, we compare our tracker with several state-of-the-art methods on arecent benchmark dataset [21] in different environments with heavy or long-time partial occlusion, rotation, scale change, and fast

Experimental Results and Analysis
The experiments of our proposed tracking algorithm is implemented on MATLAB R2014a, Intel(R) Core(TM) i7-4712MQ, CPU@3.40 GHz and TITAN GPU, 8.00 GB RAM, Windows 8.1 operating system, in Beijing, China.

Qualitative Evaluation
In order to show the robustness of the visual object tracking algorithm discussed in this paper, we compare our tracker with several state-of-the-art methods on arecent benchmark dataset [21] in different environments with heavy or long-time partial occlusion, rotation, scale change, and fast motion. Given the limited space, in this section we only list four of them to show the experimental results and the forms of data statistics.
We compare our method with several state-of-the-art trackers, including TLD Tracker [18], MIL Tracker [19],VTD Tracker [20], and RGB-D Tracker [28], CT Tracker [29], Struck Tracker [30], Deep Tracker [15], and Multi-cues Tracker [31],andwe ranthe experiments based on the code provided by the authors. Figure 7 demonstrates that our method performs well in terms of rotation, scale and position when the object undergoes severe occlusion. The MIL tracker and VTD tracker are sensitive to occlusion.
Sensors 2017, 17, 121 10 of 17 motion. Given the limited space, in this section we only list four of them to show the experimental results and the forms of data statistics. We compare our method with several state-of-the-art trackers, including TLD Tracker [18], MIL Tracker [19],VTD Tracker [20], and RGB-D Tracker [28], CT Tracker [29], Struck Tracker [30], Deep Tracker [15], and Multi-cues Tracker [31],andwe ranthe experiments based on the code provided by the authors. Figure 7 demonstrates that our method performs well in terms of rotation, scale and position when the object undergoes severe occlusion. The MIL tracker and VTD tracker are sensitive to occlusion.   motion. Given the limited space, in this section we only list four of them to show the experimental results and the forms of data statistics. We compare our method with several state-of-the-art trackers, including TLD Tracker [18], MIL Tracker [19],VTD Tracker [20], and RGB-D Tracker [28], CT Tracker [29], Struck Tracker [30], Deep Tracker [15], and Multi-cues Tracker [31],andwe ranthe experiments based on the code provided by the authors. Figure 7 demonstrates that our method performs well in terms of rotation, scale and position when the object undergoes severe occlusion. The MIL tracker and VTD tracker are sensitive to occlusion. RGB Figure 9 illustrates the tracking results on the test video with severe occlusion, appearance change and fast motion. From the results, we can notice that the TLD, MIL and VTD methods are sensitive to target appearance change or occlusion.     Figure 11 illustrates the "bad" tracking results of our method,meaning frames where tracking failures are observed. When the objects are all occluded, the tracking results of our method experience a drift phenomenon.      As shown in experimental results, the proposed tracking method performs favorably against the state-of-the-art tracking methods in handling challenging video sequences, but there are some limitations for our method. The robustness of the proposed tracking method is not strong enough to solve allocclusion and abrupt movement.

Quantitative Evaluation
We use two measurements to quantitatively evaluate tracking performances. The first one is called average center location error [32] which measures distances of centers between tracking results and ground truths in pixels. The second one is called success rate (SR) which is calculated according and indicates theextent of region overlapping between tracking results T R and G R .   As shown in experimental results, the proposed tracking method performs favorably against the state-of-the-art tracking methods in handling challenging video sequences, but there are some limitations for our method. The robustness of the proposed tracking method is not strong enough to solve allocclusion and abrupt movement.

Quantitative Evaluation
We use two measurements to quantitatively evaluate tracking performances. The first one is called average center location error [32] which measures distances of centers between tracking results and ground truths in pixels. The second one is called success rate (SR) which is calculated according to area(R T ∩R G ) area(R T ∪R G ) and indicates theextent of region overlapping between tracking results R T and R G . Figures 12-15 report the average center location errors of different tracking methods over three test videos. The comparison results show that the proposed method has a smaller average center location error than the state-of-the-art methods indifferent situations. As shown in experimental results, the proposed tracking method performs favorably against the state-of-the-art tracking methods in handling challenging video sequences, but there are some limitations for our method. The robustness of the proposed tracking method is not strong enough to solve allocclusion and abrupt movement.

Quantitative Evaluation
We use two measurements to quantitatively evaluate tracking performances. The first one is called average center location error [32] which measures distances of centers between tracking results and ground truths in pixels. The second one is called success rate (SR) which is calculated according and indicates theextent of region overlapping between tracking results T R and G R .               Table 1 reports the success rates, where larger scores mean more accurate results.  Table 2 lists the average speed of each method on the recent benchmark dataset [21]. The average speed of our method is 0.14 fps, implemented in Matlab without optimization for speed. The fine-tuning of our method is time-consuming. Table 2. The average speed of each method on the recent benchmark dataset [21].

Method
The Average Speed (fps)

Conclusions
By analyzing the problems of the existing technologies, this paper proposes a visual object tracking algorithm based on cross-modality features learning using Gaussian-Bernoulli deep Boltzmann machines (DBM) over RGB-D data. We extract cross-modality features of the samples in RGB-D video data based on across-modality Gaussian-Bernoulli DBM and obtain the object tracking results over RGB-D data using aBayesian maximum a posteriori probability estimation algorithm. The experimental results show that the proposed method greatly improves the robustness and accuracy of thealgorithm. In the future, we will extend the proposed method to solve other vision problems (e.g., object detection, face recognition, etc.).