An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition

Yoon, Sungbaek; Park, Hyunjin; Yi, Juneho

doi:10.3390/s16070981

Open AccessArticle

An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition

by

Sungbaek Yoon

¹

,

Hyunjin Park

¹

and

Juneho Yi

^1,2,*

¹

School of Electronic and Electrical Engineering, Sungkyunkwan University, Suwon 16419, Korea

²

School of Information and Communication Engineering, North University of China, Taiyuan 03000, China

^*

Author to whom correspondence should be addressed.

Sensors 2016, 16(7), 981; https://doi.org/10.3390/s16070981

Submission received: 8 March 2016 / Revised: 22 May 2016 / Accepted: 23 June 2016 / Published: 25 June 2016

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

This research features object recognition that exploits the context of object-action interaction to enhance the recognition performance. Since objects have specific usages, and human actions corresponding to these usages can be associated with these objects, human actions can provide effective information for object recognition. When objects from different categories have similar appearances, the human action associated with each object can be very effective in resolving ambiguities related to recognizing these objects. We propose an efficient method that integrates human interaction with objects into a form of object recognition. We represent human actions by concatenating poselet vectors computed from key frames and learn the probabilities of objects and actions using random forest and multi-class AdaBoost algorithms. Our experimental results show that poselet representation of human actions is quite effective in integrating human action information into object recognition.

Keywords:

object recognition; object-action context; object-human interaction

1. Introduction

Object recognition is difficult due to a variety of factors, including viewpoint variation, illumination changes, occlusion, etc. However, before encountering these factors, the inherent difficulty of object recognition lies in the fact that there is a large amount of intra-category appearance variation, and objects from different categories may have similar appearances. In order to improve the performance of object recognition, researchers have exploited contextual information that includes spatial [1,2,3], semantic [4,5,6,7], and scale [8,9] contexts. Spatial context refers to information about the potential locations of objects in images or the positional relationship between objects. Semantic context provides clues related to the co-occurrence of objects with other objects in a scene. Scale context gives the relative scale of objects in a scene.

In this work, we focus on the context of object-action interaction, which has been relatively unexplored. Since objects have specific usages, and human actions corresponding to these usages can be related to these objects, it is possible to improve the performance of object recognition by exploiting human interactions with objects as a type of context information. Especially, when objects from different categories have similar appearances, analyzing the human action associated with each object can be effective in resolving the ambiguity related to recognizing objects. As illustrated in Figure 1, when a cup or spray bottle is held by a human hand, they look very similar because of their cylindrical structures. In this case, exploiting the context of the object-action interaction greatly facilitates the distinction between the two objects.

There have been a few experiments that have adopted similar ideas with different representations of human actions, objects, and computational algorithms. Moore et al. [10] depicted human actions using the hidden Markov model (HMM) by tracking hand locations, although it is not easy to normalize different action speeds for different individuals. Gupta et al. [11] recognized human-object interactions based on the integration of action recognition and object recognition, where human actions and objects contribute mutual contexts for each other.

They also represented human actions using HMM by detecting hand trajectories. Human actions were segmented into several atomic actions; however, stable segmentation of each action into atomic actions is difficult. Yao et al. [12] modeled the context between human poses and objects using Markov random field modeling to recognize human-object interactions. Their work is based on a single pose, and it is not clear which pose belongs to which action. Thus, categories of objects may be relatively obscured compared to when action information is employed. Grabner et al. [13] described the relations between objects and a human pose based on matching the shapes of them. They exploited the relations to detect an affordance which is functionality implied by objects rather than recognizing a specific object category.

Alternatively, deep learning approaches, such as convolutional neural networks (CNN) [14], have achieved great success in object recognition. However, as can be seen in the experimental results, when there are not enough labelled images available, the recognition performance is not as high as expected. In addition, it is difficult to find an optimal CNN architecture for a given problem.

The goal of this study is to efficiently and effectively incorporate human action information into object recognition in order to boost the recognition performance. We employ a few image frames that contain key poses, which can be used to distinguish human actions. Since an assemblage of key poses can take advantage of the fiducial appearance of the human body in action, representation of human actions by concatenating a few key poses is quite effective. The main contribution of this work is the establishment of an effective Bayesian approach that exploits the probabilities of objects and actions, through random forest and multi-class AdaBoost algorithms.

Figure 2 overviews our method, which recognizes objects using object-action context. First, random forests for objects and actions are trained independently using object features obtained from object images and action features acquired from video sequences. Additionally, by regarding each tree in a random forest as a weak classifier, the weight of the tree is determined using multi-class AdaBoost [15]. The object categories of the input data are determined by applying a Bayesian approach using the probabilities calculated from object features and action features. We represent human actions by concatenating poselet vectors [16,17] computed from key frames in a video. poselets depicting local parts of human poses are feature vectors that are strictly clustered based on their appearance. The value of an element in a poselet vector is the maximum response value of the key frame to each poselet; we use a support vector machine (SVM) as the poselet classifier. Recently, with the resurgence of the neural network, poselets have a new version based on the neural network [18]. However, the neural network-based approach is more computationally expensive than our random forest-based method and also requires many more training images to produce a well-trained network. We use the histogram of oriented gradients (HOG) to represent objects. The experimental results show that our method, using object-action context, enhances the performance of object recognition when the appearances of objects belonging to the same category largely differ and objects of different categories are similar in appearance.

This paper is organized as follows. The following section presents the probabilistic model we propose for object recognition and describes our approach for determining the probabilities of objects and human actions using random forest and multi-class AdaBoost algorithms. The methods used for representing objects and actions are given in Section 3, and our experimental results are reported in Section 4.

2. Incorporating Object-Action Context into Object Recognition

O

and

A

denote object categories and human action categories, respectively.

x^{O}

is an appearance feature from an object, and

x^{A}

is a feature of a human action related to the object. Given

x^{O}

and

x^{A}

, the probability of the object category,

p (O | x^{O}, x^{A})

, can be depicted by Equation (1):

\begin{array}{l} p (O | x^{O}, x^{A}) & = p (O | x^{O}) p (O | x^{A}) \\ = p (O | x^{O}) \sum_{A} p (O, A | x^{A}) \\ = p (O | x^{O}) \sum_{A} p (O | A) p (A | x^{A}) \end{array}

(1)

Our method outputs the object that maximizes

p (O | x^{O}, x^{A})

as the recognition result. The goal of this method is to efficiently learn the probability of the object category

p (O | x^{O})

given an object feature

x^{O}

, the probability of the action category

p (A | x^{A})

given an action feature

x^{A}

, and

p (O | A)

.

We first describe how to estimate

p (O | x^{O})

and

p (A | x^{A})

. We employ a random forest to learn

p (O | x^{O})

and

p (A | x^{A})

. Figure 3 depicts the process used to calculate the probability of the object categories. The probability of object category,

P_{j}

, is a weighted summation of the probabilities of the object categories,

P_{θ_{i = 1, \dots, n}} (j)

, which are obtained from trees in the random forest. The weights of the trees,

α_{θ_{i = 1, \dots, n}^{O}}

, are trained by multi-class Adaboost.

The training process for the probability of an object category is as follows. First, given training data,

D = {(x^{o_{1}}, o_{1}), \dots, (x^{o_{n}}, o_{n})}

and

x^{o_{j}} = {x_{1}^{o_{j}}, \dots, x_{M_{i}}^{o_{j}}}

, a random forest of

k

trees,

θ_{1}, \dots, θ_{k}

, is generated from the data.

x_{i}^{o_{j}}

represents the

i^{th}

object feature belonging to object category,

o_{j}

. The probability of

o_{j}

that is computed by the random forest is given in Equation (2):

p (o_{j} | x^{O}) \equiv p (o_{j} | x^{O}, Θ^{O}) = \frac{1}{| Θ^{O} |} \sum_{i = 1}^{k} α_{θ_{i}^{O}} \frac{n_{o_{j}, θ_{i}^{O}}}{n_{θ_{i}^{O}}}

(2)

Here,

Θ^{O}

is the random forest built from the object features where

θ_{i}^{O} \in Θ

denotes the

i^{th}

decision tree and

| Θ^{O} | = k

.

n_{o_{j}, θ_{i}^{O}}

represents the amount of training data, which is classified as the object category,

o_{j}

. Finally,

n_{θ_{i}^{O}}

is the total amount of training data in tree

θ_{i}^{O}

at the leaf node. By treating each tree in the random forest as a weak classifier, the weight of each tree,

α_{θ_{i}^{O}}

, is learned using multi-class Adaboost [15].

p (A | x^{A})

is determined in the exact same way as above by using action features.

For splitting nodes in the trees of the random forests, two parameters, ‘MinParentSize’ and ‘MinLeafSize’, are defined. ‘MinParentSize’ and ‘MinLeafSize’ denote the number of samples in a node and the number of samples in a leaf node, respectively. We have set ‘MinParentSize’ to 20 and ‘MinLeafSize’ to 10. A tree stops splitting when any of the following conditions hold: (1) if a node contains only samples of one class; (2) the number of samples is fewer than ‘MinParentSize’ samples in a node; and (3) any split applied to a node generates children with smaller than ‘MinLeafSize’ samples.

Figure 4 describes the learning process of multi-class Adaboost, which is an extension of the binary Adaboost learning process into multi-classes. It generates classification rules and readjusts the distribution of the training data using the preceding classification rules. When the amount of training data is

n

and

C

is the number of categories, the initial distribution of the data is computed in the first step. During

k

repetitions,

w

is updated and data that are not well-classified are assigned higher values. In the second step, the error of the weak classifier,

T^{(m)} (x)

, is computed and

w

is renewed based on the error. Lastly, we acquire the probability of an object category as a linear combination of the probabilities obtained from the trees that are weak classifiers; this is done using the weight

α

. In Step 2c, the extra term,

\log (C - 1)

, represents the only variation from the binary Adaboost algorithm. Unlike in binary classification, where the error rate of random guessing is

1 / 2

, the error rate of random guessing is

(C - 1) / C

for multi-classification. The Adaboost assumption, which expects that the error rate of the weak classifier is less than

1 / 2

, is not satisfied. Thus, in order to solve this drawback of Adaboost, the

\log (C - 1)

term is added.

To estimate

p (O | A)

, we use

p (A | O)

using the Bayesian rule:

p (O | A) = \frac{p (A | O) p (O)}{\sum_{O} p (A | O) p (O)}

(3)

where

p (A | O)

can be calculated based on the number of observations associated with the same object category:

p (A = a_{j} | O = o_{i}) = \frac{n_{j}}{N_{i}}

(4)

Here,

N_{i}

is the number of observations associated with object category,

o_{i}

, and

n_{j}

represents the number of observations for action category,

a_{j}

. In our experiments, training image sequences are collected such that each subject takes action that corresponds to the correct usage of a given object. Thus, in actual implementation,

p (A = a_{j} | O = o_{i}) = 1

for

i = j

;

0

otherwise. Here,

i = j

means an object and its correct action pair.

3. Representing Objects and Human Actions

We can regard human actions as an assemblage of continuous poses. However, on account of the similarity between the poses in adjacent frames, singling poses out from all of the video frames creates needless duplication. Thus, we extracted the key frames from the video in order to use the minimum number of poses to express human actions. We then deployed poselet vectors to represent the key frames.

Figure 5 shows the procedure used for turning key frames into poselet vectors. To describe the key frames using poselet vectors, the labeled poselets shown in Figure 6 are expressed by HOG [19], and an SVM is learned for each poselet. A poselet vector is generated using the maximum response values, which are obtained by applying all of the learned poselet SVMs to a key frame through a sliding window technique. An action feature is then obtained by concatenation of the poselet vectors.

To extract key frames from input video, a poselet vector is computed for each frame of the input video and the Euclidean distance between the frames (at a similar time as the training key frames) and the training key frames is computed using their poselet vectors. The frames with the minimum distance are selected as the key frames of the input video. Objects are represented using HOG. The size of an object image is

50 \times 50

.

4. Experimental Results

We compared the performance of our method with that of the one proposed by Gupta et al. [11] and a CNN. To our knowledge, the algorithm of Gupta et al. [11] is the most representative work that exploits human actions as context information for object recognition. We have included a CNN for performance comparison because CNN has recently achieved great success in object recognition.

We have also conducted an experiment using local space-time action features. To represent actions, we have used Bag of Visual words (BoV) model of local N-jets [20,21,22], which are built from space-time interest points (STIP) [21,22]. Local N-jets is one of the popular and strong motion features and its two first levels show velocity and acceleration. The code book for BoV is constructed using a K-means algorithm.

For our experiments, we designed a CNN architecture by referring to CIFAR10-demo [23]. As described in Table 1, the network contains 13 layers. The outputs of the first, second, and third convolutional layers are conveyed to the rectified linear unit (ReLU) and pooling layers. The first pooling layer is the max pooling layer and the remaining pooling layers are average pooling layers. The fourth convolutional layer and two fully-connected layers are linked to one another without interrupting the ReLU and pooling layers. The last fully-connected layer feeds its output to softmax.

For the experiments, we captured videos of 19 subjects performing four kinds of actions with four different objects (i.e., cups, scissors, phones, and spray bottles). Each of the subjects carried out actions using these objects. We constructed a dataset that contains 228 video sequences [24]. We extracted key frames from the video sequences in order to use the minimum number of poses to express human actions and deployed poselet vectors to represent the key frames. An action feature is represented as a concatenation of three poselet vectors. We used 38 kinds of poselets in this experiment. Thus, an action feature has 114 dimensions, to learn poselet SVMs, we used 20,308 positive images for 38 different poses and 2321 negative images. The size of a poselet training images is

96 \times 94

. A linear SVM is used to differentiate samples in a single poselet category from samples belonging to all of the remaining poselet categories.

In order to obtain more positive action data for the random forest and SVM, we used combinations of the frames adjacent to the key frames. As a result, to train the action random forest, we used the following amounts of action features: 1625 action features for the “drinking water” action, 7149 for “calling on the phone”, 1674 for “cutting paper”, and 678 for “spraying”. For training the multi-class AdaBoost, we used 848 action features for “drinking water”, 1890 for “calling on the phone”, 330 for “cutting paper”, and 658 for “spraying”.

The object images used in the experiments were obtained from Google Image Search [25] and ImageNet [26]. We collected 3120 cup images, 4131 phone images, 2263 scissors images, and 2006 spray bottle images. We used 1200 images from each category to train the object random forest and 600 images for training the multi-class AdaBoost. Figure 7 shows some of the object images that were used in our experiments. The object image set contains objects that have a variety of appearances within the same category. Some objects, such as cups and sprays, are similar in appearance due to their cylindrical structure; however, these objects belong to different categories.

We conducted experiments with random forests using 100, 150, 200, 250, and 300 trees. Figure 8 shows the confusion matrices, which describe the results of object recognition. The first column represents the results of object recognition using only object appearance features and the second column depicts the results of object recognition using both object appearances and human actions. As expected, we see improved object recognition when using the human actions. Overall, the recognition rate is improved by between 4% (scissors) and 30% (phone), as compared to when only object appearances are used. The number of trees has little influence on the performance of object recognition in the experiments, both with and without human action context.

Figure 9 shows the result of object recognition in which actions are represented by the BoV of N-jets. For training the action random forest, we have used 39 action features for the “drinking water” action, 39 for “calling phone”, 39 for “cutting paper”, and 40 for “spraying”, respectively. For training the multi-class AdaBoost, the action features employed in the random forest are also utilized. For testing, we have used 18 action features for the “drinking water” action, 18 for “calling phone”, 18 for “cutting paper”, and 17 for “spraying”, respectively. We have used 1200 images from each object category to train the object random forest and 600 images for training the multi-class AdaBoost. For testing, we have used 18 object features for “cup”, 18 for “phone”, 18 for “scissors”, and 17 for “spray”, respectively.

Except for spray bottles, we have observed that the performance of object recognition is also significantly improved when using BoV of local N-jets as action features. The improvement of the recognition rate achieved ranges from 50% (phone) to 6% (cup). As described in Figure 8 and Figure 9, the poselet representations of the actions show better performances when recognizing cups, phones, and spray bottles. The differences of recognition rates between the action features were 6%–22% for cups, 1%–4% for phone, and 3%–16% for spray bottles. On the other hand, the recognition of the scissorss is improved from 3% to 12% using the BoV of local N-jets.

Figure 10 shows the results of applying Gupta’s algorithm to our experimental data. With the exception of cups, objects exhibit low recognition performance compared with our method. The differences of recognition rates between our method and their method were 28%–31% for telephones, 21%–31% for scissors, and 7%–9% for spray bottles. We observed that this performance difference is caused mainly by their representation of human actions with incorrectly segmented atomic actions. From the experimental results, we see that our poselet representation of human actions, using a simple graphical model, is more effective at integrating human action information into object recognition.

The results of applying the CNN to our experimental data are shown in Figure 11. To train the CNN, we used the same number of images for each category as was used in our method (1800). It can be seen that the recognition performance of our method outperforms the CNN. The performance improvements over CNN were 11% for cups, 34%–37% for telephones, 3%–10% for scissors, and 4%–6% for spray bottles. To allow for a clearer performance comparison, we also included Figure 12. We observed that 1800 labeled images for each category are not enough to adequately train the CNN and guarantee better performance than what was obtained by our method. Moreover, it is difficult to find the optimal CNN architecture for the given problem.

Cups and spray bottles look similar to each other, especially when they are held in a human hand, because of their cylindrical structure. Even some phones, such as cordless home phones, have appearances that are similar to cups and spray bottles in the feature space (due to their rectangular form). From the experimental results, we confirmed that our method greatly facilitates distinction between similar looking objects from different categories by efficiently exploiting the action information associated with the objects.

5. Conclusions

This work focused on the efficient use of object-action context to resolve the inherent difficulty of object recognition caused by large intra-category appearance variations and inter-category appearance similarities. To accomplish this, we proposed a method that integrates how humans interact with objects into object recognition. The probabilities of objects and actions have been computed effectively using random forest and multi-class Adaboost algorithms. Through experiments, we confirmed that a few key poses provide sufficient information for distinguishing human actions. When objects from different categories have similar appearances, the use of the human actions associated with each object can be effective in resolving ambiguities related to recognizing these objects. We also observed that when we have an insufficient amount of labelled objects, which inhibits recognition, carefully-designed statistical learning methods using handcrafted features are more adequate for obtaining an efficient solution, as compared to deep learning methods.

Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry Education, Science and Technology (2013R1A1A2006164).

Author Contributions

Sungbaek Yoon conceived and designed the experiment; Sungbaek Yoon and Hyunjin Park performed the experiments; and Sungbaek Yoon and Juneho Yi wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HMM	Hidden Markov Model
CNN	Convolutional Neural Network
SVM	Support Vector Machine
HOG	Histogram of Oriented Gradients
k-NN	k Nearest Neighbors
ReLU	Rectified Linear Unit
BoV	Bag of Visual words
STIP	Space Time Interest Point

References

Kumar, S.; Herbert, M. A hierarchical field framework for unified context-based classification. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, 17–20 October 2005.
Heitz, G.; Koller, D. Learning spatial context: Using stuff to find things. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008.
Prest, A.; Schmid, C.; Ferrari, V. Weakly supervised learning of interaction between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 601–614. [Google Scholar] [CrossRef] [PubMed]
Rabinoch, A.; Vedaldi, A.; Galleguillos, C.; Wiewiora, E.; Blongie, S. Objects in context. In Proceedings of the IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–20 October 2007.
Wolf, L.; Bileschi, S. A critical view of context. Int. J. Comput. Vis. 2006, 69, 251–261. [Google Scholar] [CrossRef]
Harzallah, H.; Jurie, F.; Schmid, C. Combining efficient object localization and image classification. In Proceedings of the International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009.
Murphy, K.; Torralba, A.; Freeman, W. Using the forest to see the tree: a graphical model relating features, objects and the scenes. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–13 December 2003.
Torralba, A. Contextual priming for object detection. Int. J. Comput. Vis. 2003, 53, 169–191. [Google Scholar] [CrossRef]
Strat, T.; Fischler, M. Context-based vision: Recognizing objects using information from both 2-d and 3-d imagery. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 1050–1065. [Google Scholar] [CrossRef]
Moore, D.J.; Essa, I.A.; Hayes, M.H. Exploiting human actions and object context for recognition tasks. In Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999.
Gupta, A.; Kembhavi, A.; Davis, L.S. Observing human-object Interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1775–1789. [Google Scholar] [CrossRef] [PubMed]
Yao, B.; Li, F. Recognizing human-object Interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1691–1703. [Google Scholar] [PubMed]
Grabner, H.; Gall, J.; Gool, L.V. What makes a chair a chair? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 21–25 June 2011.
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe, CA, USA, 3–8 December 2012.
Zhu, J.; Zou, H.; Rosser, S.; Hastie, T. Multi-class Adaboost. Stat. Interface 2009, 2, 349–360. [Google Scholar]
Raptis, M.; Sigal, L. Psoelet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013.
Maji, S.; Bourdev, L.D.; Malik, J. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011.
Bourdev, L.; Yang, F.; Fergus, R. Deep Poselets for Human Detection. Available online: http://arxiv.org/abs/1407.0717 (accessed on 5 July 2015).
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005.
Koenderink, J.; Doorn, A.V. Representation of local geometry in the visual system. Biol. Cybern. 1987, 55, 367–375. [Google Scholar] [CrossRef] [PubMed]
Laptev, I.; Caputo, B.; Schuldt, C.; Lindeberg, T. Local velocity-adapted motion events for spatio-temporal recognition. Int. J. Comput. Vis. 2007, 108, 207–229. [Google Scholar] [CrossRef]
Charkraborty, B.; Holte, M.B.; Moeslund, T.B.; Gonzàlez, J. Selective spatio-temproal interest points. Comput. Vis. Image Underst. 2012, 116, 396–410. [Google Scholar] [CrossRef]
ConvNetJs CIFAR-10 Demo. Available online: http://cs.stanford.edu/peoplekarpathy/convnejs/cifar10.html (accessed on 25 September 2015).
Action Videos. Available online: https://vision.skku.ac.kr (accessed on 5 February 2014).
Google Images. Available online: https://images.google.com (accessed on 5 February 2014).
ImageNet. Available online: https://www/image-net.org (accessed on 5 December 2014).

Figure 1. Examples of object-action context. Objects have specific usages and human actions corresponding to these usages can be related to these objects.

Figure 2. Object recognition using object-action context.

Figure 3. Computing the probabilities of object categories using a random forest and multi-class Adaboost.

Figure 4. Training the weight of each tree in the random forest using multi-class Adaboost.

Figure 5. Creation of poselet vectors and action features.

Figure 6. Examples of poselets.

Figure 7. Some of the object images used in our experiments.

Figure 8. The results of object recognition: the left column shows the results using only the object’s appearance and the right column represents the results using the object’s appearance and human actions.

Figure 9. The results of object recognition using the BoV of Local N-jets: The first column shows the results using only the object’s appearance and the second column represents the results using the object’s appearance and human actions.

Figure 10. The results of Gupta’s algorithm for our experimental data.

Figure 11. The results of applying the CNN to our experimental data.

Figure 12. The performance comparison of our methods using poselet vectors and local N-jets with Gupta’s algorithm and CNN.

Table 1. The CNN architecture used for the experiments.

**Table 1.** The CNN architecture used for the experiments.
	Operation	Input Size	Filter Size	Pool	Stride	Output Size
Layer1	Conv	$50 \times 50 \times 3$	$5 \times 5 \times 3 \times 32$		$1$	$50 \times 50 \times 32$
Layer2	Max	$50 \times 50 \times 32$		$3 \times 3$	$2$	$25 \times 25 \times 32$
Layer3	ReLU	$25 \times 25 \times 32$				$25 \times 25 \times 32$
Layer4	Conv	$25 \times 25 \times 32$	$5 \times 5 \times 32 \times 32$		$1$	$25 \times 25 \times 32$
Layer5	ReLU	$25 \times 25 \times 32$				$25 \times 25 \times 32$
Layer6	Avg	$25 \times 25 \times 32$		$3 \times 3$	$2$	$12 \times 12 \times 32$
Layer7	Conv	$12 \times 12 \times 32$	$5 \times 5 \times 32 \times 64$		$1$	$12 \times 12 \times 64$
Layer8	ReLU	$12 \times 12 \times 64$				$12 \times 12 \times 64$
Layer9	Avg	$12 \times 12 \times 64$		$3 \times 3$	$2$	$6 \times 6 \times 64$
Layer10	Conv	$6 \times 6 \times 64$	$4 \times 4 \times 64 \times 64$		$1$	$3 \times 3 \times 64$
Layer11	fully-connected	$3 \times 3 \times 64$	$3 \times 3 \times 64 \times 64$		$1$	$1 \times 1 \times 64$
Layer12	fully-connected	$1 \times 1 \times 64$	$1 \times 1 \times 64 \times 4$		$1$	$1 \times 1 \times 4$
Layer13	Softmax	$1 \times 1 \times 4$

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoon, S.; Park, H.; Yi, J. An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition. Sensors 2016, 16, 981. https://doi.org/10.3390/s16070981

AMA Style

Yoon S, Park H, Yi J. An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition. Sensors. 2016; 16(7):981. https://doi.org/10.3390/s16070981

Chicago/Turabian Style

Yoon, Sungbaek, Hyunjin Park, and Juneho Yi. 2016. "An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition" Sensors 16, no. 7: 981. https://doi.org/10.3390/s16070981

APA Style

Yoon, S., Park, H., & Yi, J. (2016). An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition. Sensors, 16(7), 981. https://doi.org/10.3390/s16070981

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition

Abstract

1. Introduction

2. Incorporating Object-Action Context into Object Recognition

3. Representing Objects and Human Actions

4. Experimental Results

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI