Unsupervised Object Modeling and Segmentation with Symmetry Detection for Human Activity Recognition

In this paper we present a novel unsupervised approach to detecting and segmenting objects as well as their constituent symmetric parts in an image. Traditional unsupervised image segmentation is limited by two obvious deficiencies: the object detection accuracy degrades with the misaligned boundaries between the segmented regions and the target, and pre-learned models are required to group regions into meaningful objects. To tackle these difficulties, the proposed approach aims at incorporating the pair-wise detection of symmetric patches to achieve the goal of segmenting images into symmetric parts. The skeletons of these symmetric parts then provide estimates of the bounding boxes to locate the target objects. Finally, for each detected object, the graphcut-based segmentation algorithm is applied to find its contour. The proposed approach has significant advantages: no a priori object models are used, and multiple objects are detected. To verify the effectiveness of the approach based on the cues that a face part contains an oval shape and skin colors, human objects are extracted from among the detected objects. The detected human objects and their parts are finally tracked across video frames to capture the object part movements for learning the human activity models from video clips. Experimental results show that the proposed method gives good performance on publicly available datasets.


Introduction
Part-based object detection and segmentation is an important problem in computer vision.Classical object detection methods often use learned models to detect and recognize the targets [1,2].Quality-conscious object segmentation spans a new way to build the most discriminative models, compared with classical object modeling schemes, which often delimit the training objects with inaccurate bounding boxes.Recently, segmentation-based tracking, incorporating temporal information of object movements to improve the detection accuracy, has attracted great attention in the field of video object segmentation [3][4][5] due to its potential for many vision-based applications, such as video surveillance, man-machine interfaces, sports analysis, and authoring of video games [6].To incorporate the spatial and temporal information for improving the accuracy of object segmentation is particularly important and remains a challenge.
Object segmentation is generally far more difficult than low-level image segmentation, which groups pixels of similar features, i.e., colors, textures, and optical flows, into regions, without inferring the complete image understanding models.During the past three decades, intensive research works have been carried out in the automatic segmentation domain [7][8][9][10][11][12].These techniques achieve efficient segmentation by subdividing an image into a number of moving objects and the background according to a homogenous low-level feature criterion and object tracking.This homogenous grouping almost extracts semantically incomplete objects, each of which perhaps consists of multiple parts with different homogeneous features.Moreover, using a tracking or body pose estimation in real world videos is generally not reliable due to object occlusion, distortion and changes in lighting.Semi-automatic semantic object segmentation algorithms [13][14][15] are thus proposed to tackle these difficulties.In the common first step of these methods, users initially identify a semantic object by using tracing interface and the computer automatically tracks the segmented object for the successive frames.
Recent approaches suggest using pre-learned object models to detect, segment, track, and recognize the target objects in images [1,[16][17][18].For instance, in [1], parts arranged in a deformable configuration are modeled to capture the local property of objects.The use of visual patterns of local patches in object modeling is related to several ideas, including the approach of local appearance codebooks [19] and the generalized Hough transform (GHT) [20] for object detection.At training time, these methods learn a model of the spatial occurrence distributions of local patches with respect to object centers.At testing time, based on the trained object models, the visual patterns of patches, with points of interest as their centers, are matched to visual codebooks to locate the targets using the Hough voting framework.However, the effectiveness of visual pattern grouping by Hough voting is heavily dependent on the quality of the learned visual model, the ability to precisely locate the target objects, and the features extracted from training samples.
Many object detection approaches are limited by the ill-defined object models, which are trained from a set of limited views and deficient in characterizing the texture in local parts and their spatial constraints [1,2].The performance of these methods degrades dramatically when the input image has enormous deformation compared with the training images.Symmetry, however, is an essential characteristic of man-made or natural objects.Accordingly, the motivation of this paper is to integrate symmetry detection into classical object detection and segmentation to construct a model-free approach.Instead of learning a complex object model using a large amount of training samples, our approach defines the part-based object detection and segmentation to be the task of decomposing an image into constitute salient symmetric parts, each of which is characterized by a common set of local features, i.e., symmetric skeletons, dominate colors, and shape descriptors.Thus, our approach first detects salient symmetries in the test image with the Hough voting framework.The patches that constitute each of the detected symmetries are then determined by the inverse Hough transformation.The clusters of symmetries are generated to locate potential objects, each of which is specified with a bounding box.Finally, performing classical image segmentation on each bounding box, the target object is segmented.
Object classifiers can be further used to annotate, check and interpret the detected objects.Traditional object classifiers are trained from a set of weakly annotated sample objects, each of which is specified by a bounding box with undesirable background information.Instead, the proposed object detection and segmentation would introduce less noise from the targets and help avoid performance degradation in both the learning and recognition of object classifiers.To verify the effectiveness of the object detection and segmentation, we perform the face detection algorithm [21] on all detected parts to locate human objects.The detected human objects and their parts are then tracked across video frames to capture the object part movements for learning the poselet-like models, which had been verified to be effective in human activity recognition [22].Experimental results show that the proposed method gives good performance on publicly available datasets in terms of detection accuracy and recognition rate.
The remainder of this paper is organized as follows.Section 2 presents the related work for the semantic object segmentation and symmetry detection.Section 3 describes the approach to deal with the object segmentation based on the detected results of the salient symmetric parts.Section 4 presents the application on human activity recognition.Section 5 describes the experimental tests to illustrate the effectiveness of the proposed method.Finally, conclusions are drawn in Section 6.

Related Work
Segmentation-based object recognition has been extensively studied with many algorithms available [12,[23][24][25] in computer vision.Among them, the most interesting approach related to object recognition is semantic segmentation, which assigns each pixel in an image to one of several pre-defined semantic categories [23].Compared to classical low-level unsupervised segmentation, which groups pixels of similar features, such as color, texture, or optical flows into homogeneous regions, semantic segmentation uses a supervised learning algorithm to build up semantic object models.
State-of-the-art semantic segmentation algorithms often use the local appearance model of an object to estimate the score of a pixel, a patch, or a region belonging to the target category [12,23,[26][27][28].To address the labeling consistency between neighboring local appearances, the local consistency model is then used to further group pixels, patches or regions into parts, though these parts still need merged to capture an object as a whole [1,2,29,30].Therefore, a global consistency model is finally used to enforce global consistencies, i.e., at a region or image level [30,31].Girshick et al. have shown that rich feature hierarchies are very useful for accurate object detection and semantic segmentation [32].
Recently, object segmentation in videos spans a way to estimate the object boundaries by tracking pixels, patches, or regions to obtain their trajectories.Local elements with similar trajectories are then grouped into parts and objects [3][4][5]7,[9][10][11][12][13][14][15].However, the accuracy of any boundary estimate is limited by a number of systemic factors such as image resolution, noise, motion skew and the object occlusion.
For example, formulating object segmentation as motion segmentation using optical flow rests on the assumption of brightness constancy, which is violated at moving boundaries, resulting in poor estimates of object contours [33].Object segmentation also tries to detect and segment the observed motions into semantic meaningful instances of particular activities from videos [17].To reach this goal, recent approaches consider the detection and recognition of the video object as an extension of 2D object detection with higher dimensionality.
Many human-made objects, human bodies, natural scenes, or animals have symmetric parts.Several feature-based approaches have been proposed in the literature to detect symmetries in images for object detection and segmentation [34][35][36].The common process in these approaches is that they dedicate the design of the reliable features for patch correspondences.For instance, Hsieh et al. designed a symmetric transformation to provide a framework for finding pairs of symmetric patches in vehicle images [36].A recent survey of the symmetry in 3D geometry can be found in [37].Although the symmetries provide a natural way to group low-level patches into middle-level parts, the combination of symmetric parts into high-level objects remains a challenging problem.Some methods depend on a prior global consistency model about the target object to perform top-down detection [29].On the contrary, unsupervised object detection and segmentation, which does not rely on either human input, or top-down information, is important due to its potential in a variety of applications.

Unsupervised Object Detection and Segmentation
In this section, we present a probabilistic symmetry-based framework for combined object detection and segmentation.First we outline the notations to define the problem, and then emphasize the symmetry detection and clustering to estimate object locations.This is followed by image segmentation to obtain precise object boundaries.Finally, we describe a generative model that sets the foundation of our proposed object detection and segmentation.

Notations and System Overview
Let I and O, respectively, denote the image frame and the object frame (a bounding box in I, shown in Figure 1b).Let denote the set of centers of the sampled patches in O, and be the set of feature vectors to describe P. The object being segmented is represented by its shape C, the bounding box B, and the set S of symmetries determined by the set M of symmetric patch pairs.The bounding box B can be used to intersect the segmentation result obtained by performing image segmentation on I [24] to obtain the final object segmentation.The feature of an 8 × 8 patch used in this study is the well-known histogram of gradients (HOG) [38] though other complex features such as scale-invariant feature transform (SIFT) [39] or speeded up robust features (SURF) [36] descriptors can also be used as the replacement.A patch pair is in M if their HOG distance is less than a predefined threshold.The optical flow of a patch can also be used as the supplementary feature to improve the detection accuracy of symmetric parts when it is available.The unsupervised approach consists of six pipelining steps, shown in Figure 2, to automatically locate multiple objects in an image, I. To perform the well-known Canny edge detection on I, we divide I into multiple 8 × 8 patches, each of which is described by the center (an edge point) and the HOG feature vector.Next, based on a distance function in terms of HOG, patches in I are grouped into multiple clusters, each of which determines a set of symmetric patch pairs with the symmetry detection by Hough voting to follow.These detected symmetries are then used to model the object structures with a graph representation, which is optimally partitioned with the domain sets algorithm [39].Each symmetry sub-graph estimates the bounding box B of an object.Finally, to use the graph cut algorithm [24] on B, the approach locates an object, which contains as less background as possible.A significant contribution of our approach is, at the moment of object detection, no tedious object models need learned in advance.

Discovery of Symmetric Patch Pairs
An image, I, is first partitioned into multiple overlapping 8 × 8 patches Pi, i = 1, …, N with edge points as the centers xi.For each patch, an 8-bin HOG descriptor with the quantization angles j × 45°, j = 0, …, 7 is used to represent its local appearance [1,38].However, HOG lacks the capability of defining symmetric patch pairs, and thus we should firstly define the symmetric relations between patches in terms of HOG descriptors.Figure 3 shows that a small patch sampled from the contour of an object could contain a line edge, and the peak bin angle of the corresponding HOG approximates the gradient direction of the line in the patch.( , ) δ(|| || )( ) where δ(|| || )  is the delta function that returns 1 when the geometric distance between the centers of Pi and Pj is less than L, otherwise it returns 0; ( ) i j f f  is the inner product to measure the similarity between i f and j f .Using (1) and the k-means clustering [40], patches in I are grouped into k As mentioned above, two patches belonging to the same cluster form a pair of symmetric patches.Thus, the set of symmetric patch pairs can be defined as: ..., Note that the value of k could not be large to preserve most of the potential symmetric patch pairs, and this brings fast convergence to the k-means clustering.Thus, the computational complexity to execute the patch clustering on-the-fly is not high.

Discovery of Symmetric Parts
Let {Pi, Pj} be a patch pair in M. The pairwise patches of M can determine the skeleton K of the corresponding symmetric part shown in Figure 4a.Also let (li, mi) and (lj, mj) be the normal vectors of gradient direction of Pi and Pj, respectively.These two normal vectors determine two lines Li = xi + ti (li, mi) and Lj = xj + tj (lj, mj).The intersection point (X, Y) of Li and Lj can be obtained by We can also compute the included angle ψ between Li and Lj by 1 ψ tan 1 ( , ) (tan , tan ) . Next, as shown in Figure 4b, we compute the skeleton K characterized by two parameters (r, θ): The local similarity measurement for } , { j i P P then casts a vote on the 2D (r, θ) space V: We collect the votes from all symmetric patch pairs in M to generate the Hough voting image V.In what follows is the peak detection on V to define the skeletons of salient symmetries with the criterion: where γ is a pre-defined threshold.The member patch pair Pij = (Pi,Pj) to constitute a symmetry S with skeleton K characterized by (r, θ) can thus be defined as: where INV(r, θ) is the inverse Hough transform on V(r, θ) that returns the set of patch pairs casting votes on (r, θ).Multiple peaks can be detected from V to locate multiple salient symmetric parts for the input image I.Note that the patch pairs not in M are supposed to be less similar and are excluded from casting a vote on the Hough voting image V.This avoids generating spurious peaks.Figure 5 shows an example to illustrate the Hough voting framework for symmetry detection.

Object Detection with Symmetry Graph Partitioning
The set of detected symmetries can be used to locate multiple objects in I by merging the set of skeletons to describe the symmetric axes of S.
Every skeleton Kk is a line and characterized by two parameters (rk, θk).As mentioned above, using (3), the (i,j)-th patch pair Pij in Sk defines an intersection point ( , ) . These intersection points defined by patch pairs in Sk can be used to estimate the bounding rectangle that locates the corresponding symmetric part.To achieve this goal, we first compute the part center ( ) where |Sk| is the cardinality of Sk.We also compute the distances dij to measure the part elongation along the skeleton Ki, which is characterized by the line parameters (rk, θk).Using (10), the potential outliers in Sk are defined as: where To have a better estimation of the symmetry using Sk, we eliminate the outliers from the original Sk, i.e., ( ) We also define the line k K  passing ( ) x and being orthogonal to Kk as: : sinθ cosθ ( co sθ ) ( sinθ ) . The line k K  divides Sk into two parts according to the following rule: The patch pairs to define the top and bottom boundaries of the bounding box Bk of Sk can thus be defined as: where the distance function d is defined in (10).The lines Lb and Lu that are passing through the centers of Pb and Pu and parallel to k K  then define the top and bottom boundaries of Bk, respectively.
Similarly, the skeleton Kk of Sk divides the patches in Sk into two parts: cos θ sin θ 0 where (xi, yi) is the center of the patch Pi in Sk.The patches to define the left and right boundaries Bk of Sk can thus be defined as: ( , ) (arg max , arg max ) where | cos θ sin θ  is the distance between Pi and Kk.The lines Ll and Lr that are passing through the centers of Pl and Pr and parallel to Kk then define the left and right boundaries of Bk, respectively.Figure 6c shows the bounding boxes of detected symmetries in Figure 6b.The bounding boxes belonging to the same object might heavily overlap with each other.To locate objects in the input image based on the symmetry graph representation, this paper uses the well-known dominant sets algorithm [39] to merging symmetries into objects.We first construct a weighted symmetry graph G = (S, E) where S is the set of detected symmetries (the set of nodes) and E is the set of edges.The weight on an edge between nodes i and j is defined as: where Bi and Bj are the bounding boxes of symmetries i and j, respectively.Modeling an object as a dominant sets, the graph partitioning algorithm optimally divides the symmetry graph G into multiple sub-graphs, each of which merges its member symmetries into an object [39].Obviously, the bounding box of an object might contain background information, which would degrade the performance of the resulting object classification.To tackle this difficulty, the graphcut algorithm [24] can be further used to eliminate the irrelevant background in the bounding boxes of the detected objects.

The Generative Model
We are now ready to describe the probabilistic generative model, which derives the foundation of object detection and segmentation.The underlying concept behind the graphical model is that, given the set of symmetric patch pairs M, we can sample an object patch { , } We first condition on xn and fn and assume both P(xn) and P(fn) to be constant.Then we condition on P, so the prior term P(P) is removed.Dividing both sides of (18) by P(xn), P(fn) and P(P), we get the following expression: P( , , | , , ) P( | , ) P( | ) P( ) P( ) P( ) To take product over the patch-wise posterior, the posterior probability to be maximized is f P is the probability of the patch feature fn belonging to P. P(B|S) represents the probability of the bounding box B given the set of detected symmetries S, which is determined by M with the probability P(S|M).Finally, P(M|P) is the probability of patch pairs that are symmetrical with each other.The goal of our method is to seek the parameters of B, M, and S that maximize the posterior probability P( , , | , , ) n n x f B S M P .To achieve the goal, a pre-learned object model should be built up using a generic training approach.However, the learning approach to build up a high-precision object model is obviously not a trivial work.Instead of the usage of the object model, the approach uses a greedy method to optimize P( , , | , , ) The value of P(M|P) in ( 20) can thus be estimated by since we can detect k peaks in the Hough voting image V to locate the corresponding salient symmetries . To apply the inverse Hough voting, we can estimate the value of P(Si) with the ratio of the number of patch pairs to construct Si to the size of M and the value of P(Si|M) by the voting value of the i-th peak in V.That is, the estimated value of P(Si|M) can be computed by Finally, the dominant sets algorithm and the graph cut segmentation are used to optimize the terms P(B|S) and

The Application to Human Activity Recognition
One obvious deficiency of unsupervised object detection and segmentation is that the semantic lack of detected objects.To tackle the difficulty, in constructing a real-world application, the object semantics could be augmented by a model.We use poselet models [22], shown in Figure 7, to explore the degree of the quality-conscious object detection and segmentation in improving the performance of human activity recognition.
To train a human activity classifier using SVMs, a dataset   1 ( , ) is collected, where Vi is a video sample and yi is the label of Vi.To build the multi-class activity model based on the symmetries-based object detection, we firstly perform a generic key frame detection [41] on the input video to obtain a compact video representation.Next, the proposed object detector divides every frame into multiple objects, in which the human objects are identified by a fast facial detection algorithm [42].The detected human objects in key frames are then divided into J poselets, which localize discriminative parts of the body and are proven to be effective for human activity recognition [22].Inspired from the work of [22] and based on a few weak annotations on a sparse set of frames, shown in Figure 8, two types of poselet features, including the HOG descriptors and the BoW features, are used for training the poselet detector.The BoW features, quantized dense descriptors (SIFT [43], histogram of optical flow (HOF) [44], and motion boundaries (HoMB) [45]), are used to augment the HOG descriptors for capturing the motion information of poselets.In this paper, the background information is removed from the poselets by the segmentation scheme, which, in turn, improves both the quality of the poselet models in the learning phase and the recognition accuracy in the testing phase.In the training phase, the annotated training samples are trained to learn poselet-specific HOG and BoW templates.In the testing phase, these poselet templates are used to locate the poselets in the human objects of a frame.For each video frame, we collect the highest scores from both HOG-based and BoW-based poselet templates by performing the branch-and-bound techniques [46] on the detected human objects to represent the frame as a poselet activation vector [47].Our feature representation represents a video as a HOG-based feature sequence and three BoW-based feature sequences.Finally, for each poselet model, a SVM classifier with a multi-channel string kernel [48] is trained to form a part-based weak classifier.The multi-channel string kernel is defined as: 1 ( , ) exp( ( , )) where F and F′ are two multi-channel histogram-based feature sequences; ( , ) F F are the i-th channel feature sequences for ( , ) F F ; ( , ) D F F is the distance between and i i F F using dynamic programming; Ai is the average of Dp distances using the i-th channel features of training samples.These poselet SVMs are then bootstrapped to constitute an ensemble classifier for human activity recognition.The rule to classify the input video clip V, which is represented by k key frames, is thus 1,..., ( ) arg max( ( )), ( ) α δ( ( ) ) where A is the set of activity classes and sj is the j-th poselet SVM classifier with the weighting factor αj which is proportional to the accuracy of activity recognition using sj.αj was determined in the training phase.

Experimental Results
A series of experiments was conducted on an Intel CORE i7 3.0GHz PC and three datasets, The INRIA dataset [49], the PASCAL VOC 2012 dataset [50], and the UT-Interaction dataset [51], are constructed to evaluate the performance of the human object detection and activity recognition system.The INRIA dataset has been used in many static person detection studies.It annotates a training dataset including 614 positive samples and 1218 negative samples.Multiple poses are included in both the training and testing datasets.Also many different natural scenes are used to construct the set of negative examples.The size of the image in the INRIA dataset is 64 × 128.The PASCAL VOC 2012 dataset contains 20 object classes with all images taken from natural scenes.The train and validation dataset has 11,530 images containing 27,450 region of interest (ROI) annotated objects and 6929 segmentations.Among them, the person class has 632 images.The UT-Interaction dataset contains 20 videos of continuous executions of six classes of human-human interactions: hands shaking, pointing, hugging, pushing, kicking and punching.Ground truth labels for these interactions are provided, including time intervals and bounding boxes.Every video sequence taken with the resolution of 720 × 480, 30 fps, and the height of a person in the video is about 200 pixels.The lengths of video sequences are around one minute.Each video contains at least one execution per interaction, providing us eight executions of human activities per video on average.Several participants with more than 15 different clothing conditions appear in the videos.Furthermore, the dataset is divided into two sets.Set 1 is composed of 10 video sequences taken on a parking lot.The videos of set 1 are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter.Set 2 (i.e., the other 10 sequences) are taken on a lawn in a windy day.Background is moving slightly (e.g., tree swaying), so they contain more camera jitters.Each set has a different background, scale, and illumination.Figure 9 shows several images of these three datasets.
First of all, to clarify the differences between the proposed unsupervised object segmentation method and the standard image segmentation method, the graph cut algorithm [24] is implemented, which is used to segment images into regions.Notice that the segmentation results of both the proposed and graph cut algorithms contain multiple objects in an image.However, the latter does not group regions into objects.On the contrary, our method spans a new way to group detected symmetries into objects using a symmetry graph partition algorithm.The contour of the target object can also be obtained by intersecting the detected object with the segmented regions.Thus, the proposed method solves the problem of image segmentation in object segmentation.Incorporating the segmentation results of the graph cut algorithm into the object detection approach, Figures 10-12 show examples of the object detection and segmentation using the three datasets.To compare the performance between the proposed method and regions with CNN features (R-CNN) method [32], the detection quality judged subjectively for both methods is compatible.Note that R-CNN trains high-capacity convolutional neural networks (CNNs) in advance to the bottom-up region proposals in order to localize and segment objects.Accordingly, the symmetry detection unequivocally facilitates effective object detection and segmentation without the object models.The class labels, as ground truth for images in the test datasets, are used to determine the accuracy of human object detection.For the proposed approach, the problem of human object detection is tackled by automatically locating objects with facial parts in the detected object set of an image.That is, we do not need constructing a person classifier, which is necessary for many existing person detectors.To test the effectiveness of the person detector, classification results are shown in Table 1 for the proposed and compared state-of-the-art recognition systems [1,[53][54][55][56][57][58][59].The proposed approach outperforms the compared methods since symmetric properties are salient features in person objects.
We follow the same localization evaluation rule in [22]: a detection is considered correct if, (1) the poselets in a human object are correctly classified, and (2) the intersection-union ratio of the detection and ground truth bounding box is not less than a threshold θ.For the UT-Interaction dataset, selected frames were hand-annotated with bounding boxes, and the bounding boxes for the frames in between were generated by linear interpolation.Table 2 shows the performance comparison in poselet localization accuracy using the dataset UT-Interaction.The proposed method has a better result compared to [22] because our features in constructing the poselet detectors are from more accurate results of human object detection and segmentation.Thus, our final features contain less irrelevant background in representing the corresponding poselets.Moreover, we detect poselets from human objects detected in the previous step of the approach.Consequently, this increases the robustness in the poselet detection.Figure 13 also shows examples of poselet detection using the UT-Interaction dataset.61.5% CVC_CLS [54] 42.3% LARSVM-V2 [1] 77.3% NEC [55] 32.8% MULTIFER+CSS [56] 75.0%SYSU_DYNAMIC [57] 37.5% FEATSNTH [58] 69.0%OXFORD [59] 46.1%  [22] 86.7% θ = 0.50 Proposed 100% Raptis et al. [22] 86.7% θ = 0.75 Proposed 85.4% Raptis et al. [22] 83.3% θ = 1 Proposed 81.3% Raptis et al. [22] 80.0% Evaluations of our approach in human activity recognition are carried out with a leave-one-out cross-validation method.Classification results are shown in Table 3 and compared with state-of-the-art recognition systems [22,[60][61][62][63][64][65][66][67][68].Accordingly, the proposed method has a great improvement in classification accuracy.Note that both the poselet models and the feature setting to describe poselets in the approach of Raptis et al. [22] are adopted in our human activity recognition.However, the proposed approach has better performance in terms of classification accuracy.This is because the detected poselets are more accurate compared to those of [22].Figure 14 shows the confusion matrices of the UT-Interaction dataset for the proposed and the method by Raptis et al.Both matrices show similar confusion patterns.This shows that poselet models are effective in human activity recognition.The detection of symmetries is not always accurate in the class "Kick" because the symmetries to constitute the poselets in this class are often occluded with each other.This degrades the accuracy to recognize "Kick" activities.

Conclusions
In this paper, we have presented an interesting approach for unsupervised object detection and segmentation, based on the fusion of symmetries detection, dominate sets clustering, and image segmentation.To use the object detection and segmentation as a processing, we also have presented a systematic way to construct a bank of poselet SVM classifiers for human activity recognition.The proposed activity recognition modeling encodes every video as a sequence of multi-channel histogram-based feature sequences.Multi-channel string kernels are thus introduced to improve the recognition accuracy of week classifier with individual poselet models.For each class, a set of training videos is also used to train an ensemble classifier, which verifies the correctness of the candidate detected human activities at testing time.
Compared with related human object detection and activity recognition methods, the proposed method makes a significant contribution: this paper formulates the problem of object detection through symmetries detection.Not only can the dynamic programming process model the activities of training videos as multi-channel poselet feature sequences, the procedure can also be used to detect and recognize human objects from the input video clip automatically.Our system presents an approach to detect multiple human objects from a video clip.Experimental results show that the proposed method performs well on several publicly available datasets in terms of detection accuracy and recognition rate.
The proposed method, however, suffers from the following limitations.The computational complexity of our approach using class-specific model matching through dynamic programming and Hough voting is essentially high.Future work will focus on implementing the system on parallel architecture, e.g., a GPU servers and cloud computing platforms.

Figure 1 .
Figure 1.An example to illustrate the generative model: (a) the original image; (b) two bounding boxes to locate the target objects, i.e., a person and a bottle; (c) perform the Graph cut segmentation algorithm [24] to obtain the segmentation results; (d) graphical representation of the generative model used in our method.

Figure 2 .
Figure 2. The overall procedure for the object detection and segmentation: (a) the input image is first partitioned into multiple patches; (b) the set of candidate symmetric patch pairs generated by matching patches in (a) with each other; patch pairs in (b) are used to generate the Hough voting image (c) whose peaks locate the salient symmetric axes and parts, shown in (d); (e) these detected symmetries are then used to estimate the bounding box of the target object; the sub-image constrained by the bounding box is segmented to obtain the segmentation mask; and the result, shown in (f) and (g), respectively.

Figure 3 .
Figure 3. Using line edges to approximate the contour of an object.

Figure 4 .
Figure 4. Determining the skeleton of a symmetric part using pairwise symmetric patches } , { j i P P .(a) The skeleton K of the corresponding symmetric part determined by the pairwise patches; (b) The skeleton K characterized by two parameters (r, θ). i

Figure 5 .
Figure 5.An example to illustrate the Hough voting framework for symmetry detection: (a) the original image; (b) the intersection points of symmetric patch pairs; (c) the Hough voting image; and (d) the peak detection and inverse Hough voting to compute the skeletons and symmetries.

Figure 6 .
Figure 6.Object detection with the symmetry graph representation: (a) the original image; (b) the detected skeletons and symmetries; (c) the estimated bounding boxes of the symmetries in (b); (d) the symmetry graph; (e) graph partitioning by the dominant sets algorithm [28]; (f) the merged bounding box for the person object; (g) the contour of the object using the graphcut segmentation algorithm [24] on (f); and (h) the segmentation result.
the patch center is xn and the patch feature is fn.The graphical model shown in Figure 1d tells us the joint distribution for a patch is P( , , , , , ) P( | , ) P( | ) P( each of the distribution terms in (20) in details.P( | , ) n x B P is the probability of the pixel location xn given the bounding box B and the set of sampled patches P. The function of this term is to select patches belonging to P and constrained by B. Similarly, P( | ) n the size of M. Obviously, this value depends on the number of patch clusters.The probability P(S|M) can be further decomposed into 1

Figure 7 .
Figure 7. System flowchart of the application to human activity recognition.

Figure 9 .Figure 10 .Figure 11 .Figure 12 .
Figure 9. Example images of the datasets: (a) the positive and negative training samples of the INRIA dataset; (b) examples of the "person" class of the PASCAL VOC 2012 dataset; (c) sample interactions in the UT-Interaction dataset.

Figure 13 .
Figure 13.Examples of poselet detection using the UT-Interaction dataset: (a) a "Kick" activity; (b) a "Push" activity.The bounding boxes locate the detected poselets in individual frames.

Table 1 .
Performance comparison in person detection using the datasets INRIA and PASCAL 2012.

Table 2 .
Performance comparison in poselet detection using the dataset UT-Interaction.A detection is considered correct if, (1) the poselets in a human object are correctly classified, and (2) the intersection-union ratio of the detection and ground truth bounding box is not less than a threshold θ.

Table 3 .
Comparison of UT-Interaction classification with other methods."-" indicates the data is not provided in the original papers.