Next Article in Journal
Development of Risk Management Mitigation Plans for the Infant Formula Milk Supply Chain Using an AHP Model
Next Article in Special Issue
Low-Light Image Enhancement Algorithm Based on Deep Learning and Retinex Theory
Previous Article in Journal
Experimental Study on the Effect of Bonding Area Dimensions on the Mechanical Behavior of Composite Single-Lap Joint with Epoxy and Polyurethane Adhesives
Previous Article in Special Issue
Super-Resolution Reconstruction of Depth Image Based on Kriging Interpolation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Action Recognition via Adaptive Semi-Supervised Feature Analysis

1
School of Mathematics and Computing Science, Guangxi Colleges and Universities Key Laboratory of Data Analysis and Computation, Guilin University of Electronic Technology, Guilin 541004, China
2
Center for Applied Mathematics, Guangxi (GUET), Guilin 541004, China
3
Anview.ai, Guilin 541010, China
4
School of Computer Engineering, Jingchu University of Technology, Jingmen 448000, China
5
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(13), 7684; https://doi.org/10.3390/app13137684
Submission received: 21 May 2023 / Revised: 12 June 2023 / Accepted: 16 June 2023 / Published: 29 June 2023
(This article belongs to the Special Issue Recent Advances in Image Processing)

Abstract

:
This study presents a new semi-supervised action recognition method via adaptive feature analysis. We assume that action videos can be regarded as data points in embedding manifold subspace, and their matching problem can be quantified through a specific Grassmannian kernel function while integrating feature correlation exploration and data similarity measurement into a joint framework. By maximizing the intra-class compactness based on labeled data, our algorithm can learn multiple features and leverage unlabeled data to enhance recognition. We introduce the Grassmannian kernels and the Projected Barzilai–Borwein (PBB) method to train a subspace projection matrix as a classifier. Experiment results show our method has outperformed the compared approaches when a few labeled training samples are available.

1. Introduction

Effective feature representation is key to image processing [1,2,3] and video understanding [4,5,6]. Spatio-temporal features [7,8], subspace features [9,10], and label information [11,12] have been investigated for action recognition. Nevertheless, in Figure 1, we observe that video understanding represents a significant evolution through new datasets and approaches. The activities scenarios have moved on from simple sports, isolated movies, normal surveillance to cluttered home sequences, egocentric interactions of kitchens, real-world anomalous events, part-level action parsing, dark environments, and complex surveillance videos. Considering the various views, illumination, poses, and outdoor conditions of activities, while the data distribution of feature space remains uncertain, how do we discover the underlying embedded subspace for different types of features, and what are the boundaries of action clips?
On the other hand, large-scale videos are constantly emerging nowadays; thus, lots of segments need automatic labeling, but this requires human labor. Large amounts of normal behaviors are more than those of anomalous events. It is important to measure data similarity by sample matching with distance metric learning. Noticeably, some segments in untrimmed videos may be out of specific categories [13], or there are no annotations of new sequences in the dark environment [14]. Therefore, in order to solve the point-matching problem in a semi-supervised manner, we discuss how to convert the video-set matching problem to a data distance measurement problem on the manifold subspace.
Correlations between multiple features may provide distinctive information; hence, feature correlation mining has been explored to improve the recognition results when labeled data are scarce [10,15]. However, these approaches may have limitations in learning discriminant features. First, although existing algorithms evaluate the common shared structures among different actions, they do not take inter-class separability into account. Second, current semi-supervised approaches solve the non-convex optimization problem by impressive derivation, but the global optimum may not be computed mathematically through the alternating least-squares (ALS) iterative method.
Figure 1. Sample frames from (a) home activities (Charades [16]), (b) real-world anomalies (UCF-Crime [13]), (c) dark environments (ARID [14]), (d) egocentric interactions (EPIC-KITCHENS-100 [17]), (e) part-level actions (Kinetics-TPS [18]), and (f) fight scenarios (CSCD [19]). All videos have a large gap towards target-oriented and diversity-oriented. (a) Charades depicts cluttered home actions from multimedia, (b) UCF-Crime shows real-world events containing anomalous and normal segments in untrimmed videos, (c) ARID aims to recognize actions in low illumination through semi-supervised methods, (d) EPIC-KITCHENS-100 consists of daily activities in the kitchen from first-person videos, (e) Kinetics-TPS develops a large-scale kinetics-temporal part state for encoding the composition of body parts, and (f) CSCD collects fight and no-fight scenarios from surveillance cameras.
Figure 1. Sample frames from (a) home activities (Charades [16]), (b) real-world anomalies (UCF-Crime [13]), (c) dark environments (ARID [14]), (d) egocentric interactions (EPIC-KITCHENS-100 [17]), (e) part-level actions (Kinetics-TPS [18]), and (f) fight scenarios (CSCD [19]). All videos have a large gap towards target-oriented and diversity-oriented. (a) Charades depicts cluttered home actions from multimedia, (b) UCF-Crime shows real-world events containing anomalous and normal segments in untrimmed videos, (c) ARID aims to recognize actions in low illumination through semi-supervised methods, (d) EPIC-KITCHENS-100 consists of daily activities in the kitchen from first-person videos, (e) Kinetics-TPS develops a large-scale kinetics-temporal part state for encoding the composition of body parts, and (f) CSCD collects fight and no-fight scenarios from surveillance cameras.
Applsci 13 07684 g001

2. Motivation and Contributions

To overcome the limitations of using multiple features for training, we propose modeling intra-class compactness and inter-class separability simultaneously, then capturing high-level semantic patterns via multiple-feature analysis. Considering the optimization process, we introduce the PBB algorithm because of its effectiveness in obtaining an optimal solution [20]. The PBB method is a non-monotone line-search technique considered for the minimization of differentiable functions on closed convex sets [21].
Inspired by the research using multiple features [11,15], our framework was extended in a multiple-feature-based manner to improve recognition. We proposed the characterization of high-level semantic patterns through low-level action features using multiple-feature analysis. Multiple features were extracted from different views of labeled and unlabeled action videos. Based on the constructed graph model, pseudo-information of unlabeled videos can be generated by label propagation and feature correlations. For each type of feature, nearby samples preserve the consistency separately, while unlabeled training data perform the label prediction by jointly global consistency of multiple features. Thus, an adaptive semi-supervised action classifier was trained. The main contributions can be summarized as follows:
(1) This work first simultaneously considers manifold learning and Grassmannian kernels in semi-supervised action recognition, as we assume that action video samples may be found in a Grassmannian manifold space. By modeling an embedding manifold subspace, both inter-class separability and intra-class compactness were considered.
(2) To solve the unconstrained minimization problem, we incorporate the PBB method to avoid matrix inversion, and apply globalization strategy via adaptive step sizes to render the objective functions non-monotonic, leading to improved convergence and accuracy.
(3) Extensive experiments verified that our method is better than other approaches on three benchmarks in a semi-supervised setting. We believe that this study presents valuable insights into adaptive feature analysis for semi-supervised action recognition.

3. Related Work

We review the related research on semi-supervised action recognition, multiple-feature analysis, and embedded subspace representation in this section.

3.1. Semi-Supervised Action Recognition

Unlabeled samples are valuable for learning data correlations in a semi-supervised manner [9,10,12,22]. Although they tend to achieve remarkable performance via semi-supervised learning with limited labeled data, there are still many issues, such as inherent multi-modal attributes leading to local optimum, or unconvincing pseudo-labels leading to inaccurate predictions [23,24].
Si et al. [25] tackle the challenge of semi-supervised 3D action recognition for effectively learning motion representations from unlabeled data. Singh et al. [6] maximize the similarity of the same video at two different speeds, and recognize actions by training a two-pathway temporal contrastive model. Kumar and Rawat [26] develop a spatio-temporal consistency-based approach with two regularization constraints: temporal coherency and gradient smoothness, which can detect video action in an end-to-end semi-supervised manner.

3.2. Multiple-Feature Analysis

Because we can describe an object by different features that provide different discriminative information, multiple-feature analysis has gained increasing interest in many applications. In the early and late fusion strategies, multi-stage fusion schemes have recently been investigated [10,27,28,29]. However, the correlations of each feature type have not been considered in most late-fusion approaches.
Wang et al. [10] apply shared structural analysis to characterize discriminative information and preserve data distribution information from each type of feature. Chang and Yang [15] discover shared knowledge from related multi-tasks, take various correlations into account, then select features in a batch mode. Huynh-The et al. [30] capture multiple high-level features at image-based representation by fine-tuning a pre-trained network, transfer the skeleton pose to encoded information, and depict an action through spatial joint correlations and temporal pose dynamics.

3.3. Embedded Subspace Representation

Previous studies have shown that manifold subspace learning can mine geometric structure information by considering the space of probabilities as a manifold [31,32,33]. Recent research focuses on graph-embedded subspace or distance metric learning to measure activity similarity [34,35,36,37,38].
Rahimi et al. [39] build neighborhood graphs with geodesic distance instead of Euclidean distance, and project high-dimensional action to low-dimensional space by kernelized Grassmann manifold learning. Yu et al. [40] propose an action-matching network to recognize open-set actions, construct an action dictionary, and classify an action via the distance metric. Peng et al. [41] alleviate the over-smoothing issue of graph representation when multiple GCN layers are stacked by the flexible graph deconvolution technique.
The two aforementioned studies [9,10] are similar to ours. They assume that the visual words in different actions share a common structure in a specific subspace. A transformation matrix is introduced to characterize the shared structures. They solve the constrained non-convex optimization problem through an ALS–like iterative approach and matrix derivation. Nevertheless, the deduced inverse matrix is poorly scaled during optimization or close to singular, which may lead to inaccurate results.
To address these problems, we hypothesize that manifold mapping can preserve the local geometry and maximize discriminatory power, as shown in Figure 2. However, we did not aim to mine shared structures. Therefore, we ignored shared-structure regularization and modeled the manifold by creating two graphs. As the optimization solution in [9,10] may be mathematically imprecise, Karush–Kuhn–Tucker (KKT) conditions and PBB are introduced to improve algorithm convergence and avoid matrix inversion.
Different from another related research [12], this work makes two major modifications as follows: multiple-feature analysis with combined Grassmannian kernels, and non-monotone line-search strategy with adaptive step sizes.

4. Proposed Approach

Our approach incorporates several techniques, including semi-supervised action recognition, multiple-feature analysis, PBB, KKT, manifold learning, and Grassmannian kernels. It is named Kernel Grassmann manifold analysis (KGMA).

4.1. Formulation

To leverage the multiple-feature correlation, n training sample points X = [ X 1 , , X n ] R d × n are defined from the underlying Grassmannian manifold, where X i R d × 1 . We aim to uncover a new manifold while preserving the local geometry of data points, that is, α : X i F i . Since we should demonstrate data distribution on the manifold, a predicted label matrix F = [ F 1 , , F n ] R n × n is defined, where the predicted vector of the i-th datum X i X is F i R n × 1 .
We assume that a similarity measurement of data points in the manifold subspace is available through a Grassmannian kernel [31] k i , j = X i , X j . By confining the solution to a linear function, that is, α i = j = 1 n a i j X j , we define the prediction function f as f ( X i ) = F i = ( α 1 , X i , α 2 , X i , . . . , α r , X i ) T . By denoting A l = ( a l 1 , . . . , a l n ) T and K i = ( k i 1 , . . . , k i n ) T , it can be shown that α l , X i = A l T K i , and thus, f ( X ) = F = A T K Y , where A = [ A 1 | A 2 | . . . | A r ] and K = [ K 1 | K 2 | . . . | K n ] . As mentioned in [42], the performance of the least-square loss function is comparable to hinge loss or logistic loss. This is associated with its diagonal matrix Y = [ Y 1 , , Y n ] { 0 , 1 } n × n , where Y i { 0 , 1 } n × 1 is the label matrix. We employ least-squares regression to solve the following optimization problem, then obtain the projection matrix A :
min A A T K Y F 2 + η A T F 2 ,
where η is the regularization parameter. · F 2 denotes the Frobenius norm. A T F 2 controls the model complexity to prevent overfitting.

4.2. Manifold Learning

In contrast to [10], which utilizes a graph model to estimate data distribution on the manifold, we model the local geometrical structure by generating between-class similarity graph G b and within-class similarity graph G w , where G w ( i , j ) = 1 , if x i N w ( x j ) or x j N w ( x i ) ; otherwise, G w ( i , j ) = 0 . G b ( i , j ) applies the same method, although it selects x i N b ( x j ) or x j N b ( x i ) , where N b ( x i ) contains neighbors with different labels, and N w ( x j ) is the set of neighbors x j sharing the same label as x i . Notably, the intra-class and inter-class distances can be mapped on a manifold by similarity graphs [33].
Inspired by manifold learning [12,31,33], we maximize inter-class separability and minimize intra-class compactness simultaneously. An ideal transform pushes the connected points of A b to the extent possible while moving the connected points of A w closer. The discriminative information can be represented as follows:
f = 1 2 i , j = 1 n ( F i F j ) 2 G w ( i , j ) 1 2 β i , j = 1 n ( F i F j ) 2 G b ( i , j ) = t r ( F T ( L w β L b ) F ) ,
where β is a regularization parameter, which controls the trade-off between inter-class separability and intra-class compactness. t r ( · ) denotes the trace operator, and L w = D w G w denotes the Laplacian matrix. Furthermore, D b is a diagonal matrix with D b ( i , i ) = j = 1 n G b ( i , j ) , and D w is a diagonal matrix with D w ( i , i ) = j = 1 n G w ( i , j ) .

4.3. Multiple-Feature Analysis

Multiple features imply combining kernelized embedding features, data-point manifold subspace learning (1st term in Equation (4)), and label propagation (2nd term in Equation (4)) with low-level feature correlations (3rd term in Equation (4)) for labeled and unlabeled data.
We modify the aforementioned function to leverage both labeled and unlabeled samples. First, the training dataset is redefined as X = [ X l T , X u T ] T , where X l = [ X 1 , . . . , X m ] T is the labeled data subset, and X u = [ X m + 1 , . . . , X n ] T is the unlabeled data subset. The label matrix Y = [ Y l T , Y u T ] T , where Y l = [ Y 1 , . . . , Y m ] T { 1 } m × m . The unlabeled matrix Y u = [ Y m + 1 , . . . , Y n ] T { 0 } ( n m ) × ( n m ) . According to [9,43], diagonal label matrix Y and the similarity graphs G w , G b should be consistent with the label prediction matrix F . We generalize the graph-embedded label consistency as follows:
min F t r ( F T ( L w β L b ) F ) + F Y F 2 ,
In contrast to previous shared-structure learning algorithms, we do not consider shared-structure learning within a semi-supervised learning framework. Alternatively, we propose a novel joint framework that incorporates the multiple-feature analyses of multiple manifolds. As discussed in the problem formulation section, by employing the Frobenius norm regularized loss function, we can reformulate the objective:
min F , A t r ( F T ( L w β L b ) F ) + F Y F 2 + μ A T K Y F 2 + η A T F 2 ,
where β > 0 , μ > 0 and η > 0 are regular parameters.
Equation (4) is an unconstrained convex optimization problem; hence, we can obtain the global optimum by performing ALS or the projected gradient method. Although the correlation matrix can only be singular under specific circumstances, the projected gradient method can handle the aforementioned issues without matrix inversion [20], and therefore leads to a better optimum than ALS. Notably, the convergence conditions in [9,10] merely depend on a monotone decrease, which may result in mathematically improper convergence; therefore, KKT conditions are utilized to consider this problem.

4.4. Grassmannian Kernels

The similarity between two action sample points X i and X j R d × 1 can be measured by projective kernel combination:
k i , j [ p r o j ] = X i T X j F 2 .
One attempt to solve the point-matching problem is the notion of principal angles [31]. Given X i and X j , we can define the canonical correlation kernel as
k i , j [ c c ] = max a p s p a n ( X i ) max b q s p a n ( X j ) a p T b q ,
subject to a p T a p = b p T b p = 1 and a p T a q = b p T b q = 0 , p q .
We create a combined Grassmannian kernel through existing Grassmannian kernels [31].
k [ A + B ] = δ [ A ] k [ A ] + δ [ B ] k [ B ] ,
where δ [ A ] , δ [ B ] 0 . Notably, k [ A ] + k [ B ] defines a new kernel based on the theory of reproducing the kernel Hilbert space, as described in [31].

4.5. Optimization

According to [20,21], a general unconstrained minimization problem can be solved by the trace operator and the PBB method. Hence, a new objective function g ( F , A ) instead of Equation (4) is defined:
g ( F , A ) = t r ( F T ( L w β L b ) F ) + t r ( F Y ) T ( F Y ) + μ t r ( A T K Y ) T ( A T K Y ) + μ η t r ( A A T ) .
If ( F * , A * ) is an approximate stationary point in Equation (8), it must satisfy the KKT conditions in Equation (8). Then, we have an iteration-stopping criterion
g F ( F * , A * ) 2 + g A ( F * , A * ) 2 ε ,
where ε is a non-negative small constant.

4.6. Projected Barzilai-Borwein

Similar to [20], a sequence of feasible points ( F t , A t ) is generated by the gradient method:
d F t = λ t g F ( F t , A t ) , F t + 1 = F t + σ t d F t , d A t = λ t g A ( F t , A t ) , A t + 1 = A t + σ t d A t ,
where λ t = min { λ m a x , max { λ m i n , λ A B B t } } > 0 is another step size, and σ t denotes the non-monotone line-search step size that is determined through an appropriate selection rule. Following [21], we have two choices for step size:
λ B B 1 t + 1 = s 1 t , s 1 t + s 2 t , s 2 t s 1 t , y 1 t + s 2 t , y 2 t , λ B B 2 t + 1 = s 1 t , y 1 t + s 2 t , y 2 t y 1 t , y 1 t + y 2 t , y 2 t ,
where
s 1 t = F t + 1 F t , s 2 t = A t + 1 A t , y 1 t = g F ( F t + 1 , A t + 1 ) g F ( F t , A t ) , y 2 t = g A ( F t + 1 , A t + 1 ) g A ( F t , A t ) ,
The characteristic of the adaptive step sizes (11) can render the objective functions non-monotonic; hence, g ( F t , A t ) may increase in some iterations. Alternatively, using (11) is better than merely using one of them [21]; the step size is expressed by
λ A B B t = λ B B 1 t , for odd number t λ B B 2 t , for even number t
To guarantee the convergence of ( F t , A t ) , a globalization strategy based on the non-monotone line-search technique is described as [20]
g ( F t + 1 , A t + 1 ) C t + γ σ t { g F ( F t , A t ) , d F t + g A ( F t , A t ) , d A t }
where τ ( 0 , 1 ] , C t are the parameters of the Armoji line-search method [21]. Following [20], in order to overcome some drawbacks of non-monotone techniques, the traditional largest function value is converted by the weighted-average function value:
C t = τ · min { t 1 , M } C t 1 + g ( F t , A t ) τ · min { t 1 , M } + 1 ,

5. Experiments

The proposed method called KGMA is summarized in Algorithm Section 5. The conventional method that uses SPG [12] and the ALS method instead of PBB, called kernel spectral projected gradient analysis (KSPG) and kernel alternating least-squares analysis (KALS), respectively, was also adopted to solve the objective function (8) for comparison in our experiments.
Algorithm 1: Kernel Grassmann Manifold Analysis (KGMA).
Applsci 13 07684 i001

5.1. Features

For handcrafted features, we follow [12] to extracted improved dense trajectories (IDTs) and Fisher vector (FV), as shown in Figure 3.
For deep-learned features, we retrained the temporal segment network (TSN) [7] models of 15 × c, and then extracted the global pool features of 15 × c using a pre-trained TSN model, concatenating rgb + flow into 2048 dimensions with power L2-normalization, as listed in Table 1.
We verified the proposed algorithm using three kernels: projection kernel k [ p r o j ] , canonical correlation kernel k [ C C ] , and combined kernel k [ p r o j + C C ] . In some cases, k [ p r o j ] is better than k [ C C ] or vice versa, suggesting that the kernels combination is more suitable for different data distributions. For k [ p r o j + C C ] , the mixing coefficients δ [ p r o j ] and δ [ C C ] were fixed at one. We obtain better results by combining δ [ p r o j + C C ] two kernels.

5.2. Datasets

Three datasets were used in the experiments: JHMDB [44], HMDB51 [45], and UCF101 [46]. The JHMDB dataset has 21 action categories. The average recognition accuracies over three training–test splits are reported. The HMDB51 dataset records 51 action categories. We reported the MAP over three training–test splits. The UCF101 dataset includes 101 action categories, containing 13,320 video clips. The average accuracy of the first split was reported.
For the JHMDB dataset, we followed the standard data partitioning (three splits) provided by the authors. For other datasets, we used the first split provided by the authors, and applied the original testing sets for fair comparison. Because the semi-supervised training set contained unlabeled data, we performed the following procedure to reform the training set for each individual dataset. The class number c was denoted for each dataset (c = 21, 51, and 101 for JHMDB, HMDB51, and UCF101, respectively).
Using JHMDB as an example, we first randomly selected 30 training samples per category to form a training set ( 30 × c samples) in our experiment. From this training set, we randomly sampled m videos (m = 3, 5, 10, and 15) per category as labeled samples. Therefore, if m = 10 , 10 × c -labeled samples will be available, leaving ( 30 × c 10 × c ) videos as unlabeled samples for the semi-supervised training setting. We used three splits of testing set on JHMDB and HMDB51 but only the first testing split on UCF101 due to lack of GPU memory resources. Owing to the random selected training samples, the experiments were repeated 10 times to avoid bias.

5.3. Experimental Setup

To demonstrate the superiority of our approach (KGMA), we adopted 8 methods for comparison: SVM, SFUS [47], SFCM [9], MFCU [10], KSPG, and KALS. Notably, SFUS, SFCM, MFCU, KSPG, and KALS are semi-supervised action recognition approaches. Using the available codes, we can facilitate a fair comparison.
For the semi-supervised parameters η , β , μ for SFUS, SFCM, MFCU, KSPG, KALS, and KGMA, we follow the same settings utilized in [9,10], ranging from { 10 4 , 10 3 , 10 2 , 10 1 , 1 , 10 1 , 10 2 , 10 3 , 10 4 }. Because the PBB parameters were not sensitive to our algorithm, we initialized the parameters as in [20], as indicated in Algorithm Section 5. Notably, since KGMA applied PBB to solve the optimal value of the objective function (8), it resulted in non-monotonic convergence with oscillating objective function values, as shown in Figure 4. Thus, using only the absolute error made it difficult to determine when to stop iterating, and relative error of objective function values was better than absolute error, which may be mathematically improper convergence. We chose constant ε = 10 4 as the iteration-stopping criterion in (9).

5.4. Mathematical Comparisons

The recognition results with handcrafted features on three datasets are demonstrated in Figure 3. We compared our method with deep-learned features in Table 1.
Regarding the presented objective function (8), Figure 4 summarizes the computational results of the three optimization methods. When we used the 2048-dimensional deep-learned features by TSN on the JHMDB dataset, the model was trained with only 15 labeled samples and 15 unlabeled samples per class. With the same semi-supervised parameters set up, η , β , μ , the performance differences during the solving of the same objective function could be compared in terms of running time, number of iterations, absolute error, relative error, and objective function value. Figure 4 shows the convergence curves of the three optimization methods. Since both SPG and PBB were non-monotonic optimization methods with relatively large fluctuations in objective function values, we omitted the first 29 iterations of SPG and PBB in Figure 4, and only displayed the data starting from the 30th iteration so as to better illustrate the monotonic convergence process of ALS.
As shown in Figure 3, for a randomly selected video data sample, ALS exhibited the fewest iterations, shortest running time, and fastest computation speed of 0.1220 s after extracting the deep features by TSN. In contrast, PBB exhibited the most iterations, longest running time, and slowest computation speed of 0.4212 s, while SPG’s performance was intermediate between ALS and PBB. Considering Figure 4 and Table 2, it is evident that despite using the PBB optimization method, our KGMA algorithm still achieves the highest accuracy on the kernelized Grassmann manifold space. Nevertheless, Equation (9) using SPG results in marginal improvement over ALS, which is likely attributable to our novel kernelized Grassmann manifold space.

5.5. Performance on Action Recognition

A linear SVM was utilized as the baseline. Based on the comparisons, we observe the following: (1) KGMA achieved the best performance, and our semi-supervised algorithm was better than linear SVM, which is a widely used supervised classifier (2) all methods achieved better performances using more labeled training data, as shown in Figure 3, or enlarging the semi-supervised parameter (i.e., η , β , μ ) range such as Figure 5; (3) we averaged an accuracy of 3 × c , 5 × c , 10 × c , and 15 × c cases, and the recognition of KGMA on JHMDB, HMDB51, and UCF101 improved by 2.97%, 2.59%, and 2.40%, respectively. When using TSN features, the recognition of our KGMA on the above-mentioned datasets improved by 2.21%, 3.77%, and 2.23%, respectively. Evidently, our semi-supervised method can improve recognition by leveraging unlabeled data compared to linear SVM with labeled data merely. Figure 3 illustrates that our algorithm benefits from the multiple-feature analysis, kernelized Grassman space, and iterative skills of the PBB method.
These results can be attributed to several factors. First, our method not only leverages semi-supervised approaches, but also leverages intra-class action variation and inter-class action ambiguity simultaneously. Therefore, ours gain more significant performance than other approaches when there are few labeled samples. Second, we uncover the action feature subspace on the Grassmannian manifold by incorporating Grassmannian kernels, and solve the objective function optimization by the adaptive line-search strategy and the PBB method mathematically. Hence, the proposed algorithm works well in few labeled cases.

5.6. Convergence Study

According to the objective function (4), we conducted experiments with the TSN feature, fixed the semi-supervised parameters η , β , μ , and then executed both the ALS and PBB methods 10 times. The results of the study are listed in Table 2. Although no oscillation exists in the convergence of the ALS and it requires fewer iterations, the PBB method can outperform the ALS for three reasons. First, the PBB method uses a non-monotone line-search strategy to globalize the process [21], which can obtain the global optimal objective function value rather than being trapped in local optima using the monotone ALS method. Second, the character of adaptive step sizes is an essential characteristic that determines efficiency in the projected gradient methodology [21], whereas the iteration step skill has not been considered in ALS. Finally, the efficient convergence properties of the projected gradient method have been demonstrated because the PBB is well defined [21].

5.7. Computation Complexity

In the training stage, we computed the Laplacian matrix L, the complexity of which was O ( n 2 ) . To optimize the objective function, we computed the projected gradient and trace operators of several matrices. Therefore, the complexity of these operations was O ( n 3 ) .

5.8. Parameter Sensitivity Study

We verified that KGMA benefits from the intra-class and inter-class by manifold discriminant analysis, as shown in Figure 5. We analyze the impact of manifold learning on JHMDB and HMDB51, set η = 10 3 and μ = 10 1 at optimal values over split2, for 15 × c -labeled training data. As β varied from 10 4 to 10 4 , the accuracy oscillated significantly and reached a peak value when β = 10 4 . Since β controls the proportion of the intra-class local geometric structure and the inter-class global manifold structure, as shown in Figure 5, when the intra-class local geometric structure is treated as a constant 1, β 1 can be considered such that the inter-class global manifold structure has a larger proportion in the objective function and vice versa. When β = 0 , no inter-class structure is utilized; thus, if β + , no intra-class structure is present. When the Grassmann manifold space leverages an adequate balance of intra-class action variation and inter-class action ambiguity, the proposed algorithm can further enhance the discriminatory power of the transformation matrix.

6. Conclusions

This study proposed a new approach to categorize human action videos. With Grassmannian kernel combinations and multiple-feature analysis on multiple manifolds, our method can improve recognition by uncovering the intrinsic features relationships. We evaluated the presented approach on three benchmark datasets, and experiment results show ours outperformed all competing methods, particularly when there are few labeled samples.

Author Contributions

Conceptualization, Z.X.; methodology, Z.X., X.L., J.L. and H.C.; software, Z.X., X.L. and J.L.; validation, Z.X.; formal analysis, Z.X. and X.L.; investigation, Z.X. and H.C.; resources, Z.X. and R.H.; data curation, Z.X.; writing—original draft, Z.X., X.L., J.L. and H.C.; writing—review and editing, X.L. and J.L.; visualization, Z.X.; supervision, R.H.; project administration, Z.X.; funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61862015, 11961010, 12261026), the Science and Technology Project of Guangxi (AD21220114), the Guangxi Key Laboratory of Automatic Detecting Technology and Instruments (YQ23103), the Outstanding Youth Science and Technology Innovation Team Project of Colleges and Universities in Hubei Province (T201923), the Key Science and Technology Project of Jingmen (2021ZDYF024), the Guangxi Key Research and Development Program (AB17195025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Many thanks to all the authors who took the time out of their busy schedules to review the paper and provide references.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sun, H.; Li, B.; Dan, Z.; Hu, W.; Du, B.; Yang, W.; Wan, J. Multi-level Feature Interaction and Efficient Non-Local Information Enhanced Channel Attention for image dehazing. Neural Netw. 2023, 163, 10–27. [Google Scholar] [CrossRef]
  2. Sun, H.; Zhang, Y.; Chen, P.; Dan, Z.; Sun, S.; Wan, J.; Li, W. Scale-free heterogeneous cycleGAN for defogging from a single image for autonomous driving in fog. Neural Comput. Appl. 2021, 35, 3737–3751. [Google Scholar] [CrossRef]
  3. Wan, J.; Liu, J.; Zhou, J.; Lai, Z.; Shen, L.; Sun, H.; Xiong, P.; Min, W. Precise Facial Landmark Detection by Reference Heatmap Transformer. IEEE Trans. Image Process. 2023, 32, 1966–1977. [Google Scholar] [CrossRef]
  4. Wang, H.; Dan, O.; Verbeek, J.; Schmid, C. A Robust and Efficient Video Representation for Action Recognition. Int. J. Comput. Vis. 2016, 119, 219–238. [Google Scholar] [CrossRef] [Green Version]
  5. Xu, Z.; Hu, R.; Chen, J.; Chen, H.; Li, H. Global Contrast Based Salient Region Boundary Sampling for Action Recognition. In Proceedings of the 22nd International Conference on MultiMedia Modeling, Miami, FL, USA, 4–6 January 2016; pp. 187–198. [Google Scholar]
  6. Singh, A.; Chakraborty, O.; Varshney, A.; Panda, R.; Feris, R.; Saenko, K.; Das, A. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10384–10394. [Google Scholar]
  7. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
  8. Xu, Z.; Hu, R.; Chen, J.; Chen, C.; Chen, H.; Li, H.; Sun, Q. Action recognition by saliency-based dense sampling. Neurocomputing 2017, 236, 82–92. [Google Scholar] [CrossRef]
  9. Wang, S.; Yang, Y.; Ma, Z.; Li, X.; Pang, C.; Hauptmann, A.G. Action recognition by exploring data distribution and feature correlation. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 1370–1377. [Google Scholar]
  10. Wang, S.; Ma, Z.; Yang, Y.; Li, X.; Pang, C.; Hauptmann, A.G. Semi-supervised multiple feature analysis for action recognition. IEEE Trans. Multimed. 2014, 16, 289–298. [Google Scholar] [CrossRef]
  11. Luo, M.; Chang, X.; Nie, L.; Yang, Y.; Hauptmann, A.G.; Zheng, Q. An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition. IEEE Trans. Cybern. 2018, 48, 648–660. [Google Scholar] [CrossRef] [PubMed]
  12. Xu, Z.; Hu, R.; Chen, J.; Chen, C.; Jiang, J.; Li, J.; Li, H. Semisupervised discriminant multimanifold analysis for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2951–2962. [Google Scholar] [CrossRef]
  13. Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the 2018 IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
  14. Xu, Y.; Yang, J.; Cao, H.; Mao, K.; Yin, J.; See, S. Arid: A new dataset for recognizing action in the dark. In Proceedings of the International Workshop on Deep Learning for Human Activity Recognition, Kyoto, Japan, 8 January 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 70–84. [Google Scholar]
  15. Chang, X.; Yang, Y. Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2294–2305. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Sigurdsson, G.A.; Russakovsky, O.; Gupta, A. What Actions are Needed for Understanding Human Actions in Videos? In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2156–2165. [Google Scholar]
  17. Wang, X.; Zhu, L.; Wang, H.; Yang, Y. Interactive Prototype Learning for Egocentric Action Recognition. In Proceedings of the 2021 IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8148–8157. [Google Scholar]
  18. Ma, Y.; Wang, Y.; Wu, Y.; Lyu, Z.; Chen, S.; Li, X.; Qiao, Y. Visual Knowledge Graph for Human Action Reasoning in Videos. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4132–4141. [Google Scholar]
  19. Aktı, Ş.; Tataroğlu, G.A.; Ekenel, H.K. Vision-based fight detection from surveillance cameras. In Proceedings of the 2019 Ninth International Conference on Image Processing Theory, Tools and Applications, Istanbul, Turkey, 6–9 November 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
  20. Liu, H.; Li, X. Modified subspace Barzilai-Borwein gradient method for non-negative matrix factorization. Comput. Optim. Appl. 2013, 55, 173–196. [Google Scholar] [CrossRef]
  21. Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
  22. Harandi, M.T.; Sanderson, C.; Shirazi, S.; Lovell, B.C. Kernel analysis on Grassmann manifolds for action recognition. Pattern Recognit. Lett. 2013, 34, 1906–1915. [Google Scholar] [CrossRef] [Green Version]
  23. Xiao, J.; Jing, L.; Zhang, L.; He, J.; She, Q.; Zhou, Z.; Yuille, A.; Li, Y. Learning from Temporal Gradient for Semi-supervised Action Recognition. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 3242–3252. [Google Scholar]
  24. Xu, Y.; Wei, F.; Sun, X.; Yang, C.; Shen, Y.; Dai, B.; Zhou, B.; Lin, S. Cross-model pseudo-labeling for semi-supervised action recognition. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2959–2968. [Google Scholar]
  25. Si, C.; Nie, X.; Wang, W.; Wang, L.; Tan, T.; Feng, J. Adversarial self-supervised learning for semi-supervised 3d action recognition. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  26. Kumar, A.; Rawat, Y.S. End-to-end semi-supervised learning for video action detection. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 14700–14710. [Google Scholar]
  27. Bi, Y.; Bai, X.; Jin, T.; Guo, S. Multiple feature analysis for infrared small target detection. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1333–1337. [Google Scholar] [CrossRef]
  28. Shahroudy, A.; Ng, T.T.; Gong, Y.; Wang, G. Deep multimodal feature analysis for action recognition in rgb+d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1045–1058. [Google Scholar] [CrossRef] [Green Version]
  29. Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
  30. Huynh-The, T.; Hua, C.H.; Ngo, T.T.; Kim, D.S. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf. Sci. 2020, 513, 112–126. [Google Scholar] [CrossRef]
  31. Harandi, M.T.; Sanderson, C.; Shirazi, S.; Lovell, B.C. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 2705–2712. [Google Scholar]
  32. Yan, Y.; Ricci, E.; Subramanian, R.; Liu, G.; Sebe, N. Multitask linear discriminant analysis for view invariant action recognition. IEEE Trans. Image Process. 2014, 23, 5599–5611. [Google Scholar] [CrossRef]
  33. Jiang, J.; Hu, R.; Wang, Z.; Cai, Z. CDMMA: Coupled discriminant multi-manifold analysis for matching low-resolution face images. Signal Process. 2016, 124, 162–172. [Google Scholar] [CrossRef]
  34. Markovitz, A.; Sharir, G.; Friedman, I.; Zelnik-Manor, L.; Avidan, S. Graph embedded pose clustering for anomaly detection. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10536–10544. [Google Scholar]
  35. Manessi, F.; Rozza, A.; Manzo, M. Dynamic graph convolutional networks. Pattern Recognit. 2020, 97, 107000. [Google Scholar] [CrossRef]
  36. Cai, J.; Fan, J.; Guo, W.; Wang, S.; Zhang, Y.; Zhang, Z. Efficient deep embedded subspace clustering. In Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1–10. [Google Scholar]
  37. Islam, A.; Radke, R. Weakly supervised temporal action localization using deep metric learning. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 547–556. [Google Scholar]
  38. Ruan, Y.; Xiao, Y.; Hao, Z.; Liu, B. A nearest-neighbor search model for distance metric learning. Inf. Sci. 2021, 552, 261–277. [Google Scholar] [CrossRef]
  39. Rahimi, S.; Aghagolzadeh, A.; Ezoji, M. Human action recognition based on the Grassmann multi-graph embedding. Signal, Image Video Process. 2019, 13, 271–279. [Google Scholar] [CrossRef]
  40. Yu, J.; Kim, D.Y.; Yoon, Y.; Jeon, M. Action matching network: Open-set action recognition using spatio-temporal representation matching. Vis. Comput. 2020, 36, 1457–1471. [Google Scholar] [CrossRef]
  41. Peng, W.; Shi, J.; Zhao, G. Spatial temporal graph deconvolutional network for skeleton-based human action recognition. IEEE Signal Process. Lett. 2021, 28, 244–248. [Google Scholar] [CrossRef]
  42. Fung, G.M.; Mangasarian, O.L. Multicategory Proximal Support Vector Machine Classifiers. Mach. Learn. 2005, 59, 77–97. [Google Scholar] [CrossRef] [Green Version]
  43. Yang, Y.; Wu, F.; Nie, F.; Shen, H.T.; Zhuang, Y.; Hauptmann, A.G. Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding. IEEE Trans. Image Process. 2012, 21, 1339–1351. [Google Scholar] [CrossRef]
  44. Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards Understanding Action Recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
  45. Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A Large Video Database for Human Motion Recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
  46. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  47. Ma, Z.; Nie, F.; Yang, Y.; Uijlings, J.R.R.; Sebe, N. Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection. IEEE Trans. Multimed. 2012, 14, 1021–1030. [Google Scholar] [CrossRef]
Figure 2. An illustration of our method. (a) Video sets can be represented in R D . We can use the principal angles between them to compare two actions. (b) Data points on the Grassmannian manifold M can be described as linear subspaces in R D . When points on the manifold have a proper geodesic distance, the video-set-matching problem may be converted to a points distance measurement problem. (c) By employing a proper Grassmannian kernel, data points can be mapped into another Grassmannian manifold M where the same actions become closer while different actions are well separated.
Figure 2. An illustration of our method. (a) Video sets can be represented in R D . We can use the principal angles between them to compare two actions. (b) Data points on the Grassmannian manifold M can be described as linear subspaces in R D . When points on the manifold have a proper geodesic distance, the video-set-matching problem may be converted to a points distance measurement problem. (c) By employing a proper Grassmannian kernel, data points can be mapped into another Grassmannian manifold M where the same actions become closer while different actions are well separated.
Applsci 13 07684 g002
Figure 3. Comparison (average accuracy ± std) with IDT+FV when different number of training samples are labeled, gmmSize = 16.
Figure 3. Comparison (average accuracy ± std) with IDT+FV when different number of training samples are labeled, gmmSize = 16.
Applsci 13 07684 g003
Figure 4. The convergence curves of the three optimization methods on the JHMDB dataset, with the final convergence results shown in Table 2. Due to the larger oscillations of PBB, the data for the first 29 iterations of SPG and PBB have been omitted here in order to better illustrate the comparative convergence of ALS, SPG and PBB.
Figure 4. The convergence curves of the three optimization methods on the JHMDB dataset, with the final convergence results shown in Table 2. Due to the larger oscillations of PBB, the data for the first 29 iterations of SPG and PBB have been omitted here in order to better illustrate the comparative convergence of ALS, SPG and PBB.
Applsci 13 07684 g004
Figure 5. Accuracy on JHMDB using TSN, w.r.t the parameter β with fixed η and μ .
Figure 5. Accuracy on JHMDB using TSN, w.r.t the parameter β with fixed η and μ .
Applsci 13 07684 g005
Table 1. Comparison with deep-learned features (average accuracy ± std) when 15 × c training videos are labeled.
Table 1. Comparison with deep-learned features (average accuracy ± std) when 15 × c training videos are labeled.
JHMDBHMDB51UCF101
SFUS0.6942 ± 0.01210.5217 ± 0.01140.7910 ± 0.0087
SFCM0.7125 ± 0.00990.5394 ± 0.01080.8070 ± 0.0101
MFCU0.7154 ± 0.00880.5556 ± 0.00980.8429 ± 0.0085
SVM- χ 2 0.6931 ± 0.01060.5190 ± 0.00950.8138 ± 0.0108
SVM-linear0.7140 ± 0.00860.5385 ± 0.00770.8450 ± 0.0087
KSPG0.7287 ± 0.01140.5697 ± 0.08330.8552 ± 0.0111
KALS0.7218 ± 0.00870.5607 ± 0.00980.8411 ± 0.0095
KGMA0.7361 ± 0.00960.5762 ± 0.10400.8673 ± 0.0087
Table 2. Mathematical results on JHMDB using 15 × c -labeled training samples. “Obj-Val” means objective function value.
Table 2. Mathematical results on JHMDB using 15 × c -labeled training samples. “Obj-Val” means objective function value.
MethodsFeatures (dim × nSample)ParametersTimes (s)Iter.ErrorRelative ErrorObj-Val
ALSTSN (2048 × 660) η = 0.001 , β = 0.01 , μ = 0.001 0.488040.5972 2.0691 × 10 4 2.0137
SPGTSN (2048 × 660) η = 0.001 , β = 0.01 , μ = 0.001 6.1992490.4706 8.1024 × 10 4 32.0130
PBBTSN (2048 × 660) η = 0.001 , β = 0.01 , μ = 0.001 23.5855560.6146 7.1873 × 10 4 10.0185
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Z.; Li, X.; Li, J.; Chen, H.; Hu, R. Action Recognition via Adaptive Semi-Supervised Feature Analysis. Appl. Sci. 2023, 13, 7684. https://doi.org/10.3390/app13137684

AMA Style

Xu Z, Li X, Li J, Chen H, Hu R. Action Recognition via Adaptive Semi-Supervised Feature Analysis. Applied Sciences. 2023; 13(13):7684. https://doi.org/10.3390/app13137684

Chicago/Turabian Style

Xu, Zengmin, Xiangli Li, Jiaofen Li, Huafeng Chen, and Ruimin Hu. 2023. "Action Recognition via Adaptive Semi-Supervised Feature Analysis" Applied Sciences 13, no. 13: 7684. https://doi.org/10.3390/app13137684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop