A Framework for Short Video Recognition Based on Motion Estimation and Feature Curves on SPD Manifolds

: Given the prosperity of video media such as TikTok and YouTube, the requirement of short video recognition is becoming more and more urgent. A signiﬁcant feature of short video is that there are few switches of scenes in short video, and the target (e.g., the face of the key person in the short video) often runs through the short video. This paper presents a new short video recognition algorithm framework that transforms a short video into a family of feature curves on symmetric positive deﬁnite (SPD) manifold as the basis of recognition. Thus far, no similar algorithm has been reported. The results of experiments suggest that our method performs better on three changeling databases than seven other related algorithms published in the top issues.


Introduction
Video, especially short video, has become the mainstream information medium on the Internet. Facing the huge daily output and segmented production of short videos, however, review and recommendation of short video still highly dependent on tag-based and manual recognition. In order to achieve more intelligent and efficient content audit, short video recognition algorithms are in great demand. Since the first step of recognition is feature extraction, many feature extraction methods based on static image have been established. However, their application to video data still faces great challenges. Meanwhile, Riemannian manifolds have been proven to be robust in extracting video features under different imaging conditions and therefore have been successfully employed in many branches of video recognition, including face recognition and action recognition.
In particular, symmetric positive definite (SPD) matrices are widely used in video representation because of second-order statistical information provided. Moreover, the space of all SPD matrices possessing the Riemannian metric is called SPD manifold [1,2]. In general, existing SPD-based methods for video recognition construct an SPD matrix to represent each video and then take the resulting SPD manifold into account for recognition; for example, modeling video as covariance matrix [3], kernel matrix [4,5], and Gaussian model [6,7].
Moreover, dimensionality reduction (DR) techniques can effectively extract valid information from the original SPD matrix and reduce the calculation cost drastically with regard to the dimension of SPD matrix. One of the DR approaches directly extracts feature vector, which is in the Euclidean space, from the original SPD matrix, taking the notions of Riemannian geometry [8]. Or, to maintain SPD geometry, the bilinear DR [9,10] aims to learn a mapping that transforms the original manifold into a new manifold. This new manifold possesses a lower dimension and a more discriminative metric. Metric learning on SPD manifolds learns a more discriminative function that narrows the similarity between SPD matrices from different categories while expanding the similarity between SPD matrices from the same category. Metric learning on SPD manifold usually projects the original SPD manifolds to their tangent space [11,12], or embeds the SPD manifolds to a subspace of reproducing the kernel Hilbert space (RKHS) [13][14][15]. The tangent space and subspace of RKHS are isomorphic with Euclidean spaces that possess the same dimension.
It is obvious that the SPD-based methods introduced above regard the whole video as an image-set-based SPD matrix but without considering temporal correlation between frame images. Moreover, the experiments of these methods are based on databases composed of segmented video clips. These video clips are actually short videos. Through the benefit of the few transitions in short videos, tracing the trajectories of regions of interest (ROIs) through video is available. Moreover, in order to extract both spatial and temporal features from video, the authors of [16,17] adopted space-time geometric representations of human landmark configurations. Inspired by the work in [16,17], we propose a framework for short video recognition that models the spatiotemporal trajectories of ROIs as a family of feature curves on SPD manifold in this paper. The main steps of our short video recognition framework are as follows: (1) According to the practical application, a key frame is extracted from the short video to determine the spatial feature blocks (i.e., ROIs) of the target. (2) Using motion estimation [18] in video encoding, each ROI in the key frame is traced forward, backward, or two-way to string time series of ROIs through the short video. The resulting family of time series of ROIs represent the temporal and spatial feature of short video. (3) Each ROI is transformed into an SPD matrix by regional covariance descriptor (RCD).
Hence, the family of time series of ROIs is transformed into a family of time series of SPD matrices. Each obtained time series of SPD matrices is a curve on the SPD manifold. The family of curves is the feature curves of the short video, which is the basis of short video recognition. (4) Using the dynamic time warping (DTW) [19] with Riemannian metrics [1,2,20] and divergences [21,22] on the SPD manifold, the similarity measure between curve families on the SPD manifold is established, so as to realize the recognition of short video.
Taking face recognition as an example, the overview of our framework is shown in Figure 1. Thus far, no similar algorithm has been reported. The experiments on three databases show that our framework is superior to seven other related algorithms published in the top issue in recent years. The main contributions of this paper are as follows: (1) Proposing a short video recognition framework feasible for different applications, as well as providing optional strategies for stringing time series of ROIs, constructing RCDs, and providing recognition between families of feature curves. (2) Different from viewing a video as an image set-based SPD matrix, which ignores the temporal correlation of features across image frames, our framework models each video as a family of feature curves on the SPD manifold, considering both temporal and spatial features of video. (3) ROI and motion estimation in video encoding are applied in our framework to reduce computational burden due to redundancy across image frames. Compared with using global frame images, ROIs convey more accurate spatial features. Moreover, tracing ROI with motion estimation can effectively reduce the computation of feature detection. (4) Encoding major spatial features by ROI-based covariance descriptors helps to build SPD geometry and provides a discriminant Riemannian metric for recognition. The programming of this paper is as follows. In Section 2, we present notations and preliminaries. In Section 3, several related works are introduced. Section 4 provides the details of our proposed framework. Section 5 introduces the application of our proposed framework in face recognition from video. Section 6 introduces seven SPD-based video/image set recognition algorithms in recent years as comparison algorithms. Section 7 shows the experimental results on three datasets with seven comparison algorithms. Conclusions are given in Section 8.

Preliminaries
In this section, we first provide a notation throughout this paper and then introduce geometry of SPD manifold, including the Riemannian metrics on SPD manifold and divergence of SPD matrices. In addition, we provide an introduction about motion estimation in video compression coding.

Notation
In this paper, vectors are denoted by lower case letters, e.g., x ; matrices are represented by upper case letters, e.g., X ; and the set of matrices are represented by  The programming of this paper is as follows. In Section 2, we present notations and preliminaries. In Section 3, several related works are introduced. Section 4 provides the details of our proposed framework. Section 5 introduces the application of our proposed framework in face recognition from video. Section 6 introduces seven SPD-based video/image set recognition algorithms in recent years as comparison algorithms. Section 7 shows the experimental results on three datasets with seven comparison algorithms. Conclusions are given in Section 8.

Preliminaries
In this section, we first provide a notation throughout this paper and then introduce geometry of SPD manifold, including the Riemannian metrics on SPD manifold and divergence of SPD matrices. In addition, we provide an introduction about motion estimation in video compression coding.

Notation
In this paper, vectors are denoted by lower case letters, e.g., x; matrices are represented by upper case letters, e.g., X; and the set of matrices are represented by X = {X 1 , · · · , X N }. R D is the Euclidean space. Sym D ++ is the SPD manifold, which will be formally defined later. T X Sym D ++ is the tangent space to the SPD manifold at X ∈ Sym D ++ . Gr(d, D) is the Grassmannian manifold, i.e., the set of d-dimensional subspaces of R D . GL(D) is the general linear group, i.e., the set of all invertible D × D matrices. Sym D + is a positive semidefinite cone, i.e., the set of all D × D positive semidefinite matrices. H is the reproducing kernel Hilbert space.

The Geometry of SPD Manifold
The D × D dimensional matrix X is symmetric and positively definite if X T = X and the scalar v T Xv > 0 for any non-zero column vector v ∈ R D . The space of all D × D dimensional SPD matrixes is expressed as Sym D ++ . If Sym D ++ is given a Riemannian metric, the space of the SPD matrix becomes a Riemannian manifold. Namely, the SPD manifold is defined as For all X ∈ Sym D ++ , its tangent space of symmetric D × D matrices with logarithm mapping Log X : Sym D ++ → T X Sym D ++ : The geometry of SPD manifolds is usually learned through Riemannian metrics ω. The Riemannian metric defines the inner product on tangent space. For any X ∈ Sym D And the inner product reflects the length of the curve between corresponding points on SPD manifold Sym D ++ . The curve with the shortest distance is the geodesic between two elements on Sym D ++ . The length of the geodesic is called the geodesic distance. The affine invariant Riemannian metric (AIRM) [1] is the most frequently used Riemannian metric. For all Φ, Θ ∈ T x Sym D ++ , the AIRM is defined as where A, B F = tr AB T and X ∈ Sym D ++ . ·, · X are an inner product and its smoothing over Sym D ++ , respectively. The logarithm mapping projecting X 2 ∈ Sym D ++ to tangent space T X 1 Sym D ++ is defined as For any given pair X 1 , X 2 ∈ Sym D ++ , the unique geodesic [23] induced from AIRM connecting Γ X 2 X 1 (0) = X 1 with Γ X 2 X 1 (1) = X 2 is given by The geodesic distance between X 1 , X 2 ∈ Sym D ++ induced from AIRM is as follows: where A 2 F = A, A F is the Frobenius norm of a matrix. The AIRM possesses a property of invariance to affine transformations, i.e., δ 2 AIRM (X 1 , For all Φ, Θ ∈ T x Sym D ++ , the log-Euclidean metric (LEM) [2,20] is defined as where X ∈ Sym D ++ and D X log(Φ) denote the directional derivative of log(Θ) at X. The logarithm mapping projecting X 2 ∈ Sym D ++ to tangent space T X 1 Sym D ++ is defined as Hence, the distance between X 1 , X 2 ∈ Sym D ++ induced from LEM is as follows:

Bregman Divergences
In addition to the distance generated by the Riemannian metric, the divergence of the SPD matrix based on Bregman divergence can also be used as the distance metric of the SPD matrix. For all X 1 , X 2 ∈ Sym D ++ , the Bregman matrix divergence [24] is defined as where Φ : Sym D ++ → R is a strictly convex function, and ∇ Φ(X 2 ) is the gradient of Φ at point X 2 . Bregman divergence is similar to distance measure, which does not satisfy trigonometric inequality and symmetry. To symmetrize Bregman divergences, different seed functions Φ are used. Among them, the Stein divergence [21] and Jeffrey divergence [22] play an important role in computer vision.
For any given pair X 1 , X 2 ∈ Sym D ++ , the Stein divergence adopting Φ(X) = −log det(X) and Jensen-Shannon symmetrization is defined as For any given pair X 1 , X 2 ∈ Sym D ++ , the Jeffrey divergence adopting Φ(X) = −log det(X) and direct symmetrization is defined as where D is the dimension of manifold. In the same way as the AIRM, the Stein divergence and Jeffrey divergence are affine invariant.

Motion Estimation
The similarity between adjacent frames brings inter-frame redundancy. Using motion estimation [18], only the changing parts of adjacent video frames would be encoded to reduce the amount of data and reduce the inter frame redundancy. In motion estimation, a frame image is segmented into M × N or more commonly used N × N pixel block. At the matching window of (N + 2p) × (M + 2p) size, the current block is compared with the corresponding block in the previous frame. This 'p' referring to pixels adjacent to the ROI is called the search parameter. On the basis of the matching criteria, the best match is found, and the alternative position of the current block is obtained.
There are various matching criteria, including the mean absolute difference (MAD): mean squared error (MSE): and normalized cross correlation (NCC): where f k (i, j) are the pixels in the current M × N block and f k−1 (i, j) are the pixels in the matching block in the frame k − 1. When MSE or MAD is the smallest, it means that the matching between the two blocks is the best. The difference is that NCC measures the similarity between the two blocks in the range of [−1, 1]. The closer the NCC is to 1, the closer the two blocks are to have a linear relationship.

Related Works
In this section, we introduce relevant literature that have modeled the temporal evolution of video as curves on the Riemannian manifold. In addition, we also introduce classification and alignment methods of time series.

Modeling of Video as Curve on Riemannian Manifold
The key point here is to account for the temporal features and spatial features of video simultaneously. Profiting from explicit landmark configurations, several recent works modeled each sequence of facial expressions in video as a curve or a trajectory on Riemannian manifolds. Taheri et al. [25] represented a sequence of faces as a sequence of facial landmarks. Since landmark configuration on each face is a full rank D × d matrix encoding d dimensional coordinates of D landmarks, a sequence of facial expression can be considered as a curve on the Grassmannian manifold Gr(d, D) with the neutral face as the starting point. To classify the curves modeled, the linear discriminant analysis (LDA) and multi-class SVM are applied. Taking texture information into consideration, Otberdout et al. [16] encoded deep convolutional neural network features extracted from human faces by covariance descriptors so as to model the temporal revolution of facial expression as trajectory on SPD manifolds.
Besides its role in the field of facial expression recognition, temporal modeling based on landmarks on Riemannian manifolds can extend to action recognition [17,[26][27][28][29]. Kacem et al. [17] represents each D × d landmark matrix of skeleton by a D × D positive semidefinite Gram matrix. By doing so, the temporal evolution of skeletons is represented by a time series of their corresponding Gram matrices, which can be considered as a trajectory connected by pseudo-geodesics [23] on positive semidefinite cone Sym D + . Devanne et al. [26] used square-root velocity function to construct trajectories in a n-dimensional space representing skeleton sequences. Tanfous et al. [27] represented the landmark configuration sequence as the trajectory in Kendall shape space and encoded the trajectory through the dictionary learned in the sample set to generate Euclidean sparse time series.
However, all these works rely on geometric information from landmarks. These methods not only consume too much to detect landmarks in each frame, but also are not friendly to video without clear landmarks. Moreover, features extracted from pixel-level landmarks are limited. Distinct from works introduced above, in this paper, we focused on modeling curves of ROIs. By determining the pattern of ROIs, different applications of video recognition can be realized, but not only facial expression recognition and action recognition.
DTW [19] was originally used in the field of speech recognition, elongating or shortening (compresses) the unknown speech until it is consistent with the length of the reference template. In this process, the time axis of the unknown speech will be distorted or bent so that it can correspond to the standard pattern. This is given two time series T 1 = a 1 , · · · , a L 1 and T 2 = b 1 , · · · , b L 2 with lengths of L 1 and L 2 , respectively. A pair of warping paths β = (β 1 , β 2 ) between T 1 and T 2 need to meet the following constraints: The optimal path between T 1 and T 2 is given by where Γ is the set of all possible paths and δ(·, ·) is a distance metric. The DTW distance is given by To sum up, searching for an optimal DTW path β * is equivalent to finding the optimal solution from all possible warping paths according to minimizing the cumulative distance cost. The recurrence of cumulative distance matrix Π with Π(0, 0) = 0 in DTW can be written as Developed from DTW, the 1NN-DTW model [19,33], which is the combination of 1NN classifier and DTW, variant DDDTW [34], which is based on derivative distance, and constructing kernel function using DTW distance [35,36] are widely used in time series classification. Utilizing DTW distance instead of Euclidean distance for calculating Gaussian RBF kernel, Bahlmann et al. proposed the Gaussian DTW (GDTW) kernel [35]. However, since DTW distance is not symmetric, the GDTW kernel is also not a symmetric kernel. To overcome this limitation, global alignment kernels (GAK) [36] used in [16] need to calculate all the alignments, despite the huge computational cost.

A Framework for Feature Curves on SPD Manifold
In this section, we provide the formulation of short video recognition firstly, then present our framework, which represents short video as a family of feature curves on the SPD manifold. Specifically, our framework involves three parts: stringing time series of ROIs based on motion estimation, feature curve modeling with RCDs, and the classification methods of feature curves.

Formulation of Short Video Recognition
Video recognition is to identify the corresponding label of a query video V query based on a number of sample videos {V 1 , · · · , V n } labeled with l V 1 , · · · , l V n , which covers face recognition, scene classification, action recognition, and other different application directions. For example, the recognition of surveillance video in the field of the public is to match a query video V query against the video library obtained by the monitoring system. The target to be recognized in the video can be the person, car, or even license plate involved in the case.
This paper focuses on short video recognition. Short videos are now popular in social media with their high-frequency output and strong participation. Hence, video media have high requirements for content audit, information screening, and content recommendation. It is necessary to promote the research of short video recognition. To extract features for recognition, static images are normally described in terms of feature vectors, and videos are regarded as image sets. However, image set ignores the temporal feature of video. Spatial features in video dynamically change in the time dimension, and adjacent frames share similar spatial features. Different from long video, such as movies, the transition or conversion in short video between paragraphs and scenes are minor, which means the significant spatial features, i.e., ROI, may run through the whole short video. To extract spatial-temporal features from short videos, we proposed a framework focusing on the short video recognition model spatiotemporal trajectory of ROI in video as a family of feature curves on the SPD manifold.

Stringing Time Series of ROIs Based on Motion Estimation
Given a k-frames video F = {F 1 , · · · , F k }, where F i is the image matrix, the spatial features of video are reflected within the frame image of a video. Our proposed framework takes the ROIs of the frame image as the major spatial feature of video. ROI is originally a concept in video encoding. In video encoding, the image quality of non-concerned regions can be sacrificed, and only high-resolution coding can be carried out for key regions, i.e., ROI, to meet the requirements of users' high-definition video monitoring, while saving network bandwidth, processing time, and video storage space. ROI can be square, round, irregular shape, and so on. In our framework, we first extract a key frame from video, then determine a group of ROIs as a pattern. The ROI possesses semantic features and varies according to specific applications. As an algorithm framework, the solution of ROI selection and ROI detection are open. Meanwhile, the weight and number of ROIs can be adjusted to optimize the algorithm.
The temporal features of video are reflected in the temporal correlation between video frames. Given a pattern of m ROIs {ROI i } m i=1 , our proposed framework takes the m corresponding time series of ROIs as the spatiotemporal feature of video. Moreover, we use the motion estimation method in video compression coding to trace ROIs. Tracing can be forward, backward, or bidirectional, depending on the position of key frames in the frame sequence. Taking backward tracing as an example, we introduced a specific strategy for stringing time series of ROIs in the following, including the continuous frame extraction strategy and inter-frame extraction strategy.
Let ROI i,1 represent the i-th ROI in the key frame F s (1 ≤ s ≤ k). For the continuous frame extraction strategy, we take F s as starting frame for motion estimation and trace the region closest to ROI i,1 in the next frame F s+1 (s ≤ s + 1 ≤ k) as ROI i,2 . The difference between regions can be calculated by Equations (14)- (16). Then, frame F s+1 becomes the new starting frame for next motion estimation, and so on. Each ROI traced is preserved in a time series of ROIs.
However, the ROIs between adjacent frames may be too similar in some smoothly changing videos and preserving ROIs per frame will cause data redundancy. To tackle this problem, we employed an inter-frame extraction strategy. This strategy still traces ROIs per frame, but if the differences between the traced ROI and ROI i,1 in F s is below a certain lower limit τ, the traced ROI is not preserved and the starting frame is kept at F s . Only when the differences between the traced ROI in F s+η and ROI i,1 exceed τ can the traced ROI be strung into the time series as ROI i,2 ; then, the starting frame is replaced by F s+η for next motion estimation.
Since both strategies will face the distortion of the prediction result due to the accumulation of error caused by each estimation, our proposed framework stops motion estimation in the ending frame F e where the error accumulates to an upper limit µ. Thus far, we call it a cycle of motion estimation. After a cycle, we go back to the space domain to detect the ROIs in {F e+1 , · · · , F k } and redefine the starting frame to repeat a new cycle of motion estimation. By looping the cycle until the end of the video to find all the i-th ROIs on the video timeline, we can construct a time series of the i-th where L i is the length of the i-th time series. Therefore, the whole video can be represented by a family of time series of , where m represents the number of ROIs in the pattern.
It should be noted that the length L i , upper limits µ, lower limit τ, starting frame F s and ending frame F e of each time series can be different. Moreover, the most important factor is by stringing time series of ROIs, our proposed framework combines spatial-temporal features of short videos and transforms task of short video recognition to the recognition between families of time series of ROIs.

Features Curves on SPD Manifold
Although ROI is a semantic feature region in a frame image, such a feature region is directly composed of image pixels. From the perspective of image recognition, such features are original and rough. Our proposed framework uses a regional covariance descriptor to extract the SPD feature matrix from ROI, so as to transform a family of time series of ROIs into a family of feature curves on the Riemannian manifold. The specific methods are as follows: For a square ROI ROI = [r 1 , r 2 , · · · , r d ] ∈ R D×d , we can simply compute the corresponding RCD by where When the ROI is no longer square, or the size is not uniform and the shape is irregular, we use each pixel of the ROI to generate a feature vector to calculate the covariance matrix. For a ROI with λ pixels, each pixel generates a D-dimensional vector. The ROI can be represented as {v 1 , · · · , v λ } ∈ R D×λ , and the corresponding RCD is given by where The method of generating feature vectors is open and can vary with different applications. For example, each pixel can generate a nine-dimensional feature vector, which is composed of RGB values and the first order gradients of RGB values in X and Y directions, respectively. By doing so, no matter how different the shape and size of ROI are, the size of the SPD feature matrix generated by RCD is certainly 9 × 9.
In this way, all the resulting curves are on the SPD manifold shared the same dimension, avoiding the influence of sizes of original ROIs. Following [3], to avoid the singularity, we adjusted the original RCD as C * = C + ξ I, where I is an identity matrix and ξ is 10 −3 × tr(C). Then, we followed the information geometry theory [37] to transform the Gaussian model N (µ, C * ) into a (D + 1) × (D + 1) dimensional SPD matrix as the final RCD: which means the space of D-dimensional Gaussian models has been embedded into the SPD manifold Sym D+1 ++ . By doing so, one ROI i,j becomes an SPD matrix X i,j , and a family of time are transformed as a family of time series of their embedding SPD matrices X Consider that the geometry of RCD generated from ROI is a point on SPD manifold. A time series of SPD matrices X = {X 1 , · · · , X N } can be defined as a feature curve Γ(t) 0≤t≤N−1 on the SPD manifold that passes through all SPD matrices belonging to X in sequence from Γ(0) = X 1 to Γ(N − 1) = X N . Among them, two adjacent SPD matrices are connected by geodesics: where Γ In other words, the feature curve representing a time series of ROIs is spliced by multiple geodesics one by one.
Using this strategy, each time series of ROI can be modeled as a curve on the SPD manifold: Thus, a short video can be modeled as a family of curves on the SPD manifold: where the SPD manifold is proven to be a Riemannian manifold and this family of curves is the family of spatial-temporal feature curves of the video. Algorithm 1 summarizes the steps of computing the family of feature curves.

Rcognition between Famlies of Features Curves
In our proposed method, time series of ROIs are extracted from short video. The comparison of similarity between short videos is transformed into the comparison between families of time series. The length of two time series may not be equal. Moreover, different frame rate, variable durations, and arbitrary starting/ending intensities of video also bring about obstacles. To tackle this problem, we adopted dynamic time warping (DTW) to find an alignment between the two videos.
DTW needs to define appropriate metrics for recognition. As we introduced in Section 4.3, the time series of ROIs are transformed as feature curves in SPD manifolds, which introduces discriminative Riemannian metrics and divergence. Given two feature curves Γ 1 (t) and Γ 2 (t) concatenated by L 1 and L 2 SPD matrices, respectively, and a pair of warping paths β = (β 1 , β 2 ) between Γ 1 (t) and Γ 2 (t), similar to the optimal path between time series introduced in Section 3.2, the optimal path in the set of all possible paths Γ between two feature curves Γ 1 (t) and Γ 2 (t) is where δ(·, ·) can be defined as Riemannian metrics/divergence in SPD manifold. Similarity measure between two feature curves Γ 1 (t) and Γ 2 (t) under the optimal path can be defines as Then, the similarity measure between two video needs to fuse information provided by each curve of its curve family. The classification strategy can also be divided into two types. One is an overall classification strategy that is suitable for situations where the number of ROIs m in a pattern is smaller. Given a family of feature curves Γ extracted from query video V query and sample families of feature curves Γ in sample videos. This approach extends to each feature curve of V query to obtain the m × n similarity measure matrix Ψ using DTW distance. The average similarity measure of all feature curves from V query represents the similarity measure between query video and sample videos. Taking the KNN classifier as an example, the steps of overall classification method are shown in Algorithm 2. Output: Predicted label l query of V query /* Compute DTW distances among sample and query feature curves */ for i ← 1 to m for j ← 1 to n do

Algorithm 2 Classification of a Family of Feature Curves
end /* Testing phase */ l query ← KNN classifier using average DTW distance matrix ψ Return Predicted label l query of V query The other is the pre-classification strategy suitable for the number of ROIs m in the pattern being larger. Each feature curve of query video V query is pre-classified independently according to each row of similarity measure matrix Ψ. On the basis of the law of large numbers [38], the label with the highest frequency among l query 1 , · · · , l query m is regarded as the label of the query video V query . Taking the KNN classifier as an example, the steps of the pre-classification method are shown in Algorithm 3. Output: Predicted label l query of query video V query /* Compute distances among sample and query feature curves */ for i ← 1 to m for j ← 1 to n do In summary, our proposed framework focuses on short video recognition. The few transitions of short videos help to trace the coherent trajectory of ROI with motion estimation. Moreover, stringing time series of ROIs extracts both spatial and temporal features from short video. Furthermore, modeling time series of ROIs as feature curves on the SPD manifold via RCDs introduces the Riemannian geometry. Finally, our framework provides different strategies for stringing time series of ROIs, constructing RCDs, and classifying between families of curves, improving the universality and stability of our framework.

Application to Face Recognition
For face recognition in short video, where video F = {F 1 , · · · , F k } can be considered as a sequence of face images, we take the first frame with clear facial features as the key frame and define four square ROIs {ROI i } 4 i=1 located in the four regions around the two eyes, nose, and mouth as a pattern (see Figure 1a). Using the continuous frame extraction strategy to trace ROI backward, we can string four time series of ROIs employing Equations (23) and (25). At the same time, in order to combine the global spatial features, we also take the global face G 1 in the key frame as the starting point, and link the next two nearest face images in time. For the time series of global spatial features {G 1 , G 2 , G 3 }, we have time series of their embedding SPD matrices X G = X G 1 , X G 2 , X G 3 . Then, we extend time series of ROIs and global faces extracted from one short video to a family of feature curves Γ G , Γ R i 4 i=1 on the SPD manifold. We basically set the weight of the four ROIs and the whole face as equal and we adopted a pre-classification strategy with KNN-DTW to classify feature curves in the SPD manifold. Given that a query video consists of a family of five feature curves is pre-classified independently. For one query curve Γ query , we found out the K training curves closest to the query curve and defined the set of K training curves as N K(Γ) . Solving the label of query curve Γ query , where I(l, with the law of large number.

RieCovDs
Given an image set includes n images, RieCovDs divides each image into m partially overlapping regions, that is, the image set is divided into m region sets, and each region set contains n regions. RieCovDs modelling each region belong to the image set with a Gaussian model. For a region set, the Gaussian model set is mapped to a SPD matrix set where µ i is mean vector and C i is covariance matrix. Finally, for each SPD matrix belonging to the image set, RieCovDs calculates a Riemannian local difference vector (RieLDV) [46]: Moreover, the generate Riemannian covariance descriptor between m region sets represents this image set for recognition.

AidCovDs
AidCovDs proposes a framework representing image sets with approximate infinitedimensional covariance descriptors (CovDs) based on Riemannian kernel and the Nyström method [47,48]. Given an image set includes n images, AidCovDs first calculates a covariance matrix of SIFT or Gabor features of each image as X = {X 1 , · · · , X n } ⊆ Sym D ++ . The infinite-dimensional CovDs in RKHS for X is given by where J n = n − 3 2 nI n − 1 n 1 T n , 1 n is a column vector of n ones, ϕ(X) = ϕ[ϕ(X 1 ), · · · , ϕ(X n )], and ϕ : Sym D ++ → H is a Riemannian kernel mapping. Considering a training set Y = {Y 1 , · · · , Y m } ⊆ Sym D ++ , the approximation of Riemannian kernel matrix K Y = k Y Y i , Y j m×m ∈ R m×m of the training set can be written as K Y ∼ = Z T Z = VE 1/2 E 1/2 V T , where Z =E 1/2 V T ∈ R d×m , with E being the diagonal matrix of top d eigenvalues of K Y and V being the matrix of corresponding eigenvectors. On the basis of the Nyström method, the approximation of ϕ(X) in RKHS is The approximate infinite-dimensional CovDs in RKHS for an image set can be written as C Z = Z(X)J n J T n Z(X) T . (35)

CERML
Given n videos, CERML fuses both Euclidean data (i.e., feature means) X = {x 1 , · · · , x n } ⊆ R D and the Riemannian representations (i.e., SPD matrices) Y = {Y 1 , · · · , Y n } ⊆ Sym D ++ from videos. Data are transformed from the original Euclidean space and Riemannian space into RKHS via two transformation matrices W x ∈ R n×d , W y ∈ R n×d . Transformed data in RKHS can be defined as where K x iCol ∈ R n×n and K y iCol ∈ R n×n are the i-th column of RBF kernel of X = {x 1 , · · · , x n } ⊆ R D and Y = {Y 1 , · · · , Y n } ⊆ Sym D ++ , respectively.
The first constraint is to minimize the distances between data with the same label, and maximize distances between data with different labels: The second constraint aims to keep Euclidean and Riemannian geometric relations in RKHS: where Λ 1 , Λ 2 are neighborhood number. Then, the objective function with balancing parameters λ 1 > 0, λ 2 > 0 can be written as

LEML
Given n videos and their corresponding SPD matrices X = {X 1 , · · · , X n } ⊆ Sym D ++ , let F : Sym D + → Sym d + (d ≤ D) be a mapping between manifolds, X ∈ Sym D + be the highdimensional SPD matrix, and F(X) ∈ Sym d + be the lower-dimensional matrix. LEML aims to learn a tangent mapping DF(X) : is the tangent space of Sym D + and T F(X) Sym d ++ is the tangent space of Sym d + . LEML uses a transformation matrix W ∈ R D×d to define tangent mapping as DF(log(X)) = W T log(X)W. The geodesic distance D Q le T i , T j = tr Q T i − T j T i − T j on the new SPD manifold Sym d + is obtained by substituting W into the logarithmic Euclidean distance on Sym D + , where Q =WW T WW T , T i = log(X i ). LEML defines how points are similar if D le T i , T j ≤ u and dissimilar if D le T i , T j ≥ l, where D le (·, ·) is geodesic distance, u is the upper limit, and l is the lower limit.
Finally, the objective function is given by where D ld is the LogDet divergence, Q 0 is an initialization of Q, D ld (Q, If the pair of samples come from the same class, δ ij = 1; otherwise, where ξ is a vector of slack variables.
On the new manifold, the data points belonging to the same class should be as close as possible, and the points belonging to different classes should be as far away as possible. SPDML make use of notions of within-class similarity g w (·, ·) and between-class similarity g b (·, ·): where N(X i ) is the collection of neighbors of X i , N w (X i ) is the collection of neighbors belonging to the same class with X i , and N b (X i ) is the collection of neighbors belonging to the different classes with X i . The affinity function is defined as α X i , X j = g w X i , X j − g b X i , X j . Moreover, the loss function is where δ is a distance metric on SPD manifold. To perform dimensionality reduction, the objective function is given by min

SPDSL
Given n videos and their corresponding SPD matrices X = {X 1 , · · · , X n } ⊆ Sym D ++ with labels {l 1 , · · · , l n }, where l i = [0, · · · , 1, · · · , 0] ∈ R c , and where the k-th element is 1, indicating that X i belongs to the k-th class of c total classes, inspired by SPDML, SPDSL adopts the same full rank matrix W ∈ R D×d to define the dimensionality reduction mapping, within-class similarity g w (·, ·), and between-class similarity g b (·, ·) as SPDML.
Utilizing the supervised criterion of centered kernel target alignment [49,50], the objective function of SPDSL is given by where • denotes the Hadamard product, G = g w + g b , U = I n − 1 n 1 T n n , L = [l 1 , · · · , l n ] T , k ij (W) = exp −αδ 2 F(X i ), F X j , α = 1 σ 2 , and σ is set to the mean distance of pairs in the training set.

++
are their corresponding Lie algebras in the unit tangent space mapped by logarithmic mapping. Then, DALG defines the transformation DF(X) : T I D Sym D ++ → T I d Sym d ++ between unit tangent spaces as DF(X) : Y i = W T X i W, i = 1, · · · , n with matrix W ∈ R D×d . On the basis of the exponential and logarithmic mappings between the LGs and their unit tangent space, transformation F : Sym D ++ → Sym d ++ is given by To maximize the similarity between points belonging to the same class and minimizing similarity between points from different classes in low-dimensional LG, optimized W is as follows: min W∈R D×d max W∈R D×d where g w (·, ·) and g b (·, ·) are shown in Equations (41) and (42). Combining both constraints, the overall objective function is min

Summary
In all the comparison algorithms introduced above, each video is regarded as an image set, and then the whole is represented by one SPD matrix without considering frame-toframe correlation. Geometrically, a video is represented as a point on the SPD manifold. However, our proposed framework is proposed for short video. In our framework, each short video is represented as a family of feature curves on the SPD manifold connected by geodesic.

Database
The YouTube Celebrities (YTC) database [51] contains a large series of videos on YouTube of 47 celebrities. Each individual has three different long videos, and each long video is segmented into several video clips. In all, there are 1910 clips in the YTC database. Some examples are shown in Figure 2. Since all videos are encoded in MPEG4 at a 25 fps rate with low resolution, the noise and poor imaging leads to the much more challenging recognition task.
We extracted gray-scale features (pixel values) of the face detected in each frame and resized it to 48 × 48. Then, histogram equalization was used for each face image. We conducted 10 cross-validation experiments and selected 20 individuals each time. Each person had six stochastically selected videos in the gallery/training set and three in the probes/testing set in an experiment. We extracted gray-scale features (pixel values) of the face detected in each frame and resized it to 48 × 48. Then, histogram equalization was used for each face image. We conducted 10 cross-validation experiments and selected 20 individuals each time. Each person had six stochastically selected videos in the gallery/training set and three in the probes/testing set in an experiment.
ICT-BBT [52] contains large-scale video collections parsed from the whole first season of the TV Big Bang Theory (BBT). The BBT is a situation comedy, in which most scenes are shot in bright rooms (see Figure 3 for example). Moreover, the ICT-PB [52] is parsed from the TV show Prison Break (PB). Differently, the shooting scenes of the ICT-PB (see Figure 4 for example) are changeable, which results in large changes in lighting conditions and more facial obstructions such as shadows and railings. The frame sequence in each video clip is cut and resized into an image set with a size of 150 × 150 for each image. The same as the YTC database, the size of each face is unified into 48 × 48 and a histogram equalization is utilized for each face image. For each character, both the gallery/training set and the probes/testing set are composed of 10 randomly selected videos. We repeated the experiment 10 times and finally averaged the accuracy.

Method Setting
In our experiments, the sizes of ROIs were unified as a square of 16 16 × , and the search parameter was set to 7 pixels. Since MAD does not require multiplication, we took ICT-BBT [52] contains large-scale video collections parsed from the whole first season of the TV Big Bang Theory (BBT). The BBT is a situation comedy, in which most scenes are shot in bright rooms (see Figure 3 for example). We extracted gray-scale features (pixel values) of the face detected in each frame and resized it to 48 × 48. Then, histogram equalization was used for each face image. We conducted 10 cross-validation experiments and selected 20 individuals each time. Each person had six stochastically selected videos in the gallery/training set and three in the probes/testing set in an experiment.
ICT-BBT [52] contains large-scale video collections parsed from the whole first season of the TV Big Bang Theory (BBT). The BBT is a situation comedy, in which most scenes are shot in bright rooms (see Figure 3 for example). Moreover, the ICT-PB [52] is parsed from the TV show Prison Break (PB). Differently, the shooting scenes of the ICT-PB (see Figure 4 for example) are changeable, which results in large changes in lighting conditions and more facial obstructions such as shadows and railings. The frame sequence in each video clip is cut and resized into an image set with a size of 150 × 150 for each image. The same as the YTC database, the size of each face is unified into 48 × 48 and a histogram equalization is utilized for each face image. For each character, both the gallery/training set and the probes/testing set are composed of 10 randomly selected videos. We repeated the experiment 10 times and finally averaged the accuracy.

Method Setting
In our experiments, the sizes of ROIs were unified as a square of 16 16 × , and the search parameter was set to 7 pixels. Since MAD does not require multiplication, we took Moreover, the ICT-PB [52] is parsed from the TV show Prison Break (PB). Differently, the shooting scenes of the ICT-PB (see Figure 4 for example) are changeable, which results in large changes in lighting conditions and more facial obstructions such as shadows and railings. The frame sequence in each video clip is cut and resized into an image set with a size of 150 × 150 for each image. The same as the YTC database, the size of each face is unified into 48 × 48 and a histogram equalization is utilized for each face image. For each character, both the gallery/training set and the probes/testing set are composed of 10 randomly selected videos. We repeated the experiment 10 times and finally averaged the accuracy. We extracted gray-scale features (pixel values) of the face detected in each frame and resized it to 48 × 48. Then, histogram equalization was used for each face image. We conducted 10 cross-validation experiments and selected 20 individuals each time. Each person had six stochastically selected videos in the gallery/training set and three in the probes/testing set in an experiment.
ICT-BBT [52] contains large-scale video collections parsed from the whole first season of the TV Big Bang Theory (BBT). The BBT is a situation comedy, in which most scenes are shot in bright rooms (see Figure 3 for example). Moreover, the ICT-PB [52] is parsed from the TV show Prison Break (PB). Differently, the shooting scenes of the ICT-PB (see Figure 4 for example) are changeable, which results in large changes in lighting conditions and more facial obstructions such as shadows and railings. The frame sequence in each video clip is cut and resized into an image set with a size of 150 × 150 for each image. The same as the YTC database, the size of each face is unified into 48 × 48 and a histogram equalization is utilized for each face image. For each character, both the gallery/training set and the probes/testing set are composed of 10 randomly selected videos. We repeated the experiment 10 times and finally averaged the accuracy.

Method Setting
In our experiments, the sizes of ROIs were unified as a square of 16 16 × , and the search parameter was set to 7 pixels. Since MAD does not require multiplication, we took

Method Setting
In our experiments, the sizes of ROIs were unified as a square of 16 × 16, and the search parameter was set to 7 pixels. Since MAD does not require multiplication, we took it as the matching criteria. With a full search algorithm, all the 16 × 16 size regions in the searching window were compared to find the best matching one. The difference value needed to be calculated 15 × 15 times. Then, we needed to control the cumulative error 1 256 Once the cumulative error exceeds the upper limit µ, motion estimation will not continue. Moreover, the setting standard of upper limit µ is based on making sure the number of ROIs in most cycles is less than 15. We only ran one cycle of motion estimation for each video.
To be fair, the parameters in comparison algorithms were set according to the original literature. For RieCovDs, CovDs was calculated by IE-RieLDV-G. The sliding window was 16 × 16, and the step size was 8 in the horizontal direction and vertical direction, α = 1 and β = 0.5. For AidCovDs, the features were extracted using SIFT, and target dimensionality D = 40. For CERML, λ 1 was set to 0.01, λ 2 was set to 0.1, k 1 was set to 1, k 2 was set to 20, σ s was the mean distance of training data, and the iteration number was set to 20. For LEML, the parameters η and ζ were set as 10 and 0.1, respectively. For SPDML and SPDSL, the upper limit of the number of iterations was set to 50, v w was the minimum number of samples in one class, and v b was set by cross-validation. In DALG, v w and v b were set as 5 and 20, respectively. For the DR methods, the reduced dimensionality was searched in {20, 30, 40, 50, 60, 70, 80, 90}, and only the best results are shown. Except for the fact that CERML, DALG, and LEML are based on the LEM according to the original works, other comparison algorithms adopted two best performing metrics/divergences.

Result and Analysis
In this section, we show the experimental comparison between different metrics/divergences within our proposed methods, as well as a comparison of our proposed methods and seven SPD-based comparison algorithms.
Since the feature curves we proposed were based on SPD geometry, the choice of metric/divergence in the SPD manifold, which derived the distance matric in DTW, was especially important. Hence, the AIRM, the LEM introduced in Section 2.2, and the Stein and Jeffrey divergences introduced in Section 2.3 combined with the KNN classifier were applied for the comparative experiments. Table 1 shows the average accuracies using these four metrics/divergences on the SPD manifold. It is obvious that our method with AIRM achieved the highest recognition accuracies on two databases. It should be noted that the AIRM defines a true geodesic distance. Although the LEM was confirmed to be much more efficient than the AIRM in [21], the LEM did not perform well in our proposed method. This might have been the case because the LEM is not an affine invariant. In addition, the LEM does not really reflect the geometric relationship between two points on an SPD manifold. Moreover, as we introduced in Section 5, we utilized global spatial features (global faces) as the companion to regional spatial features (ROIs). To prove the improvement made by the global spatial features, we compared three situations in our method, namely, regional spatial features (ROIs) only, global spatial features (global faces) only, and the combination of both. As shown in Table 2, extracting regional spatial features only and global spatial features only from short videos underperformed in most cases. However, the combination of ROIs and global faces provided more plentiful information and achieved better accuracy than the first two. Table 2. Recognition results with global spatial features and regional spatial features.

Method
Spatial Features Database

YTC [51] ICT-BBT [52] ICT-PB [52]
Ours-Jeffrey The face recognition tests compared with seven SPD-based algorithms on three internet video face databases are summarized in Table 3. As can be seen from Table 3, our framework with AIRM performed best on both ICT-BBT and ICT-PB databases and highly approached CREML on the YTC database. The DR methods LEML, SPDML and SPDSL showed similar performance, maybe because set-based SPD matrix representations encode approximate information of global variations of videos. In contrast, the DALG, the CERML, and our proposed method were generally outperformed by the other SPD-based video recognition methods on the both databases. This may have been because the DALG utilized the geometry of LGs, which provides high-order information, and the CERML fuses the Euclidean representation and Riemannian representation from videos while our proposed method fuses both major spatial features and temporal features of video. Table 3. Recognition results of comparison algorithms.

Conclusions and Future Work
In this paper, we propose a short video recognition framework that models temporal evolution of ROIs in short video as a family of feature curves on the SPD manifold, which fuses spatial and temporal features of video. In this framework, the time series of ROIs are traced by motion estimation, which effectively saves vast computing cost compared with feature detection per frame and provides a degree of information filtering. Moreover, by characterizing each ROI with the RCD, an effective transformation from original video recognition to family of feature curves on SPD manifold recognition is established. Finally, the Riemannian metrics and divergences on the resulting SPD manifold can derive appropriate distance in DTW to define similarity measures between feature curves. Our extensive comparative experiments show that the proposed framework achieves advanced and effective results on three challenging video-based face databases.
For future work, combined with feature detection methods, the study of how to expand the proposed framework to different short video recognition tasks would be interesting. Moreover, on the basis of the geometry of the Riemannian manifold, it is necessary to explore the novel time series recognition method.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.