Novel Cross-View Human Action Model Recognition Based on the Powerful View-Invariant Features Technique

: One of the most important research topics nowadays is human action recognition, which is of signiﬁcant interest to the computer vision and machine learning communities. Some of the factors that hamper it include changes in postures and shapes and the memory space and time required to gather, store, label, and process the pictures. During our research, we noted a considerable complexity to recognize human actions from different viewpoints, and this can be explained by the position and orientation of the viewer related to the position of the subject. We attempted to address this issue in this paper by learning different special view-invariant facets that are robust to view variations. Moreover, we focused on providing a solution to this challenge by exploring view-speciﬁc as well as view-shared facets utilizing a novel deep model called the sample-afﬁnity matrix (SAM). These models can accurately determine the similarities among samples of videos in diverse angles of the camera and enable us to precisely ﬁne-tune transfer between various views and learn more detailed shared facets found in cross-view action identiﬁcation. Additionally, we proposed a novel view-invariant facets algorithm that enabled us to better comprehend the internal processes of our project. Using a series of experiments applied on INRIA Xmas Motion Acquisition Sequences (IXMAS) and the Northwestern–UCLA Multi-view Action 3D (NUMA) datasets, we were able to show that our technique performs much better than state-of-the-art techniques.


Introduction
Human action data are all-encompassing, which is why such data are of significant interest to the computer vision [1,2] as well as machine learning communities [3,4].There are various views through which we can study human action data.One example is a group of dynamic actions captured by different views of the camera, as shown in Figure 1.Classifying this kind of data is quite difficult in a cross-view situation because the raw data are captured at varying locations by different cameras and, thus, could look completely different.A perfect example is represented in Figure 1a, where the action captured from a top view appears different from the one captured from a side view.This implies that features obtained from a single camera view cannot provide enough discriminative aspects to classify actions in another camera view.Many studies concentrate on ways of developing view-invariant pictures for action recognition [5,6], where all the actions captured on video are treated as frame time series.Many approaches have utilized a self-similarity matrix (SSM) descriptor [5,7] to replay actions and have proven to be robust in cross-view outlines.Information shared among camera views are each kept and transferred to all the views [7,8].It is assumed that the shared features contribute equally with samples from different views.Through our research, we found that this assumption is not true because the discriminative parameters of one of the views could be very far away from the parameters of other views.Therefore, this may result in a misunderstanding by classifiers, as they do not control the sharing of data between action categories, which would result in an incorrect model result.
Future Internet 2018, 10, x FOR PEER REVIEW 2 of 17 aspects to classify actions in another camera view.Many studies concentrate on ways of developing view-invariant pictures for action recognition [5,6], where all the actions captured on video are treated as frame time series.Many approaches have utilized a self-similarity matrix (SSM) descriptor [5,7] to replay actions and have proven to be robust in cross-view outlines.Information shared among camera views are each kept and transferred to all the views [7,8].It is assumed that the shared features contribute equally with samples from different views.Through our research, we found that this assumption is not true because the discriminative parameters of one of the views could be very far away from the parameters of other views.Therefore, this may result in a misunderstanding by classifiers, as they do not control the sharing of data between action categories, which would result in an incorrect model result.As a response to the assumption of the equal contribution of share features, in this paper, we put forward original networks that can learn view-invariant features for cross-view action categorization, and we have introduced a novel sample-affinity matrix (SAM) that can accurately determine similarities of video samples.By encouraging incoherence between shared and private features, we learned discriminative view-invariant information.Our approaches retain two types of features-strong private features as well as shared features across views acquired by a single autoencoder.SAM focuses on the resemblance between samples, but SSM concentrates on the video frames.The vanity between these facets was achieved by strengthening the coherence between the mapping matrices.In addition, in a layer-wise fashion, we piled several layers of features in order to learn them.After a set of experiments was carried out on three multi-view sets of data, we found out that our method performs much better than state-of-the-art methods.The following pieces of information are covered in the next sections of this paper.We first analyze works related to our topic and follow this with an analysis of view-invariant features.Then, we present a detailed description of the structure of our method together with the algorithm used.Finally, through a set of experiments, we show that our method performs much better than state-of-the-art methods.

Research Problem Definition
It is difficult for various view-invariant methods to find similarity among various frames captured from different RGB camera (standard CMOS sensor camera) views at the same time.As a response to the assumption of the equal contribution of share features, in this paper, we put forward original networks that can learn view-invariant features for cross-view action categorization, and we have introduced a novel sample-affinity matrix (SAM) that can accurately determine similarities of video samples.By encouraging incoherence between shared and private features, we learned discriminative view-invariant information.Our approaches retain two types of features-strong private features as well as shared features across views acquired by a single autoencoder.SAM focuses on the resemblance between samples, but SSM concentrates on the video frames.The vanity between these facets was achieved by strengthening the coherence between the mapping matrices.In addition, in a layer-wise fashion, we piled several layers of features in order to learn them.After a set of experiments was carried out on three multi-view sets of data, we found out that our method performs much better than state-of-the-art methods.The following pieces of information are covered in the next sections of this paper.We first analyze works related to our topic and follow this with an analysis of view-invariant features.Then, we present a detailed description of the structure of our method together with the algorithm used.Finally, through a set of experiments, we show that our method performs much better than state-of-the-art methods.

Research Problem Definition
It is difficult for various view-invariant methods to find similarity among various frames captured from different RGB camera (standard CMOS sensor camera) views at the same time.However, with an RGB-D camera (depth sensor added), this is not an issue because the camera provides a powerful feature that allows for easy extraction of the required feature from the plan (Figure 2).The only challenge is generating artificial views that have all the facets necessary for understanding the action.In this paper, we also introduce a novel algorithm based on sample affine matrix (SAM) and various powerful autoencoders that allows for extraction of shared and unshared facets needed for identifying human action.However, with an RGB-D camera (depth sensor added), this is not an issue because the camera provides a powerful feature that allows for easy extraction of the required feature from the plan (Figure 2).The only challenge is generating artificial views that have all the facets necessary for understanding the action.In this paper, we also introduce a novel algorithm based on sample affine matrix (SAM) and various powerful autoencoders that allows for extraction of shared and unshared facets needed for identifying human action.

Related Research
A multi-view study, as cited in [4], establishes the similarities between two different views.There are several other published methods used to serve the same purpose.These methods have been published with the intention of concentrating their interest on expressive as well as discriminative facets from low-level observations [9][10][11][12][13][14].Similarly, [15] entirely used the intrinsic characteristics extracted from the views by combining color and depth information, which resulted in the improvement of their perspective-invariant feature transform (PIFT) for RGB-D images.However, [15] focused on the combined used of the RGB and depth component and gave less attention to the features (shared and global) to be extracted from each viewpoint.Zhang et al. [16] obtained a dictionary that can be used to convert 2D video to a view-invariant sparse representation, as well as a classifier to recognize actions with an arbitrary view by using an end-to-end framework to learn view invariance jointly.The authors of [16] also introduced a 3D trajectory which can describe the action better; however, it does not emphasize the percentage of contribution of each view to that trajectory.Kerola et al. [17] took a temporal sequence of graphs as an illustration of a graph representation action and used that to create a feature descriptor by applying a spectral graph wavelet transform.The authors of [17] also emphasized two well-known types of view-invariant graphs-key point-based and skeleton-based graphs.Rahmani et al. [18] devised a histogram of oriented principal components (HOPC) that is robust to noise.Unlike several articles, [18] obtained cloud points by directly dealing with point clouds for cross-view action recognition from unknown and unseen views.Also, [19] proposed a different approach when extracted functionality is needed.Hsu et al. [19] used the Euclidean distance between the spatiotemporal characteristic vectors represented in a spatiotemporal matrix (STM).Daumé and Kumar [20] presented a cotraining technique that trains various learning algorithms for every view and determines a certain correlation between two pairs of information among various views.Zhang et al. [21] and He et al. [22] introduced

Related Research
A multi-view study, as cited in [4], establishes the similarities between two different views.There are several other published methods used to serve the same purpose.These methods have been published with the intention of concentrating their interest on expressive as well as discriminative facets from low-level observations [9][10][11][12][13][14].Similarly, [15] entirely used the intrinsic characteristics extracted from the views by combining color and depth information, which resulted in the improvement of their perspective-invariant feature transform (PIFT) for RGB-D images.However, [15] focused on the combined used of the RGB and depth component and gave less attention to the features (shared and global) to be extracted from each viewpoint.Zhang et al. [16] obtained a dictionary that can be used to convert 2D video to a view-invariant sparse representation, as well as a classifier to recognize actions with an arbitrary view by using an end-to-end framework to learn view invariance jointly.The authors of [16] also introduced a 3D trajectory which can describe the action better; however, it does not emphasize the percentage of contribution of each view to that trajectory.Kerola et al. [17] took a temporal sequence of graphs as an illustration of a graph representation action and used that to create a feature descriptor by applying a spectral graph wavelet transform.The authors of [17] also emphasized two well-known types of view-invariant graphs-key point-based and skeleton-based graphs.Rahmani et al. [18] devised a histogram of oriented principal components (HOPC) that is robust to noise.Unlike several articles, [18] obtained cloud points by directly dealing with point clouds for cross-view action recognition from unknown and unseen views.Also, [19] proposed a different approach when extracted functionality is needed.Hsu et al. [19] used the Euclidean distance between the spatiotemporal characteristic vectors represented in a spatiotemporal matrix (STM).Daumé and Kumar [20] presented a cotraining technique that trains various learning algorithms for every view and determines a certain correlation between two pairs of information among various views.Zhang et al. [21] and He et al. [22] introduced another approach called canonical correlation analysis (CCA) that helps in monitoring a common space among various views.Moreover, in [23], a challenge was encountered when studying an incomplete view owing to an assumption that multiple views are created from a shared space.Kumar et al. [24] provided another approach called generalized multi-view analysis (GMA).According to Liu et al. [11], which found that matrix factorization was applicable in the clustering of multiple views, [10] was introduced alongside the collective matrix factorization (CMF) method, which learns relationships between feature matrices.Similarly, Ding et al. [14] came up with a low-rank controlled matrix factorization model that provides a solution to the challenge faced in the multi-view learning.Various studies have tried to provide solutions to the challenges associated with view-invariant action recognition.One of these challenges is related to the generation of action labels in a scenario where multiple views are involved.Özuysal et al. [25] introduced an approach that offers a structured categorization of the 3D histogram of oriented gradients (HOG) and local separation with an intention to represent successive images.Dexter and colleagues [5,26] showed an SSM-based approach that extracts view-invariant descriptors in a log-polar block of the matrix by determining a frame-wise resemblance matrix in a video.However, a multitask learning technique can be used to enhance the SSM power of the representation [7].This technique comprises shared facets among different views as examined in [6,8,27,28], where more explicit approaches exist.Gao et al. [6] used an MRM-Lasso method to keep latent adjustment through various views.This was achieved by examining a lowly ranked matrix comprising weights of specific patterns.However, Jiang et al. [8] and Jiang and Zheng [29] introduced handy dictionary pairs that support the sparse common feature space.Compared to other methods [20,22,23,30], our approach enables us to keep multiple layers of learners and to examine view-invariant facets more effectively.In addition, it enables us to record complicated movements that exist in certain views.Our approach uses private facets and supports the inconsistencies among shared as well as private facets.Compared to other methods of sharing knowledge [6,8,27,31,32], our approach achieves the sharing of information among different views as per sample similarities.Since samples of data can appear the same in some views, our method provides a solution that enables us to distinguish different classes.While [30] calculated between and within classes of Laplacian matrices, SAM Z approach takes directly within and between classes.Moreover, the space between two views of one sample is determined using SAM Z.Such a distance cannot be encoded by [30].

Sample-Affinity Matrix (SAM)
SAM is used here to determine the similarities among views (a pair of videos).Assume we have as input two training videos with N views: {X t ,y t } t N=1 , where the data of the view frame at the interval of time t, X t has of N actions videos shows as a block diagonal matrix: NB: dial(.)generates a diagonal matrix, and Z mn i shows the space between two view frames in the ith sample obtained by Z mn i = exp(|| Xn -Xm||/2c) parameterized by c.In different views within one class, the appearance of variations is characterized by the block Z i in Z.This illustrates clearly how an action might be seen differently from different viewpoints and allows us to share information between views that results in the construction of robust cross-view features.Furthermore, the presence of 0 on the off-diagonal blocks in our SAM Z limits the transfer of information between classes in the same view, with the direct impact of encouraging the distance between the features from various classes having the same view frame.It also gives us the possibility of differentiating multiple action categories if they seem similar in some view frames.

Preliminary on Autoencoders
Based on popular deep learning approaches [33][34][35], we have built an autoencoder (AE) that links the raw input X to hidden unit H using a powerful "encoder" e1(.):H = e1(X) and uses the "decoder" e2(.): O = e2(H) to connect the concealed units to outputs.The objective of studying AE is to strengthen matching or related input and output pairs where the restoration failure is decreased once decoding is over: Xi − e2(e1(Xi)) 2  (2 where N is the number of training samples.As the process of reconstruction keeps track of incoming information, neurons in the latent layer represent the inputs.On the other hand, the two-phase coding and decoding in autoencoders [33] is emphasized on the marginalized stacked denoising autoencoder (mSDA), which is used to reap the ruined information by use of one mapping W: where a represents the corrupted version of our Yi and computes by assigning 0 to each feature with a given probability p. Regarding the mSDA method, let us understand that for achieving a better result, it needs to pass n times over the training set with a different corruption each time.This causes a nonconformity regularization [36].In the objective to achieve a robust transformation matrix W, n is set as n → ∞ so that mSDA will effectively use an infinite number of noisy data copies.Furthermore, mSDA is solved in closed form and is also stackable.

Single-Layer Feature Learning
As mentioned so far, our proposed method is based on mSDA.We aimed to get the shared facets among private features, especially those that belong to a single view frame and multiple view frames for cross-view action recognition categorization.Furthermore, in order to construct more robust features that are aware of the very large motion difference in diverse view frames, we introduced SAM Z to learn shared facets with the objective of equating information among view frames.
By the help of the objective function below, we have learned private and shared features: where W and {G v } V v=1 are, respectively, the mapping matrix used to learn shared features and a collection of mapping matrices used to learn private features of each view frame, and p v can be expressed as is used to learn shared features (SF) among view frames.SF view is particularly used in the reconstruction process of the action data taken from a single view frame with the help of the data extracted from other view frames.To ascertain certain unshared facets that seem similar to shared features, F , the third term, minimizes redundancies between the mapping matrices, and r 2v = Tr (Pv Xv LXvT PvT), as the fourth term, emphasizes the similarity of the private and shared features belonging to the same class and view frame.The parameters α, β, and γ are discussed below.However, we can introduce them as elements used to balance the components discussed beforehand.It is important to note that data obtained from all view frames in cross-view action recognition are availed in training so that our model can learn private and shared features.However, with the testing phase, data collected from some view frames are not present.

Shared Features
The action recognition ability found in humans is easily understood from a single view, but does how the action appear if observed from multiple views?This is possible, since we regularly observe the same action from several views.This problem is one of the reasons why we tried to restore the action data view from one point with the help of action data from multiple views.Considering ψ, the disparity between the data of all the V source views and the data of the vth target view can be expressed as: where Z uv i represents the value measuring the endowment of the u-th view action in the remodeling of the sample X v i of the vth view frame.Also, W is a single linear mapping for the corrupted input X v i of all the views frames with W ∈ R d * d .From the sample-affinity matrix, which encodes all the values {Z uv i }, also called weights, we have Z ∈ R V N * V N .It is good to mention that the corrupted version of X matrices, which is X ∈ R d * V N [33], performs a drop out regularization on the model [33,36].Furthermore, to accurately regulate the information transfer among view frames and learn more easily discriminative shared features, SAM Z is used.As an alternative of using equal values (weights) [8], all the samples are used, so that we remodel the i-th training sample taking from the vth view.Let us note, by the help of Figure 3, a greater similarity will be found between a sample of side view frame (S1) and side view t (which is the target view) compared to the one found between S1 and top view frame (S2).These result in the increase of weight for S1 with the objective to learn more expressive features for t. action data view from one point with the help of action data from multiple views.Considering ψ, the disparity between the data of all the V source views and the data of the th target view can be expressed as: where  represents the value measuring the endowment of the u-th view action in the remodeling of the sample  of the th view frame.Also, W is a single linear mapping for the corrupted input  of all the views frames with W ∈ ℝ * .From the sample-affinity matrix, which encodes all the values { }, also called weights, we have Z ∈ ℝ * .It is good to mention that the corrupted version of X matrices, which is  ∈ℝ * [33], performs a drop out regularization on the model [33,36].Furthermore, to accurately regulate the information transfer among view frames and learn more easily discriminative shared features, SAM Z is used.As an alternative of using equal values (weights) [8], all the samples are used, so that we remodel the i-th training sample taking from the th view.Let us note, by the help of Figure 3, a greater similarity will be found between a sample of side view frame (S1) and side view t (which is the target view) compared to the one found between S1 and top view frame (S2).These result in the increase of weight for S1 with the objective to learn more expressive features for t.

Private Features
Selective information can still be found in each view despite the information shared across view frames.By improving the robustness of that information, in [33], we have used robust feature learning, and by using a matrix  ∈ℝ * , we have "learned view specific private features" for the samples in th view frame: where the feature matrix  has for corrupted version  of the th view.Using associate inputs

Private Features
Selective information can still be found in each view despite the information shared across view frames.By improving the robustness of that information, in [33], we have used robust feature learning, Future Internet 2018, 10, 89 7 of 17 and by using a matrix G v ∈ R d * d , we have "learned view specific private features" for the samples in vth view frame: where the feature matrix X v has for corrupted version X v of the vth view.Using associate inputs of different view frames, we obtained the V mapping matrices {G v } V v=1 .We should take into consideration that Equation ( 6) can keep some redundant shared information from vth view, but by promoting the inconsistency among the view-specific mapping matrix G v and view-shared mapping matrix W, we reduce some redundancy in our project:

Label Information
We should keep in mind that action data collected from different viewpoints may come with a considerable proportion of posture and motion variations.Thus, features (private and shared) collected by applying Equations ( 5) and ( 7) could appear not sufficiently selective for classifying action with considerable variation.Facing this issue, our approach leads to the same view and the same class to enforce similarity between private and shared features.To normalize view-specific mapping matrix G v and view-shared mapping matrix W, we have defined a "within class" and "within view frame" variance as: where L ∈ R N * N and L = D − A is label view the Laplacian matrix with degree matrix D(i,j) = ∑ N i=1 a (i,j) and the adjacent matrix A. Note that the (i,j)-th element: a(i,j) in A is 1 if y i = y j or 0 , if y i = y j .As we have implicitly required facets from various view frames to be similar in Equation (5), we do not need it here.Furthermore, by using facets from different view frames of a given sample, we can better illustrate expected facets of a given sample.This can be possible due to the mapping function (based on mapping matrix W) applied on the shared features (SF) so that the SF is mapped to a new space.Base on label information results obtained from Equation (8), in a supervised approach (SA) and by considering γ = 0, we can derive an unsupervised equation.In the following lines, a supervised approach is renaming.

Learning Process
Using a coordinate convergent algorithm, we are able to optimize parameters W and the V mapping matrices {G v } V v=1 .Our process resolves the optimization problem in Equation ( 5).Furthermore, after determining the derivative of ∆ w.r.t to the parameter and allocating 0 to it, one parameter matrix within every step is updated by setting the others.First of all, we update W by fixing the derivative, d∆ dW = 0 as: Keep in mind that in updating W, matrices {G v } V v=1 are fixed; also by performing m → ∞ times the corruption, we obtain after computation F 1 and F 2 for X X T and XZ X T , respectively.As referred to in [33], the weak law of large numbers applied on X X T and XZ X T gave us the computation values: F 1 = E p ( X X T ) and F 2 = E p (XZ X T ), where p is the corruption probability.
Similar to the previous update schema, by fixing the derivative d∆ dG v = 0, parameter G v is updated with {G u } V u=1,u =1 and W sets with default value.Furthermore, X v X vT and X v X vT computation values are obtained by applying E p (expectation with corruption p): As a resolution approach of Equation (1)'s problem, let us subdivide it into V + 1 subproblems, where by considering one variable, each one is a convex problem.Thus, an optimal solution to each subproblem is surely found by using the learning algorithm which will consequently converge to a local solution.
Every mentioned method has some advantages and disadvantages where need to be taken into account when designing solution as shown in table (Table 1).We are also defining our own approach as the alternative.

Approach Advantages Disadvantages
Multi-view learning approach [9][10][11][12][13][14] Focuses on expressive and discriminative features Does not focus much on private features Cotraining method [20] Trains various learning algorithms for every view and finds explicit correlation of two pairs of information among various views Cannot handle more than two views simultaneously [21,22] Maintains common distance between views, and utilizes the two projection matrices on a common feature space in order to map multimodal information Has little interest in private features [25] method It achieves a structured categorization of the 3D histogram of oriented gradients (HOG).It does not keep enough layers of learners.
[30] method Calculates between-class as well as within-class Laplacian matrices Does not measure the space between two views of the same sample.

Our approach
It can keep several layers of learners so as to study view-invariant features in a more effective manner Because of the large amount of computation involved, the approach can process fewer views It equipoises the sharing of information among views, as per sample similarities Requires the use of various computer resources.Measures the distance between two views using SAM Z

The Design of the Proposed Approach-Deep Architecture
From the papers [33,37], we find a deep model developed by piling multiple layers facets discussed earlier in single layer feature learning where we utilized the nonlinear feature mapping function θ(.) on the output of every layer, so that for C v g = θ(X v G v ) and C w = θ(XW), the outcome is a series of matrices of latent features.We utilized a "layer wise" training approach to train the networks It is imperative to note that the input of the (k + 1)th layer is the output of the nth layer C v kg and C kw .This gives the input of our matrix G v k+1 V v=1 and W k+1 .Moreover, since k = 0 (implying layer 1 because there are K layers), X and X v have for raw features C v 0g and C 0w in that order.

Flow Chart of our Project
Figure 3 shows the steps of our project.These steps are described below: • Take three videos from different perspectives at the same time: Here, we try to capture images of a person from varying angles.

•
We then obtain key features from the captured pictures by utilizing Equations ( 5)-( 7): The pictures obviously have various features in common because they belong to one subject and they were taken at the same time.These features are called shared features, while the unique features that every picture has are called private features.We submit these two types of features to the next component as input.

•
Applying a novel invariant feature algorithm: This step is a learning point pertinent for the process.

•
Create the target views: In this step, we solve the sample-affinity matrix Z for every arrow.We also solve the W mapping matrix and create the target view having all the relevant features that will help in understanding the action.

•
Allocate a label and an explanation of the action taking place.

Novel View-Invariant Features Algorithm
While in this paper we are targeting to present a powerful view-invariant feature, considering two subjects acting in the same environment as shown in figure (Figure 4a), our algorithm selects the zone of interest, extracts share, and private features as shown in figure (Figure 4b).It uses SAM to determine the similarities among views (a pair of videos).Additionally, figure (Figure 4c) shows a Clear mapping among view as described by our computed W mapping matrix.
Using our supervising technique, we determined the similarity between our target view and source views (Figure 5).A view shared mapping matrix W can be used to accomplish this process.The matrix is incorporated into Equation ( 5) to calculate weight value (Z) and get the shared feature.The arrow with the largest weight is the one situated between our target view and source view.
Here (Figure 6), we derived a supervised approach from our previous one.W is the mapping matrix updated through several iterations (as show in our algorithm), while the shared and private features are determined.We aimed through our approach to determine or generate a target view (obtained from several viewpoints after applying W matrix) accurate enough so that the weight (Z) of the arrow will be same as the one obtained after applying Equation (5):
i←  +  end while      In opposite of [3,6,8,27,28] where features extracted from views are used individually, we further process those features, and we derive a target view which contends the combined relevant features.
As specified in (Figures 4 and 6).Our approach achieves state of the art result because we apply the recognition algorithm on the target view.Thus, it is done at this level as we have a single viewpoint.

Experiment
We evaluated our method using three multi-view datasets and the Daily and Sports Activities (DSA) dataset [3], multi-view IXMAS dataset [39], and the Northwestern-UCLA Multi-view Action 3D (NUMA) dataset [40]).It is important to note that the datasets we used were also used in many Figure 6.Extraction of features of sources view followed by the computation of the mapping matrix (W) as well as the target view (t) (image obtained from a public dataset [38]).
In opposite of [3,6,8,27,28] where features extracted from views are used individually, we further process those features, and we derive a target view which contends the combined relevant features.
As specified in (Figures 4 and 6).Our approach achieves state of the art result because we apply the recognition algorithm on the target view.Thus, it is done at this level as we have a single viewpoint.

Experiment
We evaluated our method using three multi-view datasets and the Daily and Sports Activities (DSA) dataset [3], multi-view IXMAS dataset [39], and the Northwestern-UCLA Multi-view Action 3D (NUMA) dataset [40]).It is important to note that the datasets we used were also used in many studies, such as [3,6,8,27,28].Many-to-one and one-to-one were treated here as two cross-view categorization situations.The first was trained on one view and examined another view, but the second was trained on V-1 views and examined the remaining views.The corresponding X v ← 0 in Equation ( 5) was used for the vth view meant for testing.In addition, we employed the intersection kernel support vector machine (IKSVM) as our classifier with parameters C = 1.γ = 1, β = 1, α = 1, p = 0, and K = 1 were default parameters with the default number of layers being 1.We also took into consideration NUMA and IXMAS.These are datasets for multiple camera view video.We acquired the first from three Kinect sensors in five scenarios comprising 10 human actions, while the second was taken from one top-view camera, and four side-view cameras.In addition, we employed a k-means clustering approach to create video words and quantize the descriptors.This led to the likelihood of using a histogram of the feature to represent a given video.As a further contribution, we were curious to see how our model would react if the recording was not done continuously for the performing action; that is, will it be possible for our model to keep the same accuracy if we remove some frame of the video?As described in Section 4.1, for an interval of time t, we have N actions.If we pause the recording during a small period (p << t), we can still see a flow described by feature points of the target view, as shown in Figure 7.It is also worth noting that V feature vectors constituted a representation of an action taken from V camera angles.

Using the IXMAS Dataset
Just like in [33], we obtained a histogram of intense trajectory and slanting optical flow and a dictionary for every feature was obtained by making use of k-means.In addition, a bag-of-words model was used to turn every video into a feature and encode each of the features.A "leave one action class out" training scheme was used to obtain a reasonable concurrence, just like in [8] and [28].A single action class was used every time we needed to test.Moreover, we removed all videos from the feature steps of learning in order to examine the ability of our method to transfer information, and we introduced a fifth component, which was the result of our method when we performed several p pauses (p << t) (Table 2).The renowned methods in [5,25,28,29] were compared with our method and it was found that approximately 99.9% was achieved by our method, as shown in Table 3.Our method Seb1 achieved a better performance compared to the other approaches, which illustrates the benefit of using our private and shared feature approach in our paper.To determine the resemblance among video samples across camera views, our method implements the sample-affinity matrix with the direct consequence of accurate characterization of the similitude across views by the learned shared features.Furthermore, the learned private features become more edifying for categorization, as the redundancy between private and shared features is reduced.In this experiment, we trained our model with information from a single camera view and we performed the test based on the data extracted from another view.The private features were discarded here and only learned shared features were utilized since private features of one view do not suffice.
By doing a comparison between our approach (Seb1) and the reported recognition results in Table 4 of [8,29,44], we can say that our method performed best in 18 out of 20 combinations, which is considerably better than all the other methods.Again, our approach reached 99.8% in 16 instances, showing the potency of the learned shared features.Due to the importance of discriminative information obtained from the label information and the learned shared features, our approach is resistant enough to viewpoint variation and demonstrates high performance that is cross-view invariant.It is good to keep in mind that the features used here are similar to those of the IXMAS dataset.As mentioned in [40], many-to-one cross-view recognition accuracy in three cross-view scenarios can be expressed as cross-camera view, cross-subject, and cross-environment.Following [40], our approach is compared with [31,40,[45][46][47].
Table 5 shows that our approach is better than [40] Low-resolution visual features (LowR) by 10.4% and 3.9% in cross-environment and cross-view scenario, respectively.Furthermore, in the cross-subject scenario, it reaches an incredible performance with LowR added to [40].A close look at the different comparison shows that the most important gain of performance of Seb1 in cross-environment, cross-subject, and cross-view scenarios are, respectively, 62.3% (over [48]), 30.4% (over [48]), and 32.0% (over [31]).This shows an incredible gain of performance obtained by the Seb1 approach due to the use of SAM to determine similarities of samples in different views and shared and private facts for modeling cross-view data.[40] 78.9 65.3 71.9 Maji et al. [46] 54.9 24.5 48.5 Our supervised method 83.2 77.3 89.8

Parameter Analysis
Let us evaluate the sensibility of our method when we applied α, β, and γ parameters, as shown in Figure 8.We can observe the mean performance of many-to-one cross-view action recognition given values of 0.001, 0.01, 0.1, 1, 10, and 100 of parameters α, β, and γ.We concluded that our approach is insensible to that variation (parameter values).Despite the 2% observed as the large performance gap when parameters α, β, and γ were set, we came again to the conclusion that our method is robust.performance gap when parameters α, β, and γ were set, we came again to the conclusion that our method is robust.

Conclusions
In this paper, we proposed an approach that can label human actions from cross views.Our research focuses on two new methods using unshared and shared facets to precisely classify human action with varying appearances and viewpoints.We have introduced a sample-affinity matrix that is used to determine similarities across views.This matrix has also been used in the monitoring of shared features as well as in controlling the transfer of information so that the contribution of every sample can be measured accurately.Our methods can keep several layers of learners to study viewinvariant features more efficiently.It equipoises the sharing of information among views as per sample similarities and can measure the distance between two views using SAM Z.We also found that our method is robust to some variation in the recording time, as shown in Table 2.However, we noticed during our experiments that our approach required the use of various computer resources.We will further improve our model so that it uses fewer computational resources.To also help in

Conclusions
In this paper, we proposed an approach that can label human actions from cross views.Our research focuses on two new methods using unshared and shared facets to precisely classify human action with varying appearances and viewpoints.We have introduced a sample-affinity matrix that is used to determine similarities across views.This matrix has also been used in the monitoring of shared features as well as in controlling the transfer of information so that the contribution of every sample can be measured accurately.Our methods can keep several layers of learners to study view-invariant features more efficiently.It equipoises the sharing of information among views as per sample similarities and can measure the distance between two views using SAM Z.We also found that our method is robust to some variation in the recording time, as shown in Table 2.However, we noticed during our experiments that our approach required the use of various computer resources.We will further improve our model so that it uses fewer computational resources.To also help in measuring the contribution of every sample precisely, we carried out a series of experiments in NUMA and IXMAS, where we found out that our methods performed well when it came to the categorization of cross-view actions.We have also seen the potentials of our approach, and we now intend to handle image extract flow and space-time instead of taking it during a time t in order to handle activities.We also intend to adjust our model and algorithm in a way based on our previous projects [49] so that they enable us to capture many activities in a more accurate manner.

Figure 1 .
Figure 1.The figure shows a multi-view situation where (a) shows how human actions are captured from different perspectives, while (b) shows how various sensors are connected on the body of a human being so as to gather enough action data (images from Google Images).

Figure 1 .
Figure 1.The figure shows a multi-view situation where (a) shows how human actions are captured from different perspectives, while (b) shows how various sensors are connected on the body of a human being so as to gather enough action data (images from Google Images).

Figure 2 .
Figure 2. The figure shows a recap of our features extractor.

Figure 2 .
Figure 2. The figure shows a recap of our features extractor.

Figure 3 .
Figure 3. Flow chart of the project.

Figure 3 .
Figure 3. Flow chart of the project.

Figure 4 .
Figure 4. (a) Powerful view-invariant feature for two subjects acting in the same environment; (b) our algorithm selects the zone of interest, extracts share; and private features; and (c) clear mapping among views.

Figure 4 .
Figure 4. (a) Powerful view-invariant feature for two subjects acting in the same environment; (b) our algorithm selects the zone of interest, extracts share; and private features; and (c) clear mapping among views.Future Internet 2018, 10, x FOR PEER REVIEW 10 of 17

Figure 5 .
Figure 5. Extracting the features from source pictures and creating the target view (image retrieved from the public dataset IXMAS).

Figure 5 .
Figure 5. Extracting the features from source pictures and creating the target view (image retrieved from the public dataset IXMAS).

Figure 5 .
Figure 5. Extracting the features from source pictures and creating the target view (image retrieved from the public dataset IXMAS).

Figure 6 .
Figure 6.Extraction of features of sources view followed by the computation of the mapping matrix (W) as well as the target view (t) (image obtained from a public dataset[38]).

Figure 7 .
Figure7.Three-dimensional dense trajectories of the target view obtained by following each feature point during the interval of time t.This can be further interpreted as the set of positions occupied by each feature point of the target view[16].

Future
Internet 2018, 10, x FOR PEER REVIEW 14 of 17

Figure 8 .
Figure 8. Robustness evaluation of our approach by applying different values of α, β, and γ.

Figure 8 .
Figure 8. Robustness evaluation of our approach by applying different values of α, β, and γ.

Table 1 .
Advantages and disadvantages of used approaches.

Table 2 .
[8,29,41,42]ross-view recognition outcome for different controlled methods on the IXMAS dataset.The values enclosed in brackets are the recognition accuracies for[8,29,41,42]and our supervised method (with and without the p pause).The experiment enabled us to evaluate our method of learning shared and private features.

Table 4 .
Many-to-one cross-view action recognition results on the IXMAS dataset, where each column corresponds to a test view.

Table 5 .
Cross-view, cross-environment, and cross-subject action recognition outcomes on the NUMA dataset.