^{1}

^{2}

^{*}

^{1}

^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

With the increasing demand on the usage of smart and networked cameras in intelligent and ambient technology environments, development of algorithms for such resource-distributed networks are of great interest. Multi-view action recognition addresses many challenges dealing with view-invariance and occlusion, and due to the huge amount of processing and communicating data in real life applications, it is not easy to adapt these methods for use in smart camera networks. In this paper, we propose a distributed activity classification framework, in which we assume that several camera sensors are observing the scene. Each camera processes its own observations, and while communicating with other cameras, they come to an agreement about the activity class. Our method is based on recovering a low-rank matrix over consensus to perform a distributed matrix completion via convex optimization. Then, it is applied to the problem of human activity classification. We test our approach on IXMAS and MuHAVi datasets to show the performance and the feasibility of the method.

A camera sensor network (CSN) is defined as a set of vision sensors, which can communicate through a network. Each of these smart camera nodes also has its own processing element and memory. With such settings, many applications could be addressed, due to the ease of deployment and their robustness. For instance, creating smart homes, intelligent environments and robot coordination are some great potential applications, which can lead us to a better quality of life. Traditional systems make each camera transmit its own image data or low-level features over the network to a centralized processing unit, which analyzes everything in a centralized fashion (

Human action recognition has been proven to have many applications, including vision-based surveillance [

Understanding the events and activities of humans in video sequences is a challenging task, due to several different issues, including: (1) the large variability in the imaging conditions, as well as the way different people perform a particular action; (2) the background clutter and motion; (3) the high dimensionality of such data is another significant challenge for recognition problems; and (4) a huge amount of occlusion in real-world environments. Many previous works have targeted these challenges by introducing different sets of features [

Rank Minimization has recently gained a lot of attention, due to the simple, effective success in solving many problems. As noted by [

In this paper, we develop a method for the recognition of human activities portrayed in multi-view video sequences. Our method is based on matrix completion, which finds the best action label(s) for each test scene. Each view is composed of a single smart camera, which locally processes its video sequence and decides about the activity being performed in the scene via communication. A sample configuration of the smart cameras for activity recognition is depicted in

In the rest of the paper, the next section reviews the previous work, Section 3 explains our distributed matrix completion technique and the proceeding section explains the proposed activity recognition approach in detail. Section 5 outlines a set of experiments for distributed activity recognition. Finally, Section 6 concludes the paper.

Action and activity recognition methods from single-view video sequences could be categorized into three classes: (1) models that directly utilize bag-of-words (BoWs) representations [

Several multi-camera and distributed action recognition approaches have also been proposed in the literature [

Matrix completion is a great tool for classification purposes, where the instances are classified through convex optimization for best labels and, simultaneously, finding the error and outliers present in the data. The problem of matrix completion and rank minimization is initially a non-convex optimization problem [

Distributed algorithms for matrix factorization and low rank recovery mostly include using parallel or distributed programming models, such as MapReduce and Hadoop. For instance, [

In this paper, a distributed algorithm is proposed, which uses a convex formulation of matrix completion and is applied to the problem of multi-view activity recognition in a network of smart cameras.

Let's assume that the network of the processing nodes or the smart cameras is modeled with a connected undirected graph,
_{p}_{i}

Matrix Completion is the process of recovering a matrix from a sampling of its entries. We want to recover a data matrix, _{0}, in which we only get to observe a number of its entries, which is comparably much smaller than the total number of elements in the matrix. Let Ω denote the set of known entries. With sufficiently large measurements and uniformly distributed entries in the matrix, we can assume that there is only one low-rank matrix with these entries [_{k}_{tr}_{tr}_{tst}

As noted by Goldberg _{tst}_{tst}_{0} is minimized. This could be done via a convex minimization process [_{0} as Ω_{X}_{Y}_{tr}^{m×Ntr} and _{tst}^{m×Ntst} are the training and testing labels and _{tr}^{n×Ntr} and _{tst}^{n×Ntst} are the training and testing feature vectors, respectively. Therefore, the classification process would be posed as finding the best _{tst}_{0} + _{y}_{x}_{1}, are positive trade-off weights [

As shown by [_{0} = _{X}_{Y}

Using the iterative thresholding or the singular value thresholding (_{0} —

With the singular value decomposition of a matrix, ^{T}, we can apply an Alternating Direction method (ADM) for recovering the low-rank matrix, _{X}_{Y}^{th}_{ϵ}[.], is applied on the singular values of the matrix,
^{th}

The constraint, _{1} = ^{T}, is enforced by keeping the last row of _{k}^{T}. Furthermore, for all unknown entries, (_{Y}_{k}

In order to parallelize this algorithm, we need to distribute the entries present in _{0} between the processing nodes. Therefore, we will have separate ^{(n+m)×(Ntr+Ntst)}, into _{p}_{i}^{ni×(Ntr+Ntst)}. Therefore, we can assume that the original data matrix is formed as:

Therefore, the Lagrangian multiplier, ℒ, and the error matrix, _{i}

_{i}_{i}; in each iteration, nodes receive the internal state of their neighbors and update:
_{i}_{i}^{−1} and decreased through time. It is shown [_{t→∞} _{i}_{i}(_{i}(

Note that _{p}^{(n+m)×r}, ^{r×(Ntr+Ntst)} and Σ ∈ ℝ^{r×r}, with ^{T}. To do this, we can compute the SVD of _{p}_{i}_{i}VΣ^{−1}.

As a result, the ^{th}

In summary, this algorithm consists of two stages: first, calculating _{i}_{i}_{i}

^{th}

^{th}

_{i}=

_{0}

_{i}

^{th}

_{i}

_{i0}

_{k}

_{i0}=

_{i}(

_{ik}

^{T}

_{ik}

_{i}(

_{j}(

_{i}(

_{i}(

_{j∈ }(c

_{j}(

_{i}(

^{T}) =

_{i}(

_{i}

_{ik}VΣ

^{−1}

_{ik+1}

_{i}

_{τ}[

^{T}2. Fix all other variables and update

_{i}

_{ik+1}= ℒ

_{ik}+ μ

_{ik+1}−

_{i0}

_{ik+1}

_{k+1}

_{k+1}=

_{k}

^{10}) and

_{ik}

_{0i}

_{ik}

Our task is to recognize activities present in the scene, which are captured with a networked set of cameras, as also illustrated in

To represent each video from each view, we use histograms of densely sampled features, which extract features from space-time video blocks and sample from five dimensions, (^{th}_{i}

We can assume that both train and test action sequences are captured by _{c}_{0} for our case, as shown in

We construct the matrix, _{0}, by assigning each column to training or testing samples. During the process of capturing the sequences of each action, the subject could be facing any of the cameras performing the action. For training, the samples are formed, such that all the sequences have the same orientation formation. Therefore, in order to enhance the recognition results, for each test sequence, we need to determine the orientation for which the action can best perform the recognition. The correspondence could be determined using a circular shift. For instance, consider an action scene,

When performing a matrix completion, for determining the labels, all four combinations are considered, and the one with the least amount of absolute error in the corresponding row of the error matrix, _{X}_{Y}

In this section, we setup several experiments on some well-known multi-view activity datasets and compare the recognition results with some state-of-the-art distributed and centralized methods. We choose previous methods, which have reported results with the same experimental setup for comparisons. We also compare the execution times of our distributed matrix completion algorithm with those of the original centralized version of the algorithm, solving

In order to validate our approach, we carried out experiments using the IXMAS [

The IXMAS dataset has 13 action classes (check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point, pick up, throw over head and throw from bottom up) performed by 12 subjects, each 3 times. The scene is captured by 5 cameras, and the calibration/synchronization parameters are provided. In order to be consistent with a setup similar to those in the previous work [

The MuHAVi dataset contains 17 action classes (walk turn back, run stop, punch, kick, shotgun collapse, pull heavy object, pickup throw object, walk fall, look in car, crawl on knees, wave arms, draw graffiti, jump over fence, drunk walk, climb ladder, smash object and jump over gap) performed by 7 actors, recorded in 25 fps with challenging lighting conditions. In our experiments, we choose four (two side and two corner) cameras for evaluations. A manually annotated subset (MuHAVi-MAS) is also available, which provides silhouettes for two of these views (front-side and corner) for two actors, labeled 14 (called MuHAVi-14). We run our experiments on the whole dataset, since we did not require the manually annotated silhouettes, but we compare our method with some state-of-the-art methods on MuHAVi-14.

To setup this experiment, we have simulated the network environment, where each camera process is implemented in a single process on a processing core of a Corei7-3610QM CPU, and the communication is done via IPC. The network of the cameras is considered to have a fully connected topology.

For extracting the spatio-temporal interest points and to form the histogram feature vectors, we set

The classification results for every individual camera using our method, in comparisons with our distributed algorithm, are shown in

Many actions are very hard to recognize if they are viewed from a specific view point. However, our distributed algorithm achieves better recognition rates, compared to each single view of the same dataset. As could be seen, our method outperforms several distributed or centralized methods, both as an overall recognition system or in the class-level. Only Wu and Jia [

In order to evaluate the boost in the run time, the execution times of the runs on the two versions of the algorithm are calculated.

In this paper, we have described a distributed action recognition algorithm, based on low-rank matrix completion. We have proposed a simple distributed algorithm to minimize the nuclear norm of a matrix, and then, we have adapted an inexact augmenting Lagrangian multiplier method to solve the matrix completion problem. We have tested the algorithm on IXMAS and MuHAVi datasets and achieved good results. With the experiments outlined in this paper, we show that our matrix completion framework could be well adapted for the classification of a scene in a distributed camera network. Therefore, it is a proof-of-concept study for using such algorithms in distributed computer vision algorithms.

As mentioned before, we have developed a distributed classification framework for human action recognition, which can also be used for distributed classification tasks. Matrix completion is a great tool for dealing with noisy data. As could be seen in the formulations, the error and outliers are identified during the minimization task. Activity recognition data, due to its many variations across subjects and imaging/illumination conditions, is a set of data with many potential outliers, and that is why our method could achieve acceptable results, compared to the other state-of-the-art method.

As a direction for future work, we need to perform the training and testing procedures incrementally, where huge amounts of data could be summarized into smaller matrices and used for testing purposes.

The authors declare no conflict of interest.

(

Sample camera network setup for human activity recognition.

Data matrix, _{0}, which contains training and testing instances, each as a single column.

A model for the data split between the processing camera nodes (distributing segments of each activity between the nodes).

Sample frames from the action datasets. (

Recognition results for each of the single views and all four views, on the IXMAS dataset with training and testing on 11 actions and 10 subjects.

The confusion matrix of the recognition output on the IXMAS dataset.

Class-level recognition results of the IXMAS dataset with 11 actions, in comparison with Shao

Class-level recognition results of the IXMAS dataset with 13 actions, in comparison with Reddy

Recognition results for each of the single views and all four views, on the MuHAVi dataset.

The confusion matrix of the recognition output on the MuHAVi dataset.

Class-level recognition results of the MuHAVi dataset, in comparison with Wu and Jia [

Execution times for the distributed and centralized matrix completion on human activity recognition datasets. (

Overall accuracy results on the IXMAS dataset, using all four cameras. # Sub. and # Act. in the table are the number of subjects and the number of actions taken into account for evaluation in the method, respectively.

Srivastava |
10 | 11 | Distributed | 81.4% |

Weinland |
10 | 11 | Centralized | 81.3% |

Our Method | 10 | 11 | Distributed | 87.5% |

Liu and Shah [ |
13 | 12 | Centralized | 82.8% |

Reddy |
13 | 12 | Centralized | 66.5% |

Wu and Jia [ |
12 | 12 | View-invariant | 91.67% |

Our Method | 13 | 12 | Distributed | 85.9% |

Overall accuracy results on the MuHAVi dataset. The data column shows the subset of the data used for evaluation for each of the methods.

Singh |
MuHAVi-14 | Centralized | 82.4% |

Chaaraoui |
MuHAVi-14 | Centralized | 91.2% |

Wu and Jia [ |
All of the dataset | View-invariant | 97.48% |

Our method | All of the dataset | Distributed | 95.59% |