Digital Commons @ Michigan Tech Digital Commons @ Michigan Tech

: With the wide applications of vision based intelligent systems, image and video analysis technologies have attracted the attention of researchers in the computer vision field. In image and video analysis, human activity recognition is an important research direction. By interpreting and understanding human activities, we can recognize and predict the occurrence of crimes and help the police or other agencies react immediately. In the past, a large number of papers have been published on human activity recognition in video and image sequences. In this paper, we provide a comprehensive survey of the recent development of the techniques, including methods, systems, and quantitative evaluation of the performance of human activity recognition.


Introduction
After the tragic event on September 11 and the subsequent terrorist attacks around the world, visual surveillance has attracted much more attention and has been adopted in different applications for crime detection and prediction. Automatic activity recognition is an important research direction in OPEN ACCESS surveillance vision analysis. By analyzing the detected human activities, especially the abnormal activities of human beings, standoff threats can be recognized and predicted. In the past decade, a large number of in-depth research papers have been published on the recognition and understanding of human activities. They can be classified into two types of approaches: active techniques and passive techniques. Active techniques, such as radar, I/R or microwave, have been widely used to obtain images. For example, the commercial products such as the Nintendo's WII or Microsoft's Kinect are good examples that make use of active techniques [1]. However, although such products have been partially successful, their deployment per location is usually not practical in widespread public areas. Thus, we limit our work to the summarization of the past efforts on passive vision processing techniques.
As described in [2], activity recognition aims to draw a description of human actions and interactions through the analysis and understanding of human motion patterns. It contains two level procedures [3]. As illustrated in Figure 1, the lower level aims to detect the regions of interest (ROIs) corresponding to static or moving humans; while the higher level recognizes temporal motion patterns of human activities. From a technical viewpoint, human activity recognition can be considered as a classification problem using time varying feature data. Visual information is extracted from video sequences and represented in relevant features, which are used to match with the features extracted from a group of labeled reference sequences representing typical activities. During the extraction procedure, three kinds of features may be involved: single object's features (i.e., position, velocity, veins, shape, color and etc.), global features of multiple objects (i.e., average speed, region occupancy, relative positional variations and etc.), and the relationships between objects and background [4]. In the past, several efforts have been made to survey this area of research [5,6]. In [7], Popoola and Wang summarized the key points of previous related review papers on activity recognition. It is noted that previous review publications were mainly focused on the methods for building normal activity templates or normal activity models. However, these papers touch only on a subset of this research area. Our emphasis in this paper aims to discuss the existing high-level techniques, and provide summary of progress achieved in the direction of building robust and intelligent vision based methods, including abnormal activity templates, abnormal activity models, and manifold geometry. Besides, we will also discuss smart surveillance systems and evaluation metrics for human activity recognition. Beyond activity recognition, other similar fields may include event recognition, goal recognition or intent prediction. As is pointed out by [8], although these terms may emphasize different aspects of activities, their essential goals are the same. Therefore, in this paper, we use the term activity recognition and do not distinguish the minor difference between the different terms mentioned above.
The remainder of this paper is organized as follows: Section 2 discusses the methods for activity recognition. Section 3 introduces the approved surveillance systems for activity recognition. Section 4 reviews the research project on performance evaluation of activity recognition. The conclusions are given in Section 5.

Methods for Human Activity Recognition
The essence of activity recognition may be considered to be a classification problem relating to time varying data. Accordingly, two critical issues need to be addressed during classification. The first one is how to formulate the reference motion patterns for typical activities; the second one is how to enable the training and matching methods effective enough to cope with the minor deviations in both temporal and spatial scales for similar motion patterns. In different circumstances, these two problems are treated differently, and we will discuss the methods to deal with the difference from the technique viewpoint.
As stated in [9], the investigations of human activity recognition can be divided into two kinds of approaches: template matching and state space. Most previous efforts have been concentrated on using state space method to understand human activities because of its comparative high recognition accuracy [10]. Spatial features including points, lines, and blobs are used during the recognition processing. However, state space methods usually have high computational complexity while template matching methods are computationally inexpensive. Meshes of a subject image were usually applied to identify a particular movement in these methods. During the recognition processing, the features extracted from the given image sequence were compared to the pre-stored patterns. As illustrated in Table 1, we classify past research from these two directions. Typical methods are outlined below.

Template Matching
Template matching methods aim to extract motion features from the given image sequences, and transfer them into certain motion patterns. Then, human motion templates can be obtained from these motion patterns representing predefined activity patterns. Human normal activities can be recognized by matching the activities with the templates [24]. However, methods for normal human activity recognition may present several drawbacks when applied to anomalous activities. An anomaly can be defined as an atypical activity pattern that is not represented by sufficient samples in a training data set but critically satisfies the specificity constraint to an abnormal pattern [25]. In many applications, the data of anomalous activities is extremely scarce compared to normal activities. This may lead to significant difference in the methods for activity recognition. We will discuss the difference in this section.

Normal Activity Template
In the early days, human activities were composed of Motion-history image (MHI) and Motion Energy Image (MEI) in different views, from which the square based motion features could be abstracted [11]. In these methods, image sequences were first processed by background subtraction and binarization. MEI can be accumulated over time by these binary motion images which contain the motion field, and enhanced to be MHI. Each activity was composed of MEI and MHI in different views, from which the square based motion features can be abstracted for template matching. However, this method can only recognize a 180 degrees angle of sample actions. Oren proposed a trainable object detection architecture that can recognize pedestrians from frontal and rear views [12]. Different from the above method, this architecture did not rely on any a priori model or motion template, but defined the shape of an object as a series of regions and relationships between them using wavelet templates. These wavelet templates can be used to compare with the image frames to search for the matching action.
In order to ensure human activity is invariant to viewpoint variations, Ben-Arie described these actions as temporal sequences of pose vectors that represented the motion of human body [13]. They constructed a database for major body parts, in which all the activity templates were stored in multidimensional hash tables in the form of pose entries. Voting Approach and multidimensional indexing were used in the recognition stage to improve the efficiency and stability of matching. Recently, Lu developed a system to automatically track multiple hockey players in a video sequence and simultaneously recognize their actions [14]. Hue-Saturation-Value (HSV) color histogram and Histogram of Oriented Gradients (HOG) descriptor were used to represent the color and shape information of the image of hockey players respectively. They used a 3D histogram based on the magnitude of gradients in both x and y direction and their orientations for the HOG descriptor. Thus, their method is invariant to viewpoint variations. Action templates can thus be leant and updated from training data. For a candidate action, a Sparse Multinomial Logistic Regression (SMLR) classifier can be used to classify its HOG descriptors into action categories.

Abnormal Activity Template
Abnormal motion patterns can also be recognized through the matching of human motion templates. An internal list of anomalous motion patterns can be established as a template to match with an ongoing activity. If this ongoing activity is on the list, then it can be confirmed to be anomalous. However, this kind of approach presents several drawbacks. The significant one is that, in view of the way of generating templates, new abnormal activities cannot be discovered [15]. In order to cope with this problem, Khalid proposed a method to filter anomalous activities [15]. Instead of generating templates from motion patterns, they believed that normal behaviors possess high correlation between each other, thus abnormal activities can be detected through the comparison with normal activity recorded in video sequences. In this method, trajectories were represented as time series using modified Discrete Fourier Transform (DFT)-based coefficient in low dimensional feature space, so as to learn motion patterns using iterative hierarchical semi-agglomerative clustering-learning vector quantization. This method did not need any prior knowledge about the number and type of activity patterns. Usually, template matching methods are computationally efficient, and do not need much computation time. However, despite their low cost computation, template matching methods are sensitive to the variation of motion duration and noise, thus the accuracy of recognition is not very high.

State Space
Different from template matching method, the state space approach aims to formulate a statistical model through training, which can be used for the recognition of human activities [24]. In state space methods, each static posture is defined as a single state, and correlated with each other using the statistical model. Thus the motion sequence can be treated as an ergodic process through different states. For each motion sequence, the joint probability is calculated to find the maximum value [26]. State space methods can overcome the problem of motion duration variation in template matching approaches, because each state was accessed several times. However, other difficulties may arise. For example, it is far from easy to establish a fixed form model. Thus, different statistical models need to be established through complex iterative computation according to specific situation [5]. Accordingly, several graphical models were proposed to serve as an efficient way to do probability inference. Graphical model is a powerful tool for modeling dependencies among random variables, and can be divided into two categories including Directed Graphical Models (DGM) and Undirected Graphical Models (UGM) [27]. We will explore the recent efforts for generating statistical models in this section.

Normal Activity Model
One of the most typical DGMs is the Hidden Markov Model (HMM). HMM was broadly used in speech recognition in early years, then it was successfully applied to the recognition of activities. For example, in order to model the dependence on the parameter of activity explicitly, Wilson and Bobick proposed a framework which added a global parametric variation in the output probabilities of each states in HMM [16]. In this framework, expectation-maximization (EM) method was used to train the parametric HMM. Similarly, Duong introduced the switching Hidden Semi-Markov Model (HSMM) to study and recognize human activities of daily living [17,18]. Parameters of HSMM were determined by the switching variable at the high level.
A typical example of UGM is conditional random fields (CRF), which have been emerged into behavior recognition in the last few years. Compared with HMM, CRF can easily incorporate domain knowledge and get better performance in terms of classification accuracy [28]. For example, Chieu applied CRF to solve the two behavior recognition tasks proposed at the Physiological Data Modeling Contest [19]. The Generalized Expectation Maximization was used to train the partially labeled sequences to improve the performance. Similarly, Yin proposed a dynamic conditional random field (DCRF) model based method to detect events from large-scale sensor networks in real time [20]. DCRF model incorporated temporal constraints among contiguous spatial fields, and relaxed independent spatial-temporal relationship among events in a unified probabilistic framework. Thus, it can deal with partial sensor data and interactions between contiguous events.

Abnormal Activity Model
State space approaches set up profiles for normal activities. The activities deviating from these profiles are treated as anomalous. In other words, state space approaches construct a graphical model using a set of normal patterns to establish a classifier that can discriminate between normal and abnormal activities. The critical point of this method lies in whether or not the proposed graphical model can be used as an accurate predictor of normal activities. In this way, an ongoing pattern is likely to be anomalous when it cannot be predicted by the graphical model.
The most part of graphical models used for normal activity recognition can be also used in the detection of abnormal events. However, due to the fact that abnormal behaviors occurred rarely and were not expected in advance, these models should be adjusted according to specific applications.
Taking DGM for example, Yin and Meng proposed a self-adaptive HMM based framework to understand abnormal activities [21]. Different from the normal activities need to train from a large data set, this framework can learn on-line from current data set and generate new models for abnormal activities. In order to detect anomalies in complex outdoor scenes, Loy proposed an activity-based decomposition over complex activities, and modeled them using a cascade of DBN [22]. The activity space was factorized into sub-spaces based on exploring of the behavior semantics within the spatial-temporal visual context where the activity occurred.
UGM can also be used to recognize abnormal activities. For example, Hu and Yang presented a probabilistic and goal-correlation based two-level framework to deal with concurrent and interleaving goals from observed activity sequences [23]. At the low level, skip-chain CRF was used to estimate whether a newly goal exist in the given observed activity. While at the high level, relational graph was adopted to represent the correlation between different goals.

Manifolds Geometry
Besides above methods in Euclidean spaces, there are also some emerging and interesting techniques, for example manifolds geometry. In [29], Liu et al. used Grassmann manifolds to classify human actions. A tensor was characterized as a point on manifold, and then mapped to the geodesic distance on this manifold. Recently, Harandi et al. compared Riemannian manifolds with several state-of-the-art methods to check their performance of representing human activities [30]. They conducted several vision based classification experiments, including gesture recognition and person re-identification. And the experimental results indicate considerable improvements in discrimination accuracy.
In this kind of methods, human activities were related to a particular matrix manifold. Human motion patterns can then be characterized using some transformation. Besides Grassmann and Riemannian manifolds, the matrix manifolds of interest also include Lie groups, and Stiefel manifolds.
Lui presented a good description of the recent advance in matrix manifolds for computer vision, and introduced its applications in human activity recognition. For details can refer to [31].

Systems for Activity Recognition
Vision based surveillance systems can be used to detect, analyze, and recognize activities. In [32][33][34], good descriptions of vision processing techniques in surveillance systems were presented. As illustrated in Figure 2, the basic framework of an automatic vision surveillance system is composed of a set of cameras, vision processing unit, vision storage unit, and visual control unit. These units were interconnected through a network or other kind of device. In the framework, vision processing unit plays an important role, which contains the key techniques for activity recognition. In the past, large amount of vision based surveillance systems outfitted with inexpensive cameras were proposed. We will summarize the research projects approved in this domain. Typical systems include: Closed Circuit Television (CCTV) [35], Pfinder [36], W4 [37], Human Identification at a Distance (HID) [38], Context Aware Vision using Image-based Active Recognition (CAVIAR) [39,40], BEHAVE [41], Visual Surveillance and Monitoring (VSAM) [42], Project from the Center for Biometrics and Security Research (CBSR) [43], IBM Smart Surveillance System (S3) [44], etc. From the perspective of system architecture and technology, Kumar divided the evolution of vision based surveillance systems into four stages [34]. Table 2 illustrates the past approved research projects on activity recognition in these four stages. The first generation vision-based surveillance systems consisted of a number of Charge Couple Diode (CCD) cameras, which were connected with a set of monitors using automatic control switches. For example, Nwagboso proposed a CCTV system to assist understanding the events in traffic networks and finally provide better traffic control, incident management and traffic law enforcement [35]. The CCD cameras can continuously trigger image saving routines and monitor accident black spots, thus they can be used as a forensic tool after vehicle crashes have taken place.
However, the widespread deployment of CCD cameras resulted in more expensive and ineffective human supervision. In order to automatically detect alarming events proactively rather than record them passively, second generation surveillance systems were developed. The Pfinder and W4 developed by the MIT Media Laboratory and the University of Maryland in the early years belong to this kind of systems [36,37]. The significant feature of these systems lies in its ability to provide robust detection, tracking and classification algorithms. Besides Pfinder and W4, several recently emerged second generation surveillance systems exist. For example, the HID project sponsored by the Defense Advanced Research Projection Agency (DARPA) fused biometric technologies into a human identification system to detect, recognize and identify humans at significant standoff distances [38]. The incorporation of biometric technologies can help to enable faster and more accurate identification of humans, and thus can provide useful early warning support for force protection and homeland defense to deal with terrorists, criminals, and other human-based threats. Differently, the CAVIAR project funded by the Information Society Technology (IST) made use of various information including task, scene, function, and object contextual knowledge to provide rich description for local images through hierarchal visual processes [39,40]. The information can enable CAVIAR to perform its function in detecting nighttime crime and classifying customers' commercial behaviors. In order to filter out uninteresting normal activities and not occurring activities from video stream, the UK's Engineering and Physical Science Research Council funded the BEHAVE project undertaken by the University of Edinburgh [41]. BEHAVE, using the dynamic Hidden Markov Model to track individuals, can detect and discriminate between similar interactions. Besides, global probabilistic models were adopted to solve the inconformity during the tracking of individuals in crowd scenes, where images were obtained in a short-time.
In order to achieve wide area surveillance, third generation surveillance systems were designed using distributed, heterogeneous and synergistic cameras. A typical example of this system is the VSAM project supported by DARPA [42]. Cooperative multi-sensors were used in VSAM to track human and vehicles persistently in a cluttered environment. The main goal of VSAM was to monitor the condition in battlefields through automatically collecting real-time information, and assisted improving the situational awareness of commanders and staff. The CBSR at Institute of Automation, Chinese Academy of Sciences developed an intelligent visual surveillance system, which can ensure public safety and enhance protection from terrorist attacks [43]. This system can recognize anomaly and abnormal activities, detect abandoned or removed objects, and track multiple objects at night time; moreover, it also can display overall information in panoramic monitoring screen.
Recently, fourth generation surveillance systems were proposed so as to provide real time event alerts and long term statistical patterns in large scale distributed video surveillance systems. This kind of systems was built on top of existing IP-network infrastructure using wireless networks and networked digital video cameras [34]. For example, IBM Corporation developed a middleware named S3 to provide video based activities analysis capabilities [44]. S3 is a kind of the fourth generation surveillance system. S3 can not only automatically monitor a scene, but also perform surveillance data management, event based retrieval, long term activity pattern statistics, and web based real time events alarm. There are two main components in S3. The first one was Smart Surveillance Engine (SSE), which provided the front end video analysis capabilities; and the other one was Middleware for Large Scale Surveillance (MILS), which enabled data management and retrieval functions. These two components can be used along with the IBM DB2 and IBM WebSphere Application Server to realize a series of functions, such as local and web based real time surveillance and event notification, web based surveillance event retrieval, and web based surveillance event statistics.

Evaluation Metrics for Activity Recognition
Effectively evaluating the performance of methods and systems for activity recognition in videos or image sequences is important for the improvement of surveillance algorithms in theory, and also for the selection of proper surveillance solutions towards practical applications. Based on past work, much effort has been made towards generating metrics to evaluate the performance of video based automatic surveillance systems. As illustrated in Table 3, we will review some of the recent efforts. Table 3. Past research projects on performance evaluation.

Research Projects for Performance Evaluation
The earliest effort in performance evaluation started with the Video Analysis and Content Extraction (VACE) program in the year 2000. VACE, supported by Advanced Research and Development Activity (ARDA), aimed to develop novel algorithms and implementations to analyze video content including newscasts, meetings, and surveillance [45][46][47]. Thus, VACE pays special attention to tasks such as detection and tracking of text, faces, person's positions, etc. The performance evaluation initiative in VACE is carried out by the University of South Florida (USF) under the guidance of National Institute of Standards and Technology (NIST). The evaluation was based on the framework by Kasturi et al. [53], which is a well established protocol for performance evaluation of object detection and tracking in video sequences. Evaluation criterions in VACE vary according to different tasks. For the detection tasks, VACE takes use of the Sequence Frame Detection Accuracy (SFDA) metric to obtain the detection accuracy (misses and false alarms) and the detection precision (spatial alignment); while for the tracking tasks, Average Tracking Accuracy (ATA) metrics is used to measure both tracking accuracy (number of correct trackers) and tracking precision (spatial and temporal accuracy).
The Performance Evaluation of Tracking and Surveillance (PETS) workshop is another endeavor [48]. This yearly workshop investigated moving object detection and tracking in the earliest years; and turned to focus on density estimation, left luggage detection, and activity recognition in recent years. General outdoor surveillance benchmark datasets and online evaluation service were provided in this workshop for the participants to evaluate their systems. Unlike VACE, all metrics in PETS are defined as error measures meaning that the lower the score, the better the performance [52]. Like VACE, the metrics in PETS are also task dependent. For the motion segmentation tasks, PETS generated four metrics at the pixel level including Negative Rate, Misclassification Penalty, Rate of Misclassifications, and Weighted Quality Measure; while in case of the tracking tasks, five criteria are used including Percentage of dataset tracked, Average overlap between bounding boxes, Average overlap between bitmaps, Average chamfer distance using the ground truth object bitmap, and Average chamfer distance using the algorithm generated bitmap.
However, in the early days both VACE and PETS lacked evaluation metrics needed for the tasks of event recognition. The detection of activities is difficult to evaluate because the challenge depends strongly on the events to recognize. For instance, it is much easier to detect an intrusion in a zone of interest than a person opening the door [50].
Aiming to address this problem, NIST sponsored another evaluation project named Text REtrieval Conference Video Retrieval (TRECVid) Evaluation for Event Detection from year 2005. TRECVid is a laboratory-style evaluation intended to promote machine learning technology development for event detection in video surveillance [49]. The video source data was mainly derived from the UK Home Office at the London Gatwick International Airport. TRECVid Evaluation for Event Detection was performed through the comparison of the temporal similarity between the annotated reference event observations and the system-detected event observations. And the result of performance was obtained in the form of MD and FA, which can be used to derive Detection Cost Rate (DCR) model and Detection Error Tradeoff (DET) curves. DCR model is a single error measure, which is simply derived from the linear combination of MD and FA. While DET curves aims to graphically depict the tradeoff of these two error types over a wide range of operational points.
ETISEO was approved to evaluate the performance of event detection tasks by comparing -the number of correctly recognized events with the constraint of time‖. It is a project starting in January 2005 and sponsored by French government in order to evaluate vision techniques for video surveillance [50,51]. Unlike the above evaluation methods which stand on the algorithm point of view, ETISEO investigates the relationship between algorithms and video sequences. In other words, EITSEO aims at identifying the suitable scene characteristics for a given algorithm and highlighting algorithm weaknesses for further improvements. Besides event detection, other aspects of video surveillance systems can also be evaluated in this project using various metrics. For instance, the accuracy of the 2D or 3D location of objects and the quality of the object shape can be used as criterions for the detection task; while for the tracking tasks, tracking time, object ID persistence and object ID confusion can be used as criterions. ETISEO displays its evaluation results in the form of Receiver Operating Characteristic (ROC) curve defined as a plot of the true positive rate against the false positive rate.

Collaboration between Different Projects
Besides above mentioned performance evaluation projects, many other programs are also created in the past years such as Computers in the Human Interaction Loop (CHIL) [54], Challenge for Real-Time Event Detection Solutions (CREDS) [55], etc. However, the existence of many concurrent metrics makes it difficult to compare them in a fair manner as they are not formalized in the same way [56]. Since most of the performance evaluation programs share the same motivation of developing novel algorithms for detection, tracking, and behavior recognition of humans and objects in video sequences. Technology mapping/transfer among individual projects may contribute to a fair comparison and fast research growth. In addition, current performance evaluation is still limited to short sequences. These sequences and their annotation are often available only to those who created them [51]. It is also necessary to provide benchmark dataset and ground truth data with common evaluation setup to all researchers. Table 4 shows some of the recent collaboration efforts. The Classification of Events, Activities and Relationships (CLEAR) Evaluation Workshop is the first attempt to bring together two projects: VACE and CHIL [57]. This collaboration has achieved great success. The evaluation metrics provided in CLEAR are widely accepted as an effective and informative assessment of system performance. In addition, CLEAR provides the availability of more data to the research community for algorithm development.
After that, Manohar et al. [58] presented a qualitative comparison of detection and tracking tasks in the VACE and the PETS programs. Performance metrics, along with other vital aspect such as the framework, the tasks and ground truth data, are compared thoroughly in this comparison. They believed that the identification of right set of metrics can be achieved through continuing collaboration of the task definitions, database development, etc. In 2010, PETS started to evaluate the object detection and tracking tasks based on the SFDA and ATA metrics, which are formally used by the VACE and CLEAR programs [59]. As a result, researchers can evaluate the detection and tracking performance of their systems using the same metrics (SFDA and ATA) and more data (both from CLEAR and PETS).
For the event detection task, Desurmont et al. [56] performed mapping the metrics in TRECVid, CREDS and their project. There metrics are compared using a toy example, where events have a temporal duration and are represented as a time interval with a beginning and ending time. Results indicated the metrics in TRECVid project is fully consistent. Based on the problem formalization of TRECVid, the authors further proposed a faster implementation for duration-less events [60].

Conclusions
In this paper, we present an overview of recent techniques for vision based activity recognition. We have summarized previous work from different technical viewpoints. In addition, we have also reviewed and past approved surveillance systems, as well as the research projects for performance evaluation.
However, there are still some problems that need to be solved in the future. Robust recognition of activities depends on rapid human motion detection, reliable motion tracking, and accurate data analysis [24,33]. These tasks are challenging for several reasons, such as noise and uncertainty backgrounds. Even with robust human motion detection and tracking, activity recognition may still pose great difficulties, including variance in the appearance of particular events, similarity in the appearance of different events, lack of specific background information which may contain large amount of prior knowledge, etc.
Besides, the evaluation of the performance of these tasks is another important issue. Although much work has been done on evaluating the performance of activity recognition, standardized evaluation metrics and benchmark datasets are still lacking. For different algorithms and datasets, it is difficult to evaluate and compare their performance with others. Moreover, as far as we know, most of current investigations are focused on the evaluation of algorithms. There is scarcely any evaluation towards the performance of practical surveillance products. Multiple metrics and criterions may help researchers to evaluate their algorithms more effectively. However, it is not convenient for the manufacturer and the end user. A comprehensive metric can be helpful for them to select a suitable surveillance system from large numbers of products. Unfortunately, current evaluation metrics can only reflect part of overall performance; comprehensive criteria are still lacking.