Video Sensor-Based Complex Scene Analysis with Granger Causality

In this report, we propose a novel framework to explore the activity interactions and temporal dependencies between activities in complex video surveillance scenes. Under our framework, a low-level codebook is generated by an adaptive quantization with respect to the activeness criterion. The Hierarchical Dirichlet Processes (HDP) model is then applied to automatically cluster low-level features into atomic activities. Afterwards, the dynamic behaviors of the activities are represented as a multivariate point-process. The pair-wise relationships between activities are explicitly captured by the non-parametric Granger causality analysis, from which the activity interactions and temporal dependencies are discovered. Then, each video clip is labeled by one of the activity interactions. The results of the real-world traffic datasets show that the proposed method can achieve a high quality classification performance. Compared with traditional K-means clustering, a maximum improvement of 19.19% is achieved by using the proposed causal grouping method.


Introduction
Along with the developments of video surveillance systems, intelligent video analysis is a hot topic that has attracted significant attention, such as [1][2][3][4]. Scene analysis is one of the most important aspects of an intelligent video surveillance system. Automatically scene analysis, especially discovering activity interactions and temporal dependencies between them, is an active research area, but it remains a challenging problem due to complex video surveillance scenes with multiple activities occurring simultaneously. Currently, there are two main categories of approaches for discovering activity interactions and their temporal dependencies. The first category of approaches is based on probability inference. These methods typically use a Dynamic Bayesian Network (DBN), such as a Hidden Markov Model (HMM), or a probabilistic topic model (PTM) [5][6][7][8], such as Latent Dirichlet Allocation (LDA) [9], Hierarchical Dirichlet Processes (HDP) [10] and their extensions. The second category is data driven and based on the Granger causality [11] approach. Zhou et al. [12] perform the continuous-time Granger causality on pairs of trajectories to extract features for activity classification. Prabhakar et al. [13] perform the non-parameter Granger causality analysis on pairs of visual events, and demonstrate its effectiveness in exploring causal event sets without tracking. Based on the above-mentioned studies, Yi et al. [14] present a framework that models human actions using temporal causal relations of joint movements for human action classification. Nevertheless, both of their studies are applied to human action analysis.
Most of the aforementioned studies are based on tracking, which is limited to situations in which object tracking can be performed reliably. However, the successes of the topic model applied in complicated scenes analysis offers an advantage in that they can work directly on low-level motion features [5,6,8], e.g., optical flow, which avoids the unreliable detection and tracking. However, with respect to codebook generation, a fixed quantization is often used in these studies, which lose necessary resolution. Increasing the discretization resolution can reduce this loss but with a cost of increased training data and computation time.
Our study builds on the success of the Granger causality applied to the analysis of visual events and takes advantage of the topic model. A flow chart of the proposed framework is shown in Figure 1. To create a temporal document from a video, low-level visual features are first detected and denoised. Then, a new method is proposed for codebook generation using adaptive quantization. Rather than using these words directly, low-level words are automatically clustered into topics (atomic activities) using the HDP model. Then, the dynamics of the visual activities are represented as a multivariate point-process [15]. Specifically, the temporal documents are created by counting the number of occurrences of the activities in each clip of the video. The pair-wise causal relationships between the atomic activities are then explicitly captured using the non-parametric Granger causality [16]. Based on the causal analysis results, the activity interactions and temporal dependencies are discovered, and a high quality classification performance is achieved.
The contributions of this study are two-fold: First, an activeness criterion-based method is proposed to determine the level of resolution of the optical flow quantization, and the low-level codebook is generated by using adaptive multi-scale quantization. Specifically, the window size of the location quantization is adaptive to the spatiotemporal characteristics. The second contribution is that the causality analysis is incorporated with the topic model to automatically explore the activity interactions and temporal links between them. Causal relationships between any pair of agents are established via the Granger causality statistic, from which we can obtain an improved classification performance.  The remainder of this paper is organized as follows. In Section 2, a statistical denoise method is proposed to obtain informative features. Then, a codebook with minimal size and adequate discriminative power is constructed based on the adaptive quantization. In Section 3.1, the video is interpreted as a point process based on the HDP model. The approach for the non-parametric Granger causality is explained in Section 3.2. In Section 4, we present the experimental results along with the analysis of the proposed approach for two real-world traffic sequences. Section 5 presents the conclusions and discusses future studies.

Feature Extraction and Denoising
In this section, optical flow is detected for video representation because it is typically more reliable in crowded scenes compared to the long-term trajectory in estimating various current activities. When the magnitude of the flow is smaller than a threshold, the flow is deemed unreliable and removed. Furthermore, the optical flow is denoised by determining whether it is informative for further analysis.
For each pixel position, the total of optical flow vectors in the video sequence are counted. The result is a 2D matrix, and the matrix is divided by the total number of frames to obtain the probabilities. The probability for each position is a measure to determine whether this position is active (useful dynamic region). The pixel positions with extremely probabilities are deemed as a static region, while the pixel positions with very large probabilities are regarded as a noise dynamic region. Therefore, the optical flow vectors at these positions are discarded, which were not discarded by the magnitude threshold, because they are uninformative for further analysis. Figure 2 shows the comparison of the optical flow spatial distribution with and without statistical denoising for the intersection video sequence. It is evident that the statistical denoising allows the optical flow spatial distribution to be similar to the actual situation. For example, as shown in the Figure 2a, the probabilities become large in the areas labeled by red and green circles; in actuality, they are stationary regions corresponding to buildings (see Figure 2c).

Adaptive Quantization
In this section, a codebook is generated based on the denoised motion feature. To obtain a codebook, the previous methods often spatially divide the scene into non-overlapping grids (e.g., 5 × 5 and 10 × 10), and the direction is quantized into four or eight directions. This discretization necessarily causes spatial and directional resolution loss. The increasing discretization resolution can deduce this loss, but it also results in an increase in training data requirements and computation time. Furthermore, the fixed quantization does not consider the complexity of the motion in the actual scene.
To overcome the above limitations, we present a new method for codebook generation using adaptive quantization. The advantages of the proposed approach include maintaining a minimal size codebook with adequate discriminative power. Based on the observation, the optical flow spatial and directional distribution s may not be uniform, especially when observed over a significant length of time. Thus, a rough quantization will be used for the low activeness region. Furthermore, a fine quantization will be used for areas of the region in which motions are complex. First, an activeness criterion for a block is defined based on the flow density and diversity. The flow density of a block is defined as follows: where S is the area of the grid and N is the total number of optical flow vectors in this region. The density is then normalized by the maximum density value as follows: D max : To compute the flow diversity for each grid, the optical flow directional histogram is first computed, and the histogram is normalized as a vector H(d 1 , d 2 , ..., ...d M ) . The number of bins in the histogram is M.
The diversity of flows in the block is measured by the KL divergence (relative entropy) [17] between H and the union distribution as follows: U = 1 / M (maximum entropy): Lastly, the diversity score of this block is calculated as follows: The diversity score lies within [0,1], and a high score indicates that the direction distribution has more scatter. The activeness of the gird is measured by the combination of these two terms as follows: where parameter µ is the the prior mixture parameter for the two terms, and A ranges from 0 to 1. The first term on the right side in the above equation affects the flow density, and the second term affects the flow diversity. In practical, the value of the parameter µ is determined by experience. Our approach of location quantization is similar to the quad-tree segmentation. A flow chart of the adaptive multi-scale location quantization is shown in Figure 3. The activeness of each initial block of size (N × N ) is determined by comparing it to a threshold. If a block is sufficiently non-active, it is not divided further. However, if a block is active enough, the block is divided into four sub-blocks of identical size, the process is iterated on these four blocks individually. The process stops when each block is regarded as non-active or the minimum block size is reached. Figure 4 shows the illustration of the adaptive location quantization results for the intersection video sequence.
After location quantization, the direction is quantized based on the flow diversity. There are typically two degrees of directional quantization: four directions or eight directions. For one block, if its diversity score is larger than the predefined threshold, the direction in this block is clipped into eight. On the other side, the direction is clipped into four bins. After performing the spatial and directional quantization, a codebook is obtained with the size defined as follows, where L is the total number of blocks, and d i is the size of the directional quantization in block i. The flow vectors from the sequence are mapped into one of the visual words. To establish a bag-of-words representation, a video is temporally clipped into non-overlapping clips, and its corresponding visual documents are composed with the words accumulated over its frames. The activities will be represented by co-occurring visual words.

Mid-Level Visual Words
Although the size of the codebook is decreased by the adaptive quantization, its dimension is still high. In this section, the HDP [10] model shown in Figure 5 is used for clustering these low-level words into topics (atomic activities). The HDP is a nonparametric hierarchical Bayesian model. The advantage of using the HDP model is that it can automatically provide the number of discovered topics (atomic activities) that are deemed as mid-level visual words. Thus, the video can be represented in a more compact way.
There are two levels of the Dirichlet process (DP) in the HDP model, and the generative process is given as follows: • In the first level, the DP generates a global random measure G 0 with concentration parameter γ and base probability measure H as follows: G 0 can be formulated using the stick-breaking construction as follows: • In the second level, the DP generates random measures G t for each clip d t with base probability measure G 0 as follows: G t is a prior distribution of all the words in document d t , with only a subset of the topics in G 0 active. It is formulated using the stick-breaking construction again as follows: In the learning process, every low level word is assigned a topic identification. Furthermore a document (video clip) d t is represented by the mixture G t of topics. The discovered topics (atomic activities) will be directly used in creating a multivariate point-process in the following section.

Multivariate Point-Process
For each video sequence, topics are detected and a multivariate point-process is generated by considering each topic z i as a point event. By applying the HDP inference on low-level word documents, the probability of topic z i in document d t can be estimated. The amount of occurrence of topic z i in the time interval (t, t + dt] is then defined as follows: where dt represents the time resolution; M i (t) denotes the number of topics in the interval (0, t]; p(z i |d t ) is the probability of topic z i in document d t ; and n t represents the total number of visual words in the clip. The mean intensity of the process M i (t) is defined as E{dM i (t)} = λ i dt , and the zero-mean process

Nonparametric Granger Causality
In this section, a nonparametric estimation of Granger causality [16] for multivariate point processes is used. This method bypasses the autoregression (AR) model fitting. The calculation process is given as follows: Given the multivariate point process, its spectral matrix is defined as follows: where off-diagonal elements represent the cross-spectrum, and diagonal elements represent the auto spectrum. The spectral matrix is estimated using the multitaper method [18], in which K Data tapers {h k } K k=1 are applied successively to the ith topic, and the Fourier transform is taken as follows: Lastly, the spectral matrix elements S ij (f ) are estimated in the following function [19], The spectral matrix is factorized as follows: [20] : where T (f ) is the transfer function between processes and is the noise process covariance. After spectral factorizing, the Granger causality from N j (t) to N i (t) at frequency f is given by [21] as follows: Note that the measure is asymmetric, that is G N j →N i = G N i →N j . A scalar measure of causality between processes N j (t) and N i (t) can be obtained by integrating Equation (16) with respect to the frequency, and the causal score can be obtained as follows: where C(i, i) = 0, ∀i.

Causal Graph
To discover the interaction event sets, the pair-wise causal scores are first thresholded. The threshold is computed using an empirical null-hypothesis testing framework [22]. Finally, the Granger causality is represented by a directed graph, where nodes denote topics and edges denote the causal relations between them. Generally, the relations between two topics (e.g., topic 1 and topic 2) may have four cases: • T opic 1 → T opic 2 : Topic 1 drives topic 2.
In both of the above cases, we define these two topics as temporal causal topics. • T opic 1 → ← T opic 2 : There is a bidirection causal relation between these two topics yielding the conclusion that the two topics are reciprocally coupled. We define them as temporal concurring topics. • T opic 1 × T opic 2 : There is no direct causal relation between these two topics.

Experiments and Discussions
In this section, the proposed algorithm was tested on three public video sequences: a street intersection dataset (360 × 288, 25 fps, 1 h), a roundabout dataset (360 × 288, 25 fps, 1 h), and a subway platform dataset (360 × 288, 25 fps, 40 min). Both the street intersection dataset and the roundabout dataset are traffic videos governed by traffic lights in a certain temporal order selected from the QMUL dataset. Thus, the sequence of activities exhibits spatial-temporal periodicity. Typically, there are several flows at a time, and each flow may last for a period. The subway platform dataset is from the UK Home Office i-LIDS dataset and has been reedited. This dataset is significantly different from the other two datasets. It is captured indoors and mainly features humans and trains. The camera was mounted significantly lower and closer to the objects. The typical behaviors in this scene include people leaving or approaching the platform, and people getting on or off the train. We assess the proposed method with both qualitative and quantitative evaluations. First, we demonstrate that our analysis can explore the activity interactions and temporal causal relationships. Then, we quantitatively evaluate the results of casual analysis through the task of scene classification.

Activities Analysis
In this section, we apply our approach to extract the activities and then the activities are represented by multi-point processes. Each video was temporally segmented into 3-s long clips. The optical flows were computed and denoised. Furthermore, the optical flows were mapped into words based on the adaptive quantization. Finally, the HDP was applied to learn a generative model of video clips. During the learning process, this model effectively clusters concurring visual words into topics (activities). Twenty-one topics were automatically discovered by the HDP model for the intersection dataset, 26 topics were discovered for the roundabout dataset and 29 topics were discovered for the subway platform dataset. The topics are represented by different colors and will be further analyzed to search for temporal causality. Figure 6a shows the motion distributions of the top eight topics (sorted by size) that explain at least 5% of all observations for the intersection dataset. Topics 1 and 5 describe vehicles moving upward along different lanes. Topic 8 describes vehicles turning right. Topics 2 and 7 describe vehicles moving downward but at different zones. Topic 3 describes the horizontal traffic flow from left to right. Topics 4 and 6 describe the horizontal traffic flow from right to left. Topic 6 may be shared by several activity interactions. Based on these topics, the eight point-processes are constructed and shown in Figure 6b.     For the subway platform dataset, we select the top eleven topics (sorted by size) that explain at least 4% of all observations. As shown in Figure 8a, topics 1, 2, 5, 6 and 11 describe the people leaving the platform. Topic 4 describes the motion of the train. Topics 8 and 9 represent the people getting on and off the train respectively. Topics 3, 7 and 10 represent the people approaching the platform. Figure 8b shows the corresponding eleven point-processes.

Granger Causality Analysis
To automatically discover the activity interactions and temporal causal relationships between the topics, the analysis of Granger causality analysis was applied to the processes of Figures 6-8. The results are shown in Tables 1-3. Causal scores less than the threshold (0.6 for the intersection dataset, 0.7 for the roundabout dataset, and 0.5 for the subway platform dataset) are deemed as a no causal relationship. Then, the causal matrixes are then interpreted as directed causal graphs. As shown in Figure 9a, Figure 10a, and Figure 11a, nodes represent topics and edges denote detected pair-wise relationships. It can be observed in Table 2 that topics 2, 5 and 11 have no causal relationships with the other topics. This trend indicates that these three topics occur independently of the other topics.     Table 3. Causal matrix for the subway dataset. In Figure 9a, a connection from topic 6 to topic 7 is observed, while another indirect connection from topic 6 to topic 7 through topic 3 is observed. To distinguish the direct causal influence from the indirect causal influence, the conditional Granger causality is computed. When topic 3 is in the on condition, the causal score from topic 6 to topic 7 (0.56) falls below the threshold value (0.6). Therefore the connection between topic 6 and topic 7 is spurious because of the mediated influence from topic 3. This connection is removed in the Figure 9a. Similarly, the connections between topic1, 7 and 4 are also removed.
Based on the definitions of the relations between two topics (see Section 3.  Furthermore, then the temporal causal relationship between them is explored, as follows: The visualizations of the temporal groupings are shown in Figure 9b. It is evident that the Granger causality results can identify the traffic light cycle governing the scene. Thus, four states are automatically founded. Compared to the intersection dataset, the scene from the roundabout dataset is more complex and the video sensor is mounted significantly further from the objects. In Figure 10a, there are three sets of concurring topics: topics {8 10}, topics {4 6} and topics {1 3 12}. In conclusions, these topics are grouped into three activity interactions (states): • State A: topics{8 10}; • State B: topics{4 6}; • State C: topics{1 3 7 9 12}; The temporal relationship between them is given as follows: We determined that the proposed approach failed to detect the relationship between the State C and State A because State A and State C would occasionally occur simultaneously. Therefore, the complete traffic light cycle is not discovered. The visualizations of the temporal groupings are shown in Figure 10b. Thus, three states are automatically founded.
Compared with the above two scenes regulated by traffic lights, the global temporal order in the subway platform scene is not obvious. However, the local temporal orders of the activities are well identified. To clarify, we construct three causal graphs, as shown in Figure 11a. Furthermore, the corresponding six motion patterns are illustrated in Figure 11b.   State A represents the behavior where people get off the train and leave the platform after the train arrives. States B, C and D represent the motion of people leaving the platform along different lines. State E represents the behavior of people approaching the platform and getting on the train. State F represents the behavior of people traveling through the platform.

Scene Classification
To objectively measure the performance of the proposed approach, we use a scene classification task. We select the intersection and roundabout datasets for this experiment. Based on the causal analysis results, a causal grouping method is proposed for the scene classification. After the use of the HDP model as a feature dimension reduction step, a distribution over topics is associated with each clip. For causal grouping, we choose the same dominating topics (see Section 4.1) distribution to represent the clips. Then, the largest topic is selected for each clip, and we automatically predict the state in which it belongs. Finally, the short video clips are grouped into different classes. The classification performance of the proposed casual grouping method is compared with the K-means clustering method. The K-means method (Bhattacharyya distance is adopted) directly clusters video clips based on the topic distributions as feature vectors . The number of cluster centers of K-means is set to the same as the causal grouping. After clustering, each cluster is manually identified to determine in which class it belongs. In each case, the results are quantified in terms of the overall classification accuracy. To evaluate the classification performance, a ground truth is created by manually labeling the whole video clips into different typical interactions. Table 4 shows the comparisons of the classification accuracy for the intersection dataset with different quantization resolutions. For constant quantization, the optical flow directions are quantized into four bins. For location quantization, there are five different resolutions(4 × 4, 8 × 8, 16 × 16, 32 × 32 and 64 × 64). In this experiment, the K-means technique is used as the classification method. It is evident that with the adaptive quantization, the size of the codebook decreases while the classification accuracy improves.  Tables 5 and 6 show the comparisons of the classification accuracy between the k-means clustering method and our causal grouping method for the intersection and roundabout datasets respectively. For the intersection dataset, the performance of the proposed approach is superior to the K-means clustering in the case of classes A, C and D. However, in the case of class B, the causal grouping is inferior to the K-means clustering. Similarly, for the roundabout dataset, the causal grouping method shows a significantly better performance than the K-means clustering method, except for class A. However, for the overall classification accuracy, the proposed approach always produces superior performance compared to the k-means clustering. Particularly, as seen from Table 6, the K-means clustering only obtains a 70.20% overall classification accuracy, while the causal grouping method is 89.39%.
As shown in Figures 12 and 13, the full classification performance is also evaluated using a normalized confusion matrix. Our method results in a high true positive for most classes. However, it is also evident that the true positives of class B (Figure 12a) and class A (Figure 13a) are not improved.
To provide further insights on the performance difference, the distribution over the atomic topics for each class are analyzed. For the manual label and causal grouping, the average topic mixture of each class is computed (without sorting). However, for the K-means method, the clustering centers are chosen by itself. As shown in the first rows of Figures 14 and 15, the dominant topics for each class are illustrated by the most likely visual words. In the other three rows, the topic mixture proportions corresponding to each class are represented by bars. The x-axis is the index of atomic activities. The y-axis is the mixture over atomic activities. Different colors indicate the different topics respectively.     In Figure 14, (a) explains the traffic moving in a vertical direction; (b) represents turning traffic with various vertical traffic; (c) and (d) represent the rightward and leftward traffic flows, respectively. Clearly, topic 4 (topic 6 in Section 4.1) is shared among class B and C, but class B is only represented by topic 9 (topic 8 in Section 4.1); thus, it is easily misclassified as class C (see Figure 12a).
In Figure 15, (a) explains the leftward traffic flows; (d) represents the rightward traffic flows; (c) represents the vertical with right turning traffic flow. As the shown by the first row in Figure 15, topics 2, 4 and 10 (topics 2, 5 and 11 in Section 4.1) are independent of the other topics; they are always present. Furthermore, topic 3 (also topic 3 in Section 4.1) is shared by both class B and class C, but in Figure 10 class B does not include this topic. This trend results in a few real class B clips being misclassified as class C (see Figure 13a). Overall, the topic distribution for each class of the causal grouping method is more consistent with the manual label results, especially for class C.

Conclusions and Discussion
Scene analysis is a challenging problem in crowded outdoor environments, especially in situations where multiple activities are occurring simultaneously. In this paper, we present a novel framework to understand the complex scenes by exploring activity interactions and their temporal dependencies. First, a statistical denoising method is proposed to select useful dynamic regions in the scene for further analysis, and a codebook is generated using adaptive quantization. Next, we proposed an approach to interpret the atomic activities explored by the HDP model as multivariate point process. By performing the non-parameter Granger causal analysis on pairs of atomic activities, we can identify patterns of activity interactions and temporal rules. Additionally, the result of the causal analysis is used as a feature for scene classification which achieves high quality performance compared with the K-means clustering.
In future studies, additional experiments on different datasets will be conducted to evaluate the generalization of the proposed approach. Furthermore, the sensitivity to parameter settings is also a question that will be considered and investigated. Moreover, during the causality analysis, in addition to causal scores, the causal period should be considered to explore more exact topic interactions. We will study more precise causal grouping algorithms and better usage of the causal analysis results.