This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Associating attributes to pedestrians in a crowd is relevant for various areas like surveillance, customer profiling and service providing. The attributes of interest greatly depend on the application domain and might involve such social relations as friends or family as well as the hierarchy of the group including the leader or subordinates. Nevertheless, the complex social setting inherently complicates this task. We attack this problem by exploiting the small group structures in the crowd. The relations among individuals and their peers within a social group are reliable indicators of social attributes. To that end, this paper identifies social groups based on explicit motion models integrated through a hypothesis testing scheme. We develop two models relating positional and directional relations. A pair of pedestrians is identified as belonging to the same group or not by utilizing the two models in parallel, which defines a compound hypothesis testing scheme. By testing the proposed approach on three datasets with different environmental properties and group characteristics, it is demonstrated that we achieve an identification accuracy of 87% to 99%. The contribution of this study lies in its definition of positional and directional relation models, its description of compound evaluations, and the resolution of ambiguities with our proposed uncertainty measure based on the local and global indicators of group relation.
motion modeltrackingrecognitionIntroduction and Motivation
The observation of human behavior in public environments such as shopping malls, sport venues or stations is a common application. To increase our understanding of these data and utilize them more efficiently, we must associate attributes to individual pedestrians. The attributes of interest depend considerably on applications. For instance, resolving the social relation between customers such as motherson, friends or couple is relevant in customer profiling [1]. Similarly, in intelligent environments, service quality can be improved by providing different services to clients by inferring their relation to their partners. Besides, in public environments, such as prisons or stadiums, recognizing the leader or the subordinates of groups is helpful for investigating aggressive or criminal activities [2,3].
However, the association of such attributes is considerably difficult due to the inherent contextual asperities and complex social relations. We propose treating this problem primarily by decomposing the entire crowd into smaller structures. In other words, we propose handling the crowd as a combination of social groups and single individuals. Once we obtain such a categorization, assigning social attributes is easier. We base our definition of social groups on the work of McPhail and Wohlstein [4], who regard a group as people engaged in a social relation to one or more pedestrians and move together toward a common goal.
The detection of pedestrian groups is challenging from several perspectives. Figure 1 illustrates a scene, where the detection of group relations is not straightforward. This figure illustrates a scene from a public space, where friends and families are walking. Here, gender, clothing and age of the pedestrians are important cues indicating a social relation such as a couple or friends. Human cognition has evolved in such a way that these personal properties are identified easily in an unconscious manner. However, estimation of such cues from surveillance footage is not possible in most cases since traditional image based methods do not perform well for such recordings.
Therefore, we propose taking a closer look at the trajectories, namely the distribution of the displacements and scalar product of the velocity vectors. Based on these, we develop two explicit schemes for modeling the interaction among group members, in addition to two other schemes for modeling the interaction between groups and single pedestrians. The models are calibrated for different sorts of environments, group structures, and densities. With our proposed hypothesis testing scheme, we show that our method can resolve group relation to a considerable degree for various conditions.
The outline of the paper is as follows. Section 2 presents prominent works in this field, and Section 3 elaborates on the properties of the datasets employed in modeling and evaluation. Sections 4 and 5 discuss the motion models and the integration of individual indicators with the help of uncertainty measures. Finally, Section 6 presents our experimental results indicating stability, performance, sensitivity, and generalization issues in addition to a comparison with an earlier work in literature and an alternative decision scheme.
Background and Related Work
As smart environments spread, a vast amount of data is gathered, particularly from public spaces. The analysis of the crowd behavior in this sort of data is of great interest to numerous research fields such as crowd modeling and simulation, public space design, visual surveillance, and event interpretation [5]. In this section, we focus on previous works that interpret ambient information from a social relation perspective.
Human activity analysis bears numerous challenging traits [6]. For the solution of this problem, a social signaling standpoint is adopted by Cristani et al. [7], utilizing primarily the nonverbal cues of human behavior. GaticaPerez gives a detailed overview of the nonverbal cues of small group relation, such as internal states, personality, and social relations [8]. Additionally, Costa demonstrates that group behavior presents distinctions in interpersonal distances depending on dominance, attraction, age similarity, and gender of the group members [9]. In the rest of this section, we refer to such complex features as the highlevel cues of group relation. Such cues are specific to individuals. On the contrary, lowlevel cues involve features like spatial position, velocity or motion direction, which are not specific to individuals. We categorize lowlevel cues into two classes, linear and circular variables. Linear variables involve spatial position, trajectory shape, and the configuration of group members, while circular variables are composed of motion direction and the correlation of velocities.
Recently, the utilization of highlevel cues has become a popular approach in the association of attributes to individuals, particularly in social network research. Several works address investigation of social relations based on such universally valid implicit cues as the age difference between parents and children or the opposite genders of heterosexual couples [10,11]. Some studies investigate kin relationships using photo albums that span a long time window of several years or even decades [12,13]. On the other hand, the proximity relation of faces on an image [14], clothing, or facial expressions [15] are used to estimate social relations.
For several contextual and practical reasons, these studies apply only to image domain and not to surveillance footage. First of all, in images from family albums or social network it is evident that the individuals appearing in the same image are related to each other. Then the question becomes resolving the type of relationship. However, the relation among pedestrians in a crowd is not obvious. Moreover, in video surveillance highlevel cues are not available at all times.
To account for these challenging conditions, several studies propose integrating lowlevel and highlevel cues. For instance, Ding et al. employ lowlevel cues in concept detection and define a Gaussian process based affinity learning for spotting social networks in theatrical movies and Youtube videos [16]. However, the appearance matrix relating the actors in a movie is derived from the script by searching for the names of the characters, which is not applicable in surveillance footage. By identifying the group structure, such behaviors as aggression or agitation are analyzed in [2]. Yu et al. assume that the 3D tracks of individuals and corresponding highresolution face images are provided to investigate social groups and their organizations [3], which cannot be generalized to most other problems.
Compared with highlevel cues, lowlevel ones are easier to derive. However, the analysis of group level activity based on lowlevel cues is profoundly integrated with stable multiobject tracking [1,17]. In other words, the occlusion arising from the group motion, which stands as a significant challenge at the first glance, can potentially be exploited for the enhancement of data association [18,19]. Namely, the search area is restricted based on the estimated future location of the objects from their past trajectories and motion models. Therefore, the dynamic models accounting for the collective locomotion behavior of pedestrians are proposed to improve tracking performance particularly against occlusions in [20–22].
By exploiting the lowlevel linear cues, several studies propose employing the contextual information provided by the configuration of groups to detect collective unusual behavior in public spaces. However, note that the problem of the resolution of group relations cannot be reduced to determining the similarity of trajectories [23]. The methods, which investigate similarity between individual trajectories, are mainly used in semantic scene modeling. They do not establish a relationship between simultaneously observed trajectories, which is the core of our problem [24,25]. Instead of finding the similarities between trajectories, Habe et al. propose finding interactions between trajectories to solve for mutual relationship between pedestrians. The influence that pedestrians exert on each other in the transition of motion states is investigated [26]. Floor control constitutes another commonly used lowlevel linear cue of collective human activities [27,28]. However, French et al. propose employing only the circular lowlevel cue of velocity correlation in a Bayesian framework and ignore the interpersonal distances [29]. In their framework, close proximity is not regarded as an indicator of group motion since it is claimed to be misleading in complex settings. Similarly, Calderara et al. omit the spatial relationships of trajectory points and focus on trajectory shapes [30]. Namely, they handle the problem from a circular statistics standpoint and cluster trajectories into similarity classes.
Yücel et al. suggest combining the linear and circular attributes [31–33]. In their framework, group relation is characterized by the distance between the moving parties and the alignment of their velocity vectors. Similarly, Ge et al. propose an algorithm to detect pedestrian groups through a bottomup hierarchical clustering scheme based on locomotion similarities derived from an aggregated measure of velocity difference vectors and spatial proximity [34]. Similar to [34], Sandιkcι et al. propose to integrate the positional and directional cues in the resolution of group relations by defining similarity metrics for position, velocity, and direction, all of which in turn are expressed in a joint similarity matrix, followed by an agglomerative clustering approach [35]. Nonetheless, their motion models assume a very simple structure, which might not suffice to capture the distinctive attributes of group behavior. Bahlmann integrates linear and circular variables in a fairly different problem: online handwriting recognition [36]. Integration is achieved through an approximated wrapped Gaussian distribution, which only holds for data with low deviation, i.e., σ < 1. Besides, this approach assumes that the probability density function of the linear variable is Gaussian. These two assumptions enable integration into multivariate semicircular wrapped distribution. However, neither holds for pedestrian trajectory data.
In addition to multiobject tracking and activity recognition, group models play an important role in such other fields as traffic analysis, evacuation dynamics, and the social sciences. Numerous works in pedestrians simulations are inspired by the social force model [37,38]. Lerner et al. describe a pedestrian simulation method, where a real world recording is employed to reflect behavioral complexity on individual level and group levels [39].
In light of these observations, we introduce a fundamental insight to collective pedestrian motion models by focusing on a short time interval and deriving lowlevel cues to infer the social relation. We aim to introduce a fundamental insight to collective pedestrian motion models. We relax the conditions defining group motion and provide a flexible means of identification for group relations. Since the final decision regarding group relations is based on the combination of positional and directional indicators, this problem is regarded as compound hypothesis testing. Various experiments prove that our proposed method effectively grasps the characterizing features of group relations and can recognize group activity with significantly high performance rates under varying environmental conditions and group configurations. Our paper makes the following contributions:
Positional modeling accounting for dyadic as well as multipartner groups;
Directional modeling in both uniform and nonuniform environments;
Integration of positional and directional indicators through compound hypothesis testing;
Definition of local and global indicators and an uncertainty measure.
Datasets
Three publicly available datasets are employed in development and testing of the motion models, namely Caviar, BIWI Walking Pedestrians dataset, and APT Pedestrian Behavior Analysis dataset [20,40,41]. These are picked so as to effectively demonstrate the generalization capabilities of our proposed approach against varying environmental conditions and distinctions in group structure.
In Caviar dataset, five videos which are recorded from an oblique view over the entrance hall of a building involve group motion. The pedestrians present meeting and splitting behavior as well as uninterrupted group motion [40]. Although its size is quite moderate, Caviar dataset is considered in this study mainly due to the publicly available ground truth concerning groups, which provides a fair comparison with other methods. BrWI Walking Pedestrians dataset contains two sequences, BIWIETH and BIWIHotel, recorded from birdseye view with a total of 650 tracks over 20 minutes [20]. The experiment scenes are the entrance of a building and a sidewalk. Due to the characteristics of these scenes, there is a dominant direction in the pedestrian flux (see Figure 2(b)). APT Pedestrian Behavior Analysis dataset is recorded in the entrance hall of a shopping center [41] (see Figure 2(c)). Unlike BIWI, such a prominent flow does not exist in any direction but a tendency to walk along a certain direction is noticed. Due to the homogeneous distribution of the flow, APT dataset is regarded as coming from a uniform environment.
Table 1 shows the total number of observed pedestrians and group sizes. The Caviar dataset involves a fairly small number of pedestrians. BIWIETH contains various multipartner groups, whereas BIWIHotel and APT are composed of mainly dichotomous groups, who are often walking abreast. As the group size gets larger the possibility of abreast configuration decreases particularly in high pedestrian densities, i.e., the groups may be bent forward or backward as well as arranged in a single file [42]. Among these sets, BIWIETH has the highest density followed by BIWIHotel, APT and Caviar, consecutively.
From Figure 2 and Table 1 the main differences between these sets are concluded to be the presence of preferred direction in BIWIETH and BIWIHotel against more homogeneous distribution in Caviar and APT and the frequent observation of multipartner groups in BIWIETH against the dominance of dichotomous groups in BIWIHotel and APT. These variations are taken into consideration in the development of motion models.
Since this study proposes an identification method for groups of pedestrians rather than a tracking algorithm, we consider welltracked trajectories and carry out our analysis to identify the pedestrian groups from these trajectories. For BIWIETH, BIWIHotel and ATR datasets, the trajectories which are obtained by stateoftheart tracking algorithms, are publicly available [20,41,43,44]. For Caviar dataset, we performed manual annotation and estimated the homography matrix to map the annotated pixel coordinates to ground plane. The sampling period of trajectory points is 160 ms concerning BIWIETH and BIWIHotel sets and 100 ms concerning APT set. For Caviar dataset, the sampling rate is 200 ms. The group relations for all datasets are provided as ground truth [41,44,45]. Using these trajectories and ground truth values, a convenient formulation is offered in accordance with the characteristics of the environment and the group structure.
Modeling Indicators of Group Motion
The question addressed in this study is which parameters characterize group motion, how we can model them and determine whether two pedestrians belong to the same group or not. In what follows, we introduce the terminology used in the rest of this study and then describe our proposed models of the indicators of group motion.
We term any two pedestrians who are observed simultaneously as a pair. Suppose that the pairs who are engaged in a group relation such as {p_{i},p_{j}} of Figure 3 constitute the set G, whereas the pairs who are not engaged in a group relation such as {p_{i},P_{h}} comprise the complementary set Ḡ [4].
Based on the findings of [46], group motion is mainly characterized by positional indicators and directional indicators. We quantify positional indicators in terms of interpersonal distance, whereas directional indicators are defined based on motion directions. In explicit terms, the positional indicator of group motion is represented by Δ and is composed of a set of linear variables {δ}, where δ stands for the instantaneous distance between pedestrians (see Figure 3). On the other hand, the directional indicator, which is represented by Θ, is a set of circular variables, i.e., angles between simultaneously observed velocity vectors {θ} (see Figure 3).
Obviously, in order to define a meaningful value for θ, the pedestrians should be moving with a velocity larger than a reasonable threshold. We picked this value examining the distribution of velocity for all people in the environment (see Figure 4). In BrWIETH dataset the people who wait at the tram station have low velocities distributed more or less uniformly over 0 to 0.5 m/s. On the other hand, there are basically two peaks in velocity distribution for APT dataset. The first peak is entered around 0.1 m/s and it relates the people who are watching the shelves, whereas the second peak is centered around 1.2 m/s and it relates the people who walk steadily. Nevertheless, the number of these people is quite low compared with the steadily walking pedestrians. Thus, we picked 0.375 m/s as velocity threshold.
Since the velocity threshold is picked around the local minima of the velocity distribution separating the moving and stationary pedestrians, shifting the velocity threshold slightly would not affect a large number of pedestrians and thus not change the performance of the proposed method drastically. Moreover, the local minima observed in BIWIETH and APT datasets do not arise due the specific characteristics of these environments. According to Helbing et al., at normal density the velocity of pedestrians is given by a normal distribution with an average of 1.34 m/s and a standard deviation of 0.26 m/s [47]. These values may change slightly according to the environment but putting the velocity threshold around 0.3 ∼ 0.5 m/s we will be sure to locate it at least 2σ from the peak [48].
Based on these definitions, each pair of pedestrians is represented by a set, which is composed of these two indicators {Δ,Θ}. Moreover, each of G and Ḡ is described by two models characterizing the positional and directional relations, i.e., Δ_{G} and Θ_{G} or Δ_{Ḡ} and Θ_{Ḡ}. The identification problem is deliberated with two different applications of the same approach in parallel, i.e., investigating whether Δ ∼ Δ_{G} or Δ ∼ Δ_{Ḡ} and Θ ∼ Θ_{G} or Θ ∼ Θ_{Ḡ}. The final decision is rendered based on the outcomes of these two, where the outcome implicating a lower uncertainty is preferred in case of ambiguities.
In our previous study we followed a similar strategy and proposed a simplistic method to identify group motion [31]. Ideally, the pedestrians involved in group motion are proposed to be in close proximity and have perfectly aligned velocity vectors. Since these ideal conditions are met seldom, certain thresholds are applied to account for the nonideal nature of the behavior. In this manner, satisfactory performance rates are achieved. Nevertheless, explicit models are necessary to improve the performance and to make the method flexible in order to effectively adapt to different settings. To that end, the proximity and motion direction of pedestrians involved in a group relationship are investigated closely and a mathematical model is proposed for each of the relating probability density functions (pdf) in what follows.
Modeling Positional Indicators
The positional indicators are modeled based on the following assumptions. First an arbitrary reference frame is assigned to the observation environment. In addition, the probability of visiting each point in the environment is assumed to be equal,
P(pm)=P(pn),∀pn,pn∈Awhere P(p_{m}) denotes the probability of visiting point p_{m} and A stands for the observation environment.
Modeling Positional Indicators Regarding G
Any displacement vector δ⃗ can be decomposed into two components, δ_{x} and δ_{y}, where
δ=δx2+δy2. Namely, δ_{x} = δ cos(α) and δ_{y} = δ sin(α), where α stands for the argument of δ⃗ based on the chosen reference frame (see Figure 5). Since group members prefer to keep a comfortable distance of ν between each other, δ_{x} and δ_{y} are statistically independent normally distributed random variables,
δx~N(νcos(α),σ2)δy~N(νsin(α),σ2)Equation (2) implies that δ is distributed as a Rice distribution,
p(δ∣ν,σ)=δσ2exp(−δ2−ν22σ2)I0(δνσ2)where I_{0} stands for the modified Bessel function of the first kind with order 0 [49].
This distribution is independent of the choice of reference frame. Of course, in the presence of a strong pedestrian flow along a certain direction, the distributions of δ_{x} and δ_{y} have different representations according to different choices of reference frame. This is due to the fact that α is determined by the major flow direction in such environments. In the presence of a major flow direction α is distributed in a nonuniform manner, which affects δ_{x} and δ_{y}. However, the distribution of δ given by Equation (3) is invariant to the orientation α. Thus, the distribution of δ is still given by Equation (3). This result obviously holds in the absence of any prominent direction such that α is a uniformly distributed circular random variable.
The unimodal formulation defined by Equation (3) provides a reasonable interpretation for the distance among members of a dichotomous group. However, multipartner groups which are composed of three or more pedestrians, present more complex proxemics bearing a multimodal approach.
In order to have a better insight into the structure of multipartner groups, we define the degree of neighborhood based on the configuration of the group members. Namely, the group structure is expressed in terms of a minimum spanning tree (MST). The degree of neighborhood concerning any two pedestrians is defined by the number of edges along the shortest path of the MST connecting them. According to this definition, {p_{i},p_{j}} of Figure 3 has a degree of neighborhood that equals 1. In other words, they are are first neighbors, whereas {p_{i},p_{k}} of Figure 3 are second neighbors.
In this framework, within multipartner groups, the distance between first neighbors is modeled using the unimodal formulation of Equation (3). Assuming that the relative position of all first neighbors is given by the same function, i.e., the distribution function for the position of first neighbors is the same within the group, the distance between n^{th} order neighbors, n > 1, is modeled by the convolution of the unimodal model to the n^{th} power. A multimodal framework, which is the linear combination of these N models is suggested to embrace the relation among members of a multipartner group composed of N + 1 people. Namely,
ΔG(δ∣ν,σ)=∑n=1NKnΔGn(δ∣ν,σ)where K_{n} is the observation frequency of n^{th} neighborhood. The function Δ_{G}_{n} denotes the distribution between the n^{th} neighbors and is equivalent to the convolution of Equation (3) to the n^{th} power. It is suggested to restrict N ∈ {1, 2, 3}. Because large groups (of 5 or more people; tend to be arranged in complex configurations instead of abreast formation [42]. This limits the degree of neighborhood and eliminates the need to extend N over 3.
Modeling Positional Indicators Regarding Ḡ
If any two simultaneously observed pedestrians are not engaged in a group relation, their relative locations at a particular instant are independent. This assumption, together with Equation (1), makes the problem equivalent to randomly selecting two points from a uniform distribution in the observation environment and measuring the distance between them. Suppose that the dimensions of the observation environment along the x− and y−axes are D. Then,
p(δx)=2D(1−δxD)while the pdf concerning δ_{y} is computed in the same manner. Assuming that δ_{x} and δ_{y} are independent, the relating joint pdf is resolved [50] as,
p(δ)={1D2δ(2δ2−4δ+π),if0≤δ≤D1D2δ[4δ2−1−(δ2+2−π)−4tan−1(δ2−1)]ifD<δ≤D2This distribution describes δ regarding Ḡ in a large environment, D ≫ c, where c ≈ 400 mm stands for the width of the human body. However, it does not account for the constraint imposed by the physical dimensions of the pedestrians, that represent a minimum distance (cutoff) below which 5 cannot assume values. To account for this cutoff, δ is substituted with δ′ = δ − c and p(δ) is renormalized by replacing D with
D′=D−c/2. Note that this distribution does not need to be calibrated since it only depends on the geometry of the observation area.
Modeling Directional Indicators
The directional indicator of group motion regarding any two pedestrians p_{i} and p_{j} is derived from their velocities. The scalar product of velocity vectors υ⃗_{i} and υ⃗_{j} is defined as,
υ→i⋅υ→jυ→iυ→j=cos(θij)where θ denotes the angle between these vectors (see Figure 3). The directional indicators of group motion are represented in terms of this angle θ.
The pairs in G, excluding those exhibiting behaviors like meeting, splitting or standing, are expected to have the direction of the velocity vectors aligned to a considerable degree, whereas the pairs in Ḡ do not present any correlation of direction. This suggests that the expected value of θ is 0 for both G and Ḡ. If θ were a linear random variable over (−∞, ∞), such a behavior could be approximated with a normal distribution of mean 0 and standard deviation σ_{θ}. However, θ is a circular random variable defined over [−π, π] and, thus, it cannot be modeled in terms of a standard normal distribution.
Hence, the principles of directional statistics are invoked and the behavior of θ is modeled as a von Mises distribution [51], which is the circular analogue of the Gaussian distribution. The following is the explicit form of the von Mises distribution,
p(θ∣μ,κ)=exp(κcos(θ−μ))2πI0(κ)where μ denotes the mean value and κ is analogous of 1/σ^{2} of the normal distribution.
Note that the θ distribution relating G and Ḡ is described using the same function given by Equation (8), where the parameter κ enables modeling of different behaviors. In other words, for the pedestrian pairs in G, the distribution of θ is very localized around μ = 0 and κ ≫ 1. On the other hand, for the pedestrian pairs in Ḡ, the distribution is uniform if there is no prominent flow and κ → 0. Furthermore, in the presence of major flow θ has two peaks for each major flow, i.e., one for pedestrians moving in the same direction and another for pedestrians moving in opposite directions. In that case, the distribution of θ regarding Ḡ is modeled as a linear combination of two von Mises distributions, one with μ = 0 and the other with μ = π. Even in this case, the distribution around a particular peak is expected to be larger than that of pairs in G.
Hypothesis Testing
The decision whether a pair belongs to G or Ḡ is carried out using a compound hypothesis testing scheme, as shown in Algorithm 1. Since G or Ḡ are mutually exclusive and complementary events, a decision can confidently be made as long as the individual indicators point to the same sort of group relation. In case of conflicts, a measure of uncertainty needs to be defined to resolve the final decision. In what follows we describe how the individual decisions are carried out and we define the uncertainty measures for resolving the final decision in case of contradictions.
Algorithm 1: Compound hypothesis testing.
Input: Trajectories of pedestrian p_{i} and simultaneously observed pedestrians {p_{j}}, 1 ≤ j ≤ J.
Output: The nature of group relation of p_{i} with {p_{j}}
forj ← 1 toJdo
 Δ={δ⃗_{ij}};
 Θ = {/(υ^{i},υ^{j})};
 L^{δ}, L^{θ};
/* Equation 9 */
if (L^{δ} > 0) ∧ (L^{θ} > 0);
/* Equation 10 */
then {p_{i},p_{j}} ∈ G;
else if (L^{δ} < 0) ∧ (L^{θ} < 0);
/* Equation 10 */
then {p_{i},p_{j}} ∈ Ḡ;
else
 Compute ρ^{δ} and ρ^{θ};
/* Equation 13 */
if [(Δ ∼ Δ_{G}) ∧ (Θ ∼ Θ_{Ḡ}) ∧ (ρ^{δ} < 1/ρ^{θ})] ∨ [(Δ ∼ Δ_{Ḡ}) ∧ (Θ ∼ Θ_{G}) ∧ (ρ^{θ} < 1/ρ^{δ})]
then {p_{i},p_{j}} ∈ G;
else {p_{i},p_{j}} ∈ Ḡ
In binary decisions, a likelihood ratio test is one way of determining the underlying model. Concerning Δ, the loglikelihood ratio of being in a group relation over not being in a group relation, L^{δ}, is defined as,
Lδ=log(∏∀δ∈ΔΔG(δ∣ν,σ)∏∀δ∈ΔΔG¯(δ))The following is the decision based on δ,
={Δ~ΔGLδ>0Δ~ΔG¯Lδ<0The decision based on θ is carried out in a similar manner through the loglikelihood ratio concerning Θ, L^{θ}, computed in an analogous way to Equation (9).
As long as L^{δ} and L^{θ} have the same sign, a confident decision is made regarding the group relation (Algorithm 1 Lines 1 and 1). However, contradictions might arise. For example, when pedestrians cross next to each other, move along a flow, or go through passages, their relative position might become close or their velocity vectors might be aligned, independent of their social relation. One may argue that an intuitive way of resolving such cases is to pick the decision that implies a larger absolute value. However, we demonstrate in Section 6 that this straightforward approach is not capable of compensating for the effect of these misleading cues. Therefore, we devise an uncertainty measure.
Inspired by the KullbackLeibler divergence, a reliability estimate is employed to quantify the uncertainty of individual decisions rendered through Equation (10) [52]. The KullbackLeibler divergence of two distributions such as P and Q is defined as,
DKL(Q‖P)=∑ip(i)log(p(i)q(i))Note that this measure is not symmetric, i.e., D_{KL}(Q‖P) ≠ D_{KL}(P‖Q). Thereby, mathematically speaking, it is not a distance measure but it quantifies the difference between two probability distributions. To have a common reference point, the divergence terms are computed with respect to the observed distributions. Hence, the divergences relating 8 with respect to G and Ḡ are defined as
DGδ=DKL(Δ‖ΔG) and
DG¯δ=DKL(Δ‖ΔG¯). Since these terms embrace all {δ} through the summation term in Equation (11), we call them global indicators of group motion.
However, θ relating G does not present a behavior as regular as δ of G. Thus, it is proposed to focus on its local characteristics so as to avoid the misleading temporal imperfections that might lead to a false similarity to Ḡ. Namely, the divergence term relating θ with respect to G is defined as,
DGθ(Θ‖ΘG)=maxθ{Θ(θ)log(Θ(θ)ΘG(θ∣κ))}where the divergence of θ with respect to Ḡ is computed in a similar manner. This equation implies that only the divergence value that indicates the maximum dissimilarity is accounted for. Thereby, it defines a local indicator of group motion.
A direct comparison of the divergence terms defined above is not possible since they are not defined in terms of comparable measures. To enable a comparison, two uncertainty measures are defined regarding each individual decision as the ratio of the concerning divergence values,
ρδ=DGδ/DG¯δρθ=DGδ/DG¯δThe final resolution is determined by picking the decision with lower uncertainty (Algorithm 1 Line 1).
Experimental Results
This section discusses the performance of the estimated distributions in terms of a qualitative comparison, the stability of the model parameters with respect to varying training sets, the identification performance of groups, sensitivity, generalization, and improvement introduced by compound hypothesis testing over individual models and the method of [31] and maximum absolute loglikelihood ratio method.
Model Calibration
The models defined in Section 4 bear a number of parameters, which need to be tuned for different environments and group behaviors. For instance, the positional relation model regarding G, Δ_{G}, given in Equation (4) requires the determination of ν and σ. Similarly, the directional relation models, Θ_{G} and Θ_{Ḡ}, given in Equation (8) require calibration of κ.
For solving these model parameters, we propose shuffling the dataset and randomly selecting 10% of the pairs in G and 10% of the pairs in Ḡ. The squared error between the distributions of the positional and directional indicators concerning these randomly selected sets and the proposed models is minimized using a golden section search. Subsequently, the remaining 90% of the data is employed to evaluate of the proposed models. Section 6.2 presents the performance of this estimation scheme.
In our investigation of the stability of the model parameters, and the sensitivity of the model against varying training sets, this procedure is repeated by shuffling the dataset 50 times. Sections 6.3 and 6.4 report the performance metrics following such a validation scheme.
Estimated Distributions
Figure 6 demonstrates the modeled and observed distributions of the positional indicators for a particular run of the calibration scheme described in Section 6.1. The observed distribution is expressed in terms of the histograms that relate the samples constituting the 90% of all observations. The model concerning Δ_{G} of BIWIETH is modeled with both unimodal and multimodal approaches. For this case, the multimodal approach in Equation (4) considers N to be 3. Since BIWIETH contains various multipartner groups (see Table 1), the improvement of the multimodal approach over the unimodal approach can easily be observed in Figure 6(a). On the other hand, due to the dominance of the dichotomous groups in APT, the unimodal scheme provides satisfactory performance in modeling Δ_{G} concerning APT. For Δ_{Ḡ}, fairly good results are obtained for both sets. The smoother shape of the observed distribution of APT is due to the larger number of observations compared with BIWIETH.
Figures 7(a,b) illustrate the modeled and observed distributions of the directional indicators relating G. As expected, both models peak around 0, where the spread concerning APT is slightly larger than that of BIWIETH. This difference reflects the more regular motion pattern of the pedestrians due to fewer distractions in comparison with APT's shopping center environment. On the other hand, the models concerning Ḡ present a clear distinction arising from the different flow characteristics. Due to the lack of prominent flow direction, θ is distributed more evenly for APT and is concentrated around 0 and π for BIWIETH.
Stability of Parameters
Repeating the calibration method described in Section 6.1 50 times using a set of randomly selected samples that constitutes 10% of all the data, we obtain the statistics shown in Table 2.
The Δ_{G} models relating different datasets lead to similar values for ν, changing between 0.81 cm and 0.67 m with a fairly small variation within 0.06 m. Hall defines close phase personal distance to be between 46 cm and 75 cm and far phase personal distance to be between 76 cm and 120 cm [53]. Our findings are consistent with these values.
Regarding the θ models, the κ values relating G are always larger than those of Ḡ. As explained in Section 4.2, this indicates that the θ pattern concerning G is more structured than that of Ḡ. Nonetheless, the distinction becomes most clear in APT due to the lack of prominent flow direction. Moreover, the deviations of κ have quite insignificant values, provided that the sample set is large, as in BIWIETH and APT, whereas in BIWIHotel the deviation of κ regarding both G and Ḡ is higher relative to κ due to the reduced number of samples.
Performance and Sensitivity
Table 3 illustrates the performance of detecting the individual group relations of 50 runs of the proposed method together with the sensitivity of the identification rates. The overall success rates are all above roughly 85%, where the rates of G and Ḡ do not present any significant distinction between the different runs of the proposed method with respect to different datasets.
Since the group structure of multipartner groups gets more complex, particularly in high pedestrian densities, it is not possible to provide stable statistics for the performance rates with respect to the degree of neighborhood [42]. This fact supports our unifying approach in modeling δ of G, where the different degrees of neighborhood are blended in Equation (4).
Moreover, in multipartner groups by applying a crosscheck, the pedestrians, who are found to be in group relations to the same pedestrians independent of their degree of neighborhood, can be linked to each other. The detection rates regarding G increase to 100% by applying this crosscheck. Figure 8 illustrates several examples of challenging cases from BIWIETH and BIWIHotel sets.
Comparison and Generalization
This section presents the performance rates based on the decisions of each individual indicator and ascertains that compound hypothesis testing improves the identification of group relations. Moreover, the alternative of hypothesis testing described in Section 5, where a decision is made in favor of the maximum absolute loglikelihood ratio, is applied and the superiority of our proposed method is verified. In addition, the detection performance of method of [31] is reported and it is ascertained that our proposed method outperforms it.
The improvement introduced by the integration of two observations through compound hypothesis testing as described in Section 5 is presented in Table 4. The improvement achieved by using both indicators (Δ + Θ), in comparison with using a single indicator (Δ or Θ), is presented in terms of the difference in performance rates of the individual decisions and performance rates after integration. It is observed that the numbers are often positive, which indicates that compound hypothesis testing provides an improvement over the individual models in almost every case.
The detection of G in Caviar is the only exception. Using the positional indicator Δ, a detection rate of 93.18% is achieved. Integrating positional and directional indicators, the detection rate decreases to 86.68%. This is due to the fact that the pedestrians in Caviar follow scenarios such as meeting and splitting, which cannot be determined using the directional indicators as explained in Section 4.2. The ground truth is given based on the video sequence, where visual cues are available. However, group relation is resolved using the indicators derived only from trajectory data. This implies that certain cues are not reflected such as gaze direction or body posture, whereas cues like position are still present. Therefore, it is not surprising that for describing behaviors like meeting and splitting, using only the positional indicator Δ results in a better performance than Δ + Θ.
Table 5 illustrates the performance rates of the model of [31] for pairs in G, pairs in Ḡ and all pairs. In BIWIETH, which involves a nonuniform environment with high pedestrian density, Reference [31] has a positive bias for Ḡ, which misleadingly increases the overall detection rate to 95.05%. However, G, which is observed less often than Ḡ is only detected by a 65.52% success rate. In BIWIHotel, which involves a dominant flow direction environment with low pedestrian density, the identification rates of [31] and the proposed method are comparable. In APT, Reference [31] detects both G and Ḡ with roughly 9% lower rates than our method.
In Section 5, it is mentioned that a straightforward way of dealing with conflicting decisions is to pick the decision that implies a larger absolute value. The identification rates achieved by selecting the decision with a higher absolute loglikelihood ratio instead of applying compound hypothesis testing is presented in Table 5. Although the overall performance rates seem close to the proposed method, the detection rates of G are considerably lower than those of Ḡ. In other words, this approach has a positive bias for Ḡ. Therefore, our proposed method proves to have no bias in favor of a particular class, which implies a fair distinction of group relation.
Conclusions
Positional and directional models are proposed for identification of pedestrian groups in crowded environments together with a compound evaluation scheme. Different environmental characteristics are accounted for in addition to varying group structures. Our results indicate that our proposed models grasp the characterizing features of different environmental settings and varying patterns of group relations. Moreover, the model parameters are shown to be stably derived from a small set of data. In addition, the group relations are illustrated to be identified with satisfactorily high rates. The efficacy of compound evaluations is verified by a comparison with individual decisions as well as with another method in the literature. Finally, our contributions are listed as improvements in positional and directional models to adjust to different environments and group structures, the description of compound evaluations and the comparison of the models, and the resolution of ambiguities with our proposed uncertainty measure based on the local and global indicators of group relations.
This research is supported by the Ministry of Internal Affairs and Communications of Japan through the project of Ubiquitous Network Robots for Elderly and Challenged.
ReferencesHaritaogluI.FlicknerM.Detection and Tracking of Shopping Groups in StoresProceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern RecognitionKauai, HI, USA8– 14 December 2001I431I438ChangM.C.KrahnstoeverN.GeW.Probabilistic GroupLevel Motion Analysis and Scenario RecognitionProceedings of the 2011 IEEE International Conference on Computer VisionBarcelona, Spain6– 13 November 2011747754YuT.LimS.N.PatwardhanK.A.KrahnstoeverN.Monitoring, Recognizing and Discovering Social NetworksProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionMiami, FL, USA20– 25 June 200914621469McPhailC.WohlsteinR.Using film to analyze pedestrian behaviorZhanB.MonekossoD.N.RemagninoP.VelastinS.A.XuL.Q.Crowd analysis: A surveyAggarwalJ.K.RyooM.S.Human activity analysis: A reviewCristaniM.RaghavendraR.Del BueA.MurinoV.Human behavior analysis in video surveillance: A social signal processing perspectiveGaticaPerezD.Automatic nonverbal analysis of social interaction in small groups: A reviewCostaM.Interpersonal distances in group walkingChenY.Y.HsuW.H.LiaoH.Y.M.Discovering Informative Social Subgraphs and Predicting Pairwise Relationships from Group PhotosProceedings of the 20th ACM International Conference on MultimediaNara, Japan29 October–2 November 2012SinglaP.KautzH.LuoJ.GallagherA.Discovery of Social Relationships in Consumer Photo Collections Using Markov LogicProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition WorkshopsAnchorage, AK, USA23– 28 June 200817XiaS.ShaoM.J.L.FuY.Understanding kin relationships in a photoWangG.GallagherA.C.LuoJ.ForsythD.A.Seeing people in social context: Recognizing people and social relationshipsGallagherA.C.ChenT.Understanding Images of Groups of PeopleProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionMiami, FL, USA20– 25 June 2009256263MurilloA.KwakI.BourdevL.KriegmanD.BelongieS.Urban Tribes: Analyzing Group Photos from a Social PerspectiveProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition WorkshopsProvidence, RI, USA16– 21 June 20122835DingL.YilmazA.Learning relations among movie characters: A social network perspectiveWangX.TieuK.GrimsonE.Correspondencefree activity analysis and scene modeling in multiple camera viewsWuB.NevatiaR.Detection and segmentation of multiple, partially occluded objects by grouping, merging, assigning part detection responsesJorgeP.AbrantesA.MarquesJ.OnLine Tracking Groups of Pedestrians with Bayesian NetworksProceedings of the 6th International Workshop on Performance Evaluation for Tracking and SurveillancePrague, Czech Republic10 May 2004PellegriniS.EssA.SchindlerK.GoolL.J.V.You'll Never Walk Alone: Modeling Social Behavior for MultiTarget TrackingProceedings of the IEEE 12th International Conference on Computer Vision29 September– 2 October 2009261268PellegriniS.EssA.GoolL.J.V.Improving data association by joint modeling of pedestrian trajectories and groupingsBoseB.WangX.GrimsonE.MultiClass Object Tracking Algorithm that Handles Fragmentation and GroupingProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionMinneapolis, MN, USA17– 22 June 200718WertheimerM.Laws of Organization in Perceptual FormsWangX.MaK.T.NgG.W.GrimsonW.E.L.Trajectory analysis and semantic region modeling using nonparametric hierarchical bayesian modelsYanW.ForsythD.A.Learning the Behavior of Users in a Public Space through Video TrackingProceedings of the 7th IEEE Workshops on Application of Computer VisionBreckenridge, CO USA5– 7 January 2005370377HabeH.HondaK.KidodeM.Human Interaction Analysis Based on Walking Pattern TransitionsProceedings of the 3rd ACM/IEEE International Conference on Distributed Smart CamerasComo, Italy30 August– 2 September 200918ZenG.LepriB.RicciE.LanzO.Space Speaks: Towards Socially and Personality Aware Visual SurveillanceProceedings of the 1st ACM International Workshop on Multimodal Pervasive Video AnalysisFlorence, Italy25–29 October 20103742ChoiW.ShahidK.SavareseS.Learning Context for Collective Activity RecognitionProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProvidence, RI, USA20– 25 June 201132733280FrenchA.P.NaeemA.DrydenI.L.PridmoreT.P.Using Social Effects to Guide Rracking in Complex ScenesProceedings of the IEEE Conference on Advanced Video and Signal Based SurveillanceLondon, UK5– 7 September 2007212217CalderaraS.PratiA.CucchiaraR.Mixtures of von mises distributions for people trajectory shape analysisYücelZ.IkedaT.MiyashitaT.HagitaN.Identification of Mobile Entities Based on Trajectory and Shape InformationProceedings of the IEEE /RSJ International Conference on Intelligent Robots and SystemsSan Francisco, CA, USA25– 30 September 201135893594YücelZ.ZanlungoF.IkedaT.MiyashitaT.HagitaN.Modeling Indicators of Coherent MotionProceedings of the IEEE /RSJ International Conference on Intelligent Robots and SystemsVilamouraAlgarve, Portugal7– 12 October 201218YücelZ.MiyashitaT.HagitaN.Modeling and Identification of Group Motion via Compound Evaluation of Positional and Directional CuesProceedings of the 21st International Conference on Pattern RecognitionTsukuba, Japan11–15 November 201218GeW.CollinsR.T.RubackB.Visionbased analysis of small groups in pedestrian crowdsSandikciS.ZingerS.de WithP.H.N.Detection of human groups in videosBahlmannC.Directional features in online handwriting recognitionHelbingD.MolnarP.Social force model for pedestrian dynamicsMehranR.OyamaA.ShahM.Abnormal Crowd Behavior Detection Using Social Force ModelProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionMiami, FL, USA20– 25 June 2009935942LernerA.ChrysanthouY.LischinskiD.Crowds by exampleFisherR.The PETS04 Surveillance GroundTruth Data SetsProceedings of the 6th IEEE International Workshop on Performance Evaluation of Tracking and SurveillancePrague, Czech Republic10 May 2004Zeynep YücelAvailable online: http://www.irc.atr.jp/zeynep/research (accessed on 9 January 2013)MoussaïdM.PerozoN.GarnierS.HelbingD.TheraulazG.The walking behaviour of pedestrian social groups and its impact on crowd dynamicsGlasD.MiyashitaT.IshiguroH.HagitaN.Laser Tracking of Human Body Motion Using Adaptive Shape ModelingProceedings of the International Conference on Intelligent Robots and SystemsSan Diego, CA, USA29 October–2 November 2007602608ETH Zurich—Department of Information Technology and Electrical Engineering Computer Vision LaboratoryAvailable online: http://www.vision.ee.ethz.ch/datasets/index.en.html (accessed on 9 January 2013)Benchmark Data for PETSECCV 2004Available online: http://wwwprima.inrialpes.fr/PETS04/caviar_data.html (accessed on 9 January 2013)MoussaïdM.HelbingD.GarnierS.JohanssonA.CombeM.TheraulazG.Experimental study of the behavioural mechanisms underlying selforganization in human crowdsHelbingD.FarkasI.MolnarP.VicsekT.Simulation of Pedestrian Crowds in Normal and Evacuation SituationsDaamenW.HoogendoornS.Free Speed Distributions for Pedestrian TrafficProceedings of the 85th Annual Meeting of Transportation Research BoardWashington DC, USA22– 26 January 2006AbramowitzM.StegunI.Francesco ZanlungoAvailable online: https://sites.google.com/site/francescozanlungo/squarelinepicking (accessed on 9 January 2013)MardiaK.V.JuppP.E.KoseC.WeselR.Robustness of Likelihood Ratio Tests: Hypothesis Testing under Incorrect ModelsProceedings of the ThirtyFifth Asilomar Conference on Signals, Systems and ComputersPacific Grove, CA, USA4–7 November 200117381742HallE.T.Figures and Tables
Which pedestrians are in a group? It is hard to tell from snapshots since traditional image based methods do not apply to surveillance footage. Trajectories are an important clue of group relation.
Experiment scenes from datasets (a) Caviar; (b) BIWIETH; (c) BIWIHotel and (d) APT. Pedestrians moving as a group are denoted with bounding boxes of the same color.
Pedestrians of same group are denoted with same color. Some positional and directional measures employed in identification of groups are illustrated in reference to p_{i}.
Velocity distribution concerning all pedestrians in (a) BIWIETH and (b) APT datasets.
Distribution of δ_{x}, δ_{y} and δ regarding G.
Observed and modeled distributions of δ regarding G for (a) BIWIETH and (b) APT. Figures (c) and (d) are organized similarly for Ḡ.
Observed and modeled distributions of θ regarding G for (a) BIWIETH and (b) APT Figures (c) and (d) are organized similarly for Ḡ.
Pedestrians of same group are denoted with same marker and color, whereas pedestrians who do not belong to a group are denoted with gray circles. (a,b,c) Two pedestrians present meeting and splitting behavior; (d) Groups behave in a noncoherent manner; (e) Considerable occlusion; (f,g) Two groups move along same flow. Groups pass through each other moving (h,i) in opposite directions and (j) in same direction; (k) Unrelated pedestrians present grouplike behavior; (l,m) Unrelated pedestrians follow similar trajectories with similar velocities to groups; (n,o) Waiting people introduce uncertainty.
Specifications of datasets.
Duration
Group size
Total # of pedestrians
2
3
4
5
6
Caviar
1′11″
5

1


17
BIWIETH
8′38″
38
10
6
1
4
360
BIWIHotel
12′54″
38
3



223
APT
30′00″
128
8



531
The mean values and standard deviations of ν, σ and κ over the 50 runs.
Caviar
BIWIETH
BIWIHotel
APT
Δ_{G}(δν,σ)
ν
0.81 ±0.04
0.76 ±0.06
0.67 ±0.03
0.71 ±0.02
σ
0.33 ±0.07
0.22 ±0.05
0.14 ±0.03
0.13 ±0.02
Θ_{G}(θκ)
κ
6.36 ±1.15
69.53 ±9.18
164 ±40.25
9.59 ±2.11
Θ_{Ḡ}(θκ)
κ
0.32 ±0.39
15.03 ±1.38
36.29 ±9.18
0.89 ±0.13
Performance rates of the proposed method.
G(%)
Ḡ(%)
Total(%)
Caviar
86.68 ± 0.33
94.36 ± 0.23
87.82 ± 0.32
BIWIETH
85.62 ± 0.00
91.15 ±0.00
90.51 ±0.00
BIWIHotel
95.89 ± 0.33
96.77 ±1.61
96.57 ±0.51
APT
94.77 ±0.15
99.84 ±0.10
99.10 ±2.76
Improvement introduced by compound evaluation over individual decisions.
G(%)
Ḡ(%)
Total(%)
Caviar
Δ → Δ + Θ
−6.5
22.4
−2.7
Θ → Δ + Θ
4.96
3.89
4.77
BIWIETH
Δ → Δ + Θ
3.03
0.36
0.47
Θ → Δ + Θ
13.62
2.65
3.18
BIWIHotel
Δ → Δ + Θ
0.06
0.04
0.04
Θ → Δ + Θ
0.33
0.04
0.07
APT
Δ → Δ + Θ
5.74
0.56
3.14
Θ → Δ + Θ
8.05
9.66
8.87
Performance comparison of the proposed method to the method of [31] and method of maximum absolute loglikelihood ratio.