An Unsupervised Framework for Online Spatiotemporal Detection of Activities of Daily Living by Hierarchical Activity Models

Automatic detection and analysis of human activities captured by various sensors (e.g., sequences of images captured by RGB camera) play an essential role in various research fields in order to understand the semantic content of a captured scene. The main focus of the earlier studies has been widely on supervised classification problem, where a label is assigned to a given short clip. Nevertheless, in real-world scenarios, such as in Activities of Daily Living (ADL), the challenge is to automatically browse long-term (days and weeks) stream of videos to identify segments with semantics corresponding to the model activities and their temporal boundaries. This paper proposes an unsupervised solution to address this problem by generating hierarchical models that combine global trajectory information with local dynamics of the human body. Global information helps in modeling the spatiotemporal evolution of long-term activities, hence, their spatial and temporal localization. Moreover, the local dynamic information incorporates complex local motion patterns of daily activities into the models. Our proposed method is evaluated using realistic datasets captured from observation rooms in hospitals and nursing homes. The experimental data on a variety of monitoring scenarios in hospital settings reveals how this framework can be exploited to provide timely diagnose and medical interventions for cognitive disorders, such as Alzheimer’s disease. The obtained results show that our framework is a promising attempt capable of generating activity models without any supervision.


Introduction
Activity detection has been considered as one of the major challenges in computer vision due to its utter importance in various applications including video perception, healthcare, surveillance, etc. For example, if a system could monitor human activities, it could prevent the elderly from missing their medication doses by learning their habitual patterns and daily routines. Unlike regular activities that usually occur in a closely controlled background (e.g., playing soccer), Activities of Daily Living (ADL) usually happen in uncontrolled and disarranged household or office environments, where the background is not a strong cue for recognition. In addition, ADLs are more challenging to detect and recognize, because of their unstructured and complex nature that create visually perplexing dynamics. Moreover, each person has his/her own ways to perform various daily tasks resulted in infinite variations of speed and style of performance which accordingly add extra complexity to detection and recognition tasks.
From the temporal aspect, detecting ADLs in untrimmed videos is a difficult task since they are temporally unconstrained and can occur at any time and in an arbitrarily long video (e.g., recordings of patients in a nursing home for days and weeks). Therefore, in activity detection, we are not only interested in knowing the types of the activities happening, but also we want to precisely know the temporal delineation of the activities in a given video (temporal activity localization).
Most of the available state-of-the-art approaches deal with this problem through detection by classification task [1][2][3]. These methods classify the generated temporal segments either in the form of sliding windows in multiple scales [4][5][6] or another external proposal mechanism [7,8]. These methods infer the occurring activity by exhaustively applying trained activity classifiers at each time segment. Although they achieve encouraging performances in short actions and small-scale datasets, these computationally expensive methods can not be applied conveniently to large-scale datasets and complex activities, such as ADLs. These methods are not capable of precisely predicting flexible activity boundaries. Temporal scale variability of the activities can be dealt with by using multiple-scale sliding window approaches, however, such methods are computationally expensive. To compensate the high computational cost of these methods, a class of methods [4,8,9] influenced by advancements in the field of object detection [10][11][12] have been developed in which instead of exhaustive scanning, perform a quick scan to single out candidate activity segments. The sought after activities are more likely to occur in these segments. In the second step, the activity classifiers are only applied to the candidate segments, therefore, remarkably reduce the operational cost. Although these methods have shown good results on activity recognition tasks [13][14][15], they rarely use context priors in their models. Another drawback is that instead of learning an end-to-end deep representation, they use off-the-shelf hand-crafted [16] or deep [17,18] representations independently learned from images. This will result in a poor detection performance as these representations are not intended and are not optimal for localization.
Most of the above-mentioned methods are single-layered supervised approaches. In the training phase of the activities, the labels are fully (supervised) [16,19,20] or partially (weakly supervised) [21,22] given. In other studies [23,24], the location of the person or the interacted object is known. Usually the discovery of temporal structure of activities is done by a linear dynamic system [25], a Hidden Markov Model [26], hierarchical grammars [27][28][29], or by spatiotemporal representation [30,31]. These methods have shown satisfying performances on well-clipped videos, however, ADLs consist of many simple actions forming a complex activity. Therefore, representation in supervised approaches is insufficient to model these activities and a training set of clipped videos for ADL cannot cover all the variations. In addition, since these methods require manually clipped videos, they can mostly follow an offline recognition scheme. There also exist unsupervised approaches [32,33] which are strong in finding meaningful spatiotemporal patterns of motion. Nevertheless, global motion patterns are not enough to obtain a precise classification of ADL. For long-term activities, many unsupervised approaches model global motion patterns and detect abnormal events by finding the trajectories that do not fit in the pattern [34,35]. Other methods have been applied to traffic surveillance videos to learn the regular traffic dynamics (e.g., cars passing a crossroad) and detect abnormal patterns (e.g., a pedestrian crossing the road) [36]. However, modeling only the global motion pattern in a single-layered architecture cannot capture the complex structure of long-term human activities. Moreover, a flat architecture focuses on one activity at a time and intrinsically ignores modeling of sub-activities. Hierarchical modeling, therefore, enables us to model activities considering their constituents in different resolutions and allows us to combine both global and local information to achieve a rich representation of activities.
In this work, we propose an unsupervised activity detection and recognition framework to model as well as evaluate daily living activities. Our method provides a comprehensive representation of activities by modeling both global motion and body motion of people. It utilizes a trajectory-based method to detect important regions in the environment by assigning higher priors to the regions with dense trajectory points. Using the determined scene regions, a sequence of primitive events can be created in order to localize activities in time and learn the global motion patterns of people. To describe an activity semantically, we can adapt a notion of resolution by dividing an activity into different granularity levels. This way, the generated models describe multi-resolution layers of activities by capturing their hierarchical structures and sub-activities. Hereupon, the system can move among different layers in the model to retrieve relevant information about the activities. We create the models to uniquely characterize the activities by deriving relative information and constructing a hierarchical structure. Additionally, a large variety of hand-crafted and deep features are employed as an implicit hint to enrich the representation of the activity models and finally perform accurate activity detection. To summarize, the core contributions of this paper set forth below: • an unsupervised framework for scene modeling and activity discovery; • dynamic length unsupervised temporal segmentation of videos; • generating Hierarchical Activity Models using multiple spatial layers of abstraction; • online detection of activities, as the videos are automatically clipped; • finally, evaluating daily living activities, particularly in health care and early diagnosis of cognitive impairments.
Following these objectives, we conducted extensive experiments on both public and private datasets and achieved promising results. The rest of the paper is organized as follows: Section 2 presents the related studies from the literature. Section 3 explains our suggested approach followed by describing conducted experiments in Section 4. Lastly, Section 5 concludes the paper.
Recent re-emergence of deep learning methods has been led to remarkable performances in various tasks. That success followed by adapting convolutional networks (CNNs) to activity recognition problem for the first time in [42]. The inclination toward using CNNs in the field reinforced by introduction of two-stream [43] and 3D-CNN [17] architectures to utilize both motion and appearance features. Most of these methods are segment-based and usually use a simple method for aggregating the votes of each segment (frame or snippet). There are also other approaches that investigate long-range temporal relations of activities through temporal pooling [37,44,45]. However, the main assumptions in these methods are that the given videos should be manually clipped and the activities should take place in the entire duration of the videos. Therefore, the temporal localization of those activities is not taken into account.

Temporal and Spatiotemporal Activity Detection
The goal in activity detection is to find both the beginning and end of the activities in long-term untrimmed videos. The previous studies performed in activity detection were mostly dominated by sliding window approaches, where the videos are segmented by sliding a detection window followed by training classifiers on various feature types [4,6,[46][47][48]. These methods are computationally expensive and produce noisy detection performances, especially in activity boundaries.
Recently, several studies [4,9,49,50] incorporate deep networks and try to avoid the sliding window approach and search for activities with dynamic lengths. This is usually achieved by temporal modeling of activities using Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) networks [51,52]. For example, [9] uses an LSTM to encode Convolution3D (C3D) [17] features of each segment and classifies it without requiring an extra step for producing proposals. However, their model is still dependant on hand-crafted features. In order to resolve the problem of short dependencies in RNN based methods, time-series models, such as Temporal Convolutional Networks (TCN) [53,54], employ a combination of temporal convolutional filters and upsampling operations for acquiring long-range activity relations. However, applying convolutional operations on the local neighborhood for detecting long-range dependencies is not efficient in terms of computational time. Moreover, many methods use the concept of Actioness [55] to produce initial temporal activity proposals. Actioness indicates the likelihood of a generic activity localized in the temporal domain. Reliability of the Actioness hinges upon the correctness of distinguishing the background. Unlike conventional activity datasets which contain many background segments, long activities in ADL datasets are usually linked through short background intervals. Accordingly, methods [2,56] which are relied on Actioness cannot effectively determine the temporal boundary of ADLs in such datasets.
The methods used in [57][58][59][60][61] explore the videos to detect activities in spatial and temporal domains simultaneously. Some methods [61,62] employ a supervoxel approach to perform spatiotemporal detection, while others use human detectors [60,63] and treat the detection problem as a tracking problem [57,59]. Most of these approaches require object detection for a more accurate detection and therefore, demand exhaustive annotation of objects in long videos which is a tedious and time-consuming process. Note that the activity detection problem is closely related to object detection problem from images. A major part of the studies in the literature is inspired by object detection, but as it is not the focus of this study, we do not review object detection based methods here. It is worth mentioning that even though the models currently do not utilize object detection features, they still have a flexible design which depends on the availability of the features, any number and types of features can be included or excluded from the models.
Apart from the supervised methods mentioned above, recently there has been increasing attention on methods with unsupervised learning of activities. A pioneer study conducted by Guerra-Filho and Aloimonos [64] sought to overcome the problem of temporal segmentation of human motion which does not require training data. They suggest a basic segmentation method followed by clustering step relied on motion data. Based upon these motion descriptors, they make use of a parallel synchronous grammar system to learn sub-activities of a long activity analogous to identify words in a complete sentence. Another study performed by Fox et al. [65] makes use of the non-parametric Bayesian approach to model pattern of several related atomic elements of an activity identical to elements of a time series without any supervision. Similarly, Emonet et al. [66] proposes an unsupervised Non-parametric Bayesian methods based on Hierarchical Dirichlet Process (HDP) to discover recurrent temporal patterns of words (Motifs). Their method automatically finds the number of topics, recurrence of the activities and the time of their occurrence. Furthermore, several methods take advantage of temporal structure of video data for adjusting parameters of deep networks without using any labeled data for training [67,68]. Some others [69][70][71][72] utilize temporal pattern of activities in an unsupervised way for representation, hence, for detection of activities. Lee et al. [71] formulates representation learning as a sequence sorting problem by exploiting the temporal coherence as a supervisory hint. Temporally shuffled sequence of frames are taken as input for training a convolutional neural network to determine the correct order of the shuffled sequences. In another study conducted by Ramanathan et al. [72], a ranking loss based approach is presented for incorporating temporal context embedding based on past and subsequent frames. A data augmentation technique is also developed to emphasize the effect of visual diversity of context embedding. Fernando et al. [70] leverages the parameters of a frame ranking function as a new video representation method to encode temporal evolution of activities in the videos. The new representation provides a latent space for each video where a principled learning technique is used to model activities without requiring annotation of atomic activity units. Similarly, [73] encodes structured representation of postures and their temporal evolution as motion descriptors for activities. A combinatorial sequence matching method is proposed to realize the relationship between the frames and a CNN is also utilized to detect the conflict of transitions.
So far, state-of-the-art methods are constrained by full supervision and require costly frame level annotation or at least an ordered list of activities in untrimmed videos. By growing the size of the video datasets, it is very important to discover activities in long untrimmed videos. Therefore, recent works propose unsupervised approaches to tackle the problem of activity detection in untrimmed videos. In this work, we use training videos to specify temporal clusters of segments that contain similar semantics throughout the all training instances.

Unsupervised Activity Detection Framework
The proposed framework provides a complete representation of human activities by incorporating (global and local) motion and appearance information. It automatically finds important regions in the scene and creates a sequence of primitive events in order to localize activities in time and to learn the global motion pattern of people. To perform accurate activity recognition, it uses a large variety of features, such as Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), or deep features, as an implicit hint.
As Figure 1 shows, first, long-term videos are processed to obtain trajectory information of the people's movement (input). This information is used to learn scene regions by finding the parts of the scene with a higher prior for activities to occur, i.e., dense regions in terms of trajectory points. A common approach is to assume that there is only one kind of action occurs inside a region [34,36,74]. However, in unstructured scene settings, this assumption may not be valid. In order to distinguish actions occurring inside the same region, we benefit from the local motion and appearance features (visual vocabularies). The learned regions are employed to create primitive events which basically determine primitive state transitions between adjacent trajectory points. Based on the acquired primitive events, a sequence of discovered (i.e., detected) activities is created to define the global motion pattern of people, such as staying inside a region or moving between regions. For each discovered activity, motion statistics, such as time duration, etc., are calculated to represent the global motion of the person. Finally, a model of a certain activity is constructed through the integration of all extracted features and attributes. During the testing phase, the learned regions are used to obtain primitive events of the test video. Again, the video is clipped using discovered zones and the action descriptors are extracted for each discovered activity. Similar to the training phase, for each discovered activity, by combining the local motion information with global motion and other attributes, an activity model is constructed. To recognize activities, a comparison is performed between trained activity models and acquired test activity. A similarity score between the test instance and trained activity models are calculated by comparing global and local motion information of the models. Finally, the activity model with the maximum similarity score is considered as recognized activity of the test instance. Through all the steps, an online scheme is followed to perform continuous activity detection in assisted living scenarios. The subsequent sections describe different parts of the framework in more detail.

Feature Extraction
For local feature detection, improved dense trajectories [75] are employed which densely sample points of interests and track them in consecutive frames of a video sequence. The points of interests are sampled using a W pixels sized grid in multiple scales. Each trajectory is tracked separately at each scale for L frames and the trajectories exceeding this limit are removed from the process. Once the trajectories are extracted, the descriptors in the local neighborhood of the interest-points are computed. There are three different types of descriptors extracted from the interest-points: Trajectory shape, motion (HOF and Motion Boundaries Histogram, a.k.a MBH [75]), and appearance (HOG [76]) descriptors.
Given a trajectory of length L, its shape can be described by a sequence (S = (∆P t , . . . , ∆P t+L−1 )) of displacement vectors: ∆P = (P t+1 − P t ). The final descriptor (trajectory shape descriptor, a.k.a TSD) is computed by normalizing the magnitude of the displacement vector. Other than spatial scales, the trajectories are also calculated in multiple temporal scales in order to represent actions that rapidly occurred.
Motion descriptors (HOF and MBH) are computed in a volume around the detected interest-points and throughout their trajectories (spatiotemporal volume). Size of the constructed volume is N × N pixels around the interest-point and L frames long. For all of the grids in the spatiotemporal volume, the descriptors are calculated and concatenated to represent the final descriptor. While motion-based descriptors focus on the representation of the local motion, appearance descriptor (HOG) represents static appearance information by calculating gradient vectors around the calculated trajectory points.
Geometrical descriptors are also used for representing the spatial configuration of the skeleton joint information and model human body pose in each frame. To represent the skeleton, both joints' Euclidean distances and angles in polar coordinate are calculated using normalized joint positions. In order to preserve temporal information in pose representation, a feature extraction scheme based on temporal sliding window is adapted [77]. At each time instance, Euclidean distances between all the joints are calculated. Besides, for each joint, distance from other instances' joints included in the sliding window is calculated and stored. If J t i represents features of joint i at time t and w shows the sliding window size: Similarly, to calculate angular feature in polar coordinate, we use: and i, j = 1, 2, ..., 8 for both equations. Combining these features produces the final descriptor vector In order to compare the effect of hand-crafted and deep features on our generated activity models, the framework uses Trajectory-Pooled Deep-Convolutional Descriptors (TDD) introduced in [37]. Computing these features are similar to dense trajectory descriptors. The main difference here is that rather than computing the hand-crafted features around the spatiotemporal volume of the trajectories, deep features are extracted using Convolutional Neural Network (CNN) maps. To compute these features, multi-scale convolutional feature maps pool deep features around the interest-points of the detected trajectories. The two-stream ConvNet architecture proposed by Simonyan [43] is adapted for TDD feature extraction. The two-stream CNN consists of two separate CNNs: spatial and temporal networks. The motion features (temporal) are trained on optical flow and extracted using conv3 and conv4 layers of CNN. Additionally, for the training of the appearance features (spatial) on RGB frames, conv4 and conv5 layers of CNN are used.

Global Tracker
Information about the global position of the subjects is indispensable in order to achieve an understanding of long-term activities. For person detection, the algorithm in [78] is applied that detects head and shoulders from RGBD images. Trajectories of the detected people in the scene are obtained using the multi-feature algorithm in [79] using 2D size, 3D displacement, color histogram, the dominant color, and covariance descriptors as a feature and the Hungarian algorithm [80] to maximize the reliability of the trajectories. We use the control algorithm in [81] to tune tracking parameters in an online manner. The output of the tracking algorithm is the input for the framework: where Seq i = Traj 1 , . . . , Traj T . i is the label of the tracked subject and T is the number of trajectories in each sequence. Each scene region characterizes a spatial part of the scene and will be represented as a Gaussian distribution: SR i ∼ (µ i , σ i ).

Scene Model
In most of the trajectory-based activity recognition methods, a priori contextual information is ignored while modeling the activities. The proposed framework performs automatic learning of the meaningful scene regions (topologies) by taking into account the subject trajectories. The regions are learned at multiple resolutions. By tailoring topologies at different levels of resolution, a hierarchical scene model is created. A topology at level l is defined as a set of scene regions (SR): k indicates the number of scene regions defining the resolution of the topology. The scene regions are obtained through clustering which takes place in two stages. This two stages clustering helps to reduce the effect of outlier trajectory points in the overall structure of the topologies. In the first stage, the interesting regions for each subject in the training set are found by clustering their trajectory points. For each Seq, the clustering algorithm produces k clusters: Cluster(Seq i ) = {Cl 1 , ..., Cl k } where each resulted cluster characterizes the scene based on the motion information of subject i. µ and ω parameters of the distribution of the SR i are calculated from the clustering. Cth cluster center (Cl c ) corresponds to scene region i (SR i ). For SR i , µ is the spatial coordinate of the cluster centroid: SR i (µ) = centroid(Cl c ) and the standard deviation σ is computed from the point coordinate sequence of the trajectory set. The second stage of the clustering merges individual scene regions into a single comprehensive set of regions. Each region is a new cluster (Cl) in the second stage partitioning the obtained cluster centroids in the first stage. K-means algorithm is used for the clustering where the optimal value of K is calculated based on the Bayesian Information Criterion (BIC) [82]. We define a scene model as a set of scene regions (topologies) at different resolutions: SceneModel =< Topology highlevel , Topology midlevel , Topology lowlevel > .
We create a model with topologies at three levels, each aims to describe the scene at a high, medium and low degree of abstraction. Figure 2 depicts an example of the calculated scene regions in a hospital room in Centre Hospitalier Universitaire de Nice (CHU) dataset (https://team.inria.fr/stars/demcarechu-dataset/).

Primitive Events
To fill the gap between the low-level image features and high-level semantic description of the scene, an intermediate block capable of linking the two is required. Here, we describe a method that defines a construction block for learning the activity models. With a deeper look at the activity generation process, it can be inferred that the abstraction of low-level features into high-level descriptions does not happen in a single step and this transition is gradual. As a solution, we use an intermediate representation named Primitive Event (PE). Given the two consecutive trajectory data points (Traj i and Traj j ), by using their distance from the cluster centroids, their corresponding scene regions (StartRegion and EndRegion) can be found. A primitive event is represented as a pair of directed scene regions of these trajectory points: where StartRegion and EndRegion variables take values of SR indices. For example, if StartRegion of Traj i : SR 2 and EndRegion of Traj j : SR 4 then, we will have (2 → 4) as a primitive event. PE describes an atomic motion block and is used for characterizing motion of a person in a scene. This way, a whole sequence of trajectory can be translated into PEs. A Primitive Event's type is Stay, when the region labels (Such as SR1) stay constant between two time intervals. It is equivalent to a sequence of Stays in the scene region P: Primitive Event = Stay P_P .
When a Primitive Event's type is Change, a change of region (from region P to region Q) between two successive time instants (i.e., two successive trajectory points) occurs. It is equivalent to a region transition: Primitive Event = Change P_Q .
The duration of the current status (stay/change) can be calculated simply by where f ps is the frame rate of the recorded images.
Using a learned topology T for every video sequence, a corresponding primitive event sequence PE seq is calculated: A primitive event sequence provides information regarding the underlying structure of long-term activities.

Activity Discovery (Detection)
We refer to the detection of the boundaries of the activities as Activity Discovery. Annotating the beginning and end of the activities is a challenging task even for humans. The start/end time of the annotated activities varies from one human annotator to another. The problem is that humans tend to pay attention to one resolution at a time. For example, when a person is sitting on a chair, the annotated label is "sitting". Later, when the subject "moves an arm", she is still sitting. Discovering activities using a different resolution of the trained typologies helps to automatically detect these activity parts and sub-parts at different levels of activity hierarchy using previously created semantic blocks (Primitive Events). Input for activity discovery process is a spatiotemporal sequence of activities described by primitive events. After the activity discovery process: (1) The beginning and end of all activities in a video are estimated and the video is automatically clipped. (2) The video is classified naively into discovered activities indicating similar activities in the timeline. A discovered activity (DA) is considered either as (1) staying in current state ("Stay") or (2) changing of the current state ("Change"). Basically, a Stay pattern is an activity that occurs inside a single scene region and is composed of primitive events with the same type: A "Change" pattern is an activity that happens between two topology regions. A "Change" activity consists of a single primitive event of the same type: Although detection of primitive events takes place at three different resolutions, the activity discovery process only considers the coarse resolution. Therefore, after discovery process, the output of the algorithm for the input sequence is a data structure containing information about the segmented input sequence in the coarse level and its primitive events in two other lower levels. This data structure holds spatiotemporal information similar to the structure in Figure 3. The algorithm for this process simply checks for primitives' boundaries and constructs the data structure for each discovered activity. Employing DAs and PEs, it shows the hierarchical structure of an activity and its sub-activities. Although Discovered Activities present global information about the movement of people, it is not sufficient to distinguish activities occurring in the same region. Thus, for each discovered activity, body motion information is incorporated by extracting motion descriptors (Section 3.1). These descriptors are extracted in a volume of NxN pixels and L frames from videos. Fisher Vector (FV) method [83] is then followed to obtain a discriminative representation of activities. The descriptors are extracted for all Discovered Activities that are automatically computed. The local descriptor information is extracted only for Discovered Activities at the coarse resolution level.

Activity Modeling
Here, the goal is to create activity models with high discriminative strength and less susceptibility to noise. We use attributes of an activity and its sub-activities for modeling and accordingly, learning is performed automatically using the DAs and PEs in different resolutions. Learning such models enables the algorithm to measure the similarity between them. To create the models, a method for assembling the DAs and PEs from different resolutions is required. This is achieved by the concept of hierarchical neighborhood.

Hierarchical Neighborhood
The hierarchical representation of activity A at resolution level l is a recursive representation of the links between A and its primitive events B i at the finer resolutions: B1, . . . , Bn are the primitive events of A in the next finer resolution. The links between the different levels are established using temporal overlap information. For example, primitive event B is sub-activity of activity A in a higher level if their temporal interval overlaps in the activity timeline. Formally, B is sub-activity of A if the following statement holds: By applying (10) to a discovered activity, we can find the primitives in its neighborhood. This automatic retrieval and representation of the neighborhood of a DA help in creating the hierarchical activity models.

Hierarchical Activity Models
Hierarchical activity model (HAM) is defined as a tree that captures the hierarchical structure of daily living activities by taking advantage of the hierarchical neighborhoods to associate different levels. For an input DA (A neighborhood ) and its neighborhood, the goal is to group similar PEs obtained by clustering to create nodes (N) of the activity tree. Clustering is performed using Type attribute of the PEs which groups PEs of the same type in one cluster. This process is repeated for all levels. After clustering, nodes of the tree model are determined followed by linking them together to construct the hierarchical model of the tree. The links between the nodes are realized from the activity neighborhood of each node ( Figure 4 shows the complete procedure of creating an activity tree from neighborhood set instances of a DA). After linking, a complete tree structure of the given DA is obtained and the model is completed by adding attribute information for nodes of the tree. Each node in the activity tree contains information about the similar detected primitive events sharing similar properties, such as duration and type of the primitive, as well as similar sub-activities in the lower level. So, a node is the representative of all the similar primitives in that level. Each node has two types of properties.
The node attributes that store information about primitive events such as average duration of its constituents as well as information about parent node and the associated nodes in the lower level of the hierarchy. The nodes can keep different spatial and temporal attributes about the activity and its sub-activities. The former type consists of:

•
Type attribute is extracted from the underlying primitive or discovered activity (in case of the root node). For node N, Type N = Type PE or Type DA , where Type of PEs and DAs are either Stay or Change states.

•
Instances list PEs of training instances indicating the frequency of each PE included in the node.

•
Duration is a Gaussian distribution Duration(µ d , σ 2 d ) describing the temporal duration of the PEs ({PE 1 , PE 2 , . . . , PE n , }) or discovered activities ({DA 1 , DA 2 , . . . , DA n , }) of the node. It is frame length of the primitives or discovered activities calculated as: where n is the number of PEs or DAs.  Figure 4. The process of creating activity tree. The Primitive Events (PEs) from the training instances are clustered into nodes and, at the same time, the neighborhood set is detected. The final structure is constructed with those building blocks.
The above-mentioned attributes do not describe the relationship between the nodes which is important in the overall description of the activities. In order to model the relationship among the nodes, for each node, two other attributes are defined regarding their sub-nodes: Mixture and Timelapse. Mixture shows contribution of the type of the sub-activities (Stay 2−2 ) in the total composition of sub-nodes. This number is modeled with a Gaussian mixture Θ mixture type . Timelapse of the nodes (with the same type and level in different training instances) represents the distribution of the temporal duration of the sub-nodes. This attribute is also computed as a Gaussian distribution Θ timelapse type . The created HAM structure is a hierarchical tree that provides recursive capabilities. Accordingly, it makes the calculation of the attributes and the score in the recognition step efficient and recursive. Figure 5 illustrates an example of a HAM model with the nodes and their corresponding attributes and sub-attributes.

Descriptor Matching of Tree Nodes
Descriptor matching can be denoted as a method that captures the similarity between a given local dynamic information of an activity and a set of calculated multi-dimensional distributions. The obtained descriptor vectors (H) characterize local motion and appearance of a subject. Knowing the vector representation of the descriptors of discovered activities enables the use of a distance (Equation (14)) measurement to characterize the similarity between different activities.
As it is shown in Figure 6, in training, the scene model is used to clip the long videos to the short clips belonging to each region. Next, the descriptors of the clipped videos are extracted and employed to learn a visual codebook V (one for each region) by clustering the descriptors (using k-means). The codebook of each region is stored in the created activity model of that region. During the testing phase, when a new video is detected by the scene model, its descriptors are extracted and the feature vectors are created. These feature vectors are encoded with the learned dictionaries of the models. The distance of the current descriptor is calculated with the trained codebooks of all regions (to find the closest one) using the Bhattacharyya distance: where N is the number of learned code words and BC is the Bhattacharyya coefficient: N and M display dimensions of the descriptor and trained codebooks, respectively. The most similar codebook is determined by the minimum distance score acquired. That codebook (and its corresponding activity model) is assigned by a higher score in the calculation of the final similarity score with the test instance in the recognition phase.

Model Matching for Recognition
To measure the similarity among the trained HAM models, different criteria can be considered. The assumed criterion can vary from one application to another. While one application can emphasize more on the duration of activities, local motion can be more important for others. Although these criteria can be set depending on the application, the weights of the feature types are learned to determine the importance of each type. The recognition is carried out in five steps as follows: 1.
Perceptual information, such as trajectories of a new subject, is retrieved.

2.
Using the previously learned scene model, the primitive events for the new video are calculated.

3.
By means of retrieved primitive events, the discovered activities are calculated.

4.
Using the collected attribute information, a test instance HAM (ω * ) is created.

5.
The similarity score of the created HAM and trained HAM models are calculated and the activity with the highest score is selected as the target activity.
Once the activity models are trained, to find the one that matches with an activity in a test video, we follow a Bayesian scheme. We choose the final label using the Maximum A Posteriori (MAP) decision rule. If Ω = {ω 1 , ..., ω S }, where S = |Ω| represent the set of generated activity models and given the data for an observed test video, ω , we select the activity model, ω i , that maximizes the likelihood function (Equation (16)): where p (ω i |ω ) denotes the likelihood function defined for activity models ω 1 , ..., ω s in model set Ω. We assume that the activity models are independent. Therefore, a priori probability of trained models p (ω 1 , ..., ω s ) is considered equal. We can eliminate p (ω i ) and use the following formula (Equation (17) p (ω ) is the relative frequency of ω in the training set. Since the generated models are constructed following a tree structure, the likelihood value should be calculated recursively to cover all nodes of the tree. For each model, the recursive probability value is, therefore, calculated as Equation (18): Recur recursively calculates the probabilities of the nodes in lower levels and stops when there is no more leaf to be compared. Superscripts index the levels of the tree ([l] = 1, 2, 3). p(ω [l] i |ω [l] ) calculates probability in the current node given ω and p(ω [l] i |ω [l−1] ) returns the probability values of this node's child nodes (sub-activities). Given the data for node n of the activity in the test video, ω (n) = {type (n), duration (n), l (n)} and the activity model i, The likelihood function for node n is defined as Equation (19).p p ω (n)|l = Distance i (n) .
p ω (n)|type = type i (n) checks whether the types of nodes in test tree and trained model are the same or not: p duration (n)|∆ i duration (n) measures the difference between activity instance ω 's duration and activity model i bounded between 0 and 1. where p ω (n)|l = Distance i (n) compares the distance of training node's trained codebooks V and the test node's computed descriptor histogram H.
It should be noted that the Distance information is only available at root level l = 0 (only for DAs). The recursion stops when it traverses all the leaves (exact inference). Once we computed p(ω |Ω) for all model assignments, using MAP estimation, the activity model i that maximizes the likelihood function p(ω i |ω ) votes for the final recognized activity label (Equation (23)).

Datasets
The performance of the proposed framework is evaluated on two public and one private daily living activity datasets.

GAADRD Dataset
The GAADRD [84] activity dataset consists of 25 people with dementia and mild cognitive impairment who perform ADLs in an environment similar to a nursing home. The GAADRD dataset is public and was recorded under the EU FP7 Dem@Care Project (http://www.demcare.eu/results/ datasets) in a clinic in Thessaloniki, Greece. The camera monitors a whole room where a person performs directed ADLs. The observed ADLs include: "Answer the Phone", "Establish Account Balance", "Prepare Drink", "Prepare Drug Box", "Water Plant", "Read Article", "Turn On Radio". A sample of images for each activity is presented in Figure 7 (top row). Each person is recorded using an RGBD camera of 640×480 pixels of resolution. Each video lasts approximately 10-15 min. We randomly selected 2/3 of the videos for training and the remaining for testing.

CHU Dataset
This dataset is recorded in the Centre Hospitalier Universitaire de Nice (CHU) in Nice, France. The hospitals collecting the dataset have obtained the agreement of an ethical committee. Volunteers and their carers have signed informed consent. Data have been anonymized and can be used only for research. It contains videos from patients performing everyday activities in a hospital observation room. The activities recorded for this dataset are "Prepare Drink", "Answer the Phone", "Reading Article", "Watering Plant", "Prepare Drug Box", and "Checking Bus Map". A sample of images for each activity is illustrated in Figure 7 (middle row). Each person is recorded using an RGBD Kinect camera with 640 pixels × 480 pixels of resolution, mounted on the top corner of the room. The hospital dataset is recorded under the EU FP7 DemCare project (https://team.inria.fr/stars/demcare-chu-dataset/) and it contains 27 videos. For each person, the video recording lasts approximately 15 min. Domain experts annotated each video regarding the ADLs. Similar to GAADRD, for this dataset, we randomly chose 2/3 of the videos for training and the rest for testing.

DAHLIA Dataset
The DAHLIA dataset [85] consists of a total of 153 long-term videos of daily living activities (51 videos recorded from three different views) from 44 people. The average duration of the videos is 39 min containing 7 different actions (and a Neutral class). The considered ADLs are: "Cooking", "Laying Table", "Eating", "Clearing Table", "Washing Dishes", "Housework", and "Working" (Figure 7i-l). To evaluate this dataset, we followed a cross-subject protocol in order to compare our results with existing literature.

Evaluation Metrics
We use various evaluation metrics on each dataset to evaluate our results and compare it with other approaches. For the GAADRD and CHU datasets, we use Precision and recall metrics. True Positive Rate (TPR) or recall is the proportion of actual positives which are identified correctly: TPR = TP TP+FN . The higher the value of this metric, the better is the performance. Similarly, Positive Predictive Value (PPV) or precision is defined as: PPV = TP TP+FP . We also use F-score in our comparisons. The detected intervals are compared against the ground-truth intervals and an overlap higher than 80% of the ground-truth interval is considered as a True Positive detection of that activity.
For evaluation of the unsupervised framework, as the recognized activities are not labeled, there is no matching ground-truth activity label for them. The recognized activities are labeled, such as "Activity 2 in Zone 1". In order to evaluate the recognition performance, first, we map the recognized activity intervals on the labeled ground-truth ranges. Next, we evaluate the one-to-one correspondence between a recognized activity and a ground-truth label. For example, we check which ground-truth activity label co-occurs the most with "Activity 2 in Zone 1". We observe that in 80% of the time, this activity coincides with "Prepare Drink" label in the ground-truth. We, therefore, infer that "Activity 2 in Zone 1" represents "Prepare Drink" activity. For this purpose, we create a correspondence matrix for each activity which is defined as a square matrix where its rows are the recognized activities and the columns are ground-truth labels. Each element of the matrix shows the number of co-occurrences of that recognized activity with the related ground-truth label in that column: a ij ∈ Z + shows the correspondence between activity instance i and ground-truth label j. RA is the set of recognized activity instances and GT shows the set of ground-truth labels. We evaluate the performance of the framework based on the inferred labels. These labels are used for calculating the Precision, Recall, and F-Score metrics.
In order to evaluate the DAHLIA dataset, we use metrics based on frame level accuracy. For each class c in the dataset, we assume TP c , FP c , TN c , and FN c as the number of True Positive, False Positive, True Negative, and False Negative frames, respectively. Therefore, Frame-wise accuracy is defined as: FA 1 = ∑ c∈C TP c ∑ c∈C N c , where N c is the number of correctly labeled frames compared to the ground-truth. F-Score is defined as: F − Score = 2 |C| ∑ c∈C P c ×R c P c +R c where P c and R c are precision and recall metrics of class c, respectively. We also define Intersection over Union (IoU) metric as: C is the total number of action classes.

Results and Discussion
First, the results and evaluations of the three datasets are reported and then compared with state-of-the-art methods. Different codebook sizes are examined for the Fisher vector dictionaries: 16, 32, 64, 128, 256, and 512. Table 1 and Figure 8 show the accuracy of activity detection based on Precision and Recall metrics using the feature type with the highest accuracy. In the case of the GAADRD dataset, the best result achieved with incorporated Motion Boundaries Histogram in Y axis (MBHY) descriptor in the activity models with codebook size set to 256. Based on the obtained results, there is no special trend regarding the codebook size. For some features (MBHY and TDD spatial), the performance increases with an increase in the codebook size and drops when the codebook size becomes much bigger. For TDD temporal feature, performance increases linearly with the codebook size. For the geometrical features, particularly for the Angle feature, there is a big drop of performance with bigger codebook sizes. For others (HOG, HOF), medium-size codebook performs the best. Finding an optimal codebook size is challenging. Small datasets usually work better with smaller codebook size, and as the datasets' size grows, the codebook performs better. Regardless of the codebook size, MBHY descriptor performs better than other features in this dataset. The MBH descriptor is composed of X (MBHX) and Y (MBHY) components. As the activities involve many vertical motions, MBHY descriptor is able to model the activities better compared to the other dense trajectory descriptors and even deep features. It can be noticed that the performance of temporal deep features gets better as the codebook size gets bigger. In addition, motion features (TDD temporal, MBHY) perform better than appearance features and temporal deep features perform better than spatial TDDs. The reason for the lower performance of appearance features might be due to the activities performed in a hospital environment. Hereupon, the background does not contain discriminative information which can be encoded in activity models. It is clear that the Geometrical features perform poorly. Daily living activities are comprised of many sub-activities with similar motion patterns related to object interactions. It seems that geometrical features do not contain sufficient information to ensure encoding these interactions which result in poor detection. Furthermore, the confusion matrix in Figure 10 indicates that the activities with similar motion in their sub-activities are confused with each other the most.
On CHU dataset, the unsupervised framework achieves promising results (Table 2 and Figure 9). Similar to the GAADRD dataset, the effect of codebook size is different for different descriptor types. For MBHY descriptor, the accuracy increases as codebook size grow, whilst, it has the opposite effect on TDD appearance features. Differently, the accuracy increases and then decreases for TDD temporal feature. It can be observed that a bigger codebook size results in better performance. This trend is different from GAADRD dataset and the reason might be because of the larger size of this dataset. TDD temporal features demonstrate a better performance than deep appearance features (TDD spatial). Similarly, due to the similar background of the activities, temporal information shows better results. MBHY achieves the best performance on this dataset. The abundance of vertical motions in the performed activities helps the MBH descriptors to reach better recognition performance. Among appearance features, HOG descriptor shows a better performance since it can encode the appearance information efficiently, where it even outperforms deep appearance features. Detailed analysis ( Figure 10) indicates that the framework has difficulty in recognition of "Watering Plant" activity. It confuses this activity with all the other activities. The short duration of this activity leads to insufficient capture of local dynamic information resulting in recognition issues. The reason for the confusion of the other activities lies mainly on similar motion patterns of the sub-activities. Moreover, this dataset consists of activities recorded from subjects lateral view which makes recognition of those classes of activities challenging.
The results obtained from our proposed framework on GAADRD and CHU datasets are compared with the supervised approach in [75], where videos are manually clipped. Another comparison is made with an online supervised approach that follows [75] using a sliding window scheme. The activity models are evaluated with another version of the models [86] that does not embed local dynamic information (in this version, the score of the local descriptor attribute is omitted and not considered in the final score). A further comparison is performed with a Hybrid framework [87] that combines supervised and unsupervised information in the HAM models. We additionally compare GAADRD dataset with the produced results of another detection algorithm in [88]. Table 3 represents the comparison of our results with the reported performance on GAADRD dataset. In all approaches that use body motion and appearance features, the feature types with the best performances are selected. It can be noticed that using models equipped with both global and local motion features, the unsupervised obtains high sensitivity and precision rates. Compared to the online version of [75], thanks to the learned zones and discovered activities, we obtain better activity localization, thereby a better precision. Using only dense trajectories (not global motion) this online method fails to localize activities. For the "Watering Plant" this method can not detect any instances of this activity in the test set, hence, the Precision, Recall, and F-Score rates are zero. Compared to the unsupervised approach that either uses global motion features or body motion features, we can see that, by combining both features, our approach achieves more discriminative and precise models and improves both sensitivity and precision rates. Although the supervised approach in [75] outperforms the unsupervised framework in recall and F-Score metrics, it actually does not perform activity detection. It uses ground-truth intervals provided by manual clipping and performs offline activity recognition which is a much simpler task. As our approach learns the scene regions, we automatically discover the places where the activities occur, thereby we achieve precise and accurate spatiotemporal localization with a lower cost. As scene region information is missing in the supervised approach, it detects "Turning On Radio" while the person is inside the "Preparing Drink" region. On this dataset, the unsupervised method always performs better than the "Online Supervised" approach and significantly outperforms the sequential statistical boundary detection (SSBD) method. It also outperforms another unsupervised version of the framework while no descriptor information is used in the activity models. Only the supervised methods surpass our unsupervised models. The reason is that the supervised method works with pre-clipped activity videos and overlooks the challenging task of temporal segmentation of activity samples from the original video flow. Table 4 shows the results of evaluated approaches and their comparison with our results on CHU Nice Hospital dataset. In this dataset, as people tend to perform some of the activities in various regions (e.g., preparing the drink at the phone desk), it is difficult to obtain high precision rates. However, compared to the online version of the supervised method in [75], our approach detects all activities and achieves a much better precision rate. The online version of [75] again fails to detect activities accurately and misses some of the "Prepare Drink", and "Reading Article" activities and produces lots of false positives for all other activities. It cannot handle the transition states in the boundary of the activity regions (e.g., walking from telephone desk to DrugBox is detected as "Answer the Phone" activity). For this reason, a random label is assigned for transition states by the classifier, which consequently increases the rate of false positives. Compared to the Online Supervised method, we have increased the average precision rate from 48.06 to 87.65%. Compared to the unsupervised method without embedded descriptor information, we have decreased the false positive rates and increased the precision rates significantly. The highest improvements are on "Answering Phone" from 60 to 92%, "Checking BusMap" from 54.54 to 80.5%, "Prepare Drink" from 80 to 94%, and "Watering Plant" from 53 to 77%. For "Reading Article" activity, there is a small increase in false positive rates, causing an incremental decrease in precision rates. This might be because of the lack of local motion information caused by staying still in a sitting posture for a long time. Since the motion representation of [86] contains only global information, it fails to distinguish activities inside the regions precisely. For instance, passing by the phone zone and answering the phone in the phone zone are considered as the same activity in their models. Hence, their unsupervised approach results in high false positive rates. In addition, we can observe that the proposed approach improves the true positive rates and increased sensitivity rates for most of the activities when it is compared to the "Only Global Motion" method.

DAHLIA Dataset
Different from the two other datasets, the results on the DAHLIA dataset are compared with all the previous evaluations we could find in the literature. Meshry et al. [89] exploits gesturelets extracted from skeleton data to compute geometrical features and detect the activities. The proposed method in [90] takes a graphical approach and poses the activity detection task as a maximum-weight connected sub-graph problem. Inspired by the Hough transformation that is successfully applied in object detection, Chan-Hon-Tong et al. [91] proposes a method with discriminative features to globally optimize the parameters of Hough transform and utilize it for activity segmentation in videos. Finally, our results are compared with [92] that is a supervised method with a semi-supervised component to discover sub-activities. Table 5 demonstrates our results on the DAHLIA dataset. Different metrics are used for evaluation of this dataset to enable comparison with other methods. The table presents the best results that are produced by the generated models embedded with MBHY descriptors. It can be noticed that in this dataset, we significantly outperform [89,90] in all the categories. Efficient Linear Search (ELS) uses geometrical features and produces poor results that are only comparable with our framework when geometrical descriptors are used in the generated models. Despite being an efficient approach, Chen and Grauman [90] demonstrates poor detection performance on Dahlia dataset. Additionally, this method only works in offline mode. Chan-Hon-Tong et al. [91] is another supervised method that uses both skeleton and dense trajectory descriptors and outperforms our framework only on camera view 3, while using the F-score metric. The closest performance to ours is [92] which is a supervised method and utilizes person-centered CNN features (PC-CNN) to detect sub-activities. Moreover, it has an additional post-processing step to refine the sub-activity proposals in the activity boundaries. Although our framework is totally unsupervised, we outperform this method in camera view 2 using all evaluation metrics. Similar results are obtained using different camera angles underlying the robustness of our proposed framework to viewpoint variations and different types of occlusion. This indicates that an efficient multi-view fusion method can remarkably improve the results.
Overall, although our unsupervised framework does not utilize any supervised information, it achieved promising recognition performances. Compared to the fully supervised hybrid method [87], the unsupervised framework obtains acceptable and competitive results in the detection of most of the activities. However, the high performance of the hybrid method comes with the cost of human supervision. In the hybrid method, a supervised Support Vector Machine (SVM) classifier is trained with the ground-truth annotation provided by a human. The main benefits of the unsupervised method are automatic online clipping and detection of activities as well as unsupervised modeling and recognition. With all these benefits, the marginal difference in the recognition rate of the unsupervised method relative to supervised counterparts is admissible. Table 3. Comparison of different recognition frameworks with ours on the GAADRD dataset. The diagram shows the class-wise accuracy of each method with respect to their F-Score values. The best results in each section are indicated in bold.

Conclusions
An online unsupervised framework is proposed for detection of daily living activities, particularly for elderly monitoring. To create the activity models, we benefited from the superiority of unsupervised approaches on representing global motion patterns. Then, discriminative local motion features were employed in order to generate a more accurate model of activity dynamics. Thanks to the proposed scene model, online recognition of activities can be performed with reduced user interaction for clipping and labeling a huge amount of short-term actions which are essential for most of the previously proposed methods. Our extensive evaluations on three datasets revealed that our proposed framework is capable of detecting and recognizing activities in challenging scenarios. The evaluations were intentionally conducted on the datasets recorded in nursing homes, hospitals, and smart homes to examine the implication of the method on ambient surveillance in such environments. Further work will investigate how to generate generic models that can detect activities in any environment with minimum modification of the models. Our goal is to use the developed framework in the evaluation of long-term video recordings in nursing homes and to assess the performance of the subjects to impose early interventions which will result in early diagnosis of cognitive disorders, especially Alzheimer's disease.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: