A Regional Topic Model Using Hybrid Stochastic Variational Gibbs Sampling for Real-Time Video Mining

: The events location and real-time computational performance of crowd scenes continuously challenge the ﬁeld of video mining. In this paper, we address these two problems based on a regional topic model. In the process of video topic modeling, region topic model can simultaneously cluster motion words of video into motion topics, and the locations of motion into motion regions, where each motion topic associates with its region. Meanwhile, a hybrid stochastic variational Gibbs sampling algorithm is developed for inference of our region topic model, which has the ability of inferring in real time with massive video stream dataset. We evaluate our method on simulate and real datasets. The comparison with the Gibbs sampling algorithm shows the superiorities of proposed model and its online inference algorithm in terms of anomaly detection.


Introduction
Video mining is a hot topic that has attracted significant interests in recent years.Video mining is able to find the implicit, valuable, and understandable video patterns by analyzing visual features, time structure, event relationships, and semantic information of video data [1], which can be classified into video structure mining and video motion mining [2].In particular, for poor structural videos such as traffic surveillance video, video motion mining can realize the applications of abnormal events detection or congestion analysis, and so on.
With the evolution of video mining technology, there has been an increasing number of research works focused on the use of topic models for video motion mining.Although probabilistic topic models were originally studied in the field of natural language processing [3,4], they also provide a way for discovering hidden pattern from images or document corpus.In the text mining, a topic model represents unlabeled documents as mixtures of topics where latent topics are distributions over observed words.In the video motion mining, full video is treated as document collection; a short video clip is treated as a document that divided from full video; the video features are considered as words.In this way, with the introduction of probabilitics topic model in video motion analysis, variety of latent motion patterns, and latent motions correlations were discovered, which are represented by topics.Figure 1 shows the diagram of video topic modeling.Although several topic models have successfully applied in surveillance systems [5][6][7][8], there exist several premature phenomena in the procedure of video topic modeling-such as abnormal events locating and computational performance of real-time mining.In this paper, we focus on topic modeling with region information and uses it to automatically detect abnormal events from a complex video scene in real-time.
The rest of the paper is organized as follows.In the next section, we present a brief survey of the related works.In Section 3.1.-VideoRepresentation-the video representation is explained.In Section 3.2.-RegionalTopic Model and its Online Inference Algorithm, our regional topic model (RTM) and its hybrid stochastic variational Gibbs Sampling algorithm (HSVG) are presented.The datasets, evaluations and comparisons are discussed in detail, in Section 4. Our conclusions are presented in the last section.

Related Works
Recently, there has been a significant number of research works focused on the use of topic models for complex scene analysis.These methods have become quite popular due to their success in natural language processing, e.g., probabilistic latent semantic analysis (pLSA) [9] and latent Dirichlet allocation (LDA) [5].Nevertheless, when there are lots of motions co-occurred, LDA has problems of low sensitivity, so it is unable to detect the abnormal event accurately.In addition, there is a problem with abnormal event localization in LDA: it can only detect which clip the abnormal event is in, but have no ability to determine where the event happened in.Therefore, several attempts have been made to model video data using LDA extensions.
X. Wang [10] adopted hierarchical variants of LDA, including a Hierarchical Dirichlet Processes (HDP) [7] mixture model and a Dual Hierarchical Dirichlet Processes (Dual-HDP) model, to connect three elements in visual surveillance: low-level visual features, simple atomic activities, and interactions.Thereafter, X. Wang [11] converted tracks into words, and applied a topic model to them.The words were the quantized positions and directions of motion, consequentially the topics would represent routes shared between objects.J. Li [12] proposed WS-JTM to address the typical topic model weakness of inference speed and exploited weak supervision.They fixed delta latent Dirichlet allocation (dLDA) in their extension, multi-class dLDA, which is also used to detect rare and subtle behavior.Thereafter, a two-staged cascaded LDA model was formulated by Li et al. in reference [13] where the first stage learns regional behavior and the second stage learns the global context over the regional models.Hospedales T.M. et al. [14] adopt a nonparametric Bayesian approach to automatically determine the number of topics shared by the documents and also when they appear in each temporal document.Emonet R. [15] proposed framework consists of an activity-based semantic scene segmentation model for learning behavior spatial context, and a cascaded probabilistic topic model for learning both behavior correlation context and behavior temporal context at multiple scales.Fu et al. [16] improved sparse topical coding (STC) to discover semantic motion patterns for a dynamic scene, which can be sparsely reconstructed.Yuan et al. [17] used a topic model to discover functional regions in a city using taxi probe data and point-of-interest Although several topic models have successfully applied in surveillance systems [5][6][7][8], there exist several premature phenomena in the procedure of video topic modeling-such as abnormal events locating and computational performance of real-time mining.In this paper, we focus on topic modeling with region information and uses it to automatically detect abnormal events from a complex video scene in real-time.
The rest of the paper is organized as follows.In the next section, we present a brief survey of the related works.In Section 3.1-Video Representation-the video representation is explained.In Section 3.2-Regional Topic Model and its Online Inference Algorithm, our regional topic model (RTM) and its hybrid stochastic variational Gibbs Sampling algorithm (HSVG) are presented.The datasets, evaluations and comparisons are discussed in detail, in Section 4. Our conclusions are presented in the last section.

Related Works
Recently, there has been a significant number of research works focused on the use of topic models for complex scene analysis.These methods have become quite popular due to their success in natural language processing, e.g., probabilistic latent semantic analysis (pLSA) [9] and latent Dirichlet allocation (LDA) [5].Nevertheless, when there are lots of motions co-occurred, LDA has problems of low sensitivity, so it is unable to detect the abnormal event accurately.In addition, there is a problem with abnormal event localization in LDA: it can only detect which clip the abnormal event is in, but have no ability to determine where the event happened in.Therefore, several attempts have been made to model video data using LDA extensions.
X. Wang [10] adopted hierarchical variants of LDA, including a Hierarchical Dirichlet Processes (HDP) [7] mixture model and a Dual Hierarchical Dirichlet Processes (Dual-HDP) model, to connect three elements in visual surveillance: low-level visual features, simple atomic activities, and interactions.Thereafter, X. Wang [11] converted tracks into words, and applied a topic model to them.The words were the quantized positions and directions of motion, consequentially the topics would represent routes shared between objects.J. Li [12] proposed WS-JTM to address the typical topic model weakness of inference speed and exploited weak supervision.They fixed delta latent Dirichlet allocation (dLDA) in their extension, multi-class dLDA, which is also used to detect rare and subtle behavior.Thereafter, a two-staged cascaded LDA model was formulated by Li et al. in reference [13] where the first stage learns regional behavior and the second stage learns the global context over the regional models.Hospedales T.M. et al. [14] adopt a nonparametric Bayesian approach to automatically determine the number of topics shared by the documents and also when they appear in each temporal document.Emonet R. [15] proposed framework consists of an activity-based semantic scene segmentation model for learning behavior spatial context, and a cascaded probabilistic topic model for learning both behavior correlation context and behavior temporal context at multiple scales.Fu et al. [16] improved sparse topical coding (STC) to discover semantic motion patterns for a dynamic scene, which can be sparsely reconstructed.Yuan et al. [17] used a topic model to discover functional regions in a city using taxi probe data and point-of-interest information.Similarly, Farrahi and Gatica-Perez [18] used a topic model to discover human routines using mobile-phone location data.In [19], LDA was extended to model the flow of people entering or exiting a building.Yu et al. [20] proposed a topic model for detecting an anomalous group of individuals in a social network.Kinoshita et al. [21] introduced a traffic state model based on a probabilistic topic model to describe the traffic states for a variety of roads, the model can be learned using an expectation-maximization algorithm.Hospedales et al. [22] introduced a dynamic topic model named Markov clustering topic model (MCTM), and an approximation to online Bayesian inference was formulated to enable dynamic scene understanding and behavior mining in new video data online in real-time.In order to handle the temporal nature of the video data, Fan et al. [23] devised a dynamical causal topic model (DCTM) that can detect the latent topics and causal interactions between them.
Meanwhile, several attempts have also been made to find anomalies using topic models and surveillance cameras [24,25].Jeong et al. [26] proposed a topic model for detecting anomalous trajectories of people or vehicles in surveillance-video images.Kaviani et al. [27] addressed the problem of abnormality detection based on a fully sparse topic models (FSTM).Isupova et al. [28] proposed a novel dynamic Bayesian nonparametric topic model and its Batch and online Gibbs samplers for anomaly detection in video.
In general, there were several key problems in existing studies about video mining using topic model: (1) model parameters increment leads to the increments of the model learning time, and then traditional off-line inference algorithm is not suitable for video monitoring system; (2) anomaly detection in a whole scene rather than in each region reduces the sensitivity of the anomaly detection.
To address the problem of motion region, Zou et al. [29] proposed a belief based on correlated topic model (BCTM) for the semantic region analysis of pedestrian motion patterns in the crowded scenes.Haines proposed regional LDA model (rLDA) [30], which not only can model activities in a complicated scene, but also realize a high sensitivity detection and the localization of motion topic (especially the abnormal event) by extracting spatial information ignored by LDA.Nonetheless, the inference algorithm of above studies still used the collapsed Gibbs sampling, which needs to scan the whole samples at each iteration.For huge data sets and data streams such as video, this way adopted by Gibbs sampling leads to high memory overhead, slow running speed, and judging convergence difficultly.
Classic approaches of inference algorithm in LDA are Gibbs sampling (GS) [31] and variational Bayesian (VB) batch inference [5].In order to solve the problem of computational complexity, collapsed Gibbs sampling (CGS) [6] and collapsed VB batch inference (CVB) [32] were proposed.Nevertheless, for the purpose of LDA applied to video mining, we need to make the inference algorithm adapt to the characteristics of video streaming data set, it is better to a realize real-time and online processing quickly and efficiently.For text database which is huge or in the form of data stream, there have been developments of online LDA inference algorithm with less memory, faster running, and convergence speed.Hoffman proposed the stochastic gradient optimization algorithm (online LDA) [33], which repeatedly subsamples a small set of documents from the collection and then updates the topics from an analysis of the subsample.Since online LDA does not need to scan the entire samples for updating topic parameter matrix at each iteration, the updating of topic parameters is more frequently.The algorithm not only takes up less memory, faster running, and convergence speed, but also realizes online inference in real time for huge data sets or data stream.Nonetheless, the algorithm complexity linearly increased with the number of topics.Therefore, it is not suitable for large collection with many topics.On the basis of online LDA algorithm, Mimno proposed hybrid stochastic variational Gibbs sampling (HSVG) [34].This algorithm introduced the second source of stochasticity by MCMC sampling, and taken advantage of sparse computation to make complexity sublinearly increased with the number of topics.It fits for a large collection with many topics.Besides, RLD (Riemannian Langevin dynamics) [35] algorithm was proposed by Girolami.It is a kind of Langevin dynamics algorithm based on Riemannian manifold of MH correction.Welling proposed SGLD (stochastic gradient Langevin dynamics) [36] algorithm, which reserved stochastic gradient optimization algorithm, and can sample from the posterior distribution.Patterson proposed SGRLD (stochastic gradient Riemannian Langevin dynamics) [37] by combining RLD and SGLD algorithm.In addition, Olga Isupova et al. [38] proposed new learning algorithms for activity analysis in video, which are based on the expectation maximization approach and variational Bayes inference.

Video Representation
To discover motion patterns for video by topic modeling, the definitions of visual words and visual documents are essential for topic model applied to video analysis: given an input video, we first temporally segment the video into non-overlapping clips.Each clip is considered as a document.To create visual words, we segment a scene into sub-grid.Next, we compute optical flow field for motion object from foreground mask extracted in each frame, and then optical flow histograms are generated for one clip by counting grids i accumulated over frames of this clip.After spatial and directional quantization, video motion word labeled in v ∈ {0, 1, . . . ,V − 1} is split into grid position i ∈ {0, 1, . . . ,I − 1} and motion direction ω ∈ {0, 1, . . . ,Ω − 1}.Finally, we select the largest optical flow histogram to generate a motion words sample v = (i, ω).Then, for a visual word v ∈ {0, 1, . . . ,V − 1}, the information of motion position and direction mix together to express a motion word (i, ω), and all the motion words in a video clip constitute the bag of visual words (BOVW).

Regional Topic Model and Its Online Inference Algorithm
In our RTM, the goal is to discover a set of motions (topics) from video by learning the probability distributions of visual features over each topic and topics over each clip.These two probability distributions are represented as two co-occurring matrixes in Figure 1.Meanwhile, the location information of motion is discovered.Nonetheless, BOVW based on latent Dirichlet allocation (LDA) model presumes that the words are unordered and interchangeable in document, this hypothesis destroys the spatial information of motions or activities; we are unable to get the motion region from model learning.
In order to keep and use the spatial information of motion in video, we introduce RTM, in which each sample in a frame is not only labeled by its motion direction ω but also by a motion region label r ∈ {0, 1, . . .R}.It means that the latent motion topics in videos are associated with the regions where they occurred in.
Suppose that there are J documents (video clips), each document j ∈ J contains N j observed samples x jn .t jn is the motion topic of each sample x jn , and r jn ∈i is its motion region label of region i.Then, video sequence can be represented as X = X j J j=1 , X j = x jn N j n=1 .The latent variables are motion topic and regional labels sets Z = {T, R} = t jn , r jn ∈i N j , J n=1,j=1 .From a global perspective, motion topic weight vector of document j can be expressed as π j = π jt T t=1 .When a symmetric Dirichlet prior distribution is applied on the topic weight vector π j , the hyperparameters of Dirichlet prior is α, π j ∼ Dir(α).From a local perspective, motion regional weight vector can be expressed as ρ = {ρ r } R r=1 .When a symmetric Dirichlet prior distribution β is applied on the regional weight vector, it means ρ ∼ Dir(β).
Document j shares T motion topics by local topic weight vector π j = π jt T t=1 .In other words, the motion topic label subset T j = t jn N j n=1 obeys a multinomial distribution of T dimension whose parameter is π j , T j ∼ Mul π j Algorithms 2018, 11, 97 Local motion topic weight vector π j obey a symmetric Dirichlet prior distribution π j ∼ Dir(α) whose parameter is α The corpus share R motion regions by global region weight vector ρ = {ρ r } R r=1 .In other words, motion region label set Likewise, global region weight ρ obeys a symmetric Dirichlet prior distribution ρ ∼ Dir(β) whose parameter is Under the known motion topic label t jn = t and known motion region label r jn ∈i = r of sample x jn , the sample subset The hybrid parameter θ rt obeys a symmetric Dirichlet prior distribution θ rt ∼ Dir(λ) whose parameter is λ The construction of RTM is summarized as that: the number of motion topics is T; the number of motion regions is R; the video sequence contains J documents; the observed sample set is ; the corresponding latent variables set is Z j = T j , R = t jn , r jn ∈i Then, the generative process of RTM is as follows, the corresponding graphical model is shown in Figure 2.
( ) ( ) Local motion topic weight vector j π obey a symmetric Dirichlet prior distribution ( ) The corpus share R motion regions by global region weight vector Likewise, global region weight ρ obeys a symmetric Dirichlet prior distribution ( ) Under the known motion topic label jn t = t and known motion region label jn i r The hybrid parameter rt θ obeys a symmetric Dirichlet prior distribution ( ) The construction of RTM is summarized as that: the number of motion topics is T ; the number of motion regions is R ; the video sequence contains J documents; the observed sample set is ( ) . Then, the generative process of RTM is as follows, the corresponding graphical model is shown in Figure 2. Generate ρ ∼ Dir(β) For each sample x jn Generate a regional label r i ∼ ρ for its location i For each video clip j Generate a motion topic weight vector π j ∼ Dir(α) For each sample x jn Generate motion topic label t jn ∼ π j Generate θ rt ∼ Dir(λ) Generate x jn ∼ Mult(.θ rt ) As the generative process of RTM described above, the unknown parameters to be estimated are π j = π jt T t=1 , ρ = {ρ r } R r=1 and θ rt = {θ ωrt } Ω ω=1 ; the known data are the observed samples X j = x jn N j n=1 and their joint distribution.As shown in Equation (7), p t jn π j p x jn t jn , r jn ∈i , θ r jn ∈i ,t jn (7) According to above construction of RTM, the model learning acts as clustering document sample subsets X j = x jn N j n=1 .The word samples not only can be clustered to T motion topics, but also to R motion regions.Each latent motion topic is inevitably correlated to a space region.It is worth noting that even though there have been several studies that introduce latent variables for merging various factors to jointly estimate document contents, which have the obvious differentiation with our RTM.For instance, in topic modeling of document, Rosen-Zvi et al. [39] introduced an author latent variable, and Bao et al. [40] introduced an emotion latent variable.Both of them first generated the introduced variables (emotion or author) from a specific distribution, then generated a latent topic from a multinomial distribution conditioned on generated variable, and finally generated document terms from another multinomial distribution based on latent topics.Whereas our RTM generates a introduced variable (region) and a latent topic in two independent steps respectively, and finally generates document terms from a multinomial distribution based on fixed latent region and topic.Therefore, a different generative process leads to different forms of joint distribution as well as inference algorithm.
As with traditional topic model, there are generally two kinds of inference methods for our RTM: MCMC sampling and VB inference.For realizing the real-time video mining, we proposed a hybrid stochastic variational Gibbs (HSVG) sampling algorithm for RTM.In comparison with HSVG sampling, the Gibbs sampling algorithm needs to scan the entire samples for at each iteration as a batch algorithm.Therefore, due to the huge memory (risk of overhead), slower running and difficultly determining convergence time, even collapsed Gibbs sampling algorithm is not suitable for huge data sets or data stream.The HSVG algorithm introduces the second source of stochasticity by MCMC sampling, and takes advantage of sparse computation to make complexity sublinearly increased with the number of topics, which fits for large collection with many topics.The inference process of our HSVG algorithm is formulated with more detail as follows.
Firstly, the motion region label is considered as a global latent variable.We eliminate the local motion topic weight π j by marginal computation, and obtain the local collapsed space of latent variable . Then the strong correlation between latent variable Z and local motion topic weight π j is retained.The joint distribution becomes Equation ( 8) Next, for improving the inference accuracy by retaining weak correlation of local latent variables T j , we suppose that T j obeys an indecomposable variational distribution q T j η j = N j ∏ n=1 q t jn η jn (9) Therefore, in the inference of semi-collapsed RTM, we just need to suppose that global latent variable R, local latent variable T j , motion region weight ρ and global hybrid parameters θ are independent.Then, the variational distribution q of free variational parameters ν, σ, µ rt , and η j J j=1 can be decomposed into Equation ( 10) Then, the semi-collapsed ELBO (Evidence Lower Bound) of document collection X is the global objective function The motion topic weight π in p(π, T|α) is eliminated by integrating Then, Equation ( 13) is obtained At this point, the local variational objective function of each document j is Next, the stochastic variational inference of global layer and the MCMC inference of local layer are as follows

Local MCMC Inference
Computing the first derivative of local objective function l j with respect to variational parameters η j Set Equation ( 15) equals to zero, the optimal variational distribution q * T j η j is Among Equation ( 16), the variational expectation of sufficient statistic The MCMC sampling method can be used to solve the estimation problem of optimal variational distribution q * T j η j without supposing the independent of local latent variables.Constructing a Markov chain whose stationary distribution is the optimal variational distribution local latent variables, the key problem of Gibbs sampling is computing the transition probability of Markov chain, which equals to the motion topic label t jn 's prediction probability of sample x jn .Then, Equation ( 18) is obtained In the Equation ( 18) The prediction probability is iteratively learning in Markov chain, Markov chain is converged after N times Gibbs sampling state transition in burn-in time.After Markov chain converged, the arithmetic average value of sample sufficient statistics N jωrt is the estimation of E η j N jωrt

Global Stochastic Variational Inference
In the tth stochastic iteration, the state space of Markov chain is constructed by the sample sufficient statistics N (B t )ωrt = ∑ j∈B t N jωrt of a stochastic small batch documents B t , the contribution of B t to natural gradient of global variational parameters is Then, the variational expectation of N jωrt or N (B t )ωrt obtained from local MCMC inference respect to global variational parameters is that When the number of motion topics T, motion regions R or words Ω is large, N j observed samples of document j is allocated to a large T × R × Ω dimensions hybrid parameters matrix θ = {θ ωrt } Ω,R,T ω=1,r=1,t=1 , which make sufficient statistics N jωrt of many samples be zero.Then, E η j N jωrt estimated by MCMC sampling is a sparse matrix.Therefore, the amount of computations is decreased and computing speed is improved because of the sparsity.
Given the above description, the specific description of HSVG algorithm is shown in Algorithm 1.

Algorithm 1. HSVG algorithm of RTM model
Initialize global variational parameters ς r and µ (0) ωrt while the number of random iterations t = 0 : t max do Update iteration step: ρ t a(τ 0 + t) −κ Import a small batch documents B t Initialize T (0) j from (1, . . . ,T), R (0) For a sample n = 1 : N j of document j local MCMC inference is adopted , where

Evaluation Criterion
In the text mining and statistic inference of nature language, perplexity is always used to evaluate the performance of model, which is computed by perp(X test | φ) = exp{− log p(X test | φ)/N test }.φ denotes the parametric estimation of trained model, X test and N test are test dataset and observed samples respectively.Perplexity is the negative log likelihood (NLL) − log p(X test | φ) divided by the number of observed samples N test .As described in above, the computation of perplexity is mainly the NLL computation of trained model, and NLL denotes the cross entropy of trained model and unknown testing data.Perplexity represents the uncertainty of trained model for unknown test set's estimation.Therefore, the lower the NLL value is, the better the model performance is.
In our RTM, NLL can be computed for motion topic and region of test video clip.The parameters estimation of trained model and a test video clip are learning in RTM, the learned local motion topic weight estimation π t and global parameter estimation ς ir , θ ωrt of original trained model is computed for NLL of test clips.
Besides, t_NLL and r_NLL are computed by: t_NLL (26) t_NLL is used to evaluate the performance of learning motion topic by our model, and r_NLL is used to evaluate the performance of learning motion region.Meanwhile, because of the samples number difference between different regions, the abnormal events probabilities of the region including few samples is lower, so we add a sample number weight for each region.We regard the five most regions of r_NLL value as the most possible abnormal events regions.Furthermore, we utilize receiver operating characteristic curve (ROC) and AUC (area under ROC) to evaluate the abnormal detecting performance of our model, which are independent of threshold selection.Obviously, for ROC, the closer to the top left corner, the performance of abnormal detection is better.Similarly, the closer to 1 the AUC value is, the better the performance of abnormal detection.The running platform of experiment is shown in Table 1.

Datasets and Parameter Settings
In order to get the comprehensive evaluation of our model and its inference algorithm, we analyze the performance based on two types of dataset.The first one is a simulation video dataset constructed by specific steps, and the second one is a real video dataset.

Simulation Video Dataset
We make a simulation video dataset for simulating the traffic intersection.Each image of a frame was divided into a 6 by 6 grid (a total of 36 positions) and five valid regions (including the sides of up, middle, down, right, and left).Meanwhile, each valid region is composed by 4 grids, that there are in total of 4 × 5 = 20 locations to simulate one center and four directions of traffic intersection, as shown in Figure 3.

Datasets and Parameter Settings
In order to get the comprehensive evaluation of our model and its inference algorithm, we analyze the performance based on two types of dataset.The first one is a simulation video dataset constructed by specific steps, and the second one is a real video dataset.

Simulation Video Dataset
We make a simulation video dataset for simulating the traffic intersection.Each image of a frame was divided into a 6 by 6 grid (a total of 36 positions) and five valid regions (including the sides of up, middle, down, right, and left).Meanwhile, each valid region is composed by 4 grids, that there are in total of 4 × 5 = 20 locations to simulate one center and four directions of traffic intersection, as shown in Figure 3.

Ω− 
, and the number of motion words is 80 W I = ×Ω = .The latent motion topic is constructed by combining region number r and direction number ω .Regarding ( ) as an instance, it means that one location of region 0(that location number is one of { } 0,1, 2,3 ) is moving in direction 0(down).Then, normal and abnormal motions are able to be constructed by above way.There are five kinds of normal motion states and two kinds of abnormal motion states.The generative algorithm of simulation video is described below: Generate an initial motion direction ω randomly.In Figure 3, white texts represent the valid region number r ∈ {0, 1, 2, 3, 4}, and red texts represent the valid location number i ∈ {0, . . . ,19}.Four directions are represented as down (green), left (blue), up (purple) and right (red), whose number is ω ∈ {0, 1, 2, 3}.Then, the number of locations is I = 20, and the number of motion directions is Ω = 4. (i, ω) is used to denote a motion word, where i ∈ {0, . . . ,I − 1},ω ∈ {0, . . . ,Ω − 1}, and the number of motion words is W = I × Ω = 80.The latent motion topic is constructed by combining region number r and direction number ω. Regarding (i ∈ r = 0, ω = 0) as an instance, it means that one location of region 0(that location number is one of {0, 1, 2, 3}) is moving in direction 0(down).Then, normal and abnormal motions are able to be constructed by above way.There are five kinds of normal motion states and two kinds of abnormal motion states.The generative algorithm of simulation video is described below: Generate an initial motion direction ω randomly.Choose a motion topic, where the probability of abnormal motion topic is 5%, and the probability of abnormal motion topic is 95%.
Generate 100 samples by Based on the chosen motion topic, choose a motion state randomly.
Based on the chosen motion state, generate an observed sample (ω , i ) randomly.Choose a motion topic, where the probability of abnormal motion topic is 5%, and the probability of abnormal motion topic is 95%.
Generate 100 samples by Based on the chosen motion topic, choose a motion state randomly.
Based on the chosen motion state, generate an observed sample ( )

Real Video Dataset
We use the QMUL street intersection dataset [41] for evaluating abnormality detection performance of this model.This standard video is 50 min in length, frame rates are 30 fps, resolution is 360 × 288, and there are 90,000 frames.Video codec is mpeg-4 compression encoding.The whole traffic light cycle is about 1.5 min; the average duration of abnormal event is 4.3 s (129 frames).

Real Video Dataset
We use the QMUL street intersection dataset [41] for evaluating abnormality detection performance of this model.This standard video is 50 min in length, frame rates are 30 fps, resolution is 360 × 288, and there are 90,000 frames.Video codec is mpeg-4 compression encoding.The whole traffic light cycle is about 1.5 min; the average duration of abnormal event is 4.3 s (129 frames).
In our experiment, we divide whole video into 250 clips; each clip is 12 s (360 frames).The top 14.36 min (30 documents) is training dataset, and the last 10 min (50 documents) is test dataset.Each scene of a clip is first divided into a 4 by 4 grid (a total of 16 positions) and five valid regions.After cutting off the part of sky and generating motion code book for model training, RTM-HSVG (RTM is learned and inference by Hybrid Stochastic Variational Gibbs Sampling) and RTM-GS (RTM is learned and inference by Gibbs Sampling) then computes the negative loglikelihood of every region as a score in each test clip and abnormality clips are picked up while its abnormality score exceed 1.5 times of average.The parameter settings are shown in Table 2 Table 2. Parameter settings of RTM-HSVG and RTM-GS.

Simulation Experiment of Visualization Traffic Intersection
Firstly, the motion topics and regions discovered by RTM-GS and RTM-HSVG is shown in Figure 5, where Figure 5a shows the seven random simulated abnormal motions.The number of simulated training documents is same as the number of test set, which is 100.

Simulation Experiment of Visualization Traffic Intersection
Firstly, the motion topics and regions discovered by RTM-GS and RTM-HSVG is shown in Figure 5, where Figure 5a shows the seven random simulated abnormal motions.The number of simulated training documents is same as the number of test set, which is 100.As can be seen from Figure 5, although RTM-GS discovers more latent regions, a refined topic division is obtained by RTM-HSVG.Furthermore, in RTM-HSVG, the two roads with same direction are combined to a latent region, which is capable of moving crossroad.It is more reasonable that motions comply with traffic rules of a same road are the same.
To compare the abilities of our model to discover abnormal motion, the NLL and r_NLL comparisons of our model and actual values is shown in Figure 6.As can be seen from Figure 5, although RTM-GS discovers more latent regions, a refined topic division is obtained by RTM-HSVG.Furthermore, in RTM-HSVG, the two roads with same direction are combined to a latent region, which is capable of moving crossroad.It is more reasonable that motions comply with traffic rules of a same road are the same.
To compare the abilities of our model to discover abnormal motion, the NLL and r_NLL comparisons of our model and actual values is shown in Figure 6.
As can be seen from Figure 5, although RTM-GS discovers more latent regions, a refined topic division is obtained by RTM-HSVG.Furthermore, in RTM-HSVG, the two roads with same direction are combined to a latent region, which is capable of moving crossroad.It is more reasonable that motions comply with traffic rules of a same road are the same.
To compare the abilities of our model to discover abnormal motion, the NLL and r_NLL comparisons of our model and actual values is shown in Figure 6.As shown in Figure 6, for r_NLL curve, the accuracy of RTM-GS seems to be higher than RTM-HSVG.Nonetheless, for NLL curve, RTM-HSVG obtained a higher accuracy.As HSVG is a kind of stochastic algorithm, it cause a volatile shocks in r_NLL curve.It also suggests that our stochastic online algorithm need to introduce more motion region information to acid early-warning.The difference between RTM-GS and RTM-HSVG is also able to be observed in their ROC and AUC, which is shown in Figure 7.As shown in Figure 6, for r_NLL curve, the accuracy of RTM-GS seems to be higher than RTM-HSVG.Nonetheless, for NLL curve, RTM-HSVG obtained a higher accuracy.As HSVG is a kind of stochastic algorithm, it cause a volatile shocks in r_NLL curve.It also suggests that our stochastic online algorithm need to introduce more motion region information to acid early-warning.The difference between RTM-GS and RTM-HSVG is also able to be observed in their ROC and AUC, which is shown in Figure 7.As shown in Figure 7, the area under NLL ROC of RTM-HSVG is 0.68, and RTM-GS is 0.56.This result again explains that the comprehensive accuracy of RTM-HSVG is better than RTM-GS.On the other hand, the area under r_NLL ROC of RTM-HSVG is 0.64, and RTM-GS is 0.97, which illustrates that the performance of learning motion region of RTM-GS is better than RTM-HSVG.These simulated experimental results show the validity of RTM for discovering motion topic and motion region.As shown in Figure 7, the area under NLL ROC of RTM-HSVG is 0.68, and RTM-GS is 0.56.This result again explains that the comprehensive accuracy of RTM-HSVG is better than RTM-GS.On the other hand, the area under r_NLL ROC of RTM-HSVG is 0.64, and RTM-GS is 0.97, which illustrates that the performance of learning motion region of RTM-GS is better than RTM-HSVG.These simulated experimental results show the validity of RTM for discovering motion topic and motion region.As shown in Figure 7, the area under NLL ROC of RTM-HSVG is 0.68, and RTM-GS is 0.56.This result again explains that the comprehensive accuracy of RTM-HSVG is better than RTM-GS.On the other hand, the area under r_NLL ROC of RTM-HSVG is 0.64, and RTM-GS is 0.97, which illustrates that the performance of learning motion region of RTM-GS is better than RTM-HSVG.These simulated experimental results show the validity of RTM for discovering motion topic and motion region.According to the comparisons of latent motion topic and motion regions discovered in above figures, RTM-GS obtained more clearly atomic motion patterns, which form the motion topic.According to the comparisons of latent motion topic and motion regions discovered in above figures, RTM-GS obtained more clearly atomic motion patterns, which form the motion topic.Nevertheless, RTM-HSVG obtained a more focused clustering in both motion topic and region, and the direction representation of RTM-HSVG is richer (observed from the mixture of four directions).This is because the Markov chain state space of large-scale dataset is larger; it needs a longer burn-in time for Markov chain convergence.Even if RTM-HSVG has a shorter burn-in time, it also obtained a better performance than RTM-GS.

Real Video Experiment
In order to test the impact of burn-in time on performance of our model, the burn-in time of RTM-GS is increasing to four times, and the iteration-times is set as 8000.Then, the motion topics and regions results are shown in Figures 12 and 13.This is because the Markov chain state space of large-scale dataset is larger; it needs a longer burn-in time for Markov chain convergence.Even if RTM-HSVG has a shorter burn-in time, it also obtained a better performance than RTM-GS.
In order to test the impact of burn-in time on performance of our model, the burn-in time of RTM-GS is increasing to four times, and the iteration-times is set as 8000.Then, the motion topics and regions results are shown in Figures 12 and 13   In comparison with Figures 8 and 12 as well as Figures 9 and 13, we find that longer burn-in time makes a more focused clustering in both motion topic and region.Nevertheless, it is difficult to decide when Markov chain is convergent in Gibbs sampling, and longer burn-time is at the expense of time efficiency.Therefore, RTM-HSVG is more efficient than RTM-GS as an online algorithm.Meanwhile, we find that a larger number of topics can make more clear motion topics, which also makes more repeated latent topics.Therefore, it illustrates that the number of topics and regions are important aspects to decide performance of RTM.
The ROC curve comparison of RTM-HSVG and RTM-GS is shown in the Figure 14.As shown in Figure 14, the area under ROC of RTM-HSVG is 0.59, and RTM-GS is 0.577.These results again indicate that RTM-HSVG can improve the accuracy of abnormal event detection in comparison with RTM-GS.This is because the Markov chain state space of large-scale dataset is larger; it needs a longer burn-in time for Markov chain convergence.Even if RTM-HSVG has a shorter burn-in time, it also obtained a better performance than RTM-GS.
In order to test the impact of burn-in time on performance of our model, the burn-in time of RTM-GS is increasing to four times, and the iteration-times is set as 8000.Then, the motion topics and regions results are shown in Figures 12 and 13.In comparison with Figures 8 and 12 as well as Figures 9 and 13, we find that longer burn-in time makes a more focused clustering in both motion topic and region.Nevertheless, it is difficult to decide when Markov chain is convergent in Gibbs sampling, and longer burn-time is at the expense of time efficiency.Therefore, RTM-HSVG is more efficient than RTM-GS as an online algorithm.Meanwhile, we find that a larger number of topics can make more clear motion topics, which also makes more repeated latent topics.Therefore, it illustrates that the number of topics and regions are important aspects to decide performance of RTM.
The ROC curve comparison of RTM-HSVG and RTM-GS is shown in the Figure 14.As shown in Figure 14, the area under ROC of RTM-HSVG is 0.59, and RTM-GS is 0.577.These results again indicate that RTM-HSVG can improve the accuracy of abnormal event detection in comparison with RTM-GS.In comparison with Figures 8 and 12 as well as Figures 9 and 13, we find that longer burn-in time makes a more focused clustering in both motion topic and region.Nevertheless, it is difficult to decide when Markov chain is convergent in Gibbs sampling, and longer burn-time is at the expense of time efficiency.Therefore, RTM-HSVG is more efficient than RTM-GS as an online algorithm.Meanwhile, we find that a larger number of topics can make more clear motion topics, which also makes more repeated latent topics.Therefore, it illustrates that the number of topics and regions are important aspects to decide performance of RTM.
The ROC curve comparison of RTM-HSVG and RTM-GS is shown in the Figure 14.As shown in Figure 14, the area under ROC of RTM-HSVG is 0.59, and RTM-GS is 0.577.These results again indicate that RTM-HSVG can improve the accuracy of abnormal event detection in comparison with RTM-GS.
At last, several abnormal events discovered by RTM-HSVG are shown in Figure 15, and the regions of abnormal motion are labeled by dark red. Figure 15a shows an abnormal event in 33rd clips (37,410~37,769), its motion region is number 6, and NLL = 724,951.87.A fire vehicle is driving into the crossing from right and interrupting the vertical traffic.
Figure 15b shows an abnormal event in 25th clips (34,538~34,897), its motion region is number 9, and NLL = 753,774.17.A car is turning right illegally.
Figure 15c shows an abnormal event in 15th clips (30,948~31,307), its motion region is number 13, and NLL = 792,248.81.A car makes a U-turn illegally.
Figure 15d shows an abnormal event in 14th clips (30,589~30,948), its motion region is number 15, and NLL = 819,426.78.A fire vehicle is driving into the crossing from lift and interrupting the vertical traffic.
From the above experimental results, we can see that RTM is able to discover motion topics and motion regions efficiently.Specially, the HSVG inference algorithm designed for RTM is better than Gibbs sampling on accuracy and time efficiency.Therefore, we can anticipate that RTM-HSVG is a potential method for real-time video mining.Figure 15a shows an abnormal event in 33rd clips (37,410~37,769), its motion region is number 6, and NLL = 724,951.87.A fire vehicle is driving into the crossing from right and interrupting the vertical traffic.
Figure 15b shows an abnormal event in 25th clips (34,538~34,897), its motion region is number 9, and NLL = 753,774.17.A car is turning right illegally.
Figure 15c shows an abnormal event in 15th clips (30,948~31,307), its motion region is number 13, and NLL = 792,248.81.A car makes a U-turn illegally.
Figure 15d shows an abnormal event in 14th clips (30,589~30,948), its motion region is number 15, and NLL = 819,426.78.A fire vehicle is driving into the crossing from lift and interrupting the vertical traffic.
From the above experimental results, we can see that RTM is able to discover motion topics and motion regions efficiently.Specially, the HSVG inference algorithm designed for RTM is better than Gibbs sampling on accuracy and time efficiency.Therefore, we can anticipate that RTM-HSVG is a potential method for real-time video mining.Figure 15a shows an abnormal event in 33rd clips (37,410~37,769), its motion region is number 6, and NLL = 724,951.87.A fire vehicle is driving into the crossing from right and interrupting the vertical traffic.
Figure 15b shows an abnormal event in 25th clips (34,538~34,897), its motion region is number 9, and NLL = 753,774.17.A car is turning right illegally.
Figure 15c shows an abnormal event in 15th clips (30,948~31,307), its motion region is number 13, and NLL = 792,248.81.A car makes a U-turn illegally.
Figure 15d shows an abnormal event in 14th clips (30,589~30,948), its motion region is number 15, and NLL = 819,426.78.A fire vehicle is driving into the crossing from lift and interrupting the vertical traffic.
From the above experimental results, we can see that RTM is able to discover motion topics and motion regions efficiently.Specially, the HSVG inference algorithm designed for RTM is better than Gibbs sampling on accuracy and time efficiency.Therefore, we can anticipate that RTM-HSVG is a potential method for real-time video mining.

Conclusions
To solve the problem that traditional topic model is unable to process video in real-time and model motion regional information, we proposed a RTM and designed its hybrid stochastic variational Gibbs sampling algorithm.In RTM, observation data not only has the motion topic label but also has

Figure 1 .
Figure 1.Diagram of video topic modeling.

Figure 1 .
Figure 1.Diagram of video topic modeling.

Figure 3 . 20 I
Figure 3. Directions and regions of simulation video

Figure 3 .
Figure 3. Directions and regions of simulation video

Figure 4
shows the generated training set and test set by above steps.The color of arrow represents the direction, and the brightness represents the probability in motion topic.Algorithms 2018, 11, x FOR PEER REVIEW 12 of 20

Figure 4
Figure 4 shows the generated training set and test set by above steps.The color of arrow represents the direction, and the brightness represents the probability in motion topic.

Figure 5 .
Figure 5. Motion topics and regions discovered by RTM-GS and RTM-HSVG in simulation dataset (a) The seven random simulate abnormal motions.(b) The four motion topics and five regions discovered by RTM-GS.(c) The four motion topics and three regions discovered by RTM-HSVG.

Figure 5 .
Figure 5. Motion topics and regions discovered by RTM-GS and RTM-HSVG in simulation dataset (a) The seven random simulate abnormal motions.(b) The four motion topics and five regions discovered by RTM-GS.(c) The four motion topics and three regions discovered by RTM-HSVG.

6 .
The NLL and r_NLL comparisons of our model and actual values.(a) NLL (left and blue) and r_NLL (right and blue) comparisons of RTM-HSVG.(b) NLL (left and blue) and r_NLL (right and blue) comparisons of RTM-GS.

Figure 6 .
Figure 6.The NLL and r_NLL comparisons of our model and actual values.(a) NLL (left and blue) and r_NLL (right and blue) comparisons of RTM-HSVG.(b) NLL (left and blue) and r_NLL (right and blue) comparisons of RTM-GS.

4. 3 . 2 .
Real Video Experiment Likewise, for QUML dataset, the motion topics and regions discovered by RTM-GS and RTM-HSVG is shown in Figures 8-11 respectively.

4. 3 . 2 .
Real Video Experiment Likewise, for QUML dataset, the motion topics and regions discovered by RTM-GS and RTM-HSVG is shown in Figures 8-11 respectively.

Figure 8 .
Figure 8. Twenty latent motion regions discovered by RTM-GS.Figure 8. Twenty latent motion regions discovered by RTM-GS.

Figure 9 .
Figure 9. Eighteen latent motion topics discovered by RTM-GS.Figure 9. Eighteen latent motion topics discovered by RTM-GS.

Figure 9 .
Figure 9. Eighteen latent motion topics discovered by RTM-GS.Figure 9. Eighteen latent motion topics discovered by RTM-GS.

Figure 12 .
Figure 12.Twenty latent motion regions discovered by RTM-GS on longer burn-in time.

Figure 13 .
Figure 13.Eighteen latent motion topics discovered by RTM-GS on longer burn-in time.

Figure 12 .
Figure 12.Twenty latent motion regions discovered by RTM-GS on longer burn-in time.

Figure 12 .
Figure 12.Twenty latent motion regions discovered by RTM-GS on longer burn-in time.

Figure 13 .
Figure 13.Eighteen latent motion topics discovered by RTM-GS on longer burn-in time.

Figure 13 .
Figure 13.Eighteen latent motion topics discovered by RTM-GS on longer burn-in time.

Figure 14 .Figure 15 .
Figure 14.Comparative results of ROC on real video.

Table 1 .
Software and hardware platform.