A Distance-Dependent Chinese Restaurant Process Based Method for Event Detection on Social Media

In this paper, we propose a method for event detection on social media, which aims at clustering media items into groups of events based on their textural information as well as available metadata. Our approach is based on distance-dependent Chinese Restaurant Process (ddCRP), a clustering approach resembling Dirichlet process algorithm. Furthermore, we scrutinize the effectiveness of a series of pre-processing steps in improving the detection performance. We experimentally evaluated our method using the Social Event Detection (SED) dataset of MediaEval 2013 benchmarking workshop, which pertains to the discovery of social events and their grouping in event-specific clusters. The obtained results indicate that the proposed method attains very good performance rates compared to existing approaches.


Introduction
In recent years, there has been a great research interest in techniques for event detection on data retrieved from Social Networks, focusing mostly on Twitter platform.An event could be defined as an arbitrary classification of a space-time region and might include actively participating agents, passive factors, products, and a location in space/time [1].Atefeh et al. [2] conducted a study based on three major categories: (i) the type of event being detected, distinguishing the event into specified and unspecified; (ii) the detection task (new event detection and retrospective event detection); and (iii) the event detection method (supervised, unsupervised, and hybrid).Another classification approach of the different detection methods is presented in the survey [3] that focuses on the common traits the methods share (i.e., using probabilistic topic modeling, identifying interesting properties in a tweet's keywords/terms and using incremental clustering).
This paper is organized as follows: Section 2 presents a description of work related to event detection from social media data and a review of Bayesian nonparametric models and ddCRP.Detailed description of the clustering algorithm is given in Section 3, while, in Section 4, we describe the experiments and evaluation of the event detection.Finally, Section 5 is devoted to discussion and conclusions of our work.

Related Work
Benson et al. [4] developed a graphical model for record extraction from social media streams, using a MIRA-based binary classifier to predict whether a message mentions an event.The output of their model is a set of canonical records, the values of which are consistent with aligned messaging.During testing their method, they used a fixed number of records (events) based on the training data.Doulamis et al. [5] addressed the dynamic nature of tweet messages, constructing fuzzy time signals, modeling this clustering task as a multi-assignment graph partitioning problem.Their method exploited pairwise similarities, Riemannian distance metric between word signatures, to compensate tweet submission and correlated words based on fuzzy time feature series, focusing mostly on unstructured datasets.
Petrovic et al. [6] addressed the problem of detecting new events from a stream of Twitter posts using an algorithm based on locality-sensitive hashing (LSH) [7], performing constant time and space estimations of the closest document.This way the computational cost did not increase, as the number of clusters increases.They recognized that the high degree of lexical variation in documents makes it very difficult to detect stories that talk about the same event using different words and, in their later work [8], they combined paraphrases with locality-sensitive hashing.While the First Story Detection (FSD) performance improved, the gain is much smaller when this technique is applied to newswire data, likely due to lexical mismatch between the knowledge bases and social media.In this direction, Moral et al. [9] used word embeddings, trying to enhance the representation of social media posts to increase the effectiveness of LSH based FSD.More specifically, they exploited Word2Vec [10] and expanded tweets with semantically related paraphrases identified via automatically mined word embeddings.
The distance-dependent Chinese Restaurant Process (ddCRP) is introduced in [11] as a flexible class of distributions over partitions for data clustering.It is based on a Bayesian clustering method for non-exchangeable sequence of observations, when the number of clusters is unknown.ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense.Recent work has extended the original ddCRP model for use in different applications.Ghosh et al. [12] examined it in a spatial setting with the goal of natural image segmentation, while Socher et al. [13] combined ddCRP with spectral dimensionality reduction.These approaches, however, compute the distances between data only based on the original data.Furthermore, these approaches are not directly applicable when additional information can be used for the similarity computation [14].
A similarity-based Chinese Restaurant Process was proposed by Papaoikonomou et al. [15] addressing the problem of event detection from social media data and evaluated their method in SED MediaEval dataset, which is also used for this work's evaluation and is presented in detail in Section 4.1.When the number of events in the target set is not known in advance, this non-parametric algorithm, namely the Dirichlet Process clustering, allows the dynamic creation of clusters based on the data.A ddCRP variation was proposed by Li et al. [16] using side information in a Bayesian nonparametric model for data clustering.They evaluated their method using normalized mutual information and F1-measure, taking advantage of the strong correlation of side information (such as citation, authors, and keywords for a documents dataset) with the main data features.An object proposal generation via sampling form a ddCRP posterior on image segmentation is proposed in [17].

The Proposed Method
In this section, we describe the proposed event detection approach, which aims at analyzing media items to categorize them into meaningful collections, focusing both on the textual data and the metadata of the media posts targeting mostly Twitter.However, this service is applicable to any dataset which conforms to Twitter's data format and contains the following:

•
The username of the author (metadata)

•
The timestamp of the creation date (metadata)

•
The actual textual content (of short length) The username information is important given that we expect certain authors, who express themselves through a length-constrained message, to comment on a small number of topics each time they generate content (often just one).The temporal dimension is also crucial since media items that refer to the same real event tend to lie close in time.Finally, the actual grouping of the media posts is also controlled by groups of words that co-occur frequently and textual patterns.

Message Similarity and Clustering
A central concept in the development of a clustering algorithm is similarity among data points: similar objects should be grouped together, whereas dissimilar ones should be assigned to different collections.In the domain of text clustering, similarity is usually measured by the degree that words tend to co-occur, i.e., the higher the rate of common words the higher their similarity.On top of that perspective, and in our quest for additional signals, we choose to also mine the metadata of the media posts produced by a social network user, thus resulting in a similarity function that aggregates information from different parts of the social network messages.In particular, our similarity function, operating on two social media messages m u and m v , is defined as: where a i are coefficients to be learned during the training process.I auth is an indicator function that is equal to 1 if two messages are published by the same author and 0 otherwise.f t (t u , t v ) is a function that estimates the temporal distance between the messages.It is a monotonically decreasing function which takes as input the timestamps of their creation dates (t u and t v ) and returns a similarity value based on their temporal spread.We choose an exponential decay function of the form: for some window parameter q, which takes its maximum value (1) when t u ≈ t v , and decreases towards zero as the absolute difference |t u − t v | increases.
In general, the decay function mediates how temporal distances between customers affect the resulting distribution over partitions.We assume that the decay function f t is non-increasing, takes non-negative finite values, and satisfies f (∞) = 0. Following the proposed decay functions of Blei et al. [11], we consider several types of decay as for our temporal distance model, all of which satisfy these non-restrictive assumptions.In terms of temporal distance, the window decay f (d) = 1 (d < a) only considers customers that are at most distance a from the current customer.The exponential decay f (d) = exp(−d/a) decays the probability of linking to an earlier customer exponentially with the distance to the current customer.The logistic decay f (d) = exp(−d + a)/(1 + exp(−d + a)) is a smooth version of the window decay.After examining their impact to the overall performance of our proposed approach, we consider the exponential decay function to be the most suitable for method.
Finally, f w (m u , m v ) is a function that mines directly the textual content of the two messages and outputs a higher similarity value in case of a large number of common terms.To construct an effective similarity function targeting textual content from social networks, we need to take into account the special characteristics of the user generated content in social media.Such a case is that of hashtags, which are identifiers inserted by the author of a media post to indicate the topic(s) on which she expresses her opinion.Typical hashtag terms are written in the form #<topic-identifier>, e.g., #Oscars.We consider hashtags as highly significant for the task of event detection, and thus we highlight their role compared to the other terms in the media post, by assigning a larger weight.

Methodology
The proposed method is closely related to the Dirichlet Process (DP) mixture models, a family of flexible clustering algorithms for high dimensional data analysis.More specifically, we use a variant of a DP mixture called distance-dependent Chinese Restaurant algorithm that was introduced by Blei et al. [11].In general, a Dirichlet process [18] is an infinite mixture model which expresses a distribution over probability measures.A DP mixture model can be viewed as a Chinese Restaurant Process (CRP) which is fancifully described by a sequence of customers joining a Chinese restaurant with an infinite number of tables.Every time a new customer enters the restaurant, he may choose to join an occupied table with probability analogous to the number of persons already sitting there, or to choose a new one with probability proportional to a predefined value, called the concentration parameter.The analogy in a CRP mixture is apparent: customers represent the data points which belong to the same cluster if they "sit" at the same table.
In accordance with the notation of [11], we define: • z i is the table assignment of the ith customer (id of the table that customer i chooses).

•
K is the total number of occupied tables.
The conditional probability of the table assignment for the ith customer, given the assignments of the customers before him, is computed through: An interesting property of the CRP mixture model is that the number of the finally occupied tables is random, and thus the number of clusters is determined by the data.This is a desirable feature given that it is usually difficult to estimate the right number of groups in real-world data.
In the case of the traditional CRP mixture analogy, customers are exchangeable, i.e., the probability of a particular table configuration is the same even if the order of the customers is permuted.This property might seem reasonable for certain applications but it is not appropriate when the order of data points matters.Such an example is the social event detection task that we consider here along with its dependence on the temporal dimension, since we expect that media items that refer to an event will tend to group with other items that lie close in time.A variant of the CRP mixture model that enforces such an "non-exchangeability" constraint is the "distance-dependent Chinese Restaurant Process" introduced by Blei et al. in [11].The difference in this approach is that the distances (e.g., based on time) among the customers are the key factors that lead to the seating assignment.In other words, while the traditional CRP connects customers to tables, the distance-dependent variant connects customers to other customers and the allocation of customers to tables is a by-product of this process.
Let us define the following: • c i is the assignment of the ith customer (the identifier of the customer with whom the ith customer chooses to sit).• d ij is the "distance" between customers i and j, and D is the matrix of distances for all customer pairs.• f is a decay function.
The conditional probability for the ith customer assignment is now: Figure 1 presents a sample application of the distance-dependent process, which is used to cluster social media objects.Given Equation (3), each "customer" will choose to either "sit close" to another customer (directed link) or "sit alone" (self-loop).Customers 2, 3 and 5 belong to the first category, whereas Customers 1, 4, and 6 belong to the second.
The final allocation of customers on tables depends on the pair-wise relationships among all the customers.Two customers that are reachable through a sequence of intermediate customer assignments will be finally assigned to the same table.In this way, Customers 1, 2 and 3 will sit at the first table (cluster), Customer 4 will form a cluster on her own and Customers 5 and 6 will be assigned to the third cluster.The pair-wise distances between the media objects determine the table assignment (clusters).Some customers choose to sit close to another customer, while some choose to sit alone.For example, customer 2 chooses to sit close to customer 3 (direct link depicted as blue arrow), while customer 1 chooses to sit alone.
Our proposed method applies the distance-dependent CRP algorithm using the similarity function of the previous sub-section (sim(mu, mv)) as an inverted distance function.To train the model, we resort to Markov Chain Monte Carlo (MCMC) sampling to approximate the posterior distribution of the data.More specifically, given the set of the hyper-parameters η = {Dsim, G0}, where Dsim is the distance matrix of all data points computed through the function, α the concentration parameter and G0 the base measure, we perform Gibbs sampling by iteratively drawing from the conditional distribution of each latent variable (ci) given the other latent variables (c−i) and observations (x): where z(c) are the table assignments that follow from the customer assignments and z(ci (new) ∪ c−i) expresses the new candidate partition.
The first term on the right side of this equation can be computed from the distance dependent prior in the previous one.The second term is the likelihood of the observation under the new candidate partition.To compute this term, we consider how removing a customer link and replacing it with another affects the table assignment, as depicted in Figure 2. Having the partition z(c−i), in which a table may have been split, and the new candidate partition, as described above, we consider three cases based on the differences between the two partitions: 1.The new customer assignment ci links to itself (self-loop), which does not change the likelihood since no tables are joined together.

The new customer assignment ci links to another customer, who is already at its table under z(c−i).
There is no change in the partition, since no tables are joined together.3. The new customer assignment ci links to another customer and tables k and l are joined Gibbs sampler needs thus to compute terms that correspond to changes in the partition and for our distance-dependent Chinese Restaurant Process approach the Gibbs sampler is: The pair-wise distances between the media objects determine the table assignment (clusters).Some customers choose to sit close to another customer, while some choose to sit alone.For example, customer 2 chooses to sit close to customer 3 (direct link depicted as blue arrow), while customer 1 chooses to sit alone.
Our proposed method applies the distance-dependent CRP algorithm using the similarity function of the previous sub-section (sim(m u , m v )) as an inverted distance function.To train the model, we resort to Markov Chain Monte Carlo (MCMC) sampling to approximate the posterior distribution of the data.More specifically, given the set of the hyper-parameters η = {D sim , G 0 }, where D sim is the distance matrix of all data points computed through the function, α the concentration parameter and G 0 the base measure, we perform Gibbs sampling by iteratively drawing from the conditional distribution of each latent variable (c i ) given the other latent variables (c −i ) and observations (x): where z(c) are the table assignments that follow from the customer assignments and z(c i (new) ∪ c −i ) expresses the new candidate partition.
The first term on the right side of this equation can be computed from the distance dependent prior in the previous one.The second term is the likelihood of the observation under the new candidate partition.To compute this term, we consider how removing a customer link and replacing it with another affects the table assignment, as depicted in Figure 2. Having the partition z(c −i ), in which a table may have been split, and the new candidate partition, as described above, we consider three cases based on the differences between the two partitions: 1.
The new customer assignment c i links to itself (self-loop), which does not change the likelihood since no tables are joined together.

2.
The new customer assignment c i links to another customer, who is already at its table under z(c −i ).There is no change in the partition, since no tables are joined together.

3.
The new customer assignment c i links to another customer and tables k and l are joined Gibbs sampler needs thus to compute terms that correspond to changes in the partition and for our distance-dependent Chinese Restaurant Process approach the Gibbs sampler is:  A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.
Scaling of the algorithm when the input increases is an important issue to take into consideration.Even for medium-sized datasets, it is impractical to compute the distance for all pairs of tweets.To tackle this problem, we introduce a concept that reduces the required computational resources of our approach.For every table, we summarize the event as a list of the most important information such as representative words, hashtags and statistics about temporal information.We use this "event summarization" to compare each customer to only the event summarization of a table, reducing the total number of comparisons from n (total number of customers) to K (total number of tables).A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.
Scaling of the algorithm when the input increases is an important issue to take into consideration.Even for medium-sized datasets, it is impractical to compute the distance for all pairs of tweets.To tackle this problem, we introduce a concept that reduces the required computational resources of our approach.For every table, we summarize the event as a list of the most important information such as representative words, hashtags and statistics about temporal information.We use this "event summarization" to compare each customer to only the event summarization of a table, reducing the total number of comparisons from n (total number of customers) to K (total number of tables).A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.
Scaling of the algorithm when the input increases is an important issue to take into consideration.Even for medium-sized datasets, it is impractical to compute the distance for all pairs of tweets.To tackle this problem, we introduce a concept that reduces the required computational resources of our approach.For every table, we summarize the event as a list of the most important information such as representative words, hashtags and statistics about temporal information.We use this "event summarization" to compare each customer to only the event summarization of a table, reducing the total number of comparisons from n (total number of customers) to K (total number of tables).

Dataset
Our approach is applicable to any dataset which conforms to Twitter's data format.For the evaluation, we used the dataset of Social Event Detection (SED) task of MediaEval 2013 [19,20].That task requires the participants to discover social events and organize the related items in event-specific clusters.The task is a supervised clustering task [21,22], where a set of training events is provided.Numerous participants propose algorithms, grouping media items and producing a complete clustering of the dataset according to events, facing also the challenge of discovering the actual number of target events, since it is not given.
The dataset consists of about 437,370 pictures gathered using the Flickr API [23].They were uploaded between 2006 and 2012 and were assigned to about 21,169 events, annotated by people, referring to sport events, protests, marches, debates, expositions, festivals, and concerts and are separated into two parts: the training set (70% of the dataset) and testing set (30% of the dataset).For our evaluation, we used the textual metadata of the pictures, which included information such as the title, description, time, tags, etc., focusing on the comparison of the available metadata of the dataset's multimedia items to assign each item to an event.All these metadata are included in the publicly available files of the dataset in XML and csv format, which are published in [20,24] and follow the schema described in Table 1.As it is a real-world dataset, there are some features, such as time-stamps and uploader information, that are available for every picture, but there are also features (e.g., geographic information) that are available for only a subset of the images.

Implementation Details
Java programming language was used for the implementation, application and evaluation of our clustering-based event detection service and a flow demonstrating the different steps of the method is presented in Figure 3 Various solutions were examined to improve the performance of our method, i.e., integration with relational databases [25] to avoid repetitive computations and memory overflow and multithreading using dedicated libraries to reduce overall computational time, mostly for the part of the algorithm which handles the computation of distances among the approximately 131.200 different media items of the test set of this dataset and has a complexity of O(n 2 ).Dedicated Java libraries were used for the processing the dataset and the available files in csv and XML format, so we can extract the information of interest [26,27].As mentioned before, this approach was evaluated using the annotated SED dataset, but could be applied to any dataset that conforms to Twitter's data format, using for example Twitter API [28] and related libraries [29] to retrieve tweets.

Text Pre-Processing
The most common pre-processing techniques applied by event detection techniques on the Twitter data stream are the following: POS tagging, NER, resolving temporal expressions, slang-word conversion, tweet filtering based on specific criteria (i.e., discarding retweets and/or non-English language tweets) and removing stop words, URLs and username mentions from tweets [3].To this direction, we have extensively investigated how different pre-processing techniques affect the performance of our algorithm.Some of the pre-processing applied that had a marked impact on the overall effectiveness are:

•
Hashtags: A type of metadata used by social media users.It typically consists of the character "#" followed by a string and is usually indicative of the topic the user refers to.We use them in the similarity function among different tweets (customers in our method based on ddCRP) and tested replacing them with "#" character or omitting them.

•
Stemming and lemmatization: Our goal was to capture the "base form" of a word.For example house instead of houses, house's etc.

•
Stop-word removal: These are very common words such as "the", "at", etc.

Evaluation Metrics
For the evaluation, we compared our clustering results to the ground truth clustering assignments that has been created by human annotators, by using the following metrics [30]:

Experimental Setup
To evaluate the performance of our service, we developed and tested different approaches and variations of ddCRP, some of which are presented below: 1.
Chinese Restaurant Process (CRP): We used the training set to determine the optimal value of the concentration parameter a.

2.
Distance-dependent Chinese Restaurant Process based only on time (ddCRP only time): During this approach, in the similarity function, we used only the information about time among all the available metadata in each multimedia item.This way we could see the difference in the clustering results when we used more of the available metadata in the following algorithms.

3.
Distance-dependent Chinese Restaurant Process sequential (ddCRP sequential): We compared each customer only to customers previously assigned to tables instead of comparing to all customers in the dataset.This approach requires about half the time for training and evaluating.4.
Distance-dependent Chinese Restaurant Process (ddCRP): We used the similarity function described in previous sections.The training set was used to reach the optimal values for the parameters of our algorithm such as the concentration parameter and the coefficients in the similarity function.

Results
We compared the results of our method with the rest of the submissions of SED 2013.In total, there are 11 submissions following different clustering approaches.Table 2 reports the performance of all the proposed algorithms in terms of F1-score and NMI.In Table 3, we observe that our method is among the top proposed approaches for event detection, achieving high performance results, outrunning most of the submissions.The results of our evaluation are presented in Table 3.We can see that our method ddCRP, which is highlighted, outperforms the other three approaches for both measures.

Conclusions and Future Work
In this paper, we present a method for social media event detection.We developed a variation of the distance-dependent Chinese Restaurant Process grouping media items into event clusters based on their textual data (i.e., actual text) and the available metadata (e.g., title, description, tags, and location).Additionally, we showed how pre-processing techniques can be used to enhance the performance of our event detection.For future work, we plan to apply our event detection service to more datasets and use other methods to capture the similarity of textual context.We also intend to examine the use of word embeddings, such as word2vec [10,40], to mitigate the problem of lexical variation among tweets, which are related to the same event and use different but synonymous words, and further improve the performance of our proposed method.

Figure 1 .
Figure 1.Sample application of the distance-dependent Chinese Restaurant Process on a media stream of six media items.The pair-wise distances between the media objects determine the table assignment (clusters).Some customers choose to sit close to another customer, while some choose to sit alone.For example, customer 2 chooses to sit close to customer 3 (direct link depicted as blue arrow), while customer 1 chooses to sit alone.

Figure 1 .
Figure 1.Sample application of the distance-dependent Chinese Restaurant Process on a media stream of six media items.The pair-wise distances between the media objects determine the table assignment (clusters).Some customers choose to sit close to another customer, while some choose to sit alone.For example, customer 2 chooses to sit close to customer 3 (direct link depicted as blue arrow), while customer 1 chooses to sit alone.
i) is the set of customers that are assigned to table k excluding the current customer i.    ( )∪ ( )    ( )    ( )  ,   ( ) =          ,   ( ) =     ,    ( ) = where xz k (c−i) is the set of customers that are assigned to table k excluding the current customer i.

Figure 2 .
Figure 2.An example of the Gibbs sampler, where blue arrows represent the links after running the first part of the algorithm and yellow ones the links after resampling.A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.

Figure 3 .
Figure3.Flow and steps of the proposed approach for event detection.

Figure 2 .
Figure 2.An example of the Gibbs sampler, where blue arrows represent the links after running the first part of the algorithm and yellow ones the links after resampling.A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.

Figure 2 .
Figure 2.An example of the Gibbs sampler, where blue arrows represent the links after running the first part of the algorithm and yellow ones the links after resampling.A table can be split when we remove an existing link from a customer to another.After resampling: (i) the customer links to itself; (ii) the customer links to another but two tables are not merged; or (iii) the link obtained for Customer 3 merges two tables.

Figure 3 .
Figure 3. Flow and steps of the proposed approach for event detection.

Figure 3 .
Figure 3. Flow and steps of the proposed approach for event detection.

Table 1 .
Detailed description of the Dataset of Social Event Detection (SED) task and more particularly of the metadata of each media item to be assigned to an event.
• RT (retweet): A re-post of an original tweet is called a retweet.It usually should be clustered together with the original one.Additionally, metadata fields indicate the number of times a specific post was retweeted.•URL:We performed tests by keeping a link directing to an external source, replacing the whole link with the word "url" and removing it completely.

•
F1-score, calculated from Precision and Recall with the formulaF1 − Measure = 2 * Precision * Recall Precision + Recallwhere Precision and Recall are defined as follows: A true positive (TP) decision assigns two similar items to the same cluster, a true negative (TN) decision assigns two dissimilar items to different clusters.There are two types of errors we can commit.A False Positive (FP) decision assigns two dissimilar item to the same cluster.A False negative (FN) decision assigns two similar items to different clusters.

Table 2 .
Performance of approach over the SED 2013 dataset (for both metrics, the best value is 1 and the worst value is 0).