Entity-Based Integration Framework on Social Unrest Event Detection in Social Media

: Social unrest events have been an issue of concern to people in various countries. In the past few years, mass unrest events appeared in many countries. Meanwhile, social media has become a distinctive method of spreading event information. It is necessary to construct an effective method to analyze the unrest events through social media platforms. Existing methods mainly target well-labeled data and take relatively little account of the event development. This paper proposes an entity-based integration event detection framework for event extraction and analysis in social media. The framework integrates two modules. The ﬁrst module utilizes named entity recognition technology based on the bidirectional encoder representation from transformers (BERT) algorithm to extract the event-related entities and topics of social unrest events during social media communication. The second module suggests the K-means clustering method and dynamic topic model (DTM) for dynamic analysis of these entities and topics. As an experimental scenario, the effectiveness of the framework is demonstrated using the Lihkg discussion forum and Twitter from 1 August 2019 to 31 August 2020. In addition, the comparative experiment is performed to reveal the differences between Chinese users on Lihkg and Twitter for comparative social media studies. The experiment results somehow indicate the characteristic of social unrest events that can be found in social media.


Introduction
Social unrest events refer to behaviors and actions that violate informal social norms or formal social rules [1]. With the rapid development of society, some people express their dissatisfaction with government policy by utilizing protests and events [2]. These events have become potent tools for the public to influence government decision making and execution. Moreover, improper use of force and inappropriate handling measures may make peaceful unrest events violent and escalate into criminal activities [3]. For example, a series of social unrest events occurred in Hong Kong from 2019 to 2020. It is reported that more than 85 Mass Transit Railway stations were damaged, and the Hong Kong government has spent more than HK$65 million (US$8.5 million) to repair the damaged public infrastructures [4]. Meanwhile, as a popular social media platform, the online discussion forum is convenient to use and can rapidly disseminate information [5,6]. It has become an important and rapid channel for the gathering and dissemination of event information. The frequent outbreak of unrest events has attracted close attention to the gathering of information online. It is believed that active analysis of the intangible resources of the network can contribute to the understanding of human, structural and relational capital among organizations [7]. Thus, it is important to detect unrest events accurately and efficiently on social media, understand the knowledge network of social unrest events, and further perform social network analysis. The results of event detection and analysis can help law enforcement grasp the status of unrest events in real-time, reasonably control the trend of public opinion, and maintain the stability and development of society [8].
As a crucial step in the event extraction task, event detection is one of the essential tasks in information extraction [9]. Usually, unrest events published on the discussion forum have some specific characteristics that can describe the event. From a linguistic point of view, these characteristics are the keywords that appear in the text. However, keyword matching can only obtain a limited analysis of the events. In the previous research on event detection from a textual dataset, the models are built based on a well-structured dataset expressly provided for event detection, such as ACE 2005 [10], MAVEN [11], and Chinese Emergency Corpus [12]. These datasets are well-labeled and cover different event subtypes, such as earthquakes, arson, traffic accident, and terrorist attack. It is difficult to find a dataset that contains enough data related to unrest events from social media. The challenges lie in identifying the unlabeled posts related to unrest events on social media and tracing the development and evolution of social unrest events.
To tackle this challenge, this paper constructs an integration event detection framework for the social media platform. The problem of insufficient data can be solved by using the named entity recognition (NER) method to extract information from semi-structured data. The accuracy of event detection is improved by utilizing entities related to unrest events. We propose an entity-based framework that combines NER with the concept of triggers in classic event detection methods. There are two modules in the framework. The first module, involving the NER technology and bidirectional encoder representation from transformers (BERT) pre-training language model, is called the event extraction stage. Named entities generally refer to objects with a specific meaning in the text. Event triggers, as one of the named entities, can be treated as a clue to identify event information from social media and classify event subtypes. In the case of this paper, actions and verbs are event triggers, such as "sit-in", "destroy", "gather", and "commit arson". The extracted entities can provide a useful overview, which can help law enforcement quickly and accurately grasp the status of the unrest event. The second module utilizes the K-means clustering method and dynamic topic model (DTM) to characterize and trace social unrest events in a specific period based on the consistency of entities. Experiments and some case studies are conducted on the Lihkg discussion forum and Twitter. The experimental results illustrate that the event detection framework is very effective in collecting and analyzing relevant information about unrest events. The contributions of this paper are as follows: • In view of social network data collection and analysis, we capture the open-source social network data, extract event-related content automatically from the raw dataset, and detect social unrest events; • Using neural network algorithms, we propose an entity-based social unrest event detection model to preprocess a large number of unlabeled and unstructured data. The results show that the model can effectively extract event-related entities from the raw text and perform dynamic topic analysis; • Based on the proposed event detection model, we compare the posts on the online discussion forum (Lihkg forum) and social media (Twitter) in the same period of time and perform long-term dynamic analysis. Comparison analysis is conducted to explore the differences between local and international social media platforms.
The rest of the paper is organized as follows. Section 2 overviews related work. Section 3 outlines the social unrest event detection framework. Section 4 introduces the experiment of the framework on the Lihkg forum and Twitter. Section 5 discusses and summarizes our findings. The last section is the conclusion and future work of this research.

Related Work
NER is receiving more and more attention in the fields such as text classification and data analysis. Considering the relevant characteristic of the information on social unrest events, we associate each post with multiple entities such as person, location, and time. In this way, named entity recognition is suitable for the analysis of social unrest events.
In the past few years, named entity recognition methods based on the neural network have been proposed one after another [13][14][15]. These studies mainly use the convolutional Electronics 2022, 11, 3416 3 of 17 neural network [16], recurrent neural network [17], and other network structures to extract sequence implications. Moreover, they also use a conditional random field (CRF) to figure out the optimal sequence. In 2015, for the first time, the bidirectional long short-term memory (BiLSTM-CRF) model was applied to the sequence labeling dataset [18]. Due to the BiLSTM component, the model can obtain the characteristics of the past and future context. In recent years, neural networks have become more and more widely used in the field of NLP. The use of a pre-training word vector can avoid manual feature extraction and directly process raw data. Therefore, Devlin et al. [19] propose a BERT method based on transformer [20], which is a deep bidirectional pre-training model that can extract the semantics of text on a deeper level and has a suitable effect on NLP. Chen et al. [21] use the pre-trained language model BERT to improve the prediction performance of the NER task. Moreover, Boros et al. [22] propose an event detection neural-based model to identify and categorize events mentioned in the text. However, the experiment of the model is conducted on annotated ACE 2005 corpus of the news articles. No consensus approach exists for NER-based event detection through raw social media data.
Event detection models can assist in analyzing massive data and detecting events on social media. The traditional supervised learning method of event extraction is to label the 5W1H (What, who, where, when, why, and how) of each event, which can describe the main event from the article [23][24][25]. In [26], a framework based on the naive Bayes classifier to detect civil unrest events on Twitter is built to overcome text mining challenges. A keyword-based approach [27] is utilized to analyze civil unrest events. Keyword learning is performed by frequency and TF-IDF calculation. It also includes a clustering model to obtain an overview of public opinion about an event. However, it is often difficult in practice to obtain large quantities of high-quality labeled data, requiring much manpower and time. In addition to the 5W1H method, traditional event detection models mainly focus on news articles or event extraction datasets [28,29] because they are well-structured, and the event information is clear and obvious. For the well-structured dataset, event extraction technology is more related to ontology and knowledge graphs [30,31]. However, to deal with unrest event detection in social media, most of the data is not well-structured. An end-to-end approach [32] is developed to analyze the message stream on Twitter to distinguish between messages about real-world events and non-event messages. It uses the clustering and classification method to build the model and learn the cluster-level event features.
For the topic modeling, the latent dirichlet allocation (LDA) topic model based on probability and statistics proposed by Blei [33] has set off an upsurge in topic recognition research. The probabilistic topic model and probabilistic graphical model are basic models of text mining. LDA is a popular generative probabilistic model and an unsupervised machine-learning technique that can be used to identify hidden subject information in a large document collection or corpus [34]. By clustering and analyzing the probability distributions of the words in each topic, those words with high probability in the topic can describe the topic's meaning. Based on the probability distribution of the text, there are relevance and inheritance between keywords in different topics. However, LDA does not consider the order of the documents and topics. Even the same topic has characteristics of dynamics, improvement, and differences as time goes on. Therefore, the topic model based on time sequence has gradually become a research hotspot, such as the topic over time (TOT) model [35] and dynamic topic model (DTM). DTM was originally developed by Blei et al. [36] to analyze the temporal evolution of topics in extensive document collections. Compared with LDA, the DTM model introduces the concept of time dynamics. In the data of different time slides, the topics in the latter time slide evolve from the previous slide, which can well present the change of topics. To emphasize the topic in stories and filter the noise, Zhang et al. utilize a dynamic topic model [37], and the result proves that the dynamic model can improve the topic tracking performance. Yao et al. [38] use spatial factors to measure the spatial impacts and improve DTM by embedding the spatial factors to analyze the raw data from Twitter and track the disaster cases. Their proposed geo-topics-based system can provide specific information and trend about local events, such as place recommendations, traffic control, and crime report.
However, these event detection approaches are required to obtain the keyword list associated with each event, and it is difficult to predict events with new keywords and keep tracing social unrest events. Therefore, the integration framework of event detection is designed in this paper. The features of entities that change over time can be treated as the features of events. Clustering and tracing these events based on the features can achieve the purpose of event detection.

Integration Framework on Event Detection
This section describes the entity-based integration framework to identify social unrest events in social media. The proposed framework consists of two modules, as depicted in the workflow of Figure 1. First, the entity extraction module uses BERT-CRF to identify named entities and analyze some popular entities related to unrest events from the posts. In the second module, the K-means algorithm is used to cluster unrest events with the consistency of entity types, locations, and content features. Finally, DTM is utilized to present the trends of entities over time.
ics. In the data of different time slides, the topics in the latter time slide evolve from the previous slide, which can well present the change of topics. To emphasize the topic in stories and filter the noise, Zhang et al. utilize a dynamic topic model [37], and the result proves that the dynamic model can improve the topic tracking performance. Yao et al. [38] use spatial factors to measure the spatial impacts and improve DTM by embedding the spatial factors to analyze the raw data from Twitter and track the disaster cases. Their proposed geo-topics-based system can provide specific information and trend about local events, such as place recommendations, traffic control, and crime report.
However, these event detection approaches are required to obtain the keyword list associated with each event, and it is difficult to predict events with new keywords and keep tracing social unrest events. Therefore, the integration framework of event detection is designed in this paper. The features of entities that change over time can be treated as the features of events. Clustering and tracing these events based on the features can achieve the purpose of event detection.

Integration Framework on Event Detection
This section describes the entity-based integration framework to identify social unrest events in social media. The proposed framework consists of two modules, as depicted in the workflow of Figure 1. First, the entity extraction module uses BERT-CRF to identify named entities and analyze some popular entities related to unrest events from the posts. In the second module, the K-means algorithm is used to cluster unrest events with the consistency of entity types, locations, and content features. Finally, DTM is utilized to present the trends of entities over time.

Definition of Social Unrest Event
The traditional NER method only focuses on the person, organization, and location labels. It is insufficient and difficult to describe the complete event information (such as when, where, and what can happen) using traditional NER labels. According to the characteristics of social unrest events on social media, the event is composed of four main elements, which can be represented as the following equation: where , , , and refer to time, location, person, and action entity, respectively. A post example is "On 5 August evening in Tsuen Wan, for the first time, a demonstrator was chopped and wounded". It can be formed as <5 August, Tsuen Wan, Demonstrator, Chop>. On some occasions, the person entity is not necessary. For example, there is no

Definition of Social Unrest Event
The traditional NER method only focuses on the person, organization, and location labels. It is insufficient and difficult to describe the complete event information (such as when, where, and what can happen) using traditional NER labels. According to the characteristics of social unrest events on social media, the event e is composed of four main elements, which can be represented as the following equation: where T, L, P, and A refer to time, location, person, and action entity, respectively. A post example is "On 5 August evening in Tsuen Wan, for the first time, a demonstrator was chopped and wounded". It can be formed as <5 August, Tsuen Wan, Demonstrator, Chop>. On some occasions, the person entity is not necessary. For example, there is no person entity in the post "21 Yuen Long West Rail sit-in", but the post is still related to an event.
According to the relevant event from the textual data, we define the trigger filtering principles. The hypothesis is that the post describing an event e on social media should contain at least three elements, that is T, L, and A. Moreover, A is treated as the event trigger. By eliminating all posts that contain no trigger, time, and location entities, the remaining posts are believed to be related to the events. In this way, the relevance of posts and events can be clearer, and event detection accuracy can be effectively increased.

NER and Event Trigger
The BIO (begin, inside, and outside) labeling method is one of the most popular schemes in NER [39]. The entity labels can represent the position of different entities in the textual data. For example, B-X represents the beginning character of entity X, I-X represents the end character of entity X, and O represents not belonging to any entity type.
To train the NER model, the pre-trained BERT language model is utilized for encoding each character and obtaining the corresponding character vector with more context-related semantic information. BERT has multi-layer bidirectional transformers to extract the context feature. In the NER task, BERT can label the text sequence with various defined entities. The model is trained by inputting the token vector into the classification layer and converting the output result into the probability of entity classification through a fully connected layer.
Then, the semantic vector of the sequence is input to the CRF layer for decoding. CRF is a standard algorithm in sequence labeling tasks [40]. The linear chain CRF is often used in the sequence labeling model, a discriminant model that predicts the output sequence based on the input sequence. CRF layer can output the label sequence with the highest probability of obtaining the category of each character. It also pays attention to the pattern of label sequence. For example, I-LOC follows B-LOC, I-ORG follows B-ORG, and so on. The architecture of the BERT-CRF model is shown in Figure 2.
principles. The hypothesis is that the post describing an event on social media should contain at least three elements, that is , , and . Moreover, is treated as the event trigger. By eliminating all posts that contain no trigger, time, and location entities, the remaining posts are believed to be related to the events. In this way, the relevance of posts and events can be clearer, and event detection accuracy can be effectively increased.

NER and Event Trigger
The BIO (begin, inside, and outside) labeling method is one of the most popular schemes in NER [39]. The entity labels can represent the position of different entities in the textual data. For example, B-X represents the beginning character of entity X, I-X represents the end character of entity X, and O represents not belonging to any entity type.
To train the NER model, the pre-trained BERT language model is utilized for encoding each character and obtaining the corresponding character vector with more contextrelated semantic information. BERT has multi-layer bidirectional transformers to extract the context feature. In the NER task, BERT can label the text sequence with various defined entities. The model is trained by inputting the token vector into the classification layer and converting the output result into the probability of entity classification through a fully connected layer.
Then, the semantic vector of the sequence is input to the CRF layer for decoding. CRF is a standard algorithm in sequence labeling tasks [40]. The linear chain CRF is often used in the sequence labeling model, a discriminant model that predicts the output sequence based on the input sequence. CRF layer can output the label sequence with the highest probability of obtaining the category of each character. It also pays attention to the pattern of label sequence. For example, I-LOC follows B-LOC, I-ORG follows B-ORG, and so on. The architecture of the BERT-CRF model is shown in Figure 2.

K-Means Clustering
Since there are many posts on social media, a one-step clustering cannot derive a suitable result. Hence, we first divide the data into several time windows based on the time mentioned in the posts, where the time window is in "days" as the unit. The date

K-Means Clustering
Since there are many posts on social media, a one-step clustering cannot derive a suitable result. Hence, we first divide the data into several time windows based on the time mentioned in the posts, where the time window is in "days" as the unit. The date and time are mentioned in the text and extracted by NER. There are mainly three expressions of date and time on social media [41]: (a) Formal expression with or without symbols between the year, month, and day, similar to YYYY-MM-DD HH:MM or YYMMDD, such as 2019-01-08, 08/01/2019, and 190108; (b) Abbreviated form, without year, similar to MMDD or MM/DD, such as 0928, 9/28, and September 28; (c) Textual expression, such as today, yesterday evening, and two days ago. After the date and time are extracted, they are normalized and converted into the formal expression "2019-09-28". The time entity in textual expression is based on the published date. For example, if the time mentioned in the post is "yesterday" and the post was published on 28 September 2019, the time of the event is "2019-09-27". Each time window contains the unrest events posts in 7 consecutive days. Then in each time window, we divide the unrest event detection into the following three parts.
First, the location constraint is applied to reduce the complexity and error of further clustering. Based on the characteristics of timeliness and dissemination of posts and events, location is also an important factor in the unrest event. It is difficult to quantify the distance between two locations only based on textual information. For example, Tsun Yip Street is in Kwun Tong, Hong Kong. However, from the text or even semantic information about the location, it is unobvious to figure out the relationship between Tsun Yip Street and Kwun Tong if they do not appear together. Hence, it is necessary to obtain geographic information about the location and build the location-based event extraction model. The location element information is gathered from the address lookup service [42] from the government chief information officer. The service is updated monthly and covers most of the premises in Hong Kong. From the lookup service, the longitude and latitude information can be obtained and used to achieve event clustering.
Second, considering the time and location of the event, the relationship between the posts should also be considered by calculating the vector and similarity. Even if the time and place mentioned in the two posts are very close, they may not describe the same event. By including the similarity calculation in the linguistic description of an event, it is more accurate and can better reflect the semantic relationship between events. Each post is vectorized into a vector by the following equations: In this way, the posts are vectorized by the consistency of times, locations, entity types, and post features. Moreover, posts that describe the same event are more likely to be clustered together. In addition, a location constraint is added to calculate the optimal value of k in the K-means algorithm. The distances between different locations in each cluster are calculated, and the maximum distance should not be larger than 12 km. In this way, in each time window, location constraint is also included in event detection. The distance is calculated as the Haversine formula, which is the formula for the spherical distance between two points: where x 1 , x 2 are the latitude and y 1 , y 2 are longitude the of two points (x 1 , x 2 ), (y 1 , y 2 ), and R is the radius of Earth. Finally, clustering analysis is performed to detect the essence of unrest content in those posts. In order to reflect the clustering of unrest events more accurately, the K-means algorithm is used to reflect the fundamental similarity between posts. The post vectors in each time window are calculated, and Euclidean distances between every two posts are calculated to form the similarity matrix of the posts in this time window. Then, the K-means clustering algorithm is used to detect the event in each time window. The smallest cluster number that meets the distance constraint is the best value of k. In this way, the value of k can be automatically calculated. Through these processes, all posts are finally clustered into k clusters, where each cluster represents an event.

Dynamic Topic Model
In the event detection module, DTM is also utilized to analyze the trend and the interest of the entities related to unrest events. By identifying the various entities and digging the dynamic topic chains from the posts, we analyze unrest events from the two dimensions of macro topic level and micro keyword level. In this way, we can understand and grasp the popularity and evolution of the topics. All entities extracted by NER can be treated as keywords. We take the time series factor into account for seeking the trends of topics, so we apply the DTM model [43] based on LDA. Figure 3 shows the graphical representation of a dynamic topic model. cluster number that meets the distance constraint is the best value of . In this way, the value of can be automatically calculated. Through these processes, all posts are finally clustered into clusters, where each cluster represents an event.

Dynamic Topic Model
In the event detection module, DTM is also utilized to analyze the trend and the interest of the entities related to unrest events. By identifying the various entities and digging the dynamic topic chains from the posts, we analyze unrest events from the two dimensions of macro topic level and micro keyword level. In this way, we can understand and grasp the popularity and evolution of the topics. All entities extracted by NER can be treated as keywords. We take the time series factor into account for seeking the trends of topics, so we apply the DTM model [43] based on LDA. Figure 3 shows the graphical representation of a dynamic topic model.
The parameters of the topic model should be different over time. The topic distribution of each document also changes, and so does the text distribution of each topic. From the last period 1, the Dirichlet distribution , is generated from the period , which represents the possible text distribution of topic . Similarly, the Dirichlet distribution for the period is generated from the previous period 1, which is each document's topic distribution. A topic distribution is drawn for each document so that topics and words can be obtained for each document. It is not straightforward to quantify the changing of topics over time. Nevertheless, DTM can show that the basic content of the same topic remains the same, but the keywords evolve.

Experiment and Analysis
The experiment is conducted on both the Lihkg online discussion forum and Twitter to construct the detection system. This section demonstrates how unrest event information is extracted from the discussion forum and social media and how attractive topics evolve as time goes by. The parameters of the topic model should be different over time. The topic distribution of each document also changes, and so does the text distribution of each topic. From the last period t − 1, the Dirichlet distribution β t,k is generated from the period t, which represents the possible text distribution of topic k. Similarly, the Dirichlet distribution α t for the period t is generated from the previous period t − 1, which is each document's topic distribution. A topic distribution is drawn for each document so that topics Z and words w can be obtained for each document. It is not straightforward to quantify the changing of topics over time. Nevertheless, DTM can show that the basic content of the same topic remains the same, but the keywords evolve.

Experiment and Analysis
The experiment is conducted on both the Lihkg online discussion forum and Twitter to construct the detection system. This section demonstrates how unrest event information is extracted from the discussion forum and social media and how attractive topics evolve as time goes by.

Data Collection and Labeling
We collect the data on the Current Affairs section of the Lihkg forum from 1 August 2019 to 31 August 2020. The posts are scraped automatically utilizing website crawler tools, and we only scraped the post title, content, creation time, and the author's ID so that no personal information is included in the data collection process. We collect 939,393 posts, the majority of which are in traditional Chinese or Cantonese. We randomly select 11 thousand posts and label them manually by BIO format: For each entity, the first character is marked as "B-(entity name)", the subsequent character as "I-(entity name)", and all irrelevant characters are marked as O. This method can also achieve the consequence of word segmentation.
• Person: Relevant person name. For example, "林鄭月娥" (Carrie Lam Cheng Yuet-ngor, the current Chief Executive of Hong Kong) and "thug" are labeled as <PER>; Electronics 2022, 11, 3416 8 of 17 • Time: Relevant date and time information. For example, "17:05", "10月20日 (20 October)", and so on. They are labeled with <TIM>; • Location: Relevant addresses, including names of country, public place, road, and building. For example, "Hong Kong", "Airport", and "Cheung Sha Wan Station". Location information is labeled with <LOC>; • Organization: Relevant names of organization, including "Hong Kong Police", "the Government", and "The Central Committee". They are tagged as <ORG>; • Crime: Relevant crimes such as "assault", "riot", "vandalism", and "fire". These nouns involved in illegal activities are tagged as <CRM>; • Action: Relevant actions mentioned on the posts. For example, "sit-in", "destroy", "gather", "commit arson", and so on. These verbs are tagged as <ACT>; • Tool: Hazardous tools mentioned on the posts, such as "metal rod", "arms", "fire extinguisher", and "petrol bomb". They are labeled as <TOO>; • Emotion: Emotional words on the posts, such as "hatred", "love", and "support". They are labeled as <EMO>. Meanwhile, we collect the data on Twitter from 1 August 2019 to 31 August 2020 utilizing the Twitter search function and Twitter API. We only scrape the content, creation time, and the author's ID so that no personal information is included. Due to restricted access to the Twitter API, we can only grab a limited number of tweets per day. We totally collect 32,061 Chinese tweets, including tags and keywords such as hkprotests, freehongkong, and HKpoliceterrorist.
Both Lihkg and Twitter datasets are related to frequent social unrest events all over Hong Kong and make it ideal for studying the patterns and signals of unrest events in Hong Kong. Figure 4 illustrates the counts of posts and tweets released on Lihkg and Twitter in one year. The 2019-2020 Hong Kong protests were started in March 2019, and demonstrators also launched an operation at the Hong Kong International Airport to bring the international community's attention to the movement. At that time, peaceful protests occur during the day, turning violent at night. Social unrest events continued in November 2019, and there were larger full-day events in mid-November. In 2020, some of the protests were canceled as a result of the COVID-19 outbreak. In Figure 4, the number of posts increases in August and November and decreases in December 2019 and January 2020. The National People's Congress passed the Hong Kong National Security Legislation in May 2020, which has reignited the protests. Hence, a large number of posts suddenly appeared in June and July 2020.

NER and Trigger Analysis
NER module utilizes the Chinese BERT-BASE model with 3 epochs, 12 layers of transformer blocks, and 12 attention heads. The batch size is 16, and the character vector has 768 dimensions by default. The optimizer uses Adam [44], and the learning rate is set to 1e-5. The parameter of the fully connected layer passing through the CRF layer is 17,

NER and Trigger Analysis
NER module utilizes the Chinese BERT-BASE model with 3 epochs, 12 layers of transformer blocks, and 12 attention heads. The batch size is 16, and the character vector has 768 dimensions by default. The optimizer uses Adam [44], and the learning rate is set to 1e-5. The parameter of the fully connected layer passing through the CRF layer is 17, which means the characters in the text are divided into 17 classes.
To show an overview performance of our model, accuracy can measure the training performance, and precision, recall, and F1-score are used to evaluate the testing and prediction results of the NER model. Table 1 illustrates the overall performance of BERT-CRF on the training and testing dataset. The accuracy of the training dataset is 95%. Entity identification reflects the proportion of successfully identified entities, while the named entities' boundaries are partially correct. Entity classification measures the proportion of extracted entities that are completely correct. The results indicate that the BERT-CRF model has a suitable performance on named entity recognition and is feasible for extracting the feature entities of social unrest events. To evaluate the effectiveness of our framework, we compare the BERT-CRF result with the popular NER model BiLSTM-CRF with default settings. Both two models use the same training and testing dataset. Accuracy can measure the training performance. Precision, recall, and F1-score are used to compare the testing results under the evaluation of entity classification. We set up 3 epochs for BERT and 10 epochs for BiLSTM-CRF based on the stable error rates. Table 2 shows the NER performance of different models. Compared with BERT-CRF, BiLSTM-CRF does not perform well in identifying the words that are not learned in the training dataset, mainly because BERT can obtain more context information and patterns of the word sequence. Then, the well-trained BERT-CRF model is applied to the whole Lihkg datasets to identify entities and extract events. Based on the definition of the event in this paper, there are 40,150 posts related to unrest events. Among these posts, totally there are 737 action entities after removing duplicates.
We analyze the frequency of some high-frequency entities. Table 3 demonstrates some action entities with relatively high frequency in both Lihkg and Twitter datasets. Action entities frequently appear in October and November 2019. Significantly, the word "Set roadblock" appears 129 times in the posts released on 16 November and 121 times on 17 November 2019. In the real world, there was a siege at the Hong Kong Polytechnic University from 17 to 29 November 2019. By checking location and organization entities during that time, "the Polytechnic University" appeared 62 times, the shortened name "HKPU" appeared 54 times, and the organization "Police" appeared 82 times on 17 November. Furthermore, the trained BERT-CRF model is also applied to the Twitter dataset to extract events. According to the definition of the event in this paper, there are 12,010 tweets related to social unrest events. Compared with the entities in the Lihkg dataset, the names of regions, countries, and ethnic groups more frequently appear in the Twitter dataset. For example, "China", "Japan", and "Taiwan" appear more than 1000 times. "Hong Kong people", "Chinese", and "Japanese" appear more than 500 times. Because Twitter hashtags are widely used internationally, people can reach a much wider audience than anonymous discussion forums. Twitter users who are interested in social unrest events often browse event information by searching hashtags, while organizers can also use hashtags to promote event information.
In addition, based on the action entities extracted from the experiment, we summarize seven features to describe these social unrest events: three strikes (labor strike, class boycott, and the closure of businesses), sit-ins, parades, arson, wounding, riots and conflict. These features are not independent of each other. They include some forms of demonstration, such as some threatening behaviors (arson, wounding, riots, conflict) and some more peaceful behaviors (three strikes, sit-ins, parades). Such analysis with action entities enables the investigator or law enforcement to quickly draw important words and conclusions from a massive volume of posts on social media.

Clustering Comparison
As mentioned in Section 3.3, the post vector concatenates four vectors: T, L, Entity Type, and Bert Vector. In this experiment, the number of entity types is 8. Hence, each post vector has 30 dimensions. The length of each time window is 7 days. As an example, the time window from 30 July 2019 to 5 August 2019 has 186 posts in Lihkg and 93 tweets on Twitter that mentioned social unrest events. As introduced in Section 3.3, the location constraint of 12 km is added to determine the value of k in the K-means algorithm. During the example time window, the best value of k is 9, which means the optimum number of clusters in that week is 9. After calculating the similarity of the posts and applying the K-means algorithm, Figure 5 shows the clustering result. This experiment verifies the feasibility of the unrest event detection framework after dividing time windows and clustering the posts.
To have a better visualization, the background in Figure 5 is the map of Hong Kong, and each post is pointed on the map based on the geographic information of the event location. The points with the same color denote the same cluster, which means the posts highly likely represent the same event. As shown in Figure 5, with the help of the location distance constraint and the four vectors in each post vector, some unrest events that occur nearby are more likely to be allocated into the same cluster. Therefore, the figure can quickly visualize all the extracted events. Moreover, the entities in each cluster can overview the key features of the unrest event. Table 4 displays the entities with high frequency extracted in some clusters. constraint of 12 km is added to determine the value of k in the K-means algorithm. During the example time window, the best value of k is 9, which means the optimum number of clusters in that week is 9. After calculating the similarity of the posts and applying the Kmeans algorithm, Figure 5 shows the clustering result. This experiment verifies the feasibility of the unrest event detection framework after dividing time windows and clustering the posts. To have a better visualization, the background in Figure 5 is the map of Hong Kong, and each post is pointed on the map based on the geographic information of the event location. The points with the same color denote the same cluster, which means the posts highly likely represent the same event. As shown in Figure 5, with the help of the location distance constraint and the four vectors in each post vector, some unrest events that occur nearby are more likely to be allocated into the same cluster. Therefore, the figure can quickly visualize all the extracted events. Moreover, the entities in each cluster can overview the key features of the unrest event. Table 4 displays the entities with high frequency extracted in some clusters. Table 4. Some entities with relatively high frequency in each cluster.

Cluster No.
Named Entities Based on the experiment, we believe that the following aspects have played an important role in the unrest event detection framework: First, the proposed framework combines the named entity recognition approach with event detection so that the dataset can be easier labeled and identified. A suitable learning algorithm can identify new entities that are not included in the training dataset. It effectively improves the quality of labeling and provides a better basis for event detection. Second, the concept of the trigger word in the traditional event detection models is applied to the detection framework. Action, time, and location entities are the main factors in an unrest event. In our framework, the action  Based on the experiment, we believe that the following aspects have played an important role in the unrest event detection framework: First, the proposed framework combines the named entity recognition approach with event detection so that the dataset can be easier labeled and identified. A suitable learning algorithm can identify new entities that are not included in the training dataset. It effectively improves the quality of labeling and provides a better basis for event detection. Second, the concept of the trigger word in the traditional event detection models is applied to the detection framework. Action, time, and location entities are the main factors in an unrest event. In our framework, the action entity is treated as the event trigger. Therefore, the designed detection framework can effectively filter keywords and posts related to unrest events from a large number of posts, thereby improving the accuracy of event detection. Finally, the post vector is calculated based on the time, geographic information, entities, and feature vector from BERT-CRF. The K-means clustering algorithm, combined with the distance restriction, uses the Euclidean distance to calculate the similarity between the post vectors. The result of clustering reflects the similarities between the unrest event and ensures the accuracy of event detection.

Event Tracing
To trace social unrest events, DTM is utilized to demonstrate the results of dynamic entity analysis. In order to observe how entities change over time, we divide the whole timeline from August 2019 to August 2020 into 13 slices (each month is a time slice). For all entities extracted from the dataset, we use the DTM model to build the probabilistic topic model with 5 to 10 potential topics. By mixing the entities to construct the topic model, we can find out some semantic relations between different entities. Reading the keywords under different topics manually, we find out that the semantic information of the keyword list performs better when the total topic number is 7. We select the most representative four potential topics in the whole timeline with some keywords (entities), as shown in Table 5.  We also draw a chart about the trend of topic evolution in different topics, as pictured in Figure 6. This trend shows that with the development of the rally and parade situation, people's focus is also changing. At the same time, the occurrence of new events (such as the COVID-19 pandemic and the promulgation of National Security Legislation) will also affect people's attitudes and discussions about current events.

Discussion
To further discuss and summarize our findings, this section collates the characteristics of the spread of social unrest events in social networks and introduces several practical use cases to describe the practical application of our proposed approach.

Discussion
To further discuss and summarize our findings, this section collates the characteristics of the spread of social unrest events in social networks and introduces several practical use cases to describe the practical application of our proposed approach.

Event Spreading on Social Media
To summarize the evolution of topics, the topic trends have several characteristics: (1) People talk about different topics at different time periods. Even if the topic is the same, the words will change accordingly, particularly in the way they describe other people. For example, their addressing of the police changes from "Police" to "Black Police", then to "Black Dog". (2) Some regular topics relatively maintain a certain level of popularity, including international relations, laws and regulations, and some popular words related to demonstration. (3) The appearance of some irregular topics will reduce the popularity of regular topics and will temporarily become hot topics, especially when emergencies that threaten people's safety or national interests occur. For example, the traffic jam caused by protests, the appearance of COVID-19, topics arising from the words and deeds of public figures, and so on. (4) Some topics related to a specific time or festival will become more and more popular as the date approaches. Before the festivals such as Hong Kong Special Administrative Region Establishment Day and National Day of the People's Republic of China, the number of posts and demonstrations will increase greatly. For example, on National Day, there is a demonstration called "No national day! Only national death!" which is also the keyword of the posts at that period of time. These topics are predictable to a certain extent, and some preventive measures can be taken in advance.

Use Cases for Event Detection Framework
Except for those analyses mentioned previously, this section demonstrates several use cases of the unrest event detection framework in social media.
Hong Kong International Airport, August 2019 The event at the airport began in June 2019. In August, many protesters went to the Hong Kong International Airport to participate in the event, and traffic to the airport was almost completely blocked. On 14 August 2019, the Hong Kong International Airport implemented access control at the terminal buildings. Only real passengers with a valid ticket or boarding pass for a flight could be allowed to enter. From then on, the airport was no longer a protest place. The word "Airport" suddenly disappeared from the high-frequency words in the Lihkg forum. Therefore, earlier arrangements and timely prohibitions were necessary for public gathering activities.
The Hong Kong Polytechnic University, September 2019 People's attention to "Police", "PolyU", and "Demonstrator" increases sharply from September 2019. Some university students organized an event called "Boycotting classes but not education" in September. In November, with the conflict getting fierce, sieges and clashes between protesters and Hong Kong police broke out around PolyU, making people pay more attention to both organizations.
Posts from 1 to 7 September 2019 In Figure 7, we present an entity graph for the visualization analysis of the posts. The graph only includes five clusters with the most significant number of posts. The nodes on the graph are time and location entities. The color of nodes represents different events. The same color means the entities are in the same cluster and describe the same event. The legend in the top right corner illustrates the trigger word of each event. The edge between nodes means the entities are mentioned in the same post. The information shown in Figure 7 points to events such as human chain events in Tuen Mun, the conflict event of demonstrators and police at the MTR station, the disruption at the airport, demonstrations at several parks, and the flash mob event in Tai Po. From such visualization analysis, we can quickly obtain information about the unrest event and understand what occurred or will happen on that day at that place.

Conclusions
In this paper, we propose the social unrest event detection framework, utilizing the characteristics of event information in social media, combining NER with event detection methods, and analyzing social unrest events with desirable visualization results. In addition, the framework reveals meaningful event-related entities in a particular region within a period. Moreover, we present the comparison experiment with Lihkg and Twitter datasets and some case studies. The results show that although there is some overlap between Lihkg and Twitter regarding social unrest events, Twitter users are more likely to Detection of new events To verify whether the framework can detect real-time or preemptive events, we need to consider the practicality of this framework in the social network. We collect some posts released after 31 August 2020 and adapt the framework to the newly collected data. The data can be treated as "future" data from the perspective of when the data is collected. Table 6 shows some examples of event entity detection. In this case study, we find out that the number of posts related to unrest events after Aug 2020 is not as large as the number of posts in 2019 because of the National Security Legislation.

Conclusions
In this paper, we propose the social unrest event detection framework, utilizing the characteristics of event information in social media, combining NER with event detection methods, and analyzing social unrest events with desirable visualization results. In addition, the framework reveals meaningful event-related entities in a particular region within a period. Moreover, we present the comparison experiment with Lihkg and Twitter datasets and some case studies. The results show that although there is some overlap between Lihkg and Twitter regarding social unrest events, Twitter users are more likely to use country names and emotional terms to spread events, while Lihkg users are more likely to use specific locations in Hong Kong to indicate event locations.
However, there are limitations to our application and analysis. It is difficult for us to extract all the tweets related to the Hong Kong social unrest events from Twitter. We can only use the keyword filtering method, which leads to more forum data than Twitter data. Thus, the accuracy of event detection is limited, and there may be some posts that implicitly mention events that are difficult to be identified. In this study, we define the elements contained in the posts related to social unrest events. There may be posts that are related to the event but do not contain enough elements of event entities. Although the impact is not significant, it can still affect the results of the event analysis. Nevertheless, the experimental results can still illustrate some specific information about social unrest events in Hong Kong to some extent and prove the practicality and effectiveness of the framework.
Future research will try to further explore the reason, connection, evolution, and impact of different events. We will design an inductive learning model that covers different event types on various social media. Furthermore, we are interested in tracking the development process of events and sentiment analysis of people's attitudes to social unrest events.