AllforJan: How Twitter Users in Europe Reacted to the Murder of J á n Kuciak—Revealing Spatiotemporal Patterns through Sentiment Analysis and Topic Modeling

: Social media platforms such as Twitter are considered a new mediator of collective action, in which various forms of civil movements unite around public posts, often using a common hashtag, thereby strengthening the movements. After 26 February 2018, the #AllforJan hashtag spread across the web when J á n Kuciak, a young journalist investigating corruption in Slovakia, and his ﬁanc é e were killed. The murder caused moral shock and mass protests in Slovakia and in several other European countries, as well. This paper investigates how this murder, and its follow-up events, were discussed on Twitter, in Europe, from 26 February to 15 March 2018. Our investigations, including spatiotemporal and sentiment analyses, combined with topic modeling, were conducted to comprehensively understand the trends and identify potential underlying factors in the escalation of the events. After a thorough data pre-processing including the extraction of spatial information from the users’ proﬁle and the translation of non-English tweets, we clustered European countries based on the temporal patterns of tweeting activity in the analysis period and investigated how the sentiments of the tweets and the discussed topics varied over time in these clusters. Using this approach, we found that tweeting activity resonates not only with speciﬁc follow-up events, such as the funeral or the resignation of the Prime Minister, but in some cases, also with the political narrative of a given country affecting the course of discussions. Therefore, we argue that Twitter data serves as a unique and useful source of information for the analysis of such civil movements, as the analysis can reveal important patterns in terms of spatiotemporal and sentimental aspects, which may also help to understand protest escalation over space and time.


Introduction
The perception of inherent tensions between justice and injustice (or the disproportion of good and bad) often press a group of people (or even the whole society) to seek change concerning politics and power, for example in the form of protests [1]. In the last few decades, the advent and rapid expansion of internet-based communication technologies transformed the way of seeking change through connective action, where besides the two main elements-the people and their intentions-the role of the information along with its spread and accessibility gained more and more significance [2].
The data-driven approach, relying on social media posts and activities, has many strengths-especially considering its high temporal resolution and rapid user-response to certain news and information [3]. Social movement research employs this approach 1.
Temporal aspects: • How tweeting activity related to the murder of Kuciak varied over time throughout Europe? (RQ1a) • Can we identify the influence of specific events and incidents, such as media reports or findings of the investigation based on this tweeting activity? (RQ1b) 2. Content aspects: • How does the sentiment of the tweets vary over time, and how does it relate to specific events and news? (RQ2a) • Around what topics do the tweets revolve, beyond the murder itself? (RQ2b)

3.
Profiles: • How can we characterize the countries based on the temporal aspects of the tweeting activity and the sentiment of the tweets? (RQ3a) • Does the categorization of the countries also reflect differences in the identified topics or the changes of the sentiment values over time? (RQ3b)

Related Works
In the last decade, the use of social media data as a source of information has become increasingly widespread in a variety of fields ranging from medical research to urban planning [20][21][22][23]. The examination of information from social networking sites can be instrumental in achieving a thorough interpretation of the human environment and social dynamics. Such research projects usually utilize a so-called passive crowdsourcing approach, where data is generated collectively by users of a social media platform, but without direct contributions to a specific research or crowdsourcing project, as opposed to traditional (or active) crowdsourcing such as OpenStreetMap [24,25]. The emerging field of passive (or opportunistic) crowdsourcing relies on such data, granting empirical investigations that usually build on a semi-automated data collection process. The users either provide data in the form of text (e.g., Twitter) or images (e.g., Instagram), which are often paired with sensor-obtained information (e.g., location) and uploaded online [26,27]. Despite the unique usability of this research field, relatively few papers published so far have concentrated on multi-dimensional analysis applications of social media data in the analysis of collective actions, such as protests.
Although the social media-based analytical methods of collective actions have accelerated the slow and expensive conventional methods, such as surveys, they are still characterized by a variety of deficiencies, such as lack of representativeness or transferability of the results. Moreover, the widespread and well-established advantages of network modeling approaches on hashtags [13,14,28,29] or users [30,31] in recent literature analyzing protests based social media data, they are often not capable of handling the complexity of the sentiment, temporal, and spatial patterns of these actions in one approach. One limitation is that the analysis relies on Tweets having coordinates as an inherent part of the dataset [32,33], which-according to earlier studies-represents only a small subset of all tweets (approximately 1-10%) posted within a specified time period [34]. Another limitation is that they focus only on a single language (usually English) that may also limit the spatial interpretability of the results [35], especially in the case of movements that span over multiple countries. This is where the benefits of a thorough data pre-processing method, including translations and user location extraction, become relevant, which supports the more precise assessment of a post-event situation through extracting an additional information layer from the digital footprints of users by revealing contextual insights. This approach already exists in other fields, for example to uncover disaster footprints, but not in the case of social movements or unrest [36][37][38].
This article suggests a comprehensive methodology to overcome limitations in the existing methods and handle the complexity of protest analyses. The proposed approach includes multi-lingual corpus translation, as well as location and sentiment extraction, using machine-learning topic modelling methods to reveal the hidden interests and motivators of collective action. Through this, our approach has a distinct advantage over the prior investigations that primarily focused either on hashtag-activism [39,40] (ignoring the spatial dimensions) or, on the contrary, using only location-specific hashtags [41,42], whereas by applying machine learning algorithms and techniques that are almost entirely automatable, we can have a much wider range of input data than existing studies, where the researchers solely evaluate posts manually [7,43].

Data
The Twitter data analyzed in this work were obtained using the Twitter Streaming Application Programming Interface [44] for the period between 26 February and 15 March 2018. The starting date is adjusted to the first official report of the murder of Ján Kuciak while the final day is adapted to the earliest statement of the resignation of Prime Minister, Robert Fico. The dataset consists of the content of the tweets and additional attributes such as user name, user location, and the timestamp when the tweet was posted. In the first round we harvested tweets with relevant content (such as names: kuciak, kusnirova, fico) or hashtags (#AllforJan) within this period. This resulted in 13,176 tweets from all over the world. However, as our analysis approach requires geospatial analyses as well, we have implemented a secondary filtering on these datasets to identify relevant tweets by focusing on place attributes or the coordinates of a tweet, out of which at least one parameter should contain valuable data. Around 3000 tweets where user location was not specific enough (such as "World", "Internet", or "online") have been excluded. Most of the tweets were posted from Europe, thereby for the rest of the analysis, we consider 4 of 22 only European locations. By transforming user location into coordinates, we could further increase the amount of tweets used for the analysis, as the original dataset contained only 24 tweets with coordinates. Another 1800 tweets were also removed from our dataset as they were too short or no meaningful coordinate could have been attached to them. Overall, this two-step query and the following filtering resulted in 8069 Tweets distributing over an 18-days long timeframe for 39 countries (see Figure 1 for the overview of the timeline). Figure 2 illustrates the key steps of our analysis performed on the dataset of the harvested tweets mentioned in Section 3.1. Our pre-processing approach (Step 1 performed on the raw tweets) consists of a thorough text cleaning workflow and the transformation of available meaningful location information to coordinates for further utilization in the spatial analysis step (Figure 2). In the first part of the pre-processing, we implement primary filtering on our dataset to ignore short tweets that hardly bear any semantic significance. Moreover, stop-words, rare, and too frequent words are removed to normalize the dataset and reduce the redundant noise from tweets [45]. Then, we use the available location information of the user to localize the tweets without direct coordinates attached to a tweet (also called geoparsing, see further details in Section 3.2.2) to further increase the size of the dataset for machine learning-based translation and spatiotemporal analysis.

Text Cleaning
The aim of this step is to increase the efficiency of the subsequent translation process. We remove short tweets (containing a single word), or those posts that contained only hashtags or URLs, because they hold unclear and hardly interpretable semantic value [46]. Then, we remove replies (@user_name) from the text as that is considered as unnecessary noise in the analysis; in contrast, the hashtags are preserved but their sign (#) is removed. The consideration behind this step is that users tend to use hashtags as an integral part of the syntax [47], e.g., "on Friday I'm also going to #Bratislava to protest" -> "on Friday I'm also going to Bratislava to protest". As a final cleaning step, we remove new line characters as well as additional whitespaces from the tweets. The aim of this step is to increase the efficiency of the subsequent translation process. We remove short tweets (containing a single word), or those posts that contained only hashtags or URLs, because they hold unclear and hardly interpretable semantic value [46]. Then, we remove replies (@user_name) from the text as that is considered as unnecessary noise in the analysis; in contrast, the hashtags are preserved but their sign (#) is removed. The consideration behind this step is that users tend to use hashtags as an integral part of the syntax [47], e.g., "on Friday I'm also going to #Bratislava to protest" -> "on Friday I'm also going to Bratislava to protest". As a final cleaning step, we remove new line characters as well as additional whitespaces from the tweets. (Figure 2). 3.2.2. Locating Tweets Using Coordinates or User Profile Information As our research questions heavily rely on spatial information, we attempted to process information of all Twitter fields that may contain relevant spatial attributes to increase the amount of tweets having an identifiable location at least at the country level. In general, the tweets that inherently include coordinates constitute only a small subset of transformation of available meaningful location information to coordinates for further utilization in the spatial analysis step (Figure 1). In the first part of the pre-processing, we implement primary filtering on our dataset to ignore short tweets that hardly bear any semantic significance. Moreover, stop-words, rare, and too frequent words are removed to normalize the dataset and reduce the redundant noise from tweets [45]. Then, we use the available location information of the user to localize the tweets without direct coordinates attached to a tweet (also called geoparsing, see further details in Section 3.2.2) to further increase the size of the dataset for machine learning-based translation and spatiotemporal analysis.

Locating Tweets Using Coordinates or User Profile Information
As our research questions heavily rely on spatial information, we attempted to process information of all Twitter fields that may contain relevant spatial attributes to increase the amount of tweets having an identifiable location at least at the country level. In general, the tweets that inherently include coordinates constitute only a small subset of all tweets. To overcome this limitation our analytical approach tried to locate those tweets that had no coordinates using location information available in the users' profile.
In 2018 Twitter still allowed users to add the exact location of where they were tweeting from using coordinates; however, this feature is not available anymore. Such tweets contain precise latitude/longitude coordinates representing a point somewhere in the world, and this information comes from the built-in GPS receiver of the device. This location type does not include further information beyond the coordinates such as from which city or country was it exactly posted. To obtain address information for these coordinates, we used the Geopy Python client [49] to access the geocoding web services provided by the OpenStreetMap API (Nominatim) [50]. The location.raw ('address') function returns a dictionary of address components, such as country code, city, or road, allowing for a targeted query of relevant address information. The other location type, which can be assigned by the user to a tweet (the only way to add location information when this paper was written) is the so-called Twitter "Place" tag. This tag in contrast with adding only coordinates has further properties such as the name of the city or region, along with the country-code showing the country where the given "Place" is located. The source of this information is still the built-in GPS (or GNSS) receiver of the device, but how this information is visualized and presented is different from the option written above, where exact coordinates were attached to a tweet.
To increase the amount of tweets having some kind of spatial reference, we can extract location information from the profile of the Twitter users, which information can serve as a proxy to where this user might be active most of the time. Several data fields fall into this category, but all of them represent information that the users insert at the account level and not for each tweet separately; thus, the information's credibility mainly relies on the user. Moreover, even if the location information at the account level is valid, it might not be correct for each individual tweet, for example if the person is traveling abroad. Generally, these values are not frequently altered and do not necessarily describe the tweet's exact location, but they may represent the user's residence at least on a city level. As our research considers tweets aggregated on a country level and not the exact location within a city or a country, we still find these data valuable as proxy in instances where no direct location information for a given tweet was provided. To obtain useful information on user_location data, first we ranked individual user_location data by frequency then used the built-in map function of Python [51] and a translator dictionary developed by us to transform all location items and to group similar entries. For instance, "BaWü, DE" was transformed to "Baden-Württemberg, Germany" and similarly "B. Württemberg" was also converted to "Baden-Württemberg, Germany." The second step after user_location transformation was to apply the Geolocator function of OpenStreetMap through Geopy, which provided latitude and longitude coordinates, that we will use for subsequent mapping applications. Originally, in our raw data set, out of all the tweets, only around 0.2% (24) hold coordinates; however, with the above-mentioned transformation approaches, we could provide coordinates for 8069 tweets that is 61.2% of the original dataset. The remaining 38.8% of the tweets were either posted outside of Europe, or it was impossible to locate them, whereas most of them were excluded as part of the text cleaning process described in Section 3.2.1.

Translation Using Google API
The majority of the tweets (68%) were non-English and therefore had to be translated for the subsequent analysis steps. In order to translate these non-English tweets, we used TextBlob, a text-processing library written in Python. According to its documentation, TextBlob [52] can also be used for part-of-speech tagging, parsing, sentiment analysis, spelling correction, and translation tasks. The algorithm relies on the most used online translation service, Google Translate's API. One of the most significant benefits of this Translation API is that it has a pre-trained model that instantly identifies languages with high accuracy and can translate them into more than one hundred target languages, including all the European languages we used in this analysis. In 2011, as part of a comprehensive accuracy evaluation, 51 languages were translated using Google Translate to another language and the results showed that most of the European languages had reliable results. Thanks to a service update in 2016 by applying a Neural Machine Translation (NMT) model, the translation accuracy score increased from 3.694 (out of 6) to 4.263, close to a human-level score of 4.636 [53]. This high value is acceptable for the next steps in our workflow, as our approach uses word-based analysis to extract sentiment values.

Emoji/Emoticon Transformation
In general, text pre-processing approaches tend to remove any emojis (small images) and emoticons (facial expression representation using keyboard characters and punctuations) from the text. The main problem of such approaches is that users use these small images and characters as the lingua franca of social media to express feelings or ideas, compressing a meaningful word in a short number of characters [54]. In our methodology, we are not considering their removal as the appropriate solution since emojis and emoticons contain valuable information, particularly for the subsequent sentiment analysis. Thus, we convert them to word format using the Python emote library [55] in order to preserve the emoji information for further analysis steps.

Semantic Analysis
The semantic text analysis process used in our approach is divided into two stages: first, we extend the list of stop words in the algorithm based on the characteristics of our data set and then we remove these words from the text, and second, we provide a ISPRS Int. J. Geo-Inf. 2021, 10, 585 7 of 22 dictionary-based sentiment analysis, which classifies the subjective sentiment information contained in each tweet.

Removing Stop Words
The literature considers auxiliary verbs, conjunctions and other parts of written text that do not bear significant semantic meaning as "stop words". A list of these words is predefined by the Natural Language Toolkit (NLTK) [56] in the algorithm we used. Nonetheless, we added further words to the stop word dictionary that are unique to the unedited text, or the analyzed corpus, including special first names "Martina", "Marian", "Andrej" or words like "gonna", "wanna". We remove these words from the dataset along with words with three or less characters, as they also have limited semantic significance.

Sentiment Analysis
Sentiment scores are used to identify how positive or negative the text of a given tweet is. This identification is performed by calculating the difference between the quantity of positive and negative terms using a vocabulary with positive and negative words in an automated way. We selected for this purpose the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon, a rule-based sentiment analysis tool that uses a lexicondriven method and heuristics to assess the input data. This method is standardized to the sentiments presented in social media, and it has a higher classification accuracy than other methods in the light of the recent literature [57]. In the comparative study of Hutto and Gilbert it was found that the VADER correlation coefficient (r) was nearly equal to human raters' performance (r = 0.881 vs. 0.888). However, when they inspected the classification accuracy for social media text analysis (F1) it outperformed the human raters (F1 = 0.96 vs. 0.84), and other eleven highly regarded analysis tools such as the Hu-Liu04 opinion lexicon or WordNet [57].
Generally, if the score is lower than zero, the sentence or text element is assumed to include a "negative sentiment," whereas it is considered a "positive sentiment" if this sentiment value is higher than zero. If the score equals zero, then the sentence is identified as "neutral." The main problem of this arrangement with only three classes is that the algorithm has shortcomings in determining unambiguous negative or positive scores for sentiment rates around zero because they are either indeed neutral or misclassified with a relatively high probability. The VADER algorithm uses the so-called compound score, that is calculated by adding the valence scores of all words in the lexicon, adjusting them according to the given rules of the algorithm, and then normalizes the value to fall between −1 (the most extreme negative value) and +1 (the most extreme positive value). It is a useful metric if we seek a single unidimensional assessment of a sentence's emotion. Thus, we categorize the tweets with the compound score into five categories: This does not propose that all the tweets with the sentiment value of 0.1 and −0.1 are neutral; however, we could reduce the number of neutral tweets using this categorization.

Spatiotemporal Data Processing and Clustering
To understand the escalation of protests based on social media activity, we also performed spatiotemporal analysis using the tweets. Most of the tweets in our query results were posted from European countries; therefore, to keep the analysis concise we only considered countries from Europe, including Russia and Turkey. We kept the data aggregated at country level, as language and the political characteristics of a country might influence the tweeting behavior stronger than other characteristics at city level or other finer spatial scales. Furthermore, most of the time (except in Slovakia) the protests were held only in the capital.
Once the tweets were pre-processed and filtered (see Sections 3.1 and 3.2) we performed clustering to find countries with similar tweeting trends about Kuciak's murder, which would probably also indicate when protests took place or the presence of other influencing parameters such the media or politics. For this purpose, we used Time Series Clustering in ArcGIS Pro 2.8, where time series data can be clustered based on three criteria: having similar values across time, tending to increase and decrease at the same time, or having similar repeating patterns. By identifying countries with similar pattern, we might be able to reveal the influencing parameters and how these parameters changed over time.
Moreover, it provides a more concise and informative visualization and interpretation of the result than statistical values for 39 countries one by one.
For our analysis, we performed clustering based on the number of tweets over time for each country, to track the tweeting activity of the citizens in general. This means that we considered the second type of clusters (where values tend to increase and decrease at the same time but their absolute value is less relevant). For example, a time series with values (1, 0, 1, 0, 1) is more similar to a time series with values (10, 0, 10, 0, 10) than it is to a time series with values (1, 1, 1, 1, 1) because the values increase and decrease at the same time and stay in a consistent proportion. Therefore, we are able to avoid problems related to different population sizes and no normalization based on the population is needed for the clustering.

Preparing Steps for Topic Modeling
The first step of topic modeling is tokenization, which is a technique for segmenting texts into smaller units. This algorithm divides the text at each space character to generate a list of separate tokens (unique words, numbers, and signs). We used the Gensim library's simple pre-process function for this step, which iteratively converts tokens to Unicode strings, removing accent marks and lowercasing the string [58]. Tokens shorter than three letters are discarded.
To filter out the most common bi-and trigrams (two and three-word expressions) from a stream of sentences we used the Gensim library. However, in order to set a proper filtering threshold, we first manually explored these multi-word expressions with the help of the scikit-learn CountVectorizer [59], which converts the text to a matrix of token counts. Then, we set up the Gensim threshold to ignore those bi-and trigrams that bear well-known information such as the fact of the murder or the victim's occupation or the location of the assassination (e.g., "journalist jan", "murder of", "in slovakia", "the murder of", "of journalist jan") for each cluster identified in the spatiotemporal analysis.

Lemmatization and Vectorization
The aim of this step is to reduce inflectional and derivationally related forms of a word to a common base form similarly to a stemming approach. However, in the case of lemmatization, the part of speech of a word (POS tag) such as symbols, numbers or verbs should be first determined and the normalization rules will be different for the different parts of speech thus it is lexically more sophisticated. This method also involves the grouping of the inflected forms of each word, identified by the word's lemma, or dictionary form (for instance, "better" is lemmatized as "good", "cars" as "car"), so they can be analyzed as a single item, thus enhancing the significance of the topic-word associations [60]. For lemmatizing, we use the spaCy Lemmatizer [60] that provides a rule-based lemmatization with the setting to allow only proper nouns, verbs, and nouns related to our LDA corpus because our research is concentrating on topics that primarily answer the question of who did what and where. It shall be noted that in an earlier step (see Section 3.5.1) we already removed stop words (e.g., auxiliary verbs), which increased the speed and accuracy of the sentiment analysis; however, this step is not necessary for topic modeling since the spaCy Lemmatizer is effectively capable to filter out certain POS tags, such as auxiliary verbs.
As a final step, the text corpus has to be converted into a vector format because LDA requires a document-word-count matrix and a word dictionary to create a "bag-of-words" corpus, i.e., a collection of words without information on the proper syntax.

Performing the LDA Topic Modeling
For topic modeling, we used LDA with the Gensim library on the final set of geolocated tweets in Python [61]. To date, there is no generally established a priori parameter modeling approach for LDA. In order to find the most suitable parameters, the alpha, beta and the number of topics extractable from the dataset, we apply hyperparameter optimization that seeks after the best setting in a validation corpus set (75%). We used the topic coherence measure (C_v) for performance comparison, which is considered to have the strongest correlations with human ratings [62]. The value of C_v combines an indirect confirmation measure that uses normalized pointwise mutual information (NPMI), cosine similarity, a Boolean sliding window, and the one-set segmentation of the top words. We have applied this optimization approach to all clusters identified in the spatiotemporal analysis. Our hyperparameter optimization returned with the following parameter settings: All cluster: α = symmetric, β = 0.91, and number_of_topics = 8. Cluster 1 = α = 0.31, β = symmetric, and number_of_topics = 8. Cluster 5 = α = 0.01, β = 0.91, and number_of_topics = 7.
The similar hyperparameter settings indicate that the corpus generally follows symmetric distribution in which the lower alpha indicates fewer topics, while the high beta represents increased topic-word density consequently the discussion revolved around a few themes. Finally, the tweets were classified according to the topic that produced the highest probability; then, we generated the 10 most frequent keywords for each topic. Keywords, however, sometimes are not able to make proper sense about the discussed topic, to overcome this limitation we also assigned the most representative tweet assigned to each topic (see Section 4.4).

General Spatial and Temporal Characteristics of the Tweets (RQ1)
After performing the pre-processing steps, we had a dataset of over 8000 tweets. Figure 3 shows how tweet counts varied daily in our analysis period for European countries. The most significant peak was observable on 28 February when Kuciak's unfinished work was published about the connections of the Italian mafia and Slovakian politicians, whereas there are several smaller peaks from 9 March onwards. These most likely represent the main protest day (9 March) and the news around the resignation of the Minister of Interior and the PM in Slovakia (12 and 14 March).
Of course, these countries also have large populations, so to exclude the influence of population sizes on the number of tweets, Figure 5 visualizes also the daily distribution of tweets per country, normalized by their population to make countries more comparable. To calculate the normalized value on each day for a country we used the math.log1p() function in Python, which gives a reliable value also in the case of larger standard deviation and relatively small values. Figure 4 summarizes the absolute number of tweets per country. Users from Slovakia tweeted over 1800 times throughout the analysis period. The second most active country was Germany (ca. 1100), followed by Italy (around 800 tweets) and France (more than 500 tweets).   Figure 4 summarizes the absolute number of tweets per country. Users from Slovakia tweeted over 1800 times throughout the analysis period. The second most active country was Germany (ca. 1100), followed by Italy (around 800 tweets) and France (more than 500 tweets). Of course, these countries also have large populations, so to exclude the influence of population sizes on the number of tweets, Figure 5 visualizes also the daily distribution of tweets per country, normalized by their population to make countries more comparable. To calculate the normalized value on each day for a country we used the math.log1p() function in Python, which gives a reliable value also in the case of larger standard deviation and relatively small values.
There were eight countries, where there was continuous tweeting activity observable throughout the 18 days considered for our analysis: Belgium, Czechia, Germany, France, Hungary, Poland, Slovakia, and Switzerland. Czechia, Hungary, and Poland are neighboring countries with shared history in the past, so their interest in the topic can be easily explained. Users from Austria, as another neighboring country, also had high activity, whereas there are several smaller peaks from 9 March onwards. These most likely represent the main protest day (9 March) and the news around the resignation of the Minister of Interior and the PM in Slovakia (12 and 14 March).  Figure 4 summarizes the absolute number of tweets per country. Users from Slovakia tweeted over 1800 times throughout the analysis period. The second most active country was Germany (ca. 1100), followed by Italy (around 800 tweets) and France (more than 500 tweets). Of course, these countries also have large populations, so to exclude the influence of population sizes on the number of tweets, Figure 5 visualizes also the daily distribution of tweets per country, normalized by their population to make countries more comparable. To calculate the normalized value on each day for a country we used the math.log1p() function in Python, which gives a reliable value also in the case of larger standard deviation and relatively small values.
There were eight countries, where there was continuous tweeting activity observable throughout the 18 days considered for our analysis: Belgium, Czechia, Germany, France, Hungary, Poland, Slovakia, and Switzerland. Czechia, Hungary, and Poland are neighboring countries with shared history in the past, so their interest in the topic can be easily explained. Users from Austria, as another neighboring country, also had high activity, There were eight countries, where there was continuous tweeting activity observable throughout the 18 days considered for our analysis: Belgium, Czechia, Germany, France, Hungary, Poland, Slovakia, and Switzerland. Czechia, Hungary, and Poland are neighboring countries with shared history in the past, so their interest in the topic can be easily explained. Users from Austria, as another neighboring country, also had high activity, except for the day of 8 March, where there was no related tweet posted. Germany, the United Kingdom, and France are big countries, and along with Belgium they represent strong political power in Europe or for the European Union, so this may also explain why they were also actively discussing the case. Malta, although being a small country and far away from Slovakia, also showed high interest in the topic, as a few months prior to the murder of Kuciak a journalist from Malta was also killed because of the investigations she was working on. The role of the Italian mafia was heavily discussed throughout the period due to the corruption among Slovakian politicians, so higher tweeting activity in Italy is also not surprising. The publisher of the journal that Kuciak was working for is in Switzerland, so this probably also explains the high interest there, most likely thanks to the media and news reports. Further statistics about the tweeting activity of users per country can be found in the Supplementary materials. murder of Kuciak a journalist from Malta was also killed because of the investigations she was working on. The role of the Italian mafia was heavily discussed throughout the period due to the corruption among Slovakian politicians, so higher tweeting activity in Italy is also not surprising. The publisher of the journal that Kuciak was working for is in Switzerland, so this probably also explains the high interest there, most likely thanks to the media and news reports. Further statistics about the tweeting activity of users per country can be found in the Supplementary materials.   Figure 6 shows the results of the clustering based on the number of tweets per country and their dynamics over time. Whereas Table 1 summarizes which countries belong to each cluster and the main characteristics of the time series. The first cluster (blue) contains countries where there was a high tweeting activity (peak: 1 March) in the first few days and a second, smaller peak at the end when PM Fico resigned (14 March). Italy is the country with the most tweets in this group, which is probably thanks to discussions about the responsibility of the Italian mafia in the murder and the corruption in Slovakia.  Figure 7 represents the daily mean sentiment values for each cluster calculated in Section 3.5.2. The second half of the period (from March 6 on) is clearly more positive than the beginning. As Cluster 2-4 have only a few hundred tweets, we mainly focus on Cluster 1 and 5 in details ( Figure 8). Overall, Cluster 5 tends to be even more positive than Cluster 1.  Users from the countries belonging to Cluster 2 (red) were the least active in terms of tweeting related to Kuciak's murder. Tweeting activity in these countries remains low in the whole analysis period. Cluster 3 (green) has a similarly low activity level as Cluster 2, with two smaller peaks on 28 February and 14 March, which are the two most significant events related to the murder (see Figure 1). Cluster 4 has only three countries and interestingly, there is no peak in the beginning when the murder and its motive were discovered. While 12 and 15 March are the peaks for these countries, that are more related to the resignation of the Prime Minister and other indirect influences of the murder or the journalist's work. If we look at the topic modeling results of these countries, we can observe that there is a high chance that tweets discussing the murder and the following events might have a strong political narrative rooting in the political systems of the countries in this cluster. For more details on the interpretation of these topics, see Section 4.4. Although the trend is clear, the absolute number of tweets is quite low (similarly to Cluster 3), which might also show that Twitter is not the most popular social media platform in these countries. Therefore, those who do use it, might be even less representative to the general population than in other countries, where significantly more tweets were harvested, potentially leading to a distortion of the results. Thus, in the more detailed analysis of the sentiment patterns and topic modeling we exclude these three clusters (Cluster 2-4) because they have not enough tweet for such in-depth analysis. On the contrary, Cluster 5 (purple) has the most tweets (over 5500), and is the group that Slovakia also belongs to. In this group, the highest peak was in the beginning, when the journalist was found dead and his unfinished work was discovered and published, then the activity slowly started to decrease, with the lowest number of tweets on 8 March, after which it starts to increase again and reach some secondary peaks on 9, 12, and 14 of March.  Figure 8 shows the country specific results, if we consider the mean compound score classes discussed in Section 3.6, (ranging from 1 to 5, 5 being the most positive) for the countries in Cluster 1 and 5 and also exclude neutral tweets (class 3) to highlight the range of sentiments even more. The most positive category occurs after 6 March and interestingly Slovakia tends to be more positive in this period than any other country. By checking relevant tweets for these days, such as "On friday, 9th March 2018 at 17:00 we will march again we demand a new and trustworthy government. Fico is over" and "finally, the president of slovakia has accepted connections between the government and the mafia and they were satisfied with the political consequences, such as the resignation of the PM.

Temporal Patterns of the Sentiment Values per Cluster (RQ2 and RQ3)
Additionally, if we check statistical significance for these trends using the original calculated compound score (before applying sentiment classes), we found that among the countries, where there was at least one tweet each day in the analysis period, Slovakia and Germany has this increasing trend also statistically verified. (Germany 95% confidence level, whereas Slovakia was 99%).   Figure 8 shows the country specific results, if we consider the mean compound score classes discussed in Section 3.6, (ranging from 1 to 5, 5 being the most positive) for the countries in Cluster 1 and 5 and also exclude neutral tweets (class 3) to highlight the range of sentiments even more. The most positive category occurs after 6 March and interestingly Slovakia tends to be more positive in this period than any other country. By checking relevant tweets for these days, such as "On friday, 9th March 2018 at 17:00 we will march again we demand a new and trustworthy government. Fico is over" and "finally, the president of slovakia has accepted the resignation of pm robertfico two weeks after the murder of a journalist, and elections are expected to choose a new government" We can conclude that this most of the people supported the claims considering the connections between the government and the mafia and they were satisfied with the political consequences, such as the resignation of the PM.
Additionally, if we check statistical significance for these trends using the original calculated compound score (before applying sentiment classes), we found that among the countries, where there was at least one tweet each day in the analysis period, Slovakia and Germany has this increasing trend also statistically verified. (Germany 95% confidence level, whereas Slovakia was 99%). Figure 9 shows the eight most significant topics identified based on the tweets in Cluster 1. The topics touch upon the events and news related to the peaks, such as the resignation of the PM, the Italian mafia, or the role of the European Union. Overall, the most significant topic was identified in tweets that condemn the murder, they are followed by worrying voices about press freedom and security. The third most discussed topic in Europe was Kuciak's article, which was published after his death on February 28 in English and Slovakian, followed by in other languages later on (e.g., in French). The article revealed several connections between the Slovakian governing elite and organized crime. Remaining topics discuss further findings of Kuciak's article (i.e., "Ndrangheta mafia in Slovakia, and the European echoes of the event"). Table 2 shows the contribution of each topic in percentages, which represents how likely it is that the representative tweet of that topic was discussing that topic or included those keywords. Values around 0.7 (70%) means that there is a 30% chance that the tweet discussed other topic than this.

Result of the Topic Modeling per Cluster (RQ2 and RQ3)
in English and Slovakian, followed by in other languages later on (e.g., in French). The article revealed several connections between the Slovakian governing elite and organized crime. Remaining topics discuss further findings of Kuciak's article (i.e., "Ndrangheta mafia in Slovakia, and the European echoes of the event"). Table 2 shows the contribution of each topic in percentages, which represents how likely it is that the representative tweet of that topic was discussing that topic or included those keywords. Values around 0.7 (70%) means that there is a 30% chance that the tweet discussed other topic than this.    If we check the topic modeling results for Cluster 5 ( Figure 10, Table 3) there are seven topics identifiable. These topics are more distinct (( Table 3) Topic contribution percent values are higher) compared to the topics in Cluster 1, discussing not only the PM but also the whole government's role, the protests, the mafia, the fiancé of Kuciak, and interestingly also Viktor Orban, the Hungarian PM, although Hungary was not in this cluster or among the most active countries in terms of tweeting behavior. The representative tweet of the most significant topic (Topic 2) revolved around the political crisis in Slovakia, especially through the discussion about the resignations and the possibility of new elections. The next topic discusses the situation of press freedom in Europe, and its urgency is further pressed by the fact that some tweets made a direct connection between the assassination of Kuciak and Daphne Caruana Galizia, a Maltese investigative journalist, who was assassinated only a few months earlier, on 16th October 2017 [63]. The overrepresentation of this theme (Topic 4 and Topic 0) shows that the dissatisfied voices about the Caruana Galizia's case further strengthened the Kuciak movement in the online sphere. The tweets that made a connection between the death of Galizia' and the case of Kuciak's fiancée could also be interpreted as a clear representation of the condemnation of the violence against women that may further strengthen this movement. Furthermore, Martina Kusnirova has a double representation among the topics as by name and as fiancée that may suggest that users of this cluster not only strictly condemned her death but they may differentiate between an innocent death and a death related to work. The topic modeling also identified tweets discussing the connection between the Hungarian PM and George Soros, who is a Hungarian-born American billionaire representing a frequent theme in different conspiracy theories and fake news [64]. The reason for this relationship may be twofold. First, six months before the assassination of Kuciak, the Hungarian government started a countrywide billboard campaign portraying George Soros and saying "Don't let George Soros have the last laugh", thus generating a scapegoat from him regarding the refugee crisis in 2016 [65]. It may have created a solid base for the Slovakian PM Fico who issued a political statement on 5 March 2018. In this statement the PM inquired into the connection of George Soros and Slovakian President Andrej Kiska, who had declared the possibility of new elections a day earlier.
Through this political statement, PM Fico might have tried to discredit the president. The second reason may have been, that PM Orban also saw the "fingerprint" of George Soros behind the Slovakian crisis on 10 March [66]. Overall, this representation of Soros among the topics may indicate similar political tendencies among different counties, for example as we have seen that the clustering algorithm put Hungary, Belarus, and Turkey in the same group. crisis on 10 March [66]. Overall, this representation of Soros among the topics may indicate similar political tendencies among different counties, for example as we have seen that the clustering algorithm put Hungary, Belarus, and Turkey in the same group.     on friday there will be marches in memory of jan kuciak and martina kusnirova in many cities in slovakia after the announcement of the event in bratislava banska bystrica kosice nitra zilina or krupina and prague were added murder jan kuciak friday marches

Discussion
The current work provided an in-depth analysis of the tweets posted in Europe related to the assassination of Kuciak. Twitter has many advantages as a data source in analyzing collective actions, especially in terms of the high temporal resolution thanks to the immediate response of the people considering given events and news, which we intended to illustrate with our multilayered analysis. Although, there are no similar methods that can analyze social phenomena such as protests and reactions to such events at that large scale and fine temporal resolution, our methodology is also not free from limitations. It is well-known that Twitter is clearly not representative for the whole population in terms of demographics; however, we can also hypothesize that in the case of a protest or other follow-up events of a journalist's murder elderly or very young generations are also not very likely to be highly active or involved.
Generally, user privacy is an important question of investigations that rely on the social media data of individuals. In this relation, it shall be noted that we do not use and communicate identifiable user data, in fact, our research focuses on the big scale and similar tendencies of various countries. Thereby individuals' private data has no direct reflection in the outputs and throughout the whole analysis it is only used to group users at country level. Moreover, we were analyzing collective action, and most of the tweets were published in response to the murder, so we interpreted them directly and did not infer any further information that was not the original motivation of someone for posting a tweet.
To date there is only one study, which we are aware of, that analyzed the case of Kuciak and the reaction of the public to it, using social media data. Kapanova and Stoykova (2019) made a timeline event analysis about the murder of Ján Kuciak to analyze the social networks of relating #AllforJan hashtag on a wider temporal span (28 February-28 July 2018) focusing on Slovakia. Through this research, they examined the distribution of related hashtags. Their data collection comprises a total of 4611 tweets from 595 unique users (7.7 tweets per user), whereas our approach resulted in approximately 1800 tweets and 468 unique users for only 18 days in Slovakia.
The current study analyzed only one case so we cannot and should not draw general conclusions from the patterns we identified both in terms of spatiotemporal characteristics or the topics and the sentiments. Still, we can state that based on the news report and other official sources, our analysis is able to reflect what was going on in the countries we considered. Additionally, the workflow itself, containing the whole pre-processing and the follow-up analysis steps to cluster the countries and investigate topics and sentiments patterns in the data can be considered transferable and used for the analysis of similar cases.
Moreover, the results of this general, exploratory analysis can provide a strong foundation for more specific analyses, for example country-specific investigation or focusing on a specific topic or day. Although we applied the current methodology based on Twitter data after a particular event already took place, it is worth noting that both the subject and the region of the analysis are interchangeable with other themes, as well as with other social media data resources that has the required attributes regarding the location. Moreover, by using unsupervised topic modeling methodologies and an automatable approach there is also the potential to adapt our workflow for early classification (for example in the emerging phase of a protest) to predict the overall gravity of the analyzed collective actions.
Future research can also compare other cases to the murder of Kuciak to see if there are indeed more general conclusions to draw about the connection of social media and protests or even social unrests. Based on some events in the past such as the riots after the death of George Floyd in the US [67], other studies found a strong influence of pictures [68,69] or hashtags (#Ferguson, #oscargrant . . . ) shared in social media on the consequences of a case in the society or politics [70].

Conclusions
Our work investigated over 8000 tweets to analyze the users' reaction to the murder of Ján Kuciak and its follow-up news and events in Europe. We provided a detailed pre-processing and data cleaning workflow to ensure high quality results for not only the spatiotemporal and sentiment analysis but especially in the case of topic modeling. Thanks to the transferable nature, this workflow can also be used for other case studies, where Twitter data is analyzed in the light of similar events. Our topic modeling algorithm was able to identify the key topics, such as the connection to the Italian mafia, the murder of other journalists, or the resignation of the Prime Minister in Slovakia. The clustering algorithm was used to group countries with similar trends in terms of temporal patterns. Thus, we distinguished five groups, out of which two bigger one contained over 90% of the tweets and provided the most details about the activity peaks. The first cluster was less active in the second half of the analysis period where the funeral or the resignation of the PM happened. Countries like Italy, France, or Austria would belong here. The second big cluster includes Slovakia, Germany, and the UK, among others, and they have a bigger peak both in the beginning when the murder and the first connection of the Slovakian politicians to the Italian mafia were discovered and at the end (after 8 March) with the funerals taking place and the PM resigns. In terms of sentiments, this latter group was also more positive in this second half of the period, while Slovakia and Germany even have a statistically significant increasing trend for the sentiment of the tweets throughout the whole analysis period.
The scientific contribution of our paper is twofold. First, it shows that geo-social media data can be utilized for generating a better understanding of political events even at a smaller spatial and societal scale and in a non-English language. Second, our methodological workflow combines time series clustering with semantic topic modeling and sentiment analysis, performed on georeferenced social media data, which provides multi-modal insights into the public's reactions to a specific political event.
Overall, the investigation of social media in the case of events such as the murder of a journalist seems an important tool to track and understand the immediate reaction of people, unlike any other method or source of information. Nevertheless, the current work only investigated one case study and therefore general conclusions about the flow of events should not be drawn. Yet, we showed the potential and also illustrated some possible future research directions to better understand this field of research by using novel analysis techniques.