1. Introduction
Keywords can be defined as special words embedded in a document that can provide a precise and accurate representation of that document’s content [
1]. In many areas, keyword selection is a primary step, such as in text mining [
1,
2], bibliometrics [
3,
4,
5] and communication [
6,
7]. In general, selecting the appropriate keywords is helpful for ensuring the quality of collected data, reducing the costs for data cleaning and accordingly increasing the rigor and significance of findings.
In recent years, social media, such as Twitter, Facebook and Weibo, has been cited as an important tool for scholars investigating a variety of fields, including social sciences [
8], communication [
6,
7], politics [
9,
10], education [
11,
12], medical science [
13], transportation [
14] and disaster management [
15,
16]. As compared to traditional data collection channels, such as surveys or questionnaires, social media can yield a better understanding of the public perceptions, decrease time and commercial costs in the data collection process, and display a greater variety and geographic distribution in the data [
17,
18,
19]. However, since the data size in social media is generally very large, redundant and invalid information should be controlled [
20], such as rumors, advertising, purposeful information posted by spammers, information with a disordered format or missing content, forwarding information and information with too few words or that makes no actual sense. These data may have a negative influence on conclusions and accordingly affect the correctness and effectiveness of decisions. In order to solve this problem, the data cleaning process plays an important role in social media analysis [
17,
21,
22]. However, data cleaning is time consuming for social media-based data and there is currently no way to examine accuracy that is both rapid and effective. Therefore, using a set of appropriate keywords that can accurately and comprehensively represent the researched topic is a practical alternative. Regarding the keywords in social media analysis, the primary issue that should be considered is how to determine the list of keywords in order to obtain the maximum number of posts that are closely related to the topic. Specifically, an effective keyword list can also ensure the quality of the collected data and reduce the time costs of the subsequent data cleaning process [
23,
24]. On the other hand, if the selected keywords cannot adequately represent the characteristics of the entire topic, the reliability and accuracy of the subsequent analysis may be affected, which will in turn make it difficult to obtain reliable insights from the results [
3].
However, previous studies have commonly neglected this process by assuming that the keywords used in their studies were extracted well and there was no need to examine the applicability of the keywords. For example, Han and Wang [
25] studied the online posts on Weibo regarding the flood that occurred in Shouguang, a county-level city in Shandong Province, China. However, the only keyword they used was “Shouguang”, with no further elaboration beyond this single term. This may be due to the fact that the majority of online posts in social media containing the word “Shouguang” were related to the flood within their study’s time interval. Nevertheless, there can be no doubt that using “Shouguang” as the only keyword will result in a large amount of irrelevant data that could thus negatively influence their final conclusions. In this case, we believe that the addition of “flood” to the keyword list would have been helpful.
In most social media analyses, there are some problems in the keyword selection process. For instance, because of multiple meanings for the same word in different contexts, irrelevant data can be collected. Cody et al. [
26] used “climate” as a keyword when collecting public opinions about climate change; however, they found that not every tweet collected was about climate change. Thus, they had to filter the data manually, which consumes a considerable amount of time. Furthermore, Noh et al. [
2] stated that the method for determining the number of keywords to select is critical, as it will affect the quantity and quality of the collected data. Specifically, if the number of keywords is relatively large, the quality will improve while the quantity decreases; if it is too few, the quantity will increase while the quality decreases.
Taking into consideration the current state of the literature, to our knowledge, few studies to date have been conducted on understanding the role of the keyword in social media analysis. Wang et al. [
27] carried out a pioneer study by putting forward a novel technique named Double Ranking (DR), in which there are two steps: the first is to provide some general keywords related to the research topic according to the personal experience, and the second step is to use these words to collect possibly relevant online posts and extract more keywords based on the results of two rounds of rankings. This work firstly highlighted the importance of keyword identification in social media analysis and offered valuable explanations about the differences between it and traditional keyword extraction problem. As an extended study, Zheng and Sun [
28] proposed an automatic keyword generation model by applying the machine learning technique and three properties: relevance, coverage and evolvement. However, this model could be only effectively applied for topics about major events, in which the number of online posts within each time window is substantial. In addition, there is no appropriate method to set the size of time window, especially for topics with short durations. Thus, putting forward a novel keyword selection method with a wider applied range of topics for social media is of great significance. In order to address this gap in the literature, in the current study, we provide an investigation of keyword selection in social media analysis and propose a new keyword selection strategy based on the social network analysis.
The original contributions of our research are listed below.
This paper highlights the important role played by the keyword selection process in social media analysis, a topic that has generally been neglected by most prior studies. We claim that the use of the appropriate keywords for data collection can improve the quality of the data and accordingly greatly enhance the significance of a study. Thus, this paper contributes to enhancing researchers’ attention to keyword selection in the process of using social media data for analyzing, which could yield more accurate and persuasive research results, to a large extent.
Using a graph-based approach, we propose a new keyword selection method for social media analysis considering two different types of topics: conceptual topics and event-based topics. In particular, the normalized rich-club connectivity considering the weighted degree, closeness centrality, betweenness centrality and PageRank values are used to identify “rich keywords”, and community detection is applied to determine the keyword combinations for representing the research topic.
We evaluate our method by using the data related to four topics and comparing with four widely used keyword selection techniques. According to the results of the empirical test, our method can reach a balance between the quantity and quality of the data. In other words, it can greatly increase the amount of high-quality data. In addition, since social media is an essential data source for a variety of areas, especially for those studying public perceptions, our proposed keyword selection method can benefit future studies in various areas, including decision making, disaster management and policy development.
The overall structure of this paper is as follows.
Section 2 describes issues with keywords in social media analysis, including the general framework of social media analysis, kinds of topics studied and existing keyword selection methods. The newly proposed keyword selection strategy is described in
Section 3 and
Section 4, providing a comparison analysis to test the performance of the proposed strategy. The research results are described and discussed in
Section 5, and finally,
Section 6 provides the conclusion and some directions for future research.
5. Results
In this section, we present the results of our experiments.
Table 3 shows the selected time intervals and primary keywords for the four topics. For Topic 1 and Topic 2, the primary keywords were determined based on the personal experience of scholars. For Topic 3 and Topic 4, the primary keywords were set based on their related texts. Then, by using these primary keywords, the online posts from within the set time interval that were related to the four topics were collected from Weibo. The number of collected messages for the four topics was 4201, 8226, 8319 and 5404, respectively. These data were also the results of the EB approach.
Word segmentation was performed for the collected online posts and the stop words and non-content bearing words were removed from the results. For the SS approach, some additional keywords were selected based on word frequency and personal experience (
Table 4). For DR, only two iterations were considered. With respect to LDA, each post was regarded as a mixture of latent topics, and a topic refers to a multinomial distribution over words. In addition, by following related studies [
66,
67], the trial-and-error method was carried out to determine the optimum value of
K. The values of
K were 5, 5, 6 and 4 for the four topics. Then, the top two ranked keywords of each cluster were selected as a keyword combination.
Table 4 shows the additional keywords for each topic. Finally, for SS, DR and LDA, the online posts related to each topic were collected from Weibo using the gained additional keywords and the primary keywords, and the duplicate information was removed. The results are shown in
Table 4.
Then, for our proposed strategy, keyword co-occurrence networks for each topic were built based on 300 keywords with high word frequency and practical significance, as displayed in
Figure 2. In addition to the words that are directly related to four researched topics, such as workplace violence, cyber violence, two sessions and permanent residence, those with higher frequencies mainly include hospital, doctor, fair and patients for Topic 1; network order, flesh search, real name system and keyboard man for Topic 2; Hong Kong, vaccines, epidemic situation and poverty for Topic 3; and racial discrimination, enjoy, justice and international for Topic 4. These results can roughly reflect how the public viewed the researched topics. Then, we calculated the values of
δnorm(
W),
δnorm(
C),
δnorm(
B) and
δnorm(
P) of four topics’ keyword co-occurrence networks by using Formulas (1) and (3), and the results are shown in
Figure 3,
Figure 4,
Figure 5 and
Figure 6, respectively. It can be noticed that the calculated normalized rich-club connectivity values with different
ξ regarding the four topics are all basically greater than 1 and show a general upward trend during the beginning stage of the curves. This finding is consistent with that of Wei et al. [
52], indicating a significant rich-club characteristic. In addition, some figures show discrete fluctuations, such as
Figure 2a,b, while some others show continuous fluctuation, such as
Figure 5b,d. This difference is mainly due to the different topology structure of the four keyword co-occurrence networks.
Then, values of the volatility rate regarding each normalized rich-club connectivity of four topics were calculated. By following our proposed rules for the “rich keywords” identification, 5, 4, 6 and 3 “rich keywords” were obtained for Topic1, Topic 2, Topic 3 and Topic 4, respectively. Specifically, for Topic 1, the rich keywords are “Yi Nao”, “physician”, “blacklist”, “rights” and “hospital”; for Topic 2, the rich keywords are “cyber violence”, “keyboard man”, “human flesh” and “network order”; for Topic 3, the rich keywords are “two sessions”, “Hong Kong”, “disease”, “proposal”, “poverty” and “vaccine”; for Topic 4, the rich keywords are “Ministry of Justice”, “international students” and “national treatment”.
Finally, community detection was performed for all four keyword co-occurrence networks. It should be noted that only the communities containing rich keywords were considered. For example, in Topic 4, there was a community containing the keywords “Weibo”, “pictures”, “comment” and others that could be easily determined as unrelated to the community. Thus, such communities were not considered in the following steps. Then, based on the adjusted community detection results, several additional keyword combinations were formed. For example, for Topic 3, the additional keyword combinations included “Hong Kong + independence”, “disease + control”, “proposal + NPC”, “proposal + CPPCC”, “poverty + alleviation” and “vaccine + HPV”. Then, related online posts for each topic were collected from Weibo using these additional keyword combinations, and the primary keywords and duplicates were removed. The results are shown in
Table 5.
Then, 10 people who had performed social media-based studies in the past manually identified the related online posts for each topic from the collected data based on each keyword selection approach. Next, we calculated the values of the relevance coefficient
r and the number of topic-related posts
Nc. The final results are shown in
Table 6 and
Figure 7. The results show that although the size of data can be increased by using the DR and LDA methods as compared to EB and SS methods, the quality of data fell dramatically as the selected additional keywords cannot accurately represent the researched topic. In comparison, our proposed strategy could effectively address this shortcoming. In particular, although the relevance coefficient values of Topic 1, Topic 2 and Topic 4 of our proposed strategy were smaller than those of the EB method, we obtained the highest numbers of topic-related posts in three researched topics, which are 11,864, 15,873 and 16,951. In addition, the relevance coefficient for Topic 3 using our proposed strategy was even greater than that of the EB method while the size of the data was increased as much as 2.16 times. These promotions were mainly due to the use of the additional keywords that were identified using graph-based techniques.
6. Conclusions
Like some previous studies [
27,
28,
43], our work highlights the important role played by the keyword selection process in social media-based studies. We posit that keyword selection should be viewed as a primary step in social media-based studies as it directly determines the quantity and quality of the collected data and thus greatly affects the reliability of the corresponding findings and the validity of any decisions made based on these findings. In previous studies that have used social media-based data, the EB, SS, DR and LDA methods were employed for keyword selection, with the former two being more commonly used [
22,
23,
25]. A major drawback of these approaches is that they ignore the relationship between keywords, which results in the collection of a large quantity of unrelated data and the omission of related data.
Aiming at solving this problem, we proposed a new strategy for keyword selection in social media-based studies that can increase the size and ensure the high quality of collected data while improving the reliability and validity of the corresponding conclusions. This strategy considers the co-occurrence of relationships between different keywords for a particular researched topic. We used the normalized rich-club connectivity considering the weighted degree, closeness centrality, betweenness centrality and PageRank values to extract “rich keywords”, and community detection was performed to identify several keyword combinations that can entirely and accurately represent a topic. In order to test the performance of our proposed strategy, four topics, namely, medical violence, cyber violence, the National People’s Congress (NPC) and Chinese People’s Political Consultative Conference (CPPCC) Annual Sessions 2020 and the New Permanent Residence Law for Foreigners (Draft Version), were considered in an empirical experiment. The results have shown that if quality is assured, our proposed strategy can greatly improve the quantity of the collected data, which is considerably important for social governance [
68]. In addition, while the use of graph-based techniques can be helpful for keyword selection, we still highlight the primary and indispensable importance of the role of scholar experience. In other words, unlike previous studies simply emphasizing the design of quantitative methods [
43,
45] in keyword selection tasks, we suggest that scholars should use their personal experience to optimize and adjust their keyword results based on qualitative methods, such as expert diagnosis and consultant systems.
In sum, the evidence proves that our proposed strategy outperforms all baselines in all the four topics, which demonstrates the high effectiveness of our proposed strategy in keyword selection for social media-based analysis studies. Moreover, it should be noted that our empirical experiment only considered issues regarding original online posts. However, comment data also provide a main data source in social media-based studies [
69]. In general, as the comment data are opinions directed toward a particular original online post, if the relevance of the original online post can be ensured, the relevance of its comment data can be guaranteed, as well. Therefore, if the number of collected original online posts increases, the number of comment data can further increase in time and finally improve the size and quality of the data for analyzing.
The contributions of our main findings in addressing challenges of sustainability are listed below. First, as social media has been one of the main data sources for exploring public perceptions and behaviors of issues about sustainability [
70], such as environmental pollution [
71], sustainable education [
72] and city development [
73], our proposed keyword selection method could enhance the quality of the collected data from social media and accordingly yield a better and more accurate understanding of the public opinions and behaviors. As a result, more effective solutions for addressing environmental and social sustainability problems could be designed. Second, social media has been commonly viewed as an important channel for spreading knowledge about sustainability science and concepts [
74]. As our proposed model can offer useful guidelines for select the most important and relevant keywords from social media data, it contributes to knowledge discovery and influencing factors identification of knowledge dissemination. Third, as the keyword is one of the primary bases for fake news detection [
75,
76], which is an important part of sustainable education and emergency management, our proposed model can increase the detection accuracy by improving keyword selection process. Fourth, keyword selection is important in bibliometric analytics [
77]. Therefore, our proposed model is helpful for providing a more accurate and comprehensive map considering existing literatures related to sustainability. Overall, the contribution of this article is of great significance and has a wide range of applications for future studies focusing on sustainability.
There are, however, some limitations to this study. One major limitation of the current study is that we focused only on Weibo. We believe that similar shortcomings in keyword selection for social media-based data analysis can be found on Twitter and Facebook. Therefore, in our future studies, we will perform more empirical experiments focusing on Twitter and Facebook in consideration of the characteristics of their generated online posts. Moreover, since a purely quantitative approach for keyword selection may be insufficient, we will focus on how to combine the experience of scholars with quantitative methods more effectively and efficiently in the future in order to discover new strategies that can more precisely identify keywords for analysis.