Sports Information Needs in Chinese Online Q&A Community: Topic Mining Based on BERT

: The online Question and Answering (Q&A) community has grown globally, allowing users to ask, discuss, and answer questions based on shared interests. As a gathering place for people’s knowledge production, collaboration, and dissemination in the current Internet scene, the online Q&A community can intuitively reﬂect the public’s information needs and behavior. It also collects many sports-related data and becomes an effective vehicle for comprehending mass sports information needs and disseminating sports knowledge. However, sports-related studies on the online Q&A community have rarely been reported. This study took the sports information in Zhihu, the largest Q&A community in China, as the research object to explore the public needs for sports information in China. We introduced the BERT model through a self-compiled python program and collected 391,092 sports-topic answers in the online Q&A community of Zhihu. Then, we explored the topic content, evolution trend, and user attributes of these answers. We found that the overall trend of sports information needs in Zhihu can be divided into three cycles: the London 2012 Olympic period, the Rio 2016 Olympic period, and the Tokyo 2020 Olympic period in general. The diversiﬁed content of information needs included 40 second-level themes and eight ﬁrst-level themes. Male and female users had similarities and differences in sports information needs. The male and female users had the same information needs for ﬁtness-related information. However, men were more concerned with confrontational solid sports such as basketball and football; women were more likely to care about weight loss, shape effect, and self-protection while doing sports activities. In addition, compared with men, women preferred to emphasize their gender attributes when expressing their needs for sports information to obtain more practical knowledge. In conclusion, our ﬁnding reveals that the sports community formed by the current online Q&A community in China is still a male-dominated information ﬁeld.


Introduction
The acquisition, sharing, and dissemination of social knowledge have changed dramatically. Online Question and Answering (Q&A) communities that support users to ask, discuss and answer questions based on shared interests have become more and more popular in recent years [1,2]. With the primary function of attracting and promoting users to ask questions online and obtain answers from other community members, online Q&A communities construct a knowledge community around different topics and play an essential role in meeting people's information needs and disseminating social knowledge [3]. Quora, the world's most famous online Q&A community with about 300 million users in 2018, claims it is a place to share knowledge and better understand the world [4]. In China, Zhihu, launched in 2011, has become the most influential online Q&A community platform. By January 2019, the users of Zhihu exceeded 220 million, accumulated more than 130 million answers, and Zhihu had established a community-driven business model [5]. Online Q&A communities such as Zhihu have changed the traditional mode of people's knowledge exchange, and their knowledge production, information interaction, user characteristics and business models have gradually received widespread attention in the academic field [6][7][8][9].
As a platform for disseminating knowledge in the new media era, online Q&A communities can be seen as virtual knowledge aggregators that provide users with the flexibility to ask open-ended questions to a broad audience, the answers to which may be of great help to both the user and the community on specific topics such as health, sports or finance [10][11][12]. By 16 August 2021, the number of people following the sports topic in Zhihu had exceeded 10 million. Therefore, Zhihu has become a good channel for understanding Chinese public sports information needs. Exploring the sports information needs in online Q&A communities could deepen the understanding of sports ecology and development in online communities from users' perspectives. At the same time, it is of great significance to popularize scientific sports information knowledge and enhance public sports information literacy.
An essential goal of data analysis is to determine the shared characteristics of data points [13]. It usually means determining which events or concepts are discussed in the document in text analysis. As a popular statistical tool, topic modeling is well suited for use with text data to extract latent variables from large datasets [14]. Employing BERT, the pre-trained language model, the core of our research is an analysis of all the questions asked by users about sports in Zhihu. Specifically, the main objective of our study is: on the one hand, to investigate the sports information needs of people in the online Q&A community platform of Zhihu in China. For example, the general profile of users' questions, such as temporal trends, user characteristics and other features. On the other hand, to explore the topic classification of sports information needs in Zhihu and the time and user characteristics of these topics through topic mining of these information needs.
There are three main contributions of the paper. First, it innovatively explores the sports information needs of the public in online Q&A communities through big data mining and analysis, which provides a valuable perspective on our overall understanding of people's sports information interactions within the social media era. To the best of our knowledge, only a few papers have mined and analyzed the topics of information needs in online Q&A communities, especially the identification of topics of health information needs [15,16]. However, not a single study elaborated on the characteristics of sports information needs in the Q&A community. For a decade, this study examined the Chinese public's sports information needs in online Q&A communities. Second, the study of the user attributes and topics supports not only the understanding of the Chinese public's sports participation but also the recognition of gender-specific and user-specific preferences for sports topics and thus the understanding of the power relations between male dominance and female subordination in the online fields from a sociological perspective of sports. Finally, given that the generation of these sports topics objectively reflects the level of concern of the Chinese public about sports issues and the demands for sports information, this study would also provide useful guidelines for sports administrations and commercial companies to optimize public sports policies or business solutions.

Literature Review
As a social media platform focusing on users' knowledge exchange, the online Q&A community accumulates tens of thousands of questions that people ask or answer every day. These questions form a diverse whole, supporting users searching for information. With the in-depth development of the Internet and new media technology, the online Q&A community has gradually shifted from search-engine-based interactive information service platforms such as Baidu Know and Answers to user-centered social network Q&A communities such as Zhihu and Quora. The user-centered communities tend to focus more on social interaction, with a well-established social network and feedback mechanism [17]. Therefore, it has also triggered the focus of researchers, and the online Q&A communities discussed below are such socialized communities.
Previous research on the social network Q&A community focuses on studies of users and their behavior and studies of the questions and answers of the communities. There are two main approaches of user-platform and platform-user in the user and behavior studies. The user-platform approach focused on exploring the motivation of user participation and platform content production by quality users. Bao et al. studied the drivers of user participation in online Q&A communities based on social cognitive theory [18]. They found that outcome-based expectations were positively correlated with user participation and that users' self-efficacy could positively influence their participation behavior. In the research approach of platform-user, Wang et al. [19] proposed an improved method for identifying key users of online Q&A communities from knowledge dissemination. Guo et al. [20] found that platform anonymity has an essential impact on user participation and content production and suggested that user anonymity be viewed in two ways for online Q&A community platforms. In the studies of the questions and answers of the communities, both commenting on others and receiving comments were significant motivating factors in users' continued use of online question and answer communities [21]. Shi et al. [22] collected answer data through a crawler, established three evaluation dimensions of textual, rhetorical, and emotional content, and then identified nine features that might affect the quality of answer content in Zhihu.
Information need is the basis for information behavior, which Dervin defines as an urge to understand the current situation when an individual is faced with a problem or concern or when there is a need to understand or make a choice [23]. The studies of user information behavior have found that a person only recognizes their knowledge needs before the corresponding information behavior occurs [24], such as information search, screening, and avoidance. Therefore, information need as a motivating factor for information behavior generation has also received academic attention [25]. Users' information needs are evolving and limited by time and space [26]. In the Internet era, social media has gradually become a diversified place for people to solve their information needs as an information field. As a result, scholars in different fields have conducted relatively affluent research on this subject. For example, Xing et al. [27] analyzed the type, content, and motivation of users' information needs through library microblogging data, which provided informative suggestions for libraries to understand users' needs and serve them better. Jia et al. [28] used a national survey during the COVID-19 outbreak to find that information needs influenced media use and media trust through the information matching mechanism, which helped people better understand the relationship among public information needs, media use, and media trust during emergencies. Some scholars have also explored the user needs of online Q&A communities. Wang et al. [29] analyzed the answers to weight loss topics in online Q&A communities to reflect the characteristics and gender differences of the current public needs for weight loss information. Huang et al. [30] studied the process of topic identification and analysis in online Q&A communities by improving the technical method and analyzed the method's effectiveness by using the topic of the elderly as an example.
Topic modeling has been widely used in studying topics related to Q&A communities. On the one hand, studies extracted hot topics of the Q&A community through topic modeling to detect the trending topics of the Q&A community and provide references for the recommendation of questions and answers from the Q&A community [31]. On the other hand, studies used topic modeling to detect the trending topics in a particular field in the Q&A community to understand users' focus and information needs, mainly focusing on the areas of health information communication [32][33][34], science communication [35], data science [36], product user preference [37], tourism [38] and library user services [39]. Junghwa Bahng et al. [32] detected the topics of hearing loss on the Naver Knowledge-iN through topic modeling to identify patients' perceptions, concerns, and needs regarding hearing loss. Zhang et al. [38] extracted the topics of tourism information in Zhihu through topic modeling to explore the tourism information needs of the users in the context of COVID-19 [38].
In terms of the use of topic mining models, previous research on Q&A communities applied the models including LDA, STM, BERT, and so on [32][33][34][35][36][37][38][39]. Zhao et al. [34] applied LDA on health topic information in Zhihu to explore internet users' needs for health information. Jiang et al. [35] used the STM model to extract the topics about climate change in Quora to investigate public opinion on science communication. Luo et al. [40] applied two BERT-based models on the Q&A site to detect pregnancy-related topics and concluded that the BERT-based models were better than the traditional models. In terms of the data processing, it was mainly divided into three main steps: First was data collection, which generally crawls data from Q&A communities. The second was data preprocessing, including Text Cleansing, Word Tokenization, POS Filtering, Stop-word Filtering and Word Stemming. The third was Topic modeling, including topic clustering and topic validation [32][33][34][35][36][37][38][39][40].
From the perspective of the Q&A communities studied, Quora, Zhihu, Reddit and Naver Knowledge-In were mainly concentrated, with multiple languages such as Chinese, English and Korean. Karbasian et al. [36] applied the LDA model to extract data science topics in two English Q&A communities called Stack Exchange and Reddit to provide a path for detecting the trending topics of data science research. Han et al. [41] used the KoBERT model based on Korean to detect the topics of the course teaching evaluation on a university online Q&A site. Qian et al. [42] used the Chinese BERT topic model released by Google to extract the health-related topics in Chinese Q&A sites to understand the health information needs of the Chinese elderly. The research mentioned above shows that the topic modeling tools, including BERT, are effective in multi-language environments and available for topic modeling in multi-language texts.

Methods and Data
The language model pre-training brings a breakthrough to natural language processing (NLP) technology. Latent Dirichlet Allocation (LDA) [43] is a popular and major model in textual topic mining, whose main feature is that all documents in a collection may contain the same set of topics. Still, each document contains a different number of topics. In this iterative process, documents are observed one after another, while the hidden structureavailable topics, topic distribution per document and topic assignment per word-remains. Despite the popularity of this approach, it has many limitations, such as the need to identify a certain number of topics before modeling; the high possibility of generating irrelevant topics; the fact that the identified topics are static and do not change over time; and the fact that semantic relevance is lost because the algorithm uses a bag-of-words model, which also faces serious performance problems for small text data [44].
The transformer model based on the self-attention mechanism is the foundation of the language model pre-training. GPT, BERT, XLNet, and other large-scale language model pretraining are stacked and optimized on the basis of the transformer model, which relies on powerful arithmetic to obtain a general language model and representation based on easily accessible, non-manual data, and then fine-tunes the pre-trained model with the task corpus on the target NLP task to rapidly converge to improve the accuracy in various downstream NLP tasks. Therefore, pre-trained language models have been rapidly developed and widely used since their inception and have become the core technology for various NLP tasks. The effectiveness of pre-trained language models in various NLP projects is evident. As the parameter size increases and the training data increases, pre-trained language models can improve accuracy and generalization [44,45].

BERT Model
The BERT (Bidirectional Encoder Representations from Transformers) model is a natural language model pre-trained on a large-scale corpus based on pre-training-finetuning, which Google AI proposed in October 2018. It is an actual recent research result in NLP, as it has significantly improved accuracy in several natural language processing tasks [46] and provides a good feature representation for word learning. BERT, BERTlike, and fusion-based models outperform traditional machine learning and deep learning models. BERT models include two training tasks: a masked language model (MLM) and a Next Sentence Prediction (NSP). MLM is a good solution to the problem of inverse order information leakage in bidirectional modeling, while NSP is good for understanding the relationship between two texts, which is suitable for reading comprehension or textual entailment tasks [43,47]. Many studies have shown that BERT can have good results in topic mining of texts in Chinese, English, Russian, Arabic and other languages. This paper adopted the BERT model for topic mining [36,42,48,49].

Data Sources
This study crawled the total raw data of 391,092 questions (as of 2 June 2021) under the sports topic of Zhihu by a self-coded python program. The data included the question, question time, questioner gender, questioner authentication status, and whether the question was anonymous, forming a question collection. The coding is performed by the BERT model based on preprocessing.

Data Processing Process
The Sentence_transformer module was used to load the pre-trained BERT crosslinguistic model [50,51] and encode the preprocessed sentences into a 512-dimensional vector representing the sentences. The dimensionality of the feature vector should be reduced before clustering to avoid the effect of the curse of dimensionality. UMAP constructs a high-dimensional representation of the data and optimizes the layout of the data in the low-dimensional graph. Thus, it is an effective means of dimensionality reduction. Compared to t-SNE, the UMAP method is more efficient in processing [52]. Using UMAP, the 512-dimensional feature vector was dimensioned down to 20 dimensions. K-means is one of the most used clustering algorithms based on Euclidean distance, which considers that the closer the distance between two targets, the greater the similarity. It was used for clustering in this study, and the results are visualized in Figure 1.
After clustering, texts were classified into different topics. This study used the TF-IDF algorithm to derive the importance scores of the words in each topic and thus determine each category. The TF-IDF algorithm, a common textual keyword mining method, considers that if a word appears more frequently in a topic and less frequently in other topics, it has a more significant influence on the core content of that topic [53]. TF-IDF involves two components, Term Frequency (TF) and Inversely Document Frequency (IDF), and is calculated as follows: where n i,j is the number of occurrences of the word i in document j. ∑ k n k,j represents the total number of occurrences of all words in document j. N is the total number of corpus documents, d i is the number of documents containing the word i in the corpus, and d i is calculated as "d i + 1" to prevent zero in the denominator. TF i,j is the normalization result of word frequency i in document j. IDF i,j is a measure of the word i's ability to distinguish document j, and it is also an adjustment of TF i,j weight to suppress words with high frequency in documents such as "的(of)," "和(and)" [54]. The overall research idea is shown in Figure 2, in which a rhombus represents a process, and a box represents an outcome.

Results and Analysis
A total of 391,092 questions have been asked under the 'Sports' topic since 2010. Of these, 3020 questions were asked by authenticated users, 279,590 by non-authenticated users, and 1457 by users who have logged out of their Zhihu accounts. Of the total, 46,909 questions were asked by female users, 133,981 by male users, and 210,202 by users whose gender is unknown (no gender set, anonymous, or logged out).
Overall, most of the sports information needs in Zhihu are raised by non-certified users, accounting for 71.50%, while certified professional users account for only 0.77%. Among the information needs of the known gender of the questioners, 74.07% were male, reflecting the popular trend of sports information needs and the male-dominated information characteristics in online Q&A communities, which further confirms the related study by Vasilescu et al. [55], who found that men use online Q&A communities more frequently than women, posting both questions and answers more than women. In addition, users of online virtual Q&A communities have a high level of personal information concealment.

Contents of Information Needs
According to the results of preliminary data processing, the sports information needs of Zhihu users were divided into 40 different categories of topics. However, the topic model can only show the keywords and their contribution to each category of topics and cannot automatically generate the name of each topic. Referring to the way previous studies determined topic names, these 40 topic categories were named manually by reading the keywords with important contribution degrees. In contrast, this study randomly selected 100 original questions in each category of topics to be read manually to confirm the topics and name them more accurately. For the case of topic number 38, which had only four available subject terms, the questions were found to be highly consistent with the inquiry form "What were the highlights of the NBA regular-season game on X?" Therefore, the relevant subject line can summarize the topic. We have grouped the 40 topics to have a more concentrated theme. The 40 topic categories of information needs can be combined into eight primary categories (Table 1). In descending order of percentages, the primary topics are sports skills, sports events, sports shaping, weight loss, professional athletes and teams, Chinese sports and physical education, sports health, sports equipment, and sports experience. From the 40 secondary topic categories, the most demanded information in Zhihu is sports and slimming, followed by NBA player performance, European soccer league, and fitness consultation. Regarding the evolution of the temporal trend of information needs on primary topics, the overall trend of information needs on primary topics and the needs of sports information in Zhihu is generally comparable, with slight differences among topics.
The information needs on sports skills first increase and decrease, reaching a peak in 2017 and showing an increasing trend in 2020. The information needs on sports events and sports and physical education in China had two significant peaks, the former in 2016 and 2018 and the latter in 2016 and 2019. The information needs on sports shaping and weight loss, professional athletes and teams, sports health, sports equipment, and sports experience showed a trend of increase first and then decreased, with the peak in 2016 or 2017 (Figure 4).

Sports Information Needs of Different Gender Users
Analysis of the user attributes of the topics, we found that men are most concerned about sports events regarding the eight primary topics, accounting for 27.71%. Women are most concerned about sports shaping and weight loss, accounting for 21.42%, about twice the proportion of men. Both men and women ranked second regarding information needs about sports skills. Men ranked third in terms of information needs about professional athletes and teams, nearly three times the proportion of women. Regarding sports health information, women's needs are nearly twice as high as men's ( Table 2). Stice et al. [56] found that women tend to have more negative evaluations of body size and appearance than men, and the resulting pressure to lose weight makes women feel more strongly negative about their bodies. Given the relatively wide range of information demanded by each primary topic, to further explore the similarities and differences of information required by different genders in online Q&A communities, the information requested by different genders for secondary topics was analyzed. Figure 5 shows the proportion of male and female information needs in each secondary topic. Thirty-seven out of forty topics have more male than female information needs. In the topics of NBA highlights and roasts and NBA regular-season highlights, the gender of information seekers is all male. The most significant gap between men and women is basketball and soccer sports. The number of questions on exercise, body part shaping, and yoga learning is higher for women than men. The number of female questions on yoga learning is more than 50% higher than that of male questions. Table 3 shows the proportion of male and female users in the secondary topic information needs categories and reveals more specific differences in sports information needs between men and women. The top 10 sports information needs topics are different. For male users, NBA player performance, European soccer league, and NBA team are the topics with the highest information needs. Sports slimming, sports protection and rehabilitation, and middle and long-distance running and marathon performance improvement are the most popular topics for women. In general, men are more concerned about sports with intense confrontation, such as soccer, basketball, and fighting. At the same time, women are more concerned about sports that shape the body and sports with less confrontation, such as running, fitness, swimming, and self-protection during sports. The keyword word frequency analysis of the information needs of male and female users found that the keywords appearing more frequently in the information needs of male users are basketball and soccer, which indicates that men are more concerned about basketball and soccer sports. Those appearing more frequently in female users are fitness and weight loss, reflecting women pay more attention to the weight loss effect of fitness or sports. The keyword that appears more often in both male and female users is fitness, which indicates that both women and men have a greater demand for fitness information. Compared with men, women are more likely to emphasize their gender attributes, such as girls, when asking questions to obtain more practical information.

Characteristics of Sports Information Needs of Users with Different Authentication Attributes
From the perspective of different authentication attributes (Authenticated, non-authenticated, and anonymous users), sports events and sports skills are the essential information content for all three types of users. For authenticated users, sports events, professional athletes and teams, and Chinese sports and sports education all account for higher demand than nonauthenticated and anonymous users. At the same time, sports experience is the only topic that is more popular among anonymous users than authenticated and non-authenticated users (Table 4). Of the top five secondary topics (Table 5), authenticated users are most concerned with the topic of Chinese Super League and Korea-Japan World Cup, non-authenticated users are most concerned with sports slimming, and anonymous users are most concerned with NBA players' performance. In addition, among the top five secondary topics, NBA team and Olympic Games discussions are unique to authenticated users compared to nonauthenticated and anonymous users. The topics of fitness consultation, middle distance running, and marathon performance improvement are special to non-authenticated users, and Esports tips are individual to anonymous users.

Limitations
There are three limitations of the study. First, we only studied one platform, Zhihu, which is the largest online Q&A community in China. It does not wholly represent Chinese netizens' sports information needs. Other online communities, such as Baidu Know, need further research to reflect their sports information needs more accurately. Second, in terms of the methodology, we chose the more mature BERT model, combining K-means and TF-IDF for topic analysis. With the development of NLP technology, more and more short text models suitable for texts of social networks have been created, and more accurate topic mining models could be used in future studies. Finally, the presentation of questions is just one side of the demand for sports information.

Discussion
The Internet is changing the way people disseminate and access information. Online Q&A communities, with their unique advantages of continuity, openness, timeliness, anonymity, and content diversity, have become an essential source of information and knowledge for the public nowadays, which is an online ecology that many disciplinary fields cannot ignore. In terms of quantitative trends in demand for sports information, the increase before decrease trend does not indicate a decrease in users' needs for sports information. Because the number of relevant questions is increasing under the sports topic, not all of them will be displayed. For many discussions, people ask a lot of repetitive questions. These repetitive questions can cause content and information to be scattered, causing trouble for the answerer, the reader, and the questioner. Zhihu employs a question redirection mechanism, which provides higher value by automatically jumping from question page to question page so that discussions and thoughts about an issue can be presented more centrally on a single page. This mechanism is enabled when two or more questions are duplicated. Sometimes the text may be different, but if the question is essentially about the same thing, this mechanism is also triggered so that the questions generated are new ones, which also encourages users to try searching before asking a question, and if the question already exists, to stop asking the question and check the answer under the question [57]. After the peak in 2016, the number of sports information needs is above 10,000 every year, reflecting that the development of Zhihu continues to mature and the public sports information needs tend to be stable.
The differences in sports information needs between male and female users are also noticeable, reflecting different motivations for sports participation among users of different genders. From the differences in information needs, men pay more attention to solid and aggressive sports such as basketball and soccer, especially information about NBA, European and American soccer, and the Chinese Super League. On the other hand, women pay more attention to weak aggressive sports such as running, swimming, yoga and focus on weight loss and shaping effects brought by sports and sports protection issues. In addition, this study also found that women prefer to emphasize their gender attributes when expressing their needs for sports information to obtain more appropriate information. It suggests that women are disadvantaged in online Q&A communities, especially in sports topic discussions, where women tend to be marginalized and subconsciously view the discussion of sports information needs in Q&A communities as a male-dominated arena.
The need for sports information somehow reflect people's concern about the sports industry or sports. This study can provide an intuitive understanding of the sports information needs of the Chinese. For the sports authorities, this study could help them understand the sports hotspots people are concerned about and the problems in the sports participation process in the online Q&A community. Therefore, based on the results of this study, sports authorities could provide solutions and strategies which could better optimize the quality of the answers, address the problems raised by the users in the online Q&A community, and enhance the scientific nature of the public sports participation. For example, sports coaches, players, and experts could be organized to answer the questions the users are concerned about in the community. Secondly, the results show people's concern about sports music, sports equipment, scientific fitness and other topics, which could provide reference and direction for the products and services of sports enterprises. In addition, the exploration of user attributes is an addition to the sociology of sports and sports communication.

Conclusions and Future Work
By taking the sports-related questions of Zhihu as a sample, this study found that the sports information needs in China's online Q&A community present the following three main characteristics. First, the number of sports information needs to be formed three distinct phases around the three Olympic periods, with the Rio Olympic period having the highest number of questions. Second, the information covers eight primary topics and 40 secondary topics, with rich content and a balanced proportion. The number of the topics is relatively balanced regarding the secondary topics, with most topics accounting for about 2% and 3% of the overall number. The topics of sport-related information needs cover sports skills, sports events, sports shaping and weight loss, professional athletes and teams, Chinese sports and physical education, sports health, sports equipment, and sports experience, reflecting the diversity and richness of sports information needs by users of the online Q&A community. Finally, the data based on the known gender of users showed that the percentage of male users was 74.07%. It reflects that male users in online question and answer communities show more attention to sports information needs than female users. Male users dominate sports information needs by number.
This article is an exploratory study, and it is a good attempt to study sports issues in online Q&A communities. There is much more work that can be completed in the future. First, more studies could focus on user issues, gender issues and power issues of sports topics in online Q&A communities to provide more academic exploration. User responses for Q&A community sports topics and quality content generation are also valuable research directions for sports development in the social media era. Second, in terms of information mining techniques, researchers can try more advanced and precise techniques to enhance the accuracy of their research. It is also worth noting that we anticipate seeing more studies on the varied consequences of sports information requirements' features. What fundamental changes will these features bring to the promotion of mass sports, China's sports industry's future development, and the link between sports and society? What are the strategies for optimizing and responding to them? These will significantly impact how people, the media, and sports interact in the future.