SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques

In the digital age, the abundant unstructured data on the Internet, particularly online news articles, provide opportunities for identifying social problems and understanding social systems for sustainability. However, the previous works have not paid attention to the social-problem-specific perspectives of such big data, and it is currently unclear how information technologies can use the big data to identify and manage the ongoing social problems. In this context, this paper introduces and focuses on social-problem-specific key noun terms, namely SocialTERMs, which can be used not only to search the Internet for social-problem-related data, but also to monitor the ongoing and future events of social problems. Moreover, to alleviate time-consuming human efforts in identifying the SocialTERMs, this paper designs and examines the SocialTERM-Extractor, which is an automatic approach for identifying the key noun terms of social-problem-related topics, namely SPRTs, in a large number of online news articles and predicting the SocialTERMs among the identified key noun terms. This paper has its novelty as the first trial to identify and predict the SocialTERMs from a large number of online news articles, and it contributes to literature by proposing three types of text-mining-based features, namely temporal weight, sentiment, and complex network structural features, and by comparing the performances of such features with various machine learning techniques including deep learning. Particularly, when applied to a large number of online news articles that had been published in South Korea over a 12-month period and mostly written in Korean, the experimental results showed that Boosting Decision Tree gave the best performances with the full feature sets. They showed that the SocialTERMs can be predicted with high performances by the proposed SocialTERM-Extractor. Eventually, this paper can be beneficial for individuals or organizations who want to explore and use social-problem-related data in a systematical manner for understanding and managing social problems even though they are unfamiliar with ongoing social problems.


Social Problems and Challenging Issues for Identifying Ongoing Social Problems
Facing social problems as the challenges of the well-being and sustainability, e.g., high suicide rate and air pollution with fine dust in South Korea, research and development (R&D) projects have been promoted in addition to political and administrative measures not only to solve social problems, and but also to improve the quality of people's lives by solving social problems. Subsequently, amount of textual data is unavoidable for now in extracting the SocialTERMs for the detected SPRTs. However, it requires significant human efforts, and it is labor-intensive, expensive, time-consuming, and often error-prone. In addition, to reflect the importance of key noun terms in the textual data at different levels, e.g., the document level and the detected topic level, several weighting schemes have been used in previous works, e.g., tf, idf, and tfidf [9]. However, it is unknown whether those weighting schemes can reflect the different roles of key noun terms in representing social problems over time.

Key Term Identification in the Previous Text Mining Applications
With the emergence of Web 2.0 and social media, the amount of unstructured data, most of which are textual and publicly available on the Internet, has increased massively, especially the amount of data on individual entities such as persons and companies. This big data creates new opportunities for both qualitative and quantitative researchers of data and information sciences. Thus, big data is essential not only for scientific research on social systems but also for businesses and individuals [10], and it is important to develop a method that helps people obtain the relevant data quickly and accurately from big data and analyze it. To resolve it, text mining has been employed, with a focus on analyzing the statistical properties of terms [11]. In particular, key terms are essentials for exploring the overall data set, and text mining uses them to inspect and process the obtained texts through typical preparation steps. Table 1 shows recent studies (2014)(2015)(2016)(2017)(2018) in which key terms were extracted and used for text mining applications. The previous works in Table 1 can be grouped according to the final application of their identified key terms: indexing, clustering, summarization, classification (or categorization), or mapping [12,13]. Indexing, in which textual data are represented with a set of extracted key terms, is a research goal on its own, and is also a step in most text mining applications, such as feature generation and text representation [14]. Clustering is to group textual data on the basis of their attributes to identify important themes, patterns, or trends [15], and it is employed for topic detection (TD) [16]. Summarization focuses on creating a summary that contains the most important points of the original documents [17]. Classification assigns textual data to two or more categories [18,19]. Mapping focuses on information visualization and supports effective and efficient searches of important subjects or topic areas, which are identified from textual data [15,20].
In addition, Table 1 provides three taxonomies for the key term identification of the previous works. First, the key terms can be discovered to best describe the textual data at different levels: the sentence level [21][22][23], the document level [24], and the topic level [6]. Second, the key term identifications used in the previous text mining applications can be divided into three categories: manual, automatic, and hybrid approaches. Third, particularly for the automatic approach, four types of techniques have been used: statistical, linguistic, machine-learning-based, and hybrid approaches [12,24,25]. While the statistical approaches do not require any learning mechanism and use statistical information of terms, e.g., tf, idf, and tfidf [26,27], the linguistic approaches use linguistic features of terms, e.g., parsing, sentiment analysis, and semantics [28,29]. Machine-learning-based approaches use key terms that are extracted from the collected textual data by means of a training process and apply them to a machine learning model to find key terms in new textual data [12,28]. The hybrid approaches combine one or more techniques [17,30]. Almeida et al. [31] proposed a method to normalize and expand original short and messy text messages.
Short message service (SMS) Rao et al. [32] presented a new term weighting scheme called LGT, which jointly models the Local element, Global element, and Topical association of each story.
Online news articles Lin et al. [33] proposed an Explicit Emotion Signal based cross media sentiment learning approach.
Microblog (i.e., Sina Weibo) posts Jiang, Chen, Nunamaker and Zimbra [6] proposed a novel stakeholder-based event analysis framework that uses online stylometric analysis, and partitions their messages into different time periods of major firm events.
Web forum posts √ √ * * √ * * √ Alruily et al. [34] examined the crime domain in the Arabic language (unstructured text) using text mining techniques, and presented the development and application.
Online news articles Lo et al. [35] presented an unsupervised multilingual approach for identifying highly relevant terms and topics from the mass of social media data.

Summarization
Zhang et al. [37] presented six term clumping steps that can clean and consolidate topical content in text sources for tech mining.
Research articles √ √ √ * √ Zheng, Lin, Wang, Lin and Song [23] studied an approach to extract product and service aspect words, as well as sentiment words, automatically from reviews.
Li et al. [38] proposed a convolutional neural network (CNN)-based opinion summarization method for Chinese microblogging systems.
Microblog (i.e., Sina Weibo) posts Hu, Chen and Chou [17] proposed a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews.
Reviews   L1  L2  L3  C1  C2  C3  T1  T2  T3  T4 Classification Weichselbraun et al. [39] presented a novel method for contextualizing and enriching large semantic knowledge bases for opinion mining with a focus on Web intelligence platforms and other high-throughput big data applications.
Xu, Zhang and Wang [22] proposed a support vector machine (SVM)-based approach to identify implicit features from Chinese customer reviews.
Reviews √ * * √ * * * √ Peetz, de Rijke and Kaptein [30] proposed a feature-based model based on three dimensions, i.e., the source of the tweet, the contents of the tweet and the reception of the tweet.
Tweets √ * * √ * * * √ Li and Liu [40] established a classification model that predicts the temporal class of an original microblog's retweeting time series by using readily available social-influential, topical, and temporal factors.

Mapping
Jung and Segev [41] proposed methods to analyze how communities change over time in the citation network graph without additional external information and based on node and link prediction and community detection.
Research articles √ √ * * √ Lee et al. [42] suggested a way of technology opportunity identification that is customizable to the R&D capabilities of small and medium-sized enterprises (SMEs).
Sustainability 2019, 11, 196 6 of 44 According to Table 1, most prior text mining applications are either automatic approaches or  hybrid approaches, and they have adopted either statistical techniques or hybrid techniques for  extracting their key terms. Theoretically, under these taxonomies, this study can be categorized into  classifying the topic-level key noun terms from online news articles into the SocialTERMs and the  EventTERMs by adopting a hybrid technique, as highlighted in Table 1. Consequently, according to Table 1, the key theoretical contributions of this paper can be summarized as follows: First, to the best of our knowledge, based on Table 1, there exists a research gap that no previous work has dealt with: the automatic classification of key noun terms that were identified by TD into the SocialTERMs and the EventTERMs. This paper contributes to addressing this research gap.
Second, to label the identified topics, most of the previous works in Table 1 used simple statistical approaches for characterizing key terms from clustered documents. On the other hand, this paper proposes and employs temporal weight features, sentiment features, and complex network structural features to represent key noun terms, which can be identified to label the detected SPRTs, after reviewing the features that were used in the previous works of Table 1.
Third, according to Table 1, no previous study has compared the performances of the state-of-the-art classification techniques, particularly deep learning, for distinguishing between the SocialTERMs and the EventTERMs among the key terms of the detected SPRTs. This paper extends the related literature by taking on such a challenging issue.

Purpose and Organization of This Paper
To resolve the abovementioned challenging issues, this paper proposes an automatic approach, namely SocialTERM-Extractor, which identifies the SPRTs from a large number of Korean online news articles and classifies the key noun terms of the detected SPRTs into the SocialTERMs and the EventTERMs.
To design and examine the proposed approach, three research questions can be formulated as below, and a research framework is constructed to answer those research questions:

RQ1.
How well do the three types of features, namely temporal weight, sentiment, and complex network structural features, perform in distinguishing the SocialTERMs and the EventTERMs among the key noun terms of the detected SPRTs from a large number of online news articles by using different classification techniques? Moreover, which feature set and features give the best results? RQ2. Which classification technique among the five base learners, namely Decision Tree (DT), Naïve Bayes (NB), Radial Basis Function Network (RBFN), Support Vector Machine (SVM), and Deep Belief Network (DBN), is best suited for differentiating the key noun terms of the detected SPRTs into the SocialTERMs and the EventTERMs? RQ3. Which ensemble learning method gives the best results? Is there a single ensemble method that achieves the best performances for the all feature sets with any given base learner?
The rest of the paper is organized as follows: Section 2 outlines the proposed a research framework to design and examine the SocialTERM-Extractor, and explains it in detail. Section 3 presents the results of applying the suggested research framework to the online news articles, which are collected from the best-known Korean news portal site in South Korea. Section 4 discusses the application results in terms of designing an automatic system. Finally, Section 5 presents the conclusions of this paper with reflections on limitations and future works.

Materials and Methods
To answer the research questions in the previous section, a research framework is proposed as summarized in Figure 1. First, online news articles reporting social-problem-related events are collected from a test-bed news portal site. Second, sentiment analysis selects online news articles with negative sentiment from the collected data. Then, from the online news articles having negative sentiment, the SPRTs are detected, and labelled by their key noun terms. Third, the three types

Collect Data
In the first component of Figure 1, news sections related to society are targeted for data collection. Then, the data collection is performed mainly by two steps, crawling and parsing: First, a distributed web-crawling program is developed to collect the online news articles from the Internet in a significantly reduced timespan. In detail, the distributed web-crawling program is based on the simple remote procedure call (SRPC) framework, in which two tasks from a master computer are delivered to slave computers with various hardware configurations, i.e., a uniform resource identifier (URI) to crawl and how to crawl the given URI. Consequently, a large number of online news articles, published in the chosen society-related news sections of a test-bed news portal service, are collected as raw HTML pages.
Second, the textual data in <title>…</title> and <content>…</content> of the collected online news articles are parsed out from the raw HTML pages, and are stored in a relational database. In addition, the publication date of each online news article is stored in the database for the TD. This

Collect Data
In the first component of Figure 1, news sections related to society are targeted for data collection. Then, the data collection is performed mainly by two steps, crawling and parsing: First, a distributed web-crawling program is developed to collect the online news articles from the Internet in a significantly reduced timespan. In detail, the distributed web-crawling program is based on the simple remote procedure call (SRPC) framework, in which two tasks from a master computer are delivered to slave computers with various hardware configurations, i.e., a uniform resource identifier (URI) to crawl and how to crawl the given URI. Consequently, a large number of Sustainability 2019, 11, 196 8 of 44 online news articles, published in the chosen society-related news sections of a test-bed news portal service, are collected as raw HTML pages.
Second, the textual data in <title> . . . </title> and <content> . . . </content> of the collected online news articles are parsed out from the raw HTML pages, and are stored in a relational database. In addition, the publication date of each online news article is stored in the database for the TD. This results in NEWS 0,t = {news | online news articles, published at time t and collected from the chosen society-related news sections of the test-bed news portal service} for time t = 1, . . . , T.

Select Online News Articles with Negative Sentiment
In this step, online news articles with negative sentiment are selected to focus on online news articles more related to social problems. Considering the wide applicability to different languages, the sentiment of an online news article is obtained based on multilingual sentiment feature set, made by two main parts proposed in Dang et al. [43]: the extraction of English sentiment feature set from SentiWordNet (http://sentiwordnet.isti.cnr.it/), and the construction of multilingual sentiment feature set.
To explain, for the multilingual sentiment features in SentiWordNet, the average polarity score is calculated by using the prior-polarity formula, defined as score(senti, pos, pol) = ∑ synset∈SYNSET(senti,pos,pol) swnscore(synset, pos, pol) n(∪ pol SYSNSET(senti, pos, pol)) , where senti is a sentiment feature in SentiWordNet, pos is a sentiment related part-of-speech (POS) sense, pos ∈ {verb, adverb, adjective}, pol is a type of polarity scores, pol ∈ {objective, positive, negative}, SYNSET(senti, pos, pol) is a set of synsets, i.e., synonyms, belonging to senti when pos and pol are given, and swnscore(synset, pos, pol) is the SentiWordNet score of synset with given pos and pol. From score(senti, pos, pol), the final sentiment score is determined by the sentiment feature-calculation strategy. In the strategy, the sentiment features satisfying both score(senti, pos, pol = objective) < 0.5 and |score(senti, pos, pol = negative)| = |score(senti, pos, pol = positive)| are taken into account, and the final negative sentiment score of a multilingual sentiment feature is calculated as f inalnegscore(senti, pos) = 0 if |score(senti, pos, pol = negative)|<|score(senti, pos, pol = positive)|, |score(senti, pos, pol = negative)| otherwise. (2) Then, using the constructed multilingual sentiment feature set, the sentiment score of an online news article, news, is obtained by where NEWSSENTI(news) is a set of the multilingual sentiment features appearing in news ∈ NEWS 0,t . In particular, this study uses the multilingual sentiment feature set of English and Korean, constructed by Suh [8], because of the following reasons: it is based on the commonly used approach for measuring multilingual sentiment, proposed by Dang, Zhang and Chen [43]; it enables researchers of the other lingual cultures to make use of this study's research framework; Korean, selected for this study, is taken into account as the non-English for the multilingual sentiment features; the sentiments of synonyms for a English sentiment feature are considered for the corresponding Korean sentiment feature; and the additional Korean sentiment features are generated and included to consider the negation.
To explain briefly, the Korean sentiment feature, constructed by Suh [8], inherits the final sentiment score and POS sense of its corresponding English sentiment feature, generated by Dang, Zhang and Chen [43]. For instance, as shown in Table A1 of Appendix A, the sentiment value of 'ᄃ ᅥᄅ ᅥ ᆸ ᄒ ᅵ/pvg+ᄃ ᅡ/ef' is −0.7500 for pos = verb, and comes from the corresponding English sentiment feature, 'soil'. If the English sentiment feature has synonyms, the final sentiment scores of the synonyms are averaged for the corresponding Korean sentiment feature. For example, the sentiment value of 'ᄌ ᅳ ᆨᄉ ᅵ/mag' is the average of sentiment values from five English sentiment features: 'instantly', 'straight_away', 'right_away', 'at_once', and 'swiftly'. Moreover, if the morphological analysis splits the Korean sentiment feature into a stem and ending(s), the extended Korean sentiment features are generated by adding various endings to the stem in possible POS senses and tenses. For instance, the stem of 'ᄃ ᅥᄅ ᅥ ᆸᄒ ᅵ/pvg+ᄃ ᅡ/ef' is 'ᄃ ᅥᄅ ᅥ ᆸᄒ ᅵ/pvg', and the extended Korean sentiment features of 'ᄃ ᅥᄅ ᅥ ᆸ ᄒ ᅵ/pvg+ᄃ ᅡ/ef' are listed in Table A2 of Appendix A. They inherit sentiment values from the original Korean sentiment feature, 'ᄃ ᅥᄅ ᅥ ᆸᄒ ᅵ/pvg+ᄃ ᅡ/ef', and, when negation is added to their endings, −1 is multiplied to their sentiment values.
As a consequence, to select online news articles more concerned with social problems, the online news articles with newsnegscore(news) > 0 are chosen for the TD of the next step. This leads to NEWS 1,t = {news | online news articles, published at time t, and turned out to have negative sentiment's online news articles from NEWS 0,t } for time t = 1, . . . , T.

Detect the SPRTs from the Collected and Negative Online News Articles
An event is defined as a real-world incident that is related to time(s) and location(s), e.g., 9/11 attacks of 2001, Hurricane Catalina of 2004, and North Korea's nuclear weapon test [44]. Due to the rapid growth and popularity of the Web, when an event occurs, a large number of event-related textual data are published online [45]. Generally, online news articles are starting points, and Web 2.0 has recently led to the tremendous distribution of online news articles through individuals on social media [46,47]. As a result, managing, interpreting, and analyzing such a huge volume of online news articles that are related to events has been a difficult task. To address this, many online news articles that are related to a set of events and interconnected with one another need to be grouped into the same topic [16]. Then, such topics and their changes can be identified over time by using TD methods [48]. Formally, a topic is a seminal event that is associated with all related events, that is, a set of related events [49].
Therefore, to detect the topics of this study's interest, i.e., SPRTs, this step clusters online news articles, which are collected and evaluated to have negative sentiment. First, noun terms are identified from the online news articles through a series of natural language processing (NLP) techniques, i.e., spacing, part-of-speech (POS) tagging, regular expressions-based noun extraction, and stop words removal. In this way, only noun terms are used for the TD because of the following reasons: first, the target key terms of this study, i.e., SocialTERMs and EventTERMs, are noun terms according to the introduction of this paper; second, the other types of key terms, i.e., verbs, adjectives, and adverbs, are more relevant to sentiments rather than topics [6][7][8]. Next, let news be an online news article in NEWS 1,t , and noun be a noun term in NEWSNOUN 0 (news) = {noun | all noun terms in news}. Then, the weight score of noun in news ∈ NEWS 1,t is obtained by w(noun, news) = tf (noun, news) × idf t (noun) × ths(noun, news).
Here, tf (noun, news) is the normalized frequency of noun appearing in news, and it is defined as where f (noun) is the frequency of noun in <content> . . . </content> of news. In addition, idf t (noun) is the inverse document frequency of noun, defined as where h t (noun) is the number of online news articles containing noun among online news articles in NEWS 1,t , and H t is the number of online news articles in NEWS 1,t . On the other hand, ths(noun, news) is the existence of noun in <title> . . . </title> of news, given by ths(noun, news) = 1 if noun appears in < title > . . . < /title > of news, 0.5 otherwise.
Using the obtained w(noun, news) values, five noun terms with the highest weights are selected as key noun terms for news. This results in NEWSNOUN 1 (news) = {noun | five key noun terms for news}, and news is represented by the vector of its five key noun terms due to its simplicity, compared to the other textual representations models, e.g., the graph-based model, and the fuzzy set model [50,51]. Here, one may argue about how to decide the number of key noun terms for an online news article, but this study refers to the number of key noun terms used to represent an online news article in the previous works, i.e., three to five keywords [6,8,52,53]. Therefore, in this study, the number of key noun terms for an online news article is set to five as default.
Then, Algorithm 1 is adopted to cluster online news articles in NEWS 1,t for t = 1, . . . , T. The Algorithm 1 is a modified version from the algorithm that was used in He, Chang, Lim and Banerjee [16] and Suh [8], which has been widely used and known effective for TD because it overcomes the following drawbacks. While previous TD models are broadly classified into two types, i.e., non-probabilistic and probabilistic [16], non-probabilistic models do not provide the number of topic clusters, and existing probabilistic models, especially latent Dirichlet allocation (LDA), seem to be overly complex for the TD problems.
Consequently, the Algorithm 1 extracts the topics of similar online news articles from NEWS 1,t for t = 1, . . . , T. Over the iterations in executing Algorithm 1, the centroid of each topic keeps less than α key noun terms while excluding less important key noun terms. In consequence, Algorithm 1 results in TOPIC(topic) = {topic | SPRTs detected from NEWS 1,t for t = 1, . . . , T}, TOPICNEWS(topic) = {news | online news articles, classified to topic ∈ TOPIC}, and TOPICNOUN(topic) = {noun | key noun terms in the centroid of topic ∈ TOPIC}. Algorithm 1 Detecting the SPRTs from online news articles in NEWS 1,t (t = 1, . . . , T).
Input: Online news articles in NEWS 1,t and their noun score vectors, and threshold ε Output: TOPIC, TOPICNEWS(topic), and TOPICNOUN(topic)

1:
for time t = 1 (i.e., the first publication date among online news articles of NEWS 1,t ) to t = T (i.e., the last publication date among online news articles of NEWS 1,t ) do 2: select online news articles in NEWS 1 , t ; 3: if t = 1 and n(TOPIC) = 0 then 4: create a topic, set the online news article as a centroid of the new topic, and announce it; 5: else 6: for each online news article of NEWS 1,t do 7: compute the cosine similarity of an online news article with the centroid of each topic in TOPIC, defined as where v i is the weight vector, v i · v j is the dot product of two weight vectors, and v i is the magnitude of v i ; 8: if cosine similarity > threshold ε then select topics with >β online news articles, and define them as the SPRTs.

Measure the Three Types of Features to Represent the Key Noun Terms of the SPRTs
To label the identified topics, most of the previous works in Table 1 used simple statistical approaches for characterizing key terms from clustered documents. In contrast, this paper proposes and employs temporal weight features, sentiment features, and complex network structural features to represent key noun terms, which can be identified to label the detected SPRTs, after reviewing the features that were used in the previous works of Table 1. Details of the proposed three features can be explained as follows: Temporal weight features. Temporal IR attempts to consider not only relevance but also temporal correspondence based on the underlying temporal factor behind search intension. A relatively large number of key noun terms, i.e., queries, for information access have temporal information needs [54]. Hence, to represent the temporally changing importance of a key noun term in the identified topics, this study modifies the traditional weighting statistics, e.g., tf, idf, and tfidf, by taking into account time, which yields temporal weight features. In addition, basic statistics such as the mean, variance, and |skewedness| are measured for the temporal weight features, to consider the distributional characteristics over the given time period. Here, the absolute value of skewedness is to measure the shape of skewedness irrespective of whether it is skewed to the left/negative or to the right/positive. Sentiment features. The sentiment features of a key noun term are measured by sentiment analysis on the large-scale online news articles. In general, sentiment analysis determines whether a textual data instance is objective or subjective and whether a subjective textual data instance contains positive or negative statements, and measures the sentiment value of a subjective textual data instance [55,56]. In this paper, the approach of Suh [8] that uses SentiWordNet as a lexicon is adopted to extract multilingual sentiment features and score their sentiment values mainly for two reasons: it enables researchers in the other countries to use the research framework of this paper by constructing sentiment features with their own languages; and it takes into account the negations. In addition, this study exploits the basic statistics of a key noun term's sentiment features to represent the distributional characteristics over the news and topics that contain the key noun term.
Complex network structural features. Using the co-occurrence relationships of the key noun terms as links, which are called co-news and co-topic links, the complex networks of the key noun terms are constructed, and their complex network structural properties are measured by referring to the standard measures of node centrality, i.e., the degree, closeness, and betweenness centralities [57][58][59], and used as features for this study. In addition, after specifying a boundary, such as identified SPRTs and detected topical communities, to the complex networks of the key noun terms, the basic statistics are measured to represent the distributions of a key noun term's in-boundary network properties over the different SPRTs and topical communities.
Thus, in this Section 2.3, the proposed three types of features are measured for all the extracted key noun terms of the SPRTs, i.e., noun ∈ ∪ topic∈TOPIC TOPICNOUN(topic). The measured three types of features for the topic-level key noun term are used to decide automatically whether the topic-level key noun term is a SocialTERM or EventTERM in the next Section 2.4.

Measure the Temporal Weight Features of the SPRTs' Key Noun Terms
The temporal weight features of noun ∈ ∪ topic∈TOPIC TOPICNOUN(topic), namely F1, are measured with four respects: df, tf, ths, and idf. Moreover, these temporal weight features are measured with two different respects: at the news level and at the topic level.
First, the temporal weight features of noun at the news level are measured as follows: Given that NOUNNEWS t (noun) is a set of online news articles, containing noun and published at time t, d f score 1,t (noun) is the normalized number of online news articles in NOUNNEWS t (noun), and it is given by d f score 1,t (noun) = n(NOUNNEWS t (noun)) n(∪ noun NOUNNEWS t (noun)) .
Given that t f (noun, news) is the frequency of noun, which occurred in the content of news ∈ NOUNNEWS t (noun), t f score 1,t (noun) is obtained by normalizing t f (noun, news) by the number of online news articles in NOUNNEWS t (noun), and it is defined as thsscore 1,t (noun) is the normalized ths(noun, news) for online news articles in NOUNNEWS t (noun), and it is given by where ths(noun, news) is 2 if noun appears in the title of news, otherwise it is 1. id f score 1,t (noun) is the inverse of the number of online news articles, containing noun at time t, and it is defined as id f score 1,t (noun) = log( n(∪ topic TOPICNEWS t (topic)) n(NOUNNEWS t (noun)) ).
To represent the distribution of each of the Equations (9)-(12) over time t = 1, . . . , T, the mean, variance, and |skewness| are measured, and added as the temporal weight features of noun at the news level to F1. As a consequence, 12 features are measured as the news-level temporal weight features of noun.
Second, the temporal weight features of noun at the topic level are obtained as follows: Given that NOUNTOPIC t (noun) is a set of the detected SPRTs that contain online news articles in NOUNNEWS t (noun) and thereby are related to noun, d f score 2,t (noun) is the normalized number of the detected SPRTs in NOUNTOPIC t (noun), defined as t f score 2,t (noun) is obtained by normalizing t f score 1,t (noun) over the detected SPRTs related to noun, and defined as In the same way, titlescore 2,t (noun) is got by normalizing titlescore 1,t (noun) over the detected SPRTs related to noun, given by and id f score 2,t (noun) is the normalized id f score 1,t (noun) over the detected SPRTs regarding to noun, given by ).
To represent the distribution of each of Equations (13)-(16) over time t = 1, . . . , T, 12 topic-level temporal weight features on noun are measured, and added to F1. Consequently, Table 2 shows that 24 temporal weight features on noun, measured at both news and topic levels. Table 2. Temporal weight features of the detected SPRTs' key noun terms, proposed for this study.

Measure the Sentiment Features of the SPRTs' Key Noun Terms
This component extracts features related to the sentiment of noun, namely F2. To do so, this paper adopts the multilingual sentiment feature set, constructed by Suh [8] through two main parts: the extraction of English sentiment feature set from SentiWordNet (http://sentiwordnet.isti.cnr.it/), and the construction of multilingual sentiment feature set.
Let NOUNSENTI(noun) be a set of the constructed multilingual sentiment features that contain noun. Then, the sentiment score of a multilingual sentiment feature including noun with the given pos is obtained by f eaturesentiscore(noun, pos) = ∑ senti∈NOUNSENTI(noun) f inalscore(senti, pos) n(NOUNSENTI(noun)) .
In addition, the sentiment score of noun is defined as The sentiment score of noun at the news level is obtained by averaging the sentiment scores of online news articles, containing noun, and it is given by where NOUNNEWS(noun) = NOUNNEWS 1 (noun)∪ . . . ∪NOUNNEWS T (noun) for time t = 1, . . . , T.
Here, the sentiment score of an online news article, news, is given by where NEWSSENTI(news) is a set of the multilingual sentiment features appearing in news. In addition, to represent the distribution of the sentiment scores of online news articles that contain noun, variance, and |skewness| for newssentiscore(news) are measured over news ∈ NOUNNEWS(noun), and they are added as the sentiment features of noun to F2. Here, the mean value of newssentiscore(news) is equal to sentiscore 1 (noun). The sentiment score of noun at the topic level is defined as Here, the sentiment score of the detected topic, topic, is given by topicsentiscore(topic) = ∑ news∈TOPICNEWS(topic) newssentiscore(news) n(TOPICNEWS(topic)) .
In addition, to represent the distribution of the sentiment scores over the detected SPRTs, whose online news articles are containing noun, the mean, variance, and |skewness| for topicsentiscore(topic) are measured over topic ∈ NOUNTOPIC(noun), and they are added as the sentiment features of noun to F2. Here, the mean value of topicsentiscore(topic) is equal to sentiscore 2 (noun). Consequently, 10 sentiment features of noun are measured as shown in Table 3, and F22 and F23 are particularly measured to represent the distributions of the sentiment scores of noun over its news and topic. Table 3. Sentiment features of the detected SPRTs' key noun terms, proposed for this study.

Measure the Complex Network Structural Features of the SPRTs' Key Noun Terms
A network whose structure is irregular, complex, and dynamically evolving over time is defined as a complex network. The research on complex networks has resulted in the identification of a series of unifying principles and statistical properties that are common to most real networks [60]. For a given plain graph, approaches that are based on the structure-based patterns of complex networks can be grouped into feature-based and proximity-based approaches: feature-based approaches extract graph-centric features, e.g., node degree; proximity-based approaches quantify the closeness of nodes in the graph to identify associations, e.g., PageRank [61]. In particular, feature-based approaches compute various measures that are associated with the nodes, dyads, triads, egonets, communities, and global graph structure. Among these measures, this paper focuses on the nodes and communities because they both correspond to the node perspective.
Network properties characterize an individual node's position within a complex network. The three most widely investigated concepts for evaluating such network properties are the degree, closeness, and betweenness centralities [57][58][59]. These are the standard measures of node centrality, which were originally introduced to quantify the importance of an individual in a social network. Given an adjacency matrix M n×n (m ij ) of networks, where n is ≥3, the three normalized network centralities can be respectively defined as follows: where m ij = 1 if node i is connected to node j. A high value of degree i means that node i acts as a center in the network.
where d ij is the number of edges in the shortest path from node i to node j. closeness i indicates the influence of node i on the other nodes.
where g jk is the number of the shortest paths between node j and node k, and g jik is the number of the shortest paths between node j and node k that contain node i. A high betweenness i value means that node i is located at the core of the networks and has higher momentum of transition. A community is a densely connected subgroup, which is known to exist in many real-world networks, and community detection (CD) can help us understand networks more deeply and identify interesting properties that are shared by the nodes [62,63]. The fundamental idea behind most CD methods is to partition the nodes of the network into modules [64]. For the agglomerative methods of CD, there are two commonly used algorithms: first, Newman's CD algorithm is a widely used agglomerative method that uses modularity to measure the goodness of the current partitioning; second, the recently developed Louvain method [65] is an agglomerative method and is commonly used because of its low computational complexity and high performance. When merging communities, the Louvain method considers not only the modularity but also the consolidation ratio [41]. Newman's algorithm is effective but slow, whereas Louvain's method is much more computationally efficient [66]. Therefore, this paper adopts the Louvain method for detecting topical communities from complex networks of key noun terms, which are used to label the detected SPRTs.
Based on the abovementioned definitions related to complex networks, this component extracts the complex network structural features regarding noun, namely F3, by constructing two types of the complex networks of the SPRTs' key noun terms: cross-boundary networks and in-boundary networks. Figure 2 describes how the networks of the key noun terms are constructed respectively, and details are explained as follows: Illustration on how to form cross-boundary networks (CBNs), i.e., CBN co-news and CBN co-topic , and in-boundary networks (IBNs), i.e., ITNs (topic) and ICNs (community).
To evaluate the network properties of noun in both CBN co-news and CBN co-topic , degree, closeness, and betweenness are respectively measured as the complex network structural features of noun. Relating to the IBNs, the network properties of noun in ITN co-news (topic) are degree(noun, ITN conews (topic)), closeness(noun, ITN co-news (topic)), and betweenness(noun, ITN co-news (topic)). In particular, to represent the distribution of the three network centralities of noun over the detected SPRTs, the mean, variance, and |skewness| are measured regarding noun, and they are added as the complex network structural features of noun to F3. Then, the structural properties of noun in its corresponding ICN cotopic (community) are obtained as degree(noun, ICN co-topic (community)), closeness(noun, ICN cotopic (community)), and betweenness(noun, ICN co-topic (community)). As a result, Table 4 shows that 18 complex network structural features of noun for each of the constructed complex networks of the SPRTs' key noun terms.  The cross-boundary networks (CBNs) are constructed by using the key noun terms as nodes, and setting edges by the co-occurrence relationship between the key noun terms in terms of news and topics. In other words, CBN co-news is constructed by making the key noun terms as nodes and their co-occurrence frequencies in online news articles, i.e., co-news frequencies, as the corresponding link weights. Similarly, by setting co-occurrence frequencies in the detected topics, i.e., co-topic frequencies, as the corresponding link weights, CBN co-topic is constructed.
In-boundary networks (IBNs) are built up by using the key noun terms in a particular boundary, and their co-occurrence relationships with respect to the boundary. For IBNs, this study uses two types of boundaries, topics, and communities. First, let ITN co-news (topic) be a kind of IBNs, constructed by setting topic as the boundary and co-news frequencies of the key noun terms as link weights. Second, the Louvain method-based CD on CBN co-topic is performed to take into account the semantic relationship among the key noun terms in terms of their co-topic frequencies. Unlike the TD, the CD allows noun to only one of the detected communities. For each detected community, community, an in-community network, i.e., ICN co-topic (community), is formed by setting co-topic frequencies of the key noun terms in the boundary of community as the link weights.
To evaluate the network properties of noun in both CBN co-news and CBN co-topic , degree, closeness, and betweenness are respectively measured as the complex network structural features of noun. Relating to the IBNs, the network properties of noun in ITN co-news (topic) are degree(noun, ITN co-news (topic)), closeness(noun, ITN co-news (topic)), and betweenness(noun, ITN co-news (topic)). In particular, to represent the distribution of the three network centralities of noun over the detected SPRTs, the mean, variance, and |skewness| are measured regarding noun, and they are added as the complex network structural features of noun to F3. Then, the structural properties of noun in its corresponding ICN co-topic (community) are obtained as degree(noun, ICN co-topic (community)), closeness(noun, ICN co-topic (community)), and betweenness(noun, ICN co-topic (community)). As a result, Table 4 shows that 18 complex network structural features of noun for each of the constructed complex networks of the SPRTs' key noun terms. In-boundary Given topic co-news mean, variance, and |skewness| of degree(noun, ITN co-news (topic)) mean, variance, and |skewness| of closeness(noun, ITN co-news (topic)) mean, variance, and |skewness| of betweenness(noun, ITN co-news (topic)) F34 Given community co-topic degree(noun, ICN co-topic (community)) closeness(noun, ICN co-topic (community)) betweenness(noun, ICN co-topic (community)) Notes: 1 The statistics of degree, closeness, and betweenness of noun in ITN co-news (topic) are obtained over topic ∈ NOUNTOPIC(noun). 2 If n(NOUNTOPIC(noun)) ≤ 1, the variance values of degree, closeness, and betweenness of noun in ITN co-news (topic) are set as 0. If n(NOUNTOPIC(noun)) ≤2, the skewness values of degree, closeness, and betweenness of noun in ITN co-news (topic) are set as 0.

Classify the Key Noun Terms of the SPRTs into the SocialTERMs and the EventTERMs
This subsection defines a target variable for classification, and introduces machine learning techniques used for classification in the previous text mining applications. In addition, it explains the experimental settings to generate configurations, which result from combining the different feature sets and different classification techniques.

Definition for a Target Variable
By referring to the examples, mentioned in the introduction, SocialTERM and EventTERM can be defined as below: Definition 1. (SocialTERM) Given social-problem-related topics (SPRTs) and their key noun terms, the SocialTERM of a SPRT is defined as a key noun term that are perceived as: characterizing the SPRT as a social problem; and being a useful cue to identifying and monitoring the ongoing and future events of the social problem. SocialTERMs are irrelevant of the event-specific characteristics of the SPRTs, e.g., when and where the events of the SPRT happened, but reflective of the social-problem-specific perspectives of the SPRTs, e.g., what social problems the SPRT includes, and what causes are underlying such social problems.

Definition 2.
(EventTERM) Given SPRTs and their key noun terms, the EventTERM of a SPRT is defined as a key noun term that is not perceived as a SocialTERM, because it is not able to explain the social-problem-specific characteristic of the SPRT but the event-specific characteristics of the events that belong to the SPRT. Thus, the EventTERMs are considered not useful to identifying and monitoring the ongoing and future events of social problems.
For the key noun terms obtained from the detected topics, their target variables, y(noun), are manually identified by three professional and experienced social scientists, invited as inspectors. Defined as Equation (26), these are used as the true values to be compared to the estimated values.
To assure the reliability of the manual investigation, Cohen's Kappa, k, is calculated for the inter-agreement between the three inspectors, and it is defined as where p o is the relative observed agreement among the three inspectors, and p e is the hypothetical probability of chance agreement. The Cohen's Kappa is a statistic that measures inter-rater agreement for categorical items, and it serves as an evidence that the combination of several sources reduced the bias of individual sources [56,67,68]. For these reasons, it is adopted in this study to evaluate the consistency of annotated results by the three inspectors.

Machine Learning Techniques for Classification in the Previous Text Mining Applications
To distinguish between the SocialTERMs and the EventTERMs among the key noun terms of the detected SPRTs, this paper adopts supervised classification techniques, which have been extensively studied due to their high classification performance. Of the classification techniques that were used in the previous works in Table 1, four commonly used classification techniques and a recently proposed deep-learning-based technique are adopted as base learners for this study. To name, they are C4.5 as Decision Tree (DT) [9,69], Naïve Bayes (NB) [70][71][72], Radial Basis Function Network (RBFN) [9], Support Vector Machine (SVM) [73,74], and Deep Belief Network (DBN) [75][76][77]. Each of them is explained in the S.1 of Supplementary Materials.
In addition to the five base learners, three types of ensemble methods are combined with each of the five base learners for this study. Ensemble learning is a machine learning paradigm in which multiple learners are trained to solve the same problem. In contrast to the base learners, which try to learn one hypothesis from the training data, the ensemble learning methods try to learn a set of hypotheses and combine them for use. In general, ensemble methods are divided into two categories: instance partitioning and feature partitioning. Bagging and Boosting are instance partitioning methods; RS is a feature partitioning method [78].
Particularly, the three ensemble methods, namely Bagging, Boosting, and RS, are summarized as follows: Bagging is one of the simplest ensemble methods but has surprisingly good performance. The combination strategy of base learners for Bagging is majority voting. This strategy reduces the variance when combined with the base learner generation strategies. Bagging is particularly appealing when the available data are of limited size [79]. Unlike Bagging, Boosting produces different base learners by sequentially giving instances that have been misclassified by the previous base learner larger weight in the next iteration of training. The final model that is obtained by Boosting is a linear combination of several base learners, which are weighted by their own performances. There are several Boosting algorithms; the most widely used is AdaBoost [78]. RS is an ensemble construction technique, which uses random subspaces to both construct and aggregate the base learners. If a dataset has many redundant or irrelevant features, base learners in random subspaces may be better than in the original feature space. The combined decision of such base learners may be superior to that of a single classifier that is constructed on the original training dataset in the complete feature sets.
To the best of our knowledge from Table 1, no previous study has compared the performances of the state-of-the-art classification techniques, particularly DBN, in distinguishing between the SocialTERMs and the EventTERMs among the key terms of the detected SPRTs. Hence, this study adopts the five base learners and their combinations with the three ensemble methods. Moreover, these classification techniques are compared in terms of their performances.

Experimental Settings on Features and Classification Techniques
In this paper, the experiments are performed with 60 configurations, which result from combining the three feature sets, namely F1, F1 + F2, and F1 + F2 + F3, and 20 classification techniques. Details on the experimental settings are as follows.
The three types of features, i.e., F1, F2, and F3, are obtained after the feature extraction of the Section 2.3. Based on these different types of features, three feature sets are constructed in an incremental way: feature set F1; feature set F1 + F2; and feature set F1 + F2 + F3. This incremental order implies the evolutionary sequence of features [19,80].
In addition, three popular ensemble methods, i.e., Bagging, Boosting, and RS, are implemented respectively with the five base learners. Consequently, the paper uses 20 classification techniques to differentiate the SocialTERMs from the EventTERMs as described in Table 5. For an experiment that uses one of the 20 classification techniques, a 10 fold validation is performed to train a classifier and evaluate it. Before performing the experiments, if the sample sizes of two classes in y(noun) of the data set for an experiment are imbalanced, the imbalanced problem has to be resolved because imbalanced datasets may have problems such as small sample size, overlapping or class separability, and small disjunctions [81]. Previous approaches for dealing with imbalanced datasets are grouped into four categories: algorithm-level, e.g., Hellinger Distance Decision Trees; data-level, e.g., random oversampling and synthetic minority oversampling technique (SMOTE); cost-sensitive, e.g., AdaCost; and classifier ensembles, e.g., Bagging [82]. Among them, the SMOTE approach is known for its good performances when adopted with ensemble methods [81], and therefore it is used to deal with the imbalance problem of this study [83].
Among the 20 classification techniques, to implement the conventional 16 classification approaches of DT, NB, RBFN, and SVM, the data mining toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.7.0 is used because it is the best-known open-source toolkit with a collection of various machine learning algorithms for solving data mining problems [19,78]. In detail, for the base learners, J48 module (WEKA's own version of C4.5) for DT, RBFNetwork module for RBFN, NaïveBayes module for NB, and SMO module for SVM; for the ensemble methods, Bagging module for Bagging, AdaBoostM1 module for Boosting, and RandomSubSpace module for RS. Moreover, for DBN and its ensemble learning methods, python-based deep learning tutorials from 'www.deeplearning.net' are used as references, and modified. In implementing DBN, the number of hidden layers is set to two, and the dimension in each layer is set to 100 by default.

Evaluate Results with Comparisons
This component assesses the performance of the configurations of three feature sets and 20 classification techniques for classifying the key noun terms of the SPRTs into the SocialTERMs and the EventTERMs. Among the standard metrics, widely used in IR and text classification studies, this paper uses the three performance measures, i.e., accuracy, F-measure, and AUC to evaluates each configuration. In particular, the definition of accuracy can be explained with a confusion matrix as shown in Table 6, and it is defined as and F-measure is obtained by In addition, pairwise t tests are used for the comparisons because they are the simplest statistical tests, and they are commonly used for comparing the performance of two algorithms. The pairwise t tests examine whether the average difference in two approaches is significantly different from 0 by repeating the same experiments many times, particularly 50 times for this study [19]. In detail, the effect of adding one feature set on the three performance measures for a certain classification technique is investigated by conducting 60 individual pairwise t tests, i.e., 60 = three feature set comparisons × 20 classification techniques. Moreover, classification techniques for a certain feature set are compared in terms of the three performance measures by conducting 120 individual pairwise t tests, which are composed as follows: 30 between five BL classification techniques, i.e., 30 = 10 technique comparisons × three feature sets; 45 between five BL classification techniques and 15 ensemble learning methods, i.e., 45 = 15 technique comparisons × three feature sets; and 45 between 15 ensemble learning methods, i.e., 45 = 15 technique comparisons × three feature sets.

Test Bed for Data Collection: South Korea and Korean News Portal Site
Relating to the Section 2.1, this paper selected South Korea as a test-bed country for three main reasons: first, it is an information and communication technology (ICT)-intensive nation, so many online news articles are available, and it is easier to identify the SPRTs from online news articles [8,9]. Second, it is well known for its high prevalence of social problems, e.g., it has the highest rate of suicide among OECD countries [84]. This means that South Korea needs to identify social problems more than the other countries do, which corresponds to the desired application of this study. Third, it is a knowledge-intensive country, so, once identified, SocialTERMs can be better used to explore technologies for solving social problems than in other countries [85].
By using the distributed web-crawling program, the online news articles were collected from NAVER.com, which is the best-known Korean news portal site. These articles had been published in the society-related news sections in the 356 days from May 2013 to June 2014, i.e., t = 1, . . . , 365. In total, 126,402 online news articles were collected from the targeted society-related sections, and the parsed data were stored in the relational database for the experiments.

Evaluation Results
Relating to the Section 2.2, 43,711 online news articles with negative sentiment were selected from the collected 126,403 online news articles. Next, the thresholds ε = 0.3, α = 20, and β = 10 were determined based on a pre-topic analysis of 100 online news articles, which were published in the first month, and 2961 topics of online news articles, which were detected from the 43,711 online news articles by Algorithm 1. Among the 2961 detected topics of online news articles, the 467 topics with more than 10 (=β) online news articles were chosen as the final detected topics, namely the SPRTs. Then, as explained in the Section 2.3, the three types of features, namely temporal weight, sentiment, and complex network structural features, were measured for the 1810 key noun terms, which were extracted from the 467 detected SPRTs (see Table 7 for examples of the 1810 key noun terms). Particularly in measuring the complex network structural features, JUNG (http://jung.sourceforge. net/), which is a Java-based software library for network analysis, was used to obtain the network centralities, and Gephi (https://gephi.org/), which is an open-source graph visualization platform, was used to identify communities from the constructed co-news and co-topic key term networks. Tables A3-A6 in Appendix B show the descriptive statistics on the three types of features that represent the 1810 key noun terms. The target variables, which are denoted as y(noun), of the 1810 key noun terms were manually identified by three inspectors. The procedure yielded a Cohen's Kappa inter-rater reliability of 0.8678, thereby indicating good agreement, i.e., k ≥ 0.8, according to Lombard et al. [86]. Disagreements among the three inspectors were jointly reviewed until a final agreement was reached. These were used as the true values, to be compared to the estimated values. In addition, the 1810 key noun terms were the imbalanced in terms of the classes of their target variables. To resolve this imbalance issue, SMOTE was applied to the 1810 key noun terms. By adding 502 new instances of y(noun) = SocialTERM, a balanced data set of 1156 SocialTERMs and 1156 EventTERMs was prepared for the following experiments.
Next, according to the Section 2.4, experiments were performed on the prepared data set, and Table 8 shows the experimental results on the three performance measures for different feature sets and different classification techniques. Consequently, the full feature set configuration of F1 + F2 + F3 and the ensemble learning method, namely Boosting DT, gave the best accuracy, i.e., 83.8769%, which is 1.3264% better than the second best configuration, i.e., F1 + F2 and Boosting DT. Moreover, Table 8 shows that, with F1 + F2 + F3, Boosting DT also gave the best performances in terms of F-measure (1.7112% better than with F1 + F2) and ACU (1.8174% better than with F1 + F2). Thus, the results in Table 8 provided an answer to the part of RQ1 that is how well the three types of features perform by using different classification techniques.
The possible reason for the best performances of Boosting DT is as follows: DT could deal with the numerical features of this study properly as categorical features; and DT with Boosting could reduce multi-collinearity problems, which may exist among features [9,74,78].

Comparisons of Feature Sets
Table A6 of Appendix C shows the comparison results of pairwise t tests, which were performed to evaluate the effects of different feature sets on the performance of a classification technique in terms of the three performance measures. The comparison results gave answers to the part of RQ1 about which feature set and features give the best results, and their details are as follows.
By summarizing the comparison results in Table A6, Figure 3 illustrates the ratio of agreement with the positive effect of adding a feature subset on increasing performance from different perspectives. One of its key findings is that for most of the 20 classification techniques, adding F1, F2, and F3 individually increased performance in respect of the three performance measures. This indicates that each of the feature sets that were suggested by this study is useful for identifying the SocialTERMs from the detected SPRTs from online news articles. Sentiment feature set F2 led to better performances, regardless of the classification technique. The effect of adding complex network structural feature set F3 was smaller than those of adding F1 and F2. Sustainability 2019, 11, x FOR PEER REVIEW 25 of 47 Figure 3. Ratio of agreement that a feature set improved performance. Figure 3. Ratio of agreement that a feature set improved performance.
Furthermore, using Boosting DT, which was shown to be the best classification technique in Table 8, this paper performed pairwise t tests to compare their different feature subsets, and investigated the effect of adding each feature subset on the classification performance. Table 9 shows that for all three performance measures, the significant performance improvements by feature sets F1, F2, and F3 was respectively attributed to feature subsets F11 and F12 for F1, F21 and F23 for F2, and F31 for F3. This indicates that these features are more useful in characterizing the relatedness of the key noun terms to social problems.

Comparisons on Classification Techniques
In addition, the classification techniques were compared in three ways: base learner vs. base learner (see Table A7 of Appendix C), base learner vs. ensemble learning method (see Table A8 of Appendix C), and ensemble method vs. ensemble method (see Table A9 of Appendix C). Table A7 shows the results of the pairwise t tests, which were performed to examine the effects of different base learners on three performance measures for a specific feature set, and Figure 4 provides an overview of the results in Table A7. According to the results, the performance rankings of all five base learners are different according to the selected feature sets, and it implies that there is no single best classification technique for all three performance measures.
In addition, the classification techniques were compared in three ways: base learner vs. base learner (see Table A7 of Appendix C), base learner vs. ensemble learning method (see Table A8 of Appendix C), and ensemble method vs. ensemble method (see Table A9 of Appendix C). Table A7 shows the results of the pairwise t tests, which were performed to examine the effects of different base learners on three performance measures for a specific feature set, and Figure 4 provides an overview of the results in Table A7. According to the results, the performance rankings of all five base learners are different according to the selected feature sets, and it implies that there is no single best classification technique for all three performance measures.  Table A8 shows the results of the pairwise t tests, which were performed to examine the effect of combining an ensemble method on three performance measures for a specific feature set, and Figure 5 summarizes the results in Table A8. To explain, Figure 5 shows that in terms of all three performance measures, combining Bagging yielded better performances than their base learners in most configurations for all the incremental feature sets, while Boosting and RS did not perform as well as Bagging. The reason for the positive effect of Bagging can be that Bagging helps preserve the important information better than the base learners by considering the features in their entirety,  Table A8 shows the results of the pairwise t tests, which were performed to examine the effect of combining an ensemble method on three performance measures for a specific feature set, and Figure 5 summarizes the results in Table A8. To explain, Figure 5 shows that in terms of all three performance measures, combining Bagging yielded better performances than their base learners in most configurations for all the incremental feature sets, while Boosting and RS did not perform as well as Bagging. The reason for the positive effect of Bagging can be that Bagging helps preserve the important information better than the base learners by considering the features in their entirety, unlike the base learners, which only considers the average of the aggregated features. Overall, it is concluded that combining an ensemble learning method is appropriate for this study to identify the SocialTERMs from the detected SPRTs. unlike the base learners, which only considers the average of the aggregated features. Overall, it is concluded that combining an ensemble learning method is appropriate for this study to identify the SocialTERMs from the detected SPRTs.  Table A9 shows the results of the pairwise t tests, which were performed to examine the effects of different ensemble methods on three performance measures for a specific feature set if a base learner is given. Figures 6-8 explain the performance rankings of the ensemble methods, which are evaluated based on the results in Table A9.
Some interesting findings from Figure 6 are as follows: While Bagging was ranked best among the three ensemble methods if combined with DT and DBN for F1, Boosting gave better accuracies with DT and DBN for F1 + F2 and F1 + F2 + F3, and with NB for all feature sets. The possible reasons for the superiority of Boosting with DT, NB, and DBN are as follows: The strategy of Boosting, which gives higher weights to misclassifications in training, was effective for training models of DT, NB, and DBN with more features; and Boosting's robustness against the multi-collinearity problems among complex features could help DT, NB, and DT to have better accuracies. Moreover, for all feature sets, RS always achieved better accuracies than other ensemble methods if it was used with RBFN, while no major ensemble method achieved better accuracy with SVM. In Figures 7 and 8, the  Table A9 shows the results of the pairwise t tests, which were performed to examine the effects of different ensemble methods on three performance measures for a specific feature set if a base learner is given. Figures 6-8 explain the performance rankings of the ensemble methods, which are evaluated based on the results in Table A9.
Some interesting findings from Figure 6 are as follows: While Bagging was ranked best among the three ensemble methods if combined with DT and DBN for F1, Boosting gave better accuracies with DT and DBN for F1 + F2 and F1 + F2 + F3, and with NB for all feature sets. The possible reasons for the superiority of Boosting with DT, NB, and DBN are as follows: The strategy of Boosting, which gives higher weights to misclassifications in training, was effective for training models of DT, NB, and DBN with more features; and Boosting's robustness against the multi-collinearity problems among complex features could help DT, NB, and DT to have better accuracies. Moreover, for all feature sets, RS always achieved better accuracies than other ensemble methods if it was used with RBFN, while no major ensemble method achieved better accuracy with SVM. In Figures 7 and 8, the same results with Figure 6 were observed, except that for all feature sets, Boosting gave better AUCs if combined with SVM, followed in descending order of Bagging and RS as shown in Figure 8d. a single ensemble method gave the best accuracy for the all feature sets with any single base learner. However, as shown in Figure 6f, if the accuracy rankings of the ensemble methods were averaged over the different base learners for a given feature set, Boosting was a comparatively better choice as an ensemble method for any base learner. Moreover, Figures 7f and 8f that averaged the F-measure rankings and the AUC rankings, respectively, also demonstrate the same results with Figure 6f, i.e., the superiority of Boosting over Bagging and RS.

Conclusions
This paper proposed and examined an automatic approach, namely SocialTERM-Extractor, for distinguishing between the SocialTERMs and the EventTERMs among the key noun terms of the detected SPRTs from a large number of Korean online news articles. It aimed at resolving the challenging issues that were mentioned in Section 1. Using the best-known news portal site of South Thus, Figures 6-8 indicate that the choice of an ensemble method for obtaining better performances depends on the feature sets and the base learners. Therefore, it can hardly be said that a single ensemble method gave the best accuracy for the all feature sets with any single base learner. However, as shown in Figure 6f, if the accuracy rankings of the ensemble methods were averaged over the different base learners for a given feature set, Boosting was a comparatively better choice as an ensemble method for any base learner. Moreover, Figures 7f and 8f that averaged the F-measure rankings and the AUC rankings, respectively, also demonstrate the same results with Figure 6f, i.e., the superiority of Boosting over Bagging and RS.

Conclusions
This paper proposed and examined an automatic approach, namely SocialTERM-Extractor, for distinguishing between the SocialTERMs and the EventTERMs among the key noun terms of the detected SPRTs from a large number of Korean online news articles. It aimed at resolving the challenging issues that were mentioned in Section 1. Using the best-known news portal site of South Korea as a test-bed, experiments were conducted by following the proposed research framework, as explained in Section 2. The experimental results in Table 8 showed that the configuration of the full feature set, namely F1 + F2 + F3 and Boosting DT gave the best performances for accuracy, as well as F-measure and AUC. Its high performances, e.g., 83.8769% accuracy, implies that the proposed approach can automatically identify the SocialTERMs in a reliable way (RQ1 was partly answered).
Furthermore, according to Figure 3, the pairwise t tests on three performance measures for adding a feature set in Table A6 indicated that most of the 20 classification techniques agreed that the three feature sets, namely F1, F2, and F3, contributed to improving the classification performance in a statistically significant way. In particular, it was agreed by all 20 classification techniques that adding sentiment feature set F2 improved the classification performance, in particular unanimously in terms of accuracy and AUC. When the best classification technique, namely Boosting DT, was used, Table 9 showed that the individual addition of feature subsets such as F11, F12, F21, F23, and F31 increased all three performance measures actually. This indicates that the significant improvement in terms of three performance measures by adding feature sets in Table A6 is attributed to such feature subsets (RQ1 was partly answered).
Relating to the comparisons of the classification techniques, according to Figure 4 (and Table A7), the performance rankings of all five base learners differed according to the selected feature sets (RQ2 was answered). In addition, Figure 5 (and Table A8) revealed that most of the 20 configurations agreed that most ensemble learning methods produced better performances than the base learners (RQ3 was answered). According to   Table A9), the ensemble method that obtains the best results depends on the feature sets and the base learners. Nevertheless, when the performance rankings of an ensemble method for a feature set were averaged over all types of base learners, ensemble learning methods with Boosting showed comparatively better results for all feature sets (RQ3 was answered).
Theoretically, this paper contributes to expanding the related literature by applying text mining and machine learning techniques to a large number of online news articles as big data. To the best of our knowledge, this study is the first to provide an automatic approach for identifying and predicting the SocialTERMs of the detected SPRTs from online news articles. The appropriate SocialTERMs can be identified automatically so anybody, even someone who is unfamiliar with the ongoing social problems, can benefit from the automatic approach of this study. It helps enable everyone to recognize the landscape of the SPRTs from a large amount of event-related textual data without difficulty. In addition, this study has a significant impact on sustainability, since the SocialTERMs can be used as key noun terms in searching for technologies that are helpful for solving social problems and monitoring the ongoing and future events that are associated with the social problems. Eventually, the paper may facilitate innovations in our society by driving the development of technologies for ongoing and future social problems.
Practically, by answering RQ1~RQ2, this paper provided a reference and guidance for researchers, government officials, politicians, and companies that are in need of the system implementation. The paper investigated which kinds of feature sets are preferable, what kinds of classification techniques perform better, and how these two factors must be combined to obtain the best results. These results help determine the proper model for building a system with real-world large data. In the suggested research framework, the paper suggested novel approaches for representing the key noun terms: temporal weight, sentiment, and complex network structural features. Moreover, the paper compared state-of-the-art techniques, including the recently proposed DBN, which is a deep-learning-based technique. It showed that the simpler conventional classification method was better for this study, while the more complex DBN gave worse results. This indicates that the deep architecture is not a magic key for all kinds of problems in machine learning research, as it is known that the deep architecture works for big data cases with a lot of variables. However, as the results were not much worse compared to the other approaches, better performances by the deep architecture in the other applications may be possible.
Thus, if the automatic approach is implemented by developing a system, the system can automatically recommend the SocialTERMs, which are useful key noun terms for exploring technologies that can be used to solve social problems. The SocialTERMs can be applied to the prediction of future social problems and the monitoring the ongoing social problems from a large number of online news articles. Thus, this study finally helps obtain the new insights about how to identify ongoing and upcoming social problems from big data, thereby paving a way to big-data-driven social and technological innovations for the public good.
Further research can be conducted to overcome the limitations of this study. First, this study used only a large number of online news articles, but, in addition to online news articles, large-scale data from social media, e.g., YouTube, Twitter, and Facebook, may provide good sources for extracting temporal weight, sentiment, and complex network structural features on the key noun terms of the detected SPRTs. Second, the paper focused on the three types of features, but there may be other useful features, and more sophisticated classification techniques can be taken into account to improve the classification performance.
In addition, as future work, a portal site that provides the proposed methodology can be planned so that this methodology can be available to individuals and groups who are in need of identifying the SPRTs and their SocialTERMs. The easier-to-use method can also be considered in developing the portal site, e.g., k-means and latent Dirichlet allocation (LDA) for the TD approach. If developed, the proposed methodology and system can be evaluated in terms of whether they are helpful for users not only in exploring technologies for solving social problems but also in monitoring ongoing and future social problems based on a large amount of event-related textual data.