Social Big-Data Analysis of Particulate Matter, Health, and Society

The study collected particulate matter (PM)-related documents in Korea and classified main keywords related to particulate matter, health, and social problems using text and opinion mining. The study attempted to present a prediction model for important causes related to particulate matter by using social big-data analysis. Topics related to particulate matter were collected from online (online news sites, blogs, cafés, social network services, and bulletin boards) from 1 January 2015, to 31 May 2016, and 226,977 text documents were included in the analysis. The present study applied machine-learning analysis technique to forecast the risk of particulate matter. Emotions related to particulate matter were found to be 65.4% negative, 7.7% neutral, and 27.0% positive. Intelligent services that can detect early and prevent unknown crisis situations of particulate matter may be possible if risk factors of particulate matter are predicted through the linkage of the machine-learning prediction model.


Introduction
Air pollution due to rapid industrialization and urbanization has become the fourth-biggest threat to global human health following hypertension, dietary habits, and smoking [1]. The World Health Organization (WHO) estimates that premature death due to air pollution is reaching seven million, by which the proportion of death due to outdoor air pollution includes ischemic heart disease at 40%, stroke at 40%, chronic obstructive pulmonary disease at 11%, lung cancer at 6%, and acute lower respiratory infections in children at 3% [2]. In addition, extreme weather phenomena and air pollution influence communicable diseases due to water, food, insect vectors, and rodents [3]. The major causes of air pollution are particulate matter (PM 2.5 , PM 10 ), sulfur dioxide (SO 2 ), and nitrogen oxides (NO 2 ), which produce particulate matter (PM) by directly polluting the air or change into secondary pollutants through chemical reactions in the atmosphere [1]. In particular, since the International Agency for Research on Cancer (IARC) classified PM as a Group-1 carcinogen [4], interest in the effects of PMs on health is growing [5]. Particulate matter has been reported to increase the risk of respiratory-related diseases such as asthma exacerbation and chronic obstructive pulmonary disease, as well as cardiovascular diseases such as irregular heartbeats, vascular dysfunction, and arrhythmia [6]. It is also reportedly related to acute and chronic premature death [7].
South Korea set the standard for Total Suspended Particles (TSP) in the "Framework Act on Environmental Policy" in 1991 before adding PM 10 and PM 2.5 to the Act in 1993 and 2011, respectively. The number of deaths due to the air pollution in South Korea is estimated to have been 11,944 (24 deaths per 100,000 population) in 2008 (http://apps.who). Generally, air pollution level is represented by daily or annual mass concentration (µg/m 3 ) of particulate matter or ultrafine particles in the air. The WHO recommends daily and annual averages of ultrafine particles (PM 10 ) equal to or less than 50 µg/m 3

and
The present study collected PM-related documents mentioned in all online channels from which documents can be collected in South Korea and classified main keywords related to particulate matter using text and opinion mining and attempted to present a prediction model for important causes related to particulate matter.

Research Targets
The present study used social big data collected through the Internet such as online news sites, blogs, cafés, social network services, and bulletin boards. The present analysis defined social big data as text-based web documents (buzz) collectible through a total of 166 online channels including 146 online news sites, four blogs (Naver, Daum, Nate, Tistory), two cafés (Naver, Daum), one SNS (Twitter), and 13 bulletin boards (such as Naver "Jishik-iN," NateTalk, NatePann, and DaumAgora). Topics related to particulate matter were collected every hour regardless of the day of the week, weekends, and holidays from the applicable channels from 1 January 2015, to 31 May 2016, and 226,977 text documents that mentioned causes and diseases related to particulate matter out of a total of 587,099 collected documents were included in the analysis. A crawler was used to collect social big data for the present study, and text-mining techniques were used to classify topics. Particulate matter topics used to collect all relevant documents were "particulate matter, ultrafine particles, yellow sand, smog, atmospheric pollution, and air pollution," and documents were collected after removing documents using stop words such as "smog wrinkles, contaminated children, smog films, and smog work," which were unrelated to particulate matter, during the collection period. The main purpose of social big-data analysis is to examine complex social and environmental issued based on the various opinion of large amount of online data. By conducting social big-data analysis, we can predict more accurately to find risk factors or protective factors of particulate matter.

Research Instruments
Text documents collected in relation to particulate matter were encoded as standardized data through the text-mining process as follows.

Emotions Related to Particulate Matter
Emotions related to particulate matter, which were dependent variables of the present study, were defined according to classifications through text-mining in which positive emotions (such as happy, neat, cool, pleasant, healing, possible, strong recommendation, refreshing, joy, recommendation, overcome, positive, expectation, clean, fortunate, great, satisfaction, harmless, relaxing, belief, crisp, invigorating, fresh, relief, safety, stability, agree, solve, cure, and comfort) were defined as positive, and negative emotions (such as bad, danger, exceed the standard, watch, disaster level, worst, serious, stuffy, inconvenience, terror, anxiety, depression, fatigue, pain, fatal, care, damage, stifling, worry, high risk, bafflement, extreme, horrible, failure, perplexed, catastrophe, great confusion, fear, problem, opposition, resistance, emission, neglect, denial, dissatisfaction, helplessness, vicious circle, crisis, disaster, calamity, caution, abstinence, warning, and concern) were defined as negative. Emotional dictionary includes positive feelings (e.g., clean, safe) and negative feeling (e.g., unpleasant, depressed) in the online document. The emotional dictionary was developed by SK Telecom Korea's leading communication company. In addition, if positive and negative attitudes were equal, they were defined as neutral.

Causes Related to Particulate Matter
Causes related to particulate matter were defined with 17 keywords classified in the subject analysis, which include "dust, yellow sand, PM 10 , powder, tobacco, grilling, influenced China, PM 2.5 , air pollution, ozone, smog, pollutant, carcinogen, fossil fuel, bacteria, exhaust gas, and chemical substance."

Diseases Related to Particulate Matter
Diseases related to particulate matter were defined with seven keywords, "common cold, lung disease, cardiac disorder, cerebrovascular disease, hypertension, depression, death disease," which were classified through text-mining.

Analysis Methods
The present study applied machine-learning analysis technique to forecast the risk of particulate matter. Representative machine-learning algorithms used in the present study were random forest, decision-tree analysis, and multilayer neural network. In addition, to determine the relationships among the independent variables that influence the risk of particulate matter, association analysis was carried out. An a priori principle algorithm was used for the association analysis. Social big data and data mining is based on the causal relationship between the emotional effects of PM. The research finding could provide the risk factors and protective factor on the PM issue. The ROC (Receiver Operating Characteristic) curve and AUC (Area under the Curve) were used for the evaluation of machine-learning models. IBM SPSS 24.0 (SPSS Inc., Chicago, IL, USA) was used for the decision-tree analysis, and R 3.4.2 (R Foundation for Statistical Computing, Vienna, Austria) was used for the random forest, multilayer neural network analysis, association analysis, and model evaluation.

Factors Affecting the Risk of Particulate Matter
The results of the analysis of main factors influencing the emotions related to particulate matter (Non-negative, Negative) using a random forest model are presented in Figure 1. The figure showing the importance (IncNodePurity) of the random forest model indicates that the main factor that has the greatest influence on emotions related to particulate matter (an important factor that classifies non-negative and negative emotions) is "Air Pollution." It is followed by cardiac disorder, smog, yellow sand, chemical substance, carcinogen, lung disease, PM 2.5 , and pollutants.
the greatest influence on emotions related to particulate matter (an important factor that classifies non-negative and negative emotions) is "Air Pollution." It is followed by cardiac disorder, smog, yellow sand, chemical substance, carcinogen, lung disease, PM2.5, and pollutants. The decision-tree model for the prediction of the risk factor of particulate matter is shown in Figure 2. The root tree at the top of the tree structure shows the frequency of the dependent variable without the predictor variables (independent variables) entered. The emotion ratio for particulate matter of the root node was 39.2% negative and 60.8% non-negative. Since the cause-and-disease factor at the top under the root node is the factor that has the greatest influence (highly relevant) on the dependent variable, the influence of the "air pollution" factor was found to be the largest, i.e., negative emotions about particulate matter increased from 39.2% to 66.7% if an online document had the air pollution factor in it. Negative emotions about particulate matter increased from 66.7% to 93.6% when air pollution and cardiac disorder factors were in the document. To develop a prediction model for the risk of particulate matter, 226,977 text documents that mentioned causes and diseases related to particulate matter were used as a learning data set. Training data and test data were sampled 50:50 from learning data to develop a prediction model. Analysis of machine learning on the cause of particulate matter and the risk of diseases showed that the neural network (AUC = 0.74) performed best ( Figure 3). The decision-tree model for the prediction of the risk factor of particulate matter is shown in Figure 2. The root tree at the top of the tree structure shows the frequency of the dependent variable without the predictor variables (independent variables) entered. The emotion ratio for particulate matter of the root node was 39.2% negative and 60.8% non-negative. Since the cause-and-disease factor at the top under the root node is the factor that has the greatest influence (highly relevant) on the dependent variable, the influence of the "air pollution" factor was found to be the largest, i.e., negative emotions about particulate matter increased from 39.2% to 66.7% if an online document had the air pollution factor in it. Negative emotions about particulate matter increased from 66.7% to 93.6% when air pollution and cardiac disorder factors were in the document. To develop a prediction model for the risk of particulate matter, 226,977 text documents that mentioned causes and diseases related to particulate matter were used as a learning data set. Training data and test data were sampled 50:50 from learning data to develop a prediction model. Analysis of machine learning on the cause of particulate matter and the risk of diseases showed that the neural network (AUC = 0.74) performed best (Figure 3).   A multilayer neural network model using 17 causes related to particulate matter (dust, yellow sand, PM10, powder, tobacco, grilling, influenced China, PM2.5, air pollution, ozone, smog, pollutant, carcinogen, fossil fuel, bacteria, exhaust gas, and chemical substance) and seven diseases (common cold, lung disease, cardiac disorder, cerebrovascular disease, hypertension, depression, death disease) as the input layer, and five hidden layers and risk (Negative) as one output layer is shown in Figure  4. The overall risk of causes-and-disease factors predicted by the multilayer neural network model was 39.35%. The risk of each factor was in the following order: smog (8.21%), influenced China  A multilayer neural network model using 17 causes related to particulate matter (dust, yellow sand, PM 10 , powder, tobacco, grilling, influenced China, PM 2.5 , air pollution, ozone, smog, pollutant, carcinogen, fossil fuel, bacteria, exhaust gas, and chemical substance) and seven diseases (common cold, lung disease, cardiac disorder, cerebrovascular disease, hypertension, depression, death disease) as the input layer, and five hidden layers and risk (Negative) as one output layer is shown in Figure 4. The overall risk of causes-and-disease factors predicted by the multilayer neural network model was 39.35%. The risk of each factor was in the following order: smog (8.21%), influenced China (5.19%), carcinogen (4.29%), pollutant (3.83%), death disease (2.37%), yellow sand (1.94%), tobacco (1.88%), fossil fuel (1.64%), ozone (1.42%), cardiac disorder (1.22%), exhaust gas (0.95%), bacteria (0.87%), chemical substance (0.8%), lung disease (0.79%), PM 10 (0.79%), common cold (0.56%), cerebrovascular disease (0.52%), grilling (0.49%), air pollution (0.48%), depression (0.35%), powder (0.31%), hypertension (0.23%), PM 2.5 (0.14%), and dust (0.08%), from the highest to the lowest.
A multilayer neural network model using 17 causes related to particulate matter as the input layer, and five hidden layers and seven diseases as the output layer, is shown in Figure 5. The accuracy of the neural network model for the prediction of diseases caused by particulate matter was in the following order: the common cold (10.49%), lung disease (5.19%), cardiac disorder (2.80%), cerebrovascular disease (2.14%), death disease (1.88%), hypertension (1.13%), and depression (0.71%), from the highest to the lowest.    Association analysis in social big-data analysis is performed to discover the relationships between two or more words included in an online document. The present study analyzed association rules between the causes of particulate matter and disease factors as shown in Table 2. The results showed that the association between the four factors {pollutant, carcinogen, common cold} ≥ {lung disease} was 0.011 support, 0.647 confidence, and 11.64 lift, and the same rule is seen in 2471 documents. It indicates that when "pollutant, carcinogen, common cold" factors are mentioned in an online document, the probability of the document mentioning lung disease is 64.7%, and the probability of the document mentioning lung disease is 11.6 times higher than in a document that does not mention "pollutants, carcinogens, common cold." The risk prediction of the disease-prediction neural network model of the causes of particulate matter showed a similar trend to the particulate matter (PM 10 , PM 2.5 ) forecast by the Korea Meteorological Administration (Figure 6).

Discussion
The purpose of the present study was to develop a risk prediction model for particulate matter by collecting PM-related documents mentioned in all online channels from which documents can be collected in South Korea and using them as machine-learning data. The summary and implications of the present study are as follows.
First, emotions related to particulate matter were found to be 65.4% negative, 7.7% neutral, and 27.0% positive. The finding is similar to that of the Public Attitudes towards the Environment-2016 Survey [33] in which 68.6% of the respondents worried about the of particulate matter and ultrafine particles on health. The direct causes among the causes related to particulate matter are in the following order: exhaust gas, the influence of China, fossil fuels, and tobacco, from the highest to the lowest. The finding is similar to that of the survey of the Korean Federation for Environmental Movement [34], which found the causes of particulate matter were the influence of the neighbor countries such as China, exhaust gas including diesel cars, and coal-fired electrical power plants. Disease factors related to particulate matter were in the following order: common cold, lung disease, heart disease, and cerebrovascular disease, from the highest to the lowest. These are similar to the findings of particulate matter threats investigated by the Korean Environment Institution (KEI) [33], which found threats in the following order: cough, rhinitis, and sinusitis, asthma, acute and chronic bronchitis, atopic dermatitis, dizziness and headaches, cardiovascular disease, and cerebrovascular disease, from the highest to the lowest.
Second, the factors that have the greatest effect on emotions related to particulate matter in the random forest model were in the following order: air pollution, heart disease, smog, yellow sand, chemicals, carcinogens, lung disease, PM 2.5 , and pollutants, from the highest to the lowest. In the decision-tree model, negative emotions about particulate matter increased by about 2.4 times from 39.2% to 93.6% when air pollution and cardiac disorder factors were in the document than when the two factors were not present. This signifies that if cardiac disorder is present due to air pollution when particulate matter is mentioned in an online document, the risk of particulate matter increases 2.4 times.
Third, the performance of multilayer neural networks was found to be best for the evaluation of machine learning for the cause of particulate matter and risk prediction of diseases. This supports previous studies that reported superior prediction accuracy of the neural network in the development of a particulate matter prediction model using machine learning [22][23][24][25]27].
Fourth, the overall risk of causes-and-disease factors predicted by the multilayer neural network model was 39.35%. The risk of the causes of particulate matter by factor was in the following order: smog, influence of China, carcinogens, pollutants, yellow sand, tobacco, fossil fuel, ozone, exhaust gas, bacteria, chemical substance, PM 10 , grilling, air pollution, powder, PM 2.5 , and dust. The risk of disease by particulate matter by factor was in the order of death disease, cardiac disorder, lung disease, common cold, cerebrovascular disease, depression, and hypertension, from the highest to the lowest. In the multilayer neural network model for the prediction of the influence of the causes of particulate matter on diseases, the prediction probability of particulate matter and related causative factors was in the following order: common cold, lung disease, cardiac disorder, cerebrovascular disease, death disease, hypertension, and depression, from the highest to the lowest.
Fifth, in association analysis between the cause of particulate matter and disease factors, the probability of mentioning lung disease increased by about 11.6 times when "pollutant, carcinogen, and common cold" factors are mentioned, and the common cold is found to be interconnected with yellow sand, dust, pollutants, lung disease, carcinogens, cardiac disorders, and bacteria. Lastly, the risk prediction of the disease-prediction neural network model on the cause of particulate matter showed a similar trend to the particulate-matter (PM 10 , PM 2.5 ) forecast by the Korea Meteorological Administration.

Conclusions
The policy implications and conclusions of the findings of the present study are as follows. The implication of the finding is to predict the risk of PM and predict the degree of PM by using the prediction model. The research can provide more accurate weather information related to PM concentration and prevention system can be established.
First, when causes and diseases related to particulate matter are mentioned in online documents, the negative emotions toward particulate matter increase. Accordingly, management as well as promotion measures need to be prepared through accurate diagnosis of the causes of particulate matter for correct public understanding. To that end, the establishment of countermeasures based on the identification of causes through the analysis of various big data of particulate matter and an integrated management system that includes the influence of neighboring countries and the interaction of climate changes appears to be needed.
Second, the prevalence rate of diseases due to particulate matter is serious. Particulate matter has been reported to increase the risk of respiratory-related diseases such as asthma exacerbation and chronic obstructive pulmonary disease, and cardiovascular diseases such as irregular heartbeats, vascular dysfunction, and arrhythmia [6]. It also has been reported to be related to acute and chronic premature death. In the multilayer neural network analysis of the present study, the prediction probability is in the following order: common cold, lung disease, cardiac disorder, cerebrovascular disease, and death disease, from the highest to the lowest. Because information is searched or shared online when a person is infected with a disease such as the common cold due to various particulate-matter causes, as a means to overcome the disease, verified information can be provided online in advance by predicting related diseases due to particulate matter using the machine-learning prediction model developed in the present study.
Third, as the existing particulate-matter prediction models use the data measured by organizations such as the Korea Meteorological Administration, the neural network model was also found to be superior for prediction models using social big data. Since the input variable used as the learning data of the neural network of the present study, however, used words about causes and diseases mentioned in relation to particulate matter, and positive and negative emotional words in relation to particulate matter as output variables, it can be less accurate than the existing measurements of causes, diseases, and emotional state by the existing operational definitions. Accordingly, the development of an analysis technique that can perform subject analysis of causes, diseases, and emotions at the sentence level appears to be necessary.
Fourth, high-quality learning data with accurate classification are needed to increase the accuracy of the machine-learning model for particulate matter developed in the present study, and to that end, the development of ontology for a terminology system for particulate matter and a dictionary of emotional words is needed.
Fifth, continuous data updates used in the machine-learning model are necessary. In the case of machine-learning models developed through the learning of training data, the actual classification and predicted classification are different when test data are applied. Accordingly, the prediction accuracy of machine-learning models can be improved if training data are learned again after producing high-quality training data by selecting cases for which classification of actual data and predicted data are the same to increase the prediction accuracy rate of the model.
Sixth, the development of glossaries of colloquial or slang words for the causes of particulate matter and diseases and a system that allows collecting such words is necessary since many general consumers do not use professional terminology of the causes of particulate matter and diseases. Lastly, intelligent services that can detect early and prevent unknown crisis of particulate matter may be possible if risk factors of particulate matter are predicted through the linkage of the machine-learning prediction model for particulate matter developed in the present study with weather big data and disease big data.