Food Safety Event Detection Based on Multi-Feature Fusion

: Food safety event detection is a technique used to discover food safety events by monitoring online news. In general, a set of keywords are extracted as features to represent news, and then the news is clustered to generate events. The most popular method for news feature extraction is Term Frequency-Inverse Document Frequency (TF-IDF), however, it has some defects such as being prone to the “dimension disaster”, low computational e ﬃ ciency, and a lack of semantic information. In addition, Latent Dirichlet Allocation (LDA) is also widely used in news representation. Despite its low dimension, it still su ﬀ ers from some drawbacks such as the need to set a predeﬁned number of clusters and has di ﬃ culty recognizing new events. In this paper, a method based on multi-feature fusion is proposed, which combines the TF-IDF features, the named entity features, and the headline features to represent the news. Based on the representations, the incremental clustering method is used to cluster the news documents and to detect food safety events. Compared with the traditional methods, the proposed method achieved higher Precision, Recall, and F1 scores. The proposed method can help regulatory authorities to make decisions and improve the reputation of the government, whilst reducing social anxiety and economic losses.


Introduction
Topic Detection and Tracking (TDT) is an information processing technique for the information flow on news media [1], which can detect the appearance of new topics and track their reappearance and evolution [2], whilst helping people deal with the problem of the internet information explosion [3].Topic detection is a sub-task of TDT, which can help decision makers find meaningful topics or events in a timely manner [4] and has attracted a great deal of attention in many application areas, such as public opinion monitoring, emergency management, decision-making support systems, and online reputation monitoring [5][6][7][8].In the context of news, topic detection and event detection can be viewed as the same concept [9].Food safety event detection is very important for governments and for society.In recent years, food safety events have occurred frequently, making rapid food safety event detection an urgent problem to be solved.Food safety events include food poisoning, food-borne diseases, food contamination, etc. Examples include the horsemeat scandal that occurred in Europe [10], rat meat that was found in famous snacks in Korea [11], and the melamine, Sudan red egg, the gutter oil scandals that occurred in China [12][13][14].These events not only caused huge economic losses and brought anxiety to the public, but also seriously undermined the reputation of the relevant governments.
Several approaches have been proposed for events detection, including: (1) Document clustering based on news feature extraction and representation [15][16][17], wherein most researchers use Term Frequency-Inverse Document Frequency (TF-IDF) [18] to extract keywords and Vector Space Model (VSM) [19] to represent news, then the clustering algorithms such as single-pass [20] or k-means [21] are used to cluster news (news describing the same event are clustered to generate events); (2) the method based on a topic model [22], where Latent Dirichlet Allocation (LDA) [23], Probabilistic Latent Semantic Analysis (PLSA) [24], and various extension versions are used to explore the latent semantic knowledge of documents, i.e., treating each document as a probability distribution over topics, then representing news based on this distribution and clustering the news accordingly; (3) The method based on neural networks [25], which uses deep neural network models such as Doc2vec [26] and Sentece2vec [27] to obtain document vectors, and then clustering document vectors to generate events; (4) The community partitioning method based on a complex network [28], which takes co-occurrence words as nodes in the network to establish a topic graph and detect topics by using community partitioning.The community partitioning methods include the Kernighan-Lin algorithm [29], Blondel algorithm [30], etc.There are a few studies currently available about food safety event detection [31,32].These studies are based on LDA and have lower data dimensions, achieving better results than methods based on TF-IDF [31,32].
Nevertheless, the document representation ability of the above research is still limited by the low semantic information, and the inference algorithm used in the model can be too complex [4].In addition, such methods need manual labeling of events and setting a predefined number of clusters, and have difficulty in detecting new events [33], which is not conducive to large-scale data modeling and affects the precision of event detection.
In this paper, TF-IDF is used to calculate the weights of all words in the news, and a fixed number of words are selected as the feature of news.McMinn et al. [34] proposed a real-time entity-based event detection method for Twitter, which proved that named entities play a crucial role in describing an event.Their entity-based event detection method is able to detect more events than previous approaches whilst also providing improved precision and retaining low computational complexity.Therefore, we use the named entity as a part of the features to represent news in this paper, and combine it with the feature words obtained by TF-IDF to form the joint feature of documents.This entity can significantly reduce the data dimension and computation overhead.In addition, it can retain the important news information effectively.Furthermore, the news headlines of food safety event can effectively summarize the news content, therefore, this paper uses the semantics of headlines to update the weight of the joint features, so that words with high similarity to the headlines have greater weights.In this paper, we proposed the concept of news "fusion feature", which fuses multiple features together, including the TF-IDF features, the named entity features, and the headline features.In this way, key information can be more prominent and document semantics can be highly covered, meaning more accurate representation of the news can be obtained to improve the event detection results.
The main contributions of this paper lie in: (1) the combination of TF-IDF features and named entity features used to form the joint feature of news; (2) a method for updating the weight of feature words based on semantics of headlines, which highlights the key information and allows the fusion feature of news to be obtained.The multi-feature fusion method proposed in this paper is used to document a representation of food safety news, which has enhanced the detection results of food safety events, and can help regulatory authorities to more accurately detect food safety events.In order to verify the effectiveness of the method proposed in this paper, experiments were carried out on real food news data, and the experimental results of TF-IDF, LDA, and multi-feature fusion are compared.

Methods
This paper proposes a food safety event detection method based on multi-feature fusion, and the process was as follows: (1) preprocessing the news data; (2) TF-IDF is used to calculate the weight of each word in the news document, then the first M words with the largest weight of each news document are selected to form a feature words set W; (3) the named entities in the news document are recognized by using the Bi-LSTM-CNN-CRF framework [35] to form the set E, then the joint feature set K is obtained by combining E and feature set W; (4) the word2vec is used to obtain the vector of all words in the news dataset, then establish the dictionary D and corresponding word vector set V D ; (5) establish the headline vector V h of the headlines in the dataset which has been preprocessed by (1), calculate the similarity between each word in the feature set K and headline vector, and update the weight of the feature words according to the similarity value, then the VSM is used to represent the news document; and (6) the single-pass algorithm is used to cluster the news documents and generate events.The process is outlined in Figure 1 and described in detail in the next subsections.
document are selected to form a feature words set W; (3) the named entities in the news document are recognized by using the Bi-LSTM-CNN-CRF framework [35] to form the set E, then the joint feature set K is obtained by combining E and feature set W; (4) the word2vec is used to obtain the vector of all words in the news dataset, then establish the dictionary D and corresponding word vector set   ; (5) establish the headline vector  ℎ of the headlines in the dataset which has been preprocessed by (1), calculate the similarity between each word in the feature set K and headline vector, and update the weight of the feature words according to the similarity value, then the VSM is used to represent the news document; and (6) the single-pass algorithm is used to cluster the news documents and generate events.The process is outlined in Figure 1 and described in detail in the next subsections.

Preprocessing
The data preprocessing includes filtering noise, removing meaningless symbols such as space and links, word segmentation, and stop words.The news dataset S contains plenty of news documents, each news document is represented as a word bag and recorded by a set   after preprocessing, and as the input of the subsequent components, as shown as Formula ( 1) where m is the number of news documents in the news dataset S, and n is the number of words in each news document.

TF-IDF Feature Extraction
TF-IDF is a feature extraction algorithm, where TF denotes word frequency, that is, the frequency of a word appearing in the document, and IDF denotes the inverse document frequency.The main idea is that if a word or phrase appears more frequently in one document and less frequently in other documents, it is considered to have good representation ability for the document.Generally, the words or phrases with higher TF-IDF values are more important in the documents.The tf of the word   appearing in document   is calculated by Formula (2): ⊕ denotes the feature joint calculator, ⊗ denotes feature fusion calculator, W i denotes the keywords set of each news document extracted by Term Frequency-Inverse Document Frequency (TF-IDF), and E i denotes the named entities set in a news document.K i denotes the set of joint feature words and V hi denotes the headline vector.The black dots denote the words in bag of words after preprocessing, the black diamonds denote the keywords feature extracted by TF-IDF, and the black squares denote the named entities in the news content.

Preprocessing
The data preprocessing includes filtering noise, removing meaningless symbols such as space and links, word segmentation, and stop words.The news dataset S contains plenty of news documents, each news document is represented as a word bag and recorded by a set news i after preprocessing, and as the input of the subsequent components, as shown as Formula (1) where m is the number of news documents in the news dataset S, and n is the number of words in each news document.

TF-IDF Feature Extraction
TF-IDF is a feature extraction algorithm, where TF denotes word frequency, that is, the frequency of a word appearing in the document, and IDF denotes the inverse document frequency.The main idea is that if a word or phrase appears more frequently in one document and less frequently in other documents, it is considered to have good representation ability for the document.Generally, the words or phrases with higher TF-IDF values are more important in the documents.The tf of the word t i appearing in document d j is calculated by Formula (2): where n i,j is the number of occurrences of word i in document d j , k n k,j is the sum of the TF of all the words in document d j , which is the normalization process.idf is the reverse document frequency and is calculated by Formula (3): where |D| is the total number of documents in the dataset S, j : t i ∈ d j denotes the number of documents containing the word t i .In general, 1 + j : t i ∈ d j is used as denominator to avoid it being zero.TF-IDF is calculated by Formula (4).
Then the weights of all words are calculated by TF-IDF, for each news document, the first M words with the largest weights are selected to form the feature set W = {w 1 ,

Named Entity Feature Extraction
Named entities include person names, place names, organization names, and proper nouns.In this paper, named entities are regarded as one of the features of food safety news.The Bi-LSTM-CNN-CRF framework is used to recognize named entities in a food safety news dataset, the framework is based on Bi-directional Long Short-Term Memory (Bi-LSTM) [36], Convolutional Neural Networks (CNN) [37], and Conditional Random Field (CRF) [38].The steps are as follows: firstly, word embedding is used to obtain the vectors of words; then CNN is used to encode character-level information of a word into its character-level representation, then the character and word-level representations are fed into Bi-LSTM to the model context information of each word.Finally, a sequential CRF is used to jointly decode labels for the whole sentence.For each news, the extracted named entities set can be expressed as Combine the TF-IDF feature set W with the named entity feature set E to obtain the joint feature set of the news, shown as Formula (5).
where T ≤ M + N, the weight set of the joint feature set

Feature Fusion Based on the Semantic of Headline
In general, the headline for food safety news is a summarization of the news content, as it contains the keywords of a certain food safety event.Figure 2 shows a news document about a food safety event.In this Figure , (a) shows the original news in Chinese and (b) shows the English translation of the news in (a).Through the keywords "苏州"(Suzhou), "喜茶"(Heytea), "苍蝇"(flies) in the headline, we can understand what happened in this food safety event.In this paper, a dictionary D of food news dataset is constructed, the vectors of all words in D are obtained by word2vec [39] and form a set   = { 1 ,  2 ⋯   ⋯   }, where z is the size of D and each vector contains 256 dimensions.The preprocessing of the headlines involves removing punctuation marks, spaces, Chinese word segmentation, and stop words.Then Doc2vec was used to map headlines into vectors with fixed dimensions, thus the headline vector   of each news d was obtained and its dimension is also 256.The vectors of words and sentences contain its semantic meaning, while the relationships between words and sentences can be calculated by the vectors.
Each news document is represented by joint feature set K, words with high similarity to headlines can better represent the key information of a news and should be given greater weight.For each word   in K, the distance between   and the headline vector   is calculated by the cosine similarity s(  ,   ), and it is calculated by Formula (6): Thus, the similarity of every word in the joint feature set K and headline is obtained and the similarity set is expressed as  = { 1 ,  2 , ⋯ ,   , ⋯   }.The updated weights  can be obtained by Formula (7).
where   ′ is the original weight,   is the similarity of the word i with the headline,   is the updated weight of word i. Please note the coefficient of the similarity value   is determined by our preliminary experiment.

News Representation Based on Multi-Feature Fusion Using the Vector Space Model
In this paper, the news representation based on multi-feature fusion is modeled by VSM (vector space model).VSM is one of the most popular methods for text modeling, as it regards the news document as a set of unordered words.The joint feature of VSM is shown as set K in Section 2.3, while the calculation process of the words' weights is shown in Sections 2.1-2.4Thus, the vector space model of a news can be expressed by the weights of the unordered words, as shown in Formula (8): where   is the weight of the i-th word, and D is the number of words in the dataset after preprocessing, TF-IDF feature extraction, and named entity recognition.Then the representation of news in the dataset can be obtained as {  1 ,  2 , ⋯    , ⋯    }, in which N is the number In this paper, a dictionary D of food news dataset is constructed, the vectors of all words in D are obtained by word2vec [39] and form a set where z is the size of D and each vector contains 256 dimensions.The preprocessing of the headlines involves removing punctuation marks, spaces, Chinese word segmentation, and stop words.Then Doc2vec was used to map headlines into vectors with fixed dimensions, thus the headline vector v d of each news d was obtained and its dimension is also 256.The vectors of words and sentences contain its semantic meaning, while the relationships between words and sentences can be calculated by the vectors.
Each news document is represented by joint feature set K, words with high similarity to headlines can better represent the key information of a news and should be given greater weight.For each word t i in K, the distance between t i and the headline vector v d is calculated by the cosine similarity s(v i , v d ), and it is calculated by Formula (6): Thus, the similarity of every word in the joint feature set K and headline is obtained and the similarity set is expressed as The updated weights δ can be obtained by Formula (7).
where θ i is the original weight, s i is the similarity of the word i with the headline, δ i is the updated weight of word i. Please note the coefficient of the similarity value s i is determined by our preliminary experiment.

News Representation Based on Multi-Feature Fusion Using the Vector Space Model
In this paper, the news representation based on multi-feature fusion is modeled by VSM (vector space model).VSM is one of the most popular methods for text modeling, as it regards the news document as a set of unordered words.The joint feature of VSM is shown as set K in Section 2.3, while the calculation process of the words' weights is shown in Sections 2.1-2.4Thus, the vector space model of a news can be expressed by the weights of the unordered words, as shown in Formula (8): where δ i is the weight of the i-th word, and D is the number of words in the dataset after preprocessing, TF-IDF feature extraction, and named entity recognition.Then the representation of news in the dataset can be obtained as in which N is the number of news in the dataset.The news representation combined the TF-IDF features, named the entity features, and fused the headline information, thereby developing the news representation model based on multi-feature fusion.In this way, the similarity between different types of news can be calculated and cluster analysis can be performed. 2.6.Experiment

Data Preparation
The food news data in our experiment were gathered from several popular news websites such as Headlines Today (https://www.toutiao.com/),Sina News (https://news.sina.com.cn/), and Sohu News (http://news.sohu.com/) in China, these websites provides valuable and real-time information for people.The news data were used to evaluate the performance and robustness of our approach.The dataset was named "Food Safety News" and contains the human-annotated facts.It contains 1255 Chinese news documents and corresponding headlines from 10 events, where each event contains a variable number of news items ranging from 68 to 180.The vocabulary contains 84,198 unique terms after preprocessing.The total time span for the 10 events is from 1 January 2017 to 30 June 2019 (Table 1).Therein the People's Daily [40] annotated corpus, which contains a large number of annotated person names, place names, organizations, and other proper nouns, is combined with the "Food Safety News" dataset to train the named entity recognition model.

Evaluation Metrics
TDT [41] proposed several evaluation metrics for topic detection, including Precision P, Recall R and F1 score.In addition, the Miss Rate (P miss ) and False Alarm Rate (P f a ) are also important evaluation metrics of system performance.The evaluation status of event detection is shown in Table 2.
Table 2. Evaluation status of event detection.

Category
Event Related Event Irrelevant Where a is the number of detected news stories related to an event, b is the number of detected news stories irrelevant to the event, c is the number of undetected news stories related to the event, and d is the number of undetected news stories irrelevant to the event.Following the notation in Table 2, the evaluation metrics of TDT are shown in Formula ( 9): In this paper, the detection cost function C det is used to evaluate the system performance [41], which is a metric that combines the miss rate P miss and the false alarm rate P f a proposed in TDT2004 [42] and is calculated by Formula (10): where C miss , P target , C f a and P non−target are predefined parameters, (TDT2004 set these parameters as 1.0, 0.02, 0.1, and 0.98, respectively).C det is usually normalized by Formula (11): The lower the Norm(C det ) value, the better the system performed [42].In the experiment, the evaluation metrics of each event were firstly calculated, and then the average value is calculated to determine the system performance.

Results
In this paper, experiments were designed to compare different methods and verify the advantage of the proposed method.The experiments are consisted of two parts: (1) explore the influence of the TF-IDF feature number M and cluster thresholds T on system performance, then determine the optimal value of M and T; (2) compare the P, R, and F1 score of the proposed method and other methods under the same feature number M and threshold T.

Parameter Selection
For the first experiment, we set M = 5, 10, 15, 20, . . ., 50, and cluster threshold T = 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, and 0.4 to test the system performance.The Norm(C det ) values of the system under different M and T are shown in Figure 3.
From Figure 3, we can see that the Norm(C det ) value is different under different M and T. When M < 30, the Norm(C det ) value decreases gradually with the increase of M; when M = 30, the Norm(C det ) value can reach the minimum value under a different threshold; when M > 30, the Norm(C det ) value increases with the increase of M. When the number of features M = 30, the clustering result and the performance of the system achieved the best results, i.e., a lower Norm(C det ).
Under different T in Figure 3, we found that the dotted line (threshold T = 0.25) was lower no matter what the M is.So in the following experiments, we used M = 30 as the number of news features and T = 0.25 as the clustering threshold.From Figure 3, we can see that the Norm(  ) value is different under different M and T. When M < 30, the Norm(  ) value decreases gradually with the increase of M; when M = 30, the Norm(  ) value can reach the minimum value under a different threshold; when M > 30, the Norm(  ) value increases with the increase of M. When the number of features M = 30, the clustering result and the performance of the system achieved the best results, i.e., a lower Norm(  ).
Under different T in Figure 3, we found that the dotted line (threshold T = 0.25) was lower no matter what the M is.So in the following experiments, we used M = 30 as the number of news features and T = 0.25 as the clustering threshold.

Food Safety Event Detection Results
The food safety event detection method based on multi-feature fusion combines the TF-IDF features with the named entity features and forms the joint features, then fuses the headline features to make the news representation more accurate.In this paper, a single-pass clustering algorithm was used to cluster news and generate food safety events.The experiment compared the results of event detection under the following news representations methods: (1) TF-IDF, (2) LDA, (3) TF-IDF and named entity (TI-NE), and (4) multi-feature fusion.
The Precision P, Recall R, and F1 score of food event detection when the number of TF-IDF features M = 30 and clustering threshold T = 0.25, under different news representation methods are shown in Table 3

Food Safety Event Detection Results
The food safety event detection method based on multi-feature fusion combines the TF-IDF features with the named entity features and forms the joint features, then fuses the headline features to make the news representation more accurate.In this paper, a single-pass clustering algorithm was used to cluster news and generate food safety events.The experiment compared the results of event detection under the following news representations methods: (1) TF-IDF, (2) LDA, (3) TF-IDF and named entity (TI-NE), and (4) multi-feature fusion.
The Precision P, Recall R, and F1 score of food event detection when the number of TF-IDF features M = 30 and clustering threshold T = 0.25, under different news representation methods are shown in Table 3.The experimental results show that the Precision P, Recall R, and F1 score of event detection based on LDA are 0.79, 0.81, and 0.79, which are 3%, 6%, and 4% higher than the values of the method based on TF-IDF, which means that the method based on LDA is better than the method based on TF-IDF.After being combined with named entity features (TI-NE), the Precision P, Recall R, and F1 score are 0.86, 0.84, and 0.84, which are higher than the method based on LDA by 7%, 3%, and 5%, which means that the named entity features are important in representing the news documents.Compared with the method based on TI-NE, our method based on multi-feature fusion is 8%, 9%, and 9% higher on the three metrics.Compared with the method based LDA, our method is 15%, 12%, and 14% higher than TI-NE, which only fused the named entities features with TF-IDF.Compared with TF-IDF, our method is 18%, 18%, and 18% higher on the Precision P, Recall R, and F1 score, which proves that our proposed multi-feature fusion method is effective and better than the baseline method based on TF-IDF and LDA.
When using the method based on TF-IDF, the data dimension is equal to the size of the dictionary that constituted by all words in the news dataset, while the dictionary size is 84,198 and provides a high dimensional and sparse matrix, thus leading to low computational efficiency.In the multi-feature fusion method proposed in this paper, the dictionary size is reduced to 4562 after preprocessing, TF-IDF feature extraction, and named entity recognition, which means that the dimension of the news representation based on the multi-feature fusion method is only 4562.Compared to the traditional TF-IDF method, the dimension of news representation is greatly reduced, so the computational efficiency is greatly improved.

Discussion
A food safety event detection method based on multi-feature fusion is proposed in this paper.The method combines features and named entity features of food news, then the headline features are fused and more accurate news representation is obtained.Finally, the news is clustered based on the representation and events are obtained.The method proposed in this paper solves the problems of a too large data dimension and low computational efficiency of traditional TF-IDF [43], as well as the problems of manual data annotation, which are an inability to identify new events that occurs when using the LDA method [33].
In this paper we designed experiments on the real food safety news dataset.The experimental results show that the value of normalization of detection cost function (Norm(C det )) varies with the number of TF-IDF features M. When M = 30, the Norm(C det ) can reach the smallest value, this is because when the number of TF-IDF features is less than 30, the smaller the number of features available, the smaller the amount of semantic information contained, thus the content of a news report cannot be represented well; when the number of features is too large, some unimportant information is introduced, which makes the clustering results worse.Therefore, the number of features affects the performance of event detection, meaning an appropriate number of features is important for the system performance.Experimental results show that the system performance at its best when M = 30.The clustering results show that the threshold also affects the clustering results; when the threshold T=0.25, the Norm(C det ) value is the smallest possible value, which indicates that the system performance is the best when the threshold T = 0.25.
Compared the results of event detection based on different news representation methods, when the TF-IDF features are combined with the named entity features, the Precision P, Recall R, and F1 score are better than LDA, which is because the named entities is a part of the important information of the news report, while the joint feature has richer information and the news representation is better.When fusing the headline semantic information, the results are higher than those obtained from other methods, which is because headline is a summarization of the news content.By calculating the similarity between feature words and the headline vector, the weights of the feature words are updated and the key information of the food safety news representation is more prominent, which improves the results of the event detection compared with the method based on TF-IDF and LDA.
The method proposed in this paper also has some limitations, since the data used in this paper is derived from the events that has occurred, so it cannot guarantee the real-time performance of events detection.Nevertheless, the method proposed in this paper still reduced the data dimension, enhanced the results, and more effectively solved the problem of food safety event detection.
In the future, we will focus on combining a variety of data sources and constructing a versatile event detection method, solving the computation overhead, and addressing problems dealing with real-time news feed.

Conclusions
In this paper, we designed a food safety event detection method based on multi-feature fusion, the method integrates TF-IDF features, named entity features, and used headline features for news representation and food safety event detection, which solves the shortcomings of traditional methods, such as the high dimensionality of data, a lack of semantic information, the need to be labeled in advance, etc., which therefore enhanced the results of event detection.Our proposed methods is of great significance in improving the reputation of governments and reducing social anxiety and economic losses.

Figure 1 .
Figure 1.Overview of food safety event detection based on multi-feature fusion, where D denotes the dictionary constituted by all words in news data and   denotes the word vectors corresponding to D. ⨁ denotes the feature joint calculator, ⨂ denotes feature fusion calculator,   denotes the keywords set of each news document extracted by Term Frequency-Inverse Document Frequency (TF-IDF), and   denotes the named entities set in a news document.  denotes the set of joint feature words and  ℎ denotes the headline vector.The black dots denote the words in bag of words after preprocessing, the black diamonds denote the keywords feature extracted by TF-IDF, and the black squares denote the named entities in the news content.

Figure 1 .
Figure 1.Overview of food safety event detection based on multi-feature fusion, where D denotes the dictionary constituted by all words in news data and V D denotes the word vectors corresponding to D.⊕ denotes the feature joint calculator, ⊗ denotes feature fusion calculator, W i denotes the keywords set of each news document extracted by Term Frequency-Inverse Document Frequency (TF-IDF), and E i denotes the named entities set in a news document.K i denotes the set of joint feature words and V hi denotes the headline vector.The black dots denote the words in bag of words after preprocessing, the black diamonds denote the keywords feature extracted by TF-IDF, and the black squares denote the named entities in the news content.

Figure 2 .
Figure 2. Example of food safety news, (a) is the Chinese version; (b) is the English translated version.

Figure 2 .
Figure 2. Example of food safety news, (a) is the Chinese version; (b) is the English translated version.

Figure 3 .
Figure 3. Norm(  ) under different number of features and threshold, the abscissa denotes the number of TF-IDF features M and the ordinate denotes the value of Norm(  ).The solid line with triangle, solid line with circle, solid line with square, dotted line, dotted line with triangle, dotted line with circle and dotted line with square denote the value of Norm(  ) when the clustering threshold T = 0.1, 0.15, 0.2, 0.25, 0.30, 0.35, and 0.4, respectively.

Figure 3 .
Figure 3. Norm(C det ) under different number of features and threshold, the abscissa denotes the number of TF-IDF features M and the ordinate denotes the value of Norm(C det ).The solid line with triangle, solid line with circle, solid line with square, dotted line, dotted line with triangle, dotted line with circle and dotted line with square denote the value of Norm(C det ) when the clustering threshold T = 0.1, 0.15, 0.2, 0.25, 0.30, 0.35, and 0.4, respectively.

Table 1 .
Safety news data summary. .

Table 3 .
Comparison of Precision P, Recall R, and F1 score of different methods.