4.1. Results of Shallow-Machine-Learning-Based Approaches
On detecting disruptive events, Alsaedi and Burnap [18
] obtained an F measure of 80.24% using NB. They measured the precision of the eight clusters, and they are 81.39% for politics, 80.62% for finance, 79.57% for sports, 73.23% for entertainment, 76.13% for technology, 77.54% for culture, and 82.26% for disruptive events. Similarly, Dey et al. [3
] found 87.5% accuracy using the BNB classification model. They also compared their model with the SVM and decision tree (DT) classifier and found an accuracy of 84.72% and 83.33%, respectively. Hossny and Mitchell [19
] compared with KNN, NB, LR, and DT and gained an ROC score of up to 0.91 and an F1-score of up to 0.79. They achieved a classification accuracy of 87%, precision of 77%, and recall of 82%. Kotzé et al. [20
] measured their results using both unigrams and Word2Vec models. Using word unigram (1,1), accuracy, macro-average recall, macro-average precision, macro-average F1 and micro-average F1-scores of LR were 0.899, 0.767, 0.772, 0.769, and 0.899, respectively. Using Word2vec (i.e., CBOW and skip-gram), accuracy, macro-average recall, macro-average precision, macro-average F1, and micro-average F1-scores of LR were 0.712, 0.756, 0.485, 0.514, and 0.712, respectively. Ristea et al. [21
] found the AUC values for different crimes and they were assault (0.70–0.76), battery (0.74–0.79), criminal damage (0.65–0.70), motor vehicle theft (0.60–0.74), other offense (0.65–0.77), robbery (0.65–0.79), and theft (0.72–0.77).
Detecting traffic events, Suma et al. [25
] applied big data and machine learning techniques for detecting spatiotemporal events. They successfully detected the Underbelly Festival, The Luna Cinema, and Notting Hill Carnival. Again, Suma et al. [27
] found that for targeted event detection accuracy, the area under PR and area under ROC for LR were 78.734%, 84.706%, and 78.825%, respectively. They detected the Underbelly Festival in the South Bank. The Luna Cinema was also detected around Greenwich Park, National Trust-Morden Hall Park, and Crystal Palace Park. Notting Hill Carnival 2017 was also detected by their model. On the other hand, Alomari et al. [32
] selected SVM as their main classifier. For event detection performance, they selected LR for accident and road closure events. They selected SVM for the rest of the six events’ detection. Accuracy score of SVM, LR, and NB in detecting accident, traffic condition, road closure, road damage, roadwork, social event, weather, and fire were (95%, 95%, 93%), (96%, 95%, 92%), (93%, 95%, 91%), (98%, 94%, 95%), (93%, 93%, 89%), (99%, 98%, 96%), (99%, 99 %, 96%), and (99%, 99%, 95%), respectively. Precision score of SVM, LR, and NB in detecting accident, traffic condition, road closure, road damage, roadwork, social event, weather, and fire were (92%, 94%, 96%), (97%, 95%, 95%), (97%, 95%, 97%), (98%, 91%, 99%), (89%, 88%, 91%), (99%, 99%, 95%), (99%, 98%, 95%), and (99%, 99%, 95%), respectively. Recall score of SVM, LR, and NB in detecting accident, traffic condition, road closure, road damage, roadwork, social event, weather, and fire were (96%, 95%, 89%), (96%, 95%, 90%), (88%, 95%, 84%), (98%, 98%, 92%), (97%, 98%, 87%), (99%, 98%, 98%), (99%, 99%, 96%) and (99%, 99%, 95%) respectively. F1-score of SVM, LR, and NB in detecting accident, traffic condition, road closure, road damage, roadwork, social event, weather, and fire were (95%, 95%, 93%), (96%, 95%, 92%), (93%, 95%, 91%), (98%, 94%, 95%), (93%, 93%, 89%), (99%, 98%, 96%), (99%, 99 %, 96%), and (99%, 99%, 95%), respectively.
To detect natural disaster-related events, Sakaki et al. [33
] used two keywords—earthquake and shaking. For earthquakes, the average value of precision, recall, and F-score was 87.50%, 63.64%, and 73.69%, respectively. The average precision, recall, and F-score values for shaking tweets were 80.56%, 65.91%, and 72.50%, respectively. Pekar et al. [36
] used three scenarios, namely, scenario 1 (i.e., training and testing on the same disaster event), scenario 2 (i.e., training and testing on the same events’ set), and scenario 3 (i.e., training on some events and testing on other events). Eventually, they considered scenario 3 for ensemble learning which was the most frequent real case, and experiments included AdaBoost, GBC, RF, DT, and disaster-based (DB-DT, DB-SVM, DB-MaxEnt). Best precision, recall, and F1-score measuring relatedness were 91.5% (DB-SVM), 100% (DB-DT, DB-MaxEnt), and 94.9% (DB-MaxEnt). The best precision, recall, and F1-score measuring informativeness were 86% (AdaBoost), 100% (DB-DT), and 86% (DB-SVM). The best precision, recall, and F1-score measuring topics were 60% (GBC), 45% (GBC), and 52% (GBC). The best precision, recall, and F1-score measuring eyewitnesses were 41% (GBC), 98% (DB-DT), and 19% (DB-SVM). Spruce et al. [37
] found accuracy scores around 86–99% in different geographical areas for their task. They also found 100% accuracy in some areas, and their overall accuracy was 95% in detecting high-impact rainfall events.
In detecting foodborne disease, Cui et al. [4
] obtained 32.2% accuracy for feature extraction using the TextRank algorithm maintaining the fixed context window, and 21.15% using the TF/IDF. Maintaining the dynamic context window, they obtained an accuracy of 35.3% for TextRank and 23.9% for TF/IDF. For location estimation, they experimented with 500, 1000, 1500, and 2000 tweets and obtained accuracy of 66.4%, 64.7%, 67.3%, and 64%, respectively. On the other hand, to early detect acute disease, Joshi et al. [5
] observed that their system was able to detect three events before the official time (i.e., approximately 9 h before), and the cases were BreathMelbourne-StatSVM-1 in 2000, BreathMelbourne-StatSVMPerf-1 in 1000, and BreathMelbourne StatSVMPerf-1 in 2000. Similarly, their system detected events in five cases before the first news report.
For detecting journalistically relevant events, Guimarães et al. [43
] mainly compared the performance of the automatic and human approaches. They found out the F1-scores of their model with different machine learning algorithms. Automatic and human approach F1-score with SVM, NB, DT, RF, GBT, and Auto MLP are (0.50, 0.28), (0.64, 0.59), (0.46, 0.28), (0.39, 0.30), (0.57, 0.55), and (0.52, 0.38), respectively. On the contrary, to track monolingual events, Kolya et al. [45
] found that their total target stories were 5000, the total number of stories similar to the initial story was 56, the total number of stories similar to the initial story by the system was 39, and total number of stories correctly identified was 33. They found a recall value of 58.93% and a precision value of 84.62%.
Kumar and Sangwan [7
] presented a four-steps-based rumor detection model. They categorized their dataset into two classes—rumor or non-rumor, utilizing naïve Bayes and SVM approaches.
To detect sub-events, Nolasco and Oliveira [46
] conducted two experiments. The political protest database divided their results into two parts—from 16 June 2013 to 18 June 2013, a total of 14 sub-events were identified by their proposed algorithm, and from 19 June 2013 to 21 June 2013, a total of 20 sub-events were found. For the second dataset of the Zika epidemic, they divided their result into two parts—international and local. For the international part, 14 sub-events were found, and for the local part, nine sub-events were found.
Feng et al. [47
] presented a real-time event detection system and obtained precision, recall, and F1-scores of 90.3%, 87.6%, and 88.9%, respectively, for all proposed features. They compared their model with the traditional 1-NN clustering and the LSH-based to establish the effectiveness of their proposed model. The precision of their proposed approach for the detected five-event clusters outperformed the other two baseline models. For real-world event detection, they obtained 17.1% precision, which also outperformed the other two models.
Zhang et al. [48
] detected local events and used two datasets, named Los Angeles (LA) and New York (NY). For LA, they obtained precision, recall, and F1-score of 0.804, 0.612, and 0.695, and for NY, they obtained the scores of 0.765, 0.602, and 0.674, respectively.
Jain et al. [49
] found the trustworthiness of an event and considered different event indicators, and employed a weekly supervised technique for detecting events. They gained their supervised baseline model’s precision of 95%. On the other hand, Bodnar et al. [50
] assessed the veracity of an event and applied NLP and ML techniques. They used an RF classifier for the classification of their data. They found a mean ROC area of 73.14% and an accuracy of 75.41%. On the contrary, Abebe et al. [51
] focused on the semantic meaning of an event and conducted four experiments. In experiment 1, wL = wS = (1 − wT)/2, and wT was varying independently between 0 and 1. Best NMI and F-score were 0.9943 and 0.9845, respectively, where wL = wS = 0.35 and wT = 0.3. In experiment 2, wT = wS = (1 − wL)/2, and wL was varying independently. Best NMI and F-score were 0.9942 and 0.9843, respectively, where wT = wS = 0.35 and wL = 0.3. In experiment 3, wT = wL = (1 − wS)/2, and wS was varying independently. Best NMI and F-score were 0.9943 and 0.9845, respectively, where wT = wL = 0.3 and wS = 0.4. In experiment 4, all weight values were varying in the range 0.25 to 0.45, and best NMI and F-score were 0.9943 and 0.9845, respectively, when wT = 0.25, wL = 0.35, and wS = 0.4. PN and GEORGE [8
] proposed four-modules-based HEE models. They compared the SVM classifier and PLNN classifier in their paper, where SVM gave precision, accuracy, sensitivity, and specificity of 71.3245, 91.333, 6.45, and 99.457, and PLNN gave precision, accuracy, sensitivity, and specificity of 75.435, 96.456, 64.466, and 98.321. Gu et al. [52
] extracted incidents from social media, rather than finding meaning or veracity assessment, and applied it to Pittsburgh and Philadelphia metropolitan areas. Finally, their results showed that a geocodable traffic-incident-related tweet was accountable for about 5% of all the collected tweets. Among these tweets, 60%–70% of tweets were posted by dominant users such as public agencies and media, while others were by individual users. Similarly, Nguyen and Jung [53
] also extracted events and compared their model with the existing BNGram and LDA models. For their first dataset, T-REC, K-PREC, and K-REC were 0.769, 0.453, and 0.548, and for their second dataset, T-REC, K-PREC, and K-REC were 0.455, 0.652, and 0.714, respectively. Bide and Dhage [54
] also detected similar events and compared the precision of their approach with existing PLSA, LDA, and EVE models. They used the average silhouette method for deciding different values of k in the k-means algorithm. Their model had a precision of 1/1, 5/5, 8/8, and 10/10 for k = 1, k = 5, k = 8, and k = 10, respectively.
4.2. Results of Deep-Machine-Learning-Based Approaches
Zhang et al. [55
] detected traffic events and used correlation coefficient
for their performance measurements of DBN. They found accuracy and precision for accident-related data and precision for non-accident related data of 0.94, 0.90, and 0.96, respectively, with
= 0.05. They found accuracy and precision for accident-related data and precision for non-accident related data of 0.80, 0.65, and 0.87, respectively, with
For detecting meteorological events, Shi et al. [58
] compared their model with CNN and single-grained capsule. Micro-average recall, micro-average precision, and micro-average F1-score of the SFMED model were 0.738, 0.862, and 0.795, respectively. In every case, the SFMED model’s values were better than CNN and single-grained capsule. SFMED outperformed the other two models in accuracy values, and SFMED, CNN, and single-grained accuracy values were 0.941, 0.897, and 0.931, respectively. On the other hand, Burel and Alani [59
] detected crisis events and found precision, recall, and F1-score of CNN model (with full dataset) in predicting related or unrelated event types and information types were (0.861, 0.744, 0.797), (0.991, 0.986, 0.988), and (0.634, 0.590, 0.609), respectively. Precision, recall, and F1-score of CNN model (with sample dataset) in predicting related or unrelated event types and information types were (0.839, 0.838, 0.838), (0.983, 0.983, 0.983), and (0.610, 0.610, 0.610), respectively. Similarly, Abavisani et al. [60
] detected crisis event and compared their model with compact bilinear pooling [110
], compact bilinear gated pooling [111
], and MMBT [112
]. Accuracy, macro-F1, and weighted F1-score of SSE-Cross-BERT-DenseNet model in informativeness task, humanitarian categorization task damage, and severity task were (89.33, 88.09, 89.35), (91.94, 68.41, 91.82), and (72.65, 59.76, 70.41), respectively. These measures were calculated using Setting A (i.e., excluding the training pairs with inconsistent labels). In Setting B (i.e., informativeness task and humanitarian categorization task evaluations), the values were (90.05, 88.88, 89.90) and (93.46, 84.16, 93.35), respectively, and it outperformed other models in all aspects. In setting C, where real-world events were considered, their model’s values were (62.56, 39.82, 62.08), (84.02, 63.12, 83.55), and (86.30, 65.55, 85.93), respectively, and it outperformed other models in all aspects. For evaluation, Imran et al. [9
] considered two aspects of the measure-detection rate and the hit ratio. For datasets of Joplin and Sandy, they obtained the corresponding detection rate of 78% and 41% and an HR of 90% and 78%. For both cases, Joplin showed higher accuracy. Overall, their model could identify from 40% up to 80% of disaster-relevant tweets, and finally, their approach generated an output that is 80% to 90% accurate. Fan et al. [61
] offered a hybrid machine learning model for detecting disaster locations from social media data. They showed that their proposed fine-tuned BERT model obtained validation accuracy of 95.55% and test accuracy of 75.37%, which outperforms the other baseline models. Huang et al. [62
] compared their model with different classifiers and showed their comparison table where text CNN obtained precision, recall, and F1-scores of 0.82, 0.92, and 0.87, respectively. For time extraction, they obtained an accuracy of 96.9%, and for location extraction, they obtained an accuracy of 94.7%.
For detecting events in the sports domain, Kannan et al. [6
] showed the ROC curve of their model. For an evaluation window of 10 min, they obtained an accuracy of 80 percent in the AUROC curve, which meant that significant events were identified well within 10 min from their actual occurrence time.
Shen et al. [63
] detected ADE and used two datasets—TwiMed [113
] and TwitterADR [114
] in their experiment. Their TwiMed dataset showed precision, recall, and F1-score of 76.14, 75.26, and 75.25, and their TwitterADR dataset showed precision, recall, and F1-score of 80.19, 71.23, and 74.49, respectively.
For multilingual event detection, Liu et al. [64
] measured their detection effectiveness and found the values of precision, recall, and F1-scores of 0.6891, 0.7833, and 0.7332, respectively. For event detection, they achieved a 35.27% improvement in computing speed. In generating the event evolution graph, their model used an average of 2.5 s less time.
Ahmad et al. [65
] also followed a multilingual approach and used monolingual word embedding as input features in their first experiment, where they obtained precision, recall, and F1-scores of 0.32, 0.25, and 0.25 for Hindi; 0.22, 0.18, and 0.18 for Bengali; 0.18, 0.21, and 0.18, for English respectively. Multilingual word embeddings were used as input features in their second experiment. They obtained precision, recall, and F1-score of 0.32, 0.25, and 0.26 for Hindi; 0.33, 0.25, and 0.26 for Bengali; 0.33, 0.29, and 0.28 for English, respectively. For their third experiment, multilingual word embeddings were used where they obtained precision, recall, and F1-scores of 0.40, 0.37, and 0.36 for Hindi; 0.35, 0.29, and 0.30 for Bengali; 0.43, 0.38, and 0.39 for English, respectively.
Sub-events follow events, and Bekoulis et al. [66
] concentrated on this. They compared both strategies (i.e., the relaxed evaluation and the bin-level evaluation), with and without chronological LSTM in terms of precision, recall, and F1-scores, and obtained the highest F1-score of 86.59% for the Tweet-AVG model of relaxed evaluation. They concluded that their proposed binary classification baseline model showed an outstanding performance that exceeded the state-of-the-art approaches with an F1-score of 76.16% and 75.65% for macro and micro levels, respectively, in sub-event detection.
Chen et al. [67
] tracked and detected online events. They measured precision, recall, F1, and DERate, and their found values were 0.800, 0.774, 0.750, and 0.063, respectively. They also compared their model with PS [115
], TS [116
], and MABED model [117
] and in every case, their model outperformed the three other models. On the other hand, Aldhaheri and Lee [68
] detected temporal events. They showed precision, recall, accuracy, and F1-scores for different regions to show the impact of image downsampling on their proposed model’s performance and obtained the highest accuracy of 99.3%. Qiu et al. [69
] proposed a single-pass clustering model and showed that their approach achieved an NMI of 0.86, ARI of 0.69, and F1-score of 0.70, which ultimately showed the effectiveness of their proposed approach. Ali et al. [70
] proposed sentence-level multiclass event detection and found that DNN’s accuracy, macro-average precision, macro-average recall, and macro-average F1-score were 0.84, 0.84, 0.84, and 0.84, respectively. RNN’s accuracy, macro-average precision, macro-average recall, and macro-average F1-score were 0.81, 0.82, 0.80, and 0.81, respectively. CNN’s accuracy, macro-average precision, macro-average recall, and macro-average F1-score were 0.80, 0.82, 0.75, and 0.78, respectively. They also compared their approach with machine learning algorithms. KNN’s accuracy was 0.78, DT’s accuracy was 0.73, MNB’s accuracy was 0.70, LR’s accuracy was 0.80, RF’s accuracy was 0.80, and SVM’s accuracy was 0.73.
4.4. Results of Other Approaches
Ansah et al. [78
] detected disruptive events. They found that their model detected 96% of ground truth protesting events (with FP dataset) and for that, a time window of 60 min was used. Topic intrusion, topic coherence, and precision values of SensorTree, TweetInfo, tLDA, EventTweet, and KwFreq were (0.22, 0.50, 0.70, 0.72, 0.80), (0.85, 0.50, 0.30, 0.55, 0.20), and (0.92, 0.62, 0.40, 0.60, 0.30), respectively. SensorTree outperformed the other models in all respects. On the other hand, for data breach events, the typed DQE (dynamic query expansion) method of Khandpur et al. [79
] obtained precision, recall, and F1-score of 0.74, 0.68, 0.71, the baseline 1 method obtained precision, recall, and F1-score of 0.52, 0.93, 0.67, and the baseline 2 method obtained precision, recall, and F1-score of 0.21, 0.20, 0.20. For DDoS events, the typed DQE method obtained precision, recall, and F1-score of 0.80, 0.85, 0.82, the baseline 1 method obtained precision, recall, and F1-score of 0.72, 0.48, 0.58, and the baseline 2 method obtained precision, recall, and F1-score of 0.01, 0.02, 0.01. For account hijacking events, the typed DQE method obtained precision, recall, and F1-score of 0.99, 0.45, 0.61, the baseline 1 method obtained precision, recall, and F1-score of 0.99, 0.48, 0.65, and the baseline 2 method obtained precision, recall, and F1-score of 0.01, 0.01, 0.01.
In detecting traffic events, Alomari et al. [82
] found out that the highest number of tweets related to events in Jeddah were posted at 22 (i.e., 10 p.m.). It decreased after three and was lowest at 8. In Makkah, the tweet rate was always high but low only during Al-Fajr prayer and Al_Dhuhr prayer time (5–12). They also found the most mentioned places in Jeddah and Makkah. The top three events detected in Jeddah were accident, fire, and inauguration. The top detected events in Makkah were rain and accidents.
Rossi et al. [83
] detected disaster events. In order to test the effectiveness of their informativeness classification, they took the CrisisLexT26 database, which consisted of the tweets of 26 emergencies [123
]. To evaluate the performance of their model, they used different subsets of this database and finally obtained overall effectiveness of around 70%.
To support pharmacovigilance, Rosa et al. [89
] analyzed social media data. They found that 91% extracted correlations were reliable with a residual error of ±4%.
In detecting city events, Anantharam et al. [90
] obtained a precision of around 94% for CRF model creation. The baseline model showed a precision of around 91%. They extracted 1042 city events, and 454 of them co-existed with the 511.org report.
For the detection of sub-events, Arachie et al. [10
] generated 40 clusters for Hurricane Harvey and 50 clusters for Nepal Earthquake. They compared their method with DEPSUB and MOAC+NV methods. For Hurricane Harvey, among 795,461 unlabeled tweets and 4000 labeled tweets, their model identified 769,670 unique noun–verb pairs and 27,122 phrases. Their method had 796,792 candidate sub-events, while the DEPSUB had 769,670. For Nepal Earthquake, among 635,150 unlabeled and 3479 labeled tweets, their model identified 577,914 unique noun–verb pairs and 36,980 phrases. Their method had 614,894 candidate sub-events, while the DEPSUB had 577,914.
In order to detect real-time events, Fedoryszak et al. [93
] proposed a framework. They finally generated a production application that resolved diverse product use cases and enhanced the user-examination-related metrics of continuous events.
For detecting local events, Choi et al. [94
] considered OnlyTweet as their approach without relevant documents. Precision of TBEM [19
], GTBEM [48
], OnlyTweet, and the proposed method were 0.42, 0.38, 0.64, and 0.78, respectively. The recall values of GTBEM, TBEM, OnlyTweet, and the proposed model were 0.36, 0.40, 0.61, and 0.83. F-score of TBEM, GTBEM, OnlyTweet, and the proposed method were 0.41, 0.37, 0.62, and 0.85.
Yang et al. [95
] proposed an event discovery approach. They compared their model with SDR, DGNMF, LLRR, LSDT, and SMDR. In the X->Y (using seed samples from the X domain for detecting events in the Y domain) scenario, NMI and F1-scores of ICLR were 0.7932 and 0.5748, respectively, and ICLR outperformed other baseline models in this scenario. In the Y->X scenario, NMI and F1-scores of ICLR were 0.5363 and 0.2207, respectively. SCLR outperformed other models in NMI scores, and F1-score, SMDR (F1 = 0.2432), outperformed SCLR. Comito et al. [97
] proposed an online algorithm. They found event recall, keyword recall, and keyword precision on the three datasets, namely, FA Cup, Manhattan, and the US Election. For the FA Cup dataset, BEaTS event recall, keyword recall, and keyword precision were 0.61, 0.65, and 0.59, respectively. For the US election dataset, they were 0.67, 0.55, and 0.57, respectively. For the Manhattan dataset, they were 0.70, 0.72, and 0.69, respectively. For detecting events, Gao et al. [98
] compared their model with candidate-ranking (CR) [124
] and CLASS-SVM (CS) [125
] methods. For textual content’s event detection, their proposed method showed improvements of 40.9%, 46.1%, 43.4%, and 19.4% for recall, precision, F-score, and ANMRR values, respectively (compared to the CR method). The model showed 50.7%, 55.5%, 53.0%, and 20.9% for recall, precision, F-score, and ANMRR values, respectively (compared to the CS method). For visual content to detect events, they found around 20% improvements in all the evaluation metrics. Proposing a novel model, Shi et al. [100
] compared their model with independent cascade (IC), bursty event detection (BEE), EVE, and HEE. Influence score of BEE+IC, EVE+IC, HEE+IC, and EDMP in the first, second, and third propagation are (78,78, 82, 82), (92, 95, 96, 110), and (82, 81, 84, 226), respectively, where it can be seen that EDMP outperforms other models. Dong et al. [101
] detected real-world events. These events included OWS protests at Zuccotti Park, Union Square, and Foley square. They also detected the Raise Cache tech event and the Mastercard free lunch promotion event. Similarly, Hasan et al. [103
] proposed a practical approach. For experimental results, they used the Events2012 corpus and a collection of 506 ground-truth events. For different parameters, such as M value, ts
value, Q value, t
value, and g
value, they obtained the highest recall and precision of 0.96 and 0.80, 0.96 and 0.82, 0.96 and 0.83, 0.96 and 0.85, 0.96 and 0.89, and 0.96 and 0.89, respectively, and as a result, they determined the optimal parameter setting for their system which had a recall of 0.96 and precision of 0.89. A recall of 0.96 was found for ground truth events. Again, Valkanas and Gunopulos [104
] presented an effective approach. On average, their proposed method took 3.36 ms, 0.35 ms, and 0.001 millisecond times for location extraction, classification, and event extraction, respectively. Overall, their approach took a total of 3.72 ms (average) time. Likewise, Sun et al. [105
] detected events. They compared their model with PLSA and found that their method outperformed the PLSA method for every topic. In PLSA for the “The Nepal Earthquake” topic, the precision was 10/10, 11/20, and 13/30 for the top-10, top-20, and top-30 posts, respectively, whereas EVE’s precision for the same topic was 10/10, 15/20, and 21/30, respectively. Again for “The Explosion at Fujian Plant” topic, precision of PLSA for top-10, top-20, and top-30 posts were 10/10, 19/20, and 25/30, and for EVE, these values were 10/10, 20/20, and 28/30. On the other hand, Sivaraman et al. [106
] initiated a unique parameter. They compared their model with Twitinfo, and their model obtained precision, recall, and F1-score of 0.727, 0.774, and 0.75. Although the precision value of their model was lower than the Twitinfo model, their recall and F1-score values outperformed the Twitinfo model. For proving the generality of their model, they also tested it with the ICWSM dataset and obtained the precision, recall, and F1-score of 0.7181818182, 0.9875, and 0.8315789474. On the contrary, Akbari et al. [107
] detected wellness events. They compared their proposed model with various state-of-the-art approaches such as Alan12, SVM, Lasso, GL-MTL, and TN-MTL and obtained the best F1-score of 84.86% for gMTL. Again, Zakharchenko et al. [108
] received a qualitative result which, after analyzing the data, demonstrated that their first hypothesis (H1) was true. They divided their disclosure analysis of the publications into three categories. Their result showed that the interrelation between “media quality” and the relative attention of media for the first type of publication mentioned is 0.043, which is insignificant. After their experiment, Pomytkina et al. [109
] found that the percentage of students that do not feel irritated while bullying online is 44.8% (47 people), while 27.6% (29 people) students sometimes feel irritated, 18.1% (19 people) students constantly feel irritated, and 9.5% of students (10 people) often feel irritated. Additionally, the percentage of students who believe online bullying does not exist is 9.5% (10 people). They also found the roles of students during cyberbullying, which are 21% cyberbullies (initiators), 12.7% assistants, 41.9% defenders, 12.4% victims, and 12% witnesses.