Improving Crisis Events Detection Using DistilBERT with Hunger Games Search Algorithm

This paper presents an alternative event detection model based on the integration between the DistilBERT and a new meta-heuristic technique named the Hunger Games Search (HGS). The DistilBERT aims to extract features from the text dataset, while a binary version of HGS is developed as a feature selection (FS) approach, which aims to remove the irrelevant features from those extracted. To assess the developed model, a set of experiments are conducted using a set of real-world datasets. In addition, we compared the binary HGS with a set of well-known FS algorithms, as well as the state-of-the-art event detection models. The comparison results show that the proposed model is superior to other methods in terms of performance measures.


Introduction
In the past decade, as social media grows in popularity and the number of users grows, more and more studies are considering the exploitation of the crowdsourced events [1,2]. When a crisis occurs, individuals often use social media to express their concerns and expectations about specific agents, such as events, persons, targets, or policy proposals. These posts allow event organizers to know what is going on around them and to be aware of this fact in a fast and effective manner [3]. Social media demonstrated its usefulness as a valuable spacial information source during many recent crisis events, such as the transmission of infectious disease, volcanic eruptions, tropical storms, tornadoes, earthquakes, river flooding, forest fires, and nuclear accidents awareness and disaster monitoring.
Event detection (ED) is a crucial task for extracting information to discover event trigger points (words or phrases that elicit events in text) and identify event types. A crucial part of ED utilizing social media is detecting and describing crisis-related events when the type of event of concern is unknown ahead of time [4]. Although there is much content on social networks, relatively little of it is related to crisis events and offers meaningful information. Informative material is frequently overshadowed by unrelated and needless noise in most social media postings. Some prior study aims to obtain effective content utilizing text classifiers to process and translate these large amounts of social media data into corporate data [5]. Previous work specifically on event detection was centered on developing domain text classifiers [6]. Numerous recent studies looked at natural disasters such as floods and storms, as well as man-made disasters such as acts of terrorism and bombs [7,8]. These studies concentrate on binary classifications for various crisis characteristics, such as determining source type, forecasting tweet-crisis relation, and determining information quality and relevance. Many such research [9,10], on the other hand, presented multi-classifiers for affected people, infrastructural facilities, deaths, donors, warn, and guidance. Moreover, the recognition of crisis forms, such as typhoons, floods, and flames, is also carried out [11]. In recent years, deep learning (DL)-based models have been frequently used to carry out the tasks noted above.
A DL structure consists of multiple layers, each one of which relates to a different aspect of the brain. Thus, every layer has its own set of neurons and outcomes and its own set of input and activation functions. While there is a trade-off between generalization and high computation, before actually selecting the method and training weights, the manner of the feature map is explained and the parameter values define all impact the performance of DL techniques. Due to its importance, the DL techniques have recently demonstrated excellent performance in several fields, such as the Internet of Things [12], sentiment analysis [13], and toxicity classification on social media [14]. Moreover, several papers proposed DL models for crisis-related knowledge classification and detection. For example, utilizing domain-specific and GloVe embeddings, Alrashdi and O'Keefe [15] looked into two DL structures: Bidirectional Long Short-Term Memory (BiLSTM) and CNN. In addition, a similar CNN method was suggested which uses Twitter posts to identify disaster-related events [16]. The DL architectures, particularly recurrent neural networks (RNNs), have recently gained popularity for detecting crisis events due to their ability to represent sequence data [17,18].
Nevertheless, there are a few drawbacks to just using RNN for data modeling. Initially, while RNN could indeed grab sequential data using the recurrence strategy, it could not encode metadata both from the left and proper contexts within every occurrence in a sequential manner. When detecting targeted activity relying on disaster event texts, it is critical to notice the complete contextual information rather than just the data from previous stages. Following that, existing RNN-based event detection methods are utilized to understand styles in common patterns by predicting every following event document derived from previous crisis texts. This learning goal is primarily concerned with acquiring the significant relation between crisis event texts in standard sequence data. The RNN model cannot accurately estimate the following sample based on the previous event message when a check sequence's connection is broken. Finally, the sequence will then be classified as anomalous. Moreover, utilizing only the forecasting with the following event text as the objective function does not allow for the precise encoding of the patterns distributed by all regular patterns.
Technological improvements in language pre-training have markedly increased the level of a variety of Natural Language Processing (NLP) jobs, with notable fine-tuning concepts including such BERT, ALBERT, XLNet, RoBERTa, and DistilBERT [19]. Due to the wide availability and improved user experience among those techniques, multiple structures relying on fine-tuning have emerged. In this paper, we use a Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) [19], which surpassed many methods, to overcome the drawbacks of RNN-based methods shown above [11]. For event detection datasets, which have huge feature sizes, these transformer models suffer from local optimality difficulties due to the big solution space. To overcome this problem, feature selection (FS) used in classification tasks which reduces noisy features in order to optimize the method performance. Recently, researchers used metaheuristic optimization algorithms to select the irrelevant features to reduce the high-dimensional datasets, as in [20,21]. Therefore, we used the recent metaheuristic algorithm, Hunger Games Search (HGS) [22], in our approach. The reasons for employing HGS approaches to optimize the FS challenge in this paper are as follows: we want to examine the most recent HGS optimizer, and when the HGS method is compared to complex, modern, and high-efficiency algorithms, it is revealed that the HGS optimizer has the optimal solution for the problems examined, with typically greater classification performance (i.e., fewer iterations and execution time).
This paper focuses on identifying crisis events introduced or discussed on social media (such as Twitter and Wikipedia) to identify events. As there is so many data to consider, the only way to analyze it would be to use the DistilBERT method in automated extracting features effectively. Next, the Hunger Games Search (HGS) optimization approach has been used for selecting features because it is a crucial stage due to the "curse of dimensionality" challenge. There have been no studies using HGS for event detection problems as far as we are aware. Finally, for the classification, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) are the most frequently used in FS literature [23,24] for the following reasons: The SVM is used to tackle binary problems and seeks to optimize the margin between positive and negative classes around a hyperplane. As a result, the best hyperplane with the greatest distance to the closest training location of any category is obtained, resulting in acceptable class discrimination. Furthermore, one of the most widely used Machine Learning (ML) and pattern categorization approaches is KNN. It is widely used due to its simplicity and ease of implementation compared to more complex supervised ML approaches. The proposed model's contributions can be summarized as follows: • A transformer-based model named DistilBERT has been used to learn and automatically extract meaningful and complex text representation from the input data. • The pre-trained DistilBERT has been fine-tuned on crisis detection data with the aim of maximizing the detection accuracy and perform feature extraction. • A feature selection algorithm named binary Hunger Games Search (HGS) was combined with DistilBERT in a framework to perform feature selection and dimensionality reduction on the extracted features from DistilBERT. • The overall proposed framework has been evaluated on various real-world datasets to assess the crisis detection accuracy and compare the performance of the framework to several state-of-the-art techniques.
The remainder of this paper is structured as follows. Section 2 describes a review of recent work on detecting crisis events. Section 4.2 offers a description of the technique. Experimental results and the case study are reported and discussed in Section 4, followed by the conclusion in the last section.

Related Works
Several articles have reported how to retrieve beneficial information from websites in the event of a disaster [25][26][27]. These techniques are mainly based on the characteristics used to identify crisis-related information. Rudra et al. [25] classified posts on Twitter into contextual and non-situational information using lexicon characteristics such as reduced lexical and grammatical attributes. They tested their methodologies on various disaster datasets and matched them to the Bag-of-Words (BOW) approach for English and Hindi twitter posts. For accurate classification of eyewitnesses, Zahra et al. [26] applied linguistic features and context features to the classification method. They divided eyewitnesses into three types: direct independent witnesses, indirect witness accounts, and susceptible direct eye witness accounts. They conducted experiments with quakes, flood, storm, and wildfire datasets. In order to detect emergency tweets, Kejriwal and Zhou [27] utilized low-supervision and transfer learning-based methods. They conducted experiments with datasets from the earthquakes in Kerala, Macedonia, and Nepal. They demonstrated that their technique is useful, particularly when the labeled data is limited, and that it surpasses existing baseline techniques. Every one of these studies, even so, starts concentrating on disaster-related Twitter posts rather than sub-categories of tweets such as the destruction of infrastructure, human destruction, source of information need, and availability. Little research has concentrated on classifying tweets throughout a crisis [28,29]. Madichetty and Sridevi [30] concentrated on a design to detect the accessibility and scarcity of resources. They have been using the re-ranking feature selection method to extract information from Twitter posts and send the yield of the optimization technique to the classification model to identify the accessibility and necessity of resources. For trying to extract resource requirements and availability from crisis Twitter posts, Basu et al. [31] had been using data gathering methods, individual word encoding, and a mixture of term and character word embedding. They utilized earthquake risk datasets from Italy and Nepal. Dutt et al. [32] discovered a method for deciphering the semantic meaning of need on the accessibility tweets, such as what resource is accessible or required and where the source of information is located. They also devised a method for correlating knowledge posts on Twitter accessibility and demand regarding resource correlation and position. Alam et al. [33] looked at tweets from three hurricanes (such as Harvey, Irma, and Maria) to see what people were saying. They used a Random Forest (RF) depending on the bag-of-words method for multi-class categorization in their trials. It has categories such as affected individuals, infrastructural facilities, and utility injury, warnings, and guidance. They demonstrated that perhaps the tweet's content and photo provide additional data. Purohit et al. [29] created a social-EOC model to determine and rate community service demands shared on social media. They reduced duplication by combining related applications and saved time by using the linguistic clustering strategy. They utilized data from the Alberta floods, Hurricane Sandy, Hurricane Harvey, and the earthquake in Nepal, among other things. Madichetty and Sridevi [28] developed a new qualified majority adaptable framework to describe healthcare resource twitter posts during an emergency. Relying on the relevant feature directly applicable to the medicine resource posts on Twitter, they used classification techniques such as support vector machine (SVM), AdaBoost, Random Forest, and gradient boosting methods. Even so, throughout an emergency, they did not depend on final inspection from social media. However, these approaches were still unable to achieve high degree of efficiency. Thus, the researchers recently used transformers, which are reported to perform well in understanding and inference tasks.
Transformers have unquestionably been the most popular type of Natural Language Processing (NLP) system in recent years [34]. BERT [35], the movement's "golden child", was the first method to relate a transformer's bi-directional learning to a language modeling assignment. BERT is prepared with such a masked language modeling purpose: nonsense syllables in the context of the sentence have been modified by a [MASK] token, and the method tries to forecast the masked token dependent on the circumstances. To follow BERT's accomplishment, many comparable models in bio-medicine NLP have been proposed, each proposing a different version of in-domain learning such as using a various corpus, as introduced in [36]. SpanBERT [37] is another BERT for action recognition that could be helpful to put to the librarian of BERT. Instead of single words, irregular adjacent spans of tokens are masked throughout SpanBERT's learning, trying to force the method to forecast the entire span from tokens at its edges. Recently, Liu et al. [11] proposed Crisis-BERT, a transformer that depends on categorization method that outperformed traditional linear and DL techniques in terms of stability and performance. The CrisisBERT method was introduced for crisis detection and recognition disaster tasks. However, in the recent approaches, fine-tuning the pre-trained transformer for classification tasks produced irrelevant features [38] which reduce the model performance. To overcome this problem, a transformer is integrated with meta-heuristic optimization to create our methodology in order to improve performance.

Distilled BERT for Feature Extraction
The architecture of the proposed feature extraction model based on DistilBERT is shown in Figure 1. DistilBERT receives as an input X which represents a tweet from the dataset (word sequence). The inputted sequences to DistilBERT are converted to a set of embedding vectors where each vector is mapped to each word in a sequence (S1).
DistilBERT uses the transformer encoder to learn the contextual information for each word. The transformer encoder uses a self-attention mechanism to generate the contextual embeddings (S2). The extracted contextual embeddings for each word are concatenated into a single vector to represent the semantic information presented in the tweet (S3). S3 is the input of a fully connected layer that outputs a vector of size d where d is the number of neurons. Later, a classification layer is placed at the end of the feature extractor model to fine-tune the pre-trained DistilBERT on the event detection task and predict the corresponding event class for each inputted sequence (tweet). In what follows, we detail the model fine-tuning and feature extraction processes.
Input sequence X

Lexicon Encoder
Each tweet is represented with a set of tokens (words) of s-length vector. Thus, the input X = x 1 , . . . , x s will be fed to a multi-layered RNN to map each token to its corresponding embedding vectors representing the word, segment, and positional embeddings. Based on the encoding proposed by Devlin et al. [35], the special token [CLS] is placed as the first token in the sequence (x 1 ), whereas the [SEP] token is placed at the end of the sequence. To generate the embedding vectors for X, the lexicon encoder sums up the word, segment, and positional embeddings for each token in X.

Transformer Encoder
Using DistilBERT, the representation is learned via pre-training. We employ a pretrained multilayer bidirectional transformer encoder to map the input vectors (S1) into contextual embedding vectors, one for each word. DistilBERT uses knowledge distillation to minimize the BERT base model (bert-base-uncased) parameters by 40%, making the inference 60% faster as shown in Figure 2. The main idea of distillation is to approximate the full output distributions of the BERT model using a smaller model such as DistilBERT. Thus, the number of transformer layers (encoders) in the BERT base (12 layers) has been reduced to six. The pre-trained model contains 66 million trainable parameters, compared to the BERT base model with 110 million parameters. In terms of training time, DistilBERT was trained in 3.5 GPU (8 × V100) days compared to 12 GPU days (8 × V100) for the BERT base. The DistilBERT is trained on 16 GB of data collected from Toronto books corpus and English Wikipedia (same as BERT base training data). During the training process of DistilBERT, a large batch size (400) with gradient accumulation is used where the accumulation is performed locally using the gradients from multiple mini-batches before updating the parameters in each step. In addition, next sentence prediction (NSP) and segment embeddings learning objectives are omitted in the training process. The static masking used in the BERT base model is replaced by dynamic masking applied during inference.

Fine-Tuning on Event Detection Task
Assuming that S3 is the contextual embedding learned by the token [CLS], which serves as the semantic representation of input tweet X. The task is formulated as a multiclass classification problem. Thus, the probability of X being classified as class c (i.e., the event) is predicted as the Softmax function used in Equation (1).
where W is the weight matrix learned during the fine-tuning of the pre-trained model used during the initialization of the feature extractor model. r is the number of classes. It is worth noting that the first five transformer layers in the pre-trained model are not trainable. We only fine-tuned the last transformer layer (encoder) of the pre-trained model and replaced the classification layer with two fully-connected layers for feature extraction and classification, respectively.

Feature Extraction Layer
As mentioned in the previous paragraph, a fully connected layer is placed on top of the DistilBERT pre-trained model, which will serve as our feature extraction point rather than retrieving a large vector of size 768. The generated output vector S3 from the last transformer layer, DistilBERT, will be fed the fully connected layer out of size 128 to reduce the feature space dimensionality and later inputted to the classification layer. At this stage, an activation function of type GELU [39] is used with the fully connected layer followed by a Dropout regularizer to prevent over-fitting. The GELU activation function is defined as follows: where m is the output of the fully-connected layer and Φ(m) represents the Cumulative Distribution Function for Gaussian Distribution.

Hunger Games Search
Yang et al. [40] suggested the Hunger Games Search (HGS) algorithm (Algorithm 1) as an optimization approach for modeling animal behavior and hunger. Hunger's ability to become one of the most important homeostatic reasons for decisions, behaviors, and actions in the animal's existence characterizes HGS. HGS mathematical modeling begins with a population of N solutions, X, and proceeds to the objective function values for solutions, Fit i . The following equations are used to accomplish the modernization phase: r 1 and r 2 are arbitrary numbers, and the variable rand generates numbers from a normal distribution, and R is a variable whose value is determined by the interval [−a, a] and can depend on the number of iterations as follows: While, the parameter E, in Equation (3), denotes the control parameter that is defined as: Fit b represents the finest value of the objective function and Sech corresponds to the hyperbolic function where sech(x) = 2 e x −e −x . Furthermore, W 1 and W 2 represent the hunger weights given in Equations (6) and (7).
r 3 , r 4 , and r 5 represent random numbers whose values are in the interval [0, 1], and the variable SH corresponds to the solution of the hunger feeling summation given as follows: Furthermore, the variable H i corresponds the solution hunger H i given by: The best value for the objective is supplied by Fit b , and the current solution X i has an objective given by Fit i , and the new hunger is given by the variable H n : Fit w gives a lower value to the objective function, and r 6 ∈ [0, 1] is a random variable that can indicate if hunger has positive or harmful effects depending on numerous aspects.

Algorithm 1
Steps of HGS [40] 1: Start with the iterations number defined by T, the solutions number defined by N 2: initialize the position of solutions X. 3: while t ≤ T do 4: Compute the objective value for the solutions X i .

5:
Identify the finest solution X b , Fit b , Fit W

Proposed Framework
When utilizing approaches for extracting features, such as DistilBERT, the obtained features were not given directly to the classification stage because they required additional computing time to run. Feature Selection (FS) algorithms remove unnecessary or superfluous features from an extracted crisis text as a data reduction technique. This means the FS approach reduces the amount of data transmitted. Hence, the effective feature selection mechanism was adopted, in which the majority of essential features are determined to use the optimization method, i.e., Hunger Game Search (HGS).
In general, the extracted features are divided into training and testing sets, where the training set are used to learn the model to detect the relevant features. The steps of the binary HGS as FS approach are presented in this section. Figure 3 depicts the general steps of the developed FS approach, dubbed HGS. The developed HGS's first phase is to build a set of N agents X that reflect the FS problem's solutions. The following equation is used to perform that method: Dim is the dimension of the provided problem in Equation (12) (i.e., the number of features). The random search limitations are U and L. The next step is to find the Boolean version of each X i , which may be achieved with the formula below: The fitness value of each X i is then calculated using the objective function below, based on the binary BX i and the classification error.
where ( |BX i | Dim ) represents the ratio of determined relevant features. The classification error of training using KNN with K = 5 is denoted by γ i . KNN is commonly used because it is more stable than other classifiers and has fewer parameters, while λ is a parameter used to balance the ratio of selected features and classification error. The best option X b with the smallest fitness value Fit b is then determined. The next step is to update the solution X i , which is achieved with the HGS operators defined in Equations (9)-(3).
Following that, the stop conditions are checked, and if they are matched, the best solution is returned. Otherwise, the steps for upgrading are repeated.
The last step is to reduce the testing set based on the best solution and then evaluate the performance of the output using different measures.

Experiments and Results
In this section, we show and analyze a variety of experimental tests designed to evaluate the performance of our proposed technique. Section 4.1 provides a full explanation of the datasets used in our research. The metrics used to evaluate the performance of our HGS algorithm and other scheduling methods in the trials are explained in Section 4.2. Section 4.3 concludes by summarizing the results achieved and making some final observations.

Datasets
To implement crisis classification tasks and validate the proposed framework, three datasets of labeled crisis-related tweets and are used, including C6 [41], C36 [11] (combination of C6, C8 [42], and C26, as shown in Tables 1-2), and MAVEN, which present an event schema. It is worth mentioning that this is the first time that the MAVEN dataset has been used and validated for crisis event classification based on the sentence level.
The statistics of the used datasets are presented in Table 1. The C6 and C36 datasets are divided into 95% training and 5% test sets.

Performance Measures
To detect events, the extraction classifier pair is evaluated using confusion matrices. The confusion matrix is shown in Figure 4. True Positives (TP) and False Negatives (FN) represent the number of events of a specific class that were correctly classified and wrongly classified, respectively, in the analysis.
True Negatives (TN) are the number of events defined as not belonging to a specific class. False Positive (FP) levels are the number of events wrongly categorized as belonging to a specific category.

Results and Discussion
This section introduces the proposed HGS task scheduling strategy's outcome analysis and discussion of experimental results.

Comparison with FS Methods
We assessed the HGS method against five well-known algorithms to objectively assess its effectiveness, namely, Particle Swarm Optimization (PSO) [43], Multi-Verse Optimizer (MVO) [44], Whale Optimization Algorithm (WOA) [45], Firefly Algorithm (FFA) [46], and Bat Algorithm (BAT) [47]. Table 3 outlines the different parameters that remain in each algorithm. The performance of each algorithm was evaluated in terms of Recall, Precision, F1 Score, and Accuracy to evaluate the algorithms. The four metrics of KNN and SVM classifiers with C6, C36, and MAVEN are depicted in Tables 4-6, respectively. The best accuracy results are highlighted in bold. According to the results in these tables, the HGS using SVM as the classifier typically outperforms the other studied scheduling algorithms, including PSO, MVO, WOA, FFA, and BAT. Analyzing Table 4, it is clear that the HGS method plays a critical role in feature selection when employing an SVM classifier, as the results are still successful; this is evident across all metrics. The best results occur when using the SVM classifier; HGS can classify 98.06% of the test samples on accuracy metric, more than PSO, MVO, WOA, FFA, and BAT results. In detail, the HGS scored 98.06%, followed by the BAT and FFA in the second level, with 98.03%. WOA was equal in accuracy result with MVO, with 98%. Lastly, the PSO had the worst outcome (i.e., 97.96%). The HGS achieved 98.09% on precision metric, the highest result on the SVM algorithm, followed by BAT and FFA, which achieved 98.06% and 98.05%, respectively. In the same precision result, WOA and MVO had 98.02%. PSO again scored the lowest performance, with 97.99%. In addition, the recall metric for the SVM classifier was 98.06%, 98.03%, 98.03%, 98%, 98%, and 97.96% for HGS, BAT, FFA, WOA, MVO, and PSO, respectively. Finally, for the F1-score metric, our HGS algorithm obtained the best performance, with 98.06%. Next, the BAT and the FFA algorithm had 98.03%. They were followed by the WOA and the MVO, which achieved 98.00%. In the last algorithm, the PSO has the lowest performance. On the other hand, merging the six optimization algorithms with KNN achieved the worst performance among all metrics. In Table 5, according to the results of Recall, Precision, F1-score, and Accuracy metric, the proposed HGS outperformed other optimization algorithms on the C36 dataset. The results combined the six optimization algorithms and traditional machine learning classifiers (i.e., SVM and KNN). From the results, we noticed that SVM has the best results on several optimization algorithms. For the SVM classifier, the accuracy of HGS and FFA algorithms achieved 97.10%, which is the best performance. In contrast, the WOA and MVO were in the second level, with 96.94%, followed by BAT, 96.90%. The worst result, PSO, achieved 96.81%. For the precision metric, 97.14% was the best result achieved in our proposed HGS algorithm. Next, the FFA has 97.13%. The WOA followed them, with 97.01%. The previous three algorithms are followed by the MVO, BAT, and PSO, which have 96.97%, 96.92%, and 96.87%, respectively. The recall metric compared 97.10%, 97.10%, 96.94%, 96.94%, 96.90%, and 96.81% of the test samples using HGS, FFA, WOA, MVO, BAT, and PSO algorithms, respectively. On F1-Score, the proposed HGS also outperformed other algorithms, with 97.10%, followed by FFA achieving 97.09%. Next, the other algorithms WOA, MVO, and BAT, scored 96.94%, 96.93%, and 96.89%, respectively. Finally, the PSO achieved 96.81%, the worst result.
The results of HGS and other optimizers for the MAVEN dataset are presented in Table 6. We merged the SVM classifier and KNN classifier in the table on six optimizers (e.g., MVO, PSO, WOA, HGS, FFA, and BAT). The SVM achieved better results than the KNN. Thus, analysis of the six optimization algorithms on SVM are presented in this section. From the table on accuracy metric, we noticed that merging the HGS algorithm with SVM outperformed other algorithms, which achieved 84.13% for the accuracy, followed by the BAT and the MVO, which each have the same result (83.53%). Furthermore, the FFA has 83.41%. The worst performance algorithms were PSO and WOA, which achieved 83.29% and 83.17%, respectively. Regarding the results on the precision metric, the HGS also achieved the highest results, with 81.25%. The 80.73 percent were the second-best results, which belonged to the PSO and MVO. The other algorithms (i.e., BAT, FFA, and WOA) had the lowest performance, with 80.66%, 80.37%, and 80.12%, respectively. On the recall, the best results belonged to the proposed HGS algorithm followed by the BAT and the MVO whuch share a score (83.53%). They are followed by the FFA and the PSO, with 83.41% and 83.29%, respectively. Finally, 83.17% was the worst value, which belonged to the WOA algorithm.
From another point of view, Figure 5 shows the average accuracy of each FS method on the three datasets C6, C36, and MAVEN. By discussing each dataset separately, for the C6 dataset, the HGS outperformed the other FS methods, as shown in Figure 5a, where its overall average value on the two classifiers (i.e., KNN and SVM) is nearly 98.03%, followed by the WOA method in the second position, with 98.00%. The FFA provides results that are better than that of BAT and MVO, at 97.99%. Finally, the PSO is a minor performance. The average accuracy of the C36 dataset on both classifiers (i.e., SVM and KNN) is displayed in Figure 5b. In the figure, we note that the HGS, FFA, and MVO optimizers outperformed other algorithms, with 96.95%, 96.98%, and 96.94% for accuracy, respectively. These results are close to each other. They are followed by the WOA, with 96.89%. Next, The BAT algorithm achieved 96.87%. Finally, The PSO has the lowest average accuracy, which achieved 96.86%. As shown in Figure 5c, the average accuracy of the C6 dataset was introduced. The HGS optimization algorithm shows clear superiority, comparing other algorithms. This algorithm produced 98.03%, followed by the WOA and the FFA, with the same result, 98.00%. Next, the BAT and the MVO have the same value, with 97.98%. Lastly, the PSO algorithm achieved the lowest performance, 97.96%.  The Friedman (FD) test is a nonparametric, bidirectional analysis of differences by rank, in which the statistical value is calculated and ranked. In [48], the FD test is used to check whether there is an important difference between other algorithms and numerous datasets. The superior approach is low (high) if the smallest (biggest) method is regarded to be the best. Furthermore, Figure 6 displays the HGS algorithm's mean rank in terms of Recall, Precision, F1-measure, and Accuracy when compared to the five optimization algorithms on the three datasets. When analysing the HGS' behaviors on the four measures, it can be seen that the HGS algorithm outperforms the others. For accuracy measurement, we noticed that the HGS is in the best mean rank of 5.83, and the FFA is in the mean rank of 4.33. BAT and MVO have almost the same mean level. WOA averages 2.33. Lastly, PSO is less than others, with a mean rank 1.33. According to the FD test findings for the F1-score, we also observed that HGS is better than others, with a mean rank of 6. BAT and FFA have the same mean level, with 3.83. followed by MVO, with a mean level of 2.83. Finally, the lowest mean ranking is WOA and PSO. Furthermore, in precision metric, we noticed that the HGS is in the best mean rank of 6 and the FFA is in the mean rank of 3.67. BAT and MVO have almost the same mean level (i.e., 3.33). Lastly, PSO is less than others, with a mean rank 2.17. Finally, the difference between the HGS and the PSO, MVO, WOA, FFA, and BAT optimization algorithms on the recall measure is averaged nearly 1.33, 3.5, 2.33, 4.33, and 3.67, respectively. In comparison to the time usually needed by a customer, the completion time for the total process is limited. Therefore, as shown in Figure 7, the average execution times of the proposed HGS algorithm for the C6, C36, and Maven, respectively, were 1.3733, 7.1789 s, and 1.6756 s. These time results are lower than other compared algorithms. For more accurate analysis, the execution time for the C6 dataset was 449.42 ms, the time was 2.3698 s, 1.7429 s, 1.5632 s, 1.8180 s, 1.3812 s, and 1.3733 s for PSO, MVO, WOA, FFA, BAT, and HGS, respectively. From these results, we note that HGS has the lowest time. For the C36 dataset, we also observed that the proposed algorithm was achieved in less time than other methods followed by BAT, which has 7.1789 s. Next, the WOA algorithm has 7.3930 s. They are followed by the BAT and FFA, executed in 8.3395 s and 11.5831 s, respectively. The other algorithms, MVO and PSO, have the most significant time (i.e., the worst time). Finally, the MAVEN dataset was also evaluated as the same as experiments in another dataset. Subsequently, the average execution time for our proposed HGS is the best. In addition, FFA, MVO, and WOA followed our algorithm, which has the same level (i.e., they achieved 2.2951 s, 2.3741 s, 2.4659 s, respectively). In terms of the reduction rate, Figure 8 shows the average number of the selected features. The HGS selects the smallest number of features over all three event datasets. This choosing, nevertheless, has a beneficial impact on classifying performance, as evidenced by its accuracy, as seen in the Figure 5. Furthermore, because it chooses well, almost 51, 41, and 44 features among the C6, C36, and Maven data sources, respectively, the HGS has an increased opportunity to chose the fewest portion from the feature set without impacting of event detection across all tested corpora.
However, by analyzing the behavior of FS methods at each dataset, it can be observed that, in the C6 dataset, HGS and WOA have the same number of features (i.e., 51). The MVO, BAT, and FFA follow, choosing 58, 60, and 61 features, respectively. Finally, the PSO selects the most significant number of features, which has 74 features. For the C36 dataset, PSO has the highest number of features, which means this algorithm is the worst. This selector has 88 features, followed by BAT and MVO, with 58 features and 57 features, respectively, while WOA has 52 and FFA has 49, which is a lower number of features than before. In contrast, the HGS is the best algorithm, because it has a lower feature number than others (i.e., 41 features). In addition, it has higher accuracy, as mentioned above. In the Maven dataset, our proposed HGS optimization has 44 features. The other algorithms, MVO, FFA, and WOA, are approximately equal in the number of features (i.e., nearly 65 features). The worst algorithms in terms of feature number are BAT and PSO, with 75 for BAT selector and 87 for PSO selector.
Of the three selected datasets, the average accuracy of the SVM classifier and the KNN classifier was introduced in Figure 9 on several optimization algorithms (i.e., the six optimizers, which were introduced before). In the figure, we note that the SVM outperformed another classifier on different metrics. In detail, regarding accuracy, the SVM achieved 92.83%, while the KNN has 92.53%. SVM achieved 91.90% on the precision metric, which was a higher result than the KNN (i.e., 91.73%). The results on recall were 92.53% and 92.83% for the KNN and the SVM, respectively. On the F!-score, the KNN achieved 91.99%, while the SVM obtained 92.17%.  The distribution of the SVM and KNN classifier on six different optimizers (i.e., PSO, MVO, WOA, FFA, BAT, and HGS) is displayed in Figure 10. From the figure, we note that the HGS achieved the best result on the SVM during the minor performance on the KNN. These results were 93.10% for the SVM algorithm and 92.49% for the KNN algorithm. On the SVM, the FFA followed HGS, with 92.85%. They are followed by the MVO and the BAT, which both obtained 92.82%. The other algorithms, the WOA and the PSO, have the worst results (i.e., 92.70% and 92.69%, respectively). On the other hand, the PSO has the best result, with 92.56%, on the KNN classifier, followed by the WOA, FFA, MVO, and BAT, which achieved 92.55%, 92.54%, 92.53%, and 92.52% respectively. On the average accuracy for both classifiers, our proposed HGS algorithm outperformed the other algorithms, with 92.80%. In addition, the FFA and the MVO followed our algorithm, which has 92.69% and 92.68%, respectively. The BAT achieved 92.67% accuracy. Finally, the worst results belonged to the WOA (92.63%) and the PSO (92.62%). In summary, for the C36 dataset, the HGS optimization algorithm combined with the SVM classifier had the best classification measurement of any mixture. For the accuracy metric, this collection received 97.10 percent. Furthermore, it achieved 97.10 percent F1-Score, 97.14 percent precision, and 97.10 percent recall. This combined effect further resulted in the highest performance measures values for the C6 data. The accuracy metric yielded a score of 98.06 percent, the F1-Score yielded a score of 98.09 percent, and the recall yielded a score of 98.06 percent. Ultimately, the results for the MAVEN data source were lesser than before, with 84.13 percent, 82.14 percent, 84.13 percent, and 81.25 percent for the accuracy, F1-Score, Recall, and Precision metrics, respectively. In addition, our HGS optimization algorithm outperformed other optimization methods as it has the least number of features among datasets. It achieved the highest performance, as shown before.

Comparison with Previous Studies
This section compares ours with other state-of-the-art crisis event detection techniques. Table 7 shows the results of a few important methodologies. The development of highaccuracy technology for event detection is a major undertaking. It is important to compare our strategy to other models that have been tested on the same datasets. Using C6 and C36 datasets, Table 7 evaluates the performance of several techniques for crisis identification.
For both selected datasets, in [49], the authors used three methods: (1) The combination of a Logistic Regression (LR) classifier and pre-trained Word2Vec (w2v) formulated in (LR w2v ); (2) The SVM (SV M) classifier with pre-trained Word2Vec embedding coined as (SV M w2v ); and (3) The pre-trained W2V encoding associated with the Naive Bayes (NB) model, which assumes a Gaussian distribution for attributes and benefits, denoted as (NB w2v ) model. Moreover, in [50], they merged the CNN approach with the Global Vector (gv) embedding, in which the CNN has two convolutional layers of 250 filters, 128 hidden units, a pool size of 2, and a kernel size of 3. Kumar et al. [51] combined a W2V embedding with Long Short-Term Memory (LSTM) model with 30 hidden states in two layers. In [11], they suggested Crisis2Vec (c2v), a documentation context-specific encoding technique for crisis representation that significantly outperformed traditional methods. In addition, they applied two methods: (1) LR c2v : Combining a linear LR classifier with the Crisis2Vec embedding, and (2) LSTM c2v : a non-linear LSTM model merged with the Crisis2Vec. In contrast, our approach employs a transformer-based method for feature extraction, which is carried out using DistilBERT. Moreover, a HGS is used as a feature selection approach in order to exclude unnecessary features from the feature sets in order to improve performance. For the C6 dataset, our proposed HGS db approach accomplishes 98.06 percent for the F1-score and 98.06 percent for the Accuracy, which also outperforms the highest score results of the model, notably CNN with pre-trained GloVe embeddings, by 7.56 percent for the F1-score and 7.66 percent for the Accuracy. Throughout aspects of embedding, LSTM with Crisis2Vec achieves 97.5 percent including both F1-score and Accuracy, representing a 10.2% improved performance over LSTM with Word2Vec. Correspondingly, LR with Crisis2Vec achieves a 93.6 percent F1-score and a 93.7 percent Accuracy, representing a 6.3 percent and 5.2 percent improved performance over LR with Word2Vec, respectively.
For the C36 dataset, our proposed HGS db approach obtains a 97.10 percent F1-score and 97.10 percent Accuracy, outperforming the previous best prediction model by a significant rate of return, such as LR with pre-trained Word2Vec encoding, by 25.0 percent and 14.8 percent, respectively. Through definitions of word vectors, LSTM with Crisis2Vec achieves 88.0 percent F1-score and 95.6 percent Accuracy, which outperforms LSTM with Word2Vec by 29.1 percent and 23.3 percent, respectively. Similarly, LR with Crisis2Vec achieves 85.1 percent F1-score and 90.9 percent Accuracy, outperforming LR with Word2Vec by 13.0 percent and 8.6 percent, respectively.
The bottom line is that we can remove superfluous features from high-dimensional event representations obtained by DistilBERT using our approach. However, this framework's fundamental drawback is its complexity, both in terms of time and memory. Future steps should include reducing complexity and improving the efficiency of our proposed method. In the future, other augmentation procedures can be researched to improve our method's efficiency.

Conclusions
This paper demonstrates the hybrid framework that incorporates the pre-trained DistilBERT model and the proposed feature selection algorithm to identify crisis events. For instance, a pre-trained DistilBERT model with a defined architecture for feature extraction was fine-tuned on several real-world datasets to generate sentence embedding for each data sample. Later, the feature selection phase uses a new meta-heuristic technique named the Hunger Games Search (HGS) in its binary form to select the most relevant features from the extracted tweets embeddings to maximize the classification model performance reduce the dimensionality of the features representation space. Experiments and comparisons of the proposed framework show superiority in terms of events identification accuracy and feature reduction compared to other state-of-the-art feature selection techniques and event identification methods. The DistilBERT is a relatively small model compared to existing state-of-the-art models. Thus, exploring the other transformer-based model may reveal different and valuable feature sets and improve the overall framework performance. In addition, the DistilBERT covers a single language, while multilingual language models are worth investigating.
As a future work, the developed framework can be extended to cover different NLP tasks such as sentiment analysis, offensive detection, and question answering. In addition, grouping related tasks to the main task in a multi-task learning framework may help boost the performance.