A Novel Fuzzy-Logic-Based Multi-Criteria Metric for Performance Evaluation of Spam Email Detection Algorithms

: The increasing volume of unsolicited bulk emails has become a major threat to global security. While a signiﬁcant amount of research has been carried out in terms of proposing new and better algorithms for email spam detection, relatively less attention has been given to evaluation metrics. Some widely used metrics include accuracy, recall, precision, and F-score. This paper proposes a new evaluation metric based on the concepts of fuzzy logic. The proposed metric, termed µ O , combines accuracy, recall, and precision into a multi-criteria fuzzy function. Several possible evaluation rules are proposed. As proof of concept, a preliminary empirical analysis of the proposed scheme is carried out using two models, namely BERT (Bidirectional Encoder Representations from Transformers) and LSTM (Long short-term memory) from the domain of deep learning, while utilizing three benchmark datasets. Results indicate that for the Enron and PU datasets, LSTM produces better results of µ O , with the values in the range of 0.88 to 0.96, whereas BERT generates better values of µ O in the range of 0.94 to 0.96 for Lingspam dataset. Furthermore, extrinsic evaluation conﬁrms the effectiveness of the proposed fuzzy logic metric.


Introduction
Emails have become a fundamental part of today's technology-augmented modern lifestyle. Emails have enabled people to collaborate with each other by providing one of the cheapest and fastest means of communication [1]. Since their emergence in the public domain in the mid 1990s, the positive impact of emails usage can be clearly seen in the growth of several areas, including business, education, healthcare, and industry, among others.
Despite the positive effects of emails, the most prominent weakness of the technology is the continuous increase in the unsolicited email messages that end up in a recipient's inbox. Therefore, it is a task of utmost importance to segregate legitimate emails, called hams, from undesired emails known as spam. According to a recent study [2], spam messages account for 56.87% of email traffic worldwide, and the most common type of spam emails are healthcare and dating spam. Similarly, another recent study [3] indicates that 333.2 billion emails are expected to be generated worldwide on daily basis in the year 2022. This number is estimated to reach 376.4 billion in 2025, with 4.6 billion users [3]. Consequently, the increase in spam emails is proportional to the increase in overall email traffic. The continuous increase in spam emails at a global scale has resulted in a high cost in terms of time, storage requirements, network bandwidth usage, and indirect costs of protecting privacy and security [4]. In addition, the nuisance caused by spam emails on human mental health and moods may have its own role, as indicated by studies [5,6].
The problem of spam email detection has several dimensions. The first dimension is concerned with the algorithms used for the spam detection. The prime objective of a spam detection algorithm is to segregate legitimate and spam emails as correctly as possible. For over two decades, researchers have proposed various solutions and methods, such as the use of Blacklist [7], Real-Time Blackhole List [8], and Content-Based Filters [9] to identify and filter spam from hams. Active research is still being conducted to develop methods that are faster and more accurate. In particular, artificial intelligence (AI)-based methods have received notable attention from researchers in recent years. Special emphasis has been placed on the methods based on machine learning [1,[10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. In addition, deep learning methods have recently been successfully applied to spam email detection [10][11][12][13][14][26][27][28]. The outcomes of these studies indicate that machine learning and deep learning algorithms provide an effective platform for efficiently solving the spam detection problem; yet there is still a room for further improvement and development.
The second dimension of the spam detection problem is the datasets used to assess the performance of the spam detection algorithm. The dataset contained in a corpus plays a critical role in evaluating the performance of any spam detector [2]. A number of datasets have been reported in the literature, such as Spambase [29], Lingspam [29], PU1 [30], PU2 [31], PU3 [31], PUA [31], Enron [32,33], Genspam [34], Trec 2005 [35], Trec 2006 [36], and Trec 2007 [37]. A comprehensive list of email spam datasets available in the public domain is reported by Dada et al. [2]. The datasets have distinctive features and their suitability and effectiveness is determined by the type of experiments performed to evaluate the performance of the spam detection algorithm. In addition, several researchers use datasets that are not well known, or developed by the researchers themselves [1,12,17,[19][20][21][22]24,28].
The third dimension is the performance metrics. In order to evaluate the performance of a spam detection algorithm using a dataset, an appropriate performance measure is necessary. One of the common metrics used for the evaluation is the accuracy, which measures the proportion of the total number of messages correctly classified (whether ham or spam) [15]. However, using accuracy as the only performance measure is not sufficient [2]. Therefore, other performance metrics such as recall and precision are also used [15]. The recall measures the number of messages correctly classified penalised by the number of missed entries, whereas the precision measures the number of messages correctly classified penalised by the number of incorrect classifications [15]. In addition to the aforementioned three metrics, which are frequently used in studies, other metrics such as the F-score [15], geometric mean (G-mean) [15], false positive rates (FPR) [2], false negative rate (FNR) [2], receiver operating characteristics (ROC) [15], and loss [26] are also employed. The use of the different performance metrics is vital in the spam detection process due to the costs associated with misclassification. This misclassification is twofold, since it is concerned with identifying a spam as a ham or vice versa. Although identifying a spam as a ham is a matter of concern, it is usually not a critical issue because the only action the user has to take is to delete such message. However, in the case of misclassifying a ham message as a spam, the gravity of this situation can lead to chaos, since valuable information may be lost to the classification error of the underlying detection algorithm. Therefore, it is of utmost importance that a spam detection algorithm is tested and evaluated with multiple metrics. Continuing with the performance metrics, the aforementioned discussion as well as the review of literature in Section 2 point towards classifying the metrics into different levels. The first-level metrics are the ones that are used at the atomic level. These include true positive, true negative, false positive, and false negative. Then there are second-level metrics such as accuracy, precision, and recall, which are comprised of a certain combination of primary metrics. Finally, there are third-level metrics such as F-score and G-mean, which are based on secondary level metrics. Note that both F-score and G-mean combine two second-level metrics (F1 combines precision and recall, whereas G-mean combines specificity and recall). While there has been a great deal of effort in terms of developing new algorithms and datasets for the spam email detection problem, little emphasis has been placed on the evolution of new performance metrics, especially the third-level metrics. More specifically, while the first-and second-level metrics are frequently used in studies, as highlighted by the review of literature in Section 2, the impact of existing third-level metrics or the development of new metrics has received little attention from the research community.
The main motivation behind the underlying study is to propose a metric that combines three second-level metrics-accuracy, recall, and precision-and therefore aggregate the effect of all three in one measure, rather than looking at the three individually. This aggregated approach gives convenience to the decision maker in reaching a conclusive decision. For example, consider a situation where we have three algorithms A, B, and C. Assume some hypothetical results as shown in Table 1 below: Table 1.
An example of different values of accuracy, precision, and recall for three hypothetical algorithms.

Algorithm Accuracy Precision Recall
In the above scenario, A performs best in terms of Accuracy, B performs best in terms of Precision, and C performs best in terms of Recall, but which algorithm is the best in all respects? The answer is difficult and requires human intervention and expertise to reach a conclusion. In contrast, our proposed fuzzy measure aggregates the impact of all three measures, and therefore gives a single numerical value which can be easily and clearly evaluated without any confusion or human intervention. Furthermore, the commonly used measures of F1 score or G-mean would fail here, because they only aggregate the impact of only two second-level measures (as highlighted above). In this case, our proposed fuzzy metric is a suitable solution if it comes to measuring the combined effect of the three second-level metrics.
Motivated by the aforementioned observations, the major and minor contributions of this study can be enumerated as follows:

1.
As a major contribution, a fuzzy-logic-based performance measure is proposed. This measure combines accuracy, precision, and recall into a single fuzzy function using the Unified And-Or (UAO) fuzzy operator. Several fuzzy decision rules are developed that can be used in the performance evaluation process. To the best of our knowledge, no such attempt has been reported in the literature. It is worth mentioning that while the proposed measure is evaluated in the context of spam email detection, the measure is generic and can be used in any problem domain concerned with performance evaluation.

2.
As proof of concept, two deep learning models are used for comparative performance evaluation. These models are Bidirectional Encoder Representations from Transformers (BERT) and Long Short-Term Memory (LSTM). The use of BERT and LSTM is motivated by the fact that the models have seen a limited use for spam detection (as highlighted in the literature review in Section 2). This is a minor contribution, which supports the major contribution listed above.

3.
As another minor contribution, a preliminary performance evaluation is carried out using several commonly used benchmark datasets. The rest of the paper is organized as follows. Section 2 presents the literature review. Section 3 provides a brief description of two deep learning models, namely BERT and LSTM, as used in this study. Then in Section 4, the fuzzy-logic-based performance measure is discussed, with details on the Unified And-Or (UAO) aggregation operator (which serves as the basis of the new performance metric), the fuzzy membership functions, and the decision rules. Results and discussions are provided in Section 5. Finally, a conclusion and future research directions are given in Section 6. Table 2 provides a summary of the studies showing approaches, datasets, and performance metrics used in different studies. AbdulNabi and Yaseen [10] utilize BERT for spam detection and compare the performance with Deep Neural Network (Bidirectional LSTM), KNN, and NB. Spambase and Spamfilter (from Kaggle) are used, with accuracy and F-score as the performance measures. Islam et al. [11] mutually compare several algorithms which include regression, xboost, SVM, random forest, word embedding, and LSTM. Five datasets are used, which are TrecSpam 2007, Enron, PU, Lingspam, and Basket (which is the combination of the four datasets). Accuracy, loss, precision, and recall are used as the performance metrics. Srinivasan et al. [15] develop the DeepSpamNet algorithm by combining LSTM and CNN. The proposed algorithm is compared with logistic regression (LR), Gaussian NB, KNN, decision tree, adaboost, random forest, and SVM. Four datasets, namely Lingspam, PU, Spam Assasin, and Enron, are used. The study employs accuracy, precision, recall, F-score, and G-mean as the performance measures. Kumar et al. [16] also evaluate the performance of SVR, KNN, NB, decision tree, random forest, adaboost, and bagging. However, only a single dataset is used, which is Spamcsv from Kaggle. Moreover, accuracy and precision are used as the performance metrics.

Literature Review
Anitha et al. [17] utilize xboost for spam email detection. The xboost algorithm is compared with several algorithms, including SVM, rotation forest, MLP, decision tree (J48), and NB. Performance is evaluated using a single dataset called Spam_ham_dataset from Kaggle. Five performance measures are employed, which are accuracy, precision, recall, specificity, and F-score. Siddique et al. [12] employ NB, CNN, SVM, and LSTM for spam email detection in Urdu. A self-collected dataset named "Urduemaildataset" is used for evaluation. Four metrics, namely accuracy, precision, recall, and F-score, are used for performance evaluation. Sethi et al. [18] use a single dataset (SpamAssassin) to compare the performance of NN, Logistic regression, SVM, and NB. Performance is measured with respect to accuracy, precision, recall, and F-score.
Bagui et al. [19] apply NB, SVM, decision tree, CNN, and LSTM for the spam detection problem. However, they only use accuracy as the performance measure while using a single dataset which consists of self-collected emails. Nayak et al. [20] employ NB and decision tree (J48) while using a single dataset from Kaggle. The performance metrics are accuracy, precision, recall, and F-score. Sheneamer [13] utilize several algorithms including CNN, LSTM, Random forest, SVM, NB, and decision tree while using only the Enron dataset. Three metrics, namely accuracy, recall, and precision are used for performance evaluation. Euna et al. [21] also evaluate the performance of several algorithms, which include SVM, decision tree, logistic regression, and multinomial NB. A single dataset from Kaggle is used with accuracy, precision, recall, and F-score as the performance measures.
Zamir et al. [28] mutually compare adaboost, random forest, decision tree (J48), SVM, bagging, and deep learning. Performance is evaluated with accuracy, precision, recall, and F1-score while using a single dataset called CSDM2010_SPAM. Kaddoura et al. [27] evaluate BERT using the Enron dataset. A single evaluation metric, namely the F-score, is used. Feng et al. [1] propose an algorithm combining SVM and NB and call it SVM-NB. The proposed algorithm is compared with SVM and NB algorithms using the DATAMALL dataset. Recall and precision are used for performance evaluation. Chakraborty and Mondal [22] employ NB, J48 decision tree, and logistic tree classifier algorithms for spam detection. They measure the performance of the three algorithms using accuracy and precision while utilizing a dataset of self-collected emails.
Mallampati and Hegde [14] implement NB, J48 decision tree, and deep neural network. The performance of the three algorithms is evaluated using the Spam Assassin dataset, with accuracy, precision, and recall as the performance metrics. Rusland et al. [23] use NB with two datasets, namely Spambase and Spamdata. Performance is evaluated using accuracy, recall, precision, and F-score. Bibi et al. [24] employ NB and SVM while using a dataset from Github with accuracy, recall, precision, and F-score. Iqbal et al. [26] implement BERT and evaluate the algorithm over the Enron dataset. Accuracy is used as the performance measure.
Novo-Lourés et al. [40] apply Adaboost, Flexible Bayes, NB, RF, SVM to email spam detection while using Spamassasin dataset. The performance is measured using accuracy, precision, recall, F-score, and other metrics. Occhipinti et al. [41] employ twelve machine learning algorithms which include K-NN, Xboost, and several variants of regression, SVM, and NB. The performance is evaluated using the Enron dataset while using recall, precision, F-score, and a couple of other metrics. Guo et al. [42] use several algorithms including SVM, MLP, and CNN using Twitter and SinaWeibo datasets. The performance metrics are accuracy, recall, precision, and F-score. Venkateswarlu et al. [43] propose a variant of generative adversarial network (GAN) for twitter spam detection using a single dataset (i.e., Twitter Spam). They use precision, recall, and F-score as the evaluation metrics. Table 2. Summary of studies on email spam detection. A-accuracy; R-recall author; P-precision; L-loss, F-f-score/F1 score; G-G-mean.

Reference
Year

Brief Background of BERT and LSTM
It is important to mention that the focus of this study is not on the algorithm performance, but to assess the impact of a fuzzy-logic-based metric when used in conjunction with an email spam detection algorithm. Therefore, as proof of concept, two models from the domain of deep learning have been used, which are BERT and LSTM. A brief description of these models is given below.

Bidirectional Encoders Representations from Transformers (BERT)
BERT [44] is a text representation model. The model combines bidirectional encoder LSTM and Transformers. Developed in 2018, the model has worked effectively for natural language processing tasks such as text classification, text summarization, and text generation. BERT is capable of pretraining deep bidirectional representations from unlabeled text [44]. This is achieved by jointly conditioning on both the left and right context in all layers [44]. Due to this, the pretrained BERT model can be fine-tuned with just one additional output layer. This creates cutting-edge models for a variety of tasks such as question answering and language inference, with only small modifications in a task-specific architecture. Few studies have used BERT for spam email detection [10,26,27]. A detailed discussion on BERT and its architecture can be found in Devlin et al. [44].
We have accomplished creation of a more efficient Natural Language Processing (NLP) model with BERT Tokenizer along with TensorFlow 2.0 (TF 2.0). Studies show that BERT is effective for various NLP tasks such text classification, text generation, and text summarization. On the basis of text classification, our proposed model (BERT + TF 2.0) is used to classify email (ham or spam) in the underlying study.

Long Short-Term Memory (LSTM)
The LSTMs falls in the domain of deep learning [45]. LSTM belongs to a particular class of recurrent neural networks (RNNs). LSTMs utilize their inherent memory capabilities, making use of the past information to make efficient predictions. As part of RNN, LSTM have demonstrated a great potential in capturing both the long-term and the short-term temporal dependencies within the time series data. A primary advantage of the LSTM network is that it overcomes the vanishing and the exploding gradient problem inherent in the conventional RNNs. An LSTM network accepts sequential time series data as input. The use of LSTM for email spam detection has been reported in several studies [10][11][12]. A detailed discussion on LSTM can be found in Hochreiter and Schmidhuber [45].
In the current study, LSTM is used as another model so that results can be compared with BERT, since it is widely used in sequence models.

Fuzzy-Logic-Based Multi-Criteria Evaluation Metric for Spam Detection Approaches
The foundations of fuzzy logic (FL) were developed by Lotfi Zadeh [46]. In the traditional set theory, a set is defined as collection of elements x ∈ X, where an element can either belong to the set or not. In other words, the belonging of an element to a set is either TRUE or FALSE. However, in many practical scenarios, this rigid, binary approach of belonging to a set is not effective. There could be situations where an element may partially belong to a set, thus making this belonging partially true (and partially false). This observation leads to the development of fuzzy rules, and application of FL in the domain of multi-criteria decision-making (MCDM).
The domain of MCDM deals with decision problems where multiple factors (i.e., criteria) are used in the decision process [47,48]. More specifically, the main purpose of MCDM is to identify the best solution among a number of alternatives for a given problem. Over the years, several methods have been proposed to deal with the MCDM problems. As such, fuzzy logic has also been applied effectively to solve MCDM problems. The main ingredients of the FL based MCDM approach are the decision criteria, membership functions for the decision criteria, decision rules, and mathematical representation of the decision rule. These concepts are discussed in the next section in the context of the spam detection problem.

Decision Criteria for the Spam Detection Metric
The first issue in the development of the fuzzy-logic-based metric is the identification of decision criteria. Decision criteria play a pivotal role in the decision process. The domain expert, who has complete knowledge of the problem under study, identifies the most appropriate decision criteria. Selection of inappropriate decision criteria leads to inaccurate solutions. In the context of the current study, the literature review has identified several commonly used metrics such as accuracy, recall, precision, and F-score. Note that accuracy, recall, and precision are based on the four primary level attributes given below, and F-score is a derived metric (considered as a third-level metric) based on recall and precision, and therefore is not considered as a decision criterion in this study. The four primary level attributes are as follows [15]: Based on the above four attributes, accuracy, which measures the proportion of the total number of correct classifications [15] is mathematically represented as follows: Similarly, recall measures the number of correct classifications penalised by the number of missed entries [15], and is represented by the following equation: Likewise, the precision is a measure of the number of correct classifications penalised by the number of incorrect classifications, as given by the following equation It should be noted that while the F-score reflects the impact of precision and recall, no such measure exists that combines the effect of accuracy, precision, and recall in a single function. Therefore, there is a need to develop such a measure.

Membership Functions for the Decision Criteria
The next step in the process is the employment of fuzzy logic for combining the three decision criteria of accuracy, precision, and recall into a single, compound metric. This compound metric evaluates the quality of a solution in terms of membership values of accuracy, precision, and recall. A membership function maps the actual values of the three criteria onto a [0-1] scale. The mapped value is referred to as the membership value and is commonly represented by the symbol µ.
To formulate the fuzzy-logic-based compound metric, the values of individual criteria need to be determined first, through membership functions that need to be defined for each criterion. These membership functions are Min-Max functions, where the x-axis defines the upper and lower limits for a given criterion, while the y-axis represents the corresponding membership value between 0 and 1. Accordingly, the membership functions for the accuracy, precision, and recall are represented by µ A (x), µ P (x), and µ R (x). The corresponding mathematical representations are given in Equations (4)- (6). Since all three criteria vary between 0% and 100%, the upper and lower limits on the x-axis are defined as such.

Fuzzy Decision Rules for Spam Detection
Khan and Engelbrecht [49] proposed an approach of using fuzzy decision rules for MCDM problems. The approach can be extended to the problem addressed in this paper. The approach suggests that several fuzzy decision rules can be developed using the metrics of accuracy, recall, and precision. In each rule, the impact of the three decision metrics is reflected in different ways. The result of each rule classifies the performance of a given spam detection technique. The rule-based approach gives flexibility to the decision-maker to define a rule depending upon how many decision metrics are desired in the decision process. Accordingly, the different possible cases are discussed below. Note that the rules are scalable and more decision metrics can be added to the decision process. However, as proof of concept, only the aforementioned three decision metrics are considered.
Case 1: Performance evaluation based on all three decision metrics This is one extreme in which all three decision metrics are considered in the performance evaluation. This scenario is represented by the following fuzzy rule. In this case, the fuzzy rule would be as follows: • Rule R1: IF Accuracy is high AND Precision is high AND Recall is high THEN the Performance is Excellent.
Case 2: Performance evaluation based on any two decision metrics. In this case, any two of the three metrics are considered in the decision process. This leads to several fuzzy rules, as follows: • Rule R2a: IF Accuracy is high AND (Precision is high OR Recall is high) THEN the Performance is Excellent.
• Rule R2b: IF (Accuracy is high OR Precision is high) AND Recall is high THEN the Performance is Excellent.
• Rule R2c: IF (Accuracy is high OR Recall is high) AND Precision is high THEN the Performance is Excellent.
Case 3: Performance evaluation based on any one decision metric. The other extreme is the case where any one metric is used in the decision process. The fuzzy rule is as given below: • Rule R3: IF Accuracy is high OR Precision is high OR Recall is high THEN the Performance is Excellent.

Mathematical Representation of Decision Rules
In order to evaluate the impact of a fuzzy decision rule, transformation of the rule into a mathematical form is essential for a tangible outcome. This is achieved through a fuzzy aggregation function. Over the years, a number of fuzzy aggregation functions have been proposed in the literature. Some well-known functions are Werner's function, Dubois and Prade functions, Yager's ordered weighted average (OWA) functions, Hamacher's function, Einstein's function, and the Unified AND-OR (UAO) function, among many others [50]. Among these operators, the UAO function [49] has some unique features (see Khan and Engelbrecht [49] for details and mathematical properties of the function). Due to the unique characteristics of the UAO function, the function has been employed effectively in many diverse studies [51][52][53][54][55][56][57], Therefore, the UAO function is also used in the underlying study. The prime uniqueness of the UAO function is that it can behave as the "AND" function or the "OR" function. This "ANDing" or "ORing" is controlled by a variable ν ≥ 0, whose value decides the behavior of the function. For the spam detection problem, the aforementioned three membership functions, accuracy, precision, and recall, are embedded in the UAO function as follows: where µ O represents the overall membership value of the decision. I * signifies the AND function while I represents the OR function. With 0 < ν < 1, UAO behaves as the AND function, while ν > 0 gives the OR behavior. Using Equation (7), the mathematical structures corresponding to the fuzzy decision rules given in Section 4.3 are represented as follows: In the above mathematical representations, the value of µ O is proportional to the individual membership values (i.e., µ A , µ P , and µ R ). Therefore, high values of input membership values result in a high value of µ O and vice versa. Consequently, a high value of µ O indicates that the performance of the algorithm is on the high side, while a low µ O suggests a degraded performance. For example, a µ O = 0.95 indicates that the performance of a given algorithm is near the ideal performance (since µ O = 1 defines the ideal condition as per the fundamentals of fuzzy logic). Likewise, µ O = 0.85 indicates that the performance is good enough, but there is room for improvement.

Results and Discussion
To demonstrate the utility of the proposed fuzzy evaluation metric, empirical experiments were carried out. The following subsections show the datasets used for the study, as well as the reasons for selecting these datasets. Then, the details of the experiments and results are provided.

Characteristics of Datasets
Several email datasets have been reported in the literature. Two recent review articles by Dada et al. [2] in 2019 and Bhowmick et al. [58] in 2018 have listed a number of datasets, which are shown in Table 3. As seen from the table, the datasets were developed during the years 1998 and 2007. Since 2007, no new email dataset has been reported in the literature. Furthermore, another article [59] in 2020 identifies that among the known datasets (as given in Table 3), SpamAssasin, Enron, and PU datasets are publicly available datasets, and other datasets are generally not available or easily accessible in the public domain.
As far as the current study is concerned, three datasets, namely Enron [32], Lingspam [38], and PU [39], were used. The reasons for selecting these datasets are as follows: • Many recent studies utilized the Enron, Lingspam, and PU datasets. As shown in Table 2, the use of Enron dataset has been reported in several studies between the years 2020 and 2022 [11,13,15,26,27,41]. Similarly, the use of Lingspam and PU has also been reported in two recent studies published in 2021 [11,15]. This indicates that the datasets are still in active use in the research community. • The emails in these datasets were generated between 2000 and 2010. This scenario characterizes the change in wordings and writing patterns in emails over a period of ten years [15]. Secondly, the Enron dataset is employed due to its bias towards spam class [15]. Thirdly, the LingSpam dataset is used due to its domain-specific ham mails, which are extracted from scholarly linguistic discussions [15]. Lastly, the PU dataset is used, since it is less frequently used by researchers, thus encouraging us to study the behavior of this dataset with respect to the aforementioned three algorithms and the proposed fuzzy-logic-based performance metric.

Empirical Results
Two sets of experiments were carried out. In the first set of experiments, the performance of the BERT and LSTM models was assessed using only Rule 1 as proof of concept. In the second set of experiments, the relative performance of the two algorithms was evaluated. The data in the three email datasets were divided in a 70/30 ratio, with 70% data for training and 30% for testing. Furthermore, binary cross entropy [15] was used as the loss function. Note that the purpose of this study is to show the applicability of the proposed fuzzy measure, and therefore only necessary details are provided without going into in-depth comparative analysis of the algorithms.

Evaluation of BERT and LSTM with Respect to the Three Datasets Using Rule R1
To show the applicability of the proposed scheme, Rule 1 was assessed as an example. A value of ν = 0.5 was used. Tables 4 and 5 provide the results of the two models with respect to each dataset using Rule 1. In Table 4, it is observed that the BERT algorithm gives the best results for the LingSpam dataset, where the best value of µ O was 0.96 (for 10 and 20 epochs). Furthermore, for the Enron and PU datasets, the best µ O were 0.89 (for 30 epochs), and 0.82 (for 10 epochs), respectively. The interpretation is that for the LingSpam dataset, the performance of BERT (with regard to Rule 1) is near ideal (for 10 and 20 epochs), since µ O = 0.96 is very close to the ideal membership of µ O = 1. In the same sense, for the Enron and PU datasets, the performance of BERT is somewhat distant from the ideal value of µ O = 1, since values of µ O = 0.89 and µ O = 0.82 were observed.
With regard to the LSTM model, the results in Table 5 are different from that of BERT. The results indicate that the best results in terms of µ O were obtained for the Enron dataset, with the best µ O of 0.96. The results of µ O = 0.94 were obtained for PU with 30 epochs, while for PU, the best µ O was 0.92 for 20 and 30 epochs. Again, the results can be interpreted as explained in the paragraph above.  A mutual comparison of the two models was also performed with respect to the proposed fuzzy metric. Figure 1 shows the comparisons of the two models with respect to each dataset, while showing the value of µ O for 10, 20, and 30 epochs. In Figure 1a, the plots indicate that as far as the Enron dataset is concerned, LSTM has a better performance compared to BERT for all epochs. With regard to the results for Lingspam dataset as given in Figure 1b, BERT shows better performance than LSTM for 10 and 20 epochs, but has equal performance for 30 epochs. Finally, for the PU dataset, the trends in Figure 1c are similar to what was observed for the Enron dataset as far as BERT is concerned; BERT has an inferior performance for all epochs.

Extrinsic Evaluation
A focus group was formed for extrinsic valuation based on human judgements. As depicted in Table 6, a focus group was formed comprising of 34 male and 7 female undergraduate university students with a computer science major. As an evaluation requirement, the focus group was provided with the same environment, data, situation, and time period for analyzing and decision-making. As each member of the focused group has a different mental approach towards the decision, this may result in creating a gray area/bias. Moreover, each member of the focused group was briefed about the objectives of research and steps involved in the basic evaluation process.  In the BERT model scenarios, decisions were based on the values of accuracy, precision and recall, as shown in Table 7. Condition/restriction was also incorporated, where each member had to opt for only one choice, and once. Considering the Enron dataset, the focused group decision-making results shows that all 34 (i.e., 100%) males opted for results obtained with 30 epochs, with all 7 females (i.e., 100%) also opting for the same option. For the Lingpam dataset, the decision-making results for the focus group shows that 59% of the males opted for 20 epochs, while 41% females chose results obtained with 30 epochs. Furthermore, out of seven females, three opted for results with 20 epochs, while four preferred the results obtained with 30 epochs. Finally, for the PU dataset, the decision-making results for the focus group showed that out of 34 males, 10 opted for results obtained with 10 epochs, 17 opted for results obtained with 20 epochs, while 7 preferred the results obtained with 30 epochs. The results for females were somewhat showing similar trends, where two out of seven females chose results of 10 epochs, one opted for 20 epochs, while four opted for results with 30 epochs. As far as the results for LSTM are concerned, Table 8 illustrates the scenarios where the decisions were based on the values of accuracy, precision, and recall derived from the LSTM model. The same condition/restriction as for the BERT model was applied. For the Enron dataset, the decision-making results for the focused group showed that 21 males and 6 females opted for results obtained with 20 epochs, while 13 males and 1 female chose results obtained with 30 epochs. Furthermore, none of the males and females chose results with 10 epochs. With regard to the Lingspam dataset, the results indicate that 15 males and 3 females selected results of 20 epochs, while 19 males and 4 females opted for results with 30 epochs. Similar to the observations of Enron dataset, none of the males and female subjects opted for results with 10 epochs. Finally, for the PU dataset, 10 males and 2 females selected results of 20 epochs, while 24 males and 5 females selected results with 30 epochs. Furthermore, no males or females opted for results with 10 epochs. While the results with respect to Enron and Lingspam datasets are non-conflicting for both BERT and LSTM models, a divided opinion is observed with the regard to the PU datasets for both BERT and LSTM models. That is, with accuracy, precision, and recall as the performance measures, it was not simple to select the best option for the human subjects, and therefore an automated decision-making approach that provides a concrete decision was inevitable. What we conclude from the above results for the PU dataset is that it is very confusing and not easy to evaluate such human decisions, as in some cases, the respondents preferred 20 epochs over 30 (or vice versa) just because of different membership values for the three attributes. Moreover, in other cases, there is contradiction between males and females decision-making. For instance, for the PU dataset with the BERT Model, most males preferred results with 20 epochs, while most females preferred results obtained with 30 epochs. This shows that there is a need for a concluding matrix with a single value which would result in simplified and standardized decisions. This automated and standardized decision-making was provided by the proposed fuzzy-logic-based metric.

Conclusions
Spam emails are a major threat to network and data resources of an individual as well as organizations. Significant research has been carried out in the past to address spam email detection. Several elements are involved in this process. Among these elements, an effective performance metric that is able to incorporate the domain knowledge and reflects the performance of a spam detection algorithm is vital to the efficiency of the process. In this connection, a new performance metric has been proposed in this study. The new metric is based on fuzzy logic and combines the impact of accuracy, precision, and recall in a single, multi-criteria measure. For this purpose, decision rules have been developed, along with their corresponding mathematical representations in fuzzy logic. A preliminary analysis to assess the applicability of the proposed scheme has been carried out using one such fuzzy rule. The analysis consists of a comparative evaluation of two machine learning models, namely BERT and LSTM, while using three benchmark datasets, Enron, Lingspam, and PU. Results indicate that for the Enron and PU datasets, LSTM demonstrated a better performance. For the Lingspam dataset, the dominant trend indicates that BERT is a better option.
The proposed work has strong potential to evolve in several directions. For example, in addition to the three metrics used in the current study, other metrics (such as loss) can be added into the fuzzy model and the decision rules can be modified accordingly. Furthermore, the impact of other mathematical representations (such as the ones mentioned in Section 4.4) on the algorithm performance can be studied. Additionally, performance of other recent approaches such as XML and GAN can be evaluated while using more benchmark datasets. To add granularity to the fuzzy rule-based approach, preference rules [49] on top of the fuzzy decision rules can be added. Moreover, the performance of the proposed fuzzy metric can be evaluated for diverse problems requiring performance evaluation in various other domains.