A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data

The popularity of social networks provides people with many conveniences, but their rapid growth has also attracted many attackers. In recent years, the malicious behavior of social network spammers has seriously threatened the information security of ordinary users. To reduce this threat, many researchers have mined the behavior characteristics of spammers and have obtained good results by applying machine learning algorithms to identify spammers in social networks. However, most of these studies overlook class imbalance situations that exist in real world data. In this paper, we propose a heterogeneous stacking-based ensemble learning framework to ameliorate the impact of class imbalance on spam detection in social networks. The proposed framework consists of two main components, a base module and a combining module. In the base module, we adopt six different base classifiers and utilize this classifier diversity to construct new ensemble input members. In the combination module, we introduce cost sensitive learning into deep neural network training. By setting different costs for misclassification and dynamically adjusting the weights of the prediction results of the base classifiers, we can integrate the input members and aggregate the classification results. The experimental results show that our framework effectively improves the spam detection rate on imbalanced datasets.

example, the click-through rate of spam pages on Twitter is 0.13%, whereas with the click-through rate of e-mail spam ranges from only 0.0003% to 0.0006% [2]. Therefore, spam detection in social network platforms is important and valuable to many aspects of network environment security, including user privacy protection, public opinion analysis, etc.
To maintain social network security by detecting spam, early researchers have used blacklists and crowdsourced information to detect and filter abnormal accounts [2,3]. However, it has been shown that more than 90% of users click a malicious link before it is blocked by blacklisting [4]. Simultaneously, these methods are time-consuming because of the need for personal participation in active information recognition. To provide better detection methods, many scholars have proposed graph analysis-based methods [5][6][7] which extract features from social graph structures using node similarity based on following and follower relationships. However, attackers can forge the connection relationship of spammers by using artificial intelligence technology to imitate the social relationships of normal users, making it difficult to detect such malicious accounts effectively [1]. The current research focuses on machine learning methods, which train machine learning models by extracting content and behavior characteristics and other related information [8][9][10]. These methods are based on data mining and analysis of large numbers of data samples. Thus, the data processing quality directly affects the detection effect.
However, most of the previous research of spam detection in social networks has focused on feature extraction, which improves classification performance by combining various features or extracting more features of social network accounts to train classifiers, but overlooks the class imbalance problem in real-world data [11]. From management science to engineering, imbalanced learning is a wide range of research fields [12,13]. As shown in Figure 1, class imbalance means that the number of samples in different categories varies greatly; the majority class (non-spam) has much more samples than the minority class (spam) when a class imbalance problem occurs in the training data, the algorithm typically provides classification results biased toward the majority class due to the increasing prior probability.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 2 of 18 Twitter is 0.13%, whereas with the click-through rate of e-mail spam ranges from only 0.0003% to 0.0006% [2]. Therefore, spam detection in social network platforms is important and valuable to many aspects of network environment security, including user privacy protection, public opinion analysis, etc.
To maintain social network security by detecting spam, early researchers have used blacklists and crowdsourced information to detect and filter abnormal accounts [2,3]. However, it has been shown that more than 90% of users click a malicious link before it is blocked by blacklisting [4]. Simultaneously, these methods are time-consuming because of the need for personal participation in active information recognition. To provide better detection methods, many scholars have proposed graph analysis-based methods [5][6][7] which extract features from social graph structures using node similarity based on following and follower relationships. However, attackers can forge the connection relationship of spammers by using artificial intelligence technology to imitate the social relationships of normal users, making it difficult to detect such malicious accounts effectively [1]. The current research focuses on machine learning methods, which train machine learning models by extracting content and behavior characteristics and other related information [8][9][10]. These methods are based on data mining and analysis of large numbers of data samples. Thus, the data processing quality directly affects the detection effect.
However, most of the previous research of spam detection in social networks has focused on feature extraction, which improves classification performance by combining various features or extracting more features of social network accounts to train classifiers, but overlooks the class imbalance problem in real-world data [11]. From management science to engineering, imbalanced learning is a wide range of research fields [12,13]. As shown in Figure 1, class imbalance means that the number of samples in different categories varies greatly; the majority class (non-spam) has much more samples than the minority class (spam) when a class imbalance problem occurs in the training data, the algorithm typically provides classification results biased toward the majority class due to the increasing prior probability. In binary classification problems, we often encounter serious imbalances in the proportions of positive and negative samples; such imbalances can reach 50:1. If a classifier is trained to make predictions directly on such imbalanced data, the recall rate of the minority class is extremely low. This is because the traditional classifiers aim to reduce the overall classification accuracy by treating all samples equally, which results in a higher classification accuracy for the majority class and a lower classification accuracy for the minority class. For example, in a case where the class ratio is 50:1 positive to negative samples, classifier accuracy can reach 98% even if all the negative samples are misclassified as positive samples; however, the true identification rate for the negative samples is zero. In addition, as long as the majority class can be correctly identified, even if the minority class is largely misclassified, the accuracy metric still obtains a high score, which misleads assessments. As a result, instances belonging to a minority class are more likely to be misclassified than those In binary classification problems, we often encounter serious imbalances in the proportions of positive and negative samples; such imbalances can reach 50:1. If a classifier is trained to make predictions directly on such imbalanced data, the recall rate of the minority class is extremely low. This is because the traditional classifiers aim to reduce the overall classification accuracy by treating all samples equally, which results in a higher classification accuracy for the majority class and a lower classification accuracy for the minority class. For example, in a case where the class ratio is 50:1 positive to negative samples, classifier accuracy can reach 98% even if all the negative samples are misclassified as positive samples; however, the true identification rate for the negative samples is zero. In addition, as long as the majority class can be correctly identified, even if the minority class is largely misclassified, the accuracy metric still obtains a high score, which misleads assessments. As a result, instances belonging to a minority class are more likely to be misclassified than those belonging to a majority class. This undesirable effect makes it very difficult to predict different classes accurately.
One study found that the proportion of spam on Twitter is approximately 3.75% [14], and approximately 8.7% of the accounts on Facebook are fake accounts created by attackers [1]. In another study, Grier et al. [2] found that approximately 5% of tweets are spam. On the Microblogging platform, approximately 10% of users are spammers [15]. Liu et al. [11] reported that when the class imbalance ratio (IR) in the Twitter dataset increases from two to 20, the spam detection rate drops by 33%, and the error rate for non-spam drops by 5% because traditional classifiers are biased toward the non-spam class. When faced with class imbalance problems, it is difficult to obtain satisfactory classification performances using traditional classification methods [16,17].
In this paper, we propose a two-level heterogeneous stacking-based ensemble learning framework to address the problem of class imbalance of spam detection in social networks.
First, the framework utilizes various machine learning algorithms as base classifiers to automatically extract effective features from the original data. These prediction results are combined into metadata with new features, forming the input data to the next learning stage. Then, a deep neural network (DNN) is used as a metaclassifier to capture the deep information hidden in the output of the basic classifiers. In addition, we set the misclassification costs based on cost-sensitive methods to improve the classification performance. Finally, we compare the proposed method with existing methods. The experimental results show that our method effectively improves the classification performance on data with imbalanced classes.
The remainder of this paper is organized as follows: In Section 2, we provide an overview of the related works; in Section 3, we present the process of the proposed approach in detail; in Section 4, we report an experiment using a real-world dataset to demonstrate the validity and robustness of our method; and in Section 5, we conclude the paper and suggests future work directions.

Related Works
This section reviews the related works from the following two aspects: Spam detection approaches and the class imbalance problem.

Spam Detection Approaches
At present, spam detection is one of the most important challenges for online social network security. Various types of spam detection methods exist, including crowdsourcing technology, graph-based techniques, and machine learning techniques. Among these, machine learning currently plays an important role in spam detection in social networks.
Supervised machine learning algorithms are the most common methods used for spam detection. Almaatouq et al. [8] trained six different classifiers using content, behavior, and network structure features and, then, compared their classification performances. Zheng et al. [18] first labeled samples as spam or non-spam and, then, proposed a classification algorithm based on support vector machines to detect spammers in Weibo. Recently, many scholars have combined deep learning methods to mine contextual features at the tweet level. Kudugunta and Ferrara [19] designed a deep neural network method that considered context based on the long-term short-term memory (LSTM) architecture, which uses context features extracted from user metadata to detect spambots at the tweet level. Due to the large numbers of social network users, the work of labeling the massive amounts of posted data is complex and error prone which is not applicable in practical applications.
In contrast to supervised machine learning, unsupervised machine learning does not rely on labeled data; instead, it uses unlabeled data to build a learning model. Lee and Kim [20] used an aggregate hierarchical clustering method to cluster Twitter users that does not need to wait for occurrences of malicious behavior; it can detect malicious accounts when the account is created. In view of the emerging group spam behavior, Cresci et al. [21] proposed a method similar to a clustering algorithm. They generated a corresponding "digital DNA" signature by encoding strings with user behavior information and used those to determine the spam similarity between account subgroup sequences. Chavoshi et al. [22] built an unsupervised tool named DeBot by comparing the account time series extracted from the Twitter flow API to find spambots that send tweets synchronously. Unsupervised machine learning methods do not need a large set of labeled data, but their accuracy is usually low as compared with supervised machine learning methods.
Semi-supervised learning is a learning method that combines supervised learning with unsupervised learning. Li et al. [23] proposed a semi-supervised feature selection method based on Laplace score to detect spammers on Twitter. Gong et al. [24] applied the semi-supervised learning to Sybil detection. This method classifies nodes together with information from directed messages and known node labels. Chen et al. [25] fused comprehensive clues explored from multiple views to identify spammers and predicted unlabeled instances iteratively based on a small number of labeled instances in a semi-supervised manner.
Previous studies have shown single classifiers are rarely superior to ensemble learning methods on any problem. Ensemble learning achieves better classification performances by training multiple classifiers and, consequently, it usually performs better than single classifiers (also known as base classifiers) [26]. Ensemble learning includes both homogeneous and heterogeneous ensemble learning algorithms. Homogeneous ensemble learning relies on multiple classifier instances of a single type, while heterogeneous ensemble learning uses a variety of different base classifiers to achieve better performance. For example, Tang et al. [17] proposed an ensemble method using three CS-SVMs with different parameters as base classifiers, combined with resampling technology, and achieved good performances for microblog spam detection. Madisetty et al. [27] developed an ensemble method involving five CNNs and a feature-based model; the metaclassifier used a multilayer neural network and achieved a good performance on Twitter. Thus, we chose heterogeneous ensemble learning as the basic framework of spam detection.

Class Imbalance Problem
Whether in academia or industry, imbalanced learning has attracted increasing attention. In the real world, this class imbalance problem exists in many application fields, such as anomaly detection, credit card fraud detection, and fault diagnosis, etc.
Many researchers have made efforts to solve the class imbalance problem and have achieved various results. These studies are classified into two main categories. One category functions on data level. These methods create balanced datasets by reducing the majority class (undersampling) or increasing the minority class (oversampling). The most famous resampling method is SMOTE [28]. By analyzing the characteristics and distribution of the minority class, SMOTE generates new samples and adds them to the dataset. Although SMOTE increases the number of minority class samples and improves the classification performance, it takes extra time to generate new samples and the procedure can generate noise [11]. The undersampling method generates a balanced dataset by reducing the sampling rate of other class samples. However, the undersampling method causes information loss, and therefore some studies use a hybrid sampling method. Liu et al. [29] proposed a novel method named fuzzy logic-based oversampling (FOS) to achieve a class imbalanced distribution through an information decomposition algorithm based on fuzzy logic [30] and, then, combined this method with random undersampling and random oversampling and utilized ensemble learning to conduct spam detection on Twitter [11].
Another way to deal with the class imbalance problem is from the algorithm perspective. Algorithm-level solutions do not cause changes in data distribution, and therefore they are suitable for multiple types of imbalanced datasets [31]. The typical algorithm-level method is cost-sensitive learning. Cost-sensitive learning optimizes an algorithm by considering the cost differences in distinct misclassification situations and assigns costs to the corresponding types, allowing the algorithm to achieve better performances on class imbalanced data [31]. Cost-sensitive learning is popular for addressing unknown varying costs in class imbalance problems at the algorithm level. MetaCost [32] is a reweighting algorithm proposed by P. Domings in which the basic idea is to use the Bayes risk theory to reweight instances in the training dataset based on the optimal cost classification. The AdaCost algorithm [33] is an improvement to the AdaBoost classification algorithm; it obtains the cost-sensitive classification by reweighting. WSNN [34] is a class imbalance method that uses cost as a weight distributed to the minority classes to improve the final classification accuracy. Wang et al. [35] embedded the cost information into a modified cross entropy loss function during prediction to solve the imbalance and skewness challenge in hospital readmission prediction. Zhang et al. [36] proposed an evolutionary cost sensitive deep belief network (ECS-DBN) for imbalanced classification, which optimizes the misclassification cost based on the training data by using adaptive differential evolution. Liu et al. [37] decomposed the F-measure optimization into a series of cost-sensitive classification problems, and investigated the cost-sensitive feature selection by generating and assigning different costs to each class.

Problem Description and Methodology
In this section, first, we describe the problem of class imbalance on spam detection in social networks. Then, we provide a heterogeneous stacking-based ensemble learning framework to solve the problem.

Formulation of The Problem
We first introduce the problem of classification with class imbalance in spam detection on social networks and, then, extend it to cost-sensitive learning.
Assume that given a dataset S = (x n , y n ) (n = 1, . . . , N) with N data samples, x n represents the n-th sample instance belonging to the input space, and y n indicates the label of x n and belongs to the label set Y = {1, . . . , K}. The goal of classification is to train a classifier f : X → Y to minimize the expected error of the classifier on the training set.
However, class imbalance is a common problem in many classification applications. Because a conventional classifier uses the same cost to classify all the considered classes, it is highly susceptible to skewed class distributions. In spam detection on social networks, Liu et al. [11] found that the true positive rate (spam detection rate) of the positive class decreased significantly (by 33% on average) when the class imbalance rate (IR) rose from 2 to 20. In particular, when the class imbalance rate is 20, the average detection rate dropped to 34%, which means that spam detection misses more than 66% of spam.
To solve the class imbalance problem, we use cost-sensitive learning in our ensemble learning framework. Cost-sensitive learning extends conventional classification techniques to the classification of imbalanced data by assigning different costs to each class, which punishes each class of errors differently based on the assigned costs. The training goal after applying the cost-sensitive learning is to find a classifier f : X → Y that minimizes the expected risk.

The Proposed Ensemble Learning Framework
In this subsection, we describe the proposed heterogeneous stacking-based ensemble learning framework for spam detection in social networks. The existing empirical results have shown that ensemble learning tends to perform better when there are significant differences among the ensemble models, and the stacked model composed of several learning stages is the most popular ensemble learning approach. Thus, to solve the class imbalance problem in spam detection, we propose a novel framework that has a two-level structure, i.e., a base module and a combining module. Our proposed framework is shown in Figure 2, which illustrates the process to stack models using the base and combining modules. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 18

Base Module
In our framework, the task of the base module is to utilize the training set to train the base classifiers, then, the metadata generated by these basic classifiers is used to train the metaclassifier.
In heterogeneous stacking-based ensemble learning, selection of the base classifier is crucial to model performance because each classifier has its own advantages. It is generally believed that measuring the diversity of the underlying individual classifiers is a key factor in good integration. To obtain discriminatory metadata for classification, the base classifiers should be as diverse and as complementary as possible. The goal of this paper is to solve the spam problem detection for social networks, which is regarded as a binary classification problem. Therefore, we employ the following six different base classifiers to form the base module of our framework: The support vector machine (SVM) [38], CART [39], Gaussian Naive Bayes (GNB) [40], K-nearest neighbors (KNN) [41], random forest (RF) [42], and linear regression (LR) [43]. All these algorithms are good for solving various binary classification problems from their own point of view.
This stacked ensemble learning approach uses the prediction results of the base classifiers as the input of the combining module. However, we cannot directly use the complete dataset to train and test the base classifiers and send the prediction results to the combining module for training. Because the potential model would have "seen" the test set, a risk of overfitting exists when the same data is input for prediction, which tends to have a large impact on model validation.
The metadata generation methods of the stacking model include bootstrap, bagging, and cross-validation. As shown in Figure 2, in this study, we selected the K -fold cross-validation

Base Module
In our framework, the task of the base module is to utilize the training set to train the base classifiers, then, the metadata generated by these basic classifiers is used to train the metaclassifier.
In heterogeneous stacking-based ensemble learning, selection of the base classifier is crucial to model performance because each classifier has its own advantages. It is generally believed that measuring the diversity of the underlying individual classifiers is a key factor in good integration. To obtain discriminatory metadata for classification, the base classifiers should be as diverse and as complementary as possible. The goal of this paper is to solve the spam problem detection for social networks, which is regarded as a binary classification problem. Therefore, we employ the following six different base classifiers to form the base module of our framework: The support vector machine (SVM) [38], CART [39], Gaussian Naive Bayes (GNB) [40], K-nearest neighbors (KNN) [41], random forest (RF) [42], and linear regression (LR) [43]. All these algorithms are good for solving various binary classification problems from their own point of view.
This stacked ensemble learning approach uses the prediction results of the base classifiers as the input of the combining module. However, we cannot directly use the complete dataset to train and test the base classifiers and send the prediction results to the combining module for training. Because the potential model would have "seen" the test set, a risk of overfitting exists when the same data is input for prediction, which tends to have a large impact on model validation.
The metadata generation methods of the stacking model include bootstrap, bagging, and cross-validation. As shown in Figure 2, in this study, we selected the K-fold cross-validation method.
First, we partition the original dataset into a training set D train and a test set D test . During the K-fold cross-validation procedure, D train is split into K disjoint subsets of the same size; each subset is called a fold and maintains the same class scale as the original dataset. Each cross-validation consists of executing the training phase on D train and the testing phase on D test . We take one classifier C n (1, . . . , N) as an example, where N represents the number of base classifiers. At the training stage, we use one subset as a validation set D valid and use the remaining subsets as training sets. We repeat this procedure K times, and all the prediction results on the validation sets are merged into a prediction matrix P n (n = 1, . . . , N). At the test stage, we apply C n to generate a classification matrix. After repeating this procedure K times, we obtain K classification matrices and average them by rows to generate a matrix A n (n = 1, . . . , N). The above entire procedure is repeated for the N classifiers, and all the prediction matrices P n are combined into a new training set P, and all the A n are averaged to obtain a new test set A. Through this method, the generated metadata can be guaranteed to be test results rather than results obtained by overfitting the training samples. In our method, the number of fold (K) is chosen as 10, which is considering the size of the real dataset and combined with other research experience.
In the base module of this framework, we first train the different base classifiers to generate metadata with new features. Then, we input the resulting metadata to the combination module to train the metaclassifier.

Combining Module
Under the concept of stack generalization, the output of the ensemble serves as the inputs to the metaclassifier, which learns a mapping between the metadata and the real class labels [44]. Metaclassifier selection is also important in ensemble learning, and appropriate data-combining strategies can improve the final classification capabilities. In this paper, we apply a cost-sensitive learning-improved deep neural network (DNN) as a metaclassifier for class imbalance tasks.
DNN models have strong learning ability and can extract higher-level features via their deep network structures. Therefore, using a DNN in the ensemble strategy has unique advantages for finding hidden information in metadata. Although DNNs have been successfully applied in many fields because of their powerful data mining ability, few studies have used a DNN to solve typical class imbalance problems in the social networking spam detection field.
As shown in Figure 3, the DNN model consists of an input layer, hidden layer(s), and an output layer. The input layer accepts information from the external world into the network. The hidden layer(s) extract multilevel input features to partition the different types of data linearly. Each hidden layer h(h ∈ {1, . . . , H}) has a set of parameters θ h = {W h , b h }, where W h is a fully connected weight matrix, and b h is a bias vector. The output layer is responsible for computing and transmitting information from the network to the outside world. At the test stage, we apply n C to generate a classification matrix. After repeating this procedure K times, we obtain K classification matrices and average them by rows to generate a matrix ( 1,..., ) n A n N = . The above entire procedure is repeated for the N classifiers, and all the prediction matrices n P are combined into a new training set P , and all the n A are averaged to obtain a new test set A . Through this method, the generated metadata can be guaranteed to be test results rather than results obtained by overfitting the training samples. In our method, the number of fold ( K ) is chosen as 10, which is considering the size of the real dataset and combined with other research experience.
In the base module of this framework, we first train the different base classifiers to generate metadata with new features. Then, we input the resulting metadata to the combination module to train the metaclassifier.

Combining Module
Under the concept of stack generalization, the output of the ensemble serves as the inputs to the metaclassifier, which learns a mapping between the metadata and the real class labels [44]. Metaclassifier selection is also important in ensemble learning, and appropriate data-combining strategies can improve the final classification capabilities. In this paper, we apply a cost-sensitive learning-improved deep neural network (DNN) as a metaclassifier for class imbalance tasks.
DNN models have strong learning ability and can extract higher-level features via their deep network structures. Therefore, using a DNN in the ensemble strategy has unique advantages for finding hidden information in metadata. Although DNNs have been successfully applied in many fields because of their powerful data mining ability, few studies have used a DNN to solve typical class imbalance problems in the social networking spam detection field.
As shown in Figure 3, the DNN model consists of an input layer, hidden layer(s), and an output layer. The input layer accepts information from the external world into the network. The hidden layer(s) extract multilevel input features to partition the different types of data linearly.   adjusts the weight and offsets of the hidden layer(s) to the output layer and the weight and offset of the input layer to the hidden layer(s).
Given a fully connected DNN with H hidden layers, as shown in Figure 3, during forward propagation, for an input feature vector x, the H hidden layers of the DNN describe a complex feature transform function by computing: where W h and b h , respectively, represent the weight matrix and bias vector in each hidden layer h(h ∈ {1, . . . , H}) denoted as θ h = {W h , b h }, x is the input feature vector from the previous layer, and S(z) denotes an activation function, which can be tanh or sigmoid. Because the problem of spam identification is a binary classification problem, the output layer has two neurons, and the SoftMax algorithm is used between the hidden layer and the output layer after feature conversion. In the output layer, the j-th neuron is responsible for estimating the probability that a given sample x belongs to class j: where W ( j) out and bias b ( j) out represent the weights and bias of the j-th neuron in the output layer, respectively. Under class imbalance, the goal of the standard machine learning method is to minimize the number of false predictions, but because the loss function uses the same misclassification cost for all the considered classes, it is highly susceptible to skewed class distributions [45]. This occurs because under class imbalance, the loss function is easily minimized by focusing on the majority class and largely ignoring (or in extreme cases even completely ignoring) the minority class. To solve the class imbalance problem of spam detection on social networks, we formalize it as a cost-sensitive classification problem. It assumes that an asymmetric misclassification cost exists between classes, defined in the form of a cost matrix, as shown in Table 1. Table 1. Cost matrix for binary classification.

Predicted Positive Predicted Negative
True positive (p, p) (p, q) True negative (q, p) (q, q) The typical form of cost sensitive learning is to use a cost matrix such as the one shown in Table 1. A cost (p, q) is used to denote the cost of misclassifying an instance belonging to class p into a different class q. In spam detection, we regard spam as positive samples and non-spam as negative samples. Therefore, the misclassification cost (p, q) belonging to the minority class is higher than that belonging to the majority class (q, p). Using feedback from the base classifier performances, we use the misclassification ratio of the minority class in the base modules as the misclassification cost of the positive samples and set the misclassification cost of the majority class to one. The diagonal elements of the cost matrix, such as (p, p) and (q, q), represent correct predictions, and the misclassification cost is equal to zero (p, p) = (q, q) = 0 [32].
According to the minimum expected cost principle, the goal of cost sensitive classification is to train the classifier so that it classifies training samples into the class that has the minimum expected cost. Therefore, the classifier we are training provides a class decision for a sample. The expected risk R(p x) with a sample x(x ∈ X) and i as the output classification can be expressed as follows: where P(q x) represents the posterior probability that a given sample x will be classified as class q in a dataset of K classes. On the basis of the Bayesian decision theory, an ideal classifier makes a final decision by calculating the expected risk of each sample classification and predicts the label that achieves the minimum expected risk: argmin For a sample x (i) and its corresponding label y (i) , the empirical risk can be expressed as: where l(·) represents a loss function, such as mean square error (MSE) or the cross-entropy loss function, and N is the total number of data samples. As described in Section 3, the goal of cost-sensitive learning is to minimize the overall cost of the training dataset (e.g., the Bayesian conditional risk). The misclassification cost can be regarded as a penalty factor introduced during the classifier training process (or in some cases in the forecasting step) to improve the importance of classes that are difficult to classify (such as the spam class). By imposing larger penalties for errors on a given class, we force the classifier training process (intended to minimize overall cost) to focus on instances from the given distribution.
In this study, we use cost-sensitive modified cross entropy as the loss function during classifier training. This paper mainly focuses on spam detection, and therefore we pay more attention to the spam class than to the non-spam class. Thus, (p, q) should be larger than (q, p), forcing the error classification cost for spam samples to be higher. The overall error of the cost-sensitive DNN can be formulized as follows: [y n * log(P(y = j x, θ) * (p, q)) + (1 − y n ) * log(1 − P(y = j|x, θ) * (q, p))] The backpropagation algorithm is the most common method for optimizing the DNN parameters and is essentially a gradient descent function. We optimize the loss minimization and parameters through backpropagation using the minibatch stochastic gradient descent method. Jiang et al. [46] reported that introducing costs into cross-entropy (CE) losses affects the output but does not change the gradient formulas.

Experiments
This paper focuses mainly on the performance of heterogeneous stacking-based ensemble learning methods in imbalanced data problems of spam detection on social networks. For comparison purposes, we perform extensive experiments on a real dataset and conduct performance comparisons between other algorithms used in social network spam detection and our proposed algorithm. Therefore, in this section, we first describe the dataset in detail. Section 4.2 describes the metrics used to assess the results, and we report the experimental design in Section 4.3. Section 4.4 provides a discussion of the experimental results.

Experimental Dataset
These experiments were conducted using the dataset collected by Chen et al. [47], which contains 600 million tweets of which 6.5 million are malicious tweets. This dataset was made available to other researchers studying spam detection. In the dataset, tweets containing malicious URLs were defined as twitter spam.
Each tweet in the dataset is represented as a feature vector that contains user-based features and tweet-based features. The 12 lightweight statistical features that can be extracted directly from tweets are shown in Table 2. Among them, the first six features are user-based features, and the remainder are tweet-based features. Table 2. Twitter spam dataset.

Feature Description
account_age The age (days) of an account since its creation until the time of sending the most recent tweet no_follower The number of followers of this twitter user no_following The number of followings/friends of this twitter user no_userfavourites The number of favorites this twitter user received no_lists The number of lists this twitter user added no_tweets The number of tweets this twitter user sent no_retweets The number of retweets this tweet no_hashtag The number of hashtags included in this tweet no_usermention The number of users mentions included in this tweet no_urls The number of URLs included in this tweet no_char The number of characters in this tweet no_digits The number of digits in this tweet

Evaluation Metrics
The most common metrics used to measure classification performances are accuracy, precision, recall, and the F1-score. In class imbalance problems, because the accuracy rate does not reflect the overall situation, we employ the true positive rate (TPR), false positive rate (FPR), precision, F1-score, and G-mean to measure the performance of the proposed spam detection method. We apply a confusion matrix to calculate these indicators.
As shown in Table 3, each row represents a class while each column represents the predicted class. TP, FP, FN and TN represent true positives, false positives, false negatives, and true negatives, respectively. The performance measures can be calculated as follows: Table 3. Confusion matrix. The true positive rate is known as recall rate, which indicates the ratio of the correct classification of positive samples.

Predicted Spam Predicted Non-Spam
• False positive rate (FPR) The false positive rate indicates the ratio that classifies the negative samples into positive samples. In this paper, FPR represents the proportion of classifying a majority class into a minority class.
• Precision Precision represents the ratio of correctly predicted positive samples to total predicted positive samples.
• F1-score The F1-score is a weighted average of precision and recall. It is an important performance metric to evaluate the overall performance of our method.
• G-mean G-mean measure is to evaluate the degree of inductive bias according to the ratio of positive precision to negative precision. The higher G-mean represents that the classifier has better classification performance in both majority and minority classes.
• Kappa Kappa is an important index to measure the classification performance on imbalanced datasets, which represents the proportion of error reduction between classification and completely random classification. It measures the consistency between classifier and target distribution hypothesis. where

Experimental Protocol
To facilitate the experiment, we first normalize all the datasets and, then, randomly select a corresponding proportion of spam and non-spam samples from the dataset according to different class imbalance ratios as experimental datasets. Finally, for each selected experimental dataset, we randomly select 50% of the samples of each class as training data and use the rest for testing.
As described in Section 3.2, to ensure the diversity of ensemble learning, six different base classifiers constitute the base module. To improve the performance and achieve better classification results, all the optimal parameters for all the base classifiers are determined via a grid search based on 10-fold cross-validation.
In the combining module, we determine the final structure of the network through experiments. The number of DNN layers is selected from {2,3,4,5}, and the number of nodes in hidden layers is selected from {16,32,64,128}. Following the experimental protocol, the classification performance reaches top when we use three hidden layers in the DNN, with 64 hidden units in the first hidden layer, 32 hidden units in the second layer, and 16 hidden units in the last layer. We adopt a sigmoid function as the activation function for the hidden layer. We also use dropout to avoid overfitting.

Results and Discussion
In spam detection in social networks, although the number of spam items is small, their threats and impacts can be substantial. Therefore, under class imbalance, it is more important to classify the minority class accurately than the majority class. The main purposes of this experiment are to compare the proposed method with existing spam classification methods on imbalanced datasets to determine which detects spammers most accurately and to verify the effectiveness and robustness of the proposed method when dealing with similar imbalance problems.
Each experiment is repeated 10 times and, then, the average values are calculated and used in the comparisons to verify the robustness of the proposed method.

Comparisons with Base Classifiers
In this section, we compare the proposed method with four conventional machine learning algorithms. For simplicity, all the conventional algorithms use their default parameters. We adopt a class imbalance ratio of 10 as an example. The classification results are given in Table 4 in terms of TPR, FPR, precision, F1-score, G-mean, and Kappa on the test data. From Table 4, we can see that the class imbalance problem strongly affects the performances of the conventional machine learning algorithms. For example, the TPR value of the SVM algorithm is only 0.10, which indicates that a large number of spam samples (the minority class) are misclassified as non-spam (the majority class). This also causes the G-mean value of SVM to be only 0.31 and the Kappa value of SVM to be 0.16. Although the GNB method achieves a performance of 0.91, its false positive rate reaches 0.81. Therefore, its precision, F1-score and Kappa values, are the lowest. These results show that a large number of non-spam samples are misclassified as spam samples. Figure 4 presents the classification performance of each method in terms of TPR, FPR, precision, F1-score, G-mean, and Kappa. The histogram chart illustrates that only the CART method and our method have stable metrics; the other methods have large fluctuations. For example, the precision of SVM is 0.77, but its TPR is 0.10, F1-score is 0.18, and its Kappa is 0.16, which means that the method is highly influenced by the majority class, and the classification results tend toward non-spam. The TPR, precision, F1-score, and Kappa of CART method show stable performance within the range 0.50-0.57, but these values are lower than those of our approach.
In comparison to the conventional machine learning algorithms on the same dataset, Table 4 and Figure 4 show that our approach performs better than other approaches. In particular, the F1-score value (70%) and Kappa value (67%) of our approach are much higher than those of the other methods. Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 In comparison to the conventional machine learning algorithms on the same dataset, Table 4 and Figure 4 show that our approach performs better than other approaches. In particular, the F1-score value (70%) and Kappa value (67%) of our approach are much higher than those of the other methods.

Comparisons with Class Imbalance Methods
On the basis of the comparisons in the previous section of our proposed algorithm with conventional machine learning algorithms, we further compare our algorithm with more advanced algorithms regarding their abilities to solve the class imbalance problem. Section 2.2 mentions that ensemble learning can effectively improve the performance when faced with class imbalance problems. Therefore, we add ensemble learning algorithms that use majority voting as an ensemble strategy to the comparison. CSDNN is an algorithm based on cost sensitivity learning as discussed in Section 3.2, while AdaCost, MetaCost, and WSNN are all improved algorithms based on cost sensitive learning. Table 5 presents a comparison of the above algorithms in terms of TPR, FPR, precision, F1-score, and Kappa. As shown in Table 5, the performances of these improved methods are significantly better than those of the conventional machine learning algorithms. For example, the TPR value in Table 5 fluctuates less than that in Table 4 and the mean value is higher. The same phenomenon occurs for G-mean, with an average of 0.78 in this experiment, but an average of 0.59 in the previous experiment. These results show that the methods in this experiment provide improvements when faced with class imbalance problems.

Comparisons with Class Imbalance Methods
On the basis of the comparisons in the previous section of our proposed algorithm with conventional machine learning algorithms, we further compare our algorithm with more advanced algorithms regarding their abilities to solve the class imbalance problem. Section 2.2 mentions that ensemble learning can effectively improve the performance when faced with class imbalance problems. Therefore, we add ensemble learning algorithms that use majority voting as an ensemble strategy to the comparison. CSDNN is an algorithm based on cost sensitivity learning as discussed in Section 3.2, while AdaCost, MetaCost, and WSNN are all improved algorithms based on cost sensitive learning. Table 5 presents a comparison of the above algorithms in terms of TPR, FPR, precision, F1-score, and Kappa. As shown in Table 5, the performances of these improved methods are significantly better than those of the conventional machine learning algorithms. For example, the TPR value in Table 5 fluctuates less than that in Table 4 and the mean value is higher. The same phenomenon occurs for G-mean, with an average of 0.78 in this experiment, but an average of 0.59 in the previous experiment. These results show that the methods in this experiment provide improvements when faced with class imbalance problems. Figure 5 intuitively illustrates an experimental comparison between our proposed method and the improved methods. Considering the MetaCost method as an example, the TPR of this method is 0.69, which is very close to the 0.70 achieved by our method, but its accuracy, F1-score, and Kappa are 0.3, 0.19, and 0.22 lower than those of our method. This phenomenon indicates that although the method obtains a better positive sample recognition rate, it also misclassifies a large number of negative samples. In the same way, the ensemble learning method achieved good results in this group of experiments, but its accuracy and F1-score are 7% and 4% lower than that of our method, and its Kappa is 5% lower. Therefore, our method yields good performances as compared with those of other improved methods. Figure 5 intuitively illustrates an experimental comparison between our proposed method and the improved methods. Considering the MetaCost method as an example, the TPR of this method is 0.69, which is very close to the 0.70 achieved by our method, but its accuracy, F1-score, and Kappa are 0.3, 0.19, and 0.22 lower than those of our method. This phenomenon indicates that although the method obtains a better positive sample recognition rate, it also misclassifies a large number of negative samples. In the same way, the ensemble learning method achieved good results in this group of experiments, but its accuracy and F1-score are 7% and 4% lower than that of our method, and its Kappa is 5% lower. Therefore, our method yields good performances as compared with those of other improved methods. In addition, it is worth noting that CSDNN, AdaCost, MetaCost, and our method are all improved by cost-sensitive learning, but our method adopts the integrated learning framework based on stacking, and the data processing of the basic module effectively improves the classification performance of the overall algorithm.

Comparisons with Varying Class Imbalance Rates
To verify the robustness of our method, we conducted experiments under five different class imbalance rates. According to the results of the previous two experiments, CART and ensemble learning are the next-best methods to our approach. Therefore, this section compares the performance of the proposed method with these two methods under different class imbalance rates. In this work, we define the class imbalance rate as follows: where | | Majority and | | Minority are the number of samples belonging to the majority and minority classes, respectively. Figure 6 compares the average F1-score results of three different methods at different class imbalance rates (i.e., IR equals 2, 6, 10, 14, and 18). In addition, it is worth noting that CSDNN, AdaCost, MetaCost, and our method are all improved by cost-sensitive learning, but our method adopts the integrated learning framework based on stacking, and the data processing of the basic module effectively improves the classification performance of the overall algorithm.

Comparisons with Varying Class Imbalance Rates
To verify the robustness of our method, we conducted experiments under five different class imbalance rates. According to the results of the previous two experiments, CART and ensemble learning are the next-best methods to our approach. Therefore, this section compares the performance of the proposed method with these two methods under different class imbalance rates. In this work, we define the class imbalance rate as follows: where Ma jority and Minority are the number of samples belonging to the majority and minority classes, respectively. Figure 6 compares the average F1-score results of three different methods at different class imbalance rates (i.e., IR equals 2, 6, 10, 14, and 18). As shown in Figure 6, as the class imbalance rate increases, the F1-score of these three methods gradually decreases. For the ensemble learning method, when IR = 2, its F1 result is 77%, but its F1 value drops to approximately 55% as the IR increases to 18, causing an overall decline of 22%. The second method, CART, fell by 20%, i.e., from 71% to 51%. Our method also shows a downward trend as the imbalance rate increases from 78% when IR = 2 to 67% with IR = 18, but compared with the other methods, this downward trend is relatively gentle. For example, our method achieves the best F1 value (78% when IR = 2). As IR increases to 10, our result is 70%, the result of ensemble learning is 66%, and CART falls to 55%. When the class imbalance rate increases to 18, the F1-score of our method is 67%, which is only 3% lower than when IR = 10, but CART and ensemble learning methods are 4% and 11% lower, respectively. Finally, the comparison shows that our method maintains relatively stable overall performances with different class imbalance rates, and therefore it shows better robustness.

Conclusions
In this paper, we proposed a spam detection framework for social networks. Considering the class imbalance problem in spam detection, the heterogeneous stacking-based ensemble framework was designed to balance the training process of base classifiers and metaclassifier at the data and algorithm levels, respectively. First, we used six different learning methods as base classifiers to improve the learning effect of the base module. Then, we applied a cost-sensitive learning improved deep neural network to implement the ensemble strategy. The strategy of training the meta-classifier with the individual errors of classifiers from the previous stage to detect any biased behavior reduced the impact of imbalanced class distributions on classification performances.
We verified the validity of our method on the real spam dataset of Twitter. The experimental results show that the proposed method can handle class imbalance well and obtains the best classification performance among compared approaches.
In the future, we plan to mine deeper hidden feature representations, as well as test classifiers trained with different dataset features to further improve the spam detection performance on social networks.
Author Contributions: Conceptualization, C.Z.; methodology, C.Z.; software, X.L. and Y.C.; writing-original draft, C.Z.; writing-review and editing, Y.X. and Y.Y.  As shown in Figure 6, as the class imbalance rate increases, the F1-score of these three methods gradually decreases. For the ensemble learning method, when IR = 2, its F1 result is 77%, but its F1 value drops to approximately 55% as the IR increases to 18, causing an overall decline of 22%. The second method, CART, fell by 20%, i.e., from 71% to 51%. Our method also shows a downward trend as the imbalance rate increases from 78% when IR = 2 to 67% with IR = 18, but compared with the other methods, this downward trend is relatively gentle. For example, our method achieves the best F1 value (78% when IR = 2). As IR increases to 10, our result is 70%, the result of ensemble learning is 66%, and CART falls to 55%. When the class imbalance rate increases to 18, the F1-score of our method is 67%, which is only 3% lower than when IR = 10, but CART and ensemble learning methods are 4% and 11% lower, respectively. Finally, the comparison shows that our method maintains relatively stable overall performances with different class imbalance rates, and therefore it shows better robustness.

Conclusions
In this paper, we proposed a spam detection framework for social networks. Considering the class imbalance problem in spam detection, the heterogeneous stacking-based ensemble framework was designed to balance the training process of base classifiers and metaclassifier at the data and algorithm levels, respectively. First, we used six different learning methods as base classifiers to improve the learning effect of the base module. Then, we applied a cost-sensitive learning improved deep neural network to implement the ensemble strategy. The strategy of training the meta-classifier with the individual errors of classifiers from the previous stage to detect any biased behavior reduced the impact of imbalanced class distributions on classification performances.
We verified the validity of our method on the real spam dataset of Twitter. The experimental results show that the proposed method can handle class imbalance well and obtains the best classification performance among compared approaches.
In the future, we plan to mine deeper hidden feature representations, as well as test classifiers trained with different dataset features to further improve the spam detection performance on social networks.
Author Contributions: Conceptualization, C.Z.; methodology, C.Z.; software, X.L. and Y.C.; writing-original draft, C.Z.; writing-review and editing, Y.X. and Y.Y. All authors have read and agreed to the published version of the manuscript.