Social Botomics: A Systematic Ensemble ML Approach for Explainable and Multi-Class Bot Detection

: OSN platforms are under attack by intruders born and raised within their own ecosystems. These attacks have multiple scopes from mild critiques to violent offences targeting individual or community rights and opinions. Negative publicity on microblogging platforms, such as Twitter, is due to the infamous Twitter bots which highly impact posts’ circulation and virality. A wide and ongoing research effort has been devoted to develop appropriate countermeasures against emerging “armies of bots”. However, the battle against bots is still intense and unfortunately, it seems to lean on the bot-side. Since, in an effort to win any war, it is critical to know your enemy, this work aims to demystify, reveal, and widen inherent characteristics of Twitter bots such that multiple types of bots are recognized and spotted early. More speciﬁcally in this work we: (i) extensively analyze the importance and the type of data and features used to generate ML models for bot classiﬁcation, (ii) address the open problem of multi-class bot detection, identifying new types of bots, and share two new datasets towards this objective, (iii) provide new individual ML models for binary and multi-class bot classiﬁcation and (iv) utilize explainable methods and provide comprehensive visualizations to clearly demonstrate interpretable results. Finally, we utilize all of the above in an effort to improve the so called Bot-Detective online service. Our experiments demonstrate high accuracy, explainability and scalability, comparable with the state of the art, despite multi-class classiﬁcation challenges. the best results. However, the accuracy does not ﬂuctuate signiﬁcantly, regardless of the feature category combination that has been used.


Introduction
OSNs have gone well beyond information circulation, becoming a vital part of all human activities. The power of OSNs in diverting public opinion can be compared to and is frequently even greater than the power of traditional mass media or other forms of social interaction [1]. Users of varying intentions and origins continuously interact and impact public opinion trends, sometimes with disruptive consequences. The public openness of OSN platforms has given ground to the rise of automated accounts which mimic human behavior. Such accounts, also known as social bots, are machine or human controlled software, either benevolent or malevolent, depending on their intentions [2]. The impact of such accounts has attracted the attention of the scientific community, as their spread and reach is constantly increased, while their activities are continuously refined and transformed.
Background. Particularly in Twitter, where the target audience is more vocal on differing opinions and ideologies, bots find fertile ground and ample opportunities in planting discord, stealing sensitive data and attracting users for personal gain. In many cases, such as the recent US elections, official government authorities are becoming increasingly aware of the interference of social bots in the voting process, by influencing the opinions of individuals [3]. As OSN users become more engaged and absorbed in using media, different social bots with various intentions, focused on specific purposes, emerge. For example, there may be bot accounts, the primary purpose of which is to continuously spam explicit, questionable or misleading content in an attempt to spread false rumors and news, known as spam bots [4][5][6][7]. Another emerging instance of accounts are politically infused bots, namely political bots, that get actively involved in elections and political debates in order to sow discord and deconstruct democratic procedures [8][9][10][11][12]. There are also bot-assisted humans or human-assisted bots, known as cyborgs [13] and even self-declared bots [9], which are accounts that announce their bot identity.
It is evident that the threat that bots impose on OSNs is alarming as they frequently create practical confrontations between individuals. These threats expand beyond individual harm and can create rifts and divisions between communities, polarizing the public [14,15]. Social bots have frequently been employed in voicing negative opinions on sensitive matters (e.g., climate change) [16], causing harmful consequences. Moreover, they are also a challenge for the global economy, as their intrusion in internet traffic [17] and online transactions jeopardize the transparency of economic disputes and the protection of global economic data [18,19]. Characterizing all social bot accounts as ill-willed would be inaccurate, as there are many that are neutral or even beneficial for users [20]. However, the unfortunate reality is that the majority of social bot accounts do not have benign purposes [21].
Challenges. Recent research has shown that in order to avoid being suspended, social bot accounts can potentially learn to adapt in human behaviors; they show evidence of evolution [22] and linger in social media for prolonged periods of time, further expanding their functions. What is also worrying, is the fact that humans seem to have a hard time distinguishing bot from human accounts [23]. Although Twitter itself has put a lot of effort into detecting and removing fake and bot accounts [24][25][26], the issue still remains. Thus, detecting and tackling bot accounts is of vital importance for the structure of society to function properly. In accordance with this purpose, existing and recent bibliographies have focused mainly on developing robust frameworks and algorithms that can distinguish human from bot accounts and pinpoint some guidelines for future detection and suspension of such accounts. Dominant among the proposed methodologies are mainly supervised and some unsupervised ML approaches which combine network dynamics for a wellrounded framework.
While significant research has been undertaken in detecting social bot accounts, there is a notable gap in distinguishing different types of bots and inferring what differentiates them [22]. The challenges that arise in this type of detection is that the existing developed models are suited for the binary classification of accounts and cannot be easily adjusted for the refined needs of detecting different type of bots. To facilitate the identification of different bots and eventually dictate new guidelines for tackling these accounts, a new ML approach is highly important. This work is motivated by the strong need of a refined framework which will advance ML algorithms to surpass the limitations of a simple bot or human distinction.
Another potential challenge is that behavioral information for the different characteristics of bot types is limited, and thus determining their varied traits requires more than a simple ML algorithm. Currently, there is restrained support in explaining the attributes that differentiate bot accounts (e.g., a spam bot from a simple bot). This creates a cloud of confusion in present bot detection models, as the plain labeling of bot or human cannot reveal the category of a bot account, nor the characteristics that led to this categorization. The present bibliography on bot detection adduces an impressive number of features employed in the detection of bots and humans. These features belong in several categories and serve different purposes on predicting the probability of an account being a bot. However, an open subject of research is the examination of the importance of these features in prediction and the deployment of an inferential process that could determine which features truly matter during a bot type classification. This work builds an efficient bot type detection approach which is more robust, exploiting the emerging field of explainable ML to provide feedback on the different classification of bot types.
Contributions. The above interconnected challenges have motivated this work which aims to cover the current research gap of multi-class bot detection, while also enhancing and improving current state-of-the-art practices with explainable and feature engineering methods. In summary, the major contributions of this work are the following:

1.
Newly introduced bot categories and extensive exploratory analysis of datasets: We perform an exploratory analysis of all open datasets and reveal that most of them are outdated and do not keep pace with the evolutionary nature of bots (as discussed in Section 3). Based on this analysis, we define new bot categories that have not been detected in previous bibliographies and map them to the training datasets of our classifiers, enriching a robust multi-class detection schema.

2.
New ML models: We implement a set of classifiers combined with imbalance handling techniques, which in contrast to most previous works, are able to identify different types of bots. Our approach competes with existing state-of-the-art ML approaches, despite the multi-class classification challenges, as presented in Section 5.2. We also provide a publicly available web application and open API which uses these models for inferring both the probability of an account being a bot and the probability of it being a specific type of bot, at bot-detectiveV2.csd.auth.gr (accessed on 1 October 2021). We open our source code at https://github.com/idimitriadis/Botomics (accessed on 1 October 2021).

3.
Two-stage multi-feature engineering: A multi-feature engineering process is introduced, aiming to examine the performance of the classifiers under different combinations of feature categories (as described in Section 4.1). Our experimentation has shown that even with few features, we achieve a high classification performance.

4.
An explainable and exploratory ML framework: In contrast to most previous work that had not emphasized the explainability of their results, we rationalize the predictions of the proposed classifiers (see Section 6). To that end, we utilize popular explainability frameworks, in conjunction with statistical methods, in order to profile each bot category, based on the features that mostly affect their prediction. Apart from providing bot probability scores with the use of the API, we also offer a web portal which also returns textual feature explanatory snippets to unfold the classification prediction steps.

5.
Reproducible data sharing: We build and open two new datasets of accounts enhanced with the newly detected bot categories, which have not been included in the current status of bot and human account datasets. These datasets augment the different sources of bot types, contributing to further scientific research in the field.
The remainder of this paper is organized as follows: Section 2 summarizes the related work. Section 3 discusses the process of data collection and the results of our exploratory analysis, while Section 4 outlines our feature engineering approach and ML methodology. Section 5 presents the experimental results of our research and Section 6 describes our explainability approach. Finally, Sections 7 and 8 include the discussion, conclusions and future potential regarding this paper.

Related Work
Before delving into the developed methods and algorithms for bot detection, it is important to highlight the nature of a bot and its functionalities. According to Botwiki (https://botwiki.org/bots/, accessed on 1 October 2021), a bot is "a software application that runs automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher and intense rate than would be possible for a human alone." A different interpretation is that a bot is "a software application that is programmed to do certain tasks" (https://www.cloudflare.com/learning/bots/whatis-a-bot/, accessed on 1 October 2021). Essentially, bots are automated tools that function without the need for human intervention or control, except when control is defined by research objectives [13]. Their purpose is to perform repetitive tasks that would seem mundane or time consuming for humans, exhibiting much faster rates than a human account. A practical and more humanly intuitive notion is that bots are "predictable automatons that do not have the capacity for emotions, meaning-making, creativity, and sociality" [27].
Bots exponentially gain ground in OSNs and their dominance is already evident. For example, according to a recent survey, two thirds of the URLs shared in Twitter feeds are posted by bots [28]. As the popularity of bots grows and social media platforms routinely benefit or are harmed by their use, researchers have turned their attention to tracking bots, especially on Twitter. During the past years, there have been various attempts at detecting a bot, or specific categories of bots. In general, the timeline of bot detection spans a decade, with research studies increasing in later years [22]. As the popularity of bots grows and social media platforms routinely benefit or are harmed by their use, researchers have turned their attention to tracking bots, especially on Twitter. This can be further highlighted by the wide influence of bots in presidential elections [29,30], financial markets [19], pandemic related news [31] and cryptocurrency manipulation [32] as well as the various and different bot categories [2], and particularly social bots [33].
Recent studies have highlighted that the majority of existing literature relies either on ML approaches or unsupervised methodologies in order to construct robust implementations that fulfill this purpose. There have also been attempts that provide web-based tools which gather data related to user accounts on social media from specialized APIs and infer if the account is a bot. In the following we will review the latest and most important bot detection approaches.
Regarding supervised approaches and spam bots, Ref. [34] is the first attempted research work that refers to spam bot detection via an ML Classifier. Apart from classifying Twitter accounts as spam bot or human, this work takes the first step towards exploring the major characteristics of spam bot accounts, such as tweet frequency, longevity and networks of followers. Spammers and social bots are also the central point of another paper [35], where the authors construct fake accounts to attract spammer activity and exploit the accumulated data to construct robust classifiers based on textual and network features. However, it is fairly evident that, as detection methodologies evolve, so do bot accounts, which adapt and develop evasion strategies. Thus, Ref. [36] proposes new features that are not affected by evasion strategies and are either graph related (clustering coefficient, betweenness centrality) or neighbor related (followers of neighbors, tweets of neighbors, etc.). This meticulous construction of features eventually yields good results, with low false positive rates. Several studies [37][38][39] use Random Forest classifiers and rely on well-established user and content-based features for detection, while also examining the robustness of the utilized features. In addition, another approach [40] enhances the current frameworks by adding behavioral features, in an effort to move away from single account detection and instead track groups of spambots. Finally, Ref. [41] proposes a hybrid approach that primarily takes into account the interactions of a bot account with his followers and combines them with predefined features, constructing classifiers with high accuracy levels.
As far as social bots are concerned, a notable effort by [10] constructs various novel datasets that contain social bots which act as fake follower inflators for Twitter accounts. Along with this addition, they simultaneously examine the efficiency of different feature sets in conjunction with several different classifiers. The Random Forest classifier is once more proven dominant. BotOrNot [5,8,9,42], a more refined framework that utilizes over 1000 features distributed in several categories (user, content, network, sentiment) is also presented as a state-of-the-art solution in bot detection. In addition, BotorNot is one of the few solutions that actually perform multi-class classification of bots. The authors of [28] propose a full-fledged framework that utilizes 1150 features and gives emphasis on the interactions between a genuine human account and a social bot account. More recent efforts [43][44][45], move away from a simple classifier detection and instead focus on the malicious effect that social bots have on Twitter, by analyzing political and marketing campaigns or pointing out the biggest challenges in their detection, such as their evolving behavior, which largely imitates human actions. In addition, the latest methodologies include subsetting the bot accounts in order to select the best dataset setup for improved classifier accuracy [8] and adversarial algorithms [46,47] that synthetically create social bot accounts to directly interact with other accounts, learn their strategies and improve detection rates. Apart from works that focus solely on bots, Ref. [13] introduces the distinction between humans, bots and cyborgs (human controlled bots) and proposes an entropy-based framework to classify them.
Unsupervised approaches are still gaining ground, thus research ventures are notably limited in comparison with supervised methodologies. Researchers in [48] use statistical inference and behavioral features to trace clusters of extroverted users, labelling them as social bot accounts. Behavioral similarities are also the primary subject of [49] which via unsupervised simulations reveal suspicious activities. An online framework, namely DeBot, that does not require labeled data and is focused on correlated activities between accounts was presented in [50]. The rationale of this approach is that two accounts that appear to have highly similar activities are likely to be bots. The metric of temporal synchronicity is the core aspect of Debot in detecting simultaneous changes in the state of discrete elements. In the same spirit Rtbust [51], is another unsupervised framework that exploits temporal patterns (retweets, mentions, likes) to highlight malicious activity.
Recent studies have explored group approaches, that cannot be considered either supervised or unsupervised. Their focus is drawn towards botnets, which can be conceptualized as groups of bots that act together to achieve a common purpose. Notable in these methodologies are network approaches that attempt to detect synchronized and suspicious account behavior [52][53][54], analyze the connectivity of such accounts and propose appropriate countermeasures for their detection and deactivation.
Despite their robustness and effectiveness, bot detection methodologies have recently been criticized for some crucial deficiencies. As bot behaviors evolve over time, the produced methodologies are sometimes unable to stay up to date and are rendered useless in detection campaigns [6], while it has been pointed out that focusing on the evasion strategies of bots rather than simply detecting their presence is the next logical step towards the evolution of the field [55]. In addition, the need for training the classifiers on previously unknown bot classes in order to potentially increase their efficiency is raised as an urgent solution to restrained accuracy scores [56]. One of the main issues that emerge is the lack of credible, annotated datasets, on which supervised solutions are trained [57]. At this point it should also be mentioned that only a few works pay attention to the intrepretability of their results [8,58,59], highlighting the need for more explainable models. Last but not least, bot detection platforms have been proven to be prone to increased false positive rates [60], a fact that hinders and reduces their credibility.
Our work provides a holistic approach in an attempt to fill the identified gaps, by broadening the scope of its research, rather than focusing on solving a particular issue. More specifically, our work initially presents an exploratory analysis of the open datasets, which have been used by all the previous works, and uncovers that most of these data are unavailable or outdated. At the same time, as recommended above, we propose a new validated division of different bot types, based on the description of each available dataset. Although feature engineering has been extensively discussed, we extend this work by providing a complete analysis for features used in binary-and multi-class bot classification tasks, while we also offer credible, interpretable results. Finally, we present a set of both binary-and multi-class ML models, inspired by [9], which offer competing results, in comparison to current state-of-the-art approaches.

Current State of Bot-Related Data
In this section, we present the first two blocks of the proposed research methodology modular framework, as presented in Figure 1. In Section 3.1, we describe the data collection process and the results of our exploratory analysis. The analysis of the datasets allows us to proceed with a further categorization of bot types, as presented in Section 3.2.

Data Collection and Exploratory Analysis
As already mentioned, the volume of supervised ML methodologies focused on bot detection is constantly growing [22], and most approaches use already available labeled datasets. Most of these datasets can be found in the Data Repository section of Botometer [5,8,42]. This repository contains Twitter accounts and their identification numbers (Twitter user IDs), while also characterizing them as "human" or "bot". Apart from collecting data from Botometer, we also included two new datasets that were the result of a manual account search in Twitter. We opted to follow this approach to detect new types of accounts that were not contained in the Botometer repository, namely news agencies and companies that were labeled as "cyborgs", which also include celebrity accounts. This, in turn, allowed us to expand the defined bot categories that will be explained in subsequent sections. In Table 1, we present an overview of the characteristics of each dataset.
Having accumulated the targeted accounts and their IDs, instead of simply scraping Twitter for content (an action that is prohibited by the platform anyway), we exploited existing services provided by Twitter that enable rapid and easy collection of tweets for a given account. We "hydrated" each account with its most recent 1000 tweets (other approaches like [42], use a few hundred tweets), unless it had fewer tweets in the entire timeframe of its existence. Naturally, as Twitter has launched a quite ambitious campaign to counter bot activity, many accounts had been suspended and were thus nonfunctional. As presented in Table 1, although there is a plethora of labeled Twitter users in the open datasets, most of these accounts have either been deleted or suspended. More precisely, out of more than 170 K of accounts that have been labeled in previous studies, only 69K of them are still available in Twitter. In an effort to help other researchers, we share the IDs of the accounts that are still valid at https://github.com/idimitriadis/ Botomics (accessed on 1 October 2021). Additionally, we highlight that although these remaining accounts are still available, this doesn't mean that they are still active in posting content or interacting with other users. For this purpose, we continued with a further exploratory analysis of each dataset. For every account, we extracted the date of their latest tweet, to discover the last time they posted some type of content online. Figure 2, indicates the total number of accounts, the number of accounts per type and the year they posted their latest tweet. It is obvious, that most bot accounts are inactive, while humans and cyborgs continue their online activity. This fact, points out a problem that has been discussed a lot by other researchers during the last years. Thus, the availability and credibility of up-to-date data has become a major issue and since the evolving nature of bots renders this data out of date, there is a strong need to constantly update and open bot-relevant databases.
There have been various efforts from researchers to provide up-to-date labeled data, such as DeBot [50] which supposedly offers an API and access to lists of users that have been deleted by Twitter after their detection, or Botometer [5,8,42] which provides a public API which evaluates the possibility of an account being a bot. Unfortunately, such services are either no longer functional or highly criticized, thus they do not offer solid ground to build on [6,55]. Here, we will only use publicly available shared data that have been used by other researchers, and which have been filtered in such a way that only include accounts that have not been either suspended or deleted. Unlike most other previous work, which use data that are no longer available, we follow a more realistic approach, based on the truly and actually available data.

Not All Bots Are Created Equal
Before proceeding into the feature extraction process, we defined multiple classes for the retrieved accounts. Thus, we proceeded beyond a simple binary categorization of either bot or human to explore different bot classes which are presented in Table 2, along with a short description and the datasets that comprise them. The numbers of the datasets, refer to the serial number of each dataset, as presented in Table 1. As presented in Section 1 and Table 2, the following bot classes appear in recent literature and have been retrieved in combination with the information contained in the datasets that we utilized.
• Spam bots encapsulate every type of bot that is related to continuously posting spam content. • Social bots are related to impersonators, influence bots and paybots. • Political bots are a rather unique class, including accounts that have been used for political purposes. • Self-declared bots refer to accounts that identify themselves as bots. • Cyborgs include celebrities, news agencies and organizations, and as discussed in Section 2, they have not been included in recent multi-class bot detection approaches. • Other bots include all bots that do not fall into any of the previous categories according to the dataset description.
It is important to clarify that across these umbrella classes, there are malevolent and benevolent bots as well.
To measure the efficiency of the defined bot classes we trained six binary Random Forest classifiers, one for each bot type. As depicted in Section 2, Random Forests have proven to be quite effective for similar bot classification tasks. The hyperparameters of each classifier are similar to the ones defined in Section 5, while the features that have been used for training our classifiers are thoroughly presented in Section 4.1. Each classifier has been trained on a balanced subset of data, using ADASYN [63] as discussed in Section 4.2. Each subset includes human accounts and instances of each bot type, split into train and test samples. The produced classifiers were thus trained on each subset and tested across all the others. The rationale behind this approach was to evaluate the cross-type performance of the individual binary classifiers, validating our hypothesis.
We measured the performance of each bot classifier targeting the same or other bot types by calculating the precision and recall metrics. We can easily observe from our exploratory analysis ( Figure 3) that classification precision and recall scores are strong for every classifier that targets the same bot type as the one it has been trained on. This is true for all bot types, as indicated by the dark colored diagonal blocks in Figure 3, except for "other bots". On the contrary, cross-type performance is really low, which highlights the different behavior of bots. It should also be noted, that the only datasets where other classifiers' performance somehow increases, is the one for the "other bots" and the one for "selfdeclared bots". In summary, our main observations are the following: • Low cross-type performance designates the need for the distinction of bots into separate types. • Other bots: taking into account that this bot type label had been assigned to datasets that did not include much information, so as to assign them to one of the other categories, we can easily work out that this class contains samples of the other bot types as well. • The high precision observed in the self-declared dataset when tested with other classifiers is due to the fact that it includes bots with activites quite similar to those of spam bots and that the other bots' datasets most probably include self-declared bots as well. • The majority of labels that we manually assigned, correctly correspond to different kinds of bots, pointing out the lack of generalization of each bot-type-specific model.
In the following section, we discuss the various feature types, the feature extraction process and how these features affect our final dataset.

Methodology
In this section, we thoroughly present the feature extraction and feature selection methodology, see Section 4.1, to cover the proper training of our ML models which are described next in Sections 4.2 and 4.3.

Feature Engineering: A Systematic Approach
In each typical ML task, the selection of appropriate features plays a very important role for building an efficient ML model. In this work, we utilized features that have already been introduced and defined by previous researchers and we have enriched them with additional ones that, to the best of the authors' knowledge, have not been used before. We have identified more than 400 individual features, which can be extracted leveraging the plethora of information included in each Tweet object (https://developer. twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet (accessed on 1 October 2021)), which also includes the User (https://developer.twitter.com/en/docs/ twitter-api/v1/data-dictionary/object-model/user (accessed on 1 October 2021)) object as well.
We followed a systematic approach to categorize the different features, based on their contextual scope. Our approach introduces a simple taxonomy of feature categories which is further described next:

1.
Content Features (C): Text-relevant metrics to capture the source data semantics as expressed in tweets and their relevant properties. For example, these reflect semantic elements found in text (e.g., text size) or numerical elements corresponding to the tweet's content (e.g., frequency distributions of punctuation marks, inter-tweet text similarity, etc.).

2.
User Features (U): Extracted directly from the Twitter API User object and which cover several characteristics of the account owner. For example, in this class we include attributes of the user's profile description (e.g., the existence of the word "bot" in the description"), numerical features regarding user activity (e.g., the number of followers, followers and the follower/friends ratio) or Boolean values that provide additional information about the user (e.g., if the user is verified or if a location is provided). Hashtag correlation features (H): Finally, we introduce a set of features that are generated by the network of used hashtags. Consider an undirected graph G, which has as N nodes, the hashtags used by the user and E edges, being defined from a hashtag co-occurrence matrix. Thus, once hashtags h 1 , h 2 are present in the same tweet, it holds that h 1 , h 2 ∈ N: e 1 = (h 1 , h 2 ) ∈ E. This graph is valuable to calculate graph interconnection properties, such as the number of triangles or the node to edges ratio.
During the feature extraction process, we identified some major bottlenecks which do not allow us to use the whole set features. These issues are described below: • Source data thread limitations, due to the minimum number of tweets and retweets that are included in the last 1000 tweets of each user, as described in Section 3.1. More specifically, in order to use the features included in the Social Neighborhood (N) category, we expect to have at least two retweets among the tweets that have been collected for each user; failing that, we cannot proceed to the calculation of statistical metrics for the distribution of features. • Lack of content or activity information, since in our feature extraction process, we found that there are many accounts with no tweets or no retweets. For example, some bots in the Botwiki dataset do not post original tweets but are limited to posting retweets that include a specific keyword. In this case, extracting sentiment features will not be helpful, due to the absence of original content. Moreover, there are some users that do not retweet any content at all; thus, we are not able to identify their network and consequently the extraction of specific features, as we mentioned earlier, is unfeasible. • Platform-specific limitations, posed by the continuous changes in the Twitter API. For example, we are no longer able to observe the evolution of some metrics that have been proven, by other researchers, to be quite useful as features. Such an example, is the number of friends the user has in a certain snapshot; even if the snapshot refers to the past, the User object (returned by Twitter) is the one currently effective; thus, we are unable to measure the temporal evolution of this metric. We also have to mention that, although Twitter API allows developers to collect the network of each user (friends-followers), the process of doing so is not time-efficient with respect to the official Twitter API rate limits, and therefore it is not considered to be a good solution, due to the need for real-time results.
Two-stage feature extraction : To cover this limitation, we employ a two-stage feature extraction process. Once for the accounts that fulfill all the requirements, using all the available features, and once more using a pruned version of the identified features in order to include all the available accounts in our datasets, regardless of the number of tweets or retweets they have posted. In Table 3, we present an overview of the two feature sets along with the time needed to extract each one. We estimated the average calculation time by computing the average of the time needed to extract each set of features for every user. In an effort to not only reduce computational times in future work, but also to aid researchers in avoiding using redundant features, in the next sections we utilize some feature engineering packages to detect the most important and informative features and use them as a baseline of further research. Moreover, in the ML models (described in the next section), we experiment with all possible combinations of the proposed feature categories and explore the best combination that yields the most robust performance, by embarking on both binary-and multi-class bot detection (as discussed in Section 1).

Binary Bot Classification
As discussed in Section 2, since we go beyond the conventional binary bot detection approach, we need to define a basic model which will serve as our Baseline Binary Bot detection classifier, namely BBC. Our initial goal is to select the classification algorithm that will ultimately be chosen for further experiments.To achieve this, we experimented with several classification algorithms, namely Logistic Regression (LR), Linear Discriminant Analysis (LDA), Decision Tree Classifier (CART), Multi Layer Perceptron (MLP), AdaBoost (ADA), and Random Forest (RF). We split our dataset into a 75-25% ratio for training and testing the models. For each of these algorithms we performed a GridSearch process for hyperparameter tuning and better parameter selection. The produced models were tested on the test sample (previously unseen data) and they were evaluated by measuring the accuracy using 10-fold cross validation. Eventually, our analysis proved that Random Forest had a clear dominance compared with the other algorithms. Its performance has also been recognized as the most reliable, as discussed in similar research [5,8,9,42,56], and is well aligned with an explainable approach for classification in bot detection experiments. We have included the results of other classification algorithms in Figure 4. Beyond the bot classification baseline, we followed a proper imbalance handling approach to tune and adjust the uneven amount of instances, as described in Section 3.1. To tackle this problem, right after the selection of the proper classification algorithm, we experimented with three balancing methods, namely SMOTE-Tomek [64], SMOTE-ENN [65] and ADASYN [63], to select the one that guaranteed a realistic equalizing of cases while maintaining an optimal performance. More specifically, in Table 4, we present the performance metrics for the BBC classifier, using the three different imbalance handling methods that we have already mentioned. Our experimentation runs showed that ADASYN performs better than the other two, preserving a balance between precision and over-fitting.
Next we trained our algorithm on two different datasets, as defined in the feature extraction process. The first dataset includes the users that comply with all the limitations posed by the full-features set, while the second includes users that have at least two tweets or retweets and can be used to extract the shortened feature set. The two final datasets are the following:  Finally, with respect to the efficiency of our model, we used different feature combinations, among the ones that were identified earlier. Utilizing feature importance techniques and experimenting with all the possible feature type combinations, we ended up with a smaller set of features which produced similar results in comparison with the full set of features and required much less processing, and thus, calculation time.

Multi-Class Bot Classification
While the binary classifier was useful in providing an initial estimation, the next step was to develop a multi-class classifier, to separate accounts into different classes defined in Section 3.2. From these classes we excluded the "other bots" class, which, as presented in Section 3.2, seems to be a mixture of all the other bot classes.
In our multi-class classifiers we provided sublabels for the bot class, as defined in Table 2. Similarly to our previous binary bot classification approach, and since the issue of an uneven amount of instances was even more evident in the multi-class classification task, we proceeded with the use of the ADASYN algorithm which performs better than the other algorithms, in this case as well. The ensemble multi-class classifier utilizes both class-specific binary classifiers and multi-class classifiers to predict the probability of an account being a human or a class-specific bot, in combination with the baseline binary bot or human classifier. The rationale behind this approach is to juxtapose the human or bot binary prediction to those that refer to specific bot types. Our ensemble classifier's goal is to initially decide if the account is a human or a bot and subsequently assign a more detailed label to a bot account, as presented in Figure 5.
Next, we present a set of ensemble multi-class classifiers which use the BBC classifier to calculate the probability of an account being a human, namely Human Probability HP, and they operate as described next: -EMCREST combines the BBC classifier with an one-vs-rest classifier, from which we calculate the probability of an account being a specific type of Bot BP. The one-vs-rest classifier splits the multi-class dataset into multiple binary classification problems and the prediction is made using the model which provides the most prevalent one. The final prediction of the EMCREST classifier is the most confident prediction between the BP and the HP. -EMCS uses the BBC and a stacking Random Forest classifier which consists of a set of binary classifiers. At this stage, we implemented this set of binary classifiers for each identified class of bots. Hence, we trained five binary Random Forest classifiers (following the same procedure as in the BBC classifier), each of which is responsible for classifying a user as human or as a specific type of bot. Naturally, we used the appropriate training data for each classifier. For example, the spam bot classifier is trained on data that consists of users which: (a) are labeled as spam bots, (b) are randomly chosen humans, based on the fundamental assumption that humans have a similar online behavior, which differentiates them from bots. The final prediction of the EMCS classifier is the most confident one between the HP and the stacking BP.  The ensemble process does not require any balancing, as this classifier relies on balanced pretrained models. The decision to omit the imbalance handling stage is attributed to the fact that since the constructed ensemble pipeline comprises classifiers that have already been trained on balanced datasets, in the original raw datasets, the model could easily discern the different instances and figure out the class to which each instance belonged. More specifically, as each classifier has been trained to detect a specific class and instance of every class, it would be fairly straightforward for the ensemble classifier to understand the instance class. More details about the performance of the classifiers can be found in Section 5, where the results and findings of this stage are presented.

Experimentation Results
In this section we present the wide range of experiments that have been conducted both at a feature-and model-based level. Apart from providing tuning and performance details on the classifiers that have been used, we also study their behavior based on the combination of the extracted sets of features.

Baseline Binary Classifier (BBC)
The first ML model built for the task of bot detection was a Random Forest classifier, with 160 decision trees and maximium depth equal to 10, and trained to separate humans from bots, based on the extracted features. As the feature selection experiments take place in subsequent steps, the baseline classifier utilizes all the features from the full-features dataset and the pruned-features dataset. We separated the datasets with a 75-25% ratio for training and testing the model. The training dataset was balanced using the ADASYN algorithm. Arguably, the imbalance of bots and humans was not very restraining, but we chose to balance the dataset in order to avoid a potential bias.
Concurrently with predicting the instances of the test dataset, the most important features were also kept in a separate list. While labels were mapped in the same manner (0 and 1), the number of features was significantly reduced in the model trained on the pruned-features dataset, as discussed earlier. In the following figure, see Figure 6, we present the performance of our binary classifier on the pruned dataset and we show how it changes based on the number of features that we used. We are iteratively removing features, starting from those which have been identified as less important and moving to the most important ones. Figure 6. Performance of the BBC classifier for the pruned-features dataset. The x-axis corresponds to the number of features removed in every iteration starting from the least important ones and moving towards the most important. We observe that the performance remains the same, even if we remove more than half of the total features. As presented in Table 5, we observe that the classifier achieves high precision, G-mean and recall scores. We also discover that the reduction of features does not seriously affect the performance. More specifically, both for the full-features and the pruned-features dataset we discover that all the evaluation metrics actually remain the same, even for a much smaller set of features, just 145 and 100 out of the total 420 and 309, respectively. The classifier trained on the pruned-features dataset seems to perform better than all the others, using only a small fraction of the total features. We also conducted another experiment, to discover which set or which combination of feature types, out of the ones that we defined in Section 4.1, provides better results. For this reason, we trained the BBC classifier using all the possible feature set combinations as input. In Table 6, we demonstrate the results of our experiment. Since we experimented with all the possible combinations, we present only the best ones for each number of combined feature sets. For example, the third column shows the best three feature type combinations per classifier, which is apparently a combination of User, Temporal and Hashtag correlation features for the pruned-features dataset. We observe that the accuracy for each feature combination does not fluctuate significantly, regardless of the number of combined feature sets. Moreover, we notice that the User object derived features are present in most cases, followed by those that refer to the temporal activity of the user. Table 6. Accuracy performance of the BBC classifiers in comparison to the different combinations of feature categories. Each column refers to the best combination for each number of combined feature categories. Apparently, the combination of User, Temporal and Hashtag feature categories provides the best results. However, the accuracy does not fluctuate significantly, regardless of the feature category combination that has been used.

Dataset
Best In the next section, following a similar approach, we proceed with the experimentation related to the multi-class bot classifiers.

Ensemble Multi-Class Classifiers
As presented earlier, we have built three ensemble classifiers (EMCREST, EMCS, EBBC) which combine the BBC classifier with three sets of different classifiers. We performed our experiments using the pruned dataset that was previously described and we used the pruned-feature set, which in the binary bot classification problem seemed to perform better than the full set of features. Following the same procedure as before, we split our data into train and test samples in a 75-25% ratio, and handled the imbalance of our training set using the ADASYN algorithm.
Initially we trained the one-vs-rest algorithm, which actually creates six binary classification problems, where each class "competes" against all the others. The next step involved the training of a stacking Random Forest classifier which consists of five binary Random Forest classifiers, where the competing classes are always "humans" and each other type of bot. These binary classifiers were actually introduced in Section 3.2, where we used them to validate the labels we manually assigned. Having said that, we proceed with the results of our experiments, which are thoroughly presented in Table 7. As we can see, the EBBC classifier's performance is superior to that of all the other classifiers. Evidently, the EBBC classifier would be our choice even if we would just like to distinguish humans from bots in general, since it outperforms the binary bot or human classifier BBC. In the next figure, see Figure 7, we present the distribution of maximum prediction probabilities for the whole dataset. Apparently, apart from a few cases where the maximum probability lies below the 0.5 threshold, most instances have been correctly classified with a high probability. Figure 7. Distribution of maximum prediction probabilities for both correct and misclassified instances. Higher values show that our model predicts the instance class with higher confidence. We observe that the largest part of the correctly classified instances, were predicted with high confidence.
In this case, defining the most important features is not trivial, since we refer to a multi-class classification problem. For this reason, we examined the most important features per bot type, using the binary bot type classifiers. The results once more validated our hypothesis and manual assignment of labels to each bot type. More specifically, our experiments showed that the top 25 features are completely different for each bot type. This means, that there is not even one single feature among the twenty five top-ranking ones that is present in every class. In Figure 8, we observe that, surprisingly, the important features tend to become common for all bot classes after we reach the top 250, while the first common one makes its appearance after the first 25 ranked as most important. Regarding the type of features that have been found to be most important for each class, Table 8 summarizes our findings. More specifically, we analyzed the twenty most important features per bot type and we discovered that, unlike in the case of the binary bot classifier, content features play the most important role in distinguishing humans from specific types of bots. It is also interesting that besides the fact that content features are prevalent for all bot types, none of those ranked as more important are common for all. Taking into consideration the wide range of improvements of this new model, with respect to the one already available in our bot-detection service, namely Bot-Detective, we have decided to use it in the updated beta version of our publicly available web service.

Explainability and Statistical Analysis
As highlighted in our previous work [58], targeting bot detection without explainable functionality is prohibiting in decision rationalization and trustfulness. In this work, the predictions of our models are augmented with highlighting the features that define whether an account is a human or a bot and provide the reasoning behind this prediction. Most importantly, in the multi-class detection problem, this strategy showcases the differences in each bot class, to reflect the behavior of accounts belonging in each class.
This work proposes a simple and flexible explainability pipeline (outlined in Figure 9). Although a Random Forest classifier provides interpretable results by design, its explain-ability can not be utilized in our case due to the large number of features and number of estimators used. In our explainability pipeline we use the widely accepted LIME [66] as our explainer, to offer interpretable predictions of the binary classifier or/and the multi-class classifier. The explainability phase, as depicted in Figure 1, resulted in explainer predictions for each instance, accompanied with the features that affected this prediction, all adjusted in the framework of the model that we were examining per explainer. For example, the explanations of the baseline binary classifier would have a "bot" or "human" label predicted by the classifier, and interpreted by the explainer, a true label of the same value which can obviously be close to the found label and the top ten features with their corresponding weights. Even though the results of the explainers are quite useful in not only validating the feature engineering process, but also shedding a clear light on what differentiates bot classes, they still remain static representations and lack a graphical visualization. To that end, once we completed the explaining process by conducting a statistical analysis and measurement in order to pinpoint the varying aspects of each class, distribution plots were utilized for the top features of each class belonging to the two classifiers that portrayed the distribution of values for every instance of the datasets. Moreover, we experimented with some key parameters of the explainability framework, in an effort to provide as realistic and accurate results, as possible. Our work towards this direction is further presented below.

Refining Explainable Results
Previous work on explaining the results of human-bot classifiers has been strictly based on providing some information on the features that define each prediction. Moreover, this analysis has only been applied to binary classifiers. In this section, we present our attempt to provide improved results, not only by experimenting with different parameters, but by extending the application to all bot types that are referred to in this paper.
We will use a popular explainability framework called LIME. Although LIME is considered to be a state-of-the-art tool for ML interpretability, recent research has shown that in some cases it faces some issues of instability [67,68], mainly due to the sampling step when new instances are randomly chosen. Improving the model is beyond the scope of this paper, however we experiment with different kernel widths, which seem to greatly affect the predictions made by LIME in comparison to those produced by our ML models, see Figure 10. Kernel width actually defines the region around the reference point from which new points are generated.
In order to provide more realistic explanations, we calculated the mean squared error (MSE) of our ML model predictions to those of LIME for different kernel widths. We can clearly observe that as the kernel width parameter increases, so does the mean square error. However, having a really low MSE is not always what should be preferred, since setting the kernel width to a very low value would mean that our main goal would be to predict this exact point correctly, which is not the case. Thereby, in our experiments we set the kernel width value equal to five, which seems to complement the MSE with the generalization of our model. Initially we applied our explainability model to the BBC classifier, in order to highlight the main features that affect our model's predictions. We created our explainer using the data that our model had been trained on and tested it on a sample of the spare test data. The next step involved applying a similar methodology for all the binary bot type classifiers, though limiting our dataset only to humans and certain bot types, depending on the classifier we were trying to interpret. The key findings are presented in Table 9, along with the most important features per bot class. Finally, since our intentions included presenting an improved version of Bot-Detective, we decided to also present a visual explanation that rationalizes the prediction of our models, hence improving the interpretability of our results. To that end, our updated web service will also include figures which highlight the difference between the explored instance and the data that our model has been trained on. Such an example is presented in Figure 11, where the red line depicts the explored instance, while the histogram refers to the bots and humans included in our dataset.

Discussion
Our results and experimentation with supervised ML models and bot-related data towards effective bot detection are further discussed thoroughly in the next sections.

Data Exploration
The vast majority of published research on supervised ML for bot detection uses publicly available annotated data to train their models. They use specific datasets to highlight the efficiency of their approach. We have rehydrated all of the datasets and we discovered that a great part of them are outdated or include accounts that are no longer available in Twitter. More specifically, out of the 170 K accounts included, only 69 K are still available, out of which, the ones that are still active mainly belong to human users. Based on this fact, we can clearly come to the conclusion that the respective ML models are also outdated and do not follow a realistic approach. On the contrary our work is based on truly available data and takes a step further into distinguishing different bot classes among them. However, it is important to point out the never-ending need for new, up-to-date, credible annotated datasets and thus, due to the lack of such data, rationalize the scientific turn to unsupervised methods for bot detection.

Feature Selection
During the past years, the type and number of features that have been used for training has been constantly growing. As presented by other researchers, this number has exceeded 1200 single features. Taking into consideration that previously available information is no longer accessible (e.g., followers growth rate per tweet), highlights the need for constantly updated feature sets. Moreover, instead of persistently increasing the number of features, a wiser choice would be to assess their importance and try to achieve a balance between efficiency and accuracy. Towards that direction, our work has shown that a careful selection of features can still provide comparable results for binary bot or human detection. However, this is not the case in the multi-class bot classification problem, where the important features vary for each bot class.

Generalization and Accuracy
Previous research, our past work included, have reported really high accuracy in binary classification tasks, reaching a level of 98-99% in distinguishing bots from humans. However, most of this work has used specific datasets, which include all the labeled accounts, along with their respective tweets at the time of data collection. It is needless to say, that all these datasets have a high homogeneity score and such accuracy scores are not realistic, since, as also shown in [9], these models perform well only on the datasets that they have been trained on. On the contrary, we train our binary model to the dataset that is produced by merging all available datasets, while removing the accounts that are no longer available in Twitter. Naturally, we do not expect to have a higher accuracy than other models that use more "biased" or richer data than ours. Nonetheless, our model performs well, achieving relatively high accuracy, precision and recall scores for previously unseen data, as presented in Section 4.2.
With respect to the multi-class bot classification task, we propose a set of multi-class classifiers trained on all the available datasets and with different labels than those proposed by a few works by other researchers. In Figure 12, we show the t-SNE plot, where we visualize our high dimension dataset, limited for user features, in two dimension plots and show the clusters formed for social bots and cyborgs. We can clearly observe the distinction between these two classes. A direct comparison between ours and other approaches is not feasible, due to the different methodology followed during the training phase and the different training data. However, after taking into consideration the performance metrics that were reported in [9], where they achieve an average recall score of 0.84 and F1-score of 0.73, our method seems to provide improved results with an average recall score of 0.92 and F1-score of 0.9 for our best solution, namely the EBBC classifier. Finally, we attempt to improve the explainability of our predictions by providing more realistic and accurate results, as shown in Section 6.
Finally, we should highlight that our approach, with some small modifications mostly on the features used for the training of the models, could be used for the detection of bots in other OSNs (such as Facebook, Instagram, etc.). Nevertheless we focus on Twitter bots, since Twitter is the only OSN that offers an open API for data collection.

Conclusions and Future Work
In this paper we investigated the current state of supervised ML bot detection approaches on Twitter. We showed that the available data are mostly outdated and that previous research does not comply with the present data status, but rather focuses on past data. We also proposed a new bot type classification schema, based on the descriptions of the publicly available annotated datasets and newly introduced ones and proved its efficiency.
We perform a comprehensive feature analysis, enriched with explainability functionalities and demonstrate that different features account for different type of bots. Although we acknowledge the drawbacks of these data, we follow a different methodology to provide novel models for binary and multi-class bot detection. Our experiments show that our models perform really well on previously unseen data and that they generalize well, since they are tested on data coming from different datasets.
In future work, taking into consideration the lack of credible, up-to-date data, we intend to investigate adversarial methods to improve our models and make them adaptive to future, currently unobserved, new type of bots. Our main future goal would be to create novel generative models able to produce adversarial bot examples, leaving out the issue of data unavailability, and testing new discriminators against old and newly produced types of bots.