Detection of Username Enumeration Attack on SSH Protocol: Machine Learning Approach

: Over the last two decades (2000–2020), the Internet has rapidly evolved, resulting in symmetrical and asymmetrical Internet consumption patterns and billions of users worldwide. With the immense rise of the Internet, attacks and malicious behaviors pose a huge threat to our computing environment. Brute-force attack is among the most prominent and commonly used attacks, achieved out using password-attack tools, a wordlist dictionary, and a usernames list—obtained through a so-called an enumeration attack. In this paper, we investigate username enumeration attack detection on SSH protocol by using machine-learning classifiers. We apply four asymmetrical classifiers on our generated dataset collected from a closed-environment network to build machine-learning-based models for attack detection. The use of several machine-learners offers a wider investigation spectrum of the classifiers’ ability in attack detection. Additionally, we investigate how beneficial it is to include or exclude network ports information as features-set in the process of learning. We evaluated and compared the performances of machine-learning models for both cases. The models used are k-nearest neighbor (K-NN), naïve Bayes (NB), random forest (RF) and decision tree (DT) with and without ports information. Our results show that machine-learning approaches to detect SSH username enumeration attacks were quite successful, with KNN having an accuracy of 99.93%, NB 95.70%, RF 99.92%, and DT 99.88%. Furthermore, the results improve when using ports information.


Introduction
The Internet is widely recognized for its rapid growth and tremendously usage in current years [1]. As a result, there are symmetrical and asymmetrical Internet consumption patterns. Over four billion individuals have Internet access and utilize it on a regular basis. This equates to 63.2% of the global population having access to the Internet. According to statistics, Internet usage surged by 1266% over the past two decades [2,3]. The explosiveness and widespread nature of the Internet have made almost everyone rely on computer networks for their day-to-day activities [4]. With an immense rise in dependency on the Internet and computer networks services, attacks and malicious behaviors have become unexceptional in our computing environment [5][6][7].
The emergence of attacks and malicious behaviors pose a significant danger to computer security [8]. They attempt to deviate from the deployed network security mechanism by exploiting the vulnerabilities found in the target networks [4,6]. Computer system attacks are achievable at several levels, ranging from data link layer to application layer. Attacks can also be classified as passive or active attacks [9,10]. An active attack occurs when attackers change system resources and cause effect to their operations. A passive attack occurs when attackers gather or make use of information from the systems but do not affect system resources [11,12]. Password-based attacks, like dictionary-based attacks and brute-force attacks, are among various types of computer attacks [9,13].
The brute-force attack, often referred to as high-level attack, is one among the most popular insurmountable challenges in today's computer system attacks [6,[14][15][16]. In bruteforce attack, attackers attempt to log in by trying different passwords on the victim's machine to reveal the login passwords [6,[16][17][18]. They generate password combinations using automated tools. There are several smart brute-force attack tools available, including Hydra, the most well-known brute-force attack tool, which comes pre-installed in the Kali Linux operating system [6,16]. Brute-force attacks can be used against a wide range of services or protocols with SSH and FTP being among the primary targets for the attack.
In order to achieve dictionary-based or brute-force attack, an attacker needs to have two important items: a valid and existing list of usernames of the targeted system and a wordlist dictionary (a text file containing a collection of words for use in the attacks). One of the keys first steps when attempting to gain access or to launch an attack to a victim system or application is to enumerate usernames. This means an attacker first gathers the fundamental information about a user [19]. Once intended usernames have been enumerated, targeted password-based attacks can be launched against found usernames.
Username enumeration is a sort of a passive attack (reconnaissance) that retrieves a list of existing and valid usernames from a system that requires user authentication [20,21]. Since an attacker can quickly generate a list of legitimate usernames from the username enumeration attack, the time and effort necessary to brute-force a login is considerably reduced [22]. However, it does not allow the attacker to immediately log in, rather it gives half of the necessary information which the attacker could use to run a brute-force attack to further exploit the obtained information.
The username enumeration attacks can be initiated in any system that requires user authentication including, SSH servers. Specific versions of OpenSSH experience suffering from a timing-based attack: if a valid username with a long password is given, the time taken to respond is noticeably longer than for an invalid username with a long password [23]. By exploiting how the server responds to forged queries, the attacker can enumerate the service's registered usernames. The server would respond with an authentication failure if the username does not exist, but the outcome would be different if the user exists. Other areas where username enumeration occurs are in a website login page and its 'forgot password' functionality.
The demand for traffic anomaly detection in cybersecurity is increasing because of the enormous and rapid expansion of computer attacks that are sophisticated, including password-based attacks [6]. Several approaches for detecting and mitigating passwordrelated attacks, such as brute-force, have been suggested, developed, and deployed on a variety of systems and services, including SSH, FTP, and HTTP. However, in the era of cybersecurity, username enumeration attacks continue to be a problem. The majority of the recommended solutions focus on detecting and preventing password-based attacks, ignoring the fact that username enumeration is the first attack to identify and resist.
Inspired by the advancement and promising results of machine-learning techniques in traffic anomaly detection and mitigation [24][25][26], this study focuses on detection of the username enumeration attack on SSH protocol by applying and analyzing machinelearning classifiers.
Machine-learning is a branch of artificial intelligence that allows machines to learn without having to be plainly programmed [27]. Machine-learning automates operations by skillfully taking each stage in a maintained way. Machine-learning contains several learning techniques categorized as supervised and unsupervised learning. This categorization is subjected to the existence or nonexistence of labelled dataset. Supervised learning uses labelled samples to train the model, allowing it to anticipate comparable unlabeled samples. There are no training samples in unsupervised learning, hence it relies on the arithmetical method of density approximation. Unsupervised learning is based on the notion of gathering or grouping data of the same types to uncover the underlying design of the data.
Machine-learning ability to recognize and give clues on real life issues is greatly valued and thus lead to their appeal and perverseness. These accomplishments have steered to the adoption of machine-learning in numerous fields [28,29]. Cybersecurity is among other fields availed by this trend where intrusion detection systems (IDS) are advanced with machine-learning modules [30]. With their real-time response and adaptive learning process, machine learning algorithms are becoming particularly efficient in intrusion detection systems [31]. They exemplify supreme choice over conventional rule-based algorithms [32].
Attacks and anomaly detection use supervised learning where a known dataset is used to make classification or prediction. The training dataset contains input features and target values. The supervised learning algorithm then builds a model to make classification or prediction of the target values [33].
In this work, we examine four machine-learning classifiers for the username enumeration attacks detection. We examine k-nearest-neighbor, naïve Bayes, random forest and decision tree machine-learning classifiers. The use of several classifiers offers a wider investigation spectrum of the machine-learners' ability in the detection of username enumeration attacks. Section III has more information on these classifiers.
Our findings show that utilizing machine-learning algorithms to detect SSH username enumeration attacks is a very successful approach. Additionally, we examine the impact of source and destination ports usage in the detection of username enumeration attacks. This is achieved by including source and destination ports as feature sets in model development and evaluation.
The remaining part of the paper is arranged out as follows: Section 2 discusses the works related to brute-force attacks and various detection methods. The experimental setup, dataset and dataset pre-processing, the classifiers we used are all presented in Section 3. We discuss our findings in Section 4. Finally, in Section 5, we wrap up our research and make recommendations for future investigation.

Related Works
The username enumeration attack to get a list of existing usernames works hand in hand with password-related attacks like brute-force. A typical brute-force attack looks for the right user and password combination, frequently without knowing if the user already exists on the system. The Verizon 2020 data breach investigation report highlighted that brute-force attacks accounted for more than 80% of all data breaches. It is a long-standing strategy, yet it is still prevalent and effective among hackers today [34]. In various research, the dominance of brute-force attack has indeed been observed.
One of the studies observed the prevalence of brute-force attack is [35], they examined the attack pattern on SSH protocol by investigating aggregated NetFlow data using decision tree classifier. Their study evaluation was conducted in a high-speed university campus network. Satoh et al. [36] investigated SSH dictionary attack by means of machine-learners. They subsequently suggested two novel elements for dictionary attack detection. The two studies had promising results, however, none of them ever addressed the issue of username enumeration attack.
Mobin et al. [37] studied distributed SSH brute-force attack detection by using statistical analysis on thousands of users' dataset collected for 8 years. They suggested that significant statistical changes in a parameter that summarizes aggregate activity revealed brute-force attack. They further indicated there is complexity implementation to some of the approaches for detecting specific attacks. In paper [6], the authors explored the detection of brute-force attack on SSH using NetFlow data examination under four machine-learning classifiers using their own generated labeled dataset. The two approaches proved to be successful with promising results. The focus was on detection of password-based attacks but there was no effort on detecting username enumeration attacks.
Kim et al. [38] investigated intrusion detection using KDDCUP99 dataset under LSTM recurrent neural network classifier and machine-learning algorithms. They afterward performed comparison of neural network results to machine-learning results and concluded the former outperformed the latter. Hossain et al. [16] also studied SSH and FTP brute-force attacks detection using LSTM and machine-learning classifiers. They also concluded that deep learning results outperformed machine-learning results. Similarly, both studies attained outstanding results, but none put focus on detecting the username enumeration attacks.
Hofstede et al. [39] delved into brute-force attacks on web applications and discussed several phases brute-force attacks undergo. They concluded that at a high-speed network, it is challenging to detect the attacks. Hynek et al. [40] proposed a study on redefined brute-force attack detection using a machine-learning approach. They used extended IP flow features obtained from backbone network traffic dataset to differentiate successful and unsuccessful login. Other research, in addition to the studies mentioned above, suggests that brute-force attacks are still amongst the most common attacks on the Internet [41].
All the aforementioned studies have focused and achieved excellent results on detecting and mitigating password related attacks such as brute force that are generated by various password attack tools. However, none of them have adequately included and addressed the issue of detection and mitigation of the username enumeration attacks. Considering that for any password-based attack to be launched, an attacker must have gathered all information including the list of usernames of the targeted system obtained from the username enumeration attack. Therefore, the detection and prevention of the username enumeration attack is highly needed in order to deny an opportunity for an attacker to retrieve a valid and existing list of usernames of the targeted system.

Materials and Methods
This section contains the following information: Experimental setup and attack scenario are explained in the first part. In the second part, network traffic data from a closed-environment network is collected and given corresponding labels, resulting in a new dataset. Third, several data pre-processing techniques are conducted in order to transform raw dataset into readable and understandable format by machine learning algorithms. As previously stated, the four classifiers are utilized to create classification models from the labeled traffic data. We carry out two-fold of experimentations seeing how using and not using ports information affects username enumeration attack detection. The rest of this section delves deeper into the steps listed above.

Experimental Setup
The attack simulation is carried out in a closed-environment network consisted of a victim machine, penetration testing platform and data collection point. The victim machine-SSH server was registered with thousands of users. The SSH server was a patched version of OpenSSH server version 7.7 [42] that listens on standard TCP port 22 for incoming and outgoing traffic. We chose this version because the attack occurs between version 2.3 and 7.7 [43]. The SSH server runs on Ubuntu Linux 20.04 (×64) with a 2.8 GHz Intel Core i7 CPU and a 16GB RAM computer. A penetration testing platform-Kali Linux 2020.4 (×64) with kernel version 5.9.0-is targeting this SSH server. This penetration platform operates on a machine with a 16 GB of RAM and 3.4 GHz Intel Core i7 CPU. The data collection server runs on Linux Mint 20.2 with 16 GB RAM computer, 2.8 GHz Intel Core i7 CPU. The IP addresses for the SSH server, penetration testing system and data collection server are 192.168. 56.115, 192.168.100.117, 192.168.100.16 respectively, and are in the private IPv4 range.

Attack Scenario
The attack was launched from Kali Linux, a penetration testing platform, to SSH server, a victim machine. The common vulnerabilities and exposures (CVE) with the identification number CVE-2018-15473 retrieved from the public exploits database [43] were used to do this. The CVE is developed entirely in Python language. The CVE mentioned above generates username enumeration attack traffic from the penetration testing platform, Kali machine, to a victim machine, SSH server. The attack was accomplished by employing the attack command shown in Figure 1.  Figure 2 depicts the attack's output by listing all the usernames found on the SSH server, including the root account. It displays a list of all existing usernames by indicating "valid user" and "is not a valid user" for those not found in the system. To get a mix of normal and attack traffic, a pcap file of normal traffic was obtained from public training repository [44]. The pcap file was replayed by using tcpreplay [45] tool at the same time when an attack was launched from Kali machine to the SSH server. Finally, both traffic, attack and normal, were collected in data collection point.

Data Collection and Labelling
The dataset is collected from a closed-environment network using network monitoring tools tcpdump [46] and Wireshark [47] installed in the data collection point. A total of 36,273 raw packet data were collected, each containing 25 features with label exclusive. The packet data were then given their corresponding labels as username enumeration attack and non-username enumeration attack. We chose the terms "username enumeration attack" and "non-username enumeration" instead of the traditional "attack" and "normal" label notations since "normal" traffic data could contain attacks other than username enumeration attack, which is the focus of our research. Since the goal of this study is to detect username enumeration attacks, we found that labeling dataset in this way is more suitable. The username enumeration attack class corresponds to the attack traffic while non-username enumeration class corresponds to the normal traffic. This traffic reflects different services including emails, DNS, HTTP, web, few to mention. We finally managed to get a raw dataset [48] comprising attack traffic and normal traffic. The dataset was then split into a training subset and a testing subset with an 80/20 ratio to deliver evaluation results on the classifiers' efficacy. The dataset split was based on Pareto Principle [49], also known as 80-20 rule. The 80-20 split ratio is indicated as one of the most common ratios in the machine learning and deep learning fields and was used in similar work in intrusion detection systems such as [16]. The distribution of the dataset is indicated in Tables 1 and 2.

Data Preprocessing
The Data pre-processing is the data mining technique that transforms raw datasets into readable and understandable format. Machine learning algorithms make use of the datasets in mathematical format, such format is achieved through data pre-processing [50]. Among other techniques of data pre-processing include missing-data treatment, categorical encoding, data projection and data reduction. Missing-data treatment involves deletion of missing values or replacement with estimations. Categorical encoding aims to transform categorical values into numerical values. Data projection scales the values into a symmetric range and this helps to change the appearance of the data. Data reduction intends to reduce the size of datasets using several techniques including features selection.
In this work, the missing values in a dataset were treated using imputation technique. For the categorical features, the most frequent strategy was used within each column. For the case of numerical features, a constant strategy was implemented to replace the missing values. Both label encoding and one hot encoding techniques were used to transform categorical feature values into numerical feature values. Hence, two types of datasets were generated. However, in this work label encoding dataset was used. Though one hot encoding is a common method, it faces a challenge of increasing the dimension of the dataset contrary to the label encoding approach which straightly converts the nominal feature values into specific numerical feature values. All features were scaled into the predefined same range using MinMaxScaler() method. Dataset reduction was implemented using features selection method. We selected 7 different features from the dataset. The description of each feature is shown in Table 3. All the data pre-processing techniques were carried out using scikit-learn library. Table 3. Description of features selected.

Time Packet duration time in seconds Packet Length
The length of the packet in bytes Delta Time interval between packets in seconds Flags Flags seen in the packet Total Length The total length of the packet in bytes Source Port The source port of the packet Destination Port The destination port of the packet

Applying Machine-Learning Classifiers to Dataset
In this work, we picked four distinctive machine-learning classifiers for our study. We examine k-nearest-neighbor, naïve Bayes, random forest and decision Tree machinelearning classifiers. We picked different classifiers to investigate a wider scale of investigation in username enumeration attack detection. These classifiers have asymmetric features and have light weight computation. A brief explanation for each classifier picked is provided below. We developed all models using scikit-learn library under GPU environment using python v3.7. All the models were built by tuning their parameters. Table 4 shows parameters tuning for each model. A decision tree is a widely known machine-learning classifier created in a tree-like structure [51]. It contains the internal nodes which represent attributes and branches and leaf nodes which represent the class label. To form classification rules, the root node is firstly selected which is a notable attribute for data separation. The path is then chosen from the root node to the leaf node. Decision tree classifier operates by recognizing associated attribute values as input data and produces decisions as output [52].
Random Forest is another dominant machine-learning classifier under the category of supervised learning algorithms [53]. Similarly, random forest is also used in machinelearning classification problems. This classifier is conducted in two asymmetric steps. The first step creates the asymmetrical forest of the specified dataset and the second one makes the prediction from the classifier acquired in the initial stage [54].
Naïve Bayes is a common probabilistic machine-learning classifier used in classification or prediction problems. It operates by calculating the probability to classify or predict a certain class in a specified dataset. It contains two probabilities: class and conditional probabilities. Class probability is the ratio of every class instance occurrence to the total instances. Conditional probability is the quotient of every feature occurrence for a certain class to the sample occurrence of that class [55,56]. Naïve Bayes classifier presumes every attribute as asymmetry and contemplates association between the attributes [57].
K-Nearest Neighbors is a classifier that considers three important elements in its classification manner: record set, distance, and value of K [58]. It functions by calculating the distance between sample points and training points. The smallest distance point is the nearest neighbor [59]. The nearest neighbor is measured with respect to the value of k (in our case k = 4), this defines the number of nearest neighbors required to be examined in order to define the class of sample data point [60].
We built all four classification models using a subset of 80% data of the given dataset and used the remaining subset of 20% for testing the models. The train test split ratio for each classifier was even. The performance metrics to evaluate the effectiveness of our de-veloped models were computed in terms of precision, recall and overall accuracy. The metrics are defined below.
The Receiver Operating Characteristics (ROC) curve was also considered as an additional performance metric. This evaluation metric draws the graph of True Positive against the False Positive of the subsequent model. It shows the difference amid True Positive rate and False Positive rate where the higher ROC value indicates high True Positive rate and low False Positive rate which is desirable in anomaly detection.
We conducted two types of experimentations; one excludes source and destination ports and the other includes them as our input features. This is because sometimes network administrators do customize the destination port to some different number other than the default port number for SSH protocol which is port 22. With these two experiments, we observed that including and excluding ports information has significant impact on the classification outcomes. The outcome scores advocate that using ports information as input features improves performance metrics of the developed models based on the kind of classifier used. However, excluding port information as input features in the dataset also provides significant benefits of developing a sturdy model that portrays the situation when SSH protocol is not configured in the standard default port.

Results and Discussion
For each classification model developed, we used the same training set and test set. 80% data of the given dataset was used for training the classification models and the rest 20% data was used to test the models. Tables 5 and 6 show the results of four developed machine-learning based classification models when port information is included and not included as a feature set. If we observe our prediction results, we see all the classification models in both tables-when including and excluding ports information provide outstanding results as indicated by an accuracy of greater than 95.70%, that ensures the models effectiveness in the detection of username enumeration attack. The KNN classifier has the maximum performance metrics with an accuracy of 99.95% when including source and destination ports as input features and an accuracy of 99.93% while excluding source and destination ports as models input features.
Additionally, Figures 3 and 4 show the ROC curves as the models' outcome results for two kinds of experiments conducted. They represent the True positive rate versus False Positive rate of each classification model developed. From the figures, we observe that the correctly classified rate is higher close to the maximum value of 1 while the falsely classified rate is low for both cases-when including and excluding ports information. Therefore, from the outcome results in Tables 5 and 6 together with ROC curves in Figures 3 and 4, we can conclude that our machine-learning based classification models are effectively able to detect username enumeration attack with high detection rate and low false alarm rate.

Effectiveness Comparison When Including and Excluding Ports Information
The effectiveness comparison between two kinds of experiments conducted shows that when including source and destination ports as input features, there are performance improvements compared to when source and destination ports are excluded as input features. Tables 5 and 6 show the relative comparison of precision, accuracy and roc-auc utilizing the dataset discussed in the earlier section.
The classification performances of the DT, RF, and KNN models slightly improve. KNN model increases from an accuracy of 99.93% when excludes source and destination ports as feature set to an accuracy of 99.95% when includes source and destination as feature set. Similarly, the RF model slightly improves from an accuracy of 99.92% to 99.94% when including source and destination port as the model's input features. The decision tree improves its performance from an accuracy of 99.88% to 99.93%. The naïve Bayes model has a significant improvement when including ports information as a feature set. It increases from an accuracy of 95.70% to 99.85%. Usually, naïve Bayes is a weak classifier and for the case of excluding ports information as input features in our study, other classifiers outperform it. However, by including source and destination port to its feature set naïve Bayes produces almost the same performance outcome results compared to DT, RF and KNN.
We observe that the DT, RF and KNN classification models produce almost the same classification performances regardless of whether port information is included or excluded in the feature set. This can be translated that even if source and destination ports are not included as model's input features, the distribution of samples in the feature area is still a means that samples with the symmetry label are dispersed together.
We also observe that naïve Bayes classification model has a significant enhancement of performance when including ports information as its input feature. This is due to the presumption that features in naïve Bayes are completely independent. Therefore, it is ra-tional to accept that the independency nature of naïve Bayes' features can be recompensed with inclusion of additional attributes to its attribute set and yields in performance improvement.
Thus, according to the results shown in Tables 5 and 6 and the above experimental analysis, we can conclude that including source and destination ports as input features has various impacts on the developed classifiers depending on their type; however, generally it enhances the performances, ensuring the models' effectiveness in the detection of the username enumeration attacks.

Conclusions
In this paper, we present a novel SSH username enumeration attack detection method using machine-learning approaches. To achieve this, we collected the data from a closedenvironment network and the dataset is then labelled to generate a labelled dataset. We trained four distinct classifiers in a dataset containing username enumeration and nonusername enumeration attack class instances. The former represented the normal class while the latter represented the attack class. We evaluated the models' performance using accuracy, precision, and ROC-AUC values. Our findings show that, using machine-learning approaches in detecting SSH username enumeration attacks, we can achieve reasonable results with KNN having an accuracy of 99.93%, NB 95.70%, RF 99.92%, and DT 99.88%.
In addition, when training classification models, we investigated the impact of including ports information in the feature set. Our findings imply that, including source and destination ports as input features resulted in some performance improvements without compromising computation power. However, the performance improvements vary from classifier to classifier based on their nature. Naïve Bayes has a significant enhancement of performance when including ports information. Naïve Bayes' features are completely independent, hence, including ports information yields significant performance improvements.
In the future work, we aim at gathering data in a production-environment network and evaluate how developed models would perform on the real-world live dataset. Deep-learning techniques may also be incorporated in the future to detect username enumeration attacks. Data Availability Statement: Due to the novelty of the study, the dataset had to be generated through the use of public exploits and pcap files from public training repositories. The generated datasets are publicly available to everyone and can be found at https://doi.org/10.5281/zenodo.5564663 (accessed on 9 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.