A Compression-Based Method for Detecting Anomalies in Textual Data

Nowadays, information and communications technology systems are fundamental assets of our social and economical model, and thus they should be properly protected against the malicious activity of cybercriminals. Defence mechanisms are generally articulated around tools that trace and store information in several ways, the simplest one being the generation of plain text files coined as security logs. Such log files are usually inspected, in a semi-automatic way, by security analysts to detect events that may affect system integrity, confidentiality and availability. On this basis, we propose a parameter-free method to detect security incidents from structured text regardless its nature. We use the Normalized Compression Distance to obtain a set of features that can be used by a Support Vector Machine to classify events from a heterogeneous cybersecurity environment. In particular, we explore and validate the application of our method in four different cybersecurity domains: HTTP anomaly identification, spam detection, Domain Generation Algorithms tracking and sentiment analysis. The results obtained show the validity and flexibility of our approach in different security scenarios with a low configuration burden.


Introduction
We live in a complex world with multiple and intricate interactions among countries, companies and people. Those relations are preferentially conducted by Information and Communication Technologies (ICT) [1]. As the number of devices that are connected to the Internet has increased, so has grown the number of malicious agents that try to get profit from systems vulnerabilities. These malicious actors can target governments, companies or individuals using several kinds of attacks. Malicious activities range from simple attacks such as parameter tampering, spam or phishing, to more complex menaces such as botnets, Advanced Persistent Threats (APTs) or social engineering attacks that leverage Domain Generation Algorithms (DGAs) and Open Source Intelligence techniques (OSINT) [2,3]. In order to help to protect their networks, systems and information assets, cybersecurity analysts use different tools, such as Intrusion Detection Systems (IDS), Firewalls or Antivirus. These tools generate huge amounts of information in the form of text logs with a blend of normal behaviors and malicious interactions. Moreover, the proper implementation of the security life-cycle is highly dependent on the adequate aggregation of additional records of activity and other sources of digital evidence [4,5]. Due to the ever increasing size of security related data, the automatic treatment and analysis of text logs is a key component in the deployment of adaptive and robust defensive strategies [6]. Therefore, it is interesting to have a single methodology that can be used with these logs to simplify and help analysts with their duties and decision making. Our approach consists of developing parameter-free methods for such a goal [7].
As a departing point, we are considering five binary classification problems in four different cybersecurity related areas that enclose anomalous versus normal URLs, spam versus non spam messages, normal versus malicious Internet domains and malicious versus normal messages in fora or social networks. This is a heterogeneous landscape where the common nexus is determined by sources of information that can be treated as text. It can be shown that the use of compression information techniques comes helpful to process textual data with different codification systems. Indeed, these techniques have been successfully used in several domains, such as clustering and classification of genomic information [8], language identification [9], plagiarism detection [10] or URL anomaly detection [11].
In this work, we explore and validate the use of a previous method based on compression information techniques [12] in a wider scenario that encompasses several binary classification problems in the current cybersecurity context. These include spam detection, malicious URL identification, DGA targeting and sentiment analysis in Twitter and in movie reviews. We use the Normalized Compression Distance (NCD) [10] to extract a set of features that characterise the textual data associated to each problem, and train a Support Vector Machine (SVM) that classifies the data using these features. Our results on the different problems show the generality of this approach, which shows up as a unified framework for the analysis of textual information in a general cybersecurity context. Indeed, although we do not improve the best state of the art solutions in each considered problem, we obtain competitive results in all the domains. The slight accuracy decrease is off-set by the surplus represented by the versatility of the methodology.
The rest of the paper is organized as follows: Section 2 describes previous works on each of the considered domains, Section 3 describes the proposed method, Section 4 explains the details and results of each of the experiments and Section 5 outlines the conclusions.

Anomaly Detection in Cybersecurity
This section discusses different strategies followed by the cybersecurity community to tackle some of the threats they face. Traditional approaches involve the use of anomaly detection techniques, which have been used, for example, for virus or intrusion detection. More recently, with the rise of social networks and open sources of information, the attack surface has increased [2]. To overcome these new risks, cybersecurity researchers have started using Natural Language Processing (NLP) techniques, such as sentiment analysis. For instance, NLP has been applied to detect cyberbulling [13], malware [14], or to predict cyber-attacks [15]. In this article we focus on four representative domains where the analysis of textual information is also relevant. In the following paragraphs we describe the problems and the main approaches used to cope with them.

HTTP Anomaly Detection
Usually, the detection of anomalies in the cybersecurity field is done using IDS [16]. IDS can be targeted at analysing either network or host activity. Moreover, we can adopt a static approach by comparing activity traces to concrete patterns or signatures, or we can apply a dynamic approach in terms of behaviour analysis. The latter is the one that has focused most research efforts. The techniques for HTTP traffic anomaly detection can be classified into seven general groups: statistical models, static analysis, statistical distances, data mining and pattern analysis, Markov models, machine learning and knowledge based [16][17][18][19][20][21]. Different features of the HTTP packets and the HTTP protocol, including the text of the URL, are used to model the normal network behaviour.

Spam Detection
Spam detection is another area that has been broadly studied in the cybersecurity context. Several machine learning techniques have been applied, ranging from Naïve Bayes to logistic regression or neural networks [22]. As this problem has a big dependence on text, text analysis and clustering techniques have also been widely studied, with the Term Frequency (TF) and Term Frequency Inverse Document Frequency (TFIDF) statistics [23] usually applied for attribute construction, as well as edit distances [24]. Of particular interest to our work is the application of compression based techniques [25,26]. These have a good performance and are quite robust against noise in the channel that eventually could induce an erroneous classification [27].

DGA Detection
DGA is a technique used by malware to conceal its Command and Control (C&C) panel. It consists of the generation of random domain names and is one of the techniques used by botnets (groups of hijacked computers and infected devices (https://www. trendmicro.com/vinfo/us/security/definition/botnet, last access 16 February 2021)) to communicate and hide the real address of its C&C.
The main techniques applied to detect DGA domains analyse the Domain Name System (DNS) traffic in a specific network. They use features such as the frequency and the number of domains that do not resolve (NXDomain) to cluster and identify malware families [28,29]. Another approach involves the use of the domain name to identify whether it has been generated by a DGA. This is done by the detection of patterns, and usually requires a feature extraction step before applying algorithms such as Neural Networks [30,31] or n-gram categorisation [32,33].

Sentiment Analysis
Sentiment analysis is a field of text mining that has attracted the attention of cybersecurity researchers. Some examples of application of sentiment analysis in the security context include the identification of threats [34], radicalism and conflicts in fora [35], the detection of cybercrime facilitators in underground economy based on customer feedback [36], the characterisation of disinformation phenomena [37,38], and the use of Twitter to anticipate attacks [39] and generate alerts [40]. The main strategies consist of a feature selection step using techniques such as TFIDF or Point-wise Mutual Information (PMI), followed by a sentiment classification step using either machine learning or lexicon based approaches [41,42]. Deep Learning techniques, including Convolutional Neural Networks [43,44] and Word Embeddings [45], are also considered in this context.

Materials and Methods
It is worth noting that most of the previously reviewed techniques are problem specific. Even though all the presented problems involve some kind of analysis of textual information, the field lacks a general methodology that faces all these issues from a common perspective. This section describes a general mechanism to extract a set of numerical attributes that characterize a text using a compression based metric, such as the NCD. Given a text T, the main idea is to compute the distance between T and a set of k additional texts {g 1 , g 2 , ..., g k }, known as attribute generators (AGs). This provides a vector of k numbers that represents the text T in a k-dimensional attribute space, and can be used as input for a classification algorithm [12]. Although in this work we have used a SVM classifier, it is necessary to highlight that our methodology for attributes extraction can be used with any other classification algorithm.
For a dataset consisting of text strings belonging to two different classes, the detailed procedure is as follows. We randomly divide the data into two disjoint groups, G and I, with m and n texts, respectively. Both groups are balanced, meaning that they contain the same number of strings for each of the two classes. The first group, G, is used to build the attribute generators. To do so, the strings in G are packed into a set of k generator files, each one with strings from one class only. Thus we have k/2 generators for each class. The second group, I, contains the set of n strings that are used to train the classifier, after being characterized by the distances to the attribute generators. That is, for each string s i in I, we compute the distance between s i and each of the k attribute generators to obtain the vector: which is used to characterize the string s i . The components x ij of S i are given by: where D(g j , s i ) is the normalized conditional compressed information [10] between the generator g j and the string s i : C(x) is the compressed size of x and the dot operator represents concatenation. In the experiments we use the gzip compressor because it has a better speed performance than other compression algorithms [46]. Once the attribute vectors have been generated for all the strings in I, we use them to train a SVM with a radial basis function (RBF) kernel that predicts the class associated to each string. We use the scikit-learn Python library [47] for the implementation, and measure the quality of the classifiers by using the accuracy (ACC) and the area under the receiver operating characteristic curve (AUC) metrics. The SVM model depends on two hyperparameters, the complexity C and the kernel width γ, which are tuned, for each k, using a standard 5-fold cross-validation strategy as described in [12]. The results presented in Section 4 are averaged over 10 different partitions of the data into the G and I sets. Figure 1 shows the end to end process, from the initial dataset to the final classification. x ij = D(g j , s i ) End to end representation of the proposed method. The set of texts composing the dataset are first divided into two groups, G and I. The m texts in the G group are further divided into k additional groups, or attribute generators. Then the distance between each text in I, s i , and each generator g j is computed in order to obtain a n × k attribute matrix which is used to train a SVM.

Results
To assess the validity of the proposed method, we have performed experiments in four different and heterogeneous domains. Concretely, we have explored the detection of malicious HTTP requests (Section 4.1), the identification of spam in SMS messages (Section 4.2), the detection of DGA domains (Section 4.3) and the analysis of sentiment both in Twitter (Section 4.4) and in movie reviews (Section 4.5). All the problems consist of a set of strings belonging to two different classes, for example normal versus anomalous HTTP requests or positive versus negative sentiment in tweets. Nevertheless the characteristics of each dataset are unique, and we observe a high variability both with respect to the string length (Table 1) and the string content (Tables 2, 4, 6, 8 and 10). Despite this variability, the features extracted following the presented method are able to provide a good description of the problem data in all the cases, and the classifiers trained on them obtain state of the art accuracy. The following subsections describe in detail each of the experiments carried out. This experiment tackles the issue of identifying a malicious HTTP request only using the related URL string. Usually, this problem has been faced analysing additional information, such as the URL length, the number of URL parameters or the parameter values [48]. Our method simplifies the preprocessing step by considering the raw text of the URL, hence avoiding any sort of manual attribute construction.

Data Preparation
We use the public CSIC-2010 dataset [49], which contains examples of both normal and anomalous HTTP requests. We extract all the POST requests, remove duplicates and balance the queries. This results in a total of 9600 queries, 4800 of each class. After this preprocessing step, we divide the dataset into the I set, with 1600 randomly chosen queries (800 normal and 800 anomalous) and the G set, with 8000 queries (4000 normal and 4000 anomalous). Some examples of normal and anomalous queries are shown in Table 2.

Results
We have performed experiments using a number of attributes, k, ranging from 8 to 160. The results can be seen in Table 3. The highest accuracy is obtained for k = 80, with a 95% of correctly classified HTTP queries. This value is similar to other results reported in the literature [50]. Nevertheless, our approach does not require a feature selection step and depends on a smaller number of hyperparameters.

Experiment 2-Spam
In this experiment we apply the method to the problem of discriminating between legitimate and spam SMS messages. One of the main characteristics of this kind of messages is that they are usually written using a very informal language, with many invented terms and abbreviations which are not always grammatically correct. This fact may be a problem for traditional NLP methods based on lemmatisation or parsing [51]. The method here proposed is however agnostic to the grammar or the rules followed by the messages, and it can be directly applied to this problem without any adaptation. The next paragraphs describe the data preparation and the results obtained.

Data Preparation
We use the SMS Spam Collection v.1, a public dataset that can be obtained from the authors' page (http://www.dt.fee.unicamp.br/~tiago/smsspamcollection, last access 16 February 2021) and also from Kaggle (https://www.kaggle.com/uciml/sms-spamcollection-dataset/data, last access 16 February 2021). It contains 5574 SMS messages, 747 of them labeled as spam and the rest, 4827, labeled as neutral, or ham. We balance the dataset by taking all the spam messages and randomly selecting a sample of 747 ham messages. The balanced data are further divided into the I set, with 400 messages (200 ham and 200 spam), and the G set, with 1094 messages (547 of each class). Table 4 shows some examples of ham and spam messages.

Results
As before, we have performed experiments for k ranging between 8 and 160. A summary of the results can be found in Table 5. We observe an increase of performance as more attribute generators are used, with a maximum of 0.96 AUC and 0.904 accuracy for k = 160. These values are slightly worse than the best results reported in the literature for the same dataset [52], but the latter need a more complex and problem specific preprocessing step which is avoided if using the proposed method, with the subsequent simplification of the overall process.

Experiment 3-DGAs
In the third experiment we apply the method to the detection of DGAs relying on the domain name only. The main characteristic of this problem, which makes an important difference with respect to the rest of considered scenarios, is that the string lengths are significantly shorter. It is very unlikely that a domain name contains more than 100 characters (see Table 1). In spite of this fact the proposed approach has been applied with no adaptations, and the results are quite satisfactory.

Data Preparation
We use a dataset where the legitimate domain names are extracted from the Alexa top one million list (https://www.alexa.com/topsites, last access 16 February 2021), whilst the malicious domains are generated with 11 different malware families, such as zeus, cryptolocker, pushdo or conficker. The dataset can be downloaded from Andrey Abakumov's GitHub repository (https://github.com/andrewaeva/DGA, last access 16 February 2021). Raw data contain 1,000,000 normal and 800,000 malicious DGA domains. After balancing the classes, we randomly select a subset of 13,000 domains, 6500 for each class, and from them we use 800 domains as the I set. The remaining 5700 domains are used to build the attribute generators (G set). Table 6 shows some examples of both DGA and normal domain names.

Results
A summary of our results on the DGA dataset can be seen in Table 7. As in previous experiments, we have carried out tests with different k values. The classification accuracy increases with k up to a point where it saturates. The best results are obtained for k = 80, with an accuracy of 0.94 and an AUC of 0.98. These values are better than those reported when using traditional methods, although they can be improved by using deep neural models such as recurrent neural networks [53]. Note however that we are using only a small subset of the original data to train the classifier.

Experiment 4-Sentiment Analysis in Twitter
For this experiment we use the Sentiment140 dataset (http://help.sentiment140.com, last access 16 February 2021.) described in [54], which contains a training set with 1,600,000 tweets labeled as either positive or negative according to their sentiment. There are 800,000 positive and 800,000 negative tweets, collected between 6 April and 25 June 2009, and labeled using the emoticons contained within the message. Table 8 shows a sample of 5 positive and 5 negative tweets extracted from the training set. The dataset also contains a small test set with 498 tweets which were labeled manually, to be used for validation purposes.

Data Preparation
In our experiments we use only the tweets in the training set. Prior to our analysis we preprocess the data in order to remove both duplicated tweets and tweets that appear both as positive and negative. After this preprocessing stage we obtain a new training set with 787,956 tweets of each class. The whole text of the messages, without any further preprocessing, is used to characterize the tweets. The I and G sets are built as in previous sections. In particular we use 50,000 tweets of each class to build the attribute generators.
The remaining messages are used to train the classifiers after being characterized by their distance to the generators. In this case, due to the dataset size, we use a SVM classifier with a linear kernel.

Results
The results are shown in Table 9. The accuracies in the table are slightly worse than those reported in [54], but there are two important points to consider. First, they perform additional preprocessing of the tweets. In particular, they replaced all usernames by the new token USERNAME, they replaced all URLs by the keyword URL, and they eliminated repeated letters in order correct some uses of informal language usually present in tweets. In this article we have decided to omit these steps in order to show the generality of our approach. Second, their results are obtained on the test set, which contains only 498 tweets. Results on such a small dataset may be biased. In fact, when we evaluate our method on these data, we observe a systematic increase of both the accuracy and the AUC.

Experiment 5-Sentiment Analysis in Movie Reviews
In this last scenario we tackle a sentiment analysis problem in movie reviews. It has the particularity that the strings are of arbitrary length. Concretely, the average length of the movie reviews in the dataset is 1325 characters for the positive reviews and 1294 for the negative reviews (see Table 1). This characteristic contrasts with the Twitter problem, where the string length is limited to 140 characters. Besides, the use of language and grammar tends to be more formal, and the inclusion of abbreviations and emoticons is not so extended. These are fundamental differences that motivate the application of the proposed method in this problem.

Data Preparation
We use a public dataset from the Stanford NLP group (http://ai.stanford.edu/~amaas/ data/sentiment/, last access 16 February 2021). It contains 50,000 movie reviews extracted from the Internet Movie Database (IMDb) (https://www.imdb.com/, last access 16 February 2021). Each review is a text string commenting a movie, and classified as either positive or negative. There are 25,000 reviews classified as positive and 25,000 classified as negative. Table 10 contains samples of both classes. From these raw data we use a subset of 5350 randomly chosen reviews (half positive, half negative). The I and G string groups are build as for previous problems. The I set contains 1600 movie reviews (800 positive and 800 negative), and the G set contains the remaining 3750 texts. Table 10. Some examples of positive and negative movie reviews in the Stanford dataset.

Two Examples of Positive Reviews
1. I havent seen that movie in 20 or more years but I remember the attack scene with the horses wearing gas-masks vividly, this scene ranks way up there with the best of them including the beach scene on Saving private Ryan, I recommend it strongly. 2. Some people are saying that this film was "funny". This film is not "funny" at all. Since when is Freddy Krueger supposed to be "funny"? I would call it funnily crap. This film is supposed to be a Horror film, not a comedy. If Freddy had a daughter, would not that information have surfaced like in the first one!? The ending was also just plain stupid and cheesy, exactly like the rest of it.

Results
This experiment has been carried out using different k values as in previous sections. In this case the best results are obtained for k = 160 (acc. = 0.86, AUC = 0.93, see Table 11). Although these results do not improve the best reported for the same dataset [55], they are quite competivite, even more if we take into account that they have been obtained on a small subsample of the original data.

Conclusions
Information security calls for a comprehensive deployment of protection measures and detection controls [56]. Security logs are the core of such controls, since they enable event recording and attacks characterization. Anomaly detection in security logs is one of the most relevant means to detect possible malicious activity. However, those security logs are derived from plenty of different network and information flow modalities. Therefore, there exists an urge to adopt mechanisms to process security information regardless of the concrete nature of each log [57]. In this vein, we have proposed a method that is able to process textual data from different sources using a common approach: the NCD is used to extract features that characterize the text, and a SVM is trained on these features to perform the classification.
To test the method, we have performed five experiments over four different domains. Our results are competitive in general, although for some problems we obtain a classification accuracy slightly below the best values reported in the literature. Nevertheless, it is worth noting that we are using a unique procedure to address all the problems despite their disparate nature. In other words, we have sacrificed accuracy for adaptability. In addition, we are using much less data to train the classifiers than other proposals in the literature, since part of the available data are used to build the attribute generators. The use of all training data, following the approach in [58], could further improve the results here presented. Moreover, additional efforts are required to test the suitability of our methodology in multiclass classification problems in cybersecurity (as those in [59,60]).
Another advantage of our method is that it neither needs to perform a preprocessing step nor to manually construct features from the data, and the hyper-parameter tuning is minimal. The number of generated attributes, k, appears to be the most relevant parameter. In general, in all the problems under consideration we observed an increase in performance as k grows, saturating for k large enough. We have found that the optimal k values are related to the sliding window of the gzip compressor. When k is small, the size of the generator files is usually larger than this sliding window, and hence not all the information contained in the attribute generator files is used to characterize the string instances. On the other hand, increasing the number of generators by reducing their size below the sliding window size does not seem to further improve the classification accuracy. The use of other compressors has not been considered in the present study. Future work should be devoted to test how different compression algorithms may affect our results.
All in all, our proposal leads to an adequate trade-off between adaptability and performance, and it can be interpreted as a complementary procedure in frameworks tackling the limitations of the "no free lunch" theorem [61] by the convenient integration of several anomaly detection methods. This convenient integration demands an exhaustive study of how to improve the method by using different compression algorithms to calculate the NCD and estimate Kolmogorov complexity. Besides, it would be advisable to analyze the impact of using other models than the SVM to conduct classification in the k dimensional hyperspace determined by our NCD-based methodology.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: