The Spatial Analysis of the Malicious Uniform Resource Locators (URLs): 2016 Dataset Case Study

In this study, we aimed to identify spatial clusters of countries with high rates of cyber attacks directed at other countries. The cyber attack dataset was obtained from Canadian Institute for Cybersecurity , with over 110,000 Uniform Resource Locators (URLs), which were classified into one of 5 categories: benign, phishing, malware, spam, or defacement. The disease surveillance software SaTScanTM was used to perform a spatial analysis of the country of origin for each cyber attack. It allowed the identification of spatial and space-time clusters of locations with unusually high counts or rates of cyber attacks. Number of internet users per country obtained from the 2016 CIA World Factbook was used as the population baseline for computing rates and Poisson analysis in SaTScanTM. The clusters were tested for significance with a Monte Carlo study within SaTScanTM, where any cluster with p < 0.05 was designated as a significant cyber attack cluster. Results using the rate of the different types of malicious URL cyber attacks are presented in this paper. This novel approach of studying cyber attacks from a spatial perspective provides an invaluable relative risk assessment for each type of cyber attack that originated from a particular country.


Introduction
The use of internet has been showing a continuous growth in recent years as we become more dependent on computer networks and infrastructure in the "connected" digital age [1]. This, however, leads to an increase of cyber attacks, potential platform for fraud, and vulnerability for identity theft [2]. As a result, it is crucial for security systems to be up-to-date against scams, malware (malicious software), spam, and phishing attacks [3]. Although one solution can be focusing on detection [4] and classification [5] of the cyber attacks, it will not be sufficient enough considering the fact that these attacks happen globally [6]. There is an evident need for deeper understanding of cyber attacks in terms of spatial analysis [7]. That can provide not only unique perspective regarding origins-victimizations of cyber attacks but also potential risks for regions being the target for cyber attacks [8].
One of the most important subjects in cybersecurity is the detection of cyber attacks to prevent possible threats and disruption. In the literature, there are various techniques and approaches presented for detection and identification of cyber attacks, including Kmean clustering [9], statistical classifier [10], adversary threat modeling [11], Bayesian networks [12], non-attribution based anomaly detection [13], Schmitt analysis [14], bioinspired immunological metaphors [15], Simulated annealing [16], genetic algorithm [17], fuzzy logic [18], deep neural networks [19], and machine learning [20]. Considering spatial analysis of cyber attacks, there are a few studies in the literature worth mentioning. Merien et al. presented an entropy-based model to represent cyber attacks as path from a source to target [21]. Their results lead to a pattern in cyber attack origins that could be used to categorize internet attacks [21]. To characterize attack patterns, Chen at el. introduced predictability measures using probability matrix [22]. They showed the correlation between large-scale cyber attacks and their predictabilities [22]. Geographic Internet Protocol (GEO-IP) based analysis was presented by Hu et al. [23]. In their study, GEO-IP was used to detect the location of cyber attack origin, and then, advanced spatial statistical analysis was used to explore cyber attack patterns [23]. Lin et al. presented lexical analysis approach for detecting malicious Uniform Resource Locators (URLs) [24]. According to the results in big data, 90% of malicious URLs could be detected using two-step filtering on the data. Another URL classification based approach was introduced by Feroz and Mengel [25]. They used clusters for URLs to divide them into various categories, which was used for predictions of cyber attacks by the classification model [25]. Du and Yang followed a slightly different approach for detecting and classification of cyber attacks: grouping sources as coordinated teams [26]. Their approach provided an additional perspective to high impact attacks compared to trivial statistical approaches [26]. Furthermore, Koike et al. focused on visualization of cyber threats using 2-D IP address matrix [27]. Their claim was that presented visualization framework could enhance detection algorithms in, for instance, worm propagation models [27].
Although all these studies are presented in the literature, little focus has been given on spatial analysis of cyber attacks, especially malicious URLs, in terms of physical geolocation of the attacks. Most of the studies in the literature presented cyber locations with IP addresses, which makes advanced analysis challenging if additional covariates are to be added to find complex relations between attacks and other variables, instead of geo-locations. Moreover, previous studies looked into just trivial source-target relations in cyber attacks, where risk analysis could be performed with state-of-the-art approach, such as applying epidemiology perspective. Contributions of this study can be summarized as to find answers to following questions:

•
Which nations are the top for origins, by count, of cyber attacks? • Which nations are the top for origins, by relative risk, of cyber attacks? • Do the spatial hotspots for cyber attack origins differ from the spatial relative risk hotspots for cyber attack?
The main differentiator of this paper from that of the annual reports on malware published by commercial entities is on the in-depth analyses, both visual and text, that are provided. While the annual reports offer cursory description of the data gathered, this paper dives deep into the number and offers meaningful insight on the state of cyber attacks delineated by geographical locations.
The rest of the paper is organized as follows: Section 2 describes the background of the data used, as well as software and pre-data analysis details, while the methodology used is explained in Section 3. Section 4 provides information about analysis design, results, and discussions. In the final section (Section 5), conclusions and future directions are presented.

Data Collection
In this study, we utilize URL data from 2016. A URL is the fundamental network identification for any resource connected to the web. A URL consists of five parts: Scheme, Subdomain, Second-level Domain, Top-level Domain, and Subdirectory. The data consists of 5 different URL category information, namely (i) benign, which is safe websites with normal services, (ii) phishing, which is a website performs the act of attempting to get information, such as usernames, passwords, and credit card details, by masquerading as a trustworthy entity in an electronic communication, (iii) malware, which is created by attackers to disrupt computer operation, gather sensitive information, or gain access to private computer systems, (iv) spam, which is the act of spreading unsolicited and unrelated content, and (v) defacement, which is an exploitation of the techniques to alter the content of web pages by suspicious user.
We used cyber attack data obtained from the Canadian Institute for Cybersecurity (CIC) [28], and the details of the dataset can be found in Reference [8]. Mamun et al. developed a lexical analysis to detect and categorize malicious URLs. The data came from disparate internet sources from web crawlers, web spam dataset repository (from the Web Laboratory of the University of Milan), open source OpenPhish website (a widely recognized global website for phishing information), the DNS-BH website of malware collections, a random selection of URLs from a website with a list of defaced URLs. The fact that these data came from multiple sources, properly categorized, and publicly available repositories that are open for inspection and close scrutiny provides ample evidence to support the credibility of the dataset. The results were a categorization of over 110,000 URLs from year 2016. We used the resulting categorized data to perform a spatial analysis of the country of origin for each type of attack. One limitation of this study is that the authors did not attempt to identify or correct any errors in identification or categorization of potentially malicious URL made in the original data. The data from the CIC is publicly available, and the authors downloaded a dataset titled "URL dataset (ISCX-URL-2016)" [28].
The URL dataset (ISCX-URL-2016) contains over 35,000 URLs identified as benign, and those URLs were not included in this paper's spatial analysis due to fact they are safe websites with normal services. Because our study is focused on cyber attacks, we deliberately eliminated that 35,000 benign URLs so as not to introduce data that could lend to some misinterpretation. Further, in order to normalize the data and put the study in light of internet activities, we inject the population and the estimated number of internet users for that year as bases for the model analysis. Additionally, there were approximately 12,000 URLs categorized as spam, 10,000 URLs categorized as phishing, 11,500 URLs categorized as malware, and 45,500 URLs categorized as defacement [28]. In the covariate analysis, we used population of countries, and population and estimated number of internet users per country were obtained from the 2016 CIA World Factbook, which is publicly available [29]. Three countries were missing values for internet users (Afghanistan, Ascension, and South Sudan) and these values were filled using wiki online sources and compared with others countries to cross check the ratio of population to internet users' similarity. The number of internet users per country was used as the population baseline for computing rates and for the Poisson model analysis in SaTScan TM [30]. Based on the raw data, a ranking by country of origin for each type of cyber attack is presented in Table 1. It should be noted that the countries are listed in rank order based on the raw number of cyber attacks data in the dataset [28]; that is why not all countries are listed in the table. Further, the identified clusters in the following sections may include one or more countries.

Software and Tools Used
In our analysis, we used SaTScan TM software, which is a free, open source software that analyzes spatial, temporal, and space-time data using probability models in statistics. It is designed for any of the following interrelated purposes: • Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, as well as to see if they are statistically significant.

•
Test whether a disease is randomly distributed over space, over time, or over space and time.

•
Evaluate the statistical significance of disease cluster alarms.

•
Perform prospective real-time or time-periodic disease surveillance for the early detection of disease outbreaks. Although SaTScan TM can be used for temporal and space-time data, as well, we did only the purely spatial analysis due to the data being for one year. We chose SaTScan TM for the surveillance due to the use of statistical modeling with probability models in which hypothesis testing is used with p-Values to identify significant clusters with small chances of existing due to random causes. While elliptically shaped clusters could have been used here, the results are very similar when using circular windows to identify clusters. It is understood that clusters may include countries with low cyber attack counts, but the overall counts within a cluster will be unusually high. It is possible to re-analyze each identified cluster again with SaTScan TM to identify smaller sized clusters. Below is the Poisson model's likelihood function used in our analyses, which is proportional to where n is the number of cyber attack counts within the scan window, N is the total number of internet users in the population, and E is the expected cyber attack counts under the null hypothesis. There are several other cluster analysis algorithms in the literature, such as DBSCAN, and there many available libraries on the internet, in different programming languages, for instance Python, R, or MATLAB. It is understood that no cluster analysis software is better than all other cluster analysis software packages. Each major cluster analysis software package has some advantages for specific applications. We selected SaTScan TM as it allows, unlike DBSCAN, statistical significance to be detected for each cluster. While SaTScan TM has been widely used with epidemiology data, it is perfectly fine to use this surveillance software with data from applications that have nothing to do with epidemiology. The statistical modeling used in this software is based on probability distributions, such as Normal distribution or Poisson distribution, or a nonparametric modeling can be chosen. In our cyber attacks study, we used counts of cyber attacks as the random variable of interest. The likelihood function of the Poisson distribution includes the number of internet users. In this context, the "disease" was considered as the different types of malicious URL cyber attacks. We used SaTScan TM to perform a purely spatial, Poisson statistical analysis to identify significant clusters of cyber attack by type and country of origin. Settings in SaTScan TM were used to search for high-rate clusters only, with a restriction of at least 3 cases per potential cluster. Further, high rate clusters were restricted to have a relative risk greater than or equal to 1.2 and were reported only the most likely clusters using a hierarchical clustering. These are countries with cyber attacks that are at least 20% higher than in the rest of the world. It should be noted that this paper does not claim to use the idea of the spatial spread of pathogen model to model the behavior of cyber attacks. Although the SaTScan TM software is used primarily for performing space-time disease cluster analyses and testing whether a disease is randomly distributed over time, we used its capability to perform purely spatial Poisson statistical to identify significant clusters of cyber attacks.
Additionally, we used ArcMap which is a licensed software tool from ArcGIS that is used to represent geographic information as a collection of layers and other elements in a map. ArcMap was used to display the clusters found from SaTScan TM and for two types of heat maps. The first type of heat map displays the different malicious URL cyber attacks by country based on rate (attacks per internet users). The type of heat map displayed relative risk and clusters obtained from SaTScan TM . A 2015 Tiger Shapefile from ArcGIS was used as a base map that was then manipulated with joined data to present heat maps by country and clusters when appropriate.

Data Manipulation
Before spatial analysis, we, first, processed the data to extract geo-locations. We used URL information to map them to a physical geographic location. Further, we performed normalization on the data. The values of the covariates were adjusted for population (where appropriate) and normalized by the Blom method [31]. This was done so that variation would be preserved within each variable, but the variation between variables would be minimized. Normalizing prevents variables that are measured with larger units from having a disproportionately larger impact on the model. Normalizing also lessens the effect of outliers. Finally, for each country and type of malicious URL cyber attacks, a rate was calculated by dividing the count of each type of attack by the country's internet users. The number of internet users was also used as the population file in SaTScan TM .

Methodology
The disease surveillance software, SaTScan TM , was used to identify significant clusters of high cyber attacks around the world. Specifically, SaTScan TM was used to perform a spatial scan for clusters of high counts of the different types of malicious URL cyber attacks. It allows user to choose among various models for analysis; the Poisson model was selected for cyber attacks study since it is more appropriate for (rare) count data. The malicious URL cyber attack counts by country were used as the SaTScan TM variables of interest in the analysis. Country ID codes were used to relate the counts to relevant country, population, latitude, and longitude. The analysis was performed by "purely spatial" method (vs. temporal) using hierarchical clusters, with no geographical overlap (meaning that clusters will be reported only if they do not overlap with a previously reported cluster), and with the clusters required to have a relative risk of at least 1.2 and contain at least 3 counties. Please refer to the SaTScan TM documentation for more details about these settings [30].
SaTScan TM performs a cluster analysis as follows: A dynamic geographic unit, or "moving window," is systematically scanned across the contiguous states and compared to expected and observed variable counts (Recall that we are using the discrete Poisson model to analyze counts.). For this study, each country serves as the initial geographic unit used by SaTScan TM as a potential cluster center. For a cluster, a circle of varying size (from 0 up to 3500 km) is analyzed around each county with higher than expected counts. The maximum size of a cluster was a circle containing 25% of the population at risk not to exceed 3500 km. The 3500 km upper limit was used to keep clusters from extending to the Polar regions which caused unusual patterns when transferred to a flat projection map. The null hypothesis is that the number of cases in each area is proportional to its population size; the alternative hypothesis is that there is an elevated risk within the window.
Clusters are reported for those circles where the number of observed are "much" greater than the expected values. To identify clusters, a likelihood function is maximized across all locations. The cluster with the maximum (largest) likelihood function indicates the cluster which is the least likely to have occurred by chance. This cluster is identified as Cluster 1. Once Cluster 1 has been calculated, secondary clusters are calculated and ranked by their likelihood ratio test statistic in decreasing order after Cluster 1. Clusters can be statistically significant or not, based on a p-Value obtained from SaTScan TM via Monte Carlo hypothesis testing. For this study, we reported only clusters that were significant at the α = 0.01 significance level. SaTScan TM provides detailed information about each cluster, some of which are listed below.

•
Location IDs: the geographic center and a list of countries that belong to the cluster.

•
Population: the number of internet users in each cluster. • Observed/expected: the observed number of cases within the cluster divided by the expected number of cases within the cluster (under the null hypothesis that risk is the same inside and outside the cluster). Put another way, this is the estimated risk within the cluster divided by the estimated risk for the study region as a whole.

•
Relative risk: the estimated risk within the cluster divided by the estimated risk outside the cluster. It is calculated as the observed divided by the expected within the cluster divided by the observed divided by the expected outside the cluster. • p-Value: the probability of obtaining the observed (or a greater) number of cases in a cluster if the risk were the same as it is outside the cluster.
Heat maps were used to visualize geographic data patterns, especially in conjunction with cluster maps. Since our research objective was to study the origination of malicious URL cyber attacks, we created heat maps by country for both the rate of attack and relative risk. The clusters obtained from SaTScan TM were overlaid on these heat maps with ArcMap.

Results and Discussion
Using the data and methodology described above, obtained results are presented in this section. Results using the rate of the different types of malicious URL cyber attacks are presented first, followed by results presenting relative risk. Both sets of results are similar.

Type of Cyber Attacks by Rate
The following data are presented as normalized rate data for each type of cyber attack. The corresponding differences are rates and map colors: • "Red" represents normalized rates at or above 1.18 standard deviations above the mean. • "Orange" represents normalized rates 0.38 to 1.17 standard deviations above the mean. • "Yellow" represent normalized rates centered on the mean (plus or minus 0.37). • "Light Green" represents normalized rates 0.38 to 1.17 standard deviations below the mean. • "Dark Green" represents normalized rates more than 1.18 standard deviations below the mean.

Defacement
First, we look into the defacement cyber attack type in our analysis. According to the results depicted in Figures 1-3 Table 2, the highest rate of defacement attacks originate in Europe. Each rate is defined as (cyber attack counts)/(internet users count). All of the world's highest normalized rates ("red" countries) are present in Europe, as well as most of the countries with normalized rates ("orange" countries) above the mean. Outside of Europe, Turkmenistan, Australia, and the United States are the only countries with above average rates of defacement type malicious URL attacks. Turkmenistan is the only country outside of Europe to be placed in the top 10 countries when ordered by normalized rates. It is also worth noting the importance of using rate data. If simple count data had been used, 8 of 10 countries in the table would have been replaced.    Second, our analysis focus on malware type cyber attacks. The data show a much greater diversity for the normalized rate of malware type attacks than occurred with defacement ( Figures 4 and 5 and Table 3). North America accounts for all of the highest normalized rates except for Brunei. The Cayman Islands, British Virgin Islands, and Brunei are hard to depict in the figure due their relative small size and the scale of the map. Even though normalized rates are used for rankings, seven countries would still be placed in the top 10 based on count data alone.   Third, we look into phishing cyber attacks. Even more than malware, the country of origin for phishing-type attacks is diverse, and there is no geographic relationship when using normalize rate (Figures 6 and 7 and Table 4). Four of the ten highest rate countries would also be present in a list of top 10 countries for the number of phishing-type attacks. Nine of ten countries listed for phishing-type attacks are in the highest rate category ("red"); this is highest percentage of any type of attack.   Finally, we analyze spam-type cyber attack in the data. The results lead to the fact that all of the highest rate countries are located in Europe or the United States (Figures 8  and 9 and Table 5). It should be noted that cluster analysis was not performed for spam attacks (Figure 9), as we only had data for 8 countries. For the first time in our analysis, there is only one country (Ireland) that has the highest level of normalized rate, while two other countries show slightly elevated rates. Moreover, again the first time, the list of top rates contains countries with a below average normalized rate. Spam-type attacks when analyzed by rate appear to be confined to only 3 countries with higher than average rates.

Cyber Attacks by Relative Risk with Clusters
To further the analysis, we look into the relative risk of cyber attacks with cluster. The resulting maps and tables for relative risk analysis are given in this section. The relative risk values, along with the clusters, were obtained by SaTScan TM for each type of cyber attack. Relative risk can be interpreted as "x times more likely than expected" to have a cyber attack originate from that country. Since SaTScan TM calculates different expected counts for each type of attack, the relative risk and break points will be different for each type of cyber attack. The color scheme is explained in the legend on each map.

Defacement
According to the results given in Tables 6 and 7, the highest relative risk of defacement attacks originate in Europe. The highest risk cluster includes almost all of Europe (32 countries) has a relative risk 11.2 times greater than expected based on internet users and attack counts. The next five clusters all have relative risks between 2.4 and 5.4. While Russia and Oman were declared as clusters, the p-Value for those clusters indicates they should not be included. The clusters and relative risk align well with those found using normalized rate.  Subsequently, we analyze malware with relative risk. Malware results show two countries with a relative risk over 150 times greater than expected; the Cayman Islands and Brunei ( Figure 10 and Tables 8 and 9). These two countries were also top 2 in normalized rates of malware type cyber attack. Two other countries, U.S. and Hong Kong, were the ones with higher than expected relative risk, with each at approximately 27 times more likely to originate a malware attack. Four countries were approximately twice as likely to have malware attacks originate ("yellow" countries). After this, the relative risk falls rapidly to below 1.0, which indicates a lower likelihood than expected. Figure 10. Country of origin for malware attack by relative risk.  We then analyze phishing attacks with relative risk, and results show some differences between relative risks and rates. France and Hong Kong replaced South Africa and The British Virgin Islands on the list of top 10 countries as origin of a phishing attack (Figure 11 and Tables 10 and 11). There is one large cluster that covers 23 countries in Europe, while the other six clusters are all single country clusters. The United States is the most likely cluster and has the highest relative risk (21 times more likely than expected) after the Seychelles (39 times more likely than expected).   The last analysis with relative risk is for spam attack-type. The data for spam-type cyber attacks only contains observations for 8 countries. Just as when analyzed by normalized rate, three countries, Ireland, UK and U.S., account for the only countries where spam attacks originate at a higher than expected rate (Figure 12 and Tables 12 and 13). There is one cluster includes Ireland, UK, and Isle of Man. This cluster has a relative risk nearly 240 times greater than expected. The individual countries of Ireland (165 times) and UK (123 times) have relative risks an order of magnitude greater than any other country. The U.S. is the only other country with a relative risk greater than expected at 3.4 times. From the maps and data, it is apparent that spam attacks are not as common as other forms of cyber attack, but the origination is limited to small number of countries.   Finally, we run analysis for countries including all cyber attack types combined. Figure 13 and Table 14 and 15 are a summary and categorization of all types of cyber attacks combined and analyzed as a group. The count data is simply the total of all 4 individual types of cyber attacks. The SaTScan TM analysis and heatmap followed the same procedures.

Conclusions and Future Directions
In this study, our goal is to analyze cyber attack data obtained from Canadian Institute for Cybersecurity (CIC) and to identify significant clusters of different cyber attack types based on country of origin by looking into Uniform Resource Locators (URLs). The data from CIC contains over 110,000 URLs with 4 cyber attack types: phishing, malware, spam, or defacement. We perform spatial analysis using SaTScan TM , along with number of internet users per country. We present cluster analysis results in two categories, cyber attack type by rate per country, and cyber attacks clusters by relative risk. Our results not only provide geo-physical representation of the cyber attacks but also novel perspective for these attacks as "hotspots" for cyber attack origins. To close, we provide this summary of the important contributions of this work: • provide the feasibility of visual analytics as a cybersecurity tool; • enable the realization that cybersecurity data analysis could be approached using multiple perspectives; • provide the base framework for more advanced and enhanced spatial cluster analytics tools; and • provide the recognition of the need for reliable data that can be used for analytics.