4.2. Interpretation of Results
The interpretation of the acquired results from the conducted analysis provides an intriguing understanding of the complexity and diversity of network traffic behavior. This understanding aligns with RQ1 by illustrating principal trends and patterns in cyberattacks throughout the examined period. The results support Hypothesis (H1), affirming the escalation in sophistication and targeted cyberattack approach.
Classifying network traffic data into distinct clusters presents variability in network behavior patterns. These patterns are fundamental to understanding when developing sturdy security measures to counter increasingly sophisticated threats [
41]. Each cluster exemplifies unique network characteristics, necessitating specialized preventative and responsive measures to maintain network security amidst growing attack specificity effectively.
The time-series analysis of the data captures the temporal patterns in attack counts, identifying periods of unusual activity or anomalies. The unusual activity could be attributed to the increasing cyberattack sophistication, as suggested in H1 [
43]. Notably, the periods with spiked attack counts, specifically in July 2017, December 2018, and October 2019, emphasize the necessity for a temporal approach to network security as the techniques evolve in alignment with RQ1.
The behavior score, ranging from 0 to 136, is a quantifiable measure of potential anomalies [
39]. The behavior score is a tool to quantify the increasing sophistication and target specificity of cyberattacks. Validation of these scores has accentuated their effectiveness as reliable indicators of abnormal behavior.
Geographical and autonomous system data are critical to comprehend the sources of network anomalies [
38]. The higher frequency of abnormalities sourced from the United States and Germany and specific autonomous systems, namely “DigitalOcean-ASN”, “F3 Netze e.V.”, and “Zwiebelfreunde e.V.” suggests that these areas and systems need close monitoring. This suggestion is due to the evolving nature and increasing cyberattack specificity, as outlined in RQ1 and Hypothesis (H1).
Despite these enlightening discoveries on the nature and origins of network anomalies, it is significant to acknowledge that the behavior score indicates the likelihood of abnormalities but does not identify the exact type or severity of the anomaly. This revelation resonates with Hypothesis (H1)’s proposition of escalating sophistication, considering that emerging attacks may diverge from recognized patterns [
39]. As such, further research should aim to augment the current methods with procedures to identify the exact nature and potential implications of detected anomalies, particularly as threats evolve to be more intricate and targeted [
41].
In conclusion, interpreting these results underscores the multifaceted nature of network traffic and the necessity for a comprehensive approach to ensuring network security. As suggested by RQ1 and Hypothesis (H1), the dynamic nature of cyberthreats necessitates a multi-pronged approach to counter them. This approach integrates temporal, geographical, and autonomous system data along with a quantitative measure of behavior [
40]. These elements should all be considered to effectively identify and address network anomalies in an ever-evolving threat landscape.
4.3. Data Collection and Preprocessing
The comprehensive and rigorous data collection and preprocessing procedures undertaken in this study significantly enhanced the reliability and validity of the findings. Before the analysis, the data underwent meticulous cleaning, normalization, and transformation processes to ensure consistency and validity. Feature extraction and data transformation techniques were applied to the dataset, playing a pivotal role in extracting relevant information and ensuring the overall quality of the study’s results. The careful data collection and preprocessing procedures enhanced the preparation of the dataset for subsequent analysis, contributing to the robustness of the study’s outcomes.
The findings from this dataset reveal a complex and multifaceted cyberthreat landscape. The striking disparities in the origin of attacks highlight the global nature of the cyberthreat, pointing toward the need for enhanced international cooperation and coordination in addressing cyberthreats. However, it is also important to note that these disparities may be influenced by various factors, including the digital infrastructure, policies, and practices in different regions, as well as the ability of attackers to disguise their actual location.
These insights underscore the need for continuous monitoring and analysis of cyberactivities and for developing effective and adaptive strategies to mitigate cyberthreats. This study demonstrates the advantage of such comprehensive data collection and preprocessing efforts in generating critical insights that can inform policy and practice in cybersecurity.
4.3.1. Descriptive Analysis
The observed daily frequency of approximately 45,741 entries and the peak of 888,203 attacks in a single day reveal the scale and intensity of cyberthreats. The sporadic non-attack days, such as 16 November 2016, suggest periods of relative calm or a shift in attack strategies. These patterns underscore the dynamic nature of the cyberthreat landscape, requiring constant vigilance and adaptive responses.
The analysis underscores the erratic and volatile nature of cyberattacks, with daily counts varying wildly over the six years. The high degree of variation and the skewed distribution highlights the challenge of predicting and preparing for cyberthreats. Days with no recorded attacks are rare (17 out of 2191 days), reinforcing the constant nature of the cyberthreat landscape.
The marked distribution disparity points towards the global nature of cyberthreats, highlighting the necessity for international cooperation to mitigate these threats effectively. However, it is essential to remember that these distribution disparities might only partially represent the actual origin of the attacks, as cybercriminals often obscure their actual locations.
The descriptive analysis of the honeypot log presents a quantitative understanding of the cyberthreat landscape. The observed distribution disparities, peak activity, and periods of calm comprehensively depict cyberactivities. This study lays the groundwork for further analysis and interpretation of cyberthreats, emphasizing the importance of data-driven strategies to strengthen cybersecurity. The findings underscore the dynamic and complex nature of the cyberthreat landscape, reiterating the need for robust and adaptive cybersecurity measures informed by meticulous data analysis.
4.3.2. Temporal Analysis
The temporal analysis yielded a critical understanding of the cyclical trends in cyberattacks. The marked peaks in July 2017 and October 2019, followed by an overall increase in attack volumes from late 2019 onwards, point to an evolving and escalating cyberthreat landscape. These patterns suggest that cyberthreats are becoming more sophisticated and targeted, aligning with the initial Hypothesis (H1) that cyberattacks show a marked increase in sophistication and target specificity over time.
However, it is essential to consider the possibility of attack automation and an overall increase in Internet activity contributing to these high volumes. The variations in attack volumes could also indicate changing attacker tactics, advancements in detection methods, or the influence of global events. Consequently, these temporal trends necessitate ongoing evaluation to adapt and update cybersecurity measures in response to the evolving threat landscape.
The findings emphasize the importance of continual monitoring, evolution, and adaptation of cybersecurity strategies to detect and mitigate threats effectively. The study substantiates the growing significance of data-driven approaches to understanding and addressing the complexities of cyberthreats in the evolving digital era.
4.3.3. Correlation Analysis
The moderate to high correlations observed between the source AS numbers, corporate names, and numerous other indicators of malicious Internet activity suggest potential associations within the parameters studied. Such meaningful relationships may assist in predicting and identifying malicious activity based on known patterns. However, it must be emphasized that correlation does not imply causation, thereby necessitating further examination to ascertain causal relationships between these variables.
In interpreting these correlations, one could hypothesize that attackers may utilize specific AS numbers, as indicated by the high correlations. However, additional factors such as the nature of the organization and its Internet traffic, the network infrastructure, and other contextual factors could influence these correlations. Therefore, considering these variables in future investigations would be crucial to validate better and comprehend the observed correlations.
The study underscores the necessity for a cautious interpretation of these correlations and the importance of further research to establish causal links. These findings highlight the potential of data-driven, statistical approaches to augment understanding and predict cyberthreats, contributing to more efficient and proactive cybersecurity strategies.
4.3.4. Geographic Analysis
The geographic distribution of cyberattacks offers crucial insights into the patterns of malicious cyberactivity. The significant fraction of cyberactivities originating from the United States, Russia, and China could indicate several factors, including technological advancement, economic influence, and geopolitical relevance. However, it is worth considering that cybercriminals frequently mask their precise location, which could skew the geographic data. Furthermore, the high concentration of cyberactivity within the top 20 countries might reflect their technological infrastructure and international standing. Such insights could be instrumental in shaping geographically precise cybersecurity policies and strategies. However, future studies should address the potential discrepancies resulting from attackers’ masking of specific locations.
These findings emphasize the global nature of cyberthreats and highlight the importance of international cooperation and strategy development in cybersecurity. However, it is crucial to note the potential for location obfuscation by attackers, indicating the need for additional corroborative strategies to trace the origins of cyberthreats accurately. These threats’ complex and international nature necessitate a multifaceted and global response.
4.3.5. Threat Analysis
The threat analysis presented in the study underscores the complexity and diversity of the cyberthreat landscape. A substantial number of unidentified threats (zero-count entries) emphasize the continual evolution of cyberthreats and the limitations of current threat intelligence repositories in capturing the complete range of malicious activity. The prominence of specific categories in the non-zero count entries signifies the prevalence of malicious activities or sources, providing valuable insights for devising targeted defense strategies. However, it is also critical to note the importance of minor categories, which, although constituting a smaller portion of the dataset, may represent emerging or less common threat vectors that warrant further exploration.
The significant number of unidentifiable threats reiterates the need to continuously enhance threat intelligence repositories and adopt adaptive, multifaceted cyberdefense strategies. The study’s findings highlight the importance of ongoing research to understand the rapidly changing nature of cyberthreats and develop effective strategies to counter them.
4.3.6. Source IP Address Analysis
The study highlights the importance of scrutinizing the source IP address variable in understanding the origins and patterns of cyberattacks. The findings suggest concentrated sources of attacks from specific IP addresses and ASNs, pointing towards the potential utilization of botnets or centralized attack mechanisms. Notably, a significant percentage of entries were linked to the top 20 IP addresses, suggesting a concentrated nature of cyberthreats. The findings indicate a need for increased vigilance even in environments perceived to be trustworthy, particularly considering the predominant utilization of reputable cloud services as attack vectors. Understanding the dispersion and concentration of attacks from individual source IPs informs the development of targeted defense mechanisms and fosters international collaboration to counter cybercrime effectively.
When cross-referenced with threat intelligence data repositories, the comprehensive analysis of source IP addresses revealed critical insights into the distribution of cyberthreats. The study reaffirms the necessity of an exhaustive analysis of source IP addresses to comprehend cyberattack patterns and develop effective threat detection and prevention strategies. By fostering international collaboration and sharing these insights, this approach contributes to the broader cybersecurity field’s capacity to navigate the myriad cybersecurity challenges.
4.3.7. Destination Ports Analysis
The study’s findings suggest an increasing sophistication and targeted approach to cyberattacks. The high prevalence of attacks on services like the VNC-Server (port 5900) that require more sophisticated attack vectors compared to standard ports such as HTTP (443) or SSH (22) reinforces this observation. The data points to a high concentration of attacks from specific IP addresses and ASNs, implying the potential use of botnets or centralized attack mechanisms. Using reputable cloud services to initiate attacks emphasizes the need for advanced security measures.
The “count_diff” data provides a dynamic perspective on the changes in network traffic over the years. Cyberattacks have become more targeted and sophisticated, with changing preferences for specific ports across different years. The fact that ports such as 5900 and 8 show a marked increase in traffic points to shifting attacker strategies. Conversely, a decrease in traffic for port 22 may suggest changes in the targeted systems’ security measures or network configurations. Such insights could be leveraged in future cybersecurity studies and provide network administrators with crucial information to bolster network security measures. Therefore, the study offers a vital understanding of the cyberthreat landscape, underlining the need for continuous vigilance and adaptability in response to changing cyberthreats.
4.3.8. Destination Services Analysis
The study’s results suggested an increased focus on less known or difficult-to-categorize services, indicative of a rise in the complexity and sophistication of cyberattacks. This finding aligns with the initial hypothesis. A consistent pattern of annual increases in attacks was noted for certain services such as “ICMP-Echo-Request”, “Unknown”, and “VNC-Server”. In contrast, other services, such as “BGP” and “Domain-s”, were only recorded in specific years.
The “Cluster” column, introduced through a KMeans clustering algorithm, provided additional depth to the analysis. It grouped destination services into clusters based on similarity, revealing distinct patterns for services like “Unknown”, “VNC-Server”, “ICMP-Echo-Request”, and “SSH”.
The analysis of “Destination Services” and the incorporation of the “count_diff” data and KMeans clustering painted a comprehensive picture of the evolving nature and complexity of cyberattacks. The analysis of destination IP services revealed a diverse range of targeted services. It demonstrated a marked increase in attacks on less known or harder-to-categorize services, indicative of increased complexity and sophistication of cyberattacks. These results are of immense value to network administrators and security professionals, providing vital insights for developing and reinforcing robust cybersecurity measures in response to the evolving threat landscape.
4.3.9. Autonomous System Numbers and Names Analysis
Despite the significant network activity linked to entities such as DigitalOcean, Amazon-AES, and Amazon-02, it is crucial to understand that these organizations’ high entry numbers do not necessarily signify direct involvement in malicious activities. These numbers reflect the large customer bases of these organizations, which could include users exploiting these services for nefarious activities.
Temporal trends demonstrate the ever-changing nature of the cyberthreat landscape. The fluctuations observed in specific ASNs over the years highlight the need for continuous monitoring and updating of cybersecurity measures to match the evolving nature of threats. Furthermore, the cluster analysis of ASNs offered more profound insights into the patterns of malicious network activity, indicating the changing landscape of cyberthreats.
The analysis of ASNs revealed distinct patterns of network activity linked to malicious intent, with significant variations across different ASNs and years. The findings emphasized the critical role of robust cybersecurity measures and continuous cyberthreat analysis in understanding and combating these evolving threats. By shedding light on the temporal behavior and clustering characteristics of ASNs, this analysis provides insights for future research in this area, thereby contributing to a broader understanding of cyberthreats and strengthening the defenses against them.
4.3.10. Behavior Analysis
As a metric, the behavior score demonstrated its potential in discerning anomalous from expected network behavior. This approach leverages the inherent structure of the Internet, employing AS numbers and organizations as critical factors in behavior analysis.
In the context of cyberthreat intelligence, these results highlight behavioral patterns’ significant role in network traffic analysis. Countries like the United States and Germany exhibited higher behavior scores through their AS numbers and organizations, signaling potential security threats. Notably, these countries are significant Internet nodes, reinforcing the necessity of vigilant cybersecurity measures in these regions.
However, it is essential to consider that a higher behavior score may not directly correspond to malicious intent. Network traffic can exhibit strange behavior for several reasons, such as configuration changes, software updates, or non-standard user behavior. Therefore, these results should be interpreted with caution and need to be corroborated with additional data or context.
This study shed light on the potential of using behavior scores as an effective tool for anomaly detection in network traffic. The high behavior scores associated with specific AS numbers and organizations emphasize the need for rigorous and continuous monitoring of these entities. These findings and the distribution of behavior scores offer valuable insights for cybersecurity practitioners in their ongoing efforts to detect, mitigate and prevent cyberthreats.
While the study offers promising results, future work should focus on refining the behavior score by incorporating more diverse factors. Such enhancements will contribute to a decrease in false positives and improve the precision of the anomaly detection process. Additionally, further research is required to understand the reasons behind the elevated behavior scores observed for certain entities. Understanding these anomalies more deeply will facilitate the development of more effective threat intelligence strategies.
4.3.11. Clustering Analysis
In the presented analysis, a rigorous and comprehensive clustering approach was employed. This approach was underscored by a robust pipeline integrating preprocessing and clustering stages, providing a systematic and reproducible way to manage the complexity of the data. The pipeline was constructed using Python’s sklearn library and composed of a column transformer and K-means clustering, with the K-means algorithm being purposefully initialized ten times to ensure the reliability and stability of the resulting clusters.
Key features, including “Date”, “Count”, and “AS-Number”, were selected for the analysis. These features were processed by the pipeline, resulting in a scaled version of the data. The transformation of the data was not merely a technical procedure but rather a vital step that prepared the data for an effective clustering analysis.
The clustering procedure identified three distinct clusters with different characteristics and frequencies of data points. It was observed that clusters 0 and 2 contained the majority of the data points, while cluster 1 had significantly fewer data points. The disparities in the size of the clusters raised interesting questions about the underlying structure of the data and the significance of cluster 1, which may require further investigation.
Moreover, the computation of the mean “Count” for each cluster revealed considerable variations among them. This finding highlighted the heterogeneous nature of the data and provided insight into the potential significance of the different clusters.
The temporal distribution of “Count” within each cluster was also examined, revealing unique patterns and shifts over time. The shift from cluster 2 towards cluster 0 suggested dynamic changes in the data over time, which could be an interesting area for future study. The sporadic and high “Count” values in cluster 1 were intriguing and may suggest anomalies or rare events in the data.
The addition of cluster labels to the original data frame provided an insightful layer of information. Each data point was tagged with a precise label indicating its cluster affiliation, which could be useful in interpreting the data and guiding further analyses.
The results from the clustering analysis provided a deeper understanding of the data structure, revealing unique patterns and potential areas of interest. The clear delineation of data into specific clusters allowed for the identification of potential anomalies and marked the first step towards a comprehensive anomaly detection procedure. The insights gained from this analysis are essential for guiding subsequent investigations and enhancing the understanding of the data’s inherent structure and patterns.
4.3.12. Anomaly Detection with Clustering
In the context of the conducted research, anomaly detection held a pivotal role, particularly when considered in tandem with the clustering analysis. This study employed the Isolation Forest algorithm for anomaly detection due to its notable competency in handling high-dimensional datasets. The algorithm was applied to the scaled data, previously organized into clusters, and the outputs were integrated into the original dataset as an “anomaly” column.
An intriguing aspect of the findings was the identification of 218 data points as potential anomalies amidst the 21,522 data points designated as non-anomalous. The “Count” variable played a crucial role in this differentiation, as the average “Count” for anomalies vastly exceeded that for non-anomalous data points. This substantial difference underscored the accuracy of the Isolation Forest algorithm in pinpointing outliers based on the “Count” variable.
A closer examination of the anomaly occurrence over time, as drawn from the summary analysis, revealed an insightful pattern. Anomalies began to appear from January 2017, initially at a modest rate of one per month. This frequency, however, gradually amplified, reaching an upsurge in September 2021 with 17 anomalies.
These observations provided invaluable insights into the behavior of the data over time and underlined the effectiveness of the anomaly detection method in combination with clustering analysis. The alignment of the areas of interest identified by both clustering and Isolation Forest algorithms enhanced the confidence in the robustness of the chosen methodologies.
The resulting findings offer a solid foundation for further investigation into these anomalies, which could lead to novel discoveries and enrich the existing body of knowledge in this domain. Future work is envisaged to explore these anomalies more intensively to elucidate their full implications, thereby contributing to a more profound comprehension of the data’s structure and behavior.
4.4. Comparison to Previous Research
The results of the current study align with the existing body of research on network anomaly detection while also providing unique insights. In line with previous research, the study reaffirms the role of machine learning in detecting network anomalies [
18,
21,
25,
30,
31,
32,
35]. It further extends this by focusing on anomalies in network behavior characterized by unusual patterns potentially indicative of cyberthreats.
The behavioral scoring system used in this study is in line with the approaches by Alsarhan [
30], Boateng [
18], and Mengidis et al. [
35], who also utilized machine learning methodologies for anomaly detection. The study, however, distinguishes itself by tying the scoring to a combination of Autonomous System Numbers and Names (ASNs), the country of origin, and the number of connections made, a more holistic approach to understanding network behavior.
The current research validates the significance of IP address and ASN in identifying anomalous network activities, an assertion also supported in the works of Alowaisheq [
5] and Li [
19]. It extends this understanding by providing quantifiable evidence through a behavior-scoring mechanism that links these factors with the frequency and nature of abnormal behavior, a contribution not previously articulated in such detail.
Echoing the works of Aboah Boateng [
18] and Mengidis et al. [
35], this study employs unsupervised machine learning methods for anomaly detection but differs in their application to a purely network-centric dataset. The current study also stresses the need for data reduction and dimensionality reduction techniques, a sentiment shared with the work of Fu et al. [
34]. However, this study employs both source IP addresses and ASNs to carry out the reduction process, boosting the efficiency of anomaly detection.
The study concurs with Moriano Salazar’s [
44] emphasis on analyzing real-world temporal networks. It highlights the importance of continuous and real-time monitoring due to the dynamic nature of cyberthreats. Drawing on Alowaisheq’s [
5] work, it also examines network behavior from multiple angles, considering the origin of traffic and its associated behavior.
In line with Moriano Salazar [
44] and Ongun’s [
25] discourse on the temporally dynamic nature of network behavior, the study further reinforces the need for constant model updates as cyberthreats evolve. It also integrates this concept into a practical framework, thereby offering actionable insights for the cybersecurity community.
The study diverges from previous research, focusing on a behavior-based scoring system linked to ASNs and the country of origin. While Chatterjee [
26] used deep learning mechanisms for network intrusion detection, the current research offers a behavior-based scoring system as a potentially more accessible anomaly identification and severity assessment method.
The work of Christopher [
45] underscores the significance of protecting the Industrial Control System (ICS) environment. The study builds upon this concept, emphasizing the role of ASNs and behavior scoring in improving defenses against intrusions. Similarly, the current research aligns with Wendt’s work [
7], reinforcing the strategies needed to enhance adaptive cyberdefenses, particularly in the financial sector. However, it further accentuates the role of behavior-based scoring and unsupervised learning in strengthening these defenses.
Aghaei’s study [
22] on the automated classification and mitigation of cybersecurity vulnerabilities resonates with the present research, which also emphasizes the automated detection of anomalies using machine learning techniques. The research diverges, however, in its application of these techniques specifically for network behavior analysis.
Research conducted by Bajic [
14] underpins the importance of dynamic defense in computer networks. The current study complements this perspective, considering the dynamic nature of network behavior and the need for adaptive mechanisms to detect anomalies. The current research, however, extends this premise by developing a behavior-scoring system, contributing a novel approach to network anomaly detection.
The study aligns with the work of Villalón-Huerta and Ripoll-Ripoll [
42] by emphasizing the importance of detecting and sharing behavioral indicators of compromise. It expands on this work by proposing a behavior-scoring system to classify and understand these indicators.
In summary, the present study builds upon and extends the knowledge in network anomaly detection, informed by prior research, while providing new insights through a unique behavior-based scoring system. These findings should be considered as a starting point for future research, refining and enhancing the understanding of network behavior anomalies.