Intelligent Threat Detection—AI-Driven Analysis of Honeypot Data to Counter Cyber Threats

: Security adversaries are rampant on the Internet, constantly seeking vulnerabilities to exploit. The sheer proliferation of these sophisticated threats necessitates innovative and swift defensive measures to protect the vulnerable infrastructure. Tools such as honeypots effectively determine adversary behavior and safeguard critical organizational systems. However, it takes a significant amount of time to analyze these attacks on the honeypots, and by the time actionable intelligence is gathered from the attacker’s tactics, techniques, and procedures (TTPs), it is often too late to prevent potential damage to the organization’s critical systems. This paper contributes to the advancement of cybersecurity practices by presenting a cutting-edge methodology, capitalizing on the synergy between artificial intelligence and threat analysis to combat evolving cyber threats. The current research articulates a novel strategy, outlining a method to analyze large volumes of attacker data from honeypots utilizing large language models (LLMs) to assimilate TTPs and apply this knowledge to identify real-time anomalies in regular user activity. The effectiveness of this model is tested in real-world scenarios, demonstrating a notable reduction in response time for detecting malicious activities in critical infrastructure. Moreover, we delve into the proposed framework’s practical implementation considerations and scalability, underscoring its adaptability in diverse organizational contexts.


Introduction
The escalating number of cybersecurity attacks on Internet-connected systems is a pressing concern.According to research published by the International Monetary Fund in 2024, cyberattacks have more than doubled since the pandemic [1].The research also reveals a staggering increase in financial losses due to cybersecurity incidents, quadrupling since 2017 to $2.5 billion.These figures underscore the urgency for security teams worldwide to detect and remediate attacks on their infrastructure promptly.
Unfortunately, data suggest that detecting a cyberattack is incredibly complex.Research published by IBM in October 2023 within the IBM Data Breach Action Guide indicates that it takes around 207 days on average to identify a cybersecurity breach and around 70 days to contain it once it is detected [2].Meanwhile, there is a sharp reduction in the average amount of time it takes to launch an attack such as ransomware on enterprise networks.Research from IBM suggests that the time taken for a full-fledged ransomware attack has sharply fallen from 2 months to under 4 days.The pandemic has further complicated security threats in a typical office setting with the rise of remote work environments.The increased usage of cloud services and the trend of organizations exposing internal applications to the Internet using cloud platforms have significantly expanded the attack surface of typical corporate environments and associated infrastructure [3].Both sophisticated Advanced Persistent Threats (APTs) and less sophisticated script kiddies are increasingly targeting cloud exposures [4].The rise of sophisticated cyberattacks underscores the urgent need for innovative detection mechanisms, such as User and Entity Behavior Analysis (UEBA), to confront the evolving threat landscape.UEBA systematically examines the typical behaviors of users and entities, utilizing this contextual understanding to pinpoint irregular deviations from established norms.
One of the mechanisms to target the adversaries attacking the infrastructure and understand their TTPs is by leveraging honeypots.A full-fledged high-interaction honeypot provides insights into attacker metadata, including the IP address they are coming from, the time of the attacks, the mechanism of compromise, and commands used to establish their persistence.Three types of honeypots offer different levels of glimpses into the adversary behavior [5,6]:

•
Low-Interaction Honeypot: A low-interaction honeypot offers a plain vanilla TCP/IP service without access or with minimal access to the operating system.The adversary would not be able to interact with the honeypot, nor would the honeypots be capable of responding to the attackers in a way that captures their TTPs.The data yield from these honeypots is very low, but it could be easily analyzed to derive attacker information and use it to protect organizations' critical infrastructure.The quality of the data, especially the intention of attackers, cannot be captured in this type of honeypot [7]; • High-Interaction Honeypot: A high-interaction honeypot is on the other side of the spectrum compared to a low-interaction honeypot.This type of honeypot gives attackers complete root access to the operating system.Instead of merely mimicking specific protocols or services, the attacker is given authentic systems to target, significantly reducing the chances of detecting that they are being redirected or monitored [6].A High-Interaction honeypot successfully captures the attackers' intentions and their TTPs.However, the data obtained from this type of honeypot are diverse, and the data analysis takes time to derive actionable insights for protecting organizations' critical infrastructure; • Medium-Interaction Honeypot: A medium-interaction honeypot operates between a low-interaction and high-interaction honeypot, sacrificing some operating system authenticity to facilitate easier data analysis [6].Organizations often resort to mediuminteraction honeypots as a compromise, unable to rely on low-interaction honeypots due to their low data quality and unable to timely analyze the wealth of information offered by high-interaction honeypots.However, these honeypots are easily identified by adversaries, leading to a decline in the quality of extracted data.
Clouds are heavily targeted by attackers [8].The more popular the cloud infrastructure, the more heavily the exposed infrastructure is targeted.Implementing honeypots in the cloud gives organizations a free glimpse into the attackers' activities, helping them prepare against malicious activity on the Internet.
One of the fundamental challenges in analyzing data from high-interaction honeypots is that several diverse shell commands can achieve an exactly similar outcome in an operating system.Blocking a specific shell command through either a block-list-based approach or an allow-list-based approach does not remotely do any justice in preventing attacks on the critical infrastructure of organizations.Many obfuscation techniques hide the actual command behind various shell-based masking techniques [9].
In the past couple of years, machine learning and artificial intelligence algorithms have grown exponentially.A Bloomberg Intelligence report states that AI-related spending among organizations will grow from 1% to 10% by 2032 [10].The most significant revenue drivers will be generative AI infrastructure as a service used for training LLMs, digital ads, and specialized generative AI assistant software.One can use LLMs to detect malicious events in organizational critical systems.However, this notion comes with the following challenges:

•
Most LLM models today are trained using historical datasets [11].They can only evaluate a command if the training data references it.Even if the datasets being trained are current, the models do not have a mechanism today to pre-emptively understand the event and classify it as malicious; • LLM models do not have enough information about attacks within an organization, especially the hosted honeypots, to identify and classify an event as malicious;

•
It is also possible to train new LLMs with the organization's honeypot data.However, significant investments in cost and infrastructure are required to extract the relevant models while the honeypot data are still viable and usable from a security standpoint; • Lastly, one way to derive a suitable classification of events in critical infrastructure (e.g., malicious, non-malicious) by an LLM is to provide data from the honeypot as a context for an LLM.However, the amount of data coming to the honeypot is significantly high.Providing this context might be impossible for the models where the amount of context exceeds the allowed number of tokens that can be used in a model.Further, as the context increases, the number of input tokens an LLM uses increases, making it cost-prohibitive and computationally expensive to leverage machine learning to identify malicious events.
Honeypots help derive precious data on attacker TTPs as they target a given organization.If there is a way to promptly convert these TTPs into actionable intelligence, the data's effectiveness and potential could be realized to prevent a malicious incident.With the advent of cloud computing, cyber adversaries are constantly probing for vulnerabilities in the open ports on the Internet [12].Leveraging the advancements of machine learning algorithms and rich data from high-interaction honeypots, organizations can process data from highinteraction honeypots and detect incidents in their infrastructure as early as possible.

Novelty of the Current Research
Applying a machine learning model to any dataset would not result in spurious correlations.Machine learning must be applied to datasets where analysis of the data results in reasonable estimates of underlying patterns and relationships.Before analysis, it is paramount to ensure that the data are sufficiently clean, relevant, and representative of the security event classification domain.In essence, this research focuses on three research objectives:

•
Develop a novel machine learning model tailored to the security event classification domain.The model aims to efficiently parse data from high-interaction honeypots and accurately identify malicious events in critical infrastructure;

•
Investigate the effectiveness of leveraging native data analysis techniques with machine learning algorithms to score realistic events extracted from high-interaction honeypots.The aim is to enhance the accuracy and efficiency of identifying malicious events in critical infrastructure while reducing the time of such detection;

•
Design and implement a comprehensive data pipeline integrating various analysis steps, including data parsing from high-interaction honeypots, machine learningbased event classification, and the application of the Retrieval and Augmented Generating (RAG) model to seek out and identify malicious events in critical infrastructure actively.Evaluate the performance and effectiveness of the proposed approach through real-world attack simulations in a cloud-based environment.
The document is arranged as follows: Section 2 presents some related work that has been leveraged to derive the model in this paper.Section 3 details the framework and explains the process of choosing the datasets for the current research, including a sample architecture used to validate the model.Section 4 discusses the results and observations identified through the established real-world attack scenario setup.Finally, Section 5 concludes this paper with key findings.

Related Work
Considerable machine learning research is being carried out across all domains, leveraging large language models (LLMs) and improvising with them to make them efficient and use the models for various datasets.Much of this research in the security space is focused on securing LLMs and preventing data exfiltration from LLMs or preventing the models from doing what they have been specifically instructed to do.However, very little research has surfaced on integrating security data from log analysis, understanding attackers' TTP and mindset, and aligning it with an incident detection framework within an organization.
Research from No et al. [13] proposes a RAPID model leveraging log data's inherent characteristics for real-time anomaly detection without training delays.By treating logs as natural language and utilizing pre-trained language models for representation extraction, RAPID enables efficient detection without needing log-specific training, incorporating token-level information for refined and robust anomaly detection, particularly for unseen logs.The authors introduce the core set technique to reduce computational costs for comparison.Compared to prior models, experimental results demonstrate RAPID's competitive performance and effectiveness in real-time detection without delay.The research into the RAPID framework showcases how LLMs can be used to parse log files.Log statements are more human text readable compared to the bash commands executed on the honeypots.The key rationale behind using the RAPID model for the current research is to enhance the notion of using LLMs as a mechanism to analyze malicious commands from adversaries.The current research extends the notion of using LLMs to analyze malicious commands from adversaries and use them for anomaly detection.
Research from Karlsen et al. [14] examines the effectiveness of large language models (LLMs) in cybersecurity, particularly in analyzing log files.Various LLM architectures, including BERT, RoBERTa, DistilRoBERTa, GPT-2, and GPT-Neo, are benchmarked for their ability to analyze application and system log files for security purposes.The research deploys and benchmarks 60 fine-tuned language models for log analysis, demonstrating their effectiveness in this domain.Fine-tuning is found to be crucial for domain adaptation to specific log types.Additionally, the study introduces a new experimentation pipeline, LLM4Sec, which utilizes LLMs for log analysis experimentation, evaluation, and analysis.This research proves the versatility of using LLMs for analyzing logs.Key learning from this paper is that combining techniques enriching the context and augmenting LLMs with extra information would yield a robust detection mechanism.The current research leverages enhancements to the techniques proposed in the research from Karlsen et al. and improvises with them to analyze honeypot commands.
Research from Guu et al. [15] focuses on implementing a Retrieval and Augmented Generation model to enhance the capabilities of an LLM.LLMs store knowledge implicitly in network parameters, making it challenging to discern specific knowledge locations.REALM tackled this issue by introducing a discrete retrieval step called the 'textual knowledge retriever' in its pre-training algorithm.This retriever is incentivized to retrieve documents containing relevant information and penalized otherwise.REALM utilizes this mechanism to retrieve pertinent documents and focuses solely on these documents to make predictions, thereby enhancing the model's efficiency and effectiveness in leveraging stored knowledge.Retrieval and Augmented Generation models enhance the output of the LLMs.The current research leverages this model to parse the commands from the adversaries in the honeypot and query the commands executed by normal users in critical systems.
Research from Yang et al. [16] proposes adopting honeypot technology to shift from reactive to proactive cyber defense, aiming to address the limitations of typical defensive measures in current cyber confrontations.The system aims to enhance protective capabilities and ease of use by employing highly interactive honeypots and a modular design approach.The high-interactivity honeypot technology lures attackers into controlled environments for observation and performs advanced functions such as network threat analysis and vulnerability perception.It is thus possible to implement proactive detection measures leveraging data from the honeypot by effectively designing the honeypot and making the associated analysis modular and incremental.The research from Yang et al. showcases how high-interaction honeypots could be used for proactive defenses in a given organization.The current research leverages high-interaction honeypots to lure attackers and observe their behavior.Further, the modular framework proposed in the research by Yang et al. has been expanded to perform LLM analysis of the commands executed by the adversaries to aid in the detection of malicious commands in the critical infrastructure.
Research from Szabó and Bilicki [17] investigates using LLMs (specifically-GPT) for static analysis of front-end application source code to detect the CWE-653 vulnerability.By leveraging GPT's interpretive capabilities, the research aims to automate the identification of inadequately isolated sensitive code segments that could lead to unauthorized access or data leakage.Methodologically, the study involves classifying sensitive data, preprocessing code, and generating prompts for analysis using GPT.This paper showcases an initial framework for leveraging LLMs to interpret security issues that could be expanded to other use cases, especially log analysis.However, while this framework could be applied to programming languages, the honeypot logs are diverse and require more than prompt engineering to derive malicious events.The current research extends the research using Retrieval and Augmented Generation techniques through LLMs to automate the identification of malicious commands in the critical infrastructure leveraging analysis performed on the commands executed by the adversaries.
Research from Wang et al. [18] introduces AI@NTDS, a network threat detection system that uses behavioral features of attackers and intelligent techniques.It combines data analysis, feature extraction, and evaluation to build a detection model, aiding operating systems in defending against network attacks.Linux system interaction data from SSH and Telnet are collected from Cowrie Honeypot and labeled based on MITRE ATT&CK tactics for dataset credibility.The key learning for the research from Wang et al. is that SSH and Telnet provide a solid detection model to identify malicious behavior in the honeypots.However, Cowrie is still a medium-interaction honeypot.The current research leverages a high-interaction honeypot and more complex machine learning models and LLMs to derive the adversary behavior and compare it with user commands in the critical infrastructure.
Research from Lanka et al. [19] verifies that security adversaries exploit various targets, focusing on easy compromises to extend their attacks.Cloud environments, including AWS, Azure, GCP, and OCI, are prime targets due to the volume of attacks, offering insights into attacker objectives and patterns.The study examines adversary practices on commonly exposed protocols in these platforms, documenting a honeypot model that compares attacker behavior across multiple cloud environments.Additionally, the article highlights security measures to mitigate threats from adversaries probing insecure targets on cloud platforms.The model from this paper helps drive the honeypot design and validate the machine learning model in the current research.The research from Lanka et al. stops short of utilizing adversary behavior to identify malicious commands from user interactions with critical infrastructure.The current research expands into this notion and showcases a model through which the commands executed in the honeypots could be used to analyze the commands executed in the critical infrastructure.
Research from Lewis et al. [20] investigates the limitations of large pre-trained language models in accessing and manipulating knowledge, which impacts their performance on knowledge-intensive tasks.The authors propose a general-purpose fine-tuning approach for Retrieval and Augmented Generation (RAG) models, which integrate pre-trained parametric and non-parametric memory for language generation.The study compares two RAG formulations: one that uses the same retrieved passages for the entire generated sequence and another that allows different passages per token.For language generation tasks, RAG models produce more specific, diverse, and factual language compared to a state-of-the-art parametric-only seq2seq baseline.RAG can be used in the current research to compare commands.The commands may vary significantly between adversaries, but the essence of command comparison lies in using the description of the commands and comparing them against each other.

Methodology
Table 1 illustrates a sample log event from an SSH-based high-interaction honeypot and their corresponding timestamps.Typical honeypot logs contain extensive information about the interaction with the adversary.For instance, in the session depicted in Table 1, the following could be deduced from the sample attack that occurred over just 22 s:  (3) yellow labeled text indicates protocol parsed value.
Although Table 1 depicts a sample log involving the SSH protocol, similar information can be deduced from honeypots of any protocol.High-interaction honeypots yield large volumes of discrete data from various interactions.The datasets are not similar to one another.Therefore, significant processing is required before deriving meaningful outcomes from the data and using it for threat detection in organizations.
The first pre-processing step involves identifying the types of datasets from the logs.Specifically, it involves identifying those datasets that could be used for active learning and those that could indicate a compromise without any machine learning analysis.Table 2 represents further information that could be deduced from the datasets identified in Table 1.Each subsection below details the relevance of these data elements in analyzing malicious commands and discusses how they could be refined further to aid in the identification of malicious activity in the critical infrastructure.

Attacker Credentials Identification Data No
Credentials from attackers usually represent the dataset these adversaries use to perform brute force attacks and do not provide any benefit for learning.

Connection Creation Attempts Protocol Data No
Different types of protocols offer different methodologies for connection creation.Connection creation attempts are also influenced by how stable the underlying connection to the honeypot is.These data would not provide any benefit for learning.

Commands Executed Attacker Commands Yes
Commands executed by the attacker would provide a glimpse into their TTPs.Learning from these data would aid in malicious activity detection.

Total Execution Time Activity Data/Numeric Data Yes
Attackers are on a constant lookout for externally exposed services and strongly focus on establishing persistence in the cloud [19].These data can be used for machine learning classification to identify malicious activity.

Attacker IP Address
Attacker identity-based datasets, particularly IP addresses and connection credentials, are considered direct indicators of compromise.The IP addresses cannot be correlated through any analysis or machine learning methodologies.In essence, these are the types of strings that, even when learned using a machine learning model, the predictions from the model would not yield results that could be used to detect malicious activity on the company's sensitive infrastructure.However, these strings could be used independently to identify suspicious activity.For instance, the activity could be considered suspicious if the attacks happened on the honeypot from an IP address 183.81.169.238 and new connections on critical infrastructure originated from that IP address in a relatively short time (e.g., past week).

Attacker Credentials
Similarly, if the attacker uses a password to log into both the honeypot and critical infrastructure, this increases the suspicion that the activity in critical infrastructure is malicious.Usage of same credentials in both honeypot and critical infrastructure suggests a deliberate effort by the adversary to penetrate secure systems, signaling malicious intent rather than incidental or innocuous actions.In addition, identification of identical pass-words raises alarms about the potential for additional infiltration or harm elsewhere within the critical infrastructure, which previously exposed user credentials to the attacker.This scenario underscores the importance of robust cybersecurity protocols and continuous monitoring to promptly detect and respond to malicious behavior.Additionally, it highlights the significance of threat intelligence and collaborative information sharing in identifying and countering the strategies and techniques employed by attackers, empowering organizations to enhance their defenses against future attacks.
Sometimes, user credentials are weak and do not adhere to secure password generation methodologies.An attacker can guess the password of legitimate users without compromising any other service in the organization instead of making a deliberate effort to compromise credentials.These types of attacks are easy for attackers and come at no cost, leading to increased attacks in cloud environments [19].Regardless of the underlying root cause, attackers using identical credentials for both honeypots and critical infrastructure pose a significant danger to an organization, as it grants adversaries nearly resistance-free access to the infrastructure.

Protocol Data
Protocol data indicates the behavior of a connection.Many connection attempts to the server in a short timeframe and increase in number of channel requests usually indicate poor connectivity to the honeypot regardless of the protocol.Since protocol data cannot be directly quantified in all cases, it usually does not yield any helpful metric, either by direct correlation or machine learning, to identify adverse security events.

Timestamp Data
Timestamps usually indicate attackers' active timeframes.However, with attacks on exposed ports happening around the clock, especially in the cloud, and adversaries following automation procedures, the attack timeframe does not provide a direct mechanism to learn and identify malicious threats to sensitive infrastructure.However, these timestamps can be used to determine the age of an attack and its relevance when parsing events from critical infrastructure.Furthermore, the timestamps could determine the attacker's engagement time in a session.Automated sessions by adversaries typically take less time to execute a sequence of commands than manual/human sessions without automation.The difference in session times between honeypot and critical infrastructure and the number of command-based issues could be used to identify the probability of malicious activity.

Adversary Commands
Commands executed by the attacker directly translate into adversaries' TTPs and can provide valuable information for identifying malicious events in critical infrastructure.However, adversaries can alter these commands to deduce similar information from the servers.These commands can sometimes be obfuscated or unnecessarily chained to avoid detection.The commands yield a wealth of information; however, reading them as strings would not help since some of the commands are variable even though the underlying TTPs of attackers are consistent.As an example, Figure 1 demonstrates a non-exhaustive example list of commands, each one checking if the user "ubuntu" exists in the system.Therefore, some retrieval is needed to ensure these commands are normalized to derive the attacker's TTP.
example list of commands, each one checking if the user "ubuntu" exists in the system.Therefore, some retrieval is needed to ensure these commands are normalized to derive the attacker's TTP.

Command Retrieval Using LLM
Due to the varied nature of commands with similar TTPs, treating the commands as strings is not the correct approach to analyzing adversary behavior.However, LLMs can be leveraged to analyze the command and understand the attackers' underlying TTP.Appropriate prompts must be used to ensure that LLMs yield a predictable output each time a command is executed.Research from Amatriain explores creating custom prompts that deliver quality results from LLM models [21].Out of all the methodologies used to retrieve results from an LLM, Chain of Thought (CoT) prompting, Role Playing, Teaching Algorithm in Prompt, and Generate Different Opinions deliver a robust output for the datasets parsed in this work.
Table 3 provides information on the prompt used in this research to parse adversary commands and derive the attacker's TTPs.The instructions and prompts will be executed in the model "gpt-4-turbo" in May 2024.Table 4 represents the sample output by LLM after parsing the commands listed in Figure 1.As shown, the LLM can parse the commands and present clean output that clearly articulates the attacker's TTP, regardless of how different a command is that is executed in the honeypot.
However, the challenge with LLM-generated text is that it still contains variables that are inputted into the command.For instance, the username "ubuntu" is a variable sent to the bash command, and the attacker TTPs are unaffected even if the variable value changes to a different username (e.g., "johndoe").Therefore, LLM-generated data cannot be directly used for text-based matching.Certain post-processing is required for LLM-generated data before it is stored in the database for later searches.Bashlex library provides a Python-based bash parsing library to interpret a complex bash command and identify the words/arguments provided [22].Using Bashlex helps identify the arguments/entities in the bash commands issued by the adversary.Table 5 shows a sample list of commands executed in the vulnerable honeypot and the identification of entities from the data.The identified entities are replaced with generic strings in the LLM responses in Table 4.

Command Retrieval Using LLM
Due to the varied nature of commands with similar TTPs, treating the commands as strings is not the correct approach to analyzing adversary behavior.However, LLMs can be leveraged to analyze the command and understand the attackers' underlying TTP.Appropriate prompts must be used to ensure that LLMs yield a predictable output each time a command is executed.Research from Amatriain explores creating custom prompts that deliver quality results from LLM models [21].Out of all the methodologies used to retrieve results from an LLM, Chain of Thought (CoT) prompting, Role Playing, Teaching Algorithm in Prompt, and Generate Different Opinions deliver a robust output for the datasets parsed in this work.
Table 3 provides information on the prompt used in this research to parse adversary commands and derive the attacker's TTPs.The instructions and prompts will be executed in the model "gpt-4-turbo" in May 2024.Table 4 represents the sample output by LLM after parsing the commands listed in Figure 1.As shown, the LLM can parse the commands and present clean output that clearly articulates the attacker's TTP, regardless of how different a command is that is executed in the honeypot.
However, the challenge with LLM-generated text is that it still contains variables that are inputted into the command.For instance, the username "ubuntu" is a variable sent to the bash command, and the attacker TTPs are unaffected even if the variable value changes to a different username (e.g., "johndoe").Therefore, LLM-generated data cannot be directly used for text-based matching.Certain post-processing is required for LLMgenerated data before it is stored in the database for later searches.Bashlex library provides a Python-based bash parsing library to interpret a complex bash command and identify the words/arguments provided [22].Using Bashlex helps identify the arguments/entities in the bash commands issued by the adversary.Table 5 shows a sample list of commands executed in the vulnerable honeypot and the identification of entities from the data.The identified entities are replaced with generic strings in the LLM responses in Table 4.

Role Playing
You are an advanced Linux bash interpreter.You will be given explicit commands that are executed in a bash shell in a honeypot server.

Setting Expectations for Input
These bash shell commands were executed by malicious users.All the bash commands must be assumed to be malicious.

Setting Expectations for Input
You would need to interpret the malicious bash commands and identify the tactics used by the malicious user

Setting Expectations for Output
The output of your analysis must be list of bullet points that has verbose description of what bash commands are doing.

Setting Expectations for Output
The output must be in plain English, showing details of what command does without any reference to the command or its arguments.

Teaching Algorithm in Prompt
For example, in cases of sample input "ls/", the output must be "This command lists the directories and files in the filesystem root of the server in a non-recursive manner.The default implementation of the command does not show the hidden directories and files in the server".

Chain of Thought (CoT)
Take time before you respond.Please proceed step by step and think this through thoroughly for accuracy.

Generate Different Options
Please explore multiple potential interpretations and implications of each command to ensure a comprehensive and accurate analysis.

Setting Expectations for Output
Please provide no greetings or offer help toward the end.Please provide the responses in bullet list of sentences.
Finally, once the LLM responses have been replaced with generic entities, the resultant LLM text is converted into an embedding vector for efficient searching during the augmentation step.Multiple embedding models can transform the text into vectors, and this area has been the focus of research to identify more efficient models.For the current research, "Salesforce/SFR-Embedding-Mistral" is used as the embedding model to parse the text since this is one of the top models for retrieval in the Hugging Face MTEB as of May 2024 [23,24].Since this model is quite large, executing it requires an A100 GPU running in a Google Colab environment.However, augmentation can also be achieved with smaller models in a local environment.These embeddings generated can be stored in database and can be searched to match the TTPs from critical infrastructure during the retrieval process.
The retrieval part of the model utilizes commands generated by the company's critical infrastructure users.To ensure consistency with the embeddings, the processing for user commands must adhere to the same processing steps as before.The user command is forwarded to an LLM with an appropriate prompt to retrieve the command description in a format consistent with honeypot models.After obtaining the description, the embeddings are computed using the "Salesforce/SFR-Embedding-Mistral" as the embedding model to analyze the text.The model Salesforce/SFR-Embedding-Mistral requires the query to be of a specific task description format to be input with the query.Figure 2 shows a sample task description and response from the model.The primary objective is to extract and display specific information about user accounts that include "ubuntu" in their details, which can include usernames, home directories, or shell paths.This action might be used to inspect or manipulate user properties or to validate the existence and specifics of particular user accounts.CATCMD = cat; eval "$CATCMD/etc/passwd"|grep ubuntu

•
This command sequence sets an environment variable CATCMD to store the name of a Unix command used for displaying the content of files; • Using eval, it executes the command stored in CATCMD to display the content of the server's user database file;

•
The output of this command is then filtered to specifically search for entries containing the term "ubuntu"; The purpose is essentially to isolate and reveal details pertaining to users with "ubuntu" in their information from the user database.This can be used for inspecting user configurations, checking user existence, or analyzing security policies related to those user accounts.

•
This command sequence displays the contents of the user database file from the absolute root directory by repeatedly traversing up through parent directories until the root is reached and then accessing the/etc/passwd file; • It then filters the displayed content to find lines containing the text "ubuntu", typically to identify information regarding any user whose details include "ubuntu" in the user database;

•
The primary aim is to search for and isolate specific user information potentially for unauthorized access, exploitation, or verification of specific user existence from a sensitive system file that stores all user details.

•
This command first reads the entire contents of the server's user database file directly into memory without explicitly calling a separate command to display the file; This input is processed to specifically filter and show details that involve the term "ubuntu", likely echoing back information about a user or users related to "ubuntu";

•
The primary intent of this operation is to precisely extract user-specific information in order to monitor, verify, or manipulate details regarding user accounts on the system, specifically those involving 'ubuntu'.Embeddings from the model for the query are matched against embeddings from the honeypot commands.The embeddings from "Salesforce/SFR-Embedding-Mistral" are not normalized.Hence, cosine similarity is used to obtain the matching vectors from the embeddings derived from the command.The cosine similarity metric facilitates vector comparison by quantifying the cosine of the angle between them, offering a robust measure of directional similarity while disregarding magnitude discrepancies.Its application transcends various domains, enabling precise analysis and classification tasks in complex datasets.The cosine similarity function provides two outputs: the similarity score that illustrates the level of match between the texts and the list of indices that match the score.
The retrieval mechanism aims to match the top 'k' results against the input value.The value of this threshold 'k' depends on multiple factors, such as the sensitivity of the critical infrastructure, the underlying model used for similarity analysis, the organization's appetite for the false positives and the false negatives, and the similarity between honeypot infrastructure and sensitive critical infrastructure.It is important to note that a lower value of this threshold increases the false positives if the adversary's commands differ from those executed in the sensitive infrastructure.In comparison, a higher value of the threshold increases false negatives and LLM hallucinations.Also, an increase in the value of 'k' would further increase the number of tokens in the input, thus increasing the underlying cost of performing LLM searches.In the current scenario of a simple honeypot Embeddings from the model for the query are matched against embeddings from the honeypot commands.The embeddings from "Salesforce/SFR-Embedding-Mistral" are not normalized.Hence, cosine similarity is used to obtain the matching vectors from the embeddings derived from the command.The cosine similarity metric facilitates vector comparison by quantifying the cosine of the angle between them, offering a robust measure of directional similarity while disregarding magnitude discrepancies.Its application transcends various domains, enabling precise analysis and classification tasks in complex datasets.The cosine similarity function provides two outputs: the similarity score that illustrates the level of match between the texts and the list of indices that match the score.
The retrieval mechanism aims to match the top 'k' results against the input value.The value of this threshold 'k' depends on multiple factors, such as the sensitivity of the critical infrastructure, the underlying model used for similarity analysis, the organization's appetite for the false positives and the false negatives, and the similarity between honeypot infrastructure and sensitive critical infrastructure.It is important to note that a lower value of this threshold increases the false positives if the adversary's commands differ from those executed in the sensitive infrastructure.In comparison, a higher value of the threshold increases false negatives and LLM hallucinations.Also, an increase in the value of 'k' would further increase the number of tokens in the input, thus increasing the underlying cost of performing LLM searches.In the current scenario of a simple honeypot and a mimicking SSH server, the 'k' value selected is 3.However, this value could be varied depending on the criteria listed above.

Augmentation Using LLM
A retrieval analysis of honeypot commands from an adversary typically yields a reasonably good measure of similarity to the commands executed in the sensitive infrastructure.The magnitude of cosine similarity scores indicates the extent of match between two commands.A cosine similarity score of values closer to 1.0 means a closer match between the commands executed in critical infrastructure and the honeypot.The objective of the current research is to identify malicious events in the critical infrastructure.Therefore, a specific value threshold in the cosine similarity score must be considered to ensure that only similar values are selected for subsequent analysis.
To ensure research objectives are met, it is not necessary to consider every command that exceeds the cosine similarity threshold for our LLM analysis.Instead, focus on the top 'k' values from the cosine similarity results that exceed the selected threshold.An LLM can be used to analyze these top 'k' search results, allowing it to quantitatively determine whether a command is malicious or benign and the category of the attacker TTPs employed in the command.By employing suitable prompt engineering approaches and providing context from the retrieved queries, the LLM can be augmented to generate quantifiable feedback.Table 6 lists the instructions given to "gpt-4-turbo" for generating responses to the query.This query uses a Role Playing and Teaching Algorithm in the prompt to assess if the command is malicious.The query also ensures the LLM interprets the input and provides a minimal binary output for further analysis.Further, LLMs can also derive the attack category the adversary uses.MITRE publishes a global database of enterprise ATT&CK categories and knowledge base [25].Leveraging the top attack categories from the list, Table 6 lists the query to determine the category of the attackers' TTP.You would be given a bulleted list of command descriptions that adversaries have used against our organization's honeypot.These are called attacker TTPs Setting Expectations for Output For a given query, leverage the provided TTPs and categorize the query into one of the ATT&CK categories-"Reconnaissance", "Persistence", "Impact", "Exfiltration", and "Command and Control"

Setting Expectations for Output
Please provide no greetings or offer help toward the end.Please provide the responses as a single word with category name

Generation Using LLM
Figure 3 provides a snapshot of this LLM prompt in action.Leveraging the query and the instructions, an LLM could be used to determine if the command executed in the critical infrastructure is malicious.As part of the research, "gpt-4-turbo" LLM is used for the analysis.However, this could be easily replaced with lower-cost LLMs and those that run natively within the organization's infrastructure.A combination of attacker categories, attack command matches, and command session time learning could be used to derive if a given event in critical infrastructure is malicious.

Putting Model Together-Creating a Pipeline
While the example focuses on SSH, similar insights into adversaries can be extracted from other protocol-based honeypots.As indicated in the sections above, the datasets must be pre-processed due to the diverse nature of datasets gathered from high-interaction honeypots, which require careful consolidation before integrating them into machine learning frameworks.Although attacker IP addresses and credentials are not directly conducive to machine learning and data analysis, they can indicate suspicious activity when cross-referenced across honeypots and infrastructure logs.The protocol and timestamp data present challenges in directly quantifying malicious threats.Adversary commands provide valuable insights into tactics, techniques, and procedures (TTPs) and can be analyzed using language model-based techniques like LLMs.Utilizing LLMs, specifically Retrieval and Augmented Generation (RAG), enables TTP extraction from commands, aiding in identifying malicious activity.These methodologies, coupled with suitable embedding models and retrieval mechanisms, facilitate comprehensive analysis and identification of threats within critical infrastructure, bolstering cybersecurity defenses.

Model Algorithm Implementation
Figure 4 describes the overall algorithm for the model.The model is constructed through a series of if-then-else statements that associate a given command executed in critical infrastructure as malicious or benign.Before a model can be constructed, it is essential to identify a sufficient time window based on which the rest of the analysis can be associated.The time window is important because the commands could be classified as malicious or benign within the specific time window based on the attacks happening concurrently in the honeypot.It would not be ideal to define an event in critical infrastructure as malicious, referencing a honeypot event that occurred months or even years ago.The time window length depends on various factors, such as the organization industry, the compliance window needed for the reference time frame, the nature of the endpoint, and

Putting Model Together-Creating a Pipeline
While the example focuses on SSH, similar insights into adversaries can be extracted from other protocol-based honeypots.As indicated in the sections above, the datasets must be pre-processed due to the diverse nature of datasets gathered from high-interaction honeypots, which require careful consolidation before integrating them into machine learning frameworks.Although attacker IP addresses and credentials are not directly conducive to machine learning and data analysis, they can indicate suspicious activity when cross-referenced across honeypots and infrastructure logs.The protocol and timestamp data present challenges in directly quantifying malicious threats.Adversary commands provide valuable insights into tactics, techniques, and procedures (TTPs) and can be analyzed using language model-based techniques like LLMs.Utilizing LLMs, specifically Retrieval and Augmented Generation (RAG), enables TTP extraction from commands, aiding in identifying malicious activity.These methodologies, coupled with suitable embedding models and retrieval mechanisms, facilitate comprehensive analysis and identification of threats within critical infrastructure, bolstering cybersecurity defenses.

Model Algorithm Implementation
Figure 4 describes the overall algorithm for the model.The model is constructed through a series of if-then-else statements that associate a given command executed in critical infrastructure as malicious or benign.Before a model can be constructed, it is essential to identify a sufficient time window based on which the rest of the analysis can be associated.The time window is important because the commands could be classified as malicious or benign within the specific time window based on the attacks happening concurrently in the honeypot.It would not be ideal to define an event in critical infrastruc-ture as malicious, referencing a honeypot that occurred months or even years ago.The time window length depends on various factors, such as the organization industry, the compliance window needed for the reference time frame, the nature of the endpoint, and the amount of traffic received to the honeypot.For this paper, the time window of the analysis is selected to be a week (7 days).
malicious.Machine learning algorithms are executed on each of the remaining datasetsattacker commands executed and attacker session time.RAG analysis is performed on the attacker commands, resulting in a True or False result along with TTP category result, all together, indicating whether the command is suspicious or benign.Attacker session time also provides a mechanism to classify the transaction.Algorithms such as the K-Means could be applied to classify existing benign sessions in the critical infrastructure and adversary sessions in the honeypot in the time frame selected.The trained model would then help identify the suspiciousness of a given session in critical infrastructure.The data from adversary commands and session time analysis are, at best, probability-based.These should not be ideally marked as malicious by themselves.Therefore, qualitative criteria are chosen to provide avenues to appropriately rate the outputs from the machine learning models to determine the result.The machine learning models described in this section are converted into Python language code snippets to aid data parsing from honeypots and critical infrastructure.This paper picks a threshold of 0.75 for the cosine similarity threshold to use a honeypot command for subsequent analysis.Please note that this threshold varies depending on factors such as the similarity between honeypots and critical infrastructure, the number of commands to match, and the risk sensitivity of an organization.Figure 5 represents the pseudocode to parse the data from the honeypots, and Figure 6 represents the pseudocode to parse the commands from the critical infrastructure and identify malicious events.For a given time window, various user parameters to the critical infrastructure are iteratively parsed, as described in Figure 4, to determine if a given command is malicious.The attacker's IP address and credentials directly correlate with the malicious activity.If these are identified in the critical infrastructure, they must be immediately classified as malicious.Machine learning algorithms are executed on each of the remaining datasetsattacker commands executed and attacker session time.RAG analysis is performed on the attacker commands, resulting in a True or False result along with TTP category result, all together, indicating whether the command is suspicious or benign.Attacker session time also provides a mechanism to classify the transaction.Algorithms such as the K-Means could be applied to classify existing benign sessions in the critical infrastructure and adversary sessions in the honeypot in the time frame selected.The trained model would then help identify the suspiciousness of a given session in critical infrastructure.The data from adversary commands and session time analysis are, at best, probability-based.These should not be ideally marked as malicious by themselves.Therefore, qualitative criteria are chosen to provide avenues to appropriately rate the outputs from the machine learning models to determine the result.
The machine learning models described in this section are converted into Python language code snippets to aid data parsing from honeypots and critical infrastructure.This paper picks a threshold of 0.75 for the cosine similarity threshold to use a honeypot command for subsequent analysis.Please note that this threshold varies depending on factors such as the similarity between honeypots and critical infrastructure, the number of commands to match, and the risk sensitivity of an organization.Figure 5 represents the pseudocode to parse the data from the honeypots, and Figure 6 represents the pseudocode to parse the commands from the critical infrastructure and identify malicious events.For the current setup and analysis of the events from the honeypot over two weeks, five are shortlisted from MITRE ATT&CK categories: (1) Reconnaissance, (2) Persistence, (3) Impact, (4) Exfiltration, and (5) Command & Control.

Model Infrastructure Implementation
Figure 7 represents the high-level architecture for the current research.The following are the key design objectives identified for the honeypot and infrastructure design:

•
System Similarity: The underlying honeypot and the critical infrastructure have similar setups.Both the underlying operating systems are Ubuntu 22.04 LTS Jammy Jellyfish.The software installed and the configuration settings are identical between both systems, preventing attackers from differentiating one system from another.
Honeypot is configured to allow adversary login as any user and any password.Only one account credential used by critical infrastructure (account name: ubuntu) has a weak password ("abc123"); • Attracting Adversaries: The AWS cloud hosts a combination of a honeypot network and critical infrastructure that authenticated and organization-approved users should use.Based on prior research, AWS was selected to host the honeypots because adversaries target cloud providers with the highest revenues compared to other cloud providers [19].The key objective of the research is to maximize the attacks so that correlations to the honeypot data can happen at a shorter frequency; • High-Interaction Setup: ContainerSSH (version 0.5) is the SSH proxy selected, allowing adversaries and system users to access the honeypot.ContainerSSH offers full SSH functionality to the end users [26] while acting as a proxy to record adversary and user commands.At the same time, the users receive a full-fledged shell into the host and can execute any command to their liking.The ContainerSSH proxy helps log all the SSH protocol-based interactions between adversaries and native infrastructure users; • Availability Monitoring Detection and Correction: High-interaction honeypots expose the internals of the Operating System to adversaries.Due to the nature of the attacks executed on them, these honeypots are extremely fragile [27].Therefore, additional infrastructure is needed to protect against service availability concerns.For this reason, Kubernetes service leveraging Kops (version 1.28.4) was selected to host the honeypot containers.Appropriate health checks are configured in Kubernetes to recycle the containers each time the health check results are degraded; • Concurrent Log Storage: Logs from honeypots and critical infrastructure are quickly transmitted to an offsite S3 bucket (outside the VPC exposed to the users).Every session in both infrastructures is logged in the binary format through the Contain-erSSH audit log.This binary format allows for the storage of all the SSH protocol communications and commands executed by the end user.The parity of the logging formats helps cross-correlate data between the two architectures.Further concurrent logging to offsite bucket helps ensure availability of the logs in the event of denial of service on the honeypot; • Time Window Implementation: The Lambda service automatically parses all the logs from the S3 buckets and extracts the relevant information for the log analysis.The parsed honeypot data are stored in DynamoDB tables with an automated expiration TTL for each record.DynamoDB automatically deletes old records greater than an elapsed TTL value [28].The current TTL selected for this research is 7 days.Hence, a current log record from critical infrastructure would be matched against honeypot logs from the past seven days; • Machine Learning Analysis Compute: For the current research, the embedding model selected is "Salesforce/SFR-Embedding-Mistral", and the LLM selected is "gpt-4turbo".The embedding model requires a heavy GPU for execution, and the code snippets need an Nvidia A100 GPU for smooth execution.Since the cost of constantly running the Nvidia A100 GPU 24 × 7 is gigantic, a scheduler is used to launch the servers and code once every 24 h to parse the commands in the timeframe selected.
The servers then automatically shut down after the parsing for the day is complete.Since the parsing happens daily, it would take one to identify any malicious activity in servers using this model.Please note that one can select smaller models instead of "Salesforce/SFR-Embedding-Mistral" for the analysis.
Electronics 2024, 13, x FOR PEER REVIEW 18 of 28 constantly running the Nvidia A100 GPU 24 × 7 is gigantic, a scheduler is used to launch the servers and code once every 24 h to parse the commands in the timeframe selected.The servers then automatically shut down after the parsing for the day is complete.Since the parsing happens daily, it would take one day to identify any malicious activity in servers using this model.Please note that one can select smaller models instead of "Salesforce/SFR-Embedding-Mistral" for the analysis.

•
The experimental setup, depicted in Figure 7, was implemented in both AWS and GCP environments.Kops was employed to automatically create the necessary infrastructure within the AWS account.The Kubernetes API server was configured to be private and accessible only through a bastion host managed by Kops.Kubernetes was deployed on two instances: one dedicated to Kubernetes management and monitoring APIs and the other hosting containers for exploitation purposes.Monitoring capabilities within the Kubernetes service were enhanced using Grafana, Alert Manager, and Prometheus solutions.Alerts were configured to notify administrators if the CPU load on the EC2 instances exceeded 80% for a continuous period of 15 min.

•
Kubernetes namespaces were created to host the SSH honeypot solution.The first namespace, named ContainerSSHAdmin, hosted the container SSH solution and its supporting infrastructure.The second namespace, named AdversaryTargets, hosted the target containers for SSH interactions.The containerized SSH service was created using Kubernetes deployment YAML files.Custom code was developed for the authentication service within ContainerSSH, allowing adversaries to access the highinteraction honeypot with any password or private key.The overall solution

Experimental Setup
• The experimental setup, depicted in Figure 7, was implemented in both AWS and GCP environments.Kops was employed to automatically create the necessary infrastructure within the AWS account.The Kubernetes API server was configured to be private and accessible only through a bastion host managed by Kops.Kubernetes was deployed on two instances: one dedicated to Kubernetes management and monitoring APIs and the other hosting containers for exploitation purposes.Monitoring capabilities within the Kubernetes service were enhanced using Grafana, Alert Manager, and Prometheus solutions.Alerts were configured to notify administrators if the CPU load on the EC2 instances exceeded 80% for a continuous period of 15 min.

•
Kubernetes namespaces were created to host the SSH honeypot solution.The first namespace, named ContainerSSHAdmin, hosted the container SSH solution and its supporting infrastructure.The second namespace, named AdversaryTargets, hosted the target containers for SSH interactions.The containerized SSH service was created using Kubernetes deployment YAML files.Custom code was developed for the authentication service within ContainerSSH, allowing adversaries to access the high-interaction honeypot with any password or private key.The overall solution comprised two containers: one running ContainerSSH code acting as a proxy for SSH connections from adversaries and the other hosting the authentication service that accepted credentials submitted by the adversaries.The authentication service, invoked by ContainerSSH, created a container in the AdversaryTargets namespace for each successful authentication.Each successful SSH connection by adversary instantiated a new container in the AdversaryTargets namespace, where the adversary obtained root privileges within the SSH session.To mimic critical infrastructure within the organization, Internet access from the SSH session was restricted.• The ContainerSSH solution required access to S3 buckets, facilitated using an AWS access key and secret key pair securely stored within the ContainerSSH deployment.This setup ensured a robust and monitored environment for experimentation and analysis.Lambda functions were deployed on the AWS accounts to automatically parse logs arriving in the S3 buckets and write them to a new location within the same bucket.GCP servers equipped with A100 GPUs were deployed to analyze the logs and utilize them for LLM analysis.Scripts within Google Colab automatically downloaded parsed logs from the AWS S3 bucket and performed machine learning analysis leveraging the GPUs.Additionally, a Google Colab environment with A100 GPUs was employed for manual analysis and monitoring of the logs.Figure 8 depicts overall dataflow diagram of the setup.
comprised two containers: one running ContainerSSH code acting as a proxy for SSH connections from adversaries and the other hosting the authentication service that accepted credentials submitted by the adversaries.The authentication service, invoked by ContainerSSH, created a container in the AdversaryTargets namespace for each successful authentication.Each successful SSH connection by an adversary instantiated a new container in the AdversaryTargets namespace, where the adversary obtained root privileges within the SSH session.To mimic critical infrastructure within the organization, Internet access from the SSH session was restricted.

•
The ContainerSSH solution required access to S3 buckets, facilitated using an AWS access key and secret key pair securely stored within the ContainerSSH deployment.This setup ensured a robust and monitored environment for experimentation and analysis.Lambda functions were deployed on the AWS accounts to automatically parse logs arriving in the S3 buckets and write them to a new location within the same bucket.GCP servers equipped with A100 GPUs were deployed to analyze the logs and utilize them for LLM analysis.Scripts within Google Colab automatically downloaded parsed logs from the AWS S3 bucket and performed machine learning analysis leveraging the GPUs.Additionally, a Google Colab environment with A100 GPUs was employed for manual analysis and monitoring of the logs.

Results and Discussion
The architected SSH honeypot and critical infrastructure servers remained open for attacks for 7 days, continuously collecting and analyzing data on the critical infrastructure while adversaries carried out attacks.The Kubernetes platform running the honeypot server helped stabilize it during the attacks.

IP Address-Based Analytics
Attacks from over 1154 unique IP addresses were observed on the honeypot over 7 days.Figure 9 illustrates the geographic distribution of identified adversaries during this period.Among the 1154 IP addresses, only 345 unique IP addresses issued commands to the honeypot server (command executors).Other IP addresses scanned the service for • This comprehensive setup provided a secure and efficient infrastructure for conducting detailed experiments and analyses in a controlled environment.The honeypot and critical infrastructure were kept on the Internet without firewall controls for two weeks between 26 April 2024 and 3 May 2024.The availability of both systems during this time was observed to be 100%.

Results and Discussion
The architected SSH honeypot and critical infrastructure servers remained open for attacks for 7 days, continuously collecting and analyzing data on the critical infrastructure while adversaries carried out attacks.The Kubernetes platform running the honeypot server helped stabilize it during the attacks.

IP Address-Based Analytics
Attacks from over 1154 unique IP addresses were observed on the honeypot over 7 days.Figure 9 illustrates the geographic distribution of identified adversaries during this period.Among the 1154 IP addresses, only 345 unique IP addresses issued commands to the honeypot server (command executors).Other IP addresses scanned the service for usernames and passwords (scanners), presumably to launch attacks later once a database of vulnerable hosts is created.The blue dots in Figure 9 represent IP addresses primarily functioning as scanners, while the red dots are IP addresses that attempted to exploit the honeypot with valid commands.The IP addresses seemed to be geographically distributed and cannot be statistically pinpointed to any location or geographic region.This is likely because attackers usually do not use their IP addresses to scan and maliciously connect to the exposed infrastructure.Instead, they leverage existing compromised hosts for such connections.Such hosts exist all over the world, and some of them are compromised using attacks similar to those observed on the honeypot.Any successful login from these IP addresses to the organization's critical infrastructure should be considered malicious and treated as an incident.
Electronics 2024, 13, x FOR PEER REVIEW 20 of 28 usernames and passwords (scanners), presumably to launch attacks later once a database of vulnerable hosts is created.The blue dots in Figure 9 represent IP addresses primarily functioning as scanners, while the red dots are IP addresses that attempted to exploit the honeypot with valid commands.The IP addresses seemed to be geographically distributed and thus cannot be statistically pinpointed to any location or geographic region.This is likely because attackers usually do not use their IP addresses to scan and maliciously connect to the exposed infrastructure.Instead, they leverage existing compromised hosts for such connections.Such hosts exist all over the world, and some of them are compromised using attacks similar to those observed on the honeypot.Any successful login from these IP addresses to the organization's critical infrastructure should be considered malicious and treated as an incident.

Credential-Based Analytics
Over the 7-day period, approximately 23,059 passwords were identified in the honeypot, with 10,274 unique strings recorded from unique IP addresses.Figure 10 displays a word cloud populated with the top 1000 unique strings from unique IP addresses and their observed frequencies within the honeypot.It is important to note that most passwords come from publicly available lists such as rockyou.txt[29].Looking at the strings by holistic count rather than by strings used by unique IP addresses provides a different perspective.Figure 11 presents a word cloud representing the top 1000 passwords recorded in the honeypot based on their connection frequency from all IP addresses.This word cloud includes strings outside standard dictionary-based passwords, explicitly set by adversaries in previous connections to honeypots.Figure 12 depicts a sample attack in action, where an adversary from IP address 137.186.242.99 connected using the username and password "craft" and then attempted to modify the password to "GwB4zfcxU2tm6UN" in the honeypot.Subsequent connections from this IP address used the password "GwB4zfcxU2tm6UN" for connection to the honeypot.All passwords used by adversaries should be considered malicious, and if an adversary-used password is observed in the connection to critical infrastructure, the session must be considered malicious.

Credential-Based Analytics
Over the 7-day period, approximately 23,059 passwords were identified in the honeypot, with 10,274 unique strings recorded from unique IP addresses.Figure 10 displays a word cloud populated with the top 1000 unique strings from unique IP addresses and their observed frequencies within the honeypot.It is important to note that most passwords come from publicly available lists such as rockyou.txt[29].Looking at the strings by holistic count rather than by strings used by unique IP addresses provides a different perspective.Figure 11 presents a word cloud representing the top 1000 passwords recorded in the honeypot based on their connection frequency from all IP addresses.This word cloud includes strings outside standard dictionary-based passwords, explicitly set by adversaries in previous connections to honeypots.Figure 12 depicts a sample attack in action, where an adversary from IP address 137.186.242.99 connected using the username and password "craft" and then attempted to modify the password to "GwB4zfcxU2tm6UN" in the honeypot.Subsequent connections from this IP address used the password "GwB4zfcxU2tm6UN" for connection to the honeypot.All passwords used by adversaries should be considered malicious, and if an adversary-used password is observed in the connection to critical infrastructure, the session must be considered malicious.

Session Time-Based Analysis
Approximately 1902 sessions in the selected seven-day window had a session time greater than 0 s.During the same period, the critical infrastructure recorded around 51 SSH sessions.A K-Means plot generated two clusters from the data between the honeypot and the critical infrastructure.Figure 13 illustrates the K-Means plot, with data reevaluated for the observed time window.The blue dots represent the total time for automated sessions in the honeypot, and the orange dots represent the total time required for manual sessions in the critical infrastructure.The details of the K-Means plot are summarized in Table 7.The silhouette score of the K-Means clustering performed was calculated to be 0.98, indicating a very high clustering result with a good level of confidence in identifying an automated session versus a manual session in the cluster.Table 8 presents the confusion matrix of the K-Means plot observed from the clustering analysis, with an f1 score of 0.91.

Session Time-Based Analysis
Approximately 1902 sessions in the selected seven-day window had a session time greater than 0 s.During the same period, the critical infrastructure recorded around 51 SSH sessions.A K-Means plot generated two clusters from the data between the honeypot and the critical infrastructure.Figure 13 illustrates the K-Means plot, with data reevaluated for the observed time window.The blue dots represent the total time for automated sessions in the honeypot, and the orange dots represent the total time required for manual sessions in the critical infrastructure.The details of the K-Means plot are summarized in Table 7.The silhouette score of the K-Means clustering performed was calculated to be 0.98, indicating a very high clustering result with a good level of confidence in identifying an automated session versus a manual session in the cluster.Table 8 presents the confusion matrix of the K-Means plot observed from the clustering analysis, with an f1 score of 0.91.

Session Time-Based Analysis
Approximately 1902 sessions in the selected seven-day window had a session time greater than 0 s.During the same period, the critical infrastructure recorded around 51 SSH sessions.A K-Means plot generated two clusters from the data between the honeypot and the critical infrastructure.Figure 13 illustrates the K-Means plot, with data reevaluated for the observed time window.The blue dots represent the total time for automated sessions in the honeypot, and the orange dots represent the total time required for manual sessions in the critical infrastructure.The details of the K-Means plot are summarized in Table 7.The silhouette score of the K-Means clustering performed was calculated to be 0.98, indicating a very high clustering result with a good level of confidence in identifying an automated session versus a manual session in the cluster.Table 8 presents the confusion matrix of the K-Means plot observed from the clustering analysis, with an f1 score of 0.91.

Command Analysis
Adversaries executed 47,764 instances of commands over the time window on the honeypot machine.Bashlex analysis removed the variable entities within each command, revealing that many command instances were repeated with similar commands but with variable differences.Only 165 unique commands were identified among the 47,764 instances of commands executed in honeypots over one week.Table 9 details the top 10 commands executed in the honeypots.Notably, about 27,368 instances out of 47,774 instances of commands were a single echo command with specific arguments -e "\x6F\x6B.This command typically prints "ok" on the screen; however, it is likely being used in attacker scripts to check if the shell environment responds to basic commands such as echo with an ASCII representation of hex characters.This finding correlates with research by Touch and Colin on commands executed within various honeypots on the Internet [30].
The 165 commands were analyzed using the "gpt-4-turbo" LLM with the prompt discussed in Section 3 to derive their behavior.Additionally, some post-processing was performed on LLM results using Bashlex functions to ensure that variables used in the command were entirely removed and replaced with generic strings.The resultant text was then converted to vectors using the "Salesforce/SFR-Embedding-Mistral" embedding model and stored in the DynamoDB database.The DynamoDB database has a TTL attribute set, ensuring the record automatically expires in a week.
Similar exercises were conducted for the commands executed in the critical infrastructure.Command details were derived from "gpt-4-turbo", and descriptions derived from LLM were normalized with variables removed using a Python script leveraging the Bashlex library.The resultant text was then converted to vectors using the previously mentioned embedding model-"Salesforce/SFR-Embedding-Mistral".A cosine similarity search was performed on the vectors obtained from commands in the critical infrastructure against the list of vectors from the honeypot executed by the adversaries.Table 10 provides examples of the commands and their cosine similarity search results.the use of LLM at this step, even though the cosine similarity scores exceeded the threshold, sometimes the command was not malicious, or the malicious category was not identified.In these circumstances, running LLM helped refine the results from a similarity search.

Identifying Malicious Command Execution in Critical Infrastructure
Combining the above criteria, a list of criteria to identify malicious command execution in critical infrastructure is as follows: 1.
A successful connection and login to critical infrastructure from a known attacker's IP address recorded in the honeypot during the observation time window could be deemed malicious; 2.
A successful login into critical infrastructure leveraging credentials used by adversaries in the honeypot during the observation window could be deemed malicious; 3.
Table 12 presents the qualitative weights for identifying malicious command execution between session time and command analysis.By following these algorithms, one can quickly parse high-interaction honeypot logs on an ongoing basis to establish if a given event in an organization's critical infrastructure is malicious.

Remarks and Conclusions
A robust model has been designed to quickly parse the high-interaction honeypot logs and associate them with activities within the organization's infrastructure automatically, leveraging machine learning and large language model-based techniques.The model was tested using real-world attacks on a high-interaction honeypot, and the commands were compared to those executed on real-world SSH servers running in the AWS cloud.The attackers' motives to quickly exploit and establish persistence in Internet-exposed servers drive their attack tactics and the similarity between various servers that expose the same protocol.It is possible to draw the following conclusions by leveraging the research in this article.
Four independent parameters were used to analyze adversary behavior in highinteraction honeypots.These four parameters would help shape the monitoring implemented in the critical infrastructure: 1.
Two of the parameters, the adversary's IP address and credentials, are direct indicators of compromise.Essentially, if the connections to the critical infrastructure originate from previously recorded adversary IP addresses or leverage one of the attacker's compromised credentials, the activity in the critical infrastructure can be deemed malicious; 2.
K-Means clustering has proven to be an effective method for analyzing session time on honeypot servers and critical infrastructure.Attackers spend minimal time executing commands to gather information and establish persistence in the honeypot [19].K-Means clustering helps derive clusters that identify honeypot sessions against typical server sessions with an excellent silhouette score and F1 score; 3.
The Retrieval and Augmented Generation (RAG) method offers a mechanism for identifying if commands executed in critical infrastructure are malicious, leveraging the information from existing commands executed in a honeypot infrastructure.Further, the LLM processing also classifies a command executed in the critical infrastructure into various MITRE ATT&CK categories; 4.
A qualitative combination of K-Means analysis and RAG analysis using language model-based techniques helps derive another avenue for categorizing events in critical infrastructure as malicious.

Further Research and Next Steps
Previously, data analysis from high-interaction honeypots could not be automated.By leveraging the research in this paper, any organization can process the data from highinteraction honeypots and detect incidents in their infrastructure as early as possible.Faster detection aids in reducing incident costs.This research would help organizations make automated decisions and stay aware of environmental threat actors.
Once an attack is deemed malicious, organizations can define plans to enhance their incident management efforts to contain it.Further research is needed to develop a robust incident management plan that aligns with the discussions in this paper to aid in analyzing, containing, mitigating the threat, and remediating any underlying security issues.The current research aims to understand adversary TTPs leveraging attacks on the application layer in the networking layer.Further research is required to extend the methodology and design proposed in the current research to identify DoS attacks executed on nonapplication layers (e.g., ping flood, TCP SYN attacks) in the networking stack.The research could also be extended to include a more complex honeypot setup involving multiple honeypot servers to allow for lateral movement of adversaries and to leverage honeytokens to add new dimensions to the analysis.Normalizing this research across multiple different

Figure 1 .
Figure 1.Sample list of commands to identify if user "ubuntu" exists in bash shell.

Figure 1 .
Figure 1.Sample list of commands to identify if user "ubuntu" exists in bash shell.

Figure 3 .
Figure 3. Sample response to user command query to identify malicious transaction.

Figure 3 .
Figure 3. Sample response to user command query to identify malicious transaction.

Figure 4 .
Figure 4. Flow chart of the proposed model.

Figure 4 .
Figure 4. Flow chart of the proposed model.

Figure 6 .
Figure 6.Pseudocode to parse the commands from critical infrastructure using LLM.

Figure 6 .
Figure 6.Pseudocode to parse the commands from critical infrastructure using LLM.Figure 6. Pseudocode to parse the commands from critical infrastructure using LLM.

Figure 6 .
Figure 6.Pseudocode to parse the commands from critical infrastructure using LLM.Figure 6. Pseudocode to parse the commands from critical infrastructure using LLM.

Figure 7 .
Figure 7. High-level architecture setup for the model validation.

Figure 7 .
Figure 7. High-level architecture setup for the model validation.
Figure 8 depicts overall dataflow diagram of the setup.• This comprehensive setup provided a secure and efficient infrastructure for conducting detailed experiments and analyses in a controlled environment.The honeypot and critical infrastructure were kept on the Internet without firewall controls for two weeks between 26 April 2024 and 3 May 2024.The availability of both systems during this time was observed to be 100%.

Figure 8 .
Figure 8. Data flow diagram of the experimental setup.

Figure 8 .
Figure 8. Data flow diagram of the experimental setup.

Figure 9 .
Figure 9. Geographic distribution of the attacks on the honeypot.Legend: • Recorded TCP connection without executing a single command in the honeypot (scanners).• Connection that executed commands in the honeypot (command executors).

Figure 9 .
Figure 9. Geographic distribution of the attacks on the honeypot.Legend: • Recorded TCP connection without executing a single command in the honeypot (scanners).• Connection that executed commands in the honeypot (command executors).

Figure 10 .
Figure 10.Word cloud of passwords from unique IP addresses.Size is proportional to number of logins.

Figure 11 .
Figure 11.Word cloud of Size of the password in above picture is proportional to number of commands executed using the password.

Figure 10 . 28 Figure 10 .
Figure 10.Word cloud of passwords from unique IP addresses.Size is proportional to number of logins.

Figure 11 .
Figure 11.Word cloud of passwords.Size of the password in above picture is proportional to number of commands executed using the password.Figure 11.Word cloud of passwords.Size of the password in above picture is proportional to number of commands executed using the password.

Figure 11 .
Figure 11.Word cloud of passwords.Size of the password in above picture is proportional to number of commands executed using the password.Figure 11.Word cloud of passwords.Size of the password in above picture is proportional to number of commands executed using the password.

Figure 12 .
Figure 12.Attack to change the password (used in word cloud in Figure 11) of user logged in subsequent session.

Figure 13 .
Figure 13.Plot of clustering into two partitions using K-Means algorithm.

Figure 12 .
Figure 12.Attack to change the password (used in word cloud in Figure 11) of user logged in subsequent session.

Figure 12 .
Figure 12.Attack to change the password (used in word cloud in Figure 11) of user logged in subsequent session.

Figure 13 .
Figure 13.Plot of clustering into two partitions using K-Means algorithm.Figure 13.Plot of clustering into two partitions using K-Means algorithm.

Figure 13 .
Figure 13.Plot of clustering into two partitions using K-Means algorithm.Figure 13.Plot of clustering into two partitions using K-Means algorithm.

Table 2 .
Data types collected from honeypot.

Table 3 .
Prompts used to derive adversary TTPs.

Table 5 .
Identification of command line attributes (entities) from bash command.

Table 6 .
Prompts used to derive responses to infrastructure commands.

Table 7 .
Analysis of the K-Means plot.

Table 11 .
Illustrative LLM model results to categorize commands and fetch TTP Categories.

Table 12 .
Qualitative criteria between machine learning models.