Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models

: Large Language Models (LLMs) are reshaping the landscape of Machine Learning (ML) application development. The emergence of versatile LLMs capable of undertaking a wide array of tasks has reduced the necessity for intensive human involvement in training and maintaining ML models. Despite these advancements, a pivotal question emerges: can these generalized models negate the need for task-specific models? This study addresses this question by comparing the effectiveness of LLMs in detecting phishing URLs when utilized with prompt-engineering techniques versus when fine-tuned. Notably, we explore multiple prompt-engineering strategies for phishing URL detection and apply them to two chat models, GPT-3.5-turbo and Claude 2. In this context, the maximum result achieved was an F1-score of 92.74% by using a test set of 1000 samples. Following this, we fine-tune a range of base LLMs, including GPT-2, Bloom, Baby LLaMA, and DistilGPT-2—all primarily developed for text generation—exclusively for phishing URL detection. The fine-tuning approach culminated in a peak performance, achieving an F1-score of 97.29% and an AUC of 99.56% on the same test set, thereby outperforming existing state-of-the-art methods. These results highlight that while LLMs harnessed through prompt engineering can expedite application development processes, achieving a decent performance, they are not as effective as dedicated, task-specific LLMs.


Introduction
In recent years, ML application development has witnessed a transformative impact due to the emergence of LLMs.These models, notable for their vast training across diverse datasets, have demonstrated an impressive adaptability and performance across a wide array of tasks [1][2][3][4].This development signifies a shift toward utilizing LLMs through prompting for various applications, challenging the conventional need for intensive, taskspecific training and the maintenance of ML models.This paradigm shift has propelled ML toward a new era where the development of specialized models for each task is being questioned since LLMs already perform a multitude of tasks in a decent way [5,6].
The primary motivation behind this research lies in understanding the extent to which LLMs can replace dedicated models for specific tasks.Our investigation delves into the critical domain of cybersecurity, with a specific focus on the detection of phishing URLs-an enduring and sophisticated online threat [7].In this realm, the conventional approach has primarily revolved around the use of rule-based systems [8,9], ML algorithms like decision trees, support vector machines [10,11], or neural networks trained on specific phishing characteristics [12][13][14].These traditional methods often require extensive feature engineering and are limited by the need for constant updates to keep pace with the evolving nature of phishing attacks.We aim to assess whether LLMs, with their broad training and adaptability, can provide a more efficient yet effective alternative in this critical domain.
Specifically, two novel approaches are adopted, the prompt engineering and finetuning of LLMs, to assess their efficacy in the context of detecting phishing URLs.Prompt engineering involves crafting specific input prompts to guide the LLM toward desired outputs without modifying the model itself [15], a new technique that emerged with the rise of LLMs and not previously applied in the phishing context.Fine-tuning, on the other hand, involves relying on a pretrained model and adjusting its parameters on a dataset specific to the task at hand [16], a method also novel in the phishing domain.This dualstrategy approach offers a new perspective in cybersecurity research, moving away from the traditional focus on predefined algorithms or feature-dependent models.It enables a comprehensive comparison between the prompt engineering and fine-tuning of LLMs for a specific application.
In this work, we examine various prompt-engineering strategies, such as zero-shot, role-playing, and chain-of-thought prompting, while applying them to two widely used chat models, GPT-3.5-turbo and Claude 2. Our findings reveal that Claude 2 performs better than GPT-3.5-turbofor all prompts and achieves an F1-score of 92.74% with a prompt that combines role-playing and chain-of-thought prompting on a 1000-sample test set sourced from the phishing dataset provided by Hannousse and Yahiouche [17].While this performance is acceptable given that no training has been conducted on the model, it is much less than what task-specific models with much fewer parameters have achieved in the literature [18].
As such, we delve into investigating how fine-tuning LLMs compares to prompt engineering and previous work.Notably, we fine-tuned foundational LLMs such as GPT-2, Bloom, Baby LLaMA, and DistilGPT-2, which have much less parameters than chat LLMs.These base LLMs were originally designed for text generation, and in this study, we finetune them to perform sequence classification to detect phishing URLs.When evaluated on the same 1000-sample test set used for the chat models, the minimal performance achieved for these models is an F1-score of 92.43% for Bloom, which is very similar to the highest performance attained in prompt engineering, and the peak performance achieved is an F1-score of 97.29% for GPT-2 (medium), which surpasses existing state-of-the-art techniques on this task.
To further assess the real-world applicability of these methods, we tested the best fine-tuned and prompt-engineered models on datasets with varying ratios of phishing URLs.Recognizing the importance of realistic testing conditions, we adjusted the phishing URL ratios in our test sets to reflect the varied prevalence of phishing URLs in actual internet traffic.These ratios ranged from as low as 5% to as high as 45%, thereby covering a broad spectrum of potential real-world scenarios.The results show that fine-tuned LLMs have more potential than those used with prompt engineering in real-world scenarios where the proportion of phishing URLs is lower than that of legitimate ones.
These findings underscore that models tailored for particular tasks often outperform general-purpose ones on these tasks, and the rise of LLMs does not negate the necessity for specialized models.Moreover, we show that fine-tuning LLMs to perform specific tasks presents a higher potential than prompt engineering and existing solutions in the literature.
In summary, the main contributions of this work are the following: • This research is the first to offer a unique comparative analysis between the performance of prompt engineering and fine-tuning techniques for LLMs.

•
Exploring prompt-engineering strategies for phishing URL detection and providing valuable insights into their effectiveness.

•
The investigation of the fine-tuning of text-generation LLMs for phishing URL detection, showcasing its potential in this domain.

•
This study achieves a remarkable performance of 97.29% as the F1-score for phishing URL detection, surpassing existing state-of-the-art techniques.
The rest of this paper is organized as follows: In Section 2, we provide essential background information on LLMs, prompt engineering, fine-tuning, and the challenges associated with phishing URL detection.Understanding these foundational concepts is crucial to grasp the context of our research.Section 3 presents some related work.In Section 4, we detail the methodology employed in our study, including the design and implementation of prompt-engineering strategies and the fine-tuning process.Section 5 offers a comprehensive overview of the experimental setup, experiments, and results.We provide insights into the effectiveness of each approach in Section 6 and compare their outcomes.Section 7 summarizes our key findings and contributions and discusses potential avenues for future research and improvements.

Background and Preliminaries
This section provides essential background information on key topics that form the foundation of our study.

Large Language Models (LLMs)
LLMs are a type of neural-network-based model designed to generate, understand, and interpret human language.They are characterized by their large number of parameters, deep architectures, and the extensive amount of data they are trained on.These models are adept at capturing the complexities and nuances of language, making them valuable for a wide range of applications from text generation to question-answering systems [19,20].The introduction of the transformer architecture by Vaswani et al. marked a significant milestone in the evolution of LLMs [21].Their work demonstrated the effectiveness of selfattention mechanisms, leading to substantial improvements in various natural languageprocessing tasks including language translation, sentiment analysis, and text generation.One of the key aspects of LLMs is their ability to learn contextual representations of words and phrases.Unlike earlier language models that relied on fixed word embeddings [22,23], LLMs use dynamic embeddings, allowing for a more nuanced understanding of language in different contexts.
The landscape of LLMs has witnessed the emergence of several landmark models.The GPT (Generative Pretrained Transformer) by OpenAI is one such example, which showcased the power of unsupervised learning for language understanding and generation [24].Next, GPT-2 and GPT-3 further pushed the boundaries in terms of size and capabilities, highlighting the scalability of transformer architectures [25,26].Building on the advancements of GPT-3, models like ChatGPT represent a significant evolution [27].Chat-GPT (https://chat.openai.com/(accessed on 2 January 2024)) developed by OpenAI, is a variant of the GPT-3 model specifically fine-tuned for conversational responses.This model exemplifies the transition from broad language understanding to specialized, context-aware conversational applications, marking a pivotal step in the practical deployment of LLMs.Nowadays, the trend is shifting to rely on such black box models to build systems and applications without the need to train or maintain ML models.

Prompt Engineering
Prompt engineering is a strategy employed to harness the capabilities of LLMs for specific tasks.By providing carefully constructed prompts or instructions, LLMs are guided to produce desired responses or perform targeted tasks.Prompt-engineering techniques come in various forms.Some examples include zero-shot prompts, few-shot prompts, role-playing prompts, and chain-of-thought prompts.Zero-shot prompting, for instance, is the standard way of prompting, and it involves framing a task in a way that the LLM can comprehend and generate an appropriate response without explicit training [28].Few-shot prompting involves giving some examples to the model in order to guide it on how to respond.Typically, it is used to control the output format by providing some examples to follow the structure of their responses and does not provide much help for reasoning [29].Role-playing prompts encourage the LLM to simulate a particular persona or role when generating responses, enhancing its ability to provide contextually relevant information [30].Chain-of-thought prompts ask the model to provide the reasoning step by step before reaching the end response.This helps the model make more informed decisions and allows it to understand the reason behind specific decisions [31].These strategies play a crucial role in our study, where we explore their effectiveness in the context of phishing URL detection.

Fine-Tuning LLMs
Fine-tuning is a critical process in adapting pretrained LLMs for specialized tasks.It involves training the LLMs on task-specific datasets to improve their performance on particular domains [32].Fine-tuning allows one to tailor the general language capabilities of LLMs to excel in specific applications, such as phishing URL detection.The process typically begins with a pretrained LLM, such as GPT, which has already learned a broad range of language patterns and semantics from large corpora of text data.Then, models are fine-tuned on a smaller dataset relevant to the specific task, effectively transferring the general language knowledge to the specialized domain [33].This approach helps LLMs become highly proficient in specific tasks while retaining their overall language understanding.In this study, since the goal is phishing URL detection, we fine-tune LLMs to perform URL classification where they receive a URL as input and predict a class as an output.The process is detailed in the methodology section.

Phishing Detection
Phishing attacks continue to be a prominent threat in the cybersecurity landscape [34].According to Cloudflare's 2023 phishing threats report [35], approximately 13 billion emails were processed between May 2022 and May 2023, out of which about 250 million malicious messages were blocked.The report highlights that deceptive links are the most common phishing tactic.Detecting phishing URLs poses a significant challenge due to the sophisticated and constantly evolving tactics of malicious actors.These URLs often closely mimic legitimate websites, making it difficult to distinguish them from genuine ones [36].Attackers frequently use misleading domain names, embed trusted brand names within URLs, or employ homoglyphs-characters that visually resemble each other-to create seemingly authentic URLs [37,38].The use of legitimate elements, such as valid TLS certificates [39] and brand logos [40], further complicates their detection.Additionally, the adoption of URL shortening services and redirection tactics helps attackers to conceal the true nature of malicious URLs [41,42].Attackers' frequent changes in tactics and URL obfuscation underscore the need for a robust understanding of URL structures and content analysis to discern the subtle differences between legitimate and phishing URLs.This study aims to leverage the power of LLMs to effectively identify phishing URLs.

Related Work
In the realm of machine-learning-based phishing detection, significant emphasis has been placed on URL-based methods.These methods primarily analyze various features and patterns inherent in URLs, such as the length, the inclusion of special characters, and the utilization of subdomains [10,43,44].Techniques like lexical and token analysis are instrumental in deciphering URL structures, aiding in the identification of phishing attempts.The application of ML algorithms, including SVM, decision tree, and Random Forest, has been a cornerstone in classifying URLs based on these extracted features [45,46].Advancements have led to the integration of deep learning techniques, including convolutional neural networks (CNNs) and natural language processing (NLP), further enhancing the automation of cybersecurity tasks [47,48].Pioneering work by researchers such as Le et al. [49] who proposed the URLNet framework and others like Tajaddodianfar et al. [50] and Jiang et al. [51] showcases the application of character-level deep neural networks in this field.Besides URL-based approaches to detect phishing, other approaches rely on additional information such as the content, appearance, or behavior of the page [52,53].While these techniques are promising, they might require more computation in order to classify a website as phishing or legitimate, yet they do not offer much more advantage in terms of performance compared to URL-based methods.The focus of this study is to predict phishing websites based solely on URL analysis.Specifically, we rely on the capabilities of LLMs to perform this classification, using prompt engineering and fine-tuning.In both cases, the advantage is that a raw URL can be inputted directly into the LLM without the need for any kind of feature extraction.Then, both techniques are compared with the performance achieved in state-of-the-art methods.

Methodology
In this section, we provide an overview of the methodology employed in our study, detailing the steps taken to investigate the effectiveness of LLMs in detecting phishing URLs by using prompt engineering and fine-tuning techniques.

Prompt Engineering
Prompt engineering refers to the process of carefully crafting prompts to elicit desired responses from an LLM such as ChatGPT, Google Bard, LLaMA2, etc.In this technique, the architecture of the LLM remains the same; only the input prompt is altered to observe its impact on the output.To investigate how prompt-engineering strategies affect the abilities of chat-completion LLMs in detecting phishing URLs, we use a subset of 1000 URLs for testing.Feeding all URLs simultaneously to the model is impractical as it would exceed the allowed context length.Therefore, we adopt the following process:

Fine-Tuning
The goal of fine-tuning an LLM is to tailor it more specifically for a particular task.In this study, we investigate the fine-tuning of pretrained text-generation LLMs for phishing URL detection.For all LLMs used, we follow a consistent fine-tuning process.This involves loading the LLM with pretrained weights for the embedding and transformer layers and adding a classification head on top, which categorizes a given URL as phishing or legitimate.This makes the LLM dedicated to performing URL classification.For the data to be processed by the LLM, it must be tokenized.For each LLM, we use its corresponding tokenizer, setting a maximum length of 100 tokens with right padding.Then, we train the complete architecture for several epochs on the training data while tuning some hyperparameters on the validation data.Finally, we evaluate the model by using the same 1000 testing samples as in the prompt-engineering method.The full architecture through which a URL is processed for classification is depicted in Figure 2. The specific models utilized for fine-tuning are detailed in the experiments section.In this study, we utilize the phishing dataset provided by Hannousse and Yahiouche [17], available for download at Mendeley Data (https://data.mendeley.com/datasets/c2gw7fy2j4/3 (accessed on 2 January 2024)), comprising a total of 11,430 URL samples.The dataset is balanced, containing an equal number of phishing and legitimate URLs, with each category represented by 5215 samples.Each URL in the dataset is accompanied by 87 extracted features and a classification label denoting whether it is legitimate or phishing.Details about the data collection and feature-extraction processes can be found in [54].
For the purpose of this study, we focus exclusively on analyzing the raw URLs by using LLMs while disregarding the extracted features.This approach enables us to evaluate the LLMs' capability to discern phishing URLs based solely on their textual characteristics.
To assess the performance of the prompt engineering and fine-tuning methods, we extract a subset of 1000 samples from the dataset.This subset maintains the dataset's balanced nature, comprising 500 phishing and 500 legitimate URLs.The remaining 10,430 samples are divided into training and validation sets, with 90% (9387 samples) used for training and 10% (1043 samples) for validation.This division is employed to fine-tune the LLMs effectively for the phishing URL detection task.For all divisions of the dataset, we ensure consistency and reproducibility by setting the random_state to 42 by using the scikit-learn Python library [55].

Models
For prompt engineering, we utilized two chat models, GPT-3.5-turbo and Claude 2. These models are accessible via API, making them suitable for developing AI applications, which is why we selected them.For fine-tuning, we chose LLMs that specialize in text generation, focusing on those with a minimal number of parameters.This approach allows us to compare their performance with that of chat models, which possess a significantly higher number of parameters.Specifically, we selected the following Hugging Face models [56]: openai-gpt, gpt2, gpt2-medium, distilgpt2, baby-llama-58m, and bloom-560m.For all models, we used the Adam optimizer, a batch size of 16, and a learning rate of 5 × 10 −5 .The number of training epochs was determined by early stopping, based on the validation data, to avoid overfitting.The fine-tuning parameters are presented in Table 1.

Evaluation Metrics
In both prompt engineering and fine-tuning, evaluating the performance of LLMs is crucial.Since the goal is to classify URLs as phishing or legitimate, we use the following classification metrics: • Accuracy: This is the most intuitive performance measure and is simply the ratio of correctly predicted observations to the total observations.It is particularly useful when the target classes are well-balanced.However, its utility is limited in scenarios with significant class imbalance, as it can yield misleading results.• Precision: Also known as the positive predictive value, precision is the ratio of correctly predicted positive observations to the total predicted positive observations.High precision, which indicates a low rate of false positives, is critical in phishing detection, where mistakenly labeling legitimate URLs as phishing can have serious consequences.

•
Recall: Also referred to as sensitivity, recall is the ratio of correctly predicted positive observations to all actual positives.This metric is essential in phishing detection as it is vital to identify as many phishing instances as possible to prevent data breaches.• F1-score: The F1-score is the harmonic mean of precision and recall.It is a more reliable measure than accuracy, particularly when dealing with unevenly distributed datasets.It considers both false positives and false negatives, making it suitable for scenarios where both precision and recall are important.
When comparing our results with state-of-the-art models, we employ additional metrics such as:

•
Area Under the Curve (AUC): this metric measures the ability of a classifier to distinguish between classes and is used as a summary of the Receiver Operating Characteristic (ROC) curve.

•
True Positive Rate at a Given False Positive Rate (TPR@FPR): This metric evaluates the model's ability to correctly identify positives at a specific false positive rate.It is particularly useful in scenarios where maintaining a low rate of false positives is crucial, which is the case in phishing detection.

Prompt Engineering
We use the flow outlined in the methodology section and we apply it to two chatcompletion models: GPT-3.5-turbo and Claude 2. We vary the input prompts to observe how this affects the results.Notably, we investigate three prompt templates for phishing URL detection, as shown in Figure 3: • Prompt 1 (zero-shot): We start with a baseline prompt that simply asks the LLM to classify the given URLs while generating the response according to a specific output format.The accuracy, precision, recall, and F1-score of both LLMs for this prompt on the test set are reported in Figure 4.
• Prompt 2 (role-playing): We modify the baseline prompt to ask the LLM to assume the role of a cybersecurity specialist analyzing URLs for a company.This approach is intended to help the model adopt a specific mindset while responding, which is expected to enhance its responses.We apply this prompt to both LLMs, and the results are shown in Figure 5. • Prompt 3 (chain-of-thought): We further modify the second prompt (role-playing) by asking the model to provide succinct reasoning for classifying a given URL as phishing or legitimate.This approach encourages the LLM to classify based on specific criteria that it articulates, which is expected to improve performance.The results of this prompt for both LLMs are illustrated in Figure 6.
During this process, a notable observation was made with GPT-3.5-turbo.The model occasionally refrained from classifying certain URLs due to internal content filters being triggered by specific words.This observation was very minor, affecting only one URL for prompts 1 and 2 and no URLs for prompt 3. Consequently, it did not impact the overall results of the study.

Fine-Tuning
For fine-tuning, we consider several base models that are pretrained for text generation.We fine-tune each of these models following the steps detailed in the methodology section.The LLMs we fine-tuned, available on Hugging Face, are as follows: •  [57].It is designed for efficiency and can perform various language tasks, optimized for environments with limited computational resources.
• bloom-560m: Part of the Bloom series [58], this large-scale multilingual language model is designed to understand and generate text in multiple languages.With 560 million parameters, it offers substantial capability in language-processing tasks.
We assess the fine-tuned LLMs by using the same test set that was utilized for assessing prompt-engineering techniques.The results are presented in Table 2, organized in ascending order based on their F1-scores.

Discussion
Our investigation into the effectiveness of prompt engineering and fine-tuning strategies for LLMs in phishing URL detection has provided new insights.In this section, we discuss the results achieved with each approach.Subsequently, we compare these approaches both against each other and with state-of-the-art methodologies.Finally, we delve into the performance of these models under imbalanced test conditions, which simulate real-world scenarios where phishing URLs are less prevalent than legitimate ones.

Prompt Engineering
In our assessment of prompt-engineering capabilities using both LLMs, we observed an increase in performance as the prompts were refined.Notably, Claude 2 outperformed GPT-3.5-turbo.To compare the effectiveness of the various prompt types, we report the F1-scores achieved for each when applied to both LLMs, and we present these results in Figure 7 for easier comparison.Specifically, the zero-shot prompt yielded F1-scores of 77.87% for GPT-3.5-turbo and 90.75% for Claude 2. The role-playing prompt, which asks the LLM to act as a cybersecurity consultant, significantly improved the performance of both models, resulting in F1-scores of 82.95% for GPT-3.5-turbo and 92.26% for Claude 2. Lastly, the chain-of-thought prompt, which requires the LLM to provide reasoning behind its classification, further enhanced the performance, achieving F1-scores of 88.54% for GPT-3.5-turbo and 92.74% for Claude 2.
We noticed that Claude 2 consistently outperformed GPT-3.5-turboacross all prompt types.However, the reason for this is not entirely clear, as both models offer limited information about their training processes and are generally treated as 'black boxes' by users.Interestingly, when the prompts were improved, GPT-3.5-turboshowed more significant gains than Claude 2, likely because Claude's baseline performance was already high (an F1-score of 90.75% for the zero-shot prompt).Therefore, its room for improvement was more limited compared to GPT-3.5-turbo, which started with a baseline F1-score of 77.87%.
On the other hand, the results achieved with prompt engineering are remarkable, considering that no specific training was conducted to enable the LLMs to distinguish between phishing and legitimate URLs.The effectiveness of a simple zero-shot prompt in detecting phishing demonstrates the inherent capabilities of such models.Moreover, throughout all prompt-engineering techniques, we observed a trend where precision was consistently higher than recall.This likely indicates that the LLMs, when prompted, were more inclined to accurately identify true positive cases (legitimate URLs correctly identified as legitimate) but were somewhat less effective in correctly identifying all phishing instances, leading to a higher rate of false negatives.This pattern suggests that while LLMs were efficient in minimizing false positives, this was at the expense of potentially missing some phishing cases.

Fine-Tuning
When fine-tuning, we observe that LLMs achieve a very high performance with minimal training, such as after only a few epochs.It is noteworthy that the GPT models outperform Bloom, despite the latter having more parameters.This discrepancy could be attributed to the different training settings used for each model.The Baby LLaMA model achieved a performance comparable to the GPT family, which can be explained by its design, as it is distilled from both LLaMA models and GPT.The lowest performance was achieved by bloom-560m with an F1-score of 92.43%.The highest performance was recorded by gpt2-medium, with an F1-score of 97.29%, surpassing the state-of-the-art model, a point we discuss further later in this section.

Comparing Prompt Engineering and Fine-Tuning
In this part, we compare prompt engineering and fine-tuning LLMs from multiple aspects.

•
Performance differences: In our analysis, we found significant performance differences between prompt-engineered and fine-tuned LLMs.Fine-tuning generally results in better performance.For example, the least performing fine-tuned model, bloom-560m, achieved an F1-score of 92.43%, which is comparable to the highest performance in prompt engineering with the chain-of-thought prompt on Claude 2 (92.74%).Notably, GPT-2-medium, with fewer parameters, significantly outperforms prompt-engineered models, achieving an F1-score of 97.29%.This is particularly striking when considering that chat-completion models like GPT-3.5-turbo and Claude 2 have substantially more parameters.This disparity highlights the potential of fine-tuning in enhancing the specialization and effectiveness of smaller models for specific tasks like phishing URL detection.• Data privacy and security: When using prompt engineering, interacting with LLMs via their APIs, as commonly performed in AI development, involves data transmission to third-party servers.This raises data privacy and security concerns.In contrast, fine-tuning as outlined in this study generally involves downloading the model for local adjustments, which enhances data security and minimizes risks of data leakage.
• Resource requirements: The resource demands of the two approaches differ significantly.Prompt engineering is generally less resource intensive, requiring minimal adjustments to apply various prompts.This makes it more accessible and practical, particularly in resource-limited settings.On the other hand, fine-tuning demands more substantial resources, including a significant amount of domain-specific training data and computational power, which can be a limiting factor in its scalability and practicality.

•
Model maintenance: The maintenance approaches for prompt-engineered and finetuned models differ considerably.For prompt-engineered models like GPT-3.5-turbo and Claude 2, updates and maintenance are typically handled by the companies that developed them, such as OpenAI for GPT-3.5-turbo and Anthropic AI for Claude 2. This means that improvements and adaptations to evolving data or tasks are managed centrally, relieving users from the burden of continual model updates.However, this also means that users are dependent on the companies for timely updates.In contrast, fine-tuned models require the users to actively manage and update the models.This might involve retraining the models as new data become available or as the nature of tasks, such as phishing URL detection, evolves.While this allows for more control and customization, it also adds to the resource intensity and demands ongoing attention from the users.
To conclude, our analysis of prompt engineering and fine-tuning in LLMs has illuminated distinct strengths and limitations for each method.Prompt engineering, while offering flexibility and lower resource requirements, tends to fall short in performance compared to fine-tuning.This is particularly evident when considering the superior outcomes achieved by fine-tuning, even with models having fewer parameters than large chat-completion models like GPT-3.5-turbo and Claude 2. Fine-tuning, despite its higher demands for resources and active maintenance, provides a notable increase in specialization and effectiveness, especially in specific tasks such as phishing URL detection.Additionally, fine-tuning affords enhanced data security through local processing as opposed to the potential privacy concerns associated with using third-party servers in prompt engineering.The choice between these approaches should be made based on the specific requirements of the task at hand, weighing factors such as performance, data security, resource availability, and the need for ongoing model maintenance and adaptability.

Comparison with State-of-the-Art Approaches
In this section, we compare the best results achieved by both methods with state-ofthe-art approaches.Initially, we contrast these results with studies that used the same dataset to ensure a fair comparison, as shown in Table 3.Although prompt engineering theoretically performs well, it did not achieve the performance level of state-of-the-art methods, indicating that prompt-engineering techniques prioritize convenience over performance.However, it is evident that the fine-tuned GPT-2 model surpasses state-of-the-art techniques in all metrics, demonstrating the significant potential of fine-tuning LLMs for specific tasks.Additionally, in Table 4, we compare the results achieved by the fine-tuned GPT-2 with two state-of-the-art models that were not trained on the same dataset.The purpose of this comparison is to provide an approximate indication of how the fine-tuned GPT-2 performs relative to these models.Specifically, we compare it with PhishBERT, the only model that uses LLMs for phishing detection, and URLNet, a state-of-the-art model that processes raw URLs in a manner similar to LLMs.This contrasts with other techniques that necessitate feature extraction.

Testing on Imbalanced Datasets
Evaluating models on imbalanced test sets is essential in phishing detection to mirror real-world conditions, where phishing URLs are typically less frequent than legitimate ones.Balanced datasets may not accurately represent these real-life scenarios, often leading to an overestimation of the model performance [64].To address this, we selected the bestperforming models from our initial experiments: Claude 2, used with a chain-of-thought prompt, and the fine-tuned gpt2-medium.These models were then assessed on imbalanced datasets to reflect the varying prevalence of phishing URLs in real-world settings.Given the uncertainty in the actual distribution of phishing versus legitimate URLs in real-world settings, we altered the phishing ratios in our test sets to include 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40% and 45%, all derived from our original balanced test set.This method allowed for an evaluation of the models' effectiveness across different scenarios, particularly in environments where phishing is less prevalent than legitimate content.
Across the varying phishing ratios, we observed significant differences in the effectiveness of the models.Claude 2 showed increased performance at higher phishing ratios, indicating better detection in datasets with a larger presence of phishing URLs.However, in scenarios more representative of real-world conditions, where phishing URLs are less common, the fine-tuned GPT-2 consistently outperformed Claude 2 across all metrics.This was particularly noticeable at lower phishing ratios.For example, at 5% and 10% phishing ratios, GPT-2 achieved F1-scores of 76.19% and 86.73%, respectively, significantly surpassing Claude 2's F1-scores of 59.15% and 74.58%, respectively.This trend of superior performance by GPT-2-medium persisted across all evaluated ratios, culminating in an F1-score of 96.02% at a 45% phishing ratio, demonstrating its enhanced effectiveness in realistic settings with a lower incidence of phishing.The results, including precision, recall, and F1-scores for both models at the varying phishing URL ratios, are presented in Figures 8-10.Notably, these findings suggest that fine-tuned LLMs are more adept than their prompt-engineered counterparts in handling real-world scenarios.

Conclusions
In this study, we explored the effectiveness of LLMs in detecting phishing URLs, focusing on prompt engineering and fine-tuning strategies.Our investigation encompassed a variety of prompt-engineering mechanisms, as well as multiple LLMs for fine-tuning.We found that although prompt engineering facilitates the construction of AI systems without the need for training or monitoring ML models, it does not match the superior performance of the fine-tuned LLMs.Notably, the fine-tuned LLMs achieved an F1-score of 97.29%, surpassing current state-of-the-art benchmarks.Moreover, in realistic scenarios where the ratio of phishing to legitimate URLs is low, the fine-tuned LLMs significantly outperformed those used with prompt engineering.This work presents a detailed comparison between prompt engineering and the fine-tuning of LLMs, offering comprehensive insights into the appropriate scenarios for employing each technique.
For future research, we suggest exploring hybrid approaches that combine the convenience of prompt engineering with the high performance of fine-tuning in phishing URL detection.It is also crucial to address the resilience of LLM-based detection methods against adversarial attacks, necessitating the development of robust defense mechanisms.Additionally, optimizing real-time detection systems, mitigating biases in LLMs, and incorporating multimodal cues for enhanced detection accuracy are key areas that warrant further investigation and research.These efforts will contribute to more effective and reliable phishing-detection tools in the rapidly evolving landscape of cybersecurity.

Figure 7 .
Figure 7. Comparing the performance of the three used prompts on GPT-3.5-turbo and Claude 2.

Table 1 .
Number of training epochs for each LLM.

Table 2 .
Performance metrics of the fine-tuned LLMs.

Table 3 .
Comparing our results with previous studies using our same dataset.

Table 4 .
Comparing our results with previous studies using different datasets.
Precision of GPT-2-medium and Claude 2 with varying phishing URL ratios.Recall of GPT-2-medium and Claude 2 with varying phishing URL ratios.Figure 10.F1-score of GPT-2-medium and Claude 2 with varying phishing URL ratios.