1. Introduction
The Dark Web represents a concealed sector of the internet characterized by high anonymity, volatility, and the proliferation of illicit activities. Accessible primarily via protocols like Tor, Dark Web platforms host a range of illegal services and content from drug markets and arms sales to hacking tools and forged documents [
1,
2]. For cybersecurity professionals and intelligence analysts, identifying and categorizing this content is essential; however, the task is inherently challenging due to the linguistic diversity, thematic ambiguity, and the constantly evolving nature of these online environments.
To support the development of automated classifiers, several labeled datasets have been introduced, including DUTA, DUTA-10K, and, most notably, CoDA (Comprehensive Darkweb Annotations), a corpus of 10,000 manually annotated Dark Web documents across ten illicit categories [
3]. Early classification efforts relied on supervised machine learning models and, more recently, on fine-tuned transformer-based architectures such as BERT and DarkBERT, which have demonstrated strong performance on CoDA and similar corpora [
4,
5]. Nonetheless, these approaches face practical limitations: they require annotated data, extensive retraining, and domain-specific knowledge, making them costly to deploy and update in dynamic forensic settings [
6].
In contrast, zero-shot classification using Large Language Models (LLMs) offers a scalable alternative by allowing categorization based solely on natural language prompts. Recent studies have shown that models such as GPT-3.5 and GPT-4 can classify text with competitive performance in zero-shot scenarios without fine-tuning, particularly when guided by well-structured instructions [
7,
8,
9]. In the context of the Dark Web, Prado Sánchez et al. [
10] demonstrated the viability of zero-shot prompting for classifying CoDA documents, although with limitations in handling ambiguous or semantically overlapping categories like Others or Electronic.
Other recent works have also emphasized the importance of model alignment with human judgment. For instance, Prado-Sánchez et al. [
6] measured inter-annotator agreement between human experts and ChatGPT (OpenAI,
https://chat.openai.com, accessed on 18 October 2025), revealing that in some domains, LLMs can match or exceed human consistency. Similarly, in the domain of name-based gender detection, ChatGPT demonstrated higher classification coverage and stability compared to specialized APIs like Namsor or Gender-API, especially in culturally ambiguous cases [
11]. These findings support the broader hypothesis that LLMs can serve not only as efficient classifiers, but also as reliable surrogates for auditing and validating human annotations.
While these studies have explored the utility of individual models, they lack a systematic comparison across multiple commercial LLMs. With the growing number of powerful proprietary models such as Claude 3.5 Haiku, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and the latest versions of GPT there is a pressing need to benchmark them under standardized settings, not only in terms of classification accuracy, but also in their alignment with human decisions and mutual consistency across outputs.
This study addresses that gap by conducting a comparative experimental analysis of eight state-of-the-art commercial LLMs: GPT-3.5 Turbo, GPT-4o, GPT-4o Mini, Claude 3.5 Haiku, DeepSeek Chat, DeepSeek Reasoner, Gemini 2.0 Flash, and Grok on the task of zero-shot classification of Dark Web content using the CoDA dataset. We seek to answer the following research questions:
RQ1: How do state-of-the-art commercial LLMs compare in classifying Dark Web content under zero-shot prompting?
RQ2: To what extent do large language models (LLMs) classify Dark Web content in a manner consistent with human judgment?
RQ3: How similarly do different commercial LLMs classify Dark Web content?
By addressing these questions, this work contributes to the growing body of research on the use of LLMs in high-risk, real-world applications. It sheds light on the comparative performance, reliability, and alignment of commercial LLMs in one of the most sensitive domains for natural language classification.
The remainder of this paper is organized as follows.
Section 2 reviews related work on illicit Dark Web content classification and recent evaluations of LLM performance and reliability.
Section 3 describes the methodology employed to assess the classification capabilities of eight commercial LLMs, their alignment with human judgment, and their inter-model consistency.
Section 4 presents the experimental results obtained.
Section 5 discusses the findings in relation to the study objectives. Finally,
Section 6 concludes the paper and suggests directions for future research.
2. Related Work
The task of classifying illicit content on the Dark Web has been the subject of increasing academic attention due to its implications for cybersecurity, law enforcement, and intelligence gathering. Early approaches relied on supervised learning algorithms such as support vector machines (SVMs), logistic regression, and decision trees, trained on manually annotated corpora like DUTA and DUTA-10K [
1,
12]. While these datasets provided a foundation for benchmarking, they were limited in scope and domain coverage. The introduction of the CoDA dataset [
3], which includes 10,000 documents manually labeled across ten illicit categories, has since become the standard benchmark for text-based classification of Dark Web content.
The emergence of transformer-based deep learning models, including BERT and its domain-specific variant DarkBERT [
4], has significantly improved classification performance on CoDA. Some models have surpassed 90% F1-scores in certain categories [
5]. Nonetheless, these models require large amounts of labeled data and domain-specific retraining, limiting their rapid applicability in operational scenarios [
13]. Nonetheless, these models require large amounts of labeled data and domain-specific retraining, limiting their rapid applicability in operational scenarios. Recent studies, such as Giannilias et al. [
14], have explored zero-shot, few-shot, and fine-tuned LLMs, such as Mistral, Gemma, and LLaMA. for classifying hacker-related content on the Dark Web, reporting viable results even in resource-constrained environments. Complementary work by Chen et al. [
15] proposed a legal document retrieval-based approach to classify illicit content using LLMs without needing fine-tuned classifiers, highlighting the potential of prompt engineering for Dark Web tasks.
In contrast, zero-shot classification with LLMs allows for direct application to classification tasks without additional task-specific training. Recent studies have shown that models such as GPT-3.5 and GPT-4 can perform competitively in zero-shot settings across a range of domains, including financial document classification Loukas et al. [
8], hate speech detection Bauer et al. [
16], and disinformation analysis [
7,
9,
17]. Within the Dark Web context, Prado Sánchez et al. [
10] applied zero-shot prompting with GPT models to the CoDA dataset, demonstrating promising performance while noting that categories such as Electronic and Others remained difficult to disambiguate.
More recent work has shifted focus toward the reliability and alignment of LLM outputs with human judgment. In Prado-Sánchez et al. [
6], the authors measured inter-coder agreement between human annotators and ChatGPT using Krippendorff’s alpha and Cohen’s kappa, finding comparable reliability in several categories. Further, Prado-Sánchez et al. [
6] evaluated the explanatory power of LLM-generated rationales, concluding that while these may enhance perceived interpretability, they do not consistently improve annotation accuracy. Parallel findings have emerged in the domain of gender detection: Domínguez-Díaz et al. [
11] showed that ChatGPT outperforms specialized APIs like Namsor and Gender-API in both coverage and consistency, especially when resolving ambiguous cultural name contexts.
Beyond these domain-specific evaluations, recent studies have begun to examine intercoder reliability between LLMs and humans across annotation tasks. Dunivin [
18] found that GPT-4 achieved intercoder agreement with human raters comparable to expert-level reliability (Cohen’s κ ≥ 0.79) in qualitative coding tasks, whereas GPT-3.5 exhibited lower consistency. Similarly, Gilardi et al. [
19] reported that ChatGPT produced annotations more consistent with expert ground truth than those from crowd workers, highlighting the feasibility of LLMs as stand-alone coders in certain social science contexts [
20].
Despite these promising results, most studies are limited to single-model evaluations, usually involving GPT-3.5 or GPT-4. A comprehensive benchmark that compares the most advanced commercial LLMs including Claude, DeepSeek, Gemini, Grok, and multiple GPT variants has yet to be conducted. The only exception is a recent survey-style review by Chen et al. [
15], which outlines potential applications of LLMs in the Dark Web but does not present empirical benchmarking. As a result, our understanding of model performance diversity, robustness, and agreement in high-stakes classification tasks remains limited.
Our work addresses this need by conducting a zero-shot comparative analysis of eight commercial LLMs on the CoDA dataset. In addition to evaluating classification performance, we assess alignment with human annotations and inter-model consistency, offering a broader view of the capabilities and limitations of current commercial LLMs in the context of Dark Web content analysis.
3. Methodology
This section describes the experimental design developed to evaluate the classification capabilities, human alignment, and consistency of eight commercial LLMs in a zero-shot setting on illicit Dark Web content. The methodology is structured to provide a transparent, reproducible, and fair comparison across models and dimensions. It covers the dataset used, the selection and configuration of LLMs, the prompt formulation protocol, the evaluation metrics applied, and the technical details of the implementation.
3.1. Dataset
The foundation of our experiments is the CoDA (Comprehensive Darkweb Annotations) dataset, a manually labeled corpus introduced by Jin et al. [
3], specifically designed for the classification of content extracted from Dark Web domains. It comprises 10,000 documents, each assigned to one of ten pre-defined illicit categories: Drugs, Weapons, Fraud, Pornography, Services, Hacking, Extremism, Electronics, Counterfeit, and Others.
Table 1 presents the language distribution of the 10,000 documents in the CoDA dataset. English is the overwhelmingly dominant language, accounting for 88.5% of the corpus (8855 documents), followed by Russian (542), German (129), and French (100). A wide range of other languages is present with much lower frequency, including Spanish (61), Portuguese (54), Chinese (38), and more than 40 additional languages with fewer than 15 documents each. This substantial linguistic skew toward English has direct implications for both model evaluation and generalization, particularly when assessing performance in low-resource languages. The multilingual nature of the dataset, despite its imbalance, reflects the diversity of communication within Dark Web forums and marketplaces.
Each document in CoDA is a textual excerpt crawled from real .onion websites on the Tor network. These texts are typically brief, unstructured, and exhibit substantial lexical noise, including spelling errors, use of slang, code-switching between languages, and intentional obfuscations a realistic reflection of adversarial environments on the Dark Web. The diversity and complexity of these documents make CoDA a highly appropriate benchmark for testing the robustness of language models under zero-shot constraints.
To ensure a more robust and representative evaluation of the model’s zero-shot classification capabilities, we performed a preprocessing step to remove duplicate and near-duplicate documents from the dataset. This step was taken to prevent the artificial inflation of performance metrics that can occur when highly similar instances are repeatedly presented to the model. By reducing redundancy in the input data, we aimed to minimize potential biases and ensure that the evaluation better reflects the model’s ability to generalize across a diverse range of document types and contents.
Duplicate detection was performed using cosine similarity over TF-IDF vector representations of the documents. Any pair of documents with a similarity score above 0.95 was considered near-duplicate, and one of them was removed from the dataset. This threshold was selected after manually reviewing borderline cases to ensure that documents flagged as near-duplicates were practically identical. Typical removed entries included replicated pages differing only in minor metadata, such as site titles or contact information. This approach follows established practices in information retrieval, where TF-IDF combined with cosine similarity is widely applied for redundancy detection [
21].
After deduplication, the dataset was reduced from 10,000 to 7970 documents. The updated distribution of documents per category, along with the average character length before and after deduplication, is summarized in
Table 2. Notably, all categories experienced a reduction in document count, with Porn and Gambling maintaining the highest average character lengths both before and after deduplication (25,597 to 15,749 and 8339 to 7246 characters, respectively). Overall, the mean text length across all categories decreased from 69,032 to 52,202 characters, reflecting the removal of repetitive and overly long entries.
Although CoDA contains only anonymized text, its illicit nature necessitates careful ethical handling. We ensured the absence of personally identifiable information and adhered to responsible research standards. Nevertheless, potential harm from misclassification, particularly in sensitive categories such as Violence or Porn, must be acknowledged. Moreover, since the original dataset lacks inter-annotator agreement data, assessing labeling bias remains limited.
3.2. Language Models Evaluated
We selected eight commercial LLMs from different providers based on their high performance, public availability via APIs, and relevance in state-of-the-art natural language understanding:
GPT-3.5 Turbo (OpenAI);
GPT-4o (OpenAI);
GPT-4o Mini (OpenAI);
Claude 3.5 Haiku (Anthropic);
Gemini 2.0 Flash (Google DeepMind);
DeepSeek Chat (DeepSeek);
DeepSeek Reasoner (DeepSeek);
Grok (xAI).
Each model was accessed using its official API or web interface between March and April 2025. The models were queried using deterministic parameters (e.g., temperature = 0 when available) to reduce sampling variance. All models were evaluated in their default zero-shot configuration, without fine-tuning or exposure to CoDA or similar datasets.
This selection reflects a broad spectrum of model architecture, training philosophies, and deployment of ecosystems. The inclusion of models from multiple vendors also enables a meaningful comparison of their respective generalization abilities on real-world illicit content.
To provide additional context for model selection and performance interpretation,
Table 3 summarizes key characteristics of the evaluated commercial LLMs. These include estimated number of parameters, architecture type, training data sources, API provider, average cost per 1000 input/output tokens, and inference latency. The values reported are based on publicly available documentation from providers or best estimates when official figures are not disclosed.
3.3. Prompting Protocol and Base Template
All models were prompted using standardized zero-shot instruction, designed to maximize semantic clarity and category disambiguation without providing any in-context examples. The prompt included the following:
A direct instruction to classify a given text.
A structured list of categories, each with a detailed but concise natural-language definition.
The document content is clearly enclosed between triple quotes.
A constraint to return only the classification, formatted in JSON with a predefined key.
This approach ensures that models are evaluated under equal conditions and with the same informational context. The question used was that in
Table 4.
This format was chosen based on prior work Prado-Sánchez et al. [
6] and manually refined through early testing to ensure proper category recognition by all LLMs. Each model received the same prompt structure, and outputs were post-processed only for formatting standardization (e.g., JSON parsing, lowercasing, trimming whitespace).
All models were queried using deterministic parameters (temperature = 0) to ensure stable outputs and reduce variance. However, we acknowledge that commercial APIs evolve over time modifications to model weights, backend versions, or prompt handling may impact reproducibility. As such, the results presented reflect a snapshot of model behavior between March and April 2025. We encourage future replications to re-evaluate performance under current API conditions.
3.4. Evaluation Metrics
To comprehensively assess the performance of the evaluated models, we designed a multi-faceted evaluation protocol that addresses three main dimensions: classification accuracy, alignment with human judgment, and inter-model agreement.
All reported performance metrics are accompanied by 95% confidence intervals (CIs), estimated via nonparametric bootstrap resampling, to ensure statistical robustness and enable reliable comparisons across models. To address the issue of multiple pairwise comparisons, the Benjamini–Hochberg (BH) procedure was applied to control the false discovery rate (FDR) at a significance level of
α = 0.05. The BH correction was performed on the raw
p-values derived from global F1-score comparisons using the multipletests function from the Python statsmodels library (version 0.14.2; available at:
https://www.statsmodels.org, accessed on 18 October 2025). The adjusted p-values and corresponding rejection decisions were subsequently used to assess the statistical significance of differences among models [
22].
3.4.1. Classification Accuracy (RQ1)
To evaluate classification performance, we computed macro-averaged precision, recall, and F1-score across all ten CoDA categories for each model. The use of macro-averaged metrics was particularly important, as it ensures that each category contributes equally to the overall score, regardless of its frequency in the dataset. This approach mitigates biases that could arise due to the significant class imbalance inherent in CoDA, where categories such as Drugs are substantially more frequent than others like Violence or Gambling.
In addition to global performance indicators, we analyzed per-class metrics (precision, recall, and F1-score) to identify specific categories where models exhibited higher accuracy or, conversely, where they faced greater challenges. Moreover, confusion matrices were generated for each model, allowing us to uncover common misclassification patterns and better understand overlaps between semantically adjacent categories, such as between Hacking and Financial.
3.4.2. Agreement with Human Judgment (RQ2)
To assess the extent to which model classifications align with human annotations, we employed two complementary agreement metrics: Cohen’s Kappa (κ) and Krippendorff’s Alpha (α). Cohen’s Kappa was calculated to measure the degree of pairwise agreement between model predictions and the ground-truth human labels provided in CoDA, considering the probability of chance agreement. Krippendorff’s Alpha, being a more general and robust reliability measure [
23], was additionally computed to accommodate the multi-categorical nature of the classification task and to better handle potential missing or ambiguous data points.
3.4.3. Inter-Model Agreement (RQ3)
To evaluate consistency across models, we first computed pairwise agreement percentages between each pair of LLMs. This metric captures the proportion of documents for which two different models produced the same classification output, offering insight into model similarity under identical zero-shot prompting conditions. In addition, to obtain a global measure of consensus across all models, we calculated Krippendorff’s Alpha treating each model as an independent rater. This metric allowed us to quantify the overall degree of convergence in classification decisions across the evaluated LLMs, highlighting areas of strong agreement as well as domains where model predictions diverged significantly, especially in ambiguous or borderline cases.
3.5. Experimental Setup
All evaluations were conducted through the use of fully automated Python scripts (
Appendix A), designed to ensure the complete reproducibility and traceability of the experimental pipeline. For each document in the CoDA dataset, the script dynamically constructed the full prompt by inserting the text content into the standardized base template described previously. Once the prompt was prepared, the system queried each target model through its respective API, using deterministic parameters such as a temperature setting of zero when configurable, to minimize output variability across requests.
After receiving the model response, the script parsed the output and systematically mapped it to one of the ten predefined CoDA categories, applying normalization steps to handle minor formatting inconsistencies (such as capitalization, whitespace, or alternative label phrasing). Each prediction was stored together with accompanying metadata, including the model version, API endpoint used, raw output, processing timestamp, and any error or exception encountered during parsing.
Subsequently, the collected outputs were compared against the ground-truth labels provided by CoDA, and evaluation metrics were computed using established Python libraries, primarily scikit-learn for classification statistics and statsmodels for reliability analysis. This setup ensured that all model evaluations were performed under identical conditions, and that the full experimental workflow was logged for future auditing or replication.
4. Results
This section presents the evaluation results of the eight commercial LLMs tested in the zero-shot classification of illicit Dark Web content using the CoDA dataset. The results are organized around the three research questions: (RQ1) classification performance, (RQ2) agreement with human labels, and (RQ3) inter-model consistency. Quantitative results are complemented by error analysis to better understand model behavior.
As shown in
Table 5, the document refers to operational changes in a vendor’s shipping policy but also mentions Dark Web marketplaces (e.g., DarkMarket, Empire, Cannazon). While human annotators categorized this entry as “Others” due to its informational rather than transactional nature, GPT-3.5 Turbo classified it as “Drugs,” likely due to the co-occurrence of drug-marketplace names. Such examples reveal how LLMs may over-rely on lexical cues (e.g., marketplace names) without fully interpreting the broader semantic context. This pattern was recurrent in low-performing categories like Violence, Electronic, and Financial, where contextual overlap and subtle wording frequently triggered misclassification.
The macro-level performance of each model is summarized in
Table 6, reporting weighted precision, recall, and F1-score values with 95% confidence intervals (CIs). Overall, DeepSeek Chat achieved the highest F1-score (0.870 [0.862, 0.877]), closely followed by Grok (0.868 [0.861, 0.876]), DeepSeek Reasoner (0.860 [0.853, 0.867]), and Gemini 2.0 Flash (0.861 [0.854, 0.868]). The OpenAI models GPT-4o Mini and GPT-4o reached F1-scores of 0.859 [0.851, 0.866] and 0.857 [0.849, 0.864], respectively. Claude 3.5 Haiku followed with 0.854 [0.846, 0.862], while GPT-3.5 Turbo obtained the lowest performance, with an F1-score of 0.831 [0.823, 0.838].
In addition to aggregate metrics, we further analyzed the performance of the best-performing LLM, DeepSeek Chat, by computing its confusion matrix (
Figure 1). This visualization reveals category-level misclassifications and provides insights into common confusion patterns, particularly in ambiguous or overlapping categories such as Others, Violence, and Financial.
To further examine the statistical significance of the observed performance differences among the evaluated commercial LLMs,
Table 7 reports the pairwise F1-score differences with 95% confidence intervals and BH-adjusted
p-values. The results confirm that GPT-3.5 Turbo significantly underperforms compared to all other models, with highly significant differences (
p < 0.001) against Claude 3.5 Haiku, GPT-4o, GPT-4o Mini, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and DeepSeek Chat. In contrast, DeepSeek Chat shows statistically significant improvements over GPT-4o, Claude 3.5 Haiku, and DeepSeek Reasoner (
p < 0.05), but not when compared with GPT-4o Mini or Gemini 2.0 Flash, suggesting comparable effectiveness among these top-performing models. Other pairwise comparisons, such as between Grok, Gemini 2.0 Flash, and DeepSeek Reasoner, exhibit non-significant
p-values, indicating no meaningful difference in F1 performance. Overall, these findings emphasize the robustness and competitive parity of newer-generation LLMs, while highlighting the substantial performance gap that persists for GPT-3.5 Turbo.
In terms of per-class performance (
Table 8), all models exhibited strong results for categories such as Drugs and Gambling, with F1-scores frequently exceeding 94%. For instance, DeepSeek Chat achieved the highest F1-score in Drugs (0.955), while GPT-4o and GPT-4o Mini both reached top scores in Gambling (0.980). Grok also performed exceptionally well in Arms (0.940) and Drugs (0.942), confirming its strength in lexically anchored categories. Across most models, these categories showed minimal variance, indicating that tasks with explicit semantic cues remain consistently easier to classify under zero-shot prompting conditions.
Conversely, categories such as Violence, Electronic, and Hacking proved more challenging for all evaluated models, with F1-scores frequently dropping below 80%. Notably, GPT-3.5 Turbo recorded the lowest results in these ambiguous domains, reaching only 0.682 in Violence and 0.735 in Electronic, whereas larger or more recent models like GPT-4o (0.803 in Violence) and DeepSeek Reasoner (0.782 in Violence) showed moderate improvements. These trends underscore the continued difficulty of handling diffuse or overlapping semantic boundaries within illicit content categories.
A comprehensive version of this table, including per-category F1-scores with 95% confidence intervals (CIs) for all LLMs on the CoDA dataset, is provided in
Appendix B for full statistical reference.
Regarding the alignment of model predictions with human annotations (RQ2),
Table 9 presents inter-coder agreement metrics, including Cohen’s Kappa, weighted Cohen’s Kappa, and Krippendorff’s Alpha with 95% confidence intervals. Grok and DeepSeek Chat achieved the highest weighted Cohen’s Kappa values (0.867 [0.855, 0.875] and 0.862 [0.852, 0.871], respectively), indicating a high degree of agreement with human-labeled CoDA annotations. These results were further corroborated by Krippendorff’s Alpha, where both models again topped the rankings (0.844 and 0.845, respectively). GPT-4o and GPT-4o Mini also showed strong consistency (both with 0.832 in Kappa and Alpha), though slightly below Grok and DeepSeek Chat.
Continuing with the analysis of inter-model agreement (RQ3), we computed pairwise Cohen’s Kappa scores with 95% confidence intervals between all evaluated LLM pairs. As shown in
Table 10, the highest inter-model consistency was observed between models from the same provider or with similar architectural foundations. In particular, Claude 3.5 Haiku and GPT-4o exhibited the strongest inter-model agreement (κ = 0.911), followed closely by DeepSeek Chat and Grok (κ = 0.909), as well as Claude 3.5 Haiku with Gemini 2.0 Flash (κ = 0.907) and DeepSeek Chat with Gemini 2.0 Flash (κ = 0.907). Similarly, GPT-4o and GPT-4o Mini showed high alignment (κ = 0.889), reflecting cohesive classification behavior among models sharing training data or instruction-tuning methodologies.
Conversely, pairs involving GPT-3.5 Turbo consistently demonstrated lower inter-model agreement, with Kappa values between 0.832 and 0.856, indicating greater divergence in decision boundaries and reduced consistency with newer-generation LLMs. These results highlight that while modern architectures tend to converge toward similar latent representations and output patterns, earlier models such as GPT-3.5 Turbo maintain distinctive and less harmonized behaviors.
A detailed version of this table, including pairwise Cohen’s Kappa values with 95% confidence intervals (CIs) for evaluating inter-model agreement among commercial LLMs, is provided in
Appendix C.
To assess the overall reliability across models and with human annotations, we computed Krippendorff’s Alpha. As shown in
Table 11, the agreement including the original CoDA labels reached 0.871 [0.866, 0.877], indicating strong alignment between human annotations and LLM predictions. When considering only the model outputs, the alpha increased slightly to 0.884 [0.879, 0.889], suggesting even greater internal consistency among the evaluated LLMs.
These results confirm that commercial LLMs not only perform well individually but also maintain high coherence with each other and with human-labeled data, reinforcing their robustness for classifying complex Dark Web content.
As shown in
Table 12, traditional supervised classifiers such as SVM, CNN and BERT [
3] slightly outperform the best-performing commercial LLM (DeepSeek Chat) in terms of precision and F1-score. This highlights the advantage of task-specific training on the CoDA dataset.
However, DeepSeek Chat delivers competitive results in a zero-shot setting, requiring no fine-tuning. Its stable and general performance makes it a practical alternative in scenarios with limited labeled data or high training costs.
5. Discussion
This section discusses the experimental findings in relation to the three core dimensions explored in this study: the effectiveness of LLMs in zero-shot illicit content classification, their alignment with human judgment, and the consistency of their decisions across different commercial models.
5.1. Classification Effectiveness of Commercial LLMs
The primary objective of this study was to assess the effectiveness of leading commercial Large Language Models in classifying illicit Dark Web content under zero-shot prompting conditions. Our findings confirm that modern LLMs exhibit substantial generalization capabilities even in adversarial domains such as the Dark Web, despite the absence of task-specific fine-tuning. All eight evaluated models achieved strong macro-averaged F1-scores, ranging from 0.831 [0.823, 0.838] (GPT-3.5 Turbo) to 0.870 [0.862, 0.877] (DeepSeek Chat), indicating their maturity and applicability to complex classification tasks in high-risk settings.
Among the tested models, DeepSeek Chat, Grok, and Gemini 2.0 Flash delivered the highest macro-level performance, slightly outperforming OpenAI’s GPT-4o and GPT-4o Mini. While OpenAI models have traditionally led benchmarks in general-purpose natural language understanding, our results suggest that competitors, particularly xAI, Google, and DeepSeek, have considerably narrowed this gap. These observations align with Giannilias et al. [
14], who reported similarly competitive results using open-source LLMs like Mistral and LLaMA in cybercrime-related classification tasks.
To assess the statistical significance of the observed performance differences, 95% confidence intervals were computed using nonparametric bootstrap resampling. In several cases, these intervals overlap, particularly among DeepSeek Chat, GPT-4o Mini, and Gemini 2.0 Flash, indicating that, although DeepSeek Chat achieves one of the highest mean F1-scores, not all pairwise differences are statistically significant. Therefore, the apparent superiority of some models should be interpreted with caution. Pairwise F1-score comparisons with adjusted p-values show that GPT-3.5 Turbo consistently underperforms compared to all other models, with highly significant differences (p < 0.001) against Claude 3.5 Haiku, GPT-4o, GPT-4o Mini, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and DeepSeek Chat. In contrast, DeepSeek Chat demonstrates statistically significant improvements over GPT-4o, Claude 3.5 Haiku, and DeepSeek Reasoner (p < 0.05), while maintaining comparable performance to GPT-4o Mini and Gemini 2.0 Flash. These findings highlight the robustness and parity among the most recent commercial LLMs, while emphasizing the notable performance gap that persists for earlier-generation models such as GPT-3.5 Turbo.
At a category-specific level, tasks with clear lexical anchors, such as Drugs and Gambling, yielded the highest F1-scores across all models, frequently exceeding 94–95%. For example, DeepSeek Chat scored 0.955 [0.945, 0.965] on Drugs, while GPT-4o and GPT-4o Mini both reached 0.980 [0.972, 0.988] in Gambling. These results suggest that zero-shot prompting can approach or even replicate the performance of specialized models like DarkBERT [
4] when applied to semantically well-defined categories.
Conversely, categories with diffuse semantics, such as Violence, Electronic, and Hacking, posed a greater challenge. These typically resulted in F1-scores below 80%, reflecting ambiguity and semantic overlap with other classes. For example, GPT-3.5 Turbo scored only 0.682 [0.644, 0.717] on Violence and 0.735 [0.695, 0.771] on Electronic. This is consistent with prior findings from Prado Sánchez et al. [
10], where even GPT-4 struggled with fine-grained category differentiation under structured prompting conditions.
Model-level behavior also differed. GPT-4o Mini occasionally outperformed its larger counterpart in categories like Others and Electronic, suggesting that model size is not the sole determinant of zero-shot robustness. Grok, meanwhile, excelled in domain-specific categories like Gambling and Drugs, but showed reduced stability in more ambiguous classes. These discrepancies likely stem from variations in pretraining data, instruction tuning regimes, and sensitivity to prompt formulation.
In sum, our analysis reinforces that zero-shot prompting with commercial LLMs is a viable method for Dark Web content classification. However, performance varies significantly across categories and overlapping confidence intervals among top models suggest the need for careful statistical interpretation. Future research should explore prompt adaptation or few-shot fine-tuning to address residual ambiguity and improve classification reliability in operational deployments.
5.2. Alignment of Model Classifications with Human Judgment
The second objective of this study was to examine the degree to which commercial LLMs align with human judgment when classifying illicit Dark Web content. This alignment was evaluated using two well-established inter-rater reliability metrics: Krippendorff’s Alpha and Cohen’s Kappa, both of which are commonly employed in annotation tasks characterized by subjectivity, semantic ambiguity, or incomplete information.
The results indicate that several LLMs, including DeepSeek Chat, Grok, and GPT-4o, achieved high levels of agreement with human-labeled ground truth in the CoDA dataset. As shown in
Table 9, weighted Cohen’s Kappa scores surpassed 0.86 for these models (e.g., Grok: 0.867 [0.861–0.873]; DeepSeek Chat: 0.863 [0.857–0.869]), suggesting strong reliability relative to expert human annotations. This is further supported by the Krippendorff’s Alpha values in
Table 11, which reached 0.871 [0.866–0.877] when including human annotations and 0.884 [0.879–0.889] among models alone. These levels are consistent with prior work using GPT-4 on sensitive classification tasks [
10], and are comparable to expert-human agreement levels in social science research [
19] and qualitative analysis [
24].
Importantly, this comparative analysis across eight commercial LLMs confirms that high alignment with human labels is no longer exclusive to OpenAI models. DeepSeek Chat and Grok produced agreement metrics that matched or slightly exceeded those of GPT--4o, highlighting that alternative providers have matured in generating semantically consistent and instruction-compliant responses even in zero-shot settings.
Nonetheless, the results also revealed systematic differences in annotation behavior. A consistent pattern among models was their conservative tendency to assign semantically ambiguous or lexically underspecified instances to the Others category. This fallback behavior, more frequent than in human coders, reflects a risk-averse classification strategy likely driven by the lack of contextual clues or prompt-specific anchors, an effect also observed in gender classification studies [
11].
While this cautious bias may reduce false positives, critical in operational settings such as threat intelligence or forensic investigations, it may also lead to underreporting of specific illicit content types. Human annotators, relying on inferential reasoning or broader context, are often more willing to assign nuanced categories despite lexical ambiguity. In contrast, LLMs, particularly under zero-shot constraints without access to retrieval or grounding, tend to default to safer options.
These findings underscore a key limitation of zero-shot LLM classification: although aggregate-level agreement with human judgments is high, decision behavior under uncertainty still diverges. For real-world applications, this suggests the need for hybrid workflows where low-confidence classifications are escalated for human review, or where uncertainty-aware prompting techniques are incorporated to mitigate overuse of fallback categories.
5.3. Consistency of Classification Across Different LLMs
The third dimension of this study focuses on the degree of consistency among commercial LLMs when classifying the same set of illicit Dark Web documents. Inter-model agreement is particularly relevant in high-stakes or ambiguous classification contexts, where deployment pipelines may rely on multiple models or assess uncertainty through model diversity.
Our updated analysis reveals a high level of convergence across the evaluated LLMs, particularly among models from the same provider or alignment family. As shown in
Table 10, pairwise Cohen’s Kappa scores with 95% confidence intervals exceeded 0.89 for several closely related model pairs: Claude 3.5 Haiku and GPT-4o (κ = 0.911 [0.904, 0.918]), DeepSeek Chat and Grok (κ = 0.909 [0.903, 0.916]), and both Claude 3.5 Haiku and Gemini 2.0 Flash (κ = 0.907 [0.900, 0.914]) as well as DeepSeek Chat and Gemini 2.0 Flash (κ = 0.907 [0.900, 0.914]). GPT-4o and GPT-4o Mini also showed high alignment (κ = 0.889 [0.882, 0.897]), reflecting cohesive classification behavior among models sharing architectural and instruction-tuning foundations. These values suggest that architectural lineage, alignment strategies, and shared training paradigms contribute to highly similar decision boundaries under zero-shot prompting. This finding supports prior evidence from Giannilias et al. [
14], who noted nearly identical behavior among GPT-family and Anthropic-Google models on multiclass classification tasks outside the Dark Web domain.
However, when evaluating agreement across the full ensemble of LLMs using Krippendorff’s Alpha, the metric drops to 0.884 [0.879–0.889] without human labels, and 0.871 [0.866–0.877] when including them (
Table 11). These values indicate moderate-to-high but not perfect convergence, with semantic divergence especially notable in diffuse or overlapping categories like Others, Financial, and Electronic. In these cases, classification appears to be more sensitive to differences in pretraining data, prompt interpretations, and risk calibration strategies.
Interestingly, while Grok and DeepSeek Chat tended to assign more definitive labels, often committing to narrower illicit categories, models like GPT-4o and Gemini 2.0 Flash displayed more conservative classification patterns, aligning with their cautious behavior also noted in
Section 5.2. This echoes findings from Gilardi et al. [
19], who documented differences in granularity and confidence thresholds across LLMs even under standardized annotation setups.
From an operational perspective, inter-model disagreement offers both challenges and design opportunities. While inconsistencies may flag ambiguous cases that require human intervention, the strong convergence observed in clearly defined categories such as Drugs, Gambling, and Pornography supports the use of ensemble or majority-vote mechanisms to reinforce confidence in automated pipelines. The potential for leveraging disagreement as an uncertainty signal, particularly in edge cases, also suggests promising directions for developing audit-ready content monitoring systems in sensitive environments.
In summary, while LLMs exhibit robust individual performance, their collective behavior varies by context and category, reinforcing the importance of multi-model evaluation frameworks in high-risk domains like the Dark Web.
5.4. Ethical Considerations and Dataset Bias
Given the sensitive nature of illicit content in the Dark Web, ethical considerations are central to the design, execution, and interpretation of this study. While the CoDA dataset contains only anonymized text snippets and excludes personally identifiable information, the classification of potentially harmful content, including categories such as Violence, Pornography, or Drugs, raises legitimate concerns about misclassification risks and the amplification of bias.
One limitation of the CoDA dataset is the lack of inter-annotator agreement (IAA) scores from the original annotation process. Without validated measures of human consistency, such as Cohen’s Kappa or Krippendorff’s Alpha among annotators, it is difficult to fully assess the subjectivity or ambiguity inherent in certain class assignments, particularly in overlapping or context-dependent categories like Others, Financial, or Violence. This gap reduces our ability to evaluate the origin of disagreement between models and human annotations and complicates bias mitigation.
In terms of preprocessing, we applied a cosine similarity threshold of 0.95 to remove near-duplicate documents using TF-IDF vectors, following established practices in information retrieval. While this approach reduces redundancy and overfitting risk, we did not perform a sensitivity analysis to assess how the choice of threshold might impact class balance, particularly for underrepresented categories. As a result, further investigation is needed to ensure that deduplication strategies do not disproportionately affect rare or marginal cases.
Additionally, the zero-shot classification setting, while avoiding task-specific fine-tuning, imposes its own constraints. LLMs tend to default to conservative labeling strategies, especially when faced with lexically sparse or semantically vague inputs. This was reflected in the overuse of fallback categories like Others, which may inadvertently underreport specific illicit activities or obscure actionable intelligence. These tendencies highlight the tension between minimizing false positives and maintaining sufficient granularity in sensitive applications.
Future research should address these limitations by (i) incorporating datasets with publicly available IAA metrics; (ii) testing alternative deduplication thresholds and their effect on minority class retention; and (iii) exploring the use of confidence scores or model uncertainty as flags for human review in deployment pipelines. Ensuring ethical robustness and transparency is crucial for the responsible use of LLMs in security-sensitive environments.
5.5. Limitations of the Study
While the findings presented offer significant insights into the capabilities of commercial LLMs for Dark Web content classification, several limitations must be acknowledged. First, all evaluations were conducted under zero-shot prompting conditions without incorporating domain-specific fine-tuning or retrieval-augmented generation techniques, which could potentially enhance model performance. Second, the CoDA dataset, although comprehensive, contains inherent ambiguities and overlaps between categories that might influence the interpretation of classification errors and model disagreements. Third, human labels used for ground truth comparison could themselves include biases or inconsistencies, affecting agreement metric interpretations. Fourth, the models were evaluated under deterministic API settings (e.g., temperature set to zero), which may not fully reflect real-world variability in model outputs.
To reduce variance and ensure reliability in results, confidence intervals have been computed for key evaluation metrics. This enhances the robustness of the analysis and mitigates concerns about result generalizability.
Finally, given the rapid evolution of LLMs, the models evaluated in this study represent a snapshot in time; newer or future versions could yield different classification behaviors. Future research should explore adaptive prompting strategies, investigate few-shot or fine-tuned configurations, and extend evaluation frameworks to cover multi-label and hierarchical classification schemes.
6. Conclusions
This study presented a comprehensive comparative evaluation of eight commercial Large Language Models (LLMs) in the task of zero-shot classification of illicit Dark Web content using the CoDA dataset. Our analysis spanned three critical dimensions: classification accuracy (RQ1), alignment with human judgment (RQ2), and inter-model agreement (RQ3). The findings offer robust evidence on the current capabilities and limitations of LLMs when applied to high-risk, semantically complex domains.
First, in terms of classification effectiveness, all evaluated LLMs demonstrated strong macro-level performance, with F1-scores ranging from 0.831 to 0.870. Notably, models such as DeepSeek Chat, Grok, and Gemini 2.0 Flash outperformed GPT-4o and GPT-4o Mini in several categories, underscoring the progress of alternative providers in closing the performance gap with OpenAI. While models performed exceptionally well in lexically rich categories such as Drugs and Gambling (F1 > 0.94), performance dropped in more ambiguous classes like Violence and Electronic, highlighting the importance of category-specific evaluation.
Second, our evaluation of model alignment with human-labeled CoDA annotations revealed high inter-rater reliability. Weighted Cohen’s Kappa and Krippendorff’s Alpha values exceeded 0.84 for top-performing models such as Grok and DeepSeek Chat. These results indicate that several LLMs can emulate human labeling behavior with strong consistency, although systematic patterns like overuse of the fallback Others category persist, especially in cases of lexical or contextual ambiguity.
Third, the study found substantial convergence in classification decisions among models with similar architectures or alignment paradigms. Pairwise Cohen’s Kappa scores surpassed 0.90 between model pairs like DeepSeek Chat and Grok or GPT-4o and GPT-4o Mini. However, Krippendorff’s Alpha across the full model ensemble revealed only moderate agreement (0.871–0.884), with higher variability observed in diffuse categories. This suggests that while intra-family consistency is high, architectural and training differences still lead to divergence in handling semantically complex cases.
Importantly, our study also highlighted ethical and methodological considerations. The absence of inter-annotator agreement data in CoDA, the lack of sensitivity analysis for deduplication thresholds, and the deterministic nature of API querying all present limitations that warrant future investigation. Nonetheless, we addressed concerns related to reproducibility by reporting 95% confidence intervals across all major metrics and by making our classification code publicly accessible.
In conclusion, commercial LLMs have reached a level of maturity that makes them viable tools for forensic Dark Web classification in zero-shot settings. However, operational deployment should consider category-specific weaknesses, model variability, and ethical safeguards. Future work should explore adaptive prompting strategies, retrieval-augmented techniques, few-shot learning configurations, and the use of hierarchical or multi-label taxonomies to better capture the complexity of real-world illicit content monitoring.