Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement

Prado-Sánchez, Víctor-Pablo; Domínguez-Díaz, Adrián; De-Marcos, Luis; Martínez-Herráiz, José-Javier

doi:10.3390/electronics14204101

Open AccessArticle

Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement

by

Víctor-Pablo Prado-Sánchez

^*

,

Adrián Domínguez-Díaz

,

Luis De-Marcos

and

José-Javier Martínez-Herráiz

Department of Computer Science, University of Alcalá, Alcalá de Henares, 28801 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(20), 4101; https://doi.org/10.3390/electronics14204101

Submission received: 17 September 2025 / Revised: 15 October 2025 / Accepted: 17 October 2025 / Published: 19 October 2025

Download

Browse Figure

Review Reports Versions Notes

Abstract

This study evaluates the zero-shot classification performance of eight commercial large language models (LLMs), GPT-4o, GPT-4o Mini, GPT-3.5 Turbo, Claude 3.5 Haiku, Gemini 2.0 Flash, DeepSeek Chat, DeepSeek Reasoner, and Grok, using the CoDA dataset (n = 10,000 Dark Web documents). Results show strong macro-F1 scores across models, led by DeepSeek Chat (0.870), Grok (0.868), and Gemini 2.0 Flash (0.861). Alignment with human annotations was high, with Cohen’s Kappa above 0.840 for top models and Krippendorff’s Alpha reaching 0.871. Inter-model consistency was highest between Claude 3.5 Haiku and GPT-4o (κ = 0.911), followed by DeepSeek Chat and Grok (κ = 0.909), and Claude 3.5 Haiku with Gemini 2.0 Flash (κ = 0.907). These findings confirm that state-of-the-art LLMs can reliably classify illicit content under zero-shot conditions, though performance varies by model and category.

Keywords:

zero-shot learning; LLM; dark web; text classification; illicit activities; dark content

1. Introduction

The Dark Web represents a concealed sector of the internet characterized by high anonymity, volatility, and the proliferation of illicit activities. Accessible primarily via protocols like Tor, Dark Web platforms host a range of illegal services and content from drug markets and arms sales to hacking tools and forged documents [1,2]. For cybersecurity professionals and intelligence analysts, identifying and categorizing this content is essential; however, the task is inherently challenging due to the linguistic diversity, thematic ambiguity, and the constantly evolving nature of these online environments.

To support the development of automated classifiers, several labeled datasets have been introduced, including DUTA, DUTA-10K, and, most notably, CoDA (Comprehensive Darkweb Annotations), a corpus of 10,000 manually annotated Dark Web documents across ten illicit categories [3]. Early classification efforts relied on supervised machine learning models and, more recently, on fine-tuned transformer-based architectures such as BERT and DarkBERT, which have demonstrated strong performance on CoDA and similar corpora [4,5]. Nonetheless, these approaches face practical limitations: they require annotated data, extensive retraining, and domain-specific knowledge, making them costly to deploy and update in dynamic forensic settings [6].

In contrast, zero-shot classification using Large Language Models (LLMs) offers a scalable alternative by allowing categorization based solely on natural language prompts. Recent studies have shown that models such as GPT-3.5 and GPT-4 can classify text with competitive performance in zero-shot scenarios without fine-tuning, particularly when guided by well-structured instructions [7,8,9]. In the context of the Dark Web, Prado Sánchez et al. [10] demonstrated the viability of zero-shot prompting for classifying CoDA documents, although with limitations in handling ambiguous or semantically overlapping categories like Others or Electronic.

Other recent works have also emphasized the importance of model alignment with human judgment. For instance, Prado-Sánchez et al. [6] measured inter-annotator agreement between human experts and ChatGPT (OpenAI, https://chat.openai.com, accessed on 18 October 2025), revealing that in some domains, LLMs can match or exceed human consistency. Similarly, in the domain of name-based gender detection, ChatGPT demonstrated higher classification coverage and stability compared to specialized APIs like Namsor or Gender-API, especially in culturally ambiguous cases [11]. These findings support the broader hypothesis that LLMs can serve not only as efficient classifiers, but also as reliable surrogates for auditing and validating human annotations.

While these studies have explored the utility of individual models, they lack a systematic comparison across multiple commercial LLMs. With the growing number of powerful proprietary models such as Claude 3.5 Haiku, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and the latest versions of GPT there is a pressing need to benchmark them under standardized settings, not only in terms of classification accuracy, but also in their alignment with human decisions and mutual consistency across outputs.

This study addresses that gap by conducting a comparative experimental analysis of eight state-of-the-art commercial LLMs: GPT-3.5 Turbo, GPT-4o, GPT-4o Mini, Claude 3.5 Haiku, DeepSeek Chat, DeepSeek Reasoner, Gemini 2.0 Flash, and Grok on the task of zero-shot classification of Dark Web content using the CoDA dataset. We seek to answer the following research questions:

RQ1: How do state-of-the-art commercial LLMs compare in classifying Dark Web content under zero-shot prompting?

RQ2: To what extent do large language models (LLMs) classify Dark Web content in a manner consistent with human judgment?

RQ3: How similarly do different commercial LLMs classify Dark Web content?

By addressing these questions, this work contributes to the growing body of research on the use of LLMs in high-risk, real-world applications. It sheds light on the comparative performance, reliability, and alignment of commercial LLMs in one of the most sensitive domains for natural language classification.

The remainder of this paper is organized as follows. Section 2 reviews related work on illicit Dark Web content classification and recent evaluations of LLM performance and reliability. Section 3 describes the methodology employed to assess the classification capabilities of eight commercial LLMs, their alignment with human judgment, and their inter-model consistency. Section 4 presents the experimental results obtained. Section 5 discusses the findings in relation to the study objectives. Finally, Section 6 concludes the paper and suggests directions for future research.

2. Related Work

The task of classifying illicit content on the Dark Web has been the subject of increasing academic attention due to its implications for cybersecurity, law enforcement, and intelligence gathering. Early approaches relied on supervised learning algorithms such as support vector machines (SVMs), logistic regression, and decision trees, trained on manually annotated corpora like DUTA and DUTA-10K [1,12]. While these datasets provided a foundation for benchmarking, they were limited in scope and domain coverage. The introduction of the CoDA dataset [3], which includes 10,000 documents manually labeled across ten illicit categories, has since become the standard benchmark for text-based classification of Dark Web content.

The emergence of transformer-based deep learning models, including BERT and its domain-specific variant DarkBERT [4], has significantly improved classification performance on CoDA. Some models have surpassed 90% F1-scores in certain categories [5]. Nonetheless, these models require large amounts of labeled data and domain-specific retraining, limiting their rapid applicability in operational scenarios [13]. Nonetheless, these models require large amounts of labeled data and domain-specific retraining, limiting their rapid applicability in operational scenarios. Recent studies, such as Giannilias et al. [14], have explored zero-shot, few-shot, and fine-tuned LLMs, such as Mistral, Gemma, and LLaMA. for classifying hacker-related content on the Dark Web, reporting viable results even in resource-constrained environments. Complementary work by Chen et al. [15] proposed a legal document retrieval-based approach to classify illicit content using LLMs without needing fine-tuned classifiers, highlighting the potential of prompt engineering for Dark Web tasks.

In contrast, zero-shot classification with LLMs allows for direct application to classification tasks without additional task-specific training. Recent studies have shown that models such as GPT-3.5 and GPT-4 can perform competitively in zero-shot settings across a range of domains, including financial document classification Loukas et al. [8], hate speech detection Bauer et al. [16], and disinformation analysis [7,9,17]. Within the Dark Web context, Prado Sánchez et al. [10] applied zero-shot prompting with GPT models to the CoDA dataset, demonstrating promising performance while noting that categories such as Electronic and Others remained difficult to disambiguate.

More recent work has shifted focus toward the reliability and alignment of LLM outputs with human judgment. In Prado-Sánchez et al. [6], the authors measured inter-coder agreement between human annotators and ChatGPT using Krippendorff’s alpha and Cohen’s kappa, finding comparable reliability in several categories. Further, Prado-Sánchez et al. [6] evaluated the explanatory power of LLM-generated rationales, concluding that while these may enhance perceived interpretability, they do not consistently improve annotation accuracy. Parallel findings have emerged in the domain of gender detection: Domínguez-Díaz et al. [11] showed that ChatGPT outperforms specialized APIs like Namsor and Gender-API in both coverage and consistency, especially when resolving ambiguous cultural name contexts.

Beyond these domain-specific evaluations, recent studies have begun to examine intercoder reliability between LLMs and humans across annotation tasks. Dunivin [18] found that GPT-4 achieved intercoder agreement with human raters comparable to expert-level reliability (Cohen’s κ ≥ 0.79) in qualitative coding tasks, whereas GPT-3.5 exhibited lower consistency. Similarly, Gilardi et al. [19] reported that ChatGPT produced annotations more consistent with expert ground truth than those from crowd workers, highlighting the feasibility of LLMs as stand-alone coders in certain social science contexts [20].

Despite these promising results, most studies are limited to single-model evaluations, usually involving GPT-3.5 or GPT-4. A comprehensive benchmark that compares the most advanced commercial LLMs including Claude, DeepSeek, Gemini, Grok, and multiple GPT variants has yet to be conducted. The only exception is a recent survey-style review by Chen et al. [15], which outlines potential applications of LLMs in the Dark Web but does not present empirical benchmarking. As a result, our understanding of model performance diversity, robustness, and agreement in high-stakes classification tasks remains limited.

Our work addresses this need by conducting a zero-shot comparative analysis of eight commercial LLMs on the CoDA dataset. In addition to evaluating classification performance, we assess alignment with human annotations and inter-model consistency, offering a broader view of the capabilities and limitations of current commercial LLMs in the context of Dark Web content analysis.

3. Methodology

This section describes the experimental design developed to evaluate the classification capabilities, human alignment, and consistency of eight commercial LLMs in a zero-shot setting on illicit Dark Web content. The methodology is structured to provide a transparent, reproducible, and fair comparison across models and dimensions. It covers the dataset used, the selection and configuration of LLMs, the prompt formulation protocol, the evaluation metrics applied, and the technical details of the implementation.

3.1. Dataset

The foundation of our experiments is the CoDA (Comprehensive Darkweb Annotations) dataset, a manually labeled corpus introduced by Jin et al. [3], specifically designed for the classification of content extracted from Dark Web domains. It comprises 10,000 documents, each assigned to one of ten pre-defined illicit categories: Drugs, Weapons, Fraud, Pornography, Services, Hacking, Extremism, Electronics, Counterfeit, and Others.

Table 1 presents the language distribution of the 10,000 documents in the CoDA dataset. English is the overwhelmingly dominant language, accounting for 88.5% of the corpus (8855 documents), followed by Russian (542), German (129), and French (100). A wide range of other languages is present with much lower frequency, including Spanish (61), Portuguese (54), Chinese (38), and more than 40 additional languages with fewer than 15 documents each. This substantial linguistic skew toward English has direct implications for both model evaluation and generalization, particularly when assessing performance in low-resource languages. The multilingual nature of the dataset, despite its imbalance, reflects the diversity of communication within Dark Web forums and marketplaces.

Each document in CoDA is a textual excerpt crawled from real .onion websites on the Tor network. These texts are typically brief, unstructured, and exhibit substantial lexical noise, including spelling errors, use of slang, code-switching between languages, and intentional obfuscations a realistic reflection of adversarial environments on the Dark Web. The diversity and complexity of these documents make CoDA a highly appropriate benchmark for testing the robustness of language models under zero-shot constraints.

To ensure a more robust and representative evaluation of the model’s zero-shot classification capabilities, we performed a preprocessing step to remove duplicate and near-duplicate documents from the dataset. This step was taken to prevent the artificial inflation of performance metrics that can occur when highly similar instances are repeatedly presented to the model. By reducing redundancy in the input data, we aimed to minimize potential biases and ensure that the evaluation better reflects the model’s ability to generalize across a diverse range of document types and contents.

Duplicate detection was performed using cosine similarity over TF-IDF vector representations of the documents. Any pair of documents with a similarity score above 0.95 was considered near-duplicate, and one of them was removed from the dataset. This threshold was selected after manually reviewing borderline cases to ensure that documents flagged as near-duplicates were practically identical. Typical removed entries included replicated pages differing only in minor metadata, such as site titles or contact information. This approach follows established practices in information retrieval, where TF-IDF combined with cosine similarity is widely applied for redundancy detection [21].

After deduplication, the dataset was reduced from 10,000 to 7970 documents. The updated distribution of documents per category, along with the average character length before and after deduplication, is summarized in Table 2. Notably, all categories experienced a reduction in document count, with Porn and Gambling maintaining the highest average character lengths both before and after deduplication (25,597 to 15,749 and 8339 to 7246 characters, respectively). Overall, the mean text length across all categories decreased from 69,032 to 52,202 characters, reflecting the removal of repetitive and overly long entries.

Although CoDA contains only anonymized text, its illicit nature necessitates careful ethical handling. We ensured the absence of personally identifiable information and adhered to responsible research standards. Nevertheless, potential harm from misclassification, particularly in sensitive categories such as Violence or Porn, must be acknowledged. Moreover, since the original dataset lacks inter-annotator agreement data, assessing labeling bias remains limited.

3.2. Language Models Evaluated

We selected eight commercial LLMs from different providers based on their high performance, public availability via APIs, and relevance in state-of-the-art natural language understanding:

GPT-3.5 Turbo (OpenAI);
GPT-4o (OpenAI);
GPT-4o Mini (OpenAI);
Claude 3.5 Haiku (Anthropic);
Gemini 2.0 Flash (Google DeepMind);
DeepSeek Chat (DeepSeek);
DeepSeek Reasoner (DeepSeek);
Grok (xAI).

Each model was accessed using its official API or web interface between March and April 2025. The models were queried using deterministic parameters (e.g., temperature = 0 when available) to reduce sampling variance. All models were evaluated in their default zero-shot configuration, without fine-tuning or exposure to CoDA or similar datasets.

This selection reflects a broad spectrum of model architecture, training philosophies, and deployment of ecosystems. The inclusion of models from multiple vendors also enables a meaningful comparison of their respective generalization abilities on real-world illicit content.

To provide additional context for model selection and performance interpretation, Table 3 summarizes key characteristics of the evaluated commercial LLMs. These include estimated number of parameters, architecture type, training data sources, API provider, average cost per 1000 input/output tokens, and inference latency. The values reported are based on publicly available documentation from providers or best estimates when official figures are not disclosed.

3.3. Prompting Protocol and Base Template

All models were prompted using standardized zero-shot instruction, designed to maximize semantic clarity and category disambiguation without providing any in-context examples. The prompt included the following:

A direct instruction to classify a given text.
A structured list of categories, each with a detailed but concise natural-language definition.
The document content is clearly enclosed between triple quotes.
A constraint to return only the classification, formatted in JSON with a predefined key.

This approach ensures that models are evaluated under equal conditions and with the same informational context. The question used was that in Table 4.

This format was chosen based on prior work Prado-Sánchez et al. [6] and manually refined through early testing to ensure proper category recognition by all LLMs. Each model received the same prompt structure, and outputs were post-processed only for formatting standardization (e.g., JSON parsing, lowercasing, trimming whitespace).

All models were queried using deterministic parameters (temperature = 0) to ensure stable outputs and reduce variance. However, we acknowledge that commercial APIs evolve over time modifications to model weights, backend versions, or prompt handling may impact reproducibility. As such, the results presented reflect a snapshot of model behavior between March and April 2025. We encourage future replications to re-evaluate performance under current API conditions.

3.4. Evaluation Metrics

To comprehensively assess the performance of the evaluated models, we designed a multi-faceted evaluation protocol that addresses three main dimensions: classification accuracy, alignment with human judgment, and inter-model agreement.

All reported performance metrics are accompanied by 95% confidence intervals (CIs), estimated via nonparametric bootstrap resampling, to ensure statistical robustness and enable reliable comparisons across models. To address the issue of multiple pairwise comparisons, the Benjamini–Hochberg (BH) procedure was applied to control the false discovery rate (FDR) at a significance level of α = 0.05. The BH correction was performed on the raw p-values derived from global F1-score comparisons using the multipletests function from the Python statsmodels library (version 0.14.2; available at: https://www.statsmodels.org, accessed on 18 October 2025). The adjusted p-values and corresponding rejection decisions were subsequently used to assess the statistical significance of differences among models [22].

3.4.1. Classification Accuracy (RQ1)

To evaluate classification performance, we computed macro-averaged precision, recall, and F1-score across all ten CoDA categories for each model. The use of macro-averaged metrics was particularly important, as it ensures that each category contributes equally to the overall score, regardless of its frequency in the dataset. This approach mitigates biases that could arise due to the significant class imbalance inherent in CoDA, where categories such as Drugs are substantially more frequent than others like Violence or Gambling.

In addition to global performance indicators, we analyzed per-class metrics (precision, recall, and F1-score) to identify specific categories where models exhibited higher accuracy or, conversely, where they faced greater challenges. Moreover, confusion matrices were generated for each model, allowing us to uncover common misclassification patterns and better understand overlaps between semantically adjacent categories, such as between Hacking and Financial.

3.4.2. Agreement with Human Judgment (RQ2)

To assess the extent to which model classifications align with human annotations, we employed two complementary agreement metrics: Cohen’s Kappa (κ) and Krippendorff’s Alpha (α). Cohen’s Kappa was calculated to measure the degree of pairwise agreement between model predictions and the ground-truth human labels provided in CoDA, considering the probability of chance agreement. Krippendorff’s Alpha, being a more general and robust reliability measure [23], was additionally computed to accommodate the multi-categorical nature of the classification task and to better handle potential missing or ambiguous data points.

3.4.3. Inter-Model Agreement (RQ3)

To evaluate consistency across models, we first computed pairwise agreement percentages between each pair of LLMs. This metric captures the proportion of documents for which two different models produced the same classification output, offering insight into model similarity under identical zero-shot prompting conditions. In addition, to obtain a global measure of consensus across all models, we calculated Krippendorff’s Alpha treating each model as an independent rater. This metric allowed us to quantify the overall degree of convergence in classification decisions across the evaluated LLMs, highlighting areas of strong agreement as well as domains where model predictions diverged significantly, especially in ambiguous or borderline cases.

3.5. Experimental Setup

All evaluations were conducted through the use of fully automated Python scripts (Appendix A), designed to ensure the complete reproducibility and traceability of the experimental pipeline. For each document in the CoDA dataset, the script dynamically constructed the full prompt by inserting the text content into the standardized base template described previously. Once the prompt was prepared, the system queried each target model through its respective API, using deterministic parameters such as a temperature setting of zero when configurable, to minimize output variability across requests.

After receiving the model response, the script parsed the output and systematically mapped it to one of the ten predefined CoDA categories, applying normalization steps to handle minor formatting inconsistencies (such as capitalization, whitespace, or alternative label phrasing). Each prediction was stored together with accompanying metadata, including the model version, API endpoint used, raw output, processing timestamp, and any error or exception encountered during parsing.

Subsequently, the collected outputs were compared against the ground-truth labels provided by CoDA, and evaluation metrics were computed using established Python libraries, primarily scikit-learn for classification statistics and statsmodels for reliability analysis. This setup ensured that all model evaluations were performed under identical conditions, and that the full experimental workflow was logged for future auditing or replication.

4. Results

This section presents the evaluation results of the eight commercial LLMs tested in the zero-shot classification of illicit Dark Web content using the CoDA dataset. The results are organized around the three research questions: (RQ1) classification performance, (RQ2) agreement with human labels, and (RQ3) inter-model consistency. Quantitative results are complemented by error analysis to better understand model behavior.

As shown in Table 5, the document refers to operational changes in a vendor’s shipping policy but also mentions Dark Web marketplaces (e.g., DarkMarket, Empire, Cannazon). While human annotators categorized this entry as “Others” due to its informational rather than transactional nature, GPT-3.5 Turbo classified it as “Drugs,” likely due to the co-occurrence of drug-marketplace names. Such examples reveal how LLMs may over-rely on lexical cues (e.g., marketplace names) without fully interpreting the broader semantic context. This pattern was recurrent in low-performing categories like Violence, Electronic, and Financial, where contextual overlap and subtle wording frequently triggered misclassification.

The macro-level performance of each model is summarized in Table 6, reporting weighted precision, recall, and F1-score values with 95% confidence intervals (CIs). Overall, DeepSeek Chat achieved the highest F1-score (0.870 [0.862, 0.877]), closely followed by Grok (0.868 [0.861, 0.876]), DeepSeek Reasoner (0.860 [0.853, 0.867]), and Gemini 2.0 Flash (0.861 [0.854, 0.868]). The OpenAI models GPT-4o Mini and GPT-4o reached F1-scores of 0.859 [0.851, 0.866] and 0.857 [0.849, 0.864], respectively. Claude 3.5 Haiku followed with 0.854 [0.846, 0.862], while GPT-3.5 Turbo obtained the lowest performance, with an F1-score of 0.831 [0.823, 0.838].

In addition to aggregate metrics, we further analyzed the performance of the best-performing LLM, DeepSeek Chat, by computing its confusion matrix (Figure 1). This visualization reveals category-level misclassifications and provides insights into common confusion patterns, particularly in ambiguous or overlapping categories such as Others, Violence, and Financial.

To further examine the statistical significance of the observed performance differences among the evaluated commercial LLMs, Table 7 reports the pairwise F1-score differences with 95% confidence intervals and BH-adjusted p-values. The results confirm that GPT-3.5 Turbo significantly underperforms compared to all other models, with highly significant differences (p < 0.001) against Claude 3.5 Haiku, GPT-4o, GPT-4o Mini, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and DeepSeek Chat. In contrast, DeepSeek Chat shows statistically significant improvements over GPT-4o, Claude 3.5 Haiku, and DeepSeek Reasoner (p < 0.05), but not when compared with GPT-4o Mini or Gemini 2.0 Flash, suggesting comparable effectiveness among these top-performing models. Other pairwise comparisons, such as between Grok, Gemini 2.0 Flash, and DeepSeek Reasoner, exhibit non-significant p-values, indicating no meaningful difference in F1 performance. Overall, these findings emphasize the robustness and competitive parity of newer-generation LLMs, while highlighting the substantial performance gap that persists for GPT-3.5 Turbo.

In terms of per-class performance (Table 8), all models exhibited strong results for categories such as Drugs and Gambling, with F1-scores frequently exceeding 94%. For instance, DeepSeek Chat achieved the highest F1-score in Drugs (0.955), while GPT-4o and GPT-4o Mini both reached top scores in Gambling (0.980). Grok also performed exceptionally well in Arms (0.940) and Drugs (0.942), confirming its strength in lexically anchored categories. Across most models, these categories showed minimal variance, indicating that tasks with explicit semantic cues remain consistently easier to classify under zero-shot prompting conditions.

Conversely, categories such as Violence, Electronic, and Hacking proved more challenging for all evaluated models, with F1-scores frequently dropping below 80%. Notably, GPT-3.5 Turbo recorded the lowest results in these ambiguous domains, reaching only 0.682 in Violence and 0.735 in Electronic, whereas larger or more recent models like GPT-4o (0.803 in Violence) and DeepSeek Reasoner (0.782 in Violence) showed moderate improvements. These trends underscore the continued difficulty of handling diffuse or overlapping semantic boundaries within illicit content categories.

A comprehensive version of this table, including per-category F1-scores with 95% confidence intervals (CIs) for all LLMs on the CoDA dataset, is provided in Appendix B for full statistical reference.

Regarding the alignment of model predictions with human annotations (RQ2), Table 9 presents inter-coder agreement metrics, including Cohen’s Kappa, weighted Cohen’s Kappa, and Krippendorff’s Alpha with 95% confidence intervals. Grok and DeepSeek Chat achieved the highest weighted Cohen’s Kappa values (0.867 [0.855, 0.875] and 0.862 [0.852, 0.871], respectively), indicating a high degree of agreement with human-labeled CoDA annotations. These results were further corroborated by Krippendorff’s Alpha, where both models again topped the rankings (0.844 and 0.845, respectively). GPT-4o and GPT-4o Mini also showed strong consistency (both with 0.832 in Kappa and Alpha), though slightly below Grok and DeepSeek Chat.

Continuing with the analysis of inter-model agreement (RQ3), we computed pairwise Cohen’s Kappa scores with 95% confidence intervals between all evaluated LLM pairs. As shown in Table 10, the highest inter-model consistency was observed between models from the same provider or with similar architectural foundations. In particular, Claude 3.5 Haiku and GPT-4o exhibited the strongest inter-model agreement (κ = 0.911), followed closely by DeepSeek Chat and Grok (κ = 0.909), as well as Claude 3.5 Haiku with Gemini 2.0 Flash (κ = 0.907) and DeepSeek Chat with Gemini 2.0 Flash (κ = 0.907). Similarly, GPT-4o and GPT-4o Mini showed high alignment (κ = 0.889), reflecting cohesive classification behavior among models sharing training data or instruction-tuning methodologies.

Conversely, pairs involving GPT-3.5 Turbo consistently demonstrated lower inter-model agreement, with Kappa values between 0.832 and 0.856, indicating greater divergence in decision boundaries and reduced consistency with newer-generation LLMs. These results highlight that while modern architectures tend to converge toward similar latent representations and output patterns, earlier models such as GPT-3.5 Turbo maintain distinctive and less harmonized behaviors.

A detailed version of this table, including pairwise Cohen’s Kappa values with 95% confidence intervals (CIs) for evaluating inter-model agreement among commercial LLMs, is provided in Appendix C.

To assess the overall reliability across models and with human annotations, we computed Krippendorff’s Alpha. As shown in Table 11, the agreement including the original CoDA labels reached 0.871 [0.866, 0.877], indicating strong alignment between human annotations and LLM predictions. When considering only the model outputs, the alpha increased slightly to 0.884 [0.879, 0.889], suggesting even greater internal consistency among the evaluated LLMs.

These results confirm that commercial LLMs not only perform well individually but also maintain high coherence with each other and with human-labeled data, reinforcing their robustness for classifying complex Dark Web content.

As shown in Table 12, traditional supervised classifiers such as SVM, CNN and BERT [3] slightly outperform the best-performing commercial LLM (DeepSeek Chat) in terms of precision and F1-score. This highlights the advantage of task-specific training on the CoDA dataset.

However, DeepSeek Chat delivers competitive results in a zero-shot setting, requiring no fine-tuning. Its stable and general performance makes it a practical alternative in scenarios with limited labeled data or high training costs.

5. Discussion

This section discusses the experimental findings in relation to the three core dimensions explored in this study: the effectiveness of LLMs in zero-shot illicit content classification, their alignment with human judgment, and the consistency of their decisions across different commercial models.

5.1. Classification Effectiveness of Commercial LLMs

The primary objective of this study was to assess the effectiveness of leading commercial Large Language Models in classifying illicit Dark Web content under zero-shot prompting conditions. Our findings confirm that modern LLMs exhibit substantial generalization capabilities even in adversarial domains such as the Dark Web, despite the absence of task-specific fine-tuning. All eight evaluated models achieved strong macro-averaged F1-scores, ranging from 0.831 [0.823, 0.838] (GPT-3.5 Turbo) to 0.870 [0.862, 0.877] (DeepSeek Chat), indicating their maturity and applicability to complex classification tasks in high-risk settings.

Among the tested models, DeepSeek Chat, Grok, and Gemini 2.0 Flash delivered the highest macro-level performance, slightly outperforming OpenAI’s GPT-4o and GPT-4o Mini. While OpenAI models have traditionally led benchmarks in general-purpose natural language understanding, our results suggest that competitors, particularly xAI, Google, and DeepSeek, have considerably narrowed this gap. These observations align with Giannilias et al. [14], who reported similarly competitive results using open-source LLMs like Mistral and LLaMA in cybercrime-related classification tasks.

To assess the statistical significance of the observed performance differences, 95% confidence intervals were computed using nonparametric bootstrap resampling. In several cases, these intervals overlap, particularly among DeepSeek Chat, GPT-4o Mini, and Gemini 2.0 Flash, indicating that, although DeepSeek Chat achieves one of the highest mean F1-scores, not all pairwise differences are statistically significant. Therefore, the apparent superiority of some models should be interpreted with caution. Pairwise F1-score comparisons with adjusted p-values show that GPT-3.5 Turbo consistently underperforms compared to all other models, with highly significant differences (p < 0.001) against Claude 3.5 Haiku, GPT-4o, GPT-4o Mini, DeepSeek Reasoner, Gemini 2.0 Flash, Grok, and DeepSeek Chat. In contrast, DeepSeek Chat demonstrates statistically significant improvements over GPT-4o, Claude 3.5 Haiku, and DeepSeek Reasoner (p < 0.05), while maintaining comparable performance to GPT-4o Mini and Gemini 2.0 Flash. These findings highlight the robustness and parity among the most recent commercial LLMs, while emphasizing the notable performance gap that persists for earlier-generation models such as GPT-3.5 Turbo.

At a category-specific level, tasks with clear lexical anchors, such as Drugs and Gambling, yielded the highest F1-scores across all models, frequently exceeding 94–95%. For example, DeepSeek Chat scored 0.955 [0.945, 0.965] on Drugs, while GPT-4o and GPT-4o Mini both reached 0.980 [0.972, 0.988] in Gambling. These results suggest that zero-shot prompting can approach or even replicate the performance of specialized models like DarkBERT [4] when applied to semantically well-defined categories.

Conversely, categories with diffuse semantics, such as Violence, Electronic, and Hacking, posed a greater challenge. These typically resulted in F1-scores below 80%, reflecting ambiguity and semantic overlap with other classes. For example, GPT-3.5 Turbo scored only 0.682 [0.644, 0.717] on Violence and 0.735 [0.695, 0.771] on Electronic. This is consistent with prior findings from Prado Sánchez et al. [10], where even GPT-4 struggled with fine-grained category differentiation under structured prompting conditions.

Model-level behavior also differed. GPT-4o Mini occasionally outperformed its larger counterpart in categories like Others and Electronic, suggesting that model size is not the sole determinant of zero-shot robustness. Grok, meanwhile, excelled in domain-specific categories like Gambling and Drugs, but showed reduced stability in more ambiguous classes. These discrepancies likely stem from variations in pretraining data, instruction tuning regimes, and sensitivity to prompt formulation.

In sum, our analysis reinforces that zero-shot prompting with commercial LLMs is a viable method for Dark Web content classification. However, performance varies significantly across categories and overlapping confidence intervals among top models suggest the need for careful statistical interpretation. Future research should explore prompt adaptation or few-shot fine-tuning to address residual ambiguity and improve classification reliability in operational deployments.

5.2. Alignment of Model Classifications with Human Judgment

The second objective of this study was to examine the degree to which commercial LLMs align with human judgment when classifying illicit Dark Web content. This alignment was evaluated using two well-established inter-rater reliability metrics: Krippendorff’s Alpha and Cohen’s Kappa, both of which are commonly employed in annotation tasks characterized by subjectivity, semantic ambiguity, or incomplete information.

The results indicate that several LLMs, including DeepSeek Chat, Grok, and GPT-4o, achieved high levels of agreement with human-labeled ground truth in the CoDA dataset. As shown in Table 9, weighted Cohen’s Kappa scores surpassed 0.86 for these models (e.g., Grok: 0.867 [0.861–0.873]; DeepSeek Chat: 0.863 [0.857–0.869]), suggesting strong reliability relative to expert human annotations. This is further supported by the Krippendorff’s Alpha values in Table 11, which reached 0.871 [0.866–0.877] when including human annotations and 0.884 [0.879–0.889] among models alone. These levels are consistent with prior work using GPT-4 on sensitive classification tasks [10], and are comparable to expert-human agreement levels in social science research [19] and qualitative analysis [24].

Importantly, this comparative analysis across eight commercial LLMs confirms that high alignment with human labels is no longer exclusive to OpenAI models. DeepSeek Chat and Grok produced agreement metrics that matched or slightly exceeded those of GPT--4o, highlighting that alternative providers have matured in generating semantically consistent and instruction-compliant responses even in zero-shot settings.

Nonetheless, the results also revealed systematic differences in annotation behavior. A consistent pattern among models was their conservative tendency to assign semantically ambiguous or lexically underspecified instances to the Others category. This fallback behavior, more frequent than in human coders, reflects a risk-averse classification strategy likely driven by the lack of contextual clues or prompt-specific anchors, an effect also observed in gender classification studies [11].

While this cautious bias may reduce false positives, critical in operational settings such as threat intelligence or forensic investigations, it may also lead to underreporting of specific illicit content types. Human annotators, relying on inferential reasoning or broader context, are often more willing to assign nuanced categories despite lexical ambiguity. In contrast, LLMs, particularly under zero-shot constraints without access to retrieval or grounding, tend to default to safer options.

These findings underscore a key limitation of zero-shot LLM classification: although aggregate-level agreement with human judgments is high, decision behavior under uncertainty still diverges. For real-world applications, this suggests the need for hybrid workflows where low-confidence classifications are escalated for human review, or where uncertainty-aware prompting techniques are incorporated to mitigate overuse of fallback categories.

5.3. Consistency of Classification Across Different LLMs

The third dimension of this study focuses on the degree of consistency among commercial LLMs when classifying the same set of illicit Dark Web documents. Inter-model agreement is particularly relevant in high-stakes or ambiguous classification contexts, where deployment pipelines may rely on multiple models or assess uncertainty through model diversity.

Our updated analysis reveals a high level of convergence across the evaluated LLMs, particularly among models from the same provider or alignment family. As shown in Table 10, pairwise Cohen’s Kappa scores with 95% confidence intervals exceeded 0.89 for several closely related model pairs: Claude 3.5 Haiku and GPT-4o (κ = 0.911 [0.904, 0.918]), DeepSeek Chat and Grok (κ = 0.909 [0.903, 0.916]), and both Claude 3.5 Haiku and Gemini 2.0 Flash (κ = 0.907 [0.900, 0.914]) as well as DeepSeek Chat and Gemini 2.0 Flash (κ = 0.907 [0.900, 0.914]). GPT-4o and GPT-4o Mini also showed high alignment (κ = 0.889 [0.882, 0.897]), reflecting cohesive classification behavior among models sharing architectural and instruction-tuning foundations. These values suggest that architectural lineage, alignment strategies, and shared training paradigms contribute to highly similar decision boundaries under zero-shot prompting. This finding supports prior evidence from Giannilias et al. [14], who noted nearly identical behavior among GPT-family and Anthropic-Google models on multiclass classification tasks outside the Dark Web domain.

However, when evaluating agreement across the full ensemble of LLMs using Krippendorff’s Alpha, the metric drops to 0.884 [0.879–0.889] without human labels, and 0.871 [0.866–0.877] when including them (Table 11). These values indicate moderate-to-high but not perfect convergence, with semantic divergence especially notable in diffuse or overlapping categories like Others, Financial, and Electronic. In these cases, classification appears to be more sensitive to differences in pretraining data, prompt interpretations, and risk calibration strategies.

Interestingly, while Grok and DeepSeek Chat tended to assign more definitive labels, often committing to narrower illicit categories, models like GPT-4o and Gemini 2.0 Flash displayed more conservative classification patterns, aligning with their cautious behavior also noted in Section 5.2. This echoes findings from Gilardi et al. [19], who documented differences in granularity and confidence thresholds across LLMs even under standardized annotation setups.

From an operational perspective, inter-model disagreement offers both challenges and design opportunities. While inconsistencies may flag ambiguous cases that require human intervention, the strong convergence observed in clearly defined categories such as Drugs, Gambling, and Pornography supports the use of ensemble or majority-vote mechanisms to reinforce confidence in automated pipelines. The potential for leveraging disagreement as an uncertainty signal, particularly in edge cases, also suggests promising directions for developing audit-ready content monitoring systems in sensitive environments.

In summary, while LLMs exhibit robust individual performance, their collective behavior varies by context and category, reinforcing the importance of multi-model evaluation frameworks in high-risk domains like the Dark Web.

5.4. Ethical Considerations and Dataset Bias

Given the sensitive nature of illicit content in the Dark Web, ethical considerations are central to the design, execution, and interpretation of this study. While the CoDA dataset contains only anonymized text snippets and excludes personally identifiable information, the classification of potentially harmful content, including categories such as Violence, Pornography, or Drugs, raises legitimate concerns about misclassification risks and the amplification of bias.

One limitation of the CoDA dataset is the lack of inter-annotator agreement (IAA) scores from the original annotation process. Without validated measures of human consistency, such as Cohen’s Kappa or Krippendorff’s Alpha among annotators, it is difficult to fully assess the subjectivity or ambiguity inherent in certain class assignments, particularly in overlapping or context-dependent categories like Others, Financial, or Violence. This gap reduces our ability to evaluate the origin of disagreement between models and human annotations and complicates bias mitigation.

In terms of preprocessing, we applied a cosine similarity threshold of 0.95 to remove near-duplicate documents using TF-IDF vectors, following established practices in information retrieval. While this approach reduces redundancy and overfitting risk, we did not perform a sensitivity analysis to assess how the choice of threshold might impact class balance, particularly for underrepresented categories. As a result, further investigation is needed to ensure that deduplication strategies do not disproportionately affect rare or marginal cases.

Additionally, the zero-shot classification setting, while avoiding task-specific fine-tuning, imposes its own constraints. LLMs tend to default to conservative labeling strategies, especially when faced with lexically sparse or semantically vague inputs. This was reflected in the overuse of fallback categories like Others, which may inadvertently underreport specific illicit activities or obscure actionable intelligence. These tendencies highlight the tension between minimizing false positives and maintaining sufficient granularity in sensitive applications.

Future research should address these limitations by (i) incorporating datasets with publicly available IAA metrics; (ii) testing alternative deduplication thresholds and their effect on minority class retention; and (iii) exploring the use of confidence scores or model uncertainty as flags for human review in deployment pipelines. Ensuring ethical robustness and transparency is crucial for the responsible use of LLMs in security-sensitive environments.

5.5. Limitations of the Study

While the findings presented offer significant insights into the capabilities of commercial LLMs for Dark Web content classification, several limitations must be acknowledged. First, all evaluations were conducted under zero-shot prompting conditions without incorporating domain-specific fine-tuning or retrieval-augmented generation techniques, which could potentially enhance model performance. Second, the CoDA dataset, although comprehensive, contains inherent ambiguities and overlaps between categories that might influence the interpretation of classification errors and model disagreements. Third, human labels used for ground truth comparison could themselves include biases or inconsistencies, affecting agreement metric interpretations. Fourth, the models were evaluated under deterministic API settings (e.g., temperature set to zero), which may not fully reflect real-world variability in model outputs.

To reduce variance and ensure reliability in results, confidence intervals have been computed for key evaluation metrics. This enhances the robustness of the analysis and mitigates concerns about result generalizability.

Finally, given the rapid evolution of LLMs, the models evaluated in this study represent a snapshot in time; newer or future versions could yield different classification behaviors. Future research should explore adaptive prompting strategies, investigate few-shot or fine-tuned configurations, and extend evaluation frameworks to cover multi-label and hierarchical classification schemes.

6. Conclusions

This study presented a comprehensive comparative evaluation of eight commercial Large Language Models (LLMs) in the task of zero-shot classification of illicit Dark Web content using the CoDA dataset. Our analysis spanned three critical dimensions: classification accuracy (RQ1), alignment with human judgment (RQ2), and inter-model agreement (RQ3). The findings offer robust evidence on the current capabilities and limitations of LLMs when applied to high-risk, semantically complex domains.

First, in terms of classification effectiveness, all evaluated LLMs demonstrated strong macro-level performance, with F1-scores ranging from 0.831 to 0.870. Notably, models such as DeepSeek Chat, Grok, and Gemini 2.0 Flash outperformed GPT-4o and GPT-4o Mini in several categories, underscoring the progress of alternative providers in closing the performance gap with OpenAI. While models performed exceptionally well in lexically rich categories such as Drugs and Gambling (F1 > 0.94), performance dropped in more ambiguous classes like Violence and Electronic, highlighting the importance of category-specific evaluation.

Second, our evaluation of model alignment with human-labeled CoDA annotations revealed high inter-rater reliability. Weighted Cohen’s Kappa and Krippendorff’s Alpha values exceeded 0.84 for top-performing models such as Grok and DeepSeek Chat. These results indicate that several LLMs can emulate human labeling behavior with strong consistency, although systematic patterns like overuse of the fallback Others category persist, especially in cases of lexical or contextual ambiguity.

Third, the study found substantial convergence in classification decisions among models with similar architectures or alignment paradigms. Pairwise Cohen’s Kappa scores surpassed 0.90 between model pairs like DeepSeek Chat and Grok or GPT-4o and GPT-4o Mini. However, Krippendorff’s Alpha across the full model ensemble revealed only moderate agreement (0.871–0.884), with higher variability observed in diffuse categories. This suggests that while intra-family consistency is high, architectural and training differences still lead to divergence in handling semantically complex cases.

Importantly, our study also highlighted ethical and methodological considerations. The absence of inter-annotator agreement data in CoDA, the lack of sensitivity analysis for deduplication thresholds, and the deterministic nature of API querying all present limitations that warrant future investigation. Nonetheless, we addressed concerns related to reproducibility by reporting 95% confidence intervals across all major metrics and by making our classification code publicly accessible.

In conclusion, commercial LLMs have reached a level of maturity that makes them viable tools for forensic Dark Web classification in zero-shot settings. However, operational deployment should consider category-specific weaknesses, model variability, and ethical safeguards. Future work should explore adaptive prompting strategies, retrieval-augmented techniques, few-shot learning configurations, and the use of hierarchical or multi-label taxonomies to better capture the complexity of real-world illicit content monitoring.

Author Contributions

Conceptualization, V.-P.P.-S. and A.D.-D.; methodology, V.-P.P.-S.; software, V.-P.P.-S.; validation, V.-P.P.-S., A.D.-D. and L.D.-M.; formal analysis, V.-P.P.-S.; investigation, V.-P.P.-S.; resources, V.-P.P.-S.; data curation, V.-P.P.-S.; writing—original draft preparation, V.-P.P.-S. and A.D.-D.; writing—review and editing, A.D.-D. and L.D.-M.; visualization, V.-P.P.-S.; supervision, L.D.-M. and J.-J.M.-H.; project administration, J.-J.M.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by the Spanish Ministry of Science and Innovation, under the “Project for the analysis and recovery of criminal evidence associated with hidden networks” (PARCHE), with reference PID2021-125645OB-I00 of the 2021 call for “Knowledge Generation Projects”.

Data Availability Statement

The CoDA dataset used in this study is publicly available. The dataset was introduced by [3], and is accessible through the open repository at https://huggingface.co/datasets/s2w-ai/CoDA accesed on 18 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To ensure reproducibility and transparency, Appendix A includes the full Python script used to interact with the evaluated commercial LLMs for content classification. The code is designed to be generic and easily adaptable to any of the language models considered in this study. Users need only specify the corresponding API endpoint and model name to apply the same classification procedure.

The code is written in Python 3.10 and relies on the following libraries and versions:

Pandas = 2.2.1 for reading, preprocessing, and saving the dataset;
Openai = 1.3.5 for API communication (adapted for compatibility with DeepSeek’s OpenAI-compatible endpoint);
Time for optional request delay handling.

The script reads an input .csv or .xlsx file, constructs structured prompts for classification using the official class definitions from the CoDA dataset, and sends the text to the LLM API. The model is prompted in a zero-shot configuration using deterministic parameters (e.g., temperature = 0) to minimize output variability.

The expected output is a single class label in JSON format. Once predictions are received, they are appended to the original dataset and saved for further analysis. This approach ensures consistent evaluation across all models under a unified prompt and inference pipeline.

    import time
    import pandas as pd
    from openai import OpenAI

    # Configure the client, in this case Deepseek
    client = OpenAI(api_key=“YOUR API KEY HERE”, base_url=“https://api.deepseek.com”)

    # Read the .csv file and store the data in a DataFrame
    df = pd.read_csv(r“ YOUR .XSLX SOURCE FILE PATH HERE”)

    # Replace NaN values in the “Content” column with an empty string
    df[“Content”] = df[“Content”].fillna(“”)

    # Text to add to the prompt
    additional_text = “““You must classify the text content of a darknet website.
    The list of classes is provided below, each one with the following structure: - Class name: Class description.
    List of classes:

-: Arms: any type of non-lethal/lethal weapons such as guns, ammunition, explosives, knives, missiles, and chemical weapons.
-: Crypto: cryptocurrency-specific services or technologies such as wallets, generators, mining, laundering, mixing, multiplying, doubling, scamming, and escrow.
-: Drugs: various types of legal/illegal drugs such as medications, steroids, pain killers, viagra, cannabis, hashish, meth, benzos, ecstasies, opioids, and psychedelic.
-: Electronic: sale of or information on (stolen/hacked) mobile phones, laptops, tablet computers, etc.
-: Financial: counterfeit/cloned/stolen money or identifications (e.g., bills, credit cards, certificates, passports), money transfers (e.g., PayPal), fiat money, ATM skimmers, magnetic card readers, etc.
-: Gambling: any type of gambling, betting, casinos, lotteries, etc.
-: Hacking: hacking tools, hacking guides, hacking groups, hacking services, ransomware, malware, exploits, DDoS attacks, cracking, botnet, etc.
-: Porn: general/child pornography and other explicit content.
-: Violence: human trafficking, hitman, kidnapping, poisoning, torture, extortion, sextortion, sex slavery, blackmail, etc.
-: Others: all other content that does not fit the above categories, including log-in pages, error messages, etc.

    The darknet website content is included below, between triple quotes.

    ”””

    # Iterate over the rows of the DataFrame and send the contents of each row as input to the Deepseek API.
    responses = []
    prompt0 = additional_text
    for index, row in df.iterrows():
        prompt1 = f‘“““{row[“Content”]}”””’
        prompt3 = “Respond with the class that best describes the content. The answer should be in JSON format with key \”class\“.”

        # Make the request to the Deepseek API
        response = client.chat.completions.create(
            model=“deepseek-reasoner”,
            messages=[
                {“role”: “system”, “content”: “You are an AI specialized in darknet content classification.”},
                {“role”: “user”, “content”: prompt0 + prompt1 + prompt3},
            ],
            stream=False
        )

        # Extract the answer correctly
        message_content =     response.choices[0].message.content
        responses.append(message_content)
        print(f“Respuesta: {index} {message_content}”)

        #stime.sleep(0.5) # Wait 1 second to avoid exceeding the limit of requests per minute

    # Add the responses column to the original DataFrame
    df[“Deepseek”] = responses

    # Save the DataFrame with the responses in a .csv file
df.to_csv(r“THE PATH TO YOUR FINAL .XSLX FILE HERE index=False)

Appendix B

Appendix B provides a detailed breakdown of the per-category F1-scores obtained by all evaluated Large Language Models (LLMs) on the CoDA dataset, together with their corresponding 95% confidence intervals (CIs). These results complement the macro-level performance analysis presented in the main text by offering a fine-grained view of how each model performs across individual illicit content categories.

Category	Claude 3.5 Haiku	DeepSeek Chat	DeepSeek Reasoner	Gemini 2.0 Flash	GPT-3.5 Turbo	GPT-4o	GPT-4o Mini	Grok
Arms	0.937 [0.921, 0.951]	0.937 [0.922, 0.952]	0.936 [0.920, 0.950]	0.961 [0.951, 0.972]	0.867 [0.845, 0.889]	0.927 [0.912, 0.943]	0.889 [0.869, 0.909]	0.940 [0.924, 0.954]
Crypto	0.817 [0.796, 0.838]	0.827 [0.805, 0.848]	0.833 [0.812, 0.854]	0.822 [0.801, 0.844]	0.820 [0.797, 0.840]	0.809 [0.788, 0.830]	0.804 [0.781, 0.825]	0.838 [0.817, 0.858]
Drugs	0.935 [0.923, 0.948]	0.955 [0.945, 0.965]	0.935 [0.922, 0.947]	0.938 [0.926, 0.950]	0.940 [0.928, 0.951]	0.941 [0.929, 0.952]	0.936 [0.924, 0.948]	0.942 [0.931, 0.953]
Electronic	0.853 [0.821, 0.885]	0.887 [0.859, 0.914]	0.892 [0.863, 0.921]	0.868 [0.837, 0.898]	0.735 [0.695, 0.771]	0.827 [0.789, 0.862]	0.739 [0.691, 0.783]	0.898 [0.871, 0.924]
Financial	0.864 [0.847, 0.880]	0.888 [0.872, 0.904]	0.868 [0.852, 0.885]	0.864 [0.847, 0.880]	0.860 [0.844, 0.876]	0.854 [0.837, 0.871]	0.857 [0.838, 0.874]	0.893 [0.877, 0.907]
Gambling	0.974 [0.965, 0.982]	0.971 [0.961, 0.980]	0.964 [0.953, 0.974]	0.963 [0.953, 0.974]	0.898 [0.881, 0.916]	0.980 [0.972, 0.987]	0.980 [0.972, 0.988]	0.968 [0.957, 0.977]
Hacking	0.777 [0.751, 0.801]	0.822 [0.800, 0.844]	0.778 [0.752, 0.803]	0.776 [0.749, 0.800]	0.792 [0.767, 0.816]	0.793 [0.768, 0.816]	0.820 [0.794, 0.844]	0.796 [0.773, 0.819]
Others	0.820 [0.808, 0.832]	0.847 [0.837, 0.857]	0.823 [0.812, 0.835]	0.831 [0.819, 0.843]	0.800 [0.788, 0.812]	0.821 [0.809, 0.833]	0.845 [0.834, 0.854]	0.837 [0.826, 0.848]
Porn	0.833 [0.805, 0.861]	0.869 [0.845, 0.893]	0.862 [0.838, 0.885]	0.875 [0.854, 0.896]	0.865 [0.841, 0.888]	0.870 [0.847, 0.891]	0.867 [0.845, 0.888]	0.879 [0.858, 0.900]
Violence	0.763 [0.730, 0.793]	0.682 [0.643, 0.720]	0.782 [0.750, 0.812]	0.757 [0.721, 0.794]	0.682 [0.644, 0.717]	0.803 [0.771, 0.832]	0.786 [0.756, 0.817]	0.734 [0.696, 0.768]
Average	0.857 [0.849, 0.866]	0.868 [0.860, 0.876]	0.867 [0.859, 0.875]	0.866 [0.858, 0.873]	0.826 [0.817, 0.834]	0.863 [0.854, 0.870]	0.852 [0.843, 0.861]	0.872 [0.864, 0.880]
Weighted	0.854 [0.847, 0.862]	0.869 [0.862, 0.877]	0.860 [0.851, 0.868]	0.861 [0.853, 0.869]	0.830 [0.822, 0.838]	0.857 [0.850, 0.865]	0.859 [0.851, 0.866]	0.868 [0.861, 0.876]

Appendix C

Appendix C presents the detailed pairwise Cohen’s Kappa coefficients computed between all evaluated Large Language Models (LLMs), along with their corresponding 95% confidence intervals (CIs). These values quantify the degree of inter-model agreement in classification outputs across the CoDA dataset, offering an interpretable measure of consistency beyond simple accuracy correlations.

	Claude 3.5 Haiku	DeepSeek Chat	DeepSeek Reasoner	Gemini 2.0 Flash	GPT-3.5 Turbo	GPT-4o	GPT-4o Mini	Grok
Claude 3.5 Haiku	1	0.894 [0.887, 0.902]	0.904 [0.895, 0.911]	0.907 [0.900, 0.914]	0.841 [0.832, 0.849]	0.911 [0.904–0.918]	0.887 [0.879, 0.894]	0.894 [0.887, 0.902]
DeepSeek Chat	0.894 [0.887, 0.902]	1	0.896 [0.889, 0.904]	0.907 [0.900, 0.914]	0.856 [0.847, 0.864]	0.885 [0.878, 0.893]	0.905 [0.898, 0.913]	0.909 [0.903, 0.916]
DeepSeek Reasoner	0.904 [0.895, 0.911]	0.896 [0.889, 0.904]	1	0.899 [0.892, 0.907]	0.832 [0.823, 0.841]	0.901 [0.894, 0.908]	0.873 [0.866, 0.881]	0.903 [0.896, 0.910]
Gemini 2.0 Flash	0.907 [0.900, 0.914]	0.907 [0.900, 0.914]	0.899 [0.892, 0.907]	1	0.849 [0.841, 0.858]	0.907 [0.899, 0.913]	0.889 [0.882, 0.897]	0.901 [0.893, 0.908]
GPT-3.5 Turbo	0.841 [0.832, 0.849]	0.856 [0.847, 0.864]	0.832 [0.823, 0.841]	0.849 [0.841, 0.858]	1	0.834 [0.825, 0.842]	0.850 [0.840, 0.859]	0.851 [0.842, 0.860]
GPT-4o	0.911 [0.904–0.918]	0.885 [0.878, 0.893]	0.901 [0.894, 0.908]	0.907 [0.899, 0.913]	0.834 [0.825, 0.842]	1	0.889 [0.882, 0.897]	0.892 [0.884, 0.899]
GPT-4o Mini	0.887 [0.879, 0.894]	0.905 [0.898, 0.913]	0.873 [0.866, 0.881]	0.889 [0.882, 0.897]	0.850 [0.840, 0.859]	0.889 [0.882, 0.897]	1	0.877 [0.869, 0.885]
Grok	0.894 [0.887, 0.902]	0.909 [0.903, 0.916]	0.903 [0.896, 0.910]	0.901 [0.893, 0.908]	0.851 [0.842, 0.860]	0.892 [0.884, 0.899]	0.877 [0.869, 0.885]	1

References

Al Nabki, M.W.; Fidalgo, E.; Alegre, E.; de Paz, I. Classifying Illegal Activities on Tor Network Based on Web Textual Contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; Lapata, M., Blunsom, P., Koller, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 35–43. Available online: https://aclanthology.org/E17-1004/ (accessed on 2 April 2025).
Avarikioti, G.; Brunner, R.; Kiayias, A.; Wattenhofer, R.; Zindros, D. Structure and Content of the Visible Darknet. arXiv 2018, arXiv:1811.01348. [Google Scholar] [CrossRef]
Jin, Y.; Jang, E.; Lee, Y.; Shin, S.; Chung, J.-W. Shedding New Light on the Language of the Dark Web. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Carpuat, M., de Marneffe, M.-C., Ruiz, I.V.M., Eds.; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 5621–5637. [Google Scholar] [CrossRef]
Jin, Y.; Jang, E.; Cui, J.; Chung, J.-W.; Lee, Y.; Shin, S. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 7515–7533. [Google Scholar] [CrossRef]
Cascavilla, G.; Catolino, G.; Sangiovanni, M. Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information. In Proceedings of the 19th International Conference on Security and Cryptography, Lisbon, Portugal, 11–13 July 2022; pp. 620–626. Available online: https://www.scitepress.org/Link.aspx?doi=10.5220/0011298600003283 (accessed on 2 April 2025).
Prado-Sánchez, V.-P.; Domínguez-Díaz, A.; Rodríguez, D.; Martínez, J.-J. How can ChatGPT help humans in Dark Web content classification? Assessing GPT models reliability and effects of explanations on human decisions. In Proceedings of the 35th Central European Conference on Information and Intelligent Systems, Varaždin, Croatia, 18–20 September 2024. [Google Scholar]
Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2024, 6, 100048. [Google Scholar] [CrossRef]
Loukas, L.; Stogiannidis, I.; Malakasiotis, P.; Vassos, S. Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance. In Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, Macau, China, 20 August 2023; Chen, C.-C., Takamura, H., Mathur, P., Sawhney, R., Huang, H.-H., Chen, H.-H., Eds.; 2023; pp. 74–80. Available online: https://aclanthology.org/2023.finnlp-1.7/ (accessed on 2 April 2025).
Wang, Z.; Pang, Y.; Lin, Y. Large Language Models Are Zero-Shot Text Classifiers. arXiv 2023, arXiv:2312.01044. [Google Scholar] [CrossRef]
Prado Sánchez, V.P.; Domínguez Díaz, A.; Marcos, L.; Martínez Herráiz, J.J. Clasificación Zero-Shot de Contenidos de la Dark Web Mediante GPT-3.5: Evaluación de Rendimiento y Análisis de Errores del Clasificador; Universidad de Sevilla, Escuela Técnica Superior de Ingeniería Informática: Seville, Spain, 2024; Available online: https://hdl.handle.net/11441/160320 (accessed on 2 April 2025).
Domínguez-Díaz, A.; Goyanes, M.; de-Marcos, L.; Prado-Sánchez, V.P. Comparative analysis of automatic gender detection from names: Evaluating the stability and performance of ChatGPT versus Namsor, and Gender-API. PeerJ Comput. Sci. 2024, 10, e2378. [Google Scholar] [CrossRef] [PubMed]
Al-Nabki, M.W.; Fidalgo, E.; Alegre, E.; Fernández-Robles, L. ToRank: Identifying the most influential suspicious domains in the Tor network. Expert Syst. Appl. 2019, 123, 212–226. [Google Scholar] [CrossRef]
Impact of Motivation Factors for Using Generative AI Services on Continuous Use Intention: Mediating Trust and Acceptance Attitude. Soc. Sci. 2024, 13, 475. [CrossRef]
Giannilias, T.; Papadakis, A.; Nikolaou, N.; Zahariadis, T. Classification of Hacker’s Posts Based on Zero-Shot, Few-Shot, and Fine-Tuned LLMs in Environments with Constrained Resources. Future Internet 2025, 17, 207. [Google Scholar] [CrossRef]
Chen, H.; Diao, Y.; Xiang, H.; Huo, Y.; Xie, X.; Zhao, J.; Wang, X.; Sun, Y.; Shi, J. Decode the Dark Side of the Language: Applications of LLMs in the Dark Web. In Proceedings of the 2024 IEEE 9th International Conference on Data Science in Cyberspace (DSC), Jinan, China, 23–26 August 2024; pp. 224–231. [Google Scholar] [CrossRef]
Bauer, N.; Preisig, M.; Volk, M. Offensiveness, Hate, Emotion and GPT: Benchmarking GPT3.5 and GPT4 as Classifiers on Twitter-specific Datasets. In Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024, Torino, Italia, 6 May 2024; Kumar, R., Ojha, A.K., Malmasi, S., Chakravarthi, B.R., Lahiri, B., Singh, S., Ratan, S., Eds.; ELRA and ICCL: Torino, Italia, 2024; pp. 126–133. Available online: https://aclanthology.org/2024.trac-1.14/ (accessed on 2 April 2025).
Alabdalhussein, A.; Singhania, N.; Nadeem, S.; Talib, M.; Al-Domaidat, D.; Jimoh, I.; Khan, W.; Mair, M. Artificial Intelligence Versus Professional Standards: A Cross-Sectional Comparative Study of GPT, Gemini, and ENT UK in Delivering Patient Information on ENT Conditions. Diseases 2025, 13, 286. [Google Scholar] [CrossRef] [PubMed]
Dunivin, Z.O. Scaling hermeneutics: A guide to qualitative coding with LLMs for reflexive content analysis. EPJ Data Sci. 2025, 14, 18. [Google Scholar] [CrossRef]
Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef] [PubMed]
Ma, T.; Organisciak, D.; Ma, W.; Long, Y. Towards Cognition-Aligned Visual Language Models via Zero-Shot Instance Retrieval. Electronics 2024, 13, 1660. [Google Scholar] [CrossRef]
Polettini, N. The vector space model in information retrieval-term weighting problem. Entropy 2004, 34, 9. [Google Scholar]
Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology; Sage Publications: Thousand Oaks, CA, USA, 2018; Available online: https://books.google.es/books?hl=es&lr=&id=nE1aDwAAQBAJ&oi=fnd&pg=PP1&dq=Krippendorff,+K.+(2004).+Content+Analysis:+An+Introduction+to+Its+Methodology+(2nd+ed.).+Sage+Publications.&ots=y_bmWviN9x&sig=WpGqvHztXDI6ZJvMihpzBGVcOGc (accessed on 27 May 2025).
Dunivin, Z.O. Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks. arXiv 2024, arXiv:2401.15170. [Google Scholar] [CrossRef]

Figure 1. Confusion matrix of DeepSeek Chat model on the CoDA dataset.

Table 1. Language distribution of documents in CoDA.

Language	Document Count	Language	Document Count	Language	Document Count
English	8855	Persian	7	Basque	1
Russian	542	Swedish	7	Egyptian Arabic	1
German	129	Ukrainian	7	Georgian	1
French	100	Turkish	6	Greek	1
Spanish	61	Catalan	5	Gujarati	1
Portuguese	54	Hungarian	5	Hindi	1
Chinese	38	Cebuano	3	Ido	1
Italian	28	Esperanto	3	Iloko	1
Japanese	27	Indonesian	3	Kurdish	1
Dutch	14	Latin	3	Marathi	1
Finnish	14	Lithuanian	3	Punjabi	1
Korean	12	Norwegian	3	Slovak	1
Czech	11	Bengali	2	Tamil	1
Polish	11	Galician	2	Thai	1
Bulgarian	10	Romanian	2	Urdu	1
Arabic	7	Serbian	2	Vietnamese	1
Hebrew	7	Slovenian	2	Welsh	1

Table 2. Document and Character Length Statistics per Category in the CoDA Dataset Before and After Deduplication using TF-IDF and Cosine Similarity.

Category	Pre-Deduplication Count	Average Character Length Pre-Duplication	Post-Deduplication Count	Average Character Length Post-Duplication
Arms	599	5554	574	4500
Crypto	761	4043	692	3388
Drugs	1166	4555	812	3701
Electronic	426	4694	279	3427
Financial	1003	3936	862	3017
Gambling	787	8339	654	7246
Hacking	649	3697	561	3453
Others	2919	3061	2596	2402
Porn	1205	25,597	519	15,749
Violence	485	5556	421	5319
TOTAL	10,000	69,032	7970	52,202

Table 3. Key Characteristics of Evaluated Commercial LLMs, Including Architecture, Scale, Training Data, Costs, and Latency (Estimated).

	Architecture Type	Parameters	Training Data	API Provider	Cost per 1K Tokens (USD) Input/Output	Inference Latency
Claude 3.5 Haiku	Transformer	13B	Proprietary (Anthropic)	Anthropic	$0.00025/$0.00125	Low
DeepSeek Chat	Transformer	67B	Web-scale + code	DeepSeek	$0.00028/$0.00042	Medium
DeepSeek Reasoner	Mixture of Experts	236B	Web-scale + reasoning corpora	DeepSeek	$0.00055/$0.00219	Medium-High
Gemini 2.0 Flash	Transformer (multimodal)	1.8T	Text + Code + Images	Google	$0.00005/$0.00020	Very Low
GPT-3.5 Turbo	Transformer	175B	Web + books + code	OpenAI	$0.0005/ $0.0015	Very Low
GPT-4o	Transformer (multimodal)	1.8T	Multimodal (web, images, audio)	OpenAI	$0.005/ $0.015	Low
GPT-4o Mini	Transformer (compressed)	Unknown	Unknown	OpenAI	$0.00005/ $0.00025	Very Low
Grok	Transformer	Unknown	Twitter + Web	xAI	$3.00/ $15.00	High

Table 4. Standardized zero-shot prompt used across all LLMs for content classification.

You must classify the text content of a darknet website.
The list of classes is provided below, each one with the following structure: - Class name: Class description.
List of classes:

-: Arms: any type of non-lethal/lethal weapons such as guns, ammunition, explosives, knives, missiles, and chemical weapons.
-: Crypto: cryptocurrency-specific services or technologies such as wallets, generators, mining, laundering, mixing, multiplying, doubling, scamming, and escrow.
-: Drugs: various types of legal/illegal drugs such as medications, steroids, pain killers, viagra, cannabis, hashish, meth, benzos, ecstasies, opioids, and psychedelic.
-: Electronic: sale of or information on (stolen/hacked) mobile phones, laptops, tablet computers, etc.
-: Financial: counterfeit/cloned/stolen money or identifications (e.g., bills, credit cards, certificates, passports), money transfers (e.g., PayPal), fiat money, ATM skimmers, magnetic card readers, etc.
-: Gambling: any type of gambling, betting, casinos, lotteries, etc.
-: Hacking: hacking tools, hacking guides, hacking groups, hacking services, ransomware, malware, exploits, DDoS attacks, cracking, botnet, etc.
-: Porn: general/child pornography and other explicit content.
-: Violence: human trafficking, hitman, kidnapping, poisoning, torture, extortion, sextortion, sex slavery, blackmail, etc.
-: Others: all other content that does not fit the above categories, including log-in pages, error messages, etc.

The darknet website content is included below, between triple quotes.
“““[DOCUMENT CONTENT]”””
Respond with the class that best describes the content. The answer should be in JSON format with key “class”.

Table 5. Example of Misclassification by a Commercial LLM: Original vs. Predicted Category on a CoDA Document.

Sample CoDA Document	urlaub / vacation corona hat uns im griff … bis auf weiteres nur notprogramm. corona got us … no shipping for some time to abroad … unser notprogramm für euch auf darkmarket, empire und cannazon… greensupreme ID_EMAIL —–begin pgp public key block—– version: gnupg ID_VERSION gafo3nirma== =gfqo —–end pgp public key block—–
Original Category	Others
LLM Prediction (GPT—3.5 Turbo)	Drugs

Table 6. Performance (F1 score, Precision, Recall) with 95% confidence intervals (CIs) for each Commercial LLM on the CoDA dataset.

Model	Precision	Recall	F1
Claude 3.5 Haiku	0.863 [0.855, 0.870]	0.870 [0.845, 0.861]	0.854 [0.846, 0.862]
DeepSeek Chat	0.875 [0.868, 0.882]	0.870 [0.863, 0.877]	0.870 [0.862, 0.877]
DeepSeek Reasoner	0.869 [0.863, 0.876]	0.859 [0.852, 0.867]	0.860 [0.853, 0.867]
Gemini 2.0 Flash	0.869 [0.869, 0.877]	0.861 [0.854, 0.868]	0.861 [0.854, 0.868]
GPT-3.5 Turbo	0.838 [0.831, 0.846]	0.829 [0.821, 0.837]	0.831 [0.823, 0.838]
GPT-4o	0.869 [0.862, 0.876]	0.857 [0.848, 0.864]	0.857 [0.849, 0.864]
GPT-4o Mini	0.864 [0.857, 0.871]	0.859 [0.851, 0.866]	0.859 [0.851, 0.866]
Grok	0.874 [0.867, 0.881]	0.869 [0.861, 0.876]	0.868 [0.861, 0.876]

Table 7. Pairwise F1-Score Differences with 95% confidence intervals (CIs) and BH-adjusted p-values for Commercial LLMs.

Model A	Model B	Mean	p
GPT-3.5 Turbo	Claude 3.5 Haiku	−0.023 [−0.034, −0.012]	0 ***
GPT-3.5 Turbo	GPT-4o	−0.026 [−0.037, −0.014]	0 ***
GPT-3.5 Turbo	GPT-4o Mini	−0.028 [−0.038, −0.016]	0 ***
GPT-3.5 Turbo	DeepSeek Reasoner	−0.029 [−0.040, −0.018]	0 ***
GPT-3.5 Turbo	Gemini 2.0 Flash	−0.030 [−0.041, −0.020]	0 ***
GPT-3.5 Turbo	Grok	−0.038 [−0.048, −0.026]	0 ***
GPT-3.5 Turbo	DeepSeek Chat	−0.039 [−0.049, −0.029]	0 ***
DeepSeek Chat	GPT-4o	0.012 [0.001, 0.023]	0.006 **
DeepSeek Chat	Claude 3.5 Haiku	0.016 [0.005, 0.026]	0.006 **
DeepSeek Chat	DeepSeek Reasoner	0.009 [−0.001, 0.021]	0.023 *
DeepSeek Chat	GPT-4o Mini	0.010 [−0.001, 0.020]	0.112
DeepSeek Chat	Gemini 2.0 Flash	0.008 [−0.002, 0.017]	0.224
DeepSeek Chat	Grok	0.001 [−0.009, 0.012]	0.863
Grok	Claude 3.5 Haiku	0.014 [0.002, 0.025]	0.017 *
Grok	GPT-4o	0.011 [0.001, 0.022]	0.075
Grok	GPT-4o Mini	0.009 [−0.001, 0.020]	0.172
Grok	Gemini 2.0 Flash	0.007 [−0.004, 0.017]	0.305
Grok	DeepSeek Reasoner	0.008 [−0.002, 0.019]	0.224
Gemini 2.0 Flash	Claude 3.5 Haiku	0.007 [−0.003, 0.018]	0.247
Gemini 2.0 Flash	GPT-4o	0.004 [−0.005, 0.015]	0.549
Gemini 2.0 Flash	GPT-4o Mini	0.002 [−0.008, 0.013]	0.786
Gemini 2.0 Flash	DeepSeek Reasoner	0.001 [−0.009, 0.012]	0.818
DeepSeek Reasoner	Claude 3.5 Haiku	0.006 [−0.005, 0.016]	0.433
DeepSeek Reasoner	GPT-4o	0.002 [−0.007, 0.013]	0.718
DeepSeek Reasoner	GPT-4o Mini	0.001 [−0.009, 0.011]	0.872
GPT-4o Mini	Claude 3.5 Haiku	0.005 [−0.005, 0.016]	0.462
GPT-4o Mini	GPT-4o	0.001 [−0.008, 0.012]	0.786
GPT-4o	Claude 3.5 Haiku	0.003 [−0.008, 0.014]	0.718

* p < 0.05, ** p < 0.01, *** p < 0.001.

Table 8. Per-Category F1-Scores for all LLMs on the CoDA dataset.

Category	Claude 3.5 Haiku	DeepSeek Chat	DeepSeek Reasoner	Gemini 2.0 Flash	GPT-3.5 Turbo	GPT-4o	GPT-4o Mini	Grok
Arms	0.937	0.937	0.936	0.961	0.867	0.927	0.889	0.940
Crypto	0.817	0.827	0.833	0.822	0.820	0.809	0.804	0.838
Drugs	0.935	0.955	0.935	0.938	0.940	0.941	0.936	0.942
Electronic	0.853	0.887	0.892	0.868	0.735	0.827	0.739	0.898
Financial	0.864	0.888	0.868	0.864	0.860	0.854	0.857	0.893
Gambling	0.974	0.971	0.964	0.963	0.898	0.980	0.980	0.968
Hacking	0.777	0.822	0.778	0.776	0.792	0.793	0.820	0.796
Others	0.820	0.847	0.823	0.831	0.800	0.821	0.845	0.837
Porn	0.833	0.869	0.862	0.875	0.865	0.870	0.867	0.879
Violence	0.763	0.682	0.782	0.757	0.682	0.803	0.786	0.734
Average	0.857	0.868	0.867	0.866	0.826	0.863	0.852	0.872
Weighted	0.854	0.869	0.860	0.861	0.830	0.857	0.859	0.868

Table 9. Intercoder Agreement between LLM Predictions and Human-Labeled CoDA annotations with 95% confidence intervals (CIs).

Model	Cohen’s Kappa	Weighted Cohen’s Kappa	Krippendorff’s Alpha
Claude 3.5 Haiku	0.827 [0.817, 0.836]	0.848 [0.838, 0.859]	0.827 [0.817, 0.836]
DeepSeek Chat	0.845 [0.836, 0.853]	0.862 [0.852, 0.871]	0.845 [0.836, 0.853]
DeepSeek Reasoner	0.835 [0.826, 0.844]	0.857 [0.847, 0.867]	0.834 [0.826, 0.843]
Gemini 2.0 Flash	0.836 [0.827, 0.844]	0.856 [0.849, 0.868]	0.836 [0.827, 0.844]
GPT-3.5 Turbo	0.796 [0.786, 0.805]	0.805 [0.794, 0.818]	0.796 [0.786, 0.805]
GPT-4o	0.832 [0.822, 0.841]	0.844 [0.834, 0.855]	0.832 [0.822, 0.840]
GPT-4o Mini	0.832 [0.822, 0.840]	0.841 [0.829, 0.851]	0.832 [0.822, 0.840]
Grok	0.844 [0.835, 0.853]	0.867 [0.855, 0.875]	0.844 [0.835, 0.853]

Table 10. Pairwise Cohen’s Kappa for Evaluating Inter-Model Agreement among Commercial LLMs.

	Claude 3.5 Haiku	DeepSeek Chat	DeepSeek Reasoner	Gemini 2.0 Flash	GPT-3.5 Turbo	GPT-4o	GPT-4o Mini	Grok
Claude 3.5 Haiku	1	0.894	0.904	0.907	0.841	0.911	0.887	0.894
DeepSeek Chat	0.894	1	0.896	0.907	0.856	0.885	0.905	0.909
DeepSeek Reasoner	0.904	0.896	1	0.899	0.832	0.901	0.873	0.903
Gemini 2.0 Flash	0.907	0.907	0.899	1	0.849	0.907	0.889	0.901
GPT-3.5 Turbo	0.841	0.856	0.832	0.849	1	0.834	0.850	0.851
GPT-4o	0.911	0.885	0.901	0.907	0.834	1	0.889	0.892
GPT-4o Mini	0.887	0.905	0.873	0.889	0.850	0.889	1	0.877
Grok	0.894	0.909	0.903	0.901	0.851	0.892	0.877	1

Table 11. Global Krippendorff’S Alpha showing overall agreement among LLMs and with Human Annotations with 95% confidence intervals (CIs).

Global Krippendorff’s Alpha (Original Classification Included)	Global Krippendorff’s Alpha (Original Classification Excluded)
0.871 [0.866, 0.877]	0.884 [0.879, 0.889]

Table 12. Performance comparison (weighted average) with supervised classifiers.

Model	Precision	Recall	F1
SVM	91.59	91.17	91.19
CNN	88.08	87.30	87.23
BERT	92.51	92.50	92.49
DeepSeek Chat	87.59	87.09	87.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Prado-Sánchez, V.-P.; Domínguez-Díaz, A.; De-Marcos, L.; Martínez-Herráiz, J.-J. Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement. Electronics 2025, 14, 4101. https://doi.org/10.3390/electronics14204101

AMA Style

Prado-Sánchez V-P, Domínguez-Díaz A, De-Marcos L, Martínez-Herráiz J-J. Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement. Electronics. 2025; 14(20):4101. https://doi.org/10.3390/electronics14204101

Chicago/Turabian Style

Prado-Sánchez, Víctor-Pablo, Adrián Domínguez-Díaz, Luis De-Marcos, and José-Javier Martínez-Herráiz. 2025. "Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement" Electronics 14, no. 20: 4101. https://doi.org/10.3390/electronics14204101

APA Style

Prado-Sánchez, V.-P., Domínguez-Díaz, A., De-Marcos, L., & Martínez-Herráiz, J.-J. (2025). Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement. Electronics, 14(20), 4101. https://doi.org/10.3390/electronics14204101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset

3.2. Language Models Evaluated

3.3. Prompting Protocol and Base Template

3.4. Evaluation Metrics

3.4.1. Classification Accuracy (RQ1)

3.4.2. Agreement with Human Judgment (RQ2)

3.4.3. Inter-Model Agreement (RQ3)

3.5. Experimental Setup

4. Results

5. Discussion

5.1. Classification Effectiveness of Commercial LLMs

5.2. Alignment of Model Classifications with Human Judgment

5.3. Consistency of Classification Across Different LLMs

5.4. Ethical Considerations and Dataset Bias

5.5. Limitations of the Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI