1. Introduction
The Dark Web has consolidated itself as a resilient digital ecosystem where anonymity, decentralisation and advanced cryptography converge, creating a conducive environment for the illicit trade of psychoactive substances and falsified pharmaceutical products [
1,
2]. Its architecture, based on the Tor protocol and onion routing, hinders the traceability of information flows and the identification of criminal actors, which makes it a strategic area of interest for cyber intelligence and law enforcement agencies [
3,
4].
Dark Web cryptomarkets and forums have driven a parallel transnational economy in which the sale of narcotics, such as fentanyl, grows at weekly rates of around 15 percent despite ongoing police disruption efforts [
5,
6]. In addition, the accessibility of these platforms has intensified public health risks, being associated with high overdose rates and the proliferation of falsified opioids [
7,
8]. In this context, the automated analysis of traffic, interactions and textual content on the Dark Web has become a priority for artificial intelligence research applied to cybersecurity, aimed at detecting behavioural patterns and anticipating emerging criminal trends [
9,
10].
Despite advances in the use of collaborative threat analysis platforms such as MISP (Malware Information Sharing Platform), the taxonomies currently available present significant structural limitations for the study of drug trafficking on the Dark Web. Most existing taxonomies focus on chemical or pharmacological aspects (for example, type of substance or molecular family) or on generic classifications of cyber incidents, without capturing the contextual diversity and linguistic characteristics that are specific to illicit digital markets [
11,
12].
This constraint prevents an adequate representation of the semantic complexity of posts, which typically combine chemical descriptions, commercial slang, consumption instructions and references to packaging. As a result, traditional taxonomies are insufficient for automated classification tasks, risk detection and behavioural analysis in dark environments. It is therefore necessary to extend and adapt MISP taxonomies towards a model capable of incorporating semantic, contextual and physical dimensions of the advertised product.
One of the most relevant challenges for the automated detection of drug trafficking on the Dark Web is the correct classification of substances according to their primary physical form, for example “powder”, “pill”, “oil” or “solid extract”, since this morphological dimension is a key indicator for health risk assessment, forensic traceability and inference of the mode of distribution [
13,
14]. From an ontological perspective, primary physical form is modelled as a MISP predicate with ontological grounding and mutually exclusive values, intended to describe the physical form in which the substance is found (for example, crystals, resin, tablet, liquid). It is important to clarify that this predicate should not be confused with the route of administration, substance names or quantities, which are non-morphological dimensions that, although analytically valuable, do not fulfil the structural function of disambiguating the physical representation of the object in discourse. In this sense, the use of the proposed predicate contributes to the disambiguation of existing slang and enables a more precise and coherent representation of substances within cyber intelligence ontological models. Taken together, this discussion seeks to link empirical results with contemporary theoretical frameworks in cyber intelligence and computational semantics, highlighting their relevance for the construction of more robust and replicable hybrid models.
Posts in cryptomarkets often employ ambiguous or coded terminology, which hampers automatic identification using traditional text processing techniques. Precise classification by physical form makes it possible not only to differentiate between consumable forms (such as edibles or capsules) and technical forms (such as solvents or resins), but also to detect patterns of adulteration or falsification associated with high-risk products such as synthetic opioids [
7,
15]. This taxonomic perspective adds analytical value by integrating a material and contextual dimension into the study of criminal discourse.
Taxonomies provide a shared language that organises operational knowledge and supports coordination between teams, as well as underpinning risk analysis and incident response in complex environments [
16,
17,
18]. As the classification basis, we adopt MISP, which strengthens collaborative indicator sharing and can be reinforced with prioritisation proposals such as CARIOCA (Cybersecurity Actionable Risk-Informed Operational Capability Assessment) to improve traceability and effectiveness [
19,
20]. Interoperability is ensured through STIX 2.1 (Structured Threat Information Expression) and TAXII 2.1 (Trusted Automated eXchange of Intelligence Information), which standardise the representation and distribution of intelligence and enable the modelling of relationships between technical threats, targets and events for automated consumption [
21,
22].
Human-in-the-Loop (HITL) is integrated here as a natural extension of this interoperability framework, acting on the same taxonomic and sharing artefacts to resolve contextual ambiguity and ensure decision traceability. In high-risk domains, expert intervention guides system learning and corrects edge cases, which reduces bias and stabilises labels generated by NLP (Natural Language Processing) and cybersecurity models [
23,
24]. In the specific context of cryptomarkets, the HITL layer operates as a quality control and audit mechanism over the classification that is interoperable with MISP and STIX/TAXII, maintaining the internal coherence of the pipeline and recording iterative feedback for operational exploitation [
23,
24,
25].
Despite the growing use of Machine Learning and Deep Learning models for the detection of illicit content, critical methodological gaps persist that limit reproducibility and comparability across studies [
9,
12] and in the specific case of mining drug forums [
26,
27]. First, there are no taxonomies specifically adapted to the domain of drug cryptomarkets, which forces researchers to rely on ad hoc categories or models trained on non-standardised datasets. Second, much of the current research is based on small or partial forum samples, which affects the generalisation of results [
2]. Finally, the reliance on language models without documentation of the validation process reduces transparency and hinders replication [
28].
These methodological limitations highlight the urgency of establishing open and replicable frameworks in which the generation and evaluation of taxonomic categories can be documented, audited and shared among researchers and agencies. This situation is even more acute in the case of drug forums, where there is no framework for extending MISP taxonomies using Large Language Models (LLMs) in a controlled and verifiable way.
Within this framework, the present research focuses on the analysis of drug cryptomarkets on the Dark Web, with the aim of proposing and adapting MISP (Malware Information Sharing Platform) taxonomies to a complex thematic domain in which technological, social and economic aspects converge around drugs. By applying deep learning models and semantic clustering techniques, and using LLMs in a controlled and verifiable manner, this study seeks to identify emerging categories and reduce ambiguity in the classification of content related to illicit substances. Ultimately, the purpose is to offer a scalable frame of reference that contributes to improving interoperability between cyber intelligence systems and to strengthening the capabilities of law enforcement agencies in the fight against digital drug trafficking.
The main contributions of this study are threefold. First, it proposes a reproducible framework for extending MISP taxonomies to the specific domain of drug-related discourse in Dark Web forums, thereby addressing the limitations of existing taxonomies that are mainly oriented towards chemical classification or generic cyber indicators. Second, it develops a hybrid LLM+HITL pipeline for the classification of primary physical forms, combining automated semantic inference with expert validation to improve traceability, consistency and ambiguity resolution. Third, it evaluates the proposed framework on a corpus of Dark Web forum posts, showing that the extension of the taxonomy improves semantic coverage and strengthens the interpretability of drug-related cyber intelligence analysis.
Accordingly, this work is structured around six research questions (RQs) that guide the methodological development and interpretation of the results:
RQ1. How can the MISP taxonomy be adapted to the domain of drugs in Dark Web forums?
RQ2. What impact does the integration of an LLM (Mistral 7B) have on the initial classification?
RQ3. How does the HITL component contribute to the reduction in ambiguities?
RQ4. Which new categories or mergers emerge from the extension process?
RQ5. How do class proportions vary after the final reclassification?
RQ6. Which semantic or thematic patterns are observed in the network representation?
This paper is organised into eight sections.
Section 1 introduces the context of the study, highlighting the relevance of the Dark Web as a space for cyber-intelligence analysis and the need to extend MISP taxonomies to the drugs domain.
Section 2 reviews the theoretical background and related work on Dark Web cryptomarkets, drug trafficking dynamics, trust mechanisms, and current taxonomic and methodological limitations.
Section 3 presents the methodology, including the design of the hybrid LLM+HITL pipeline, dataset preparation, initial taxonomy construction, and ambiguity detection.
Section 4 details the human review and taxonomic extension process, including cue extraction, threshold definition, and reclassification with the extended taxonomy.
Section 5 reports the quantitative results of the classification and reclassification process.
Section 6 presents the semantic and thematic analysis of the corpus through network visualisation.
Section 7 discusses the main findings, together with their practical, organisational, and methodological implications. Finally,
Section 8 sets out the conclusions, limitations of the study, and future lines of research.
3. Methodology
The preliminary analysis of the literature and of the taxonomies implemented in the MISP reveals the absence of a formal framework for their extension using LLMs in a controlled, verifiable and reproducible manner. Existing experiences are based mainly on manual curation of categories or on undocumented ad hoc contributions, which generates semantic inconsistencies, conceptual overlaps and a lack of traceability in the results. This methodological gap is even more pronounced in non-traditional thematic domains, such as drug trafficking on the Dark Web, where the linguistic and contextual diversity of content exceeds the limits of conventional chemical taxonomies.
To overcome these limitations, a HITL methodological pipeline is proposed, designed specifically for verifiable taxonomic extension within MISP. This procedure combines automated processing using an LLM model (in this case, Mistral 7B) with an expert human review phase that validates and adjusts the results according to criteria of coherence, semantic justification and ontological compatibility.
The pipeline consists of four main stages:
Definition of the review subset (S): selection of records that present ambiguity or uncertain classification.
Extraction and normalisation of morphological cues: identification of linguistic patterns that indicate physical form or type of substance.
Support calculation by neutral families and deduplication: consolidation of equivalent terms through neutral semantic groupings.
Application of conservative thresholds and textual justification: acceptance of new categories or mergers only if they meet verifiable statistical and semantic criteria.
The output of this flow is a verifiable taxonomic patch, aligned with the original MISP structure and measurable before and after implementation. In this way, the reproducibility of the process and interoperability between different intelligence analysis and sharing systems are ensured.
In this context, the present study differs from prior work by combining a domain-adapted MISP taxonomy, an LLM-based classification stage, and a Human-in-the-Loop validation process specifically designed to improve semantic coverage, ambiguity reduction, and reproducibility in the analysis of Dark Web drug-forum discourse.
The comparative overview presented in
Table 2 summarises the main conceptual differences between the original MISP “drugs” taxonomy and the approach proposed in this work.
Rather than replacing chemistry-oriented taxonomies, the proposed approach introduces a complementary morphology-oriented layer that is better aligned with the linguistic and commercial structure of Dark Web forum discourse.
3.1. Dataset and Data Preparation
The dataset used comes from six compressed files in .onion.war.gz format, which contain the full pages of different Dark Web forums in WARC (Web ARChive) format. These files comprise a total of 11,101 posts extracted from forums that are representative of the ecosystem of drug and illicit substance trading.
Due to the sensitive nature of the source material, the study relied exclusively on passive analysis of previously archived textual content from Tor-based forums. No interaction with forum participants took place, no transactions or purchases were conducted, and no authentication barriers were bypassed for the purposes of this research. Because the corpus derives from illicit-platform environments, the manuscript reports only aggregated results and does not disclose forum identifiers, onion addresses, usernames, wallet addresses, or other potentially identifying information.
The initial extraction and structuring were performed using a Python 3.12 script, with regular expressions employed to capture the variables of interest, including:
Site_name: name of the forum or .onion domain.
Page_title: title of the post, which generally provides a brief description of the content.
Content: full content of the post, including user replies.
Authors_vendors: name of the author or vendor mentioned.
Prices: prices expressed in text or cryptocurrencies.
Cryptocurrencies: references to digital means of payment (BTC, XMR, LTC, etc.).
Emails: visible email addresses.
Telegram_handles: Telegram user identifiers.
Onion_links: internal references to other .onion sites.
Of all these columns, the variables Page_title and Content were the most relevant for the analysis, as they concentrate the main description of the content. Using the extracted data, a JSON file was built with all the aforementioned attributes, representing one record per post.
Due to the linguistic diversity of the forums, which include content in English, German, Spanish and other minority languages, automatic translation was applied to all texts using the Python deep-translation module, based on the Google Translate API. This process generated an additional column named content_translated, which normalises the content into English.
Translation quality assessment. Language identification of the full corpus revealed that 9352 out of 9360 posts (99.91%) were written in English, with the remaining 8 posts distributed across other languages, including German, Spanish and Romanian, among others. Given this near-monolingual composition, large-scale automatic translation was not required and does not constitute a substantive processing stage in the pipeline. The 8 non-English posts were automatically translated and subsequently subjected to full manual review by the authors, who assessed whether each translation preserved (i) the meaning of the substance reference, (ii) the morphological cue, and (iii) the transactional context. Semantic adequacy was judged acceptable in all 8 cases, with no instances of nuance loss or alteration of physical-form interpretation identified. These findings confirm that translation introduces no meaningful source of error in the present corpus, and that the working dataset can be treated as effectively monolingual for the purposes of downstream classification.
Subsequently, the Mistral 7B model (base version, non-quantised and executed locally via Ollama) was used to extract a minimum of three representative keywords per post, enabling a preliminary understanding of the semantic content and preparing the ground for subsequent classification. This model was selected for its balance between computational efficiency and contextual depth.
The data cleaning and preprocessing phase removed 1741 duplicated posts and 2904 posts unrelated to drugs, using a script named drugs-base.py. As shown in
Table 3, the initial cleaning stage resulted in 9360 unique posts after duplicate removal. This script employs inference with the Mistral 7B model to distinguish between relevant posts (illicit drugs, narcotics, medicines of abuse, paraphernalia, distribution logistics) and other non-pertinent categories. After filtering, a final set of 6456 posts directly linked to drug-related content was obtained. To provide an initial visual overview of the lexical patterns identified in the filtered corpus,
Figure 1 presents a bubble-based representation of the extracted keywords.
Accordingly, the final working corpus used for classification consisted of 6456 drug-related posts, whereas 2904 unique posts were excluded as not relevant to the drugs domain (
Table 4).
These results confirm the consistency of the cleaning process, allowing only records with analytical relevance for the study to be retained.
For clarity, the dataset construction followed three sequential stages: (i) 11,101 raw extracted posts; (ii) 9360 unique posts after duplicate removal; and (iii) 6456 posts retained as the final drug-related working corpus after excluding 2904 posts classified as not drug-related. However, the morphology-classification analyses reported in
Section 3.2 onward were conducted on a stratified analytical subset of 2904 posts, and all percentages in the corresponding classification tables are calculated relative to that subset.
3.2. Initial Taxonomy via LLM
Once the study corpus had been delimited, an initial ad hoc taxonomy was developed, named machinetag_packing.json, defining a single predicate:
form = primary physical form of the substance.
The initial categories considered were:
Pill-tablet-Capsule, Powder and Crystal-rock.
This taxonomy was applied to the dataset using the script drugs-initial.py, which used the Mistral 7B model to assign each post to one of the proposed values. The model was instructed through prompt engineering to behave as a narcotics specialist, required to select strictly one of the categories or to label the post as “unclear” if the content did not allow a confident classification.
In this manuscript, “unclear” refers exclusively to the classifier output label, whereas semantic ambiguity is treated as an analytical property of unresolved or under-specified cases.
To improve reproducibility, the core prompts used in the study are reported below in representative form. The prompts were kept stable across runs, with only the admissible output labels being updated when the taxonomy was extended. All runs were executed with temperature = 0 in order to minimise output variability, the prompts are shown in
Appendix A.
The results of the initial classification are presented in
Table 5:
The distribution of the initial classification is shown in
Figure 2, where the predominance of powder, crystal-rock and unclear can be observed, confirming the need to refine the scheme before the human phase using the
pre_distribution.csv data.
The results show that the model clearly classified 76.48% of the records (powder, crystal-rock and pill-tablet-capsule), with a predominance of the powder category. However, the 683 records assigned the label “unclear” (23.52%) indicated a substantial number of unresolved cases in the corpus, which in turn motivated a second methodological phase of taxonomic extension and refinement.
Under this scheme, the task assigned to the LLM was to combine the initial proposals within this classification with its own language processing capabilities in order to classify content and propose taxonomic extensions to the original scheme.
Mistral 7B was selected as the base inference model because it produced stable deterministic outputs under fixed prompting conditions whilst remaining computationally viable in a local environment. Nevertheless, the use of a single non-fine-tuned LLM does not allow model-specific effects to be ruled out entirely. A formal inter-model robustness analysis and an out-of-sample validation on held-out data are identified as relevant directions for future work, in order to determine whether the main ambiguity patterns observed are attributable to model-specific behaviour or instead reflect structural properties of the corpus and the proposed taxonomy.
3.3. Identification of Ambiguous Records and Basis for Extension
The analysis of the preliminary results showed that the high proportion of posts assigned the label “unclear” was associated with two main causes:
The diversity of expressions and slang specific to the forums, which include colloquial or coded descriptions;
The limitation of the initial categories, which were insufficient to represent all the morphological manifestations observed.
From the 683 posts initially assigned the label “unclear”, a review subset (S) was constructed, to which the extension HITL phase was applied. This subset was reprocessed using Mistral 7B to detect morphological cues by combining the fields content-translated and keywords, thus enabling the inference of descriptive patterns that suggested new potential classes (for example, edible solid, oil extract, vape cartridge or gel capsule).
Subsequent human review verified the linguistic coherence of the proposals and consolidated those categories with sufficient statistical support and contextual grounding. This iteration significantly reduced the proportion of ambiguous records and broadened the semantic coverage of the taxonomy.
Taken as a whole, the applied pipeline, from data preparation to HITL validation, constitutes a reproducible and scalable methodology that combines automated inference, expert control and documentary traceability. The final outcome is an expanded and verifiable taxonomy, aligned with MISP standards and specifically adapted to the domain of drug trafficking on the Dark Web.
To distinguish between model misclassification within the existing taxonomy and genuine evidence of missing taxonomic categories, the label “unclear” was treated as a classifier output indicating unresolved cases at the initial stage, rather than as direct evidence of taxonomy incompleteness. During the HITL stage, each record in subset S was manually reviewed against the original three-category scheme (powder, crystal-rock, pill-tablet-capsule) before any new category was considered. Records were assigned to one of three outcomes: (i) reassignable to an existing category, indicating probable model under-classification; (ii) not reassignable but showing recurrent and semantically coherent morphological evidence, indicating a candidate taxonomic gap; or (iii) remaining ambiguous due to insufficient or non-morphological evidence. Only the second group was considered eligible for taxonomic extension.
4. Human Review and Taxonomic Extension (HITL Process)
4.1. Foundations of the HITL Approach
The human review phase constitutes the central axis of the HITL process applied in this research. This component was implemented after the initial automatic classification with the Mistral 7B model, with the aim of detecting semantic gaps, identifying emerging morphological patterns and validating the extension of the MISP taxonomy in the drugs domain.
The HITL approach makes it possible to balance the statistical inference of the model with expert judgement, ensuring that newly incorporated categories are grounded both in empirical evidence and in ontological coherence. The interaction between the model and the human reviewer is not merely corrective but also constructive and explanatory, as the system generates hypotheses based on morphological cues that are subsequently evaluated and refined by the analyst.
4.2. Selection of the Review Subset (S)
The process relies exclusively on the results of the initial classification and the base taxonomy. The review subset (S) was defined as the number of posts initially classified as unclear (683).
Importantly, inclusion in subset S did not imply that a post necessarily required a new category. Rather, S was designed as a validation stratum containing both potentially under-classified posts and genuinely out-of-taxonomy cases. This distinction was resolved during human review by testing whether the post could be confidently mapped onto one of the existing base classes using the operational definition of the primary physical form. Only when such reassignment was not justified, and when recurrent cue patterns exceeded the predefined thresholds, was the case treated as supporting taxonomic extension.
This set constitutes the subset S, representing the cases in which the model was unable to determine a primary physical form with sufficient confidence.
The corresponding file (S.csv) and the selection rules (AMBIGUITY_SELECTION.md) were documented to ensure process traceability. This subset was used as the basis for applying the HITL pipeline, in which the model and the expert collaborate in the detection, quantification and validation of morphological cues.
Human Review Protocol and Reviewer Agreement
Human review protocol. The HITL validation stage was conducted by two reviewers with complementary expertise: one researcher in cyber-intelligence and digital forensics, and one researcher in computational linguistics/NLP applied to illicit online discourse. Both reviewers independently examined the records in subset S, assessed the semantic adequacy of the extracted cues, and evaluated whether the proposed cue families justified category creation, merging, redirection, or rejection.
Inter-rater reliability. To assess annotation consistency, a double-review procedure was applied to the full subset S. Agreement was calculated at the level of final taxonomic decision (retain existing class/create new class/merge/reject as non-morphological or insufficient). In the manuscript, the inter-rater agreement is reported as Cohen’s κ = 0.82, indicating strong agreement, with a raw agreement of 89.3%. Disagreements were resolved through discussion and, where necessary, by consulting the operational definition of primary physical form adopted in the study.
Both reviewers independently examined the records in subset S using a decision protocol with three ordered questions: (1) Does the post contain sufficient morphological evidence to be assigned to one of the existing categories (powder, crystal-rock, pill-tablet-capsule)? If yes, the case was treated as probable model misclassification or under-classification within the original taxonomy. (2) If not, does the post contain recurrent and semantically coherent morphological evidence not captured by the base taxonomy? If yes, the case was marked as candidate evidence for taxonomic extension. (3) If neither condition was met, the record remained assigned to the label “unclear” due to insufficient, mixed, or non-morphological evidence. This protocol ensured that new categories were not created from isolated model errors, but only from repeated and validated out-of-taxonomy patterns.
Table 6 summarises the manual review setting and the agreement achieved between reviewers, while
Table 7 presents the distribution of the validation outcomes observed in the reviewed subset.
4.3. Extraction of Cues and Semantic Grouping
In this phase, the Mistral 7B model was instructed to extract, for each row in subset S, a set of morphological cues (Ci), combining the information contained in the fields content-translated and keywords. These cues are terms or expressions that function as semantic indicators of the physical form of the substance, for example: pill, crystal, gummy, rock, capsule, resin.
Each cue c has a frequency , defined as the number of rows in which it appears, and a prevalence .
To avoid terminological ambiguities and redundancy, the cues were grouped into neutral semantic families or cue groups (G), according to the morphological similarity of the terms. The main groups defined were:
Oral_solid → {pill, tablet, capsule, bar}
Crystal_like → {crystal, rock, shard}
Powder_like → {powder, flake, dust}
Edible_matrix → {gummy, brownie, cookie, chocolate, candy}
Concentrate_solid → {hash, resin, wax, extract}
Liquid_like → {oil, syrup, droplet}
The group support s(G) is defined as the number of rows that contain at least one cue belonging to family G.
To avoid inflating support, if a single post includes several synonyms within the same group, the row is counted only once. This procedure reduces the variance associated with synonymy and improves the accuracy of the estimation of the targeted morphological concept.
4.4. Definition of Thresholds and Decision Criteria
The HITL process established conservative thresholds for deciding when to add or merge categories within the taxonomy, in order to minimise false positives arising from noise or anecdotal occurrences. The decision criteria were defined as follows:
Addition of a new form
- ○
Minimum prevalence: (≥0.5% of the sample).
- ○
Minimum absolute frequency: occurrences.
Meeting both thresholds is required in order to consider the creation of a new value in the predicate form.
These mergers are applied when several pre-existing categories represent lexical variants or conceptual redundancies (for example, pill, tablet and capsule).
With a sample of 683 records, the thresholds correspond to:
The HITL model uses these combined metrics (relative and absolute) to distinguish between statistical noise and structured evidence, ensuring that each proposed extension is backed by a significant empirical volume and a coherent semantic context.
The set of 683 records in subset S was used to propose candidates for taxonomic extension, of which 475 yielded at least one extension proposal.
4.5. Results of the HITL Process
4.5.1. Consolidation and New Categories
The analysis of cue distributions revealed robust support for the plant_like and oral_solid families, which justified merging the labels plant, herb and weed under a single category named plant matter, and consolidating pill, tablet, and capsule under pill-tablet-capsule. At the same time, the 683 records initially labelled as “unclear” (23.52%) confirmed the persistence of semantic ambiguity in the corpus.
In addition, two further valid categories were identified that exceeded the defined thresholds and showed both morphological and contextual coherence, as summarised in
Table 8:
The cue family associated with resin-like materials (e.g., hash, hashish, charas, resin) was examined during the HITL stage because of its conceptual relevance in illicit drug markets. However, after expert review it was not retained as an independent final category in the extended taxonomy, as its empirical support and contextual consistency were not sufficient to justify a stable standalone class under the conservative inclusion criteria adopted in this study. Instead, these cases were treated as context-dependent concentrate-like references and documented as a relevant candidate for future refinement.
4.5.2. Evaluated and Rejected Cases
The HITL process also considered candidate categories that did not reach the thresholds or that were interpreted as documentary aliases of existing values, as summarised in
Table 9:
These results reinforce the non-arbitrary nature of the process: proposals arise from the corpus, are quantified empirically and are filtered according to predefined criteria before final human approval.
4.5.3. Exclusion Criteria (HITL Rejections)
The pipeline also identified non-morphological cues which, despite their frequency, do not represent a valid primary physical form. These were excluded from the final computation in order to avoid distorting the metrics or inducing erroneous categories.
The exclusion groups defined include:
Tools or utensils: needle, vial (routes of administration).
Transaction or concealment: banknotes, bills, euro bills (economic or concealment indicators).
Chemical substance: heroin, ketamine, methamphetamine (composition, not morphology).
Quantities or units: 5 g, 1 g, kg, uncut (sales magnitudes).
Composition or mixture: mixed, combo, sugar (additives or mixes).
When a record contained both valid morphological cues and exclusion cues, the system prioritised the morphological evidence. In cases with only exclusion cues, the final result was labelled as unclear.
4.5.4. Synthesis of Results and Extended Version of the Taxonomy
The HITL process concluded with a verifiable and documented extension of the MISP taxonomy for the drugs domain. The final set of categories for the predicate form = primary_physical_form is defined as:
powder, crystal-rock, plant-matter, pill-tablet-capsule, liquid, blotter.
In this way, the taxonomy moves from a chemically descriptive focus to a morphologically and linguistically contextualised classification, aligned with the discursive reality of Dark Web forums.
Each decision to add or merge categories is justified with quantitative evidence, ensuring transparency, reproducibility and ontological coherence throughout the process.
4.6. Reclassification with the Extended Taxonomy
Following validation and consolidation of the HITL process, a full reclassification of the corpus was carried out using the extended primary physical form taxonomy. The aim of this new iteration was to assess the practical effectiveness of the final scheme, quantify changes in class distribution and determine the reduction in ambiguity achieved after human intervention.
The process consisted of re-running the classifier over the entire set of posts (N = 6456), using the updated version of the taxonomy, which comprises the following values:
powder, crystal-rock, plant-matter, pill-tablet-capsule, liquid, blotter.
To guarantee the comparability of results, all experimental conditions used in the initial classification were kept constant, with only the list of available categories being modified. The conditions are described below:
Model used: Mistral 7B (same configuration as previously).
Temperature: 0, ensuring deterministic and stable behaviour in responses.
Model inputs: concatenation of the fields page_title and keywords, previously translated and semantically normalised.
Prompting strategy: identical to the previous phase, with the sole difference that the set of possible output values was updated to the final version of the extended taxonomy.
Expected output type: a single physical form value per post; in the absence of sufficient evidence, the system was required to return the unclear marker.
Model execution was automated via the script drugs-final-expanded.py, configured to record both the final prediction and the estimated contextual confidence, in order to enable subsequent comparative analyses. The total inference time was approximately 12 h in a local hardware environment equipped with two AMD EPYC 7552 48-Core processors (96 cores/192 threads total), six NVIDIA Quadro RTX 5000 GPUs (16 GB GDDR6 each, 96 GB total VRAM), and 640 GB of DDR4 ECC RAM at 3200 MHz, processing batches of 256 posts per iteration.
This second classification constitutes the comparative evaluation stage of the work, making it possible to observe how the incorporation of new categories affects corpus redistribution and the reduction in ambiguous cases under controlled conditions. Because a fully annotated ground-truth dataset was not available for the full corpus, ambiguity reduction was not treated as sufficient evidence of performance improvement on its own.
The following section details the quantitative results obtained after this reclassification, including the evolution of class proportions, the decrease in the unclear category and the semantic implications derived from the application of the extended taxonomy.
5. Analysis of Results
Two different ambiguity indicators were considered during the pipeline: (i) local ambiguity reduction within the review subset during the HITL refinement stage, and (ii) global ambiguity reduction in the full corpus after final reclassification. The manuscript reports the second indicator as the primary summary measure, in order to avoid confusion between intermediate and corpus-level effects.
5.1. General Classification Statistics
After running the Mistral 7B model with the extended primary physical form taxonomy, the corpus-level results show a substantial reduction in semantic ambiguity and a more differentiated class distribution. However, because these in-corpus comparisons are based on the same dataset used to derive the taxonomy extension, they are interpreted as descriptive evidence of improved fit rather than as sufficient proof of generalizable performance.
Consequently, reductions in the “unclear” label are presented as changes in the classifier output, while semantic ambiguity is interpreted as a broader analytical construct.
Comparing the initial classification (v1) with the subsequent reclassification (v2) makes it possible to observe the concrete effects of the taxonomic extension and the HITL process on class distribution.
In both cases, the reported percentages are calculated relative to the 2904 posts included in the morphology-classification subset used for direct PRE/POST comparison.
In the initial version, the proportion of posts assigned the label unclear reached 23.52% of the total, indicating a substantial number of unresolved cases in the identification of primary physical form. After reclassification, this value fell to 11.29%, representing a decrease of 12.23 percentage points, equivalent to a 51.99% relative reduction. This change constitutes the clearest classifier-output indication of the positive effect of the HITL pipeline and, at corpus level, is consistent with a reduction in semantic ambiguity.
Importantly, the reduction in posts assigned the label unclear should not be interpreted exclusively as evidence of missing categories in the original taxonomy. Manual validation showed that a substantial fraction of the reviewed cases could in fact be reassigned to existing classes, indicating model under-classification, whereas only a smaller but recurrent subset provided evidence for genuine taxonomic extension. Accordingly, in this manuscript, unclear is treated as a classifier output label, while semantic ambiguity is interpreted as a broader analytical property of unresolved or under-specified cases.
The behaviour of the crystal-rock category provides an additional indicator of structural stability in the classification scheme. This category remained broadly stable after reclassification, changing from 20.90% in the initial version to 23.14% in the final classification, a slight increase of 2.24 percentage points. This relative stability suggests that the extension process mainly affected ambiguous or under-specified records, while posts already associated with crystal- or rock-related forms remained consistently classified across both versions.
Taken together, the initial classification (v1) are summarised in
Table 10 as follows::
The behaviour of the remaining categories confirms the coherence of the reclassification. For example, crystal-rock increased only marginally from 20.90% to 23.14% (2.24 pp), which suggests strong stability in contexts where markers such as shard, rock or crystal are present. By contrast, powder decreased from 39.36% to 35.88% (−3.48 pp), a result consistent with the reassignment of certain records to more specific categories such as plant-matter.
Finally, the new categories introduced in the taxonomic extension show real, albeit limited, coverage. Plant-matter reaches 8.23% (239 rows), liquid accounts for 3.27% (95 rows) and blotter represents 1.34% (39 rows). Although modest, these percentages are consistent with the expected distribution of such posts in the forums analysed.
Overall, the final classification (v2) can be summarised as follows:
Table 11 presents the final classification, which shows a reduction in posts assigned the label unclear and an increase in pill-tablet-capsule that is consistent with the consolidation of oral solid forms. The subsequent comparison between PRE and POST summarises the changes in percentage points (Δ pp), highlighting the drop in the unclear label and the reassignment of part of these previously unresolved cases into more specific categories, including pill-tablet-capsule, as well as moderate adjustments in powder, crystal-rock and liquid. Sources: pre_distribution.csv and
post_distribution.csv.
Figure 3 compares the PRE and POST distribution of posts by physical form. Its purpose is to provide a concise overview of the changes introduced by the taxonomic reclassification in the morphological structure of the corpus.
These data show that the system has succeeded in reducing uncertainty and rebalancing class proportions according to a more precise semantic structure, thereby validating the methodological impact of the taxonomic extension pipeline.
5.2. Transition Analysis and Structural Stability After Reclassification
Analysis of the transition matrix between the initial classification (v1) and the extended classification (v2) makes it possible to examine how category migrations occurred within the same analytical subset and which reclassification flows accounted for the main changes introduced by the extended taxonomy.
The diagonal of the matrix, which represents exact matches between the two versions, concentrates most of the cases, indicating high structural stability in the model and strong coherence in the already consolidated categories. However, the most relevant transitions occur precisely in those cases where a direct effect of the extension process was expected, particularly in the reassignment of records initially labelled as unclear.
The most significant migrations were:
unclear → plant-matter: 113 cases.
unclear → liquid: 57 cases.
unclear → blotter: 17 cases.
The most significant migration flows are summarised visually in
Figure 4.
These three transitions account for a substantial part of the reduction in records initially assigned the label unclear. Together, they represent the effective reassignment of 27.37% of the initially unresolved records, showing that the refinement of morphological categories improved the interpretability of cases that were previously under-specified at the classifier-output level.
The remaining transitions reflect more minor adjustments, such as the reassignment powder → plant-matter (118 cases), powder → liquid (36 cases) and pill-tablet-capsule → blotter (21 cases). These migrations can be interpreted as the natural result of introducing more precise morphological markers, for example wax, shatter, crumble, which were previously subsumed under more generic categories such as powder or paste.
In general terms, the transition matrix confirms that the greatest reclassification flow is concentrated in transitions from unclear to newly differentiated categories, especially plant-matter. This trend reinforces the hypothesis that a substantial part of the initial semantic ambiguity stemmed from posts containing imprecise references to forms that the original model was unable to discriminate adequately under the initial category structure.
5.3. Evaluation of Ambiguity and Model Stability
The reduction in the percentage of posts assigned the label unclear is the main classifier-output indicator of improvement in the model, as shown in
Table 12. Moving from 23.52% to 11.29% implies not only a numerical decrease in unresolved outputs, but also a more precise fit between the classification scheme and the morphological patterns present in the corpus. Analysis of the reclassified cases shows that the HITL process did not generate overfitting or distort the overall structure of the taxonomy. In fact, the most relevant percentage variations are concentrated in classes directly affected by the new definitions or mergers, such as pill-tablet-capsule, while the remaining categories remain practically stable.
This stability is evidence of the ontological maturity of the model: the introduction of new values did not significantly alter the global distribution, which suggests that the taxonomic extension did not add noise to the system but rather improved the local precision of classification.
The reduction in unresolved cases at the classifier-output level, together with the stability of proportions and the observed semantic coherence, indicates that the HITL methodology applied was both effective and scalable.
6. Contextual Application of the Extended Taxonomy Through Co-Occurrence Network Analysis
The network representation generated with VOSviewer 1.6.20 made it possible to explore the semantic relationships between the most frequent terms in Dark Web drug forums and to identify the underlying thematic structure of the classified corpus.
Beyond its exploratory value, the co-occurrence network was used here as a contextual application layer for the extended taxonomy. Rather than introducing a separate line of analysis, this section examines whether the final morphology-based categories are embedded in coherent semantic environments within the forum discourse. In this sense, the network analysis contributes to RQ6 by showing how the taxonomic extensions identified through the LLM+HITL pipeline relate to broader thematic, commercial, and logistical structures in the corpus.
To construct the map, a minimum threshold of eight occurrences per term was established, applying a lexical normalisation thesaurus (thesaurus_drugs) that unified variants and synonyms.
This threshold was selected as a compromise between semantic coverage and visual interpretability: lower thresholds generated excessively dense maps dominated by rare or idiosyncratic terms, whereas higher thresholds removed relevant domain-specific vocabulary and reduced thematic diversity. In practical terms, the threshold of eight retained 127 interpretable nodes while filtering out sparse lexical noise. The normalisation method employed was Association Strength, with full counting, which ensures a proportional representation of semantic co-occurrence between terms. The final result comprised 127 nodes distributed across six thematic clusters (C1–C6), interpreted as semantic communities that reflect the discourses, products and dynamics of the cryptomarkets analysed.
Cluster detection was performed using the VOSviewer built-in weighted modularity-based clustering procedure, which groups nodes according to co-occurrence strength while maximising within-cluster association. The clustering resolution parameter was set to 1.00.
To support the methodological justification for the selected threshold,
Table 13 summarises a brief sensitivity check comparing three minimum-occurrence values in VOSviewer. This comparison illustrates how the threshold choice directly affected the number of retained nodes and the interpretability of the semantic map. As shown below, the value of eight occurrences provided the most suitable balance between lexical coverage and analytical clarity.
6.1. General Structure of the Network
The semantic network presents a clearly modular configuration in which several nodes act as organising axes of the conversation. The terms packing and distribution occupy central positions and concentrate a high number of links, indicating that the description of packaging and distribution processes constitutes a discursive meeting point across multiple substances and sales modalities. Around these nodes cluster references to types of drugs (heroin, ketamine, cocaine, MDMA, cannabis), forms of presentation (pills, tabs, blisters, crystal, shards) and logistical elements (shipping, worldwide_shipping, expresspost, uk2uk, marketplace). This overall network configuration is illustrated in
Figure 5.
The global structure thus combines two partially overlapping dimensions. On the one hand, a productive dimension sustained by differentiation between opioids, benzodiazepines, stimulants and cannabis derivatives. On the other, a logistical dimension focused on describing shipping modes, the degree of visibility of the vendor (physical_vendor, vendor, veteran_vendor) and geographical routes (Afghanistan, Iran, Germany, Netherlands, Canada, UK, Argentina, Peru, Venezuela). The intersection of these two dimensions gives rise to a discursive ecosystem in which product identity is defined jointly by its chemical composition, its origin and the promise of safe and discreet delivery.
6.2. Identified Thematic Clusters
Cluster 1 (29 items) brings together a heterogeneous set of substances and commercial brands articulated around the semantics of packaging and shipping. It includes classic psychedelics (lsd, dmt, ecstasy, mdma, xtc_pills), analgesics and opioids (tramadol, tapentadol, suboxone), cannabis derivatives (blueberry_weed, power_plant_weed) and references to pill shapes or designs (mickey_mouse, tesla, supreme, rolls_royce). These products are linked to operational terms such as pack, packing, pills, delivery, distribution and worldwide_shipping, as well as explicit mentions of dark markets (darkdock_market, darknet_market). The cluster reflects a multiproduct discourse in which the variety of substances is integrated under a shared logic of attractive packaging, international shipping and affiliation with consolidated marketplaces.
From the standpoint of the extended taxonomy, this cluster supports the interpretability of categories such as pill-tablet-capsule and blotter, showing that these forms are embedded not only in substance naming but also in recurrent commercial and logistical discourse.
Cluster 2 (24 items) is organised around high-demand recreational drugs and spatial references situating the offer in a transnational context. It includes cannabis strains (afghan_kush, amherst_sour_diesel_hun, auto_american_pie, white_russian, weed, pot), depressants and anxiolytics (alprazolam, benzos, xanax), cocaine and generic terms (drugs, 1g, clearance). These terms combine with logistical and geopolitical markers (marketplace, darknet_market, tor, uk, argentina, peru, venezuela), suggesting the existence of a recreational market aimed at consumers seeking information on origin, volume and type of cultivation. The cluster represents the space of an everyday consumption economy, where emphasis falls on cannabis varieties, unit doses and the geographical location of the supplier.
Taxonomically, this cluster reinforces the contextual distinctiveness of plant-matter and powder, as the dominant lexical environment consistently links these forms to recurrent patterns of retail description, quantity signalling, and product presentation.
Cluster 3 (21 items) concentrates vocabulary associated with higher volume transactions and explicit commercial strategies. Terms such as bulk, quarter_ounce, discount, discreet, packaging, shipping, worldwide_shipping refer to medium or large-scale operations, while methamphetamine, xtc and ketamine (in its variants ketamine_hcl, ketamine_s_isomer, ketamine_shards) highlight the importance of synthetic stimulants. The presence of distribution_asap_market and distribution_worldwide links these offers to specific marketplaces and to a global projection. The inclusion of counterfeit, euro and india_import points to an overlap between drug trafficking and monetary or document counterfeiting, where the same distribution channels are used to move both substances and fraudulent products.
From a taxonomic perspective, this cluster supports the analytical separation between crystal-rock and powder, as the co-occurring terms reflect distinct modes of presentation and circulation in wholesale and synthetic-drug discourse.
Cluster 4 (19 items) places ketamine at the centre of a semantic network that combines chemical purity, physical form and shipping routes. The node ketamine is connected with isomer, s-isomer, s-ketamine, racemic_rocks, shard, shards, sugar_s-isomer, indicating a high degree of specialisation in the description of product variants and textures (crystal, racemic, sugar-like). Alongside these appear geographical references (Afghanistan, Germany, India) and specific logistical operators (dhlgermany, expresspost, drugpearl, drugzfromnl), composing a narrative in which origin and supply chain function as authenticity markers. This cluster reflects a professionalised discourse around ketamine, where distinctions between isomers and crystalline forms are used both as quality arguments and as identity markers for certain vendors.
This cluster provides particularly strong contextual support for the crystal-rock category, since its semantic core is structured around lexical cues that refer to crystalline texture, shard-like presentation, and visually recognisable solid forms rather than to chemical denomination alone.
Cluster 5 (18 items) groups terms linked to prescription opioids and to the pharmaceutical presentation of the product. It includes direct references to heroin from Afghanistan (afghan_heroine), high-potency opiates and opioids (dilaudid, hydromorphone, opium, oxy, oxycodone, percocet, ghb), together with markers of sales format (blisters, m30, press, tabs) and quality (high_quality, pure, quality). The terms physical and uk2uk suggest the coexistence of physical and digital channels, particularly in domestic shipments within the United Kingdom that seek to minimise customs risks. The cluster describes a segment of the market that reproduces the language of the formal pharmaceutical chain but redirects it towards the illicit supply of medicines and heroin derivatives, with a strong emphasis on purity and hand-to-hand delivery.
In taxonomic terms, this cluster shows that categories such as pill-tablet-capsule, powder, and in some cases liquid are embedded in discourse where pharmaceutical naming, opioid circulation, and morphology-based presentation overlap in meaningful way
Cluster 6 (16 items) is structured around heroin and the construction of a narrative of extreme purity and vendor expertise. The term heroin is linked to high_purity, uncut, powder, goldenbulk, which reveals a rhetoric focused on non-adulterated products and volume formats. The semantic network also incorporates references to amphetamine and to imports from countries traditionally associated with trafficking (iran, turkish_import, turkish_heroine, france), as well as to vendor identifiers (dutch, dutchdrugs, vendor, veteran_vendor, physical_vendor). This cluster expresses the more classic dimension of heroin trafficking, transposed to the digital environment and legitimised through references to professional experience, origin and exceptional quality.
Although this cluster does not map onto a single morphology-based category, it functions as a cross-cutting contextual layer that helps explain how taxonomic labels are embedded in broader evaluative and logistical discourse within Dark Web drug markets.
6.3. Relationship Between Thematic Clusters and the Extended Taxonomy
To connect the co-occurrence analysis more directly with the core contribution of the paper,
Table 14 summarises the relationship between the thematic clusters identified in VOSviewer and the final morphology-based taxonomy. Rather than treating the clusters as a separate exploratory result, this mapping shows how the extended taxonomic categories are embedded in recurrent semantic, commercial, and logistical environments within the corpus.
6.4. Connection Patterns Between Nodes
Beyond the segmentation into six communities, the network displays a web of semantic trajectories that systematically connect products, routes and logistical devices. One of the most visible patterns is organised around the packing/distribution axis, which links terms from C1 with shipping-related notions from C3. Sequences such as packing (C1) → distribution (C1) → shipping (C3) → worldwide_shipping (C3) show that packaging is described as part of an integrated chain culminating in the promise of global delivery, regardless of the specific substance.
A second pattern is articulated around ketamine, which operates as a bridge between clusters C3 and C4. Chains such as ketamine_hcl (C3) → ketamine (C4) → s-isomer (C4) → sugar_s-isomer (C4) reveal a discursive continuum that moves from a generic reference to the active ingredient towards highly specific descriptors of the isomer and its physical appearance. This configuration is also associated with import routes (india_import, afghanistan, germany, drugzfromnl), reinforcing the idea of ketamine as a product with high symbolic and logistical value.
Third, heroin and opioids construct a semantic arc connecting C5 and C6. Paths such as afghan_heroine (C5) → pure (C5) → high_purity (C6) → uncut (C6) evidence a narrative continuity between pharmaceutical opioids, traditional heroin and high-purity formats offered by specialised vendors. These trajectories extend towards vendor-related nodes (veteran_vendor, physical_vendor) and transit locations (iran, turkish_import, france), integrating quality, experience and geopolitics into a single legitimising narrative.
Finally, several terms act as connectors between the recreational cannabis and cocaine market (C2) and the rest of the network. The co-occurrence of darknet_market and marketplace with drugs such as cocaine, MDMA, cannabis, weed links recreational consumption discourses with the global logistics discourses present in C1 and C3. In this way, the map reveals a continuous space in which segmentation by substance type overlaps with affiliation to shared infrastructures of trade and distribution.
6.5. Global Interpretation and Response to RQ6
The identified semantic patterns reveal a complex ecosystem in which at least three major discursive axes are combined. The first is a recreational–commercial axis centred on cannabis, cocaine, MDMA, LSD and ketamine, where references to strains, unit doses, pill design and brand-oriented marketing predominate. The second is a pharmaceutical–opioid axis structured around prescription opioids, benzodiazepines and high-purity heroin, which informally reproduces the semantics of the pharmaceutical chain (quality, dosage, origin, physical channel). The third is a transnational–logistical axis that cuts across the entire network, integrating vocabulary related to packaging, shipping, geographical routes and vendor visibility.
These three axes should be understood as higher-order interpretive dimensions emerging from the interaction among the six clusters, rather than as a replacement for the cluster structure itself.
The convergence of these three axes confirms that Dark Web forums do not merely list products but construct a shared language in which the identity of each offer is defined by the combination of substance, form of presentation and distribution guarantees. Highly central nodes (packing, distribution, ketamine, heroin, physical_vendor) act as discursive hubs connecting the different clusters and articulating a semantics of professionalised crime, in which chemical purity, vendor reputation and logistical efficiency are strategic elements for generating trust.
With respect to RQ6, the co-occurrence network does not operate as an independent exploratory result, but as a contextual application of the extended taxonomy. The six clusters show that the proposed morphology-based categories are not isolated labels but are embedded in recurrent semantic environments associated with packaging, purity, global shipping, pharmaceutical branding, and product presentation. In particular, categories such as plant-matter, crystal-rock, liquid, blotter, and pill-tablet-capsule appear linked to differentiated thematic constellations, which supports their interpretability within the discourse structure of Dark Web drug forums.
7. Discussion
This study demonstrates the methodological and conceptual feasibility of adapting MISP taxonomies to non-conventional thematic domains, such as drug trafficking on the Dark Web, through a hybrid process that combines automated inference with expert review. The integration of LLMs with HITL methodologies enables progress towards more adaptive, transparent and reproducible systems for semantic classification, overcoming the limitations identified in the state of the art.
The following section presents a general discussion of the findings obtained throughout the study, in line with research question RQ1, integrating the results derived from the different stages of ontological extension, automated classification and semantic validation, together with the responses to the research questions formulated in the introduction.
In light of the results, it is confirmed that the ontological structure of MISP can be extended through a semantic recontextualisation centred on the primary physical form of substances. In contrast to traditional approaches, where MISP taxonomies are restricted to technical incidents or chemical compositions [
9,
11], this work proposes a model oriented towards the linguistic and commercial morphology of discourse in onion forums. The shift from a chemical predicate to a morphological one (for example,
form = primary_physical_form) aligns with trends observed in the recent literature on contextual categorisation in cyber intelligence, where phenotypic descriptions of phenomena are prioritised over rigid taxonomies [
3,
6].
This adaptation not only broadens the applicability of MISP, but also introduces a reproducible framework for domains in which textual information is noisy, incomplete or polysemic, a structural feature of illicit digital markets [
2]. Consequently, the proposed model contributes to the convergence between computational semantics and operational ontologies in cyber intelligence.
With respect to RQ2, the incorporation of the Mistral 7B model in the initial classification stage highlights the potential of LLMs as morphological detection agents in digital criminal domains. In the present corpus, the initial classification yielded 76.48% of posts directly assigned to one of the base categories, while 23.52% were labelled as
unclear. These figures should be interpreted as a baseline result obtained under noisy, multilingual, and semantically non-standard conditions, rather than as a standalone performance benchmark. Because prior studies in this area often address different tasks, datasets, and evaluation settings, the present result is not directly comparable in strict quantitative terms. It is therefore more appropriate to interpret this outcome as evidence of the practical usefulness of LLMs for assisted semantic pre-classification, whilst recognising that subsequent HITL validation was required to achieve a more robust taxonomic resolution [
10].
This also indicates that Mistral 7B performed adequately as a first-pass classifier, but not as a fully autonomous solution, particularly in posts affected by slang, abbreviated vendor language, or limited morphological evidence.
Although the present study relies on a single inference model, the contribution should be understood less as a claim about the superiority of one particular LLM and more as evidence that ontology-aware HITL refinement can improve semantic classification under noisy illicit-market conditions. Whether the observed ambiguity patterns and the gains associated with taxonomic extension generalise across alternative architectures and unseen data remains an open empirical question, which is identified as a priority direction for future work.
The model exhibits substantial contextual capacity, identifying semantic patterns beyond surface-level keywords and generating coherent morphological labels even in texts affected by noise or lexical ambiguity. This property supports the hypothesis advanced by authors such as Sharma et al. [
12], who argue that LLMs can operate as instruments of “assisted semantic curation” within supervised classification environments. In this case, Mistral 7B functions as an interpretative component that translates the informal language of the forum into an ontologically legible space for MISP.
Regarding RQ3, the HITL component is confirmed as a crucial mechanism for reducing semantic ambiguity. Human intervention reduced the proportion of posts classified as unclear from 23.52% to 11.29%, which represents a decrease of 12.23 percentage points and a relative reduction of 51.99%. These results reinforce the value of hybrid systems that combine automated pre-classification with expert validation in taxonomy-extension tasks Mancini et al. [
6], Abbas et al. [
9].
The HITL process described in this study also provides a model for documentary traceability and support quantification that was absent from previous initiatives. Whereas most community taxonomies are expanded through informal contributions, this pipeline establishes explicit acceptance criteria, frequency thresholds and verifiable textual justification. In doing so, it introduces a methodological standard that can be replicated in other sensitive domains (for example, terrorism, child exploitation or ransomware ecosystems).
In relation to RQ4, the HITL-driven extension process enabled the consolidation and expansion of the taxonomic vocabulary, generating new empirically grounded categories such as plant-matter, liquid, and blotter, and merging redundant terms (pill, tablet, capsule → pill-tablet-capsule). This semantic evolution is consistent with the trend described by Zabihimayvan et al. [
2], according to which the semantics of digital drug trafficking tends to hybridise the technical and the commercial, using material descriptors rather than strictly chemical ones.
The identification of these new classes reflects the dynamic nature of drug markets on the Dark Web, where language evolves in parallel with consumption and distribution practices. From an ontological perspective, the proposed extensions are not merely labels but instruments of social observation, capable of capturing how criminal actors negotiate identity, reputation and product through discourse. This finding complements sociolinguistic approaches to digital trafficking, such as those of Broseus et al. [
43] and Weimann [
44], which emphasise the role of language as a marker of criminal legitimacy.
For RQ5, the global reclassification using the extended taxonomy confirmed the structural stability and internal coherence of the model, as well as the practical usefulness of the HITL process at corpus level. The reduction in ambiguity and the reconfiguration of class proportions indicate an improvement in local precision without loss of global coherence. Although the pill-tablet-capsule category increased slightly after reclassification, powder remained the dominant class in the final distribution (35.88%), whilst the proportion of unclear cases declined markedly. Taken together, these results suggest that the main effect of the extension was not to replace the overall class hierarchy, but to enable a more precise redistribution of previously ambiguous or overly generic cases. Whether these gains persist on unseen data and across alternative model architectures remains an open question that is addressed in the future lines of research.
This behaviour suggests that the initial ambiguity was concentrated in posts with vocabulary related to oral solids, which reinforces the validity of the merger and the relevance of the new taxonomic structure. Methodologically, these results provide quantitative validation for the proposal of hybrid supervised learning models, such as those outlined by Abbas et al. [
9], in which human curation guides semantic convergence without compromising scalability.
With respect to RQ6, analysis of the network based on 126 terms and six clusters shows that the semantic ecosystem of Dark Web drug forums is structured around the intersection between recreational substances, prescription opioids and logistical devices. The thematic segmentation highlights, on the one hand, a recreational–commercial space dominated by cannabis, cocaine, MDMA and ketamine and, on the other, a pharmaceutical–opioid space in which heroin, synthetic opioids and high-risk pharmaceuticals are offered. Both spaces are traversed by a shared logistical axis that emphasises packaging, global distribution and the geographical specialisation of vendors.
The network patterns identified are broadly consistent with structures already described in the prior cryptomarket literature. In this study, their value lies less in novelty at market level than in showing that the extended taxonomy aligns with recurrent semantic and commercial structures observed in the corpus.
The most central nodes reveal that trust and reputation are constructed discursively through repeated emphasis on purity (high_purity, pure, uncut), origin (Afghanistan, Iran, Netherlands, Germany, UK, Canada) and the promise of discreet and reliable delivery (shipping, expresspost, uk2uk, worldwide_shipping). Taken together, these results reinforce the idea that the forums analysed operate not only as illicit marketplaces but also as spaces of symbolic production, where meanings, hierarchies and criminal affiliations are negotiated through a shared semantic repertoire that integrates products, logistics and geopolitics.
8. Conclusions
The study carried out demonstrates the feasibility of a reproducible taxonomic extension process based on empirical evidence, oriented towards the classification of drug-related content in Dark Web forums. Unlike the manual or spontaneous extensions that typically characterise community MISP taxonomies, the procedure proposed here articulates a set of methodological stages that ensure traceability, verifiability and ontological consistency.
Each step, from the selection of the ambiguous subset to the generation of textual justifications and quantitative thresholds, follows a logic of control and documentation that turns the process into a model of methodological replication for other cyber-intelligence domains.
The integration of LLMs, specifically Mistral 7B, with expert human validation (HITL) has proved to be an effective combination for semantic expansion and classification improvement. The LLM component contributes contextual detection capability and generalisation over heterogeneous corpora, while human supervision introduces criteria of rigour, coherence and ontological adequacy.
This synergy makes it possible to overcome the limitations of purely automatic systems, reducing ambiguity and ensuring that taxonomic extensions reflect both real linguistic patterns and expert domain knowledge. In terms of results, the verifiable reduction in the percentage of unclear records and the consolidation of morphological categories confirm the effectiveness of the hybrid approach. The resulting taxonomy therefore offers better semantic coverage and greater classificatory coherence than previous versions.
The extension with new categories—such as plant-matter, liquid and blotter—and the merging of redundant terms—pill, tablet and capsule—not only optimises classification accuracy, but also reflects the discursive evolution of illicit digital markets. The resulting model provides a more realistic view of criminal language and forum dynamics, in which morphological and commercial descriptions of products prevail over purely chemical denominations.
In addition, the quantitative comparison between the initial and extended classifications confirms that the modifications introduced improve granularity without distorting the structural balance of the system. Further inter-model and out-of-sample validation, as outlined in the future lines of research, would strengthen the external validity of these findings.
From a methodological perspective, the research lays the foundations for standardising future MISP extensions. The proposed pipeline—comprising automated cue detection, statistical support calculation, human review and final reclassification—can be reproduced in other cyberthreat domains, such as ransomware, malware families or phishing ecosystems.
Its added value lies in offering a controlled extension methodology, in which each new category is justified with empirical evidence, validated semantically and documented to facilitate inter-institutional interoperability. In this way, the study contributes to the development of a taxonomic governance model grounded in transparency and verifiability, aligned with the current needs of threat intelligence sharing platforms.
8.1. Practical Implications
From an operational standpoint, the results of this research have a direct impact on the efficiency of Dark Web intelligence detection and classification systems. The new taxonomy enables more accurate identification of the physical forms and contextual settings of substances, facilitating the creation of alerts, indicators of compromise and correlations between forums, markets and actors.
This leads to greater analytical capacity for cyber-intelligence units and law enforcement agencies, which can prioritise resources according to emerging trafficking or consumption trends inferred from language.
Moreover, the partial automation of the process reduces manual review times and makes it possible to keep MISP systems continuously updated with a controlled operational effort.
8.2. Organisational Implications
At an organisational level, the methodology proposed supports standardisation and inter-institutional cooperation.
The use of an audited and documented procedure for extending taxonomies allows different organisations—incident response centres, police units, cybercrime observatories or OSINT communities—to share and reproduce the same classificatory structure without loss of semantic coherence.
This promotes sustainable interoperability within the cyber-intelligence ecosystem, where the evolution of taxonomies no longer depends on individual actors but on verifiable and collaborative processes. In the long term, this model contributes to institutionalising a culture of structured, evidence-based knowledge management, which is essential to address the rapid mutation of criminal language and the dynamics of the illicit digital economy.
8.3. Limitations of the Study
This work is subject to several limitations arising from both its methodological design and the empirical characteristics of the corpus. Although the results support the usefulness of the taxonomic extension process and of the hybrid LLM–HITL approach, these constraints delimit the scope of the findings and should be considered when interpreting the results.
First, the corpus is restricted to a specific set of Dark Web forums and to content processed through a translation pipeline centred on English and Spanish. This improves comparability across posts, but it may also reduce the semantic fidelity of slang, abbreviations, and culturally specific expressions present in the original messages, especially in multilingual or non-Anglophone communities. The findings should therefore be interpreted as valid for the analysed corpus and related contexts, rather than as universally generalisable to all cryptomarket environments.
Second, the pipeline depends in part on the behaviour of the language model used for the initial classification stage. Although the model was run under controlled settings and its outputs were later reviewed through a HITL process, automatic classification remains sensitive to contextual ambiguity, weak morphological evidence, and possible biases inherited from the training data. In addition, because the study relies on a single non-fine-tuned LLM, model-specific effects cannot be fully ruled out, and the observed classification patterns may partially reflect characteristics of the selected model rather than only structural properties of the corpus and the proposed taxonomy.
Third, the final class distribution is uneven across categories. While this reflects the empirical structure of the corpus, it reduces the analytical robustness of minority classes such as blotter or liquid, which contain substantially fewer cases than powder or crystal-rock. Future work should therefore test the model on broader and more balanced datasets in order to assess the stability of low-frequency categories.
Fourth, an additional limitation concerns research governance. The study did not undergo a formal institutional ethics committee review prior to analysis. This should be taken into account when assessing the overall design of the research. To mitigate risk, we adopted a risk-minimisation approach based on data minimisation, restricted handling of raw files, non-disclosure of operational identifiers, and controlled access to derived materials. While these measures reduce potential harm, they do not substitute for a formal ethics review and therefore remain a limitation of the study.
Finally, although the workflow includes a human validation stage, external expert validation was not incorporated in the present study. The review process was conducted within the research framework itself, which ensured procedural consistency but did not provide an independent forensic or linguistic assessment of the resulting taxonomy. Future inter-institutional validation would strengthen the external credibility and transferability of the proposed model.
8.4. Future Lines of Research
On the basis of the results obtained in this study and the limitations identified, the following future lines of research are proposed for the development and consolidation of semantic classification methodologies in cyber-intelligence environments.
First, a primary avenue for progress is to extend the taxonomic extension methodology towards an inter-domain validation process that encompasses different areas of illicit activity on the Dark Web, such as the trade in weapons, falsified pharmaceuticals or restricted biological materials. Cross-domain application of the model would make it possible to test the degree of semantic transferability of the generated taxonomies and the consistency of the HITL approach in thematic domains with distinct languages, hierarchies and markets.
Second, the practical implementation of the extended taxonomy within the MISP ecosystem is proposed, moving from an experimental environment to operational integration. This would involve developing a controlled extension plugin or module that allows HITL taxonomies to be incorporated directly into active MISP instances, with automatic logging of thresholds, textual justifications and validation results. Adoption of this approach would transform the proposed model into a functional tool, enhancing MISP’s capacity to manage contextualised information on emerging threats.
Third, it is proposed to broaden the nature of the signals processed by the pipeline, integrating multimodal components that combine text, imagery and metadata. To date, the analysis has focused exclusively on textual forum content, but many drug listings include images of the product, vendor logos or publication metadata that provide additional semantic and contextual information. Integrating these elements would enable the development of a more robust model capable of performing multimodal classifications based on both linguistic descriptions and visual characteristics of substances. Likewise, the fusion of textual and visual signals would open the door to research on authenticity, visual camouflage and illicit marketing practices—areas that remain underexplored in cyber-intelligence.
Fourth, a complementary future line is to explore the incorporation of next-generation language models with enhanced reasoning and multimodal capabilities, such as GPT-5 or Gemini Ultra, in order to extend the functional scope of the pipeline rather than merely compare classification outcomes. The relevance of this direction lies in the possibility of improving the generation of explanatory justifications, enriching contextual cue interpretation, and supporting more advanced forms of semantic traceability in cyber-intelligence workflows. In this sense, the adoption of more capable models would not only serve to improve process efficiency, but also to examine whether advances in AI enable richer and more interpretable forms of ontology-oriented classification in illicit digital environments.
Fifth, a particularly relevant direction for future work is to conduct a comprehensive robustness evaluation comprising two complementary strategies: an inter-model comparison and an out-of-sample validation. In the inter-model component, a common stratified subset of the corpus would be processed by multiple instruction-tuned LLMs, including Mistral 7B, Falcon 7B, LLaMA-2 7B, Qwen 7B, GPT-5, and Gemini Ultra, under harmonised prompting conditions, in order to determine whether the concentration of ambiguity in specific semantic zones and the gains associated with the extended taxonomy are reproducible across architectures or, conversely, depend on the particular model employed. In the out-of-sample component, a held-out set of posts specifically excluded from the HITL refinement stage would be classified under both the initial and extended taxonomies, making it possible to assess whether the observed improvements in ambiguity reduction and classification quality generalise beyond the development corpus or are partly attributable to in-sample optimisation. Establishing a provisional reference standard through independent expert annotation and comparing both taxonomy versions against stratified, manually annotated subsets would further strengthen the external validity of the proposed framework and provide more definitive evidence regarding the semantic quality of the extended classification.