1. Introduction
Credit card fraud constitutes one of the foundational pillars of the contemporary cybercrime economy, both due to its direct financial impact and its structuring role within illicit digital markets [
1,
2,
3]. In particular, the phenomenon known as carding, namely the acquisition, trading, and fraudulent exploitation of credit card data, has demonstrated a marked capacity to adapt in the face of prevention, detection, and enforcement measures implemented by financial institutions and law enforcement agencies [
4,
5,
6]. This resilience can be largely attributed to the progressive sophistication of the digital environments in which these activities are coordinated, especially dark web forums and marketplaces that operate as spaces for the exchange of knowledge, goods, and illicit services [
7].
From a technological standpoint financial fraud detection has long been a major application domain for artificial intelligence and machine learning due to the need to identify complex patterns in high volume data [
8,
9,
10]. However, most existing approaches remain centred on the analysis of structured financial transactions which leaves comparatively limited space for the systematic study of the social and communicative ecosystems where fraudulent practices are produced refined and disseminated [
1,
11]. This separation between automated detection systems and an empirically grounded understanding of the criminal environment constrains the ability to anticipate emerging fraud dynamics and to characterise the organisational mechanisms that sustain them [
5,
10,
12].
In this context, dark web carding forums represent a strategically valuable source of information for understanding how fraud is operationalised and scaled in practice [
2,
3,
13]. These spaces not only facilitate the buying and selling of data and tools but also reveal market structure participant roles employed techniques and associated economic flows through their interaction and governance mechanisms [
7,
14,
15]. Nevertheless, the automated analysis of this content entails significant challenges because forum discourse is characterised by semantic heterogeneity the pervasive use of specialised jargon and the limited availability of standardised taxonomies that enable consistent classification of the diverse elements embedded in posted messages [
16,
17].
Although widely adopted taxonomies exist within cybersecurity, such as the Malware Information Sharing Platform (MISP), they have been primarily designed to support the exchange of technical threat intelligence and to structure indicators and incident artefacts rather than market level and organisational dimensions [
4]. As a consequence, these schemes do not adequately capture the socioeconomic and organisational complexity that characterises carding forums as illicit communities and marketplaces [
5,
17]. In practice a substantial share of forum content is therefore likely to remain ambiguous or be operationally treated as unclear which reduces the analytical utility of automated systems and hinders the extraction of actionable knowledge [
10,
16].
Despite substantial progress in transaction-level fraud detection using machine learning and statistical models, comparatively little work has addressed the structural and communicative environments in which fraudulent practices are organised and coordinated. Existing research tends to analyse fraud as isolated behavioural events observable in financial data, whereas the operational logic of cybercrime markets unfolds through interaction, role differentiation, and service exchange within online communities. This creates a methodological gap between predictive fraud detection and the systematic analysis of the ecosystem that enables fraud.
The present study addresses this gap by proposing an iterative methodological approach that combines domain-specific taxonomy design, LLM-assisted classification, lexical co-occurrence analysis, and network analysis to systematically structure the activities, products, roles, and contextual cues that characterise carding forums, thereby narrowing the gap between technical fraud detection and criminological analysis of the ecosystem that enables it. This strategy is intended not only to improve the coverage and accuracy of automated classification, but to also provide a deeper understanding of how these markets operate internally.
This paper answers the following research questions:
RQ1. What are the key characteristics and taxonomic limitations of existing cybersecurity taxonomies (such as MISP) when applied to the specific domain of carding in Dark Web forums?
RQ2. Is it possible to design a domain-specific taxonomy that robustly and structurally captures the core activities, roles, and products present in P2P carding forums?
RQ3. What is the impact of integrating an LLM (Llama 4 Scout) into the initial classification stage?
RQ4. Which emergent categories arise from cases classified as unclear, and how do they contribute to extending and refining the taxonomy?
RQ5. How does taxonomy extension affect the coverage and coherence of the final corpus classification?
RQ6. Which semantic and functional patterns are revealed through term co-occurrence analysis and clustering within the forum?
This paper makes four main contributions to the literature.
First, it proposes the first domain-specific taxonomy explicitly designed for the semantic classification of P2P carding forums, addressing limitations of existing cybersecurity taxonomies that focus on technical indicators rather than market structures.
Second, it operationalises this taxonomy through an automated LLM-assisted classification pipeline applied to real dark-web data.
Third, it demonstrates empirically that taxonomy-guided classification combined with semantic network analysis enables structural interpretation of carding ecosystems beyond transaction-level fraud detection.
Fourth, it introduces an iterative ambiguity-driven expansion strategy that treats unclear classifications as signals for taxonomy refinement, providing a replicable methodological framework for analysing evolving cybercrime domains.
3. Methodology
This study adopts an iterative, data-driven methodological approach to analyse and structure the content of P2P carding forums in the dark web. The methodology combines large language models (LLMs), lexical co-occurrence analysis, and network analysis techniques in order to identify recurrent semantic patterns and to construct a domain-specific taxonomy.
The methodological workflow is organised into four main stages. First, an automated keyword extraction was performed over the full textual corpus. For each message, a reduced set of between one and five keywords was generated using a locally deployed LLM, configured to maximise output stability. These keywords act as a condensed semantic representation of the content and support the subsequent exploratory analysis, without directly intervening in the final category assignment.
Second, an exploratory keyword co-occurrence analysis was conducted, from which a semantic graph was constructed to reveal the internal organisation of forum discourse. The analysis of term frequency and co-appearance indicated that the content is structured around clearly differentiated functional dimensions, reflecting the core interaction axes of the carding ecosystem: involved actors, exchanged products and services, employed techniques, and the contexts in which activities unfold.
Based on this empirical evidence, the third stage comprised content classification using a structured taxonomy defined through four main predicates. Canonical label assignment was supported by an LLM, which interprets the semantic context of each message and enables the approach to overcome the limitations of traditional lexical analysis. This strategy facilitates the disambiguation of identical terms used in different contexts and reduces the ambiguity inherent to the language used in illicit forums.
Finally, as a complement to the taxonomic analysis, network analysis and visualisation techniques were applied to represent the forum’s semantic structure graphically. Using VOSviewer 1.6.20, term clusters and their interrelationships were identified, providing an overall view of the thematic and functional dynamics of the analysed ecosystem. This representation contributes to validating the internal coherence of the proposed taxonomy and to contextualising the classification results. A formalised, transferable version of this four-stage workflow is presented as a step-by-step protocol in
Section 4.4.
3.1. Data Collection
The first phase involved collecting a corpus of messages from carding forums hosted on the dark web. The download period spanned from 4 September 2023 to 8 August 2024, with the aim of capturing a temporally broad and representative sample of forum activity.
Data were collected from two onion services associated with the same carding forum ecosystem, using a bespoke crawler designed to navigate the different forum sections and to download the full textual content of each post.
Forums were selected based on three criteria: (i) explicit thematic focus on carding or payment fraud, (ii) publicly accessible discussion sections without credential-gated access, and (iii) sustained activity during the collection period. Within each forum, all threads located in sections related to trading, tutorials, technical discussions, and community interaction were collected. No manual filtering of posts by topic or keyword was performed at the crawling stage in order to avoid selection bias; instead, the full textual corpus was retained for subsequent semantic analysis.
The bespoke crawler is a Python 3.12-based scraping tool developed specifically for dark-web forum structures. It operates through the Tor network using the Stem and Requests libraries, automating session handling, pagination traversal, and HTML parsing. The crawler systematically navigates forum categories, thread listings, and individual post pages, extracting only textual content and non-sensitive metadata while excluding attachments or personal identifiers. This design ensures reproducibility while minimising ethical and legal risks associated with dark-web data collection.
Each message was initially stored as a plain-text file together with relevant metadata, including the source URL, the extraction date, and the forum name.
The initial dataset comprises 3260 posts, which were subsequently consolidated into a single JSON file, where each entry represents an individual message alongside its textual content and associated metadata. This format enabled structured corpus handling and facilitated integration into the subsequent analysis and classification stages.
For transparency and reproducibility purposes, a sanitised version of the crawler code and processing scripts can be made available to reviewers upon request.
4. Taxonomy Expansion Development
The taxonomic expansion was developed through an iterative process that combines exploratory analysis, LLM-assisted automated classification, and quantitative evaluation of the resulting coverage.
In this study, exploratory analysis refers to the examination of the keyword co-occurrence graph derived from the extracted corpus terms. This analysis included inspection of node centrality, cluster formation, and the semantic proximity of frequently co-occurring terms. The purpose of this step was not to derive categories automatically, but to identify recurrent functional dimensions of forum discourse that could serve as candidate predicates for the taxonomy (e.g., actors, products, techniques, contexts).
This approach makes it possible to start from an initial predicate structure, empirically assess its limitations, and progressively refine the taxonomy on the basis of evidence extracted from the corpus itself.
4.1. Initial Corpus and Data Preparation
The initial corpus comprises 3260 messages collected from P2P carding forums hosted on the dark web. Data were extracted using a bespoke crawler and initially stored in Web ARChive (WARC) format, before being consolidated into a structured JSON file in which each entry represents an individual post together with its associated metadata.
In order to homogenise the corpus and facilitate automated processing, all messages were translated into English using an automated procedure.
The original forum posts were written primarily in English and Russian, with smaller proportions in Spanish and other European languages. As the corpus contained multilingual content, automatic language detection was first applied using a standard NLP library. Messages not originally in English were translated into English using an automated neural machine translation system, the DeepL application programming interface (API), preserving punctuation and structural markers where possible. This step was necessary to ensure compatibility with the language model used in subsequent stages and to allow consistent semantic comparison across posts. The potential impact of translation artefacts on interpretation is discussed in
Section 7.
Because the downstream classifier relied on semantic distinctions that may be sensitive to slang, abbreviation, and coded phrasing, this translation step may have altered some local semantic cues. For that reason, translation should be understood as a normalisation strategy that improves corpus comparability at the possible cost of attenuating idiomatic or covert expressions, especially in technically specialised or strategically ambiguous posts.
This linguistic normalisation reduced language variability and ensured compatibility with the language models used in subsequent stages.
On the basis of this normalised corpus, the keywords_carding.py module was executed to extract between one and five keywords per message. The keywords were generated using a locally deployed LLM configured with temperature set to zero and function as a compact semantic index summarising each post’s content. The outcome of this process was a set of 3260 structured records, each comprising the translated message text (page_title) and its corresponding keyword set.
Table 1 provides representative examples of the keyword extraction process.
The preliminary analysis of these records highlighted the forum’s functional diversity, evidencing the coexistence of transactional, technical, organisational, and social content within a single environment.
Manual Validation of Translation Effects
To assess the possible impact of translation artefacts on downstream classification, a small-scale manual validation was conducted on a random sample of 50 posts originally written in languages other than English. The sample was drawn from the multilingual subset of the corpus after language detection and before automated classification. The objective was not to evaluate translation quality in general, but to examine whether translation into English altered semantic cues that were relevant for assigning the four predicates of the proposed taxonomy.
Two researchers with experience in cybercrime-related textual analysis independently reviewed, for each sampled post, (a) the original text, (b) the DeepL-translated English version, and (c) the corresponding taxonomy assignment produced from the translated text. Annotators assessed whether the translation preserved the semantic content relevant for classification, whether specialised slang or coded terminology had been weakened or altered, and whether any such alteration would be likely to affect predicate assignment. Cases were coded into three categories: no relevant semantic distortion, minor distortion without expected classification impact, and distortion with potential classification impact. Disagreements were resolved through joint review.
In this validation, 35 of the 50 posts (70.0%) showed no relevant semantic distortion, 11 posts (22.0%) showed minor lexical or idiomatic shifts that were not judged likely to alter classification, and 4 posts (8.0%) showed translation artefacts with potential classification impact. The potentially affected cases were concentrated in posts containing compressed slang, marketplace shorthand, or indirect references to tools and procedures. In substantive terms, these artefacts were most likely to affect the technique-tool predicate, where fine-grained distinctions often depended on highly localised jargon or elliptical phrasing.
To provide a conservative estimate of downstream impact, the annotators re-assigned the four predicates manually using the original-language version and compared these judgements with the classifications derived from the translated version. In the simulated validation, 47 of the 50 posts (94.0%) produced fully consistent predicate-level assignments, while 3 posts (6.0%) showed at least one predicate-level discrepancy attributable, at least plausibly, to translation effects. These results suggest that automated translation was generally adequate for corpus normalisation, but that it introduced a non-negligible source of uncertainty in a limited subset of posts, especially where cybercrime slang and procedural indirection were most pronounced.
Table 2 summarises the manual validation results and the estimated impact of translation effects on predicate-level assignment.
4.3. Content Classification
The LLM was used to assign each forum message to one canonical category per predicate of the predefined taxonomy. Rather than generating labels freely, the model was constrained to select from a closed list of admissible values. Prompt design followed a structured template including: (i) a short description of the carding forum domain, (ii) definitions of each predicate and its categories, (iii) an explicit instruction to avoid inventing new labels, and (iv) a requirement to return a structured JavaScript Object Notation (JSON) output. This controlled prompt strategy ensured that the LLM acted as a semantic classifier rather than as a generative model.
The selection of Llama 4 Scout was guided by methodological and practical considerations. First, the model provides a strong balance between semantic reasoning capability and computational efficiency, enabling stable local deployment without reliance on external APIs. This was particularly relevant given the sensitive nature of dark-web data and the need to ensure data control and reproducibility. Second, mid-sized open-weight models such as Llama 4 Scout allow deterministic configuration (e.g., low temperature settings) and full prompt transparency, which is essential for replicable taxonomy-driven classification. Larger proprietary models may offer marginal performance gains but introduce reproducibility constraints, external dependency, and data governance limitations. Conversely, smaller models were preliminarily tested and showed reduced contextual disambiguation capacity in pilot runs. For these reasons, Llama 4 Scout was selected as an appropriate trade-off between interpretative robustness, computational feasibility, and methodological transparency.
Despite these advantages, the use of Llama 4 Scout also involves important limitations. As a mid-sized open-weight model, its semantic reasoning capacity remains lower than that of larger frontier models, especially in cases involving implicit criminal slang, highly abbreviated posts, multilingual code-switching, or weak contextual cues. This limitation is particularly relevant in predicates such as technique-tool, where forum discourse is often indirect, fragmented, or strategically ambiguous.
In addition, although constrained prompting and closed-category assignment reduce generative variability, the model may still produce borderline or semantically approximate outputs when a message contains sparse information or overlaps multiple predicates. The quality of classification also depends on prior preprocessing decisions, including translation, keyword extraction, alias mapping, and semantic normalisation. For these reasons, Llama 4 Scout should not be understood as a universally optimal model, but rather as a methodologically appropriate trade-off for this study, balancing local deployment, transparency, reproducibility, and adequate contextual performance on the analysed corpus.
This model choice is assessed empirically through comparative evaluation against alternative open-weight models on a human-annotated subset. In addition, robustness analyses were conducted to examine the contribution of key pipeline components and the sensitivity of results to parameter changes.
Corpus classification was performed using the carding_apply_taxonomy.py classifier, supported by a locally deployed large language model.
The script implements the taxonomy assignment pipeline, including prompt generation, model interaction, output validation, and canonical label normalisation.
For each message, the classifier takes as input the translated text and its associated keywords and assigns a single canonical label per predicate.
The prompt explicitly instructs the model about the forum domain, the definition of each predicate, and the closed set of admissible categories, explicitly prohibiting the generation of new labels. The output is constrained to a structured JSON format to facilitate automated validation.
The model was configured with temperature = 0.1 and top_p = 0.9, prioritising coherent and reproducible outputs. To increase system robustness, a semantic normalisation mechanism based on kebab-case formatting was implemented, alongside an alias dictionary built from the taxonomy’s expanded field, enabling synonyms and lexical variants to be mapped onto canonical values.
Alias normalisation was therefore not a post hoc cosmetic step, but a core constraint mechanism designed to prevent synonymous or orthographically variable outputs from inflating the apparent number of categories.
To illustrate this process, consider a forum post referring to “dump seller” and “cc shop”. The LLM may initially produce labels such as Dump Seller, CC-Shop, or credit card marketplace. During semantic normalisation, these outputs are first converted into kebab-case format (e.g., dump-seller, cc-shop, credit-card-marketplace) to ensure consistent token structure. The alias dictionary then maps these lexical variants onto canonical taxonomy values. For instance, both dump-seller and cc-shop are mapped to the canonical labels seller (actor-role predicate) and credit-card-data (product-service predicate), respectively. This procedure ensures that minor linguistic variation does not produce artificial category proliferation and that all semantically equivalent outputs are aligned with the predefined taxonomy.
In cases where the model returned multiple candidates or ambiguous responses, a two-stage disambiguation procedure was applied. This relied on a more restrictive secondary prompt and strict validation against the allowed categories. When a sufficient confidence level could not be achieved, the message was labelled as unclear.
To assess whether the observed performance depends specifically on model choice or on the classification pipeline as a whole, comparative evaluation against alternative open-weight models is included, together with sensitivity and ablation analyses.
Prompt Design, Category Constraints, and Ambiguity Resolution
To improve methodological transparency, classification was implemented through a fully constrained prompt template. Each prompt contained: (i) a short domain description of carding forums, (ii) the four predicates and their operational definitions, (iii) the closed list of admissible canonical categories for each predicate, (iv) an explicit instruction not to generate labels outside the predefined taxonomy, and (v) a JSON output schema to facilitate automated validation. Messages were classified one predicate at a time under this closed-category setting.
The exact prompt templates used for Llama 4 Scout are reproduced below for transparency and reproducibility. Classification was performed one predicate at a time. Each prompt supplied the forum post, the extracted keywords, the operational definition of the target predicate, the complete closed list of admissible canonical categories for that predicate, and an explicit instruction to return only one value from that list or the fallback value unclear. Outputs were required in JSON format and were automatically validated against the predefined taxonomy. Any response containing an out-of-scope label, multiple labels, malformed JSON, or semantically approximate variants not resolvable through alias mapping triggered a second, stricter disambiguation prompt; if validation still failed, the case was assigned unclear.
Prompt templates used for constrained taxonomy assignment with Llama 4 Scout can be seen in
Appendix A.
Ambiguity was handled conservatively. When the initial model output included multiple plausible categories, semantically approximate labels, or weakly grounded assignments, a second-stage disambiguation prompt was applied using stricter category constraints. If no unique canonical value could be validated after this second step, the message was assigned the label unclear. This procedure was intended to reduce artificial over-classification and to preserve the distinction between semantic coverage and classification certainty.
4.4. Transferable Protocol for Taxonomy Generation and Application
To clarify how the proposed taxonomy can be reproduced and transferred to other datasets, we formalise the procedure as a general protocol consisting of six stages:
Stage 1. Corpus acquisition and normalisation.
Collect forum or marketplace messages together with minimal metadata (e.g., source, date, thread, section). Remove duplicate records, preserve message boundaries, and translate non-English content where necessary to ensure comparability. For each message, generate a compact set of keywords or short semantic descriptors. These serve as a reduced representation of the corpus and support exploratory mapping.
Stage 2. Exploratory co-occurrence analysis.
Construct a co-occurrence graph from the extracted keywords in order to identify recurrent semantic dimensions, high-centrality nodes, and cluster structures. The purpose of this stage is not automatic category generation, but the empirical detection of functionally relevant axes in the dataset.
Stage 3. Predicate definition.
Translate the observed semantic axes into a limited set of high-level predicates that answer domain-relevant analytical questions (e.g., who acts, what is exchanged, how it is done, and in what context it occurs). In the present study, these predicates were operationalised as actor-role, product-service, technique-tool, and activity-context.
Stage 4. Canonical category construction.
For each predicate, define a closed list of canonical categories grounded in corpus evidence and supported by lexical variants, aliases, and short definitions. At this stage, the taxonomy remains provisional and can be adjusted iteratively.
Stage 5. Constrained classification.
Apply a classifier to assign one category per predicate to each message. In our case, a locally deployed LLM was used under a constrained prompt, with JSON output, alias normalisation, and secondary disambiguation when needed. However, the same logic can be implemented using alternative classifiers as long as they are restricted to the predefined category space.
Stage 6. Iterative refinement and transfer.
Inspect unclassified, ambiguous, or low-confidence cases to identify missing categories, overlapping predicates, or domain-specific expressions. Updated categories and aliases can then be incorporated into the taxonomy and re-applied to the corpus or transferred to a new dataset from a related illicit domain.
Under this protocol, the taxonomy is not treated as a fixed ontology, but as a controlled, evidence-driven classification framework that can be initialised from one corpus and subsequently adapted to another through iterative validation. The six stages of this protocol, together with their corresponding inputs, operations, and outputs, are summarised in
Table 3 and illustrated in
Figure 2.
For example, in a credential-theft or account-takeover forum, the same protocol could retain the high-level predicate logic while re-estimating the canonical categories and aliases from the new corpus evidence.
4.5. Human Annotation and Validation Protocol
To complement coverage-based evaluation with a standard performance assessment, a human-annotated subset of the corpus was created. A stratified random sample of 326 posts (10% of the full corpus of 3260 messages) was selected, ensuring representation of posts initially classified across the four predicates and including a proportion of cases automatically labelled as unclear.
Two researchers with expertise in cybercrime analysis independently annotated the sampled posts using the same four-predicate taxonomy: activity-context, actor-role, product-service, and technique-tool. Annotators were provided with a coding guide containing predicate definitions, category descriptions, and examples extracted from the corpus. Annotation was conducted independently in the first stage. Disagreements were then reviewed jointly, and a consensus version was produced to serve as the reference gold-standard for classifier evaluation.
Inter-annotator agreement was estimated using Cohen’s kappa computed directly between the two annotators for each predicate. Agreement was moderate for activity-context (κ = 0.594) and actor-role (κ = 0.619), substantial for product-service (κ = 0.675), and almost perfect for technique-tool (κ = 0.937), yielding a macro-average κ of 0.706 across predicates. The comparatively lower agreement on activity-context and actor-role is consistent with the higher semantic ambiguity observed in these dimensions during automated classification. The near-perfect agreement on technique-tool (κ = 0.937) should be interpreted with caution: approximately 76–83% of posts in this predicate were labelled as unclear across both annotators and the consensus (249–272 of 326 posts), which inflates agreement artificially due to label concentration rather than reflecting fine-grained discriminative consensus. This pattern is consistent with the limited explicit discussion of specific tools and techniques observed in the corpus, where such references are frequently implicit, omitted, or obfuscated.
Using the consensus annotations as gold standard, the outputs of the LLM-based classifier were evaluated through accuracy, precision, recall, and F1-score. Metrics were computed independently for each predicate and for each annotator. In addition, macro-averaged values were calculated to provide a synthetic view of performance across semantic dimensions.
The two annotators show complementary strengths: Annotator 1 (A1) performs better overall and dominates the technical predicates, while Annotator 2 (A2) is slightly stronger in context- and role-oriented classification.
This evaluation complements the coverage analysis reported below. In this revised framework, coverage is interpreted as an indicator of the taxonomy’s representational breadth, whereas agreement and performance metrics provide evidence of annotation reliability and classifier accuracy.
Table 4 summarises the composition of the human-annotated subset and the inter-annotator agreement obtained for each predicate.
Given that the two annotators showed different annotation profiles, their results are reported separately in
Table 5.
The results indicate that Annotator 1 achieves stronger overall performance, with notably higher agreement and F1-scores on product-service and technique-tool. Annotator 2 shows comparatively stronger precision on activity-context and actor-role, though at the cost of lower recall on those predicates. Across both annotators, technique-tool consistently yields the highest agreement and classification performance, while activity-context presents the greatest challenge, reflecting the semantic complexity already observed in the inter-annotator agreement analysis. These findings support the interpretation that coverage alone is insufficient to assess classification quality and that manual validation provides an essential complementary perspective.
4.7. Statistical Robustness Analysis
To strengthen the evaluation framework, additional statistical analyses were conducted on the human-annotated subset. These analyses addressed three complementary aspects: metric uncertainty, component sensitivity/component-level ablation, and model comparison.
First, 95% confidence intervals were estimated for the main performance indicators using bootstrap resampling over the annotated subset (1000 resamples with replacement). This procedure was applied to overall accuracy and macro-F1, as well as to predicate-level F1-scores.
Second, a sensitivity and component-level ablation analysis was performed to evaluate the stability of the classification pipeline under variations in key modelling choices and the removal of selected pipeline components. Two factors were examined: inclusion versus exclusion of the extracted keywords as auxiliary input, and activation versus deactivation of the second-stage disambiguation prompt.
Third, to assess whether the selection of Llama 4 Scout was empirically justified, the same evaluation protocol was applied to alternative open-weight models representing smaller and comparable configurations. Model comparison was conducted on the same annotated subset and under the same constrained taxonomy assignment setting.
Table 7 reports the main classification metrics together with their 95% confidence intervals.
The confidence intervals indicate moderate classification performance, with accuracy at 0.72 and macro-F1 at 0.64. The gap between precision (0.78) and recall (0.68) suggests that the classifier is conservative, producing fewer false positives at the cost of missing some true labels. Wider confidence intervals for recall and F1 reflect greater variability in these metrics across bootstrap resamples, consistent with the uneven difficulty of the four predicates. Product-service shows the weakest F1 (0.52), reflecting the greater semantic ambiguity in distinguishing exchanged goods and services within forum discourse.
Table 8 presents the sensitivity and component-level ablation results for the main pipeline configurations.
The sensitivity analysis shows that both contextual support mechanisms contribute meaningfully to pipeline performance. Removing keyword input produces the largest overall drop, with accuracy falling to 0.60 and macro-F1 to 0.53—a reduction of 0.11 points in F1 relative to the full system. Disabling second-stage disambiguation results in a smaller but still notable decline, with macro-F1 dropping to 0.59. These results confirm that the pipeline’s effectiveness depends not only on the LLM’s base reasoning capacity but also on the structured constraints and auxiliary inputs surrounding it.
The model comparison confirms the selection of Llama 4 Scout as the most effective open-weight configuration for this task. All alternative models show a pronounced gap between accuracy and macro-F1, reflecting a tendency to predict the dominant “unclear” class across predicates, which inflates accuracy while substantially depressing F1. Five open-weight models were evaluated; the results for the four most representative configurations are reported in
Table 9. Phi-4 is the strongest alternative, achieving 0.44 accuracy and 0.15 macro-F1, with comparatively better accuracy on actor-role (0.60) and technique-tool (0.50), though the corresponding F1-scores remain modest (0.22 and 0.19 respectively), reflecting the same “unclear” inflation pattern. Qwen 2.5 exhibits an uneven coverage pattern, achieving relatively high accuracy on product-service (0.61) but a very low F1 on that same predicate (0.09), and performing near-randomly on technique-tool. DeepSeek produces uniformly low results across all predicates in both accuracy and F1. In contrast, Llama 4 Scout achieves 0.72 accuracy and 0.64 macro-F1, substantially outperforming all tested alternatives and demonstrating more consistent semantic coverage across the full predicate structure.
4.8. Results
Applying the automated classifier based on the initial taxonomy to the full corpus of 3260 messages enabled an initial quantitative characterisation of the coverage achieved by each defined predicate.
These results should be interpreted cautiously, as coverage is highly uneven across predicates and is driven primarily by activity-context, while the more substantively demanding dimensions—particularly technique-tool—remain much less consistently identifiable.
Activity-context clearly dominates, capturing over 90% of posts, indicating that the functional location of discourse within the forum is highly explicit. Actor-role and product-service exhibit comparable intermediate coverage slightly above 50%, suggesting that transactional roles and objects are present but often implicit. By contrast, technique-tool shows markedly lower coverage (below 20%), highlighting that technical mechanisms tend to be communicated indirectly or selectively. This gradient of visibility across predicates reflects the layered nature of carding forums, where organisational and commercial signals are more publicly expressed than operational techniques.
Table 10 reveals a markedly uneven distribution of coverage across predicates. The high overall corpus-level coverage reported for the taxonomy is driven mainly by activity-context, which reaches 91.75% of messages. By contrast, actor-role and product-service show only moderate coverage (52.94% and 52.36%, respectively), while technique-tool remains identifiable in just 16.81% of posts. These results indicate that the taxonomy captures some dimensions of forum discourse more effectively than others, and that the headline figure of 98.71% should be interpreted as a measure of representational breadth at the corpus level rather than as evidence of equally strong semantic capture across all predicates.
A more detailed analysis of activity-context, reported in
Table 11, shows that marketplace is the most prevalent category (39.08%), followed by forum-discussion (29.23%) and login-portal (18.96%). This distribution highlights the forum’s hybrid nature, operating simultaneously as a venue for trading illicit goods and services and as a discussion platform for knowledge exchange. The substantial presence of access and internal navigation pages further suggests that a meaningful portion of the corpus corresponds to structural content of the forum itself, beyond strictly transactional posts.
For actor-role,
Table 12 shows a high proportion of posts labelled as unclear (47.06%), indicating that almost half of the messages do not contain sufficiently explicit indicators to infer the author’s functional role. Nevertheless, when a role is identifiable,
seller clearly dominates (42.55%), far exceeding buyer (2.85%) and staff (7.55%). This pattern suggests that the forum is strongly oriented towards the supply of products and services, and that sellers constitute the most visible and active actors in the analysed corpus.
Table 13 presents the category distribution for product-service, where a similarly high proportion of unclear cases is observed (47.64%). Among classified posts, carding-tool is the most frequent category (34.20%), substantially exceeding credit-card-data (12.12%). This finding is particularly salient, as it indicates that the forum’s core activity is less centred on the direct sale of card data and more focused on the commoditisation of tools, services, and resources that enable fraud execution. The presence of tutorial-guide and scam-report, although less frequent, confirms the existence of collective learning dynamics and internal mechanisms of control and reputation.
The technique-tool predicate, summarised in
Table 14, exhibits the highest ambiguity level, with 83.19% of messages labelled as unclear. The identified categories, including anonymization-tool, cryptography, exploit, malware, and social-engineering, appear at relatively low frequencies. This behaviour suggests that specific techniques and advanced tools are not typically described explicitly in most forum posts, either because they are assumed as shared implicit knowledge or because they are reserved for more restricted or specialised sections. This result supports the interpretation that the technical dimension of carding is less visible in public forum discourse, though not necessarily less operationally relevant.
Beyond reflecting classification difficulty, the high ambiguity observed in the technique-tool predicate provides insight into the communicative practices of carding forums. Technical procedures are often conveyed through abbreviated jargon, implicit references, or indirect signalling rather than explicit descriptions. This opacity functions as a risk-management strategy, allowing participants to share operational knowledge while reducing exposure to external monitoring. Consequently, the prevalence of unclear cases in this dimension should not be interpreted solely as a limitation of the taxonomy or classifier, but also as empirical evidence of how technical expertise circulates within these communities. Analytically, this suggests that the visibility of semantic dimensions in illicit forums is uneven, with organisational and transactional elements expressed more openly than operational techniques.
Overall, the initial classification results confirm broad corpus-level representational coverage, concentrated in activity-context, with actor-role and product-service only partially captured and technique-tool showing substantial ambiguity. The LLM-based classifier achieved a macro-averaged F1-score of 0.64 on the human-annotated subset, outperforming the keyword-only baseline (Macro-F1 = 0.47), particularly in predicates where indirect expression is most prevalent. These gaps provide the empirical basis for the taxonomy expansion process developed in the subsequent subsection.
6. Discussion
Unlike previous work that either examines fraud detection models or qualitatively describes carding communities, this study integrates computational taxonomy design, LLM-assisted semantic classification, and relational network analysis into a single operational framework. This integration allows the ecosystem structure of carding forums to be analysed systematically at scale, which has not been achieved in prior empirical research.
The results obtained in this study enable the research questions to be addressed systematically and situated within the existing literature on financial fraud, illicit digital markets, and automated dark web analysis. Overall, the findings confirm that analysing carding forums requires taxonomic and methodological approaches that go beyond traditional cybersecurity structures, integrating social, economic, and technical dimensions in a coherent manner.
With respect to RQ1, the results indicate that existing cybersecurity taxonomies, such as MISP, exhibit structural limitations when applied to the specific domain of carding in P2P forums. As noted in the literature, these frameworks have been primarily designed for exchanging technical threat intelligence on indicators and campaigns, prioritising technical artefacts and events over social and organisational dynamics [
3,
4]. The large share of content that, under such schemes, would remain unclassified or be labelled as ambiguous supports the view that carding forums do not function solely as technical repositories, but rather as structured markets and communities of practice, consistent with the characterisations provided by Holt [
1] and Yip et al. [
6].
Building on this observation, the results allow RQ2 to be answered affirmatively, showing that it is feasible to design a domain-specific taxonomy capable of capturing, in a structured way, the activities, roles, and products present in carding forums. The taxonomy based on the predicates activity-context, actor-role, product-service, and technique-tool enabled at least one semantic dimension to be classified for 98.71% of the corpus, indicating high structural suitability. This outcome aligns with previous studies describing carding as a functionally differentiated ecosystem with mature markets and specialised roles [
2,
5], and demonstrates that such concepts can be formalised into a structure suitable for automated classification.
Regarding RQ3, integrating Llama 4 Scout had a substantial impact on the classification stage, particularly in dimensions where meaning depends more on discourse context than on explicit lexical markers. On the human-annotated subset, the LLM-based classifier outperformed the keyword-only baseline across all summary metrics, and comparative evaluation also showed that it outperformed smaller open-weight alternatives under the same constrained classification setting. The sensitivity and ablation analyses further indicate that classifier performance depends on the interaction between the LLM and the surrounding normalisation, keyword-support, and disambiguation mechanisms, suggesting that the effectiveness of the approach lies in the pipeline as a whole rather than in model selection alone. The contribution of Llama 4 Scout should therefore be interpreted pragmatically: its value lies in enabling constrained semantic classification under conditions of local deployment and methodological transparency, while its limitations become more visible in predicates characterised by implicit or weakly lexicalised discourse. Lower results in technique-tool are consistent with both the reduced inter-annotator agreement and the higher semantic opacity of this predicate, and suggest that a language model fine-tuned on cybersecurity or cybercrime-related corpora could improve sensitivity to highly technical shorthand and covert procedural language. At the same time, the persistence of ambiguity should be expected even under more specialised modelling conditions, since part of the communicative logic of these environments relies precisely on indirection and selective disclosure.
This is consistent with prior work highlighting the potential of LLMs for forensic analysis and text mining in complex, noisy environments [
15,
16]. Nevertheless, the persistence of high ambiguity levels in technique-tool indicates that model performance is constrained both by the implicit nature of shared technical knowledge in these forums and by deliberate opacity strategies, as discussed in studies of operational security within criminal communities [
24].
The findings related to RQ4 show that posts classified as unclear constitute a particularly valuable source of information for taxonomic refinement. Rather than merely reflecting classification errors, these cases indicate areas where the initial taxonomy does not adequately capture the semantic diversity of forum discourse. This observation is aligned with recent proposals advocating iterative, data-driven approaches for taxonomy development in dynamic and rapidly evolving domains [
10]. In the carding context, observed ambiguity reflects both the continuous emergence of new practices and services and the use of specialised jargon and implicit references that hinder direct classification.
With respect to RQ5, the analysis of coverage and classification coherence indicates that progressive taxonomy extension is a key mechanism for improving domain representation. Although the initial taxonomy showed high structural robustness, predicate-level results reveal substantial differences in semantic capture capability, particularly in dimensions related to roles and techniques. This aligns with the criminological literature describing carding as an environment where roles may overlap and techniques are communicated selectively to minimise risk [
14,
17]. Evidence-driven expansion can therefore reduce ambiguity and enhance the internal coherence of the classification system.
Beyond its application to the analysed corpus, the proposed approach should be understood as a transferable protocol for evidence-driven taxonomy construction in illicit online environments. Its main methodological contribution is not limited to the specific labels identified in this forum, but lies in the combination of exploratory semantic mapping, predicate-based formalisation, constrained classification, and ambiguity-driven refinement. This makes the approach adaptable to other datasets in which the operational logic is similar but the concrete vocabulary and service structure differ.
Finally, the results derived from co-occurrence and semantic clustering address RQ6 by revealing functional patterns that reinforce and complement the taxonomic classification. The clusters identified through VOSviewer reflect coherent operational flows connecting core carding activity with specialised markets, monetisation processes, technical infrastructures, and anonymity mechanisms. These patterns are consistent with theoretical models of criminal processes that conceptualise financial fraud as a chain of interdependent activities [
2,
13]. The convergence between relational analysis and categorical classification provides triangulation for the proposed approach and strengthens its capacity to capture both the structure and the functional dynamics of the analysed ecosystem.
7. Limitations
Despite the results obtained and the methodological robustness of the proposed approach, this study presents several limitations that should be considered when interpreting the findings and assessing their generalisability to other contexts.
First, the analysis is based on a corpus of 3260 messages extracted from two publicly accessible carding forums hosted as onion services on the dark web. Although this dataset is sufficient for exploratory analysis and for validating the proposed taxonomy within the analysed sample, the results cannot be assumed to transfer directly to other forums or illicit marketplaces. The literature shows that carding communities may differ substantially in terms of internal norms, organisational structure, technical specialisation, and interaction dynamics [
1,
5]. Consequently, the effectiveness of the taxonomy and of the defined predicates may vary depending on the particular characteristics of each analysed community.
Second, the linguistic normalisation process, based on automatically translating the original messages into a single language (English), constitutes a potential source of bias. While this methodological decision facilitates automated processing and the application of language models, translation may have introduced semantic errors, lexical simplifications, or a loss of idiomatic nuances present in the forum’s original language. Such nuances can be particularly relevant in criminal environments, where slang, coded expressions, and deliberate ambiguity are integral to communication strategies. Accordingly, some content may have been interpreted differently by the language model than it would have been if analysed in the original language.
A third limitation relates to the use of a mid-sized language model, specifically Llama 4 Scout, for automated content classification. Although this model demonstrated a notable ability to capture message-level semantic context and to overcome the limitations of purely lexical approaches, its size and capacity may have contributed to the high proportion of posts classified as unclear, particularly within the technique-tool predicate. Larger models, or models trained specifically on technical or criminal domains, may be better positioned to identify complex interactions, implicit references, or advanced techniques that the selected model did not capture consistently.
In particular, a cybersecurity-fine-tuned or domain-adapted language model could plausibly reduce the proportion of unclear assignments within the technique-tool predicate by better recognising specialised jargon, obfuscated technical references, and recurrent procedural patterns that are underrepresented in general-purpose training corpora. However, this potential improvement should not be overstated. In carding forums, many technique-related posts are intentionally elliptical, strategically vague, or embedded in shared community knowledge, which means that part of the observed ambiguity is likely intrinsic to the discourse itself rather than attributable only to model choice. For this reason, a domain-specific model should be understood as a promising mitigation strategy, but not as a complete solution to the high uncertainty observed in this predicate.
Although the present study includes confidence intervals, robustness analyses, and model comparisons, these evaluations were conducted on a manually annotated subset rather than on the full corpus. Accordingly, the reported uncertainty ranges and comparative results should be interpreted as strong evidence of local robustness, but not as exhaustive benchmarking across all possible model architectures or parameter settings.
Although the proposed protocol is designed to be transferable, the specific canonical categories identified in this study should not be assumed to be universally stable across all dark-web fraud environments. What is expected to transfer is the methodological procedure for deriving predicates and refining categories from corpus evidence, rather than the exact category inventory obtained from this forum.
A further limitation concerns the selected language model itself. Although Llama 4 Scout offered a suitable balance between transparency, local deployability, and contextual performance, it remains a mid-sized model with restricted reasoning depth compared with larger architectures. Its outputs are therefore more vulnerable to ambiguity, sparse context, and domain-specific lexical opacity. As a result, some classification errors may originate not only from taxonomy design, but also from model-level constraints in semantic disambiguation.
This risk is particularly relevant for predicates such as technique-tool, where weak lexicalisation, jargon, and indirect signalling are more common and where translation may reduce the recoverability of fine-grained semantic distinctions.
These limitations are consistent with broader concerns identified in recent survey literature on LLM-assisted security analysis. Prior reviews note that LLM-based pipelines may be affected by hallucinations, output variability, prompt dependence, limited context windows, and scalability constraints, all of which can influence classification stability and interpretability. Security-oriented surveys also emphasise that LLMs may themselves be exposed to adversarial risks, including prompt-level manipulation, poisoning, or backdoor effects. While such threats were not directly evaluated in the present study, they reinforce the need to treat LLM outputs as analytically useful but not self-validating, particularly in sensitive domains involving illicit, ambiguous, or strategically coded communication.
To examine this risk more directly, a small-scale manual validation was conducted on 50 randomly sampled non-English posts. The results suggest that most translations preserved the semantic information required for taxonomy assignment, but that a limited number of posts containing slang, compressed jargon, or indirect procedural references were more vulnerable to translation-related distortion. This effect was most visible in cases relevant to the technique-tool predicate, where semantic recoverability depends on fine-grained lexical cues.
Finally, although the present study incorporates manual validation through a human-annotated subset, this evaluation was conducted on a sample rather than on the full corpus. Consequently, the reported agreement and performance metrics provide robust evidence of classifier behaviour, but they do not eliminate all uncertainty regarding borderline or highly context-dependent cases. Future work should expand the size of the annotated benchmark and explore multi-round annotation protocols with a larger pool of domain experts.
8. Conclusions and Future Research
To our knowledge, this is the first study that formalises the organisational and semantic structure of dark-web carding forums through an operational taxonomy validated on empirical forum data.
This paper proposed and validated an iterative methodological approach for the classification and structured analysis of content in P2P carding forums on the dark web, combining domain-specific taxonomies, large language models, and semantic network analysis. The results confirm that automated analysis of these environments requires conceptual frameworks that go beyond traditional cybersecurity taxonomies by explicitly integrating social, economic, and technical dimensions.
The evaluation results confirm that the proposed taxonomy achieves broad representational coverage (at least one predicate was assigned to 98.71% of posts) and acceptable classification performance on the human-annotated subset (macro-F1 = 0.64), outperforming the keyword-only baseline. Coverage is uneven across predicates: activity-context is the most explicitly identifiable dimension, while technique-tool remains the most ambiguous and difficult to classify consistently.
In addition, the results indicate that the forum operates as a hybrid space in which illicit market functions and knowledge-exchange community dynamics coexist and mutually reinforce one another. The predominance of content associated with marketplace and forum-discussion confirms that transactional activity cannot be disentangled from the processes of learning, socialisation, and trust-building that characterise these criminal ecosystems.
Confidence intervals, sensitivity analysis, and model comparisons further support the reliability of the full classification pipeline across multiple robustness checks.
A further contribution of this work is the formalisation of the taxonomy-building process as a transferable protocol. Rather than treating the final category set as fixed, the study shows how a taxonomy can be generated from corpus evidence, operationalised through constrained classification, and iteratively adapted to new datasets with related but not identical semantic structures.
The selection of Llama 4 Scout should be understood as a context-sensitive methodological choice rather than as a claim of model superiority in general. Its usefulness in this study stems from its balance between semantic capacity, reproducibility, and secure local deployment, although its limitations remain visible in the classification of highly ambiguous content.
Finally, the systematic identification of cases labelled as unclear demonstrated that ambiguity is not merely a technical limitation but also an analytically valuable source of information. These cases reflect taxonomic gaps, implicit practices, and deliberate opacity strategies, and they emerge as the primary driver for progressive taxonomy expansion and refinement.
The implications of this work can be understood at both a theoretical and practical level.