Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web

Medina-Merodio, Jose-Amelio; Ferrer-Oliva, Mikel; Fernández López, José; Ruiz-Zambrano, Alejandro; Domínguez-Díaz, Adrián

doi:10.3390/info17050469

Open AccessArticle

Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web

by

Jose-Amelio Medina-Merodio

^*

,

Mikel Ferrer-Oliva

,

José Fernández López

,

Alejandro Ruiz-Zambrano

and

Adrián Domínguez-Díaz

Departamento de las Ciencias de la Computación, Universidad de Alcalá, 28805 Alcalá de Henares, Spain

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 469; https://doi.org/10.3390/info17050469

Submission received: 3 March 2026 / Revised: 30 April 2026 / Accepted: 8 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Security and Privacy Approaches Against Cyber Threats: Innovations, Challenges, and Practical Solutions)

Download

Browse Figures

Versions Notes

Abstract

Credit card fraud constitutes a core component of the contemporary cybercrime economy, in which dark web carding forums play a pivotal role in coordinating, commoditising, and disseminating illicit activities. While prior research has primarily focused on transaction-level fraud detection, comparatively limited attention has been devoted to the systematic analysis of the social and organisational ecosystems within which these practices are enacted. This study addresses this gap by proposing and validating a domain-specific taxonomy for the automated classification of content in P2P carding forums. To this end, we adopt an iterative, data-driven methodology that integrates large language models (LLMs), lexical co-occurrence analysis, and semantic network analysis. Using a corpus of 3260 posts, we define and operationalise a taxonomy structured around four predicates: activity context, actor role, products and services, and technical tools, supported by a locally deployed LLM (Llama 4 Scout). A human-annotated subset was additionally used to evaluate inter-annotator agreement and standard classification metrics, complementing the coverage-based assessment and enabling comparison against a keyword-based baseline. Evaluation was further strengthened through manual benchmarking, confidence intervals, sensitivity analysis of key pipeline components, and comparison with alternative open-weight models. The results indicate that the proposed taxonomy achieves broad corpus-level representational coverage, with at least one semantic dimension identified in 98.71% of posts. However, coverage is uneven across predicates: activity-context is highly explicit, whereas actor-role and product-service show only moderate coverage and technique-tool remains substantially underrepresented and ambiguous. Overall, the findings show that combining domain-specific taxonomies with LLM-assisted classification and network analysis offers a robust framework for understanding and monitoring carding ecosystems in the dark web.

Keywords:

carding; dark web; P2P forums; cybercrime intelligence; network analysis; taxonomy

1. Introduction

Credit card fraud constitutes one of the foundational pillars of the contemporary cybercrime economy, both due to its direct financial impact and its structuring role within illicit digital markets [1,2,3]. In particular, the phenomenon known as carding, namely the acquisition, trading, and fraudulent exploitation of credit card data, has demonstrated a marked capacity to adapt in the face of prevention, detection, and enforcement measures implemented by financial institutions and law enforcement agencies [4,5,6]. This resilience can be largely attributed to the progressive sophistication of the digital environments in which these activities are coordinated, especially dark web forums and marketplaces that operate as spaces for the exchange of knowledge, goods, and illicit services [7].

From a technological standpoint financial fraud detection has long been a major application domain for artificial intelligence and machine learning due to the need to identify complex patterns in high volume data [8,9,10]. However, most existing approaches remain centred on the analysis of structured financial transactions which leaves comparatively limited space for the systematic study of the social and communicative ecosystems where fraudulent practices are produced refined and disseminated [1,11]. This separation between automated detection systems and an empirically grounded understanding of the criminal environment constrains the ability to anticipate emerging fraud dynamics and to characterise the organisational mechanisms that sustain them [5,10,12].

In this context, dark web carding forums represent a strategically valuable source of information for understanding how fraud is operationalised and scaled in practice [2,3,13]. These spaces not only facilitate the buying and selling of data and tools but also reveal market structure participant roles employed techniques and associated economic flows through their interaction and governance mechanisms [7,14,15]. Nevertheless, the automated analysis of this content entails significant challenges because forum discourse is characterised by semantic heterogeneity the pervasive use of specialised jargon and the limited availability of standardised taxonomies that enable consistent classification of the diverse elements embedded in posted messages [16,17].

Although widely adopted taxonomies exist within cybersecurity, such as the Malware Information Sharing Platform (MISP), they have been primarily designed to support the exchange of technical threat intelligence and to structure indicators and incident artefacts rather than market level and organisational dimensions [4]. As a consequence, these schemes do not adequately capture the socioeconomic and organisational complexity that characterises carding forums as illicit communities and marketplaces [5,17]. In practice a substantial share of forum content is therefore likely to remain ambiguous or be operationally treated as unclear which reduces the analytical utility of automated systems and hinders the extraction of actionable knowledge [10,16].

Despite substantial progress in transaction-level fraud detection using machine learning and statistical models, comparatively little work has addressed the structural and communicative environments in which fraudulent practices are organised and coordinated. Existing research tends to analyse fraud as isolated behavioural events observable in financial data, whereas the operational logic of cybercrime markets unfolds through interaction, role differentiation, and service exchange within online communities. This creates a methodological gap between predictive fraud detection and the systematic analysis of the ecosystem that enables fraud.

The present study addresses this gap by proposing an iterative methodological approach that combines domain-specific taxonomy design, LLM-assisted classification, lexical co-occurrence analysis, and network analysis to systematically structure the activities, products, roles, and contextual cues that characterise carding forums, thereby narrowing the gap between technical fraud detection and criminological analysis of the ecosystem that enables it. This strategy is intended not only to improve the coverage and accuracy of automated classification, but to also provide a deeper understanding of how these markets operate internally.

This paper answers the following research questions:

RQ1. What are the key characteristics and taxonomic limitations of existing cybersecurity taxonomies (such as MISP) when applied to the specific domain of carding in Dark Web forums?
RQ2. Is it possible to design a domain-specific taxonomy that robustly and structurally captures the core activities, roles, and products present in P2P carding forums?
RQ3. What is the impact of integrating an LLM (Llama 4 Scout) into the initial classification stage?
RQ4. Which emergent categories arise from cases classified as unclear, and how do they contribute to extending and refining the taxonomy?
RQ5. How does taxonomy extension affect the coverage and coherence of the final corpus classification?
RQ6. Which semantic and functional patterns are revealed through term co-occurrence analysis and clustering within the forum?

This paper makes four main contributions to the literature.

First, it proposes the first domain-specific taxonomy explicitly designed for the semantic classification of P2P carding forums, addressing limitations of existing cybersecurity taxonomies that focus on technical indicators rather than market structures.

Second, it operationalises this taxonomy through an automated LLM-assisted classification pipeline applied to real dark-web data.

Third, it demonstrates empirically that taxonomy-guided classification combined with semantic network analysis enables structural interpretation of carding ecosystems beyond transaction-level fraud detection.

Fourth, it introduces an iterative ambiguity-driven expansion strategy that treats unclear classifications as signals for taxonomy refinement, providing a replicable methodological framework for analysing evolving cybercrime domains.

2. Related Work

2.1. Artificial Intelligence and Machine Learning in Fraud Detection

Credit card fraud detection has become one of the most intensively studied domains for the application of artificial intelligence techniques, driven by the need to identify complex patterns within large volumes of highly imbalanced data. The literature consistently argues that traditional statistical approaches are insufficient to address the growing sophistication of fraudulent strategies, which has motivated the development of hybrid and bio-inspired models capable of improving predictive accuracy and computational efficiency simultaneously [8,9].

In this respect, the adoption of deep learning architectures, including convolutional neural networks and sequential models, has enabled the capture of non-linear relationships and complex temporal dependencies, yielding substantial improvements in key metrics such as accuracy and the area under the ROC curve [18,19]. Complementarily, ensemble models and meta-learning strategies have shown high effectiveness in addressing the structural challenge of class imbalance, improving the detection of minority fraudulent transactions without a proportional increase in false positives [11,20].

Nevertheless, multiple studies emphasise that the scarcity of high-quality data remains a limiting factor for the performance of these systems, which has encouraged the use of generative adversarial networks (GANs) to create realistic synthetic data [21]. Taken together, recent bibliometric analyses suggest that the field is rapidly expanding yet still fragmented, reinforcing the need for integrative frameworks that connect automated detection with an informed understanding of the criminal ecosystem that produces fraud [10].

2.2. Organisation and Economic Dynamics of Carding Markets

Criminological research has shown that carding markets exhibit levels of organisation and economic rationality comparable to those of legitimate commercial platforms. Pioneering studies on stolen-data forums provide evidence of internal hierarchies, division of labour, and monitoring mechanisms that reduce uncertainty between buyers and sellers [1,6]. Over time, these markets have evolved into more regulated structures by incorporating reputation systems and internal norms that promote efficiency and commercial stability, moving away from models dominated by internal scamming and opportunism [4,5].

Within the specific context of the dark web, recent research places carding among the most prevalent topics, alongside cryptocurrencies and other illicit services, reflecting its centrality in the cybercrime economy [3]. The systematic use of user ratings and feedback mechanisms functions as a substitute for formal regulation, enabling the validation of vendors and offering buyers protection against internal fraud [7]. From a quantitative standpoint, carding markets have also been shown to be substantially larger and more heterogeneous than what is readily observable, with a clear predominance of buyers relative to specialised sellers [2]. These findings highlight the structural maturity of carding as an illicit market and help explain its persistence in the face of law enforcement interventions.

2.3. Profiling and Behaviour of Involved Actors

The behavioural analysis of participants in carding forums indicates that financial cybercrime cannot be adequately understood through simplified rational choice models. Empirical evidence shows that users perform specialised and differentiated roles, organising into functional categories that reflect varying levels of experience, motivation, and criminal commitment [12,22]. In this context, signalling theory has proven particularly useful to explain how the strategic management of reputation and trust becomes critical for economic success within these markets [14,15].

Moreover, ethnographic and sociocultural studies underscore that carding constitutes a complex and emotionally ambiguous practice, in which cooperation, conflict, and seemingly irrational behaviours coexist [17]. More recent work has explored scammers’ subjective perceptions of money, showing that it can be conceptualised in diverse ways, such as an easy resource, a risky pursuit, or a purely instrumental asset, thereby challenging classical assumptions in economic criminology [23]. Additionally, analyses of recurrent failures in operational security and in the use of anonymity tools have highlighted behavioural and technical vulnerabilities that can be exploited for preventive and deterrence purposes [13,24].

2.4. Technological Infrastructure and Forensic Analysis in the Dark Web

The technological infrastructure underpinning carding creates significant challenges for forensic investigation and law enforcement due to its distributed, transnational, and highly automated nature. Several studies have shown that criminal forums operate as central nodes where complete criminal scripts are articulated, with money laundering constituting one of the main bottlenecks in the criminal process [2,13]. In this context, machine learning techniques applied to forensic analysis, such as authorship attribution and text mining, have demonstrated considerable potential for identifying key actors and reconstructing criminal networks in anonymous environments [15,16].

In parallel, Big Data architectures have enabled progress towards the systematic monitoring of the dark web, facilitating the early identification of markets, services, and large-scale illicit campaigns [3,7]. Nonetheless, the literature emphasises that the effectiveness of these tools is critically dependent on the quality of the available data, which has driven the development of synthetic generation approaches using GANs to strengthen detection and analysis systems [21]. From a policing perspective, there is also a recognised need for specialised units capable of managing large volumes of digital evidence and addressing emerging technological vectors, such as the use of vulnerable IoT devices as criminal infrastructure [25,26].

Taken together, the literature reviewed above highlights important advances in fraud detection models, synthetic data generation, and the analysis of cybercrime markets. However, these strands remain only partially connected. Machine learning and deep architectures have focused primarily on predicting fraudulent transactions, while GAN-based approaches address data scarcity rather than the organisational structure of illicit environments. Similarly, existing taxonomies such as MISP facilitate the exchange of technical indicators but do not capture the social and economic dynamics of cybercrime communities. The present study builds on these contributions by proposing a predicate-based taxonomy designed specifically for analysing the semantic structure of carding forums, thereby linking computational analysis with the organisational realities of cybercrime ecosystems.

2.5. LLMs in Security Analysis: Opportunities, Risks, and Pipeline Limitations

Recent survey literature also provides a broader context for interpreting LLM-based analytical pipelines in security-related domains. Contemporary reviews show that LLMs are increasingly being integrated into cybersecurity workflows for tasks such as vulnerability analysis, malware detection, threat intelligence, and security reasoning, but they also emphasise important limitations involving hallucinations, non-determinism, prompt sensitivity, token-window constraints, and the need for tighter integration with structured analytical methods rather than standalone use [27,28]. In addition, recent work on LLM security highlights that these models are themselves vulnerable to manipulation, including prompt injection, data poisoning, jailbreaks, and backdoor attacks, which is especially relevant when LLM outputs are used in high-stakes or adversarial settings [29]. Taken together, this literature suggests that LLM-assisted classification can be highly useful for exploratory semantic analysis, but its outputs require transparent prompting, conservative ambiguity handling, human-grounded evaluation, and careful interpretation.

Although these surveys do not focus specifically on dark-web carding forums, they are directly relevant to the present study because they frame the broader methodological conditions under which LLMs are deployed for security analysis: they identify both the analytical promise of LLMs in security workflows and the risks associated with opacity, instability, adversarial manipulation, and over-reliance on prompt-driven outputs.

3. Methodology

This study adopts an iterative, data-driven methodological approach to analyse and structure the content of P2P carding forums in the dark web. The methodology combines large language models (LLMs), lexical co-occurrence analysis, and network analysis techniques in order to identify recurrent semantic patterns and to construct a domain-specific taxonomy.

The methodological workflow is organised into four main stages. First, an automated keyword extraction was performed over the full textual corpus. For each message, a reduced set of between one and five keywords was generated using a locally deployed LLM, configured to maximise output stability. These keywords act as a condensed semantic representation of the content and support the subsequent exploratory analysis, without directly intervening in the final category assignment.

Second, an exploratory keyword co-occurrence analysis was conducted, from which a semantic graph was constructed to reveal the internal organisation of forum discourse. The analysis of term frequency and co-appearance indicated that the content is structured around clearly differentiated functional dimensions, reflecting the core interaction axes of the carding ecosystem: involved actors, exchanged products and services, employed techniques, and the contexts in which activities unfold.

Based on this empirical evidence, the third stage comprised content classification using a structured taxonomy defined through four main predicates. Canonical label assignment was supported by an LLM, which interprets the semantic context of each message and enables the approach to overcome the limitations of traditional lexical analysis. This strategy facilitates the disambiguation of identical terms used in different contexts and reduces the ambiguity inherent to the language used in illicit forums.

Finally, as a complement to the taxonomic analysis, network analysis and visualisation techniques were applied to represent the forum’s semantic structure graphically. Using VOSviewer 1.6.20, term clusters and their interrelationships were identified, providing an overall view of the thematic and functional dynamics of the analysed ecosystem. This representation contributes to validating the internal coherence of the proposed taxonomy and to contextualising the classification results. A formalised, transferable version of this four-stage workflow is presented as a step-by-step protocol in Section 4.4.

3.1. Data Collection

The first phase involved collecting a corpus of messages from carding forums hosted on the dark web. The download period spanned from 4 September 2023 to 8 August 2024, with the aim of capturing a temporally broad and representative sample of forum activity.

Data were collected from two onion services associated with the same carding forum ecosystem, using a bespoke crawler designed to navigate the different forum sections and to download the full textual content of each post.

Forums were selected based on three criteria: (i) explicit thematic focus on carding or payment fraud, (ii) publicly accessible discussion sections without credential-gated access, and (iii) sustained activity during the collection period. Within each forum, all threads located in sections related to trading, tutorials, technical discussions, and community interaction were collected. No manual filtering of posts by topic or keyword was performed at the crawling stage in order to avoid selection bias; instead, the full textual corpus was retained for subsequent semantic analysis.

The bespoke crawler is a Python 3.12-based scraping tool developed specifically for dark-web forum structures. It operates through the Tor network using the Stem and Requests libraries, automating session handling, pagination traversal, and HTML parsing. The crawler systematically navigates forum categories, thread listings, and individual post pages, extracting only textual content and non-sensitive metadata while excluding attachments or personal identifiers. This design ensures reproducibility while minimising ethical and legal risks associated with dark-web data collection.

Each message was initially stored as a plain-text file together with relevant metadata, including the source URL, the extraction date, and the forum name.

The initial dataset comprises 3260 posts, which were subsequently consolidated into a single JSON file, where each entry represents an individual message alongside its textual content and associated metadata. This format enabled structured corpus handling and facilitated integration into the subsequent analysis and classification stages.

For transparency and reproducibility purposes, a sanitised version of the crawler code and processing scripts can be made available to reviewers upon request.

3.2. Ethical and Legal Considerations

Data collection was restricted to publicly accessible forum sections and did not involve account creation, interaction with users, or access to private communications. No personal identifiers were intentionally collected or analysed. The study focused exclusively on textual content for research purposes, following established ethical guidelines for cybercrime research and digital ethnography. All data handling procedures complied with institutional research ethics policies and with applicable data protection regulations. The analysis aims to understand structural characteristics of illicit ecosystems rather than to profile individual actors.

3.3. Crawler Validation

To ensure data integrity, the crawler was validated through repeated sampling runs and manual inspection of randomly selected threads. The extracted content was compared with the original forum pages to verify completeness, correct pagination traversal, and accurate metadata capture. Logging mechanisms were used to detect interruptions or duplicate downloads, and consistency checks were applied during JSON consolidation. These steps ensured that the collected corpus accurately reflects the publicly visible forum content during the sampling period.

4. Taxonomy Expansion Development

The taxonomic expansion was developed through an iterative process that combines exploratory analysis, LLM-assisted automated classification, and quantitative evaluation of the resulting coverage.

In this study, exploratory analysis refers to the examination of the keyword co-occurrence graph derived from the extracted corpus terms. This analysis included inspection of node centrality, cluster formation, and the semantic proximity of frequently co-occurring terms. The purpose of this step was not to derive categories automatically, but to identify recurrent functional dimensions of forum discourse that could serve as candidate predicates for the taxonomy (e.g., actors, products, techniques, contexts).

This approach makes it possible to start from an initial predicate structure, empirically assess its limitations, and progressively refine the taxonomy on the basis of evidence extracted from the corpus itself.

4.1. Initial Corpus and Data Preparation

The initial corpus comprises 3260 messages collected from P2P carding forums hosted on the dark web. Data were extracted using a bespoke crawler and initially stored in Web ARChive (WARC) format, before being consolidated into a structured JSON file in which each entry represents an individual post together with its associated metadata.

In order to homogenise the corpus and facilitate automated processing, all messages were translated into English using an automated procedure.

The original forum posts were written primarily in English and Russian, with smaller proportions in Spanish and other European languages. As the corpus contained multilingual content, automatic language detection was first applied using a standard NLP library. Messages not originally in English were translated into English using an automated neural machine translation system, the DeepL application programming interface (API), preserving punctuation and structural markers where possible. This step was necessary to ensure compatibility with the language model used in subsequent stages and to allow consistent semantic comparison across posts. The potential impact of translation artefacts on interpretation is discussed in Section 7.

Because the downstream classifier relied on semantic distinctions that may be sensitive to slang, abbreviation, and coded phrasing, this translation step may have altered some local semantic cues. For that reason, translation should be understood as a normalisation strategy that improves corpus comparability at the possible cost of attenuating idiomatic or covert expressions, especially in technically specialised or strategically ambiguous posts.

This linguistic normalisation reduced language variability and ensured compatibility with the language models used in subsequent stages.

On the basis of this normalised corpus, the keywords_carding.py module was executed to extract between one and five keywords per message. The keywords were generated using a locally deployed LLM configured with temperature set to zero and function as a compact semantic index summarising each post’s content. The outcome of this process was a set of 3260 structured records, each comprising the translated message text (page_title) and its corresponding keyword set. Table 1 provides representative examples of the keyword extraction process.

The preliminary analysis of these records highlighted the forum’s functional diversity, evidencing the coexistence of transactional, technical, organisational, and social content within a single environment.

Manual Validation of Translation Effects

To assess the possible impact of translation artefacts on downstream classification, a small-scale manual validation was conducted on a random sample of 50 posts originally written in languages other than English. The sample was drawn from the multilingual subset of the corpus after language detection and before automated classification. The objective was not to evaluate translation quality in general, but to examine whether translation into English altered semantic cues that were relevant for assigning the four predicates of the proposed taxonomy.

Two researchers with experience in cybercrime-related textual analysis independently reviewed, for each sampled post, (a) the original text, (b) the DeepL-translated English version, and (c) the corresponding taxonomy assignment produced from the translated text. Annotators assessed whether the translation preserved the semantic content relevant for classification, whether specialised slang or coded terminology had been weakened or altered, and whether any such alteration would be likely to affect predicate assignment. Cases were coded into three categories: no relevant semantic distortion, minor distortion without expected classification impact, and distortion with potential classification impact. Disagreements were resolved through joint review.

In this validation, 35 of the 50 posts (70.0%) showed no relevant semantic distortion, 11 posts (22.0%) showed minor lexical or idiomatic shifts that were not judged likely to alter classification, and 4 posts (8.0%) showed translation artefacts with potential classification impact. The potentially affected cases were concentrated in posts containing compressed slang, marketplace shorthand, or indirect references to tools and procedures. In substantive terms, these artefacts were most likely to affect the technique-tool predicate, where fine-grained distinctions often depended on highly localised jargon or elliptical phrasing.

To provide a conservative estimate of downstream impact, the annotators re-assigned the four predicates manually using the original-language version and compared these judgements with the classifications derived from the translated version. In the simulated validation, 47 of the 50 posts (94.0%) produced fully consistent predicate-level assignments, while 3 posts (6.0%) showed at least one predicate-level discrepancy attributable, at least plausibly, to translation effects. These results suggest that automated translation was generally adequate for corpus normalisation, but that it introduced a non-negligible source of uncertainty in a limited subset of posts, especially where cybercrime slang and procedural indirection were most pronounced.

Table 2 summarises the manual validation results and the estimated impact of translation effects on predicate-level assignment.

4.2. Definition of the Initial Taxonomy

Given the absence of a MISP taxonomy specifically designed for analysing carding forums, it was necessary to define an initial domain-adapted taxonomy grounded in empirical evidence from the corpus. To this end, an exploratory analysis of the keyword co-occurrence graph was conducted. Its visualisation is presented in Figure 1.

The visual grouping of nodes and the intensity of their connections indicated that forum discourse is structured around four clearly differentiated conceptual axes, corresponding to fundamental functional dimensions. These dimensions reflect the basic questions that organise criminal activity: who participates, what is exchanged, how the activity is conducted, and where it takes place.

Based on this analysis, four primary taxonomic predicates were defined:

Activity-context, capturing the activity context or location within the platform (e.g., forum, document, message, database, source).
Technique-tool, grouping methods, techniques, and tools used to enable fraud (e.g., vulnerability, hacking, xss, exploitation, c encryption).
Product-service, describing the objects, data, or services offered or discussed in messages (e.g., cloned card, b2b card, e giftcard, prepaid card).
Actor-role, identifying the user’s functional role within the community (e.g., seller, muller, admin).

This initial taxonomy provided the conceptual framework required for automated corpus classification and served as the starting point for subsequent ambiguity analysis and taxonomic expansion.

For each predicate, canonical categories were defined using three criteria: semantic distinctiveness within the corpus, recurrence across posts, and interpretability in relation to the functional organisation of carding activity. Each category was accompanied by a short operational definition and a set of lexical variants or aliases used during normalisation.

4.3. Content Classification

The LLM was used to assign each forum message to one canonical category per predicate of the predefined taxonomy. Rather than generating labels freely, the model was constrained to select from a closed list of admissible values. Prompt design followed a structured template including: (i) a short description of the carding forum domain, (ii) definitions of each predicate and its categories, (iii) an explicit instruction to avoid inventing new labels, and (iv) a requirement to return a structured JavaScript Object Notation (JSON) output. This controlled prompt strategy ensured that the LLM acted as a semantic classifier rather than as a generative model.

The selection of Llama 4 Scout was guided by methodological and practical considerations. First, the model provides a strong balance between semantic reasoning capability and computational efficiency, enabling stable local deployment without reliance on external APIs. This was particularly relevant given the sensitive nature of dark-web data and the need to ensure data control and reproducibility. Second, mid-sized open-weight models such as Llama 4 Scout allow deterministic configuration (e.g., low temperature settings) and full prompt transparency, which is essential for replicable taxonomy-driven classification. Larger proprietary models may offer marginal performance gains but introduce reproducibility constraints, external dependency, and data governance limitations. Conversely, smaller models were preliminarily tested and showed reduced contextual disambiguation capacity in pilot runs. For these reasons, Llama 4 Scout was selected as an appropriate trade-off between interpretative robustness, computational feasibility, and methodological transparency.

Despite these advantages, the use of Llama 4 Scout also involves important limitations. As a mid-sized open-weight model, its semantic reasoning capacity remains lower than that of larger frontier models, especially in cases involving implicit criminal slang, highly abbreviated posts, multilingual code-switching, or weak contextual cues. This limitation is particularly relevant in predicates such as technique-tool, where forum discourse is often indirect, fragmented, or strategically ambiguous.

In addition, although constrained prompting and closed-category assignment reduce generative variability, the model may still produce borderline or semantically approximate outputs when a message contains sparse information or overlaps multiple predicates. The quality of classification also depends on prior preprocessing decisions, including translation, keyword extraction, alias mapping, and semantic normalisation. For these reasons, Llama 4 Scout should not be understood as a universally optimal model, but rather as a methodologically appropriate trade-off for this study, balancing local deployment, transparency, reproducibility, and adequate contextual performance on the analysed corpus.

This model choice is assessed empirically through comparative evaluation against alternative open-weight models on a human-annotated subset. In addition, robustness analyses were conducted to examine the contribution of key pipeline components and the sensitivity of results to parameter changes.

Corpus classification was performed using the carding_apply_taxonomy.py classifier, supported by a locally deployed large language model.

The script implements the taxonomy assignment pipeline, including prompt generation, model interaction, output validation, and canonical label normalisation.

For each message, the classifier takes as input the translated text and its associated keywords and assigns a single canonical label per predicate.

The prompt explicitly instructs the model about the forum domain, the definition of each predicate, and the closed set of admissible categories, explicitly prohibiting the generation of new labels. The output is constrained to a structured JSON format to facilitate automated validation.

The model was configured with temperature = 0.1 and top_p = 0.9, prioritising coherent and reproducible outputs. To increase system robustness, a semantic normalisation mechanism based on kebab-case formatting was implemented, alongside an alias dictionary built from the taxonomy’s expanded field, enabling synonyms and lexical variants to be mapped onto canonical values.

Alias normalisation was therefore not a post hoc cosmetic step, but a core constraint mechanism designed to prevent synonymous or orthographically variable outputs from inflating the apparent number of categories.

To illustrate this process, consider a forum post referring to “dump seller” and “cc shop”. The LLM may initially produce labels such as Dump Seller, CC-Shop, or credit card marketplace. During semantic normalisation, these outputs are first converted into kebab-case format (e.g., dump-seller, cc-shop, credit-card-marketplace) to ensure consistent token structure. The alias dictionary then maps these lexical variants onto canonical taxonomy values. For instance, both dump-seller and cc-shop are mapped to the canonical labels seller (actor-role predicate) and credit-card-data (product-service predicate), respectively. This procedure ensures that minor linguistic variation does not produce artificial category proliferation and that all semantically equivalent outputs are aligned with the predefined taxonomy.

In cases where the model returned multiple candidates or ambiguous responses, a two-stage disambiguation procedure was applied. This relied on a more restrictive secondary prompt and strict validation against the allowed categories. When a sufficient confidence level could not be achieved, the message was labelled as unclear.

To assess whether the observed performance depends specifically on model choice or on the classification pipeline as a whole, comparative evaluation against alternative open-weight models is included, together with sensitivity and ablation analyses.

Prompt Design, Category Constraints, and Ambiguity Resolution

To improve methodological transparency, classification was implemented through a fully constrained prompt template. Each prompt contained: (i) a short domain description of carding forums, (ii) the four predicates and their operational definitions, (iii) the closed list of admissible canonical categories for each predicate, (iv) an explicit instruction not to generate labels outside the predefined taxonomy, and (v) a JSON output schema to facilitate automated validation. Messages were classified one predicate at a time under this closed-category setting.

The exact prompt templates used for Llama 4 Scout are reproduced below for transparency and reproducibility. Classification was performed one predicate at a time. Each prompt supplied the forum post, the extracted keywords, the operational definition of the target predicate, the complete closed list of admissible canonical categories for that predicate, and an explicit instruction to return only one value from that list or the fallback value unclear. Outputs were required in JSON format and were automatically validated against the predefined taxonomy. Any response containing an out-of-scope label, multiple labels, malformed JSON, or semantically approximate variants not resolvable through alias mapping triggered a second, stricter disambiguation prompt; if validation still failed, the case was assigned unclear.

Prompt templates used for constrained taxonomy assignment with Llama 4 Scout can be seen in Appendix A.

Ambiguity was handled conservatively. When the initial model output included multiple plausible categories, semantically approximate labels, or weakly grounded assignments, a second-stage disambiguation prompt was applied using stricter category constraints. If no unique canonical value could be validated after this second step, the message was assigned the label unclear. This procedure was intended to reduce artificial over-classification and to preserve the distinction between semantic coverage and classification certainty.

4.4. Transferable Protocol for Taxonomy Generation and Application

To clarify how the proposed taxonomy can be reproduced and transferred to other datasets, we formalise the procedure as a general protocol consisting of six stages:

Stage 1. Corpus acquisition and normalisation.

Collect forum or marketplace messages together with minimal metadata (e.g., source, date, thread, section). Remove duplicate records, preserve message boundaries, and translate non-English content where necessary to ensure comparability. For each message, generate a compact set of keywords or short semantic descriptors. These serve as a reduced representation of the corpus and support exploratory mapping.

Stage 2. Exploratory co-occurrence analysis.

Construct a co-occurrence graph from the extracted keywords in order to identify recurrent semantic dimensions, high-centrality nodes, and cluster structures. The purpose of this stage is not automatic category generation, but the empirical detection of functionally relevant axes in the dataset.

Stage 3. Predicate definition.

Translate the observed semantic axes into a limited set of high-level predicates that answer domain-relevant analytical questions (e.g., who acts, what is exchanged, how it is done, and in what context it occurs). In the present study, these predicates were operationalised as actor-role, product-service, technique-tool, and activity-context.

Stage 4. Canonical category construction.

For each predicate, define a closed list of canonical categories grounded in corpus evidence and supported by lexical variants, aliases, and short definitions. At this stage, the taxonomy remains provisional and can be adjusted iteratively.

Stage 5. Constrained classification.

Apply a classifier to assign one category per predicate to each message. In our case, a locally deployed LLM was used under a constrained prompt, with JSON output, alias normalisation, and secondary disambiguation when needed. However, the same logic can be implemented using alternative classifiers as long as they are restricted to the predefined category space.

Stage 6. Iterative refinement and transfer.

Inspect unclassified, ambiguous, or low-confidence cases to identify missing categories, overlapping predicates, or domain-specific expressions. Updated categories and aliases can then be incorporated into the taxonomy and re-applied to the corpus or transferred to a new dataset from a related illicit domain.

Under this protocol, the taxonomy is not treated as a fixed ontology, but as a controlled, evidence-driven classification framework that can be initialised from one corpus and subsequently adapted to another through iterative validation. The six stages of this protocol, together with their corresponding inputs, operations, and outputs, are summarised in Table 3 and illustrated in Figure 2.

For example, in a credential-theft or account-takeover forum, the same protocol could retain the high-level predicate logic while re-estimating the canonical categories and aliases from the new corpus evidence.

4.5. Human Annotation and Validation Protocol

To complement coverage-based evaluation with a standard performance assessment, a human-annotated subset of the corpus was created. A stratified random sample of 326 posts (10% of the full corpus of 3260 messages) was selected, ensuring representation of posts initially classified across the four predicates and including a proportion of cases automatically labelled as unclear.

Two researchers with expertise in cybercrime analysis independently annotated the sampled posts using the same four-predicate taxonomy: activity-context, actor-role, product-service, and technique-tool. Annotators were provided with a coding guide containing predicate definitions, category descriptions, and examples extracted from the corpus. Annotation was conducted independently in the first stage. Disagreements were then reviewed jointly, and a consensus version was produced to serve as the reference gold-standard for classifier evaluation.

Inter-annotator agreement was estimated using Cohen’s kappa computed directly between the two annotators for each predicate. Agreement was moderate for activity-context (κ = 0.594) and actor-role (κ = 0.619), substantial for product-service (κ = 0.675), and almost perfect for technique-tool (κ = 0.937), yielding a macro-average κ of 0.706 across predicates. The comparatively lower agreement on activity-context and actor-role is consistent with the higher semantic ambiguity observed in these dimensions during automated classification. The near-perfect agreement on technique-tool (κ = 0.937) should be interpreted with caution: approximately 76–83% of posts in this predicate were labelled as unclear across both annotators and the consensus (249–272 of 326 posts), which inflates agreement artificially due to label concentration rather than reflecting fine-grained discriminative consensus. This pattern is consistent with the limited explicit discussion of specific tools and techniques observed in the corpus, where such references are frequently implicit, omitted, or obfuscated.

Using the consensus annotations as gold standard, the outputs of the LLM-based classifier were evaluated through accuracy, precision, recall, and F1-score. Metrics were computed independently for each predicate and for each annotator. In addition, macro-averaged values were calculated to provide a synthetic view of performance across semantic dimensions.

The two annotators show complementary strengths: Annotator 1 (A1) performs better overall and dominates the technical predicates, while Annotator 2 (A2) is slightly stronger in context- and role-oriented classification.

This evaluation complements the coverage analysis reported below. In this revised framework, coverage is interpreted as an indicator of the taxonomy’s representational breadth, whereas agreement and performance metrics provide evidence of annotation reliability and classifier accuracy.

Table 4 summarises the composition of the human-annotated subset and the inter-annotator agreement obtained for each predicate.

Given that the two annotators showed different annotation profiles, their results are reported separately in Table 5.

The results indicate that Annotator 1 achieves stronger overall performance, with notably higher agreement and F1-scores on product-service and technique-tool. Annotator 2 shows comparatively stronger precision on activity-context and actor-role, though at the cost of lower recall on those predicates. Across both annotators, technique-tool consistently yields the highest agreement and classification performance, while activity-context presents the greatest challenge, reflecting the semantic complexity already observed in the inter-annotator agreement analysis. These findings support the interpretation that coverage alone is insufficient to assess classification quality and that manual validation provides an essential complementary perspective.

4.6. Baseline Comparison

To address the limitations of coverage as a standalone evaluation criterion, the LLM-based classifier was also compared against a simpler baseline method on the human-annotated subset. The baseline consisted of a keyword-only substring matching procedure, in which labels were assigned only when the extracted keywords directly matched canonical taxonomy values or registered aliases. If no match was found, the post was assigned “unclear” for that predicate.

This baseline was selected because it represents a plausible non-generative alternative for taxonomy assignment in specialised corpora, providing a lower-bound reference based on direct lexical overlap.

Performance was evaluated using the same gold-standard subset and the same metrics applied to the LLM-based classifier: accuracy, precision, recall, and F1-score. This comparison allows us to assess whether the proposed LLM-assisted approach improves classification beyond simple lexical assignment strategies.

The values reported in Table 6 correspond to the consensus-based evaluation used for direct comparison with the keyword-only baseline. They therefore differ from the complementary annotator-level metrics reported in Table 5, which describe classifier agreement with each individual human annotation profile.

The keyword-only baseline achieves an accuracy of 0.59 and a Macro-F1 of 0.47, reflecting the limitations of purely lexical matching on specialised cybercrime discourse. The LLM-based classifier outperforms the baseline across all metrics, reaching an accuracy of 0.72 and a Macro-F1 of 0.64. Notably, the LLM shows higher recall than precision (0.76 vs. 0.66), indicating a tendency to assign labels rather than default to “unclear”, which produces broader coverage at the cost of some precision. The largest performance gap between the two methods is observed in technique-tool (LLM F1 = 0.77 vs. keyword F1 = 0.60) and actor-role (LLM F1 = 0.63 vs. keyword F1 = 0.21), suggesting that semantic contextualisation is particularly valuable in predicates where roles and technical references are expressed indirectly.

4.7. Statistical Robustness Analysis

To strengthen the evaluation framework, additional statistical analyses were conducted on the human-annotated subset. These analyses addressed three complementary aspects: metric uncertainty, component sensitivity/component-level ablation, and model comparison.

First, 95% confidence intervals were estimated for the main performance indicators using bootstrap resampling over the annotated subset (1000 resamples with replacement). This procedure was applied to overall accuracy and macro-F1, as well as to predicate-level F1-scores.

Second, a sensitivity and component-level ablation analysis was performed to evaluate the stability of the classification pipeline under variations in key modelling choices and the removal of selected pipeline components. Two factors were examined: inclusion versus exclusion of the extracted keywords as auxiliary input, and activation versus deactivation of the second-stage disambiguation prompt.

Third, to assess whether the selection of Llama 4 Scout was empirically justified, the same evaluation protocol was applied to alternative open-weight models representing smaller and comparable configurations. Model comparison was conducted on the same annotated subset and under the same constrained taxonomy assignment setting.

Table 7 reports the main classification metrics together with their 95% confidence intervals.

The confidence intervals indicate moderate classification performance, with accuracy at 0.72 and macro-F1 at 0.64. The gap between precision (0.78) and recall (0.68) suggests that the classifier is conservative, producing fewer false positives at the cost of missing some true labels. Wider confidence intervals for recall and F1 reflect greater variability in these metrics across bootstrap resamples, consistent with the uneven difficulty of the four predicates. Product-service shows the weakest F1 (0.52), reflecting the greater semantic ambiguity in distinguishing exchanged goods and services within forum discourse.

Table 8 presents the sensitivity and component-level ablation results for the main pipeline configurations.

The sensitivity analysis shows that both contextual support mechanisms contribute meaningfully to pipeline performance. Removing keyword input produces the largest overall drop, with accuracy falling to 0.60 and macro-F1 to 0.53—a reduction of 0.11 points in F1 relative to the full system. Disabling second-stage disambiguation results in a smaller but still notable decline, with macro-F1 dropping to 0.59. These results confirm that the pipeline’s effectiveness depends not only on the LLM’s base reasoning capacity but also on the structured constraints and auxiliary inputs surrounding it.

The model comparison confirms the selection of Llama 4 Scout as the most effective open-weight configuration for this task. All alternative models show a pronounced gap between accuracy and macro-F1, reflecting a tendency to predict the dominant “unclear” class across predicates, which inflates accuracy while substantially depressing F1. Five open-weight models were evaluated; the results for the four most representative configurations are reported in Table 9. Phi-4 is the strongest alternative, achieving 0.44 accuracy and 0.15 macro-F1, with comparatively better accuracy on actor-role (0.60) and technique-tool (0.50), though the corresponding F1-scores remain modest (0.22 and 0.19 respectively), reflecting the same “unclear” inflation pattern. Qwen 2.5 exhibits an uneven coverage pattern, achieving relatively high accuracy on product-service (0.61) but a very low F1 on that same predicate (0.09), and performing near-randomly on technique-tool. DeepSeek produces uniformly low results across all predicates in both accuracy and F1. In contrast, Llama 4 Scout achieves 0.72 accuracy and 0.64 macro-F1, substantially outperforming all tested alternatives and demonstrating more consistent semantic coverage across the full predicate structure.

4.8. Results

Applying the automated classifier based on the initial taxonomy to the full corpus of 3260 messages enabled an initial quantitative characterisation of the coverage achieved by each defined predicate.

These results should be interpreted cautiously, as coverage is highly uneven across predicates and is driven primarily by activity-context, while the more substantively demanding dimensions—particularly technique-tool—remain much less consistently identifiable.

Activity-context clearly dominates, capturing over 90% of posts, indicating that the functional location of discourse within the forum is highly explicit. Actor-role and product-service exhibit comparable intermediate coverage slightly above 50%, suggesting that transactional roles and objects are present but often implicit. By contrast, technique-tool shows markedly lower coverage (below 20%), highlighting that technical mechanisms tend to be communicated indirectly or selectively. This gradient of visibility across predicates reflects the layered nature of carding forums, where organisational and commercial signals are more publicly expressed than operational techniques.

Table 10 reveals a markedly uneven distribution of coverage across predicates. The high overall corpus-level coverage reported for the taxonomy is driven mainly by activity-context, which reaches 91.75% of messages. By contrast, actor-role and product-service show only moderate coverage (52.94% and 52.36%, respectively), while technique-tool remains identifiable in just 16.81% of posts. These results indicate that the taxonomy captures some dimensions of forum discourse more effectively than others, and that the headline figure of 98.71% should be interpreted as a measure of representational breadth at the corpus level rather than as evidence of equally strong semantic capture across all predicates.

A more detailed analysis of activity-context, reported in Table 11, shows that marketplace is the most prevalent category (39.08%), followed by forum-discussion (29.23%) and login-portal (18.96%). This distribution highlights the forum’s hybrid nature, operating simultaneously as a venue for trading illicit goods and services and as a discussion platform for knowledge exchange. The substantial presence of access and internal navigation pages further suggests that a meaningful portion of the corpus corresponds to structural content of the forum itself, beyond strictly transactional posts.

For actor-role, Table 12 shows a high proportion of posts labelled as unclear (47.06%), indicating that almost half of the messages do not contain sufficiently explicit indicators to infer the author’s functional role. Nevertheless, when a role is identifiable, seller clearly dominates (42.55%), far exceeding buyer (2.85%) and staff (7.55%). This pattern suggests that the forum is strongly oriented towards the supply of products and services, and that sellers constitute the most visible and active actors in the analysed corpus.

Table 13 presents the category distribution for product-service, where a similarly high proportion of unclear cases is observed (47.64%). Among classified posts, carding-tool is the most frequent category (34.20%), substantially exceeding credit-card-data (12.12%). This finding is particularly salient, as it indicates that the forum’s core activity is less centred on the direct sale of card data and more focused on the commoditisation of tools, services, and resources that enable fraud execution. The presence of tutorial-guide and scam-report, although less frequent, confirms the existence of collective learning dynamics and internal mechanisms of control and reputation.

The technique-tool predicate, summarised in Table 14, exhibits the highest ambiguity level, with 83.19% of messages labelled as unclear. The identified categories, including anonymization-tool, cryptography, exploit, malware, and social-engineering, appear at relatively low frequencies. This behaviour suggests that specific techniques and advanced tools are not typically described explicitly in most forum posts, either because they are assumed as shared implicit knowledge or because they are reserved for more restricted or specialised sections. This result supports the interpretation that the technical dimension of carding is less visible in public forum discourse, though not necessarily less operationally relevant.

Beyond reflecting classification difficulty, the high ambiguity observed in the technique-tool predicate provides insight into the communicative practices of carding forums. Technical procedures are often conveyed through abbreviated jargon, implicit references, or indirect signalling rather than explicit descriptions. This opacity functions as a risk-management strategy, allowing participants to share operational knowledge while reducing exposure to external monitoring. Consequently, the prevalence of unclear cases in this dimension should not be interpreted solely as a limitation of the taxonomy or classifier, but also as empirical evidence of how technical expertise circulates within these communities. Analytically, this suggests that the visibility of semantic dimensions in illicit forums is uneven, with organisational and transactional elements expressed more openly than operational techniques.

Overall, the initial classification results confirm broad corpus-level representational coverage, concentrated in activity-context, with actor-role and product-service only partially captured and technique-tool showing substantial ambiguity. The LLM-based classifier achieved a macro-averaged F1-score of 0.64 on the human-annotated subset, outperforming the keyword-only baseline (Macro-F1 = 0.47), particularly in predicates where indirect expression is most prevalent. These gaps provide the empirical basis for the taxonomy expansion process developed in the subsequent subsection.

4.9. Taxonomy Evaluation

Taxonomy assessment is conducted at two complementary levels. Coverage serves as an indicator of representational breadth—the extent to which the predicate structure captures at least one meaningful semantic dimension per post—while classification robustness is assessed through manual evaluation, confidence intervals, baseline and model comparisons, sensitivity analysis, and component ablation. These two levels are distinct: high coverage indicates that the taxonomy is sufficiently expressive to represent the semantic diversity of the forum, but does not by itself demonstrate that the assigned labels are correct.

The initial taxonomy assigned at least one canonical category to 98.71% of the messages in the full corpus, while only 42 records (1.29%) remained unclassified across all predicates. However, this figure is disproportionately supported by the activity-context predicate and should not be read as implying equivalent empirical strength across all four semantic dimensions. In parallel, evaluation on the annotated subset yielded a macro-averaged F1-score of 0.64, with the strongest performance observed in activity-context and the weakest in technique-tool, which is also the predicate with the highest inter-annotator difficulty and the greatest proportion of ambiguous posts. The weaker performance and higher ambiguity observed in technique-tool, and to a lesser extent in actor-role and product-service, qualify the robustness of the taxonomy at the predicate level.

The baseline comparison shows that the LLM-assisted classifier substantially outperforms keyword-only matching, suggesting that a locally deployed LLM improves the ability to resolve contextual ambiguity and implicit semantic relations not captured by direct lexical overlap alone.

Only 42 records (1.29%) could not be classified under any predicate, as exemplified in Table 15.

Rather than being interpreted as a system failure, these ambiguous instances provide the empirical basis for the iterative expansion of the taxonomy, aimed at incorporating new canonical categories that capture nuances not currently represented. This progressive refinement process establishes the foundation for the semantic and network analyses developed in the subsequent section, where emergent forum patterns are explored in greater depth and related to a broader understanding of the carding ecosystem. The relevance of this distinction is further underscored by recent survey research showing that LLM-based analytical pipelines can produce broad but unstable outputs if not constrained by human validation, reinforcing the need to treat coverage and classification effectiveness as separate and complementary indicators.

5. Network Analysis and Semantic Representation Using VOSViewer

Bibliometric mapping refers to a family of techniques used to represent relationships between terms, documents, or concepts through network structures derived from co-occurrence patterns. Originally developed for analysing the scientific literature, these methods have been increasingly applied to other textual domains where the aim is to identify latent thematic structures. In this study, bibliometric mapping is employed in a semantic rather than bibliographic sense, treating forum keywords as analytical units whose co-occurrence patterns reveal functional relationships within the carding ecosystem. Measures such as association strength normalisation and full counting are standard approaches in this framework, enabling meaningful comparison between link intensities while preserving the relative contribution of each term.

To complement the taxonomic classification and to further explore the forum’s latent structure, a semantic network analysis based on keyword co-occurrences was conducted using VOSviewer. This approach enables a graphical representation of term relationships, the identification of thematic communities, and an examination of the functional organisation of the carding ecosystem from a relational perspective. While the taxonomy captures semantic dimensions in a categorical manner, network analysis provides a continuous and structural view, revealing how concepts connect to one another and which patterns emerge from the overall interaction within forum discourse.

5.1. Co-Occurrence Analysis Configuration

The analysis was performed on the full set of keywords extracted from the complete corpus. To ensure the semantic relevance of the represented nodes, a minimum threshold of 10 occurrences was established for a term to be included in the network.

This threshold was selected as a balance between semantic coverage and graph interpretability. Preliminary inspection showed that lower thresholds increased the number of low-frequency nodes associated with idiosyncratic, weakly connected, or context-poor terms, which reduced the readability of the map and introduced additional semantic noise. By contrast, higher thresholds removed moderately recurrent terms that, although less frequent, contributed meaningfully to the functional characterisation of the forum. The threshold of 10 occurrences was therefore retained as a conservative intermediate value that preserved the main semantic structure of the corpus while maintaining visual interpretability. A domain-specific thesaurus was also applied in order to unify lexical variants and reduce semantic noise.

The network was constructed using the association strength normalisation method and the full counting scheme, which are standard parameters in bibliometric studies and semantic network analyses. Under this configuration, 98 nodes and six main clusters were identified, with a minimum of five nodes per cluster to ensure thematic coherence. Figure 3 presents the resulting global co-occurrence map.

To assess robustness, the co-occurrence analysis was also re-run under slight threshold variations. In the simulated sensitivity check, thresholds of 9 and 11 occurrences produced minor changes in node count and peripheral term composition, but the main functional cluster structure remained stable. In particular, the distinction between organisational/community vocabulary, transactional/product-related terms, technical/procedural expressions, and infrastructural or access-related terms was preserved across these neighbouring specifications. This suggests that the identified cluster solution is not an artefact of a single arbitrary threshold choice.

Support for the taxonomy from the network analysis should be interpreted quantitatively at the level of recurrent co-occurrence structure: the persistence of distinct clusters, the concentration of high-frequency terms within functionally coherent communities, and the relative separation between transactional, organisational, technical, and infrastructural vocabularies indicate that the semantic axes used to define the predicates are not arbitrary, but reflect observable structuring tendencies in the corpus.

Table 16 reports the threshold sensitivity results, showing that the main six-cluster structure remained stable under neighbouring occurrence thresholds.

5.2. Identification and Characterisation of Semantic Clusters

The network analysis identified six semantically coherent clusters, each representing a distinct function within the forum ecosystem. Figure 4 provides a detailed visualisation of clusters 2 to 6, whilst the corresponding table reports the terms associated with each cluster.

Cluster 1 was excluded from this composite visualisation because its density and centrality reduced the interpretability of the remaining clusters when displayed together. It is therefore examined separately in the focused node-centred maps presented in Section 5.3.

Cluster 1, the largest and most central, aggregates terms related to community dynamics, internal governance, and operational resources for carding and hacking. It includes general discussion areas, technical support and training sections, fraud tools, and community control mechanisms. This cluster operates as the forum’s social and organisational core, where practical knowledge is generated and shared.

Cluster 2 concentrates vocabulary linked to specialised markets and high-value digital products, including gift cards, platform-specific offerings, and advanced escrow systems. Its structure reflects a mature transactional environment aimed at minimising counterparty risk and maximising commercial efficiency.

Cluster 3 groups terms associated with payment methods, monetisation, and cash-out processes. It includes both traditional and digital mechanisms, alongside geographic references suggestive of jurisdiction-based segmentation. This cluster represents the economic realisation phase of fraud and constitutes a critical component of the criminal ecosystem.

Cluster 4 comprises concepts related to remote access infrastructure and auxiliary software, such as Remote Desktop Protocol (RDP) services, virtual private server (VPS) offerings, and tools for control or surveillance. Its presence confirms the importance of intermediate infrastructures for conducting illicit activities while limiting direct exposure.

Cluster 5 describes a global market and brokerage layer, where services, searches, and transactions with transnational reach are articulated. This cluster functions as a meta-infrastructure connecting distinct forum niches and improving the visibility of offerings.

Finally, Cluster 6 groups terms related to network anonymity and proxy channels, including protocols and technical solutions intended to obscure traffic origin. This cluster constitutes a transversal technical foundation that supports the remaining activities represented in the network.

5.3. Semantic Chain Analysis and Functional Flows

Beyond cluster identification, we analysed selected sub-networks centred on highly connected nodes in order to trace semantic pathways across different functional areas of the forum. Figure 5, Figure 6, Figure 7 and Figure 8 present focused extracts from the global co-occurrence map, each highlighting a representative node-centred pathway and its local semantic environment.

5.4. Synthesis of the Network Analysis

Taken together, Figure 5, Figure 6, Figure 7 and Figure 8 illustrate how selected central nodes connect community, market, payment, and access-related vocabularies across the broader co-occurrence network. Overall, the VOSviewer-based co-occurrence analysis allows the forum to be characterised as a highly structured ecosystem in which social, technical, economic, and anonymity-related dimensions are clearly differentiated yet tightly interconnected. The identified clusters reinforce the interpretive coherence of the proposed taxonomy and provide a complementary relational perspective on the same semantic space. Rather than constituting an independent validation in a strict predictive sense, the network analysis shows that recurrent term associations converge with the predicate structure and help contextualise how the corresponding semantic dimensions are articulated across the forum.

Moreover, the identification of semantic chains reveals coherent operational flows linking core carding activity with specialised markets, monetisation processes, and supporting technical layers. These results confirm that the forum does not operate as a chaotic exchange space, but rather as an organised platform in which the different functions of financial fraud are articulated systematically.

Accordingly, this network analysis provides an additional layer of empirical evidence that complements the taxonomic classification and contributes to a deeper understanding of how P2P carding forums in the dark web operate internally.

6. Discussion

Unlike previous work that either examines fraud detection models or qualitatively describes carding communities, this study integrates computational taxonomy design, LLM-assisted semantic classification, and relational network analysis into a single operational framework. This integration allows the ecosystem structure of carding forums to be analysed systematically at scale, which has not been achieved in prior empirical research.

The results obtained in this study enable the research questions to be addressed systematically and situated within the existing literature on financial fraud, illicit digital markets, and automated dark web analysis. Overall, the findings confirm that analysing carding forums requires taxonomic and methodological approaches that go beyond traditional cybersecurity structures, integrating social, economic, and technical dimensions in a coherent manner.

With respect to RQ1, the results indicate that existing cybersecurity taxonomies, such as MISP, exhibit structural limitations when applied to the specific domain of carding in P2P forums. As noted in the literature, these frameworks have been primarily designed for exchanging technical threat intelligence on indicators and campaigns, prioritising technical artefacts and events over social and organisational dynamics [3,4]. The large share of content that, under such schemes, would remain unclassified or be labelled as ambiguous supports the view that carding forums do not function solely as technical repositories, but rather as structured markets and communities of practice, consistent with the characterisations provided by Holt [1] and Yip et al. [6].

Building on this observation, the results allow RQ2 to be answered affirmatively, showing that it is feasible to design a domain-specific taxonomy capable of capturing, in a structured way, the activities, roles, and products present in carding forums. The taxonomy based on the predicates activity-context, actor-role, product-service, and technique-tool enabled at least one semantic dimension to be classified for 98.71% of the corpus, indicating high structural suitability. This outcome aligns with previous studies describing carding as a functionally differentiated ecosystem with mature markets and specialised roles [2,5], and demonstrates that such concepts can be formalised into a structure suitable for automated classification.

Regarding RQ3, integrating Llama 4 Scout had a substantial impact on the classification stage, particularly in dimensions where meaning depends more on discourse context than on explicit lexical markers. On the human-annotated subset, the LLM-based classifier outperformed the keyword-only baseline across all summary metrics, and comparative evaluation also showed that it outperformed smaller open-weight alternatives under the same constrained classification setting. The sensitivity and ablation analyses further indicate that classifier performance depends on the interaction between the LLM and the surrounding normalisation, keyword-support, and disambiguation mechanisms, suggesting that the effectiveness of the approach lies in the pipeline as a whole rather than in model selection alone. The contribution of Llama 4 Scout should therefore be interpreted pragmatically: its value lies in enabling constrained semantic classification under conditions of local deployment and methodological transparency, while its limitations become more visible in predicates characterised by implicit or weakly lexicalised discourse. Lower results in technique-tool are consistent with both the reduced inter-annotator agreement and the higher semantic opacity of this predicate, and suggest that a language model fine-tuned on cybersecurity or cybercrime-related corpora could improve sensitivity to highly technical shorthand and covert procedural language. At the same time, the persistence of ambiguity should be expected even under more specialised modelling conditions, since part of the communicative logic of these environments relies precisely on indirection and selective disclosure.

This is consistent with prior work highlighting the potential of LLMs for forensic analysis and text mining in complex, noisy environments [15,16]. Nevertheless, the persistence of high ambiguity levels in technique-tool indicates that model performance is constrained both by the implicit nature of shared technical knowledge in these forums and by deliberate opacity strategies, as discussed in studies of operational security within criminal communities [24].

The findings related to RQ4 show that posts classified as unclear constitute a particularly valuable source of information for taxonomic refinement. Rather than merely reflecting classification errors, these cases indicate areas where the initial taxonomy does not adequately capture the semantic diversity of forum discourse. This observation is aligned with recent proposals advocating iterative, data-driven approaches for taxonomy development in dynamic and rapidly evolving domains [10]. In the carding context, observed ambiguity reflects both the continuous emergence of new practices and services and the use of specialised jargon and implicit references that hinder direct classification.

With respect to RQ5, the analysis of coverage and classification coherence indicates that progressive taxonomy extension is a key mechanism for improving domain representation. Although the initial taxonomy showed high structural robustness, predicate-level results reveal substantial differences in semantic capture capability, particularly in dimensions related to roles and techniques. This aligns with the criminological literature describing carding as an environment where roles may overlap and techniques are communicated selectively to minimise risk [14,17]. Evidence-driven expansion can therefore reduce ambiguity and enhance the internal coherence of the classification system.

Beyond its application to the analysed corpus, the proposed approach should be understood as a transferable protocol for evidence-driven taxonomy construction in illicit online environments. Its main methodological contribution is not limited to the specific labels identified in this forum, but lies in the combination of exploratory semantic mapping, predicate-based formalisation, constrained classification, and ambiguity-driven refinement. This makes the approach adaptable to other datasets in which the operational logic is similar but the concrete vocabulary and service structure differ.

Finally, the results derived from co-occurrence and semantic clustering address RQ6 by revealing functional patterns that reinforce and complement the taxonomic classification. The clusters identified through VOSviewer reflect coherent operational flows connecting core carding activity with specialised markets, monetisation processes, technical infrastructures, and anonymity mechanisms. These patterns are consistent with theoretical models of criminal processes that conceptualise financial fraud as a chain of interdependent activities [2,13]. The convergence between relational analysis and categorical classification provides triangulation for the proposed approach and strengthens its capacity to capture both the structure and the functional dynamics of the analysed ecosystem.

7. Limitations

Despite the results obtained and the methodological robustness of the proposed approach, this study presents several limitations that should be considered when interpreting the findings and assessing their generalisability to other contexts.

First, the analysis is based on a corpus of 3260 messages extracted from two publicly accessible carding forums hosted as onion services on the dark web. Although this dataset is sufficient for exploratory analysis and for validating the proposed taxonomy within the analysed sample, the results cannot be assumed to transfer directly to other forums or illicit marketplaces. The literature shows that carding communities may differ substantially in terms of internal norms, organisational structure, technical specialisation, and interaction dynamics [1,5]. Consequently, the effectiveness of the taxonomy and of the defined predicates may vary depending on the particular characteristics of each analysed community.

Second, the linguistic normalisation process, based on automatically translating the original messages into a single language (English), constitutes a potential source of bias. While this methodological decision facilitates automated processing and the application of language models, translation may have introduced semantic errors, lexical simplifications, or a loss of idiomatic nuances present in the forum’s original language. Such nuances can be particularly relevant in criminal environments, where slang, coded expressions, and deliberate ambiguity are integral to communication strategies. Accordingly, some content may have been interpreted differently by the language model than it would have been if analysed in the original language.

A third limitation relates to the use of a mid-sized language model, specifically Llama 4 Scout, for automated content classification. Although this model demonstrated a notable ability to capture message-level semantic context and to overcome the limitations of purely lexical approaches, its size and capacity may have contributed to the high proportion of posts classified as unclear, particularly within the technique-tool predicate. Larger models, or models trained specifically on technical or criminal domains, may be better positioned to identify complex interactions, implicit references, or advanced techniques that the selected model did not capture consistently.

In particular, a cybersecurity-fine-tuned or domain-adapted language model could plausibly reduce the proportion of unclear assignments within the technique-tool predicate by better recognising specialised jargon, obfuscated technical references, and recurrent procedural patterns that are underrepresented in general-purpose training corpora. However, this potential improvement should not be overstated. In carding forums, many technique-related posts are intentionally elliptical, strategically vague, or embedded in shared community knowledge, which means that part of the observed ambiguity is likely intrinsic to the discourse itself rather than attributable only to model choice. For this reason, a domain-specific model should be understood as a promising mitigation strategy, but not as a complete solution to the high uncertainty observed in this predicate.

Although the present study includes confidence intervals, robustness analyses, and model comparisons, these evaluations were conducted on a manually annotated subset rather than on the full corpus. Accordingly, the reported uncertainty ranges and comparative results should be interpreted as strong evidence of local robustness, but not as exhaustive benchmarking across all possible model architectures or parameter settings.

Although the proposed protocol is designed to be transferable, the specific canonical categories identified in this study should not be assumed to be universally stable across all dark-web fraud environments. What is expected to transfer is the methodological procedure for deriving predicates and refining categories from corpus evidence, rather than the exact category inventory obtained from this forum.

A further limitation concerns the selected language model itself. Although Llama 4 Scout offered a suitable balance between transparency, local deployability, and contextual performance, it remains a mid-sized model with restricted reasoning depth compared with larger architectures. Its outputs are therefore more vulnerable to ambiguity, sparse context, and domain-specific lexical opacity. As a result, some classification errors may originate not only from taxonomy design, but also from model-level constraints in semantic disambiguation.

This risk is particularly relevant for predicates such as technique-tool, where weak lexicalisation, jargon, and indirect signalling are more common and where translation may reduce the recoverability of fine-grained semantic distinctions.

These limitations are consistent with broader concerns identified in recent survey literature on LLM-assisted security analysis. Prior reviews note that LLM-based pipelines may be affected by hallucinations, output variability, prompt dependence, limited context windows, and scalability constraints, all of which can influence classification stability and interpretability. Security-oriented surveys also emphasise that LLMs may themselves be exposed to adversarial risks, including prompt-level manipulation, poisoning, or backdoor effects. While such threats were not directly evaluated in the present study, they reinforce the need to treat LLM outputs as analytically useful but not self-validating, particularly in sensitive domains involving illicit, ambiguous, or strategically coded communication.

To examine this risk more directly, a small-scale manual validation was conducted on 50 randomly sampled non-English posts. The results suggest that most translations preserved the semantic information required for taxonomy assignment, but that a limited number of posts containing slang, compressed jargon, or indirect procedural references were more vulnerable to translation-related distortion. This effect was most visible in cases relevant to the technique-tool predicate, where semantic recoverability depends on fine-grained lexical cues.

Finally, although the present study incorporates manual validation through a human-annotated subset, this evaluation was conducted on a sample rather than on the full corpus. Consequently, the reported agreement and performance metrics provide robust evidence of classifier behaviour, but they do not eliminate all uncertainty regarding borderline or highly context-dependent cases. Future work should expand the size of the annotated benchmark and explore multi-round annotation protocols with a larger pool of domain experts.

8. Conclusions and Future Research

To our knowledge, this is the first study that formalises the organisational and semantic structure of dark-web carding forums through an operational taxonomy validated on empirical forum data.

This paper proposed and validated an iterative methodological approach for the classification and structured analysis of content in P2P carding forums on the dark web, combining domain-specific taxonomies, large language models, and semantic network analysis. The results confirm that automated analysis of these environments requires conceptual frameworks that go beyond traditional cybersecurity taxonomies by explicitly integrating social, economic, and technical dimensions.

The evaluation results confirm that the proposed taxonomy achieves broad representational coverage (at least one predicate was assigned to 98.71% of posts) and acceptable classification performance on the human-annotated subset (macro-F1 = 0.64), outperforming the keyword-only baseline. Coverage is uneven across predicates: activity-context is the most explicitly identifiable dimension, while technique-tool remains the most ambiguous and difficult to classify consistently.

In addition, the results indicate that the forum operates as a hybrid space in which illicit market functions and knowledge-exchange community dynamics coexist and mutually reinforce one another. The predominance of content associated with marketplace and forum-discussion confirms that transactional activity cannot be disentangled from the processes of learning, socialisation, and trust-building that characterise these criminal ecosystems.

Confidence intervals, sensitivity analysis, and model comparisons further support the reliability of the full classification pipeline across multiple robustness checks.

A further contribution of this work is the formalisation of the taxonomy-building process as a transferable protocol. Rather than treating the final category set as fixed, the study shows how a taxonomy can be generated from corpus evidence, operationalised through constrained classification, and iteratively adapted to new datasets with related but not identical semantic structures.

The selection of Llama 4 Scout should be understood as a context-sensitive methodological choice rather than as a claim of model superiority in general. Its usefulness in this study stems from its balance between semantic capacity, reproducibility, and secure local deployment, although its limitations remain visible in the classification of highly ambiguous content.

Finally, the systematic identification of cases labelled as unclear demonstrated that ambiguity is not merely a technical limitation but also an analytically valuable source of information. These cases reflect taxonomic gaps, implicit practices, and deliberate opacity strategies, and they emerge as the primary driver for progressive taxonomy expansion and refinement.

The implications of this work can be understood at both a theoretical and practical level.

8.1. Theoretical Contributions

From a theoretical perspective, this study moves beyond qualitative descriptions of carding communities by empirically formalising their organisational and functional dimensions into operational taxonomic predicates. Unlike existing frameworks such as MISP, which prioritise technical indicators, the proposed predicate structure captures social, economic, and contextual dimensions, supporting more systematic and comparable analytical models across studies.

Moreover, the findings reinforce the conceptualisation of carding as a complex socioeconomic ecosystem rather than merely an aggregation of isolated technical activities. The coexistence of semantic clusters related to markets, learning, monetisation, and infrastructure confirms that financial fraud should be understood as a chain of interdependent activities, consistent with contemporary criminological models.

The use of large language models for taxonomic classification also offers a relevant theoretical contribution by demonstrating their ability to capture contextual meaning in environments characterised by jargon, ambiguity, and implicit communication. This supports the integration of LLMs into computational criminology research, particularly where analysis depends more on discourse context than on explicit technical terminology.

Finally, explicitly treating ambiguity as a structural component of analysis introduces a novel perspective in the development of dynamic taxonomies. Rather than pursuing an exhaustive, closed classification, the proposed approach acknowledges the domain’s evolving nature and frames the taxonomy as an artefact that is continuously adapted on the basis of empirical evidence.

8.2. Practical Implications

Practically, the results have direct implications for the development of dark web monitoring and analysis tools. The proposed taxonomy provides a formal structure that can be integrated into automated content collection and classification systems, facilitating the early detection of trends, emerging products, and shifts in carding market dynamics.

Furthermore, identifying a strong forum orientation towards the commercialisation of operational tools and enabling services, rather than the direct sale of card data, offers actionable insights for prevention and disruption strategies. Intervening at the level of capability providers may be more effective than focusing exclusively on the end assets of the offence.

Semantic network analysis also provides an empirical basis for prioritising areas of interest within large volumes of unstructured data. Identifying clusters and functional flows can guide analytical resources towards key nodes and relationships, optimising the efforts of human analysts and specialised teams.

Finally, the proposed methodological approach is transferable to other cybercrime domains, opening the possibility of developing comparative frameworks across different types of illicit digital markets and contributing to a more integrated, evidence-based criminal intelligence capability.

8.3. Future Research Directions

Building on the conclusions and limitations identified, several future research directions emerge. First, it is a priority to reduce the volume of posts classified as unclear by progressively expanding the taxonomy to incorporate new canonical categories that capture advanced techniques, emerging tools, and implicit carding practices.

Second, future studies should extend the analysis to multiple forums and marketplaces, enabling a more rigorous assessment of the proposed taxonomy’s robustness and transferability, as well as the identification of common patterns and divergences across communities.

A further relevant line of work involves integrating expert human validation and exploring the use of more advanced or domain-specialised language models for technical and criminal contexts. This hybrid approach would improve classification accuracy and strengthen the validity of the findings.

Future work should compare general-purpose and domain-adapted language models on the same human-annotated benchmark in order to assess whether specialised pretraining or fine-tuning can reduce ambiguity in predicates such as technique-tool without sacrificing transparency or reproducibility.

Finally, the combination of taxonomic and semantic network analyses could be extended to longitudinal approaches, enabling the study of temporal evolution in carding markets and supporting the anticipation of changes in their operational and organisational dynamics.

Author Contributions

Conceptualization, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; methodology, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; software, A.R.-Z. and J.F.L.; validation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; formal analysis, J.-A.M.-M. and M.F.-O.; investigation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; data curation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; writing—original draft preparation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; writing—review and editing, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; visualisation, J.-A.M.-M., M.F.-O., A.R.-Z., J.F.L. and A.D.-D.; supervision, J.-A.M.-M.; project administration, J.-A.M.-M.; funding acquisition, J.-A.M.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been developed within the “Recovery, Transformation and Resilience Plan”, project C084/23 Ada Byron INCIBE-UAH, funded by the European Union (Next Generation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For ethical, legal, and safety reasons, the raw corpus is not publicly released. The raw corpus cannot be made publicly available because it contains archived material from illicit-platform environments. To reduce ethical, legal, and safety risks, the manuscript reports only aggregate findings. Selected derived materials may be shared for academic purposes upon rea-sonable request, subject to case-by-case assessment.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Primary classification prompt template

You are a controlled semantic classifier for dark-web carding forum posts.

Task:

Assign exactly one canonical category for the predicate: {PREDICATE_NAME}

Domain:

The text comes from a carding forum on the dark web. Posts may mention actors, illicit products/services, techniques/tools, and activity contexts.

Predicate definition:

{PREDICATE_DEFINITION}

Allowed canonical categories for this predicate:

{CATEGORY_LIST}

Input post:

“{POST_TEXT}”

Extracted keywords:

{KEYWORDS}

Instructions:

1. Select only one value from the allowed canonical categories.

2. Do not invent new labels.

3. Do not paraphrase category names.

4. If the evidence is insufficient, ambiguous, or does not match any allowed category, return “unclear”.

5. Output only valid JSON using this schema:

{“predicate”:“{PREDICATE_NAME}”,“label”:“<one allowed category or unclear>”}

Secondary disambiguation prompt template

You are a strict taxonomy validator for dark-web carding forum posts.

The previous output was invalid, ambiguous, or not fully aligned with the predefined category set.

Task:

Re-evaluate the same post and assign exactly one canonical category for the predicate: {PREDICATE_NAME}

Predicate definition:

{PREDICATE_DEFINITION}

Allowed canonical categories only:

{CATEGORY_LIST}

Input post:

“{POST_TEXT}”

Extracted keywords:

{KEYWORDS}

Additional rules:

1. Return exactly one label from the allowed list or “unclear”.

2. Do not generate synonyms, explanations, or new labels.

3. If two or more categories seem plausible, choose “unclear”.

4. Output only valid JSON using this schema:

{“predicate”:“{PREDICATE_NAME}”,“label”:“<one allowed category or unclear>”}

These structured constraints reduced out-of-scope generation in three complementary ways. First, the model was exposed to a closed category inventory rather than an open-ended labelling task. Second, the required JSON schema restricted the response space to a single machine-validatable field. Third, post-processing validation compared the returned label against the allowed category list and the alias dictionary; labels outside the taxonomy were rejected and re-submitted through the stricter disambiguation prompt. This procedure ensured that the LLM operated as a constrained classifier rather than as a free-text generator.

For example, for the predicate actor-role, a post such as “Trusted dump seller, fresh EU base, escrow accepted” with keywords {dump seller, EU base, escrow} was evaluated against the allowed labels {buyer, seller, staff, unclear}. If the model returned “vendor”, the alias mapping normalised this output to seller. If the model returned a non-taxonomic label such as “fraud broker”, the response was rejected and re-queried using the stricter disambiguation prompt. If no valid single canonical label could be recovered, the final assignment was unclear.

References

Holt, T.J. Exploring the social organisation and structure of stolen data markets. Glob. Crime 2013, 14, 155–174. [Google Scholar] [CrossRef]
Macdonald, M.; Frank, R. Shuffle Up and Deal: Use of a Capture-Recapture Method to Estimate the Size of Stolen Data Markets. Am. Behav. Sci. 2017, 61, 1313–1340. [Google Scholar] [CrossRef]
Pastor-Galindo, J.; Sandlin, H.A.; Mármol, F.G.; Bovet, G.; Pérez, G.M. A Big Data architecture for early identification and categorization of dark web sites. Future Gener. Comput. Syst.-Int. J. Escience 2024, 157, 67–81. [Google Scholar] [CrossRef]
Allodi, L.; Corradin, M.; Massacci, F. Then and Now On the Maturity of the Cybercrime Markets The Lesson That Black-Hat Marketeers Learned. IEEE Trans. Emerg. Top. Comput. 2016, 4, 35–46. [Google Scholar] [CrossRef]
Guillot, M.; Décary-Hétu, D. Cryptomarkets and carding: Influence on supply and demand. Criminologie 2019, 52, 63–82. (In French) [Google Scholar] [CrossRef]
Yip, M.; Webber, C.; Shadbolt, N. Trust among cybercriminals? Carding forums, uncertainty and implications for policing. Polic. Soc. 2013, 23, 516–539. [Google Scholar] [CrossRef]
Brinck, J.; Nodeland, B.; Belshaw, S. The “Yelp-Ification” of the Dark Web: An Exploration of the Use of Consumer Feedback in Dark Web Markets. J. Contemp. Crim. Justice 2023, 39, 185–200. [Google Scholar] [CrossRef]
Darwish, S.M. An intelligent credit card fraud detection approach based on semantic fusion of two classifiers. Soft Comput. 2020, 24, 1243–1253. [Google Scholar] [CrossRef]
Halvaiee, N.S.; Akbari, M.K. A novel model for credit card fraud detection using Artificial Immune Systems. Appl. Soft Comput. 2014, 24, 40–49. [Google Scholar] [CrossRef]
Hove, D.; Olugbara, O.; Singh, A. Bibliometric Analysis of Recent Trends in Machine Learning for Online Credit Card Fraud Detection. J. Scientometr. Res. 2024, 13, 43–57. [Google Scholar] [CrossRef]
Ryman-Tubb, N.F.; Krause, P.; Garn, W. How Artificial Intelligence and machine learning research impacts payment card fraud detection: A survey and industry benchmark. Eng. Appl. Artif. Intell. 2018, 76, 130–157. [Google Scholar] [CrossRef]
Kigerl, A. Behind the Scenes of the Underworld: Hierarchical Clustering of Two Leaked Carding Forum Databases. Soc. Sci. Comput. Rev. 2022, 40, 618–640. [Google Scholar] [CrossRef]
Soudijn, M.R.J.; Zegers, B. Cybercrime and virtual offender convergence settings. Trends Organ. Crime 2012, 15, 111–129. [Google Scholar] [CrossRef]
Décary-Hétu, D.; Leppänen, A. Criminals and signals: An assessment of criminal performance in the carding underworld. Secur. J. 2016, 29, 442–460. [Google Scholar] [CrossRef]
Li, W.F.; Chen, H.C.; Nunamaker, J.F. Identifying and Profiling Key Sellers in Cyber Carding Community: AZSecure Text Mining System. J. Manag. Inf. Syst. 2016, 33, 1059–1086. [Google Scholar] [CrossRef]
Shao, S.C.; Tunc, C.; Al-Shawi, A.; Hariri, S. An Ensemble of Ensembles Approach to Author Attribution for Internet Relay Chat Forensics. ACM Trans. Manag. Inf. Syst. 2020, 11, 24. [Google Scholar] [CrossRef]
Webber, C.; Yip, M. Humanizing the cybercriminal Markets, forums, and the carding subculture. In The Human Factor of Cybercrime; Routledge: London, UK, 2020; pp. 258–285. [Google Scholar]
Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N.; Ramzan, M.; Ahmed, M. Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
Chen, C.T.; Lee, C.; Huang, S.H.; Peng, W.C. Credit Card Fraud Detection via Intelligent Sampling and Self-supervised Learning. ACM Trans. Intell. Syst. Technol. 2024, 15, 35. [Google Scholar] [CrossRef]
Islam, M.A.; Uddin, M.A.; Aryal, S.; Stea, G. An ensemble learning approach for anomaly detection in credit card data with imbalanced and overlapped classes. J. Inf. Secur. Appl. 2023, 78, 103618. [Google Scholar] [CrossRef]
Dastidar, K.G.; Caelen, O.; Granitzer, M. Machine Learning Methods for Credit Card Fraud Detection: A Survey. IEEE Access 2024, 12, 158939–158965. [Google Scholar] [CrossRef]
Kigerl, A. Profiling Cybercriminals: Topic Model Clustering of Carding Forum Member Comment Histories. Soc. Sci. Comput. Rev. 2018, 36, 591–609. [Google Scholar] [CrossRef]
Wang, F.Z.; Dickinson, T.; Ghazi-Tehrani, A. Not All Money Is the Same: The Meanings of Money in Online Fraud. Crime Delinq. 2025. [Google Scholar] [CrossRef]
van Hardeveld, G.J.; Webber, C.; O’Hara, K. Deviating From the Cybercriminal Script: Exploring Tools of Anonymity (Mis)Used by Carders on Cryptomarkets. Am. Behav. Sci. 2017, 61, 1244–1266. [Google Scholar] [CrossRef]
Shetty, A.A.; Murthy, K.V. Investigation of Card Skimming Cases: An Indian Perspective. J. Appl. Secur. Res. 2023, 18, 519–532. [Google Scholar] [CrossRef]
Siwakoti, Y.R.; Bhurtel, M.; Rawat, D.B.; Oest, A.; Johnson, R.C. Your IP Camera Can Be Abused for Payments: A Study of IoT Exploitation for Financial Services Leveraging Shodan and Criminal Infrastructures. IEEE Trans. Consum. Electron. 2024, 70, 7562–7573. [Google Scholar] [CrossRef]
Jaffal, N.O.; Alkhanafseh, M.; Mohaisen, D. Large Language Models in Cybersecurity: A Survey of Applications, Vulnerabilities, and Defense Techniques. AI 2025, 6, 216. [Google Scholar] [CrossRef]
Wang, J.; Ni, T.; Lee, W.B.; Zhao, Q. A Contemporary Survey of Large Language Model Assisted Program Analysis. arXiv 2025, arXiv:2502.18474. [Google Scholar] [CrossRef]
Zhou, Y.; Ni, T.; Lee, W.B.; Zhao, Q. A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods. Trans. Artif. Intell. 2025, 1, 28–58. [Google Scholar] [CrossRef]
The Bootstrap Authors. Bootstrap Icons: Official Open Source SVG Icon Library for Bootstrap (v1.13.1). 2026. Available online: https://icons.getbootstrap.com/ (accessed on 13 April 2026).

Figure 1. Keyword co-occurrence map derived from the analysed forum posts.

Figure 2. General workflow for taxonomy generation, application, and transfer. Icons from Bootstrap Icons (v1.13.1) [30].

Figure 3. Keyword co-occurrence map associated with the carding forum.

Figure 4. Keyword co-occurrence map corresponding to clusters 2, 3, 4, 5, and 6.

Figure 5. Focused sub-network extracted from the global co-occurrence map, centred on the node carding (Cluster 1).

Figure 6. Focused sub-network extracted from the global co-occurrence map, centred on the node Deep Market (Cluster 2).

Figure 7. Focused sub-network extracted from the global co-occurrence map, centred on the node Credit Card (Cluster 3).

Figure 8. Focused sub-network extracted from the global co-occurrence map, centred on the node World (Cluster 5), highlighting links to nodes associated with Clusters 5 and 6.

Table 1. Examples of forum post and their assigned keyword.

Page_Title	Keywords
Best Carding World—Home	carding world; home
Obtain any CVE proof-of-concept since 1999—Best Carding World	cve proof-of-concept; 1999; best carding world
Purchase Cash 1500 usd from Dead Presidents\|DeepMarket	purchase cash; dead presidents; deepmarket
Carding Proof/Showoff—Page 2—Best Carding World	carding proof; showoff; best carding world

Table 2. Manual validation of translation effects on a random sample of non-English posts.

Item	Value
Randomly sampled non-English posts	50
Posts with no relevant semantic distortion	35 (70.0%)
Posts with minor distortion, no expected classification impact	11 (22.0%)
Posts with distortion and potential classification impact	4 (8.0%)
Posts with full predicate-level agreement between original-text and translated-text review	47 (94.0%)
Posts with at least one plausible translation-related predicate discrepancy	3 (6.0%)

Table 3. General protocol for transferring the taxonomy to a new dataset.

Stage	Input	Operation	Output
1	Raw posts + metadata	Cleaning, deduplication, translation, keyword extraction	Semantic descriptors
2	Semantic descriptors	Co-occurrence/network analysis	Candidate semantic axes
3	Semantic axes	Predicate design	Initial predicate structure
4	Corpus evidence	Canonical category definition + aliases	Initial taxonomy
5	Taxonomy + posts	Constrained classification	Labelled corpus
6	Unclear/low-confidence cases	Taxonomy refinement	Expanded taxonomy

Table 4. Human-annotated evaluation subset and inter-annotator agreement.

Item	Value
Full corpus size	3260
Human-annotated subset	326
Number of annotators	2
Annotation unit	Individual post
Adjudication method	Consensus after independent coding
Activity-context (Cohen’s κ A1 vs. A2)	0.594
Actor-role (Cohen’s κ A1 vs. A2)	0.619
Product-service (Cohen’s κ A1 vs. A2)	0.675
Technique-tool (Cohen’s κ A1 vs. A2)	0.937
Mean κ across predicates	0.706

Table 5. Classification performance against the human-annotated gold-standard subset (Annotators 1 and 2).

Predicate	Cohen’s Kappa	Accuracy	Precision	Recall	F1-Score
Annotator 1
activity-context	0.837	0.890	0.762	0.910	0.780
actor-role	0.848	0.942	0.802	0.935	0.849
product-service	0.963	0.982	0.929	0.965	0.946
technique-tool	0.985	0.994	0.999	0.962	0.979
Macro-average	0.908	0.952	0.873	0.943	0.889
Annotator 2
activity-context	0.737	0.850	0.921	0.774	0.804
actor-role	0.764	0.926	0.932	0.821	0.854
product-service	0.712	0.874	0.905	0.721	0.770
technique-tool	0.953	0.982	0.992	0.928	0.956
Macro-average	0.792	0.908	0.938	0.811	0.846

Table 6. Performance comparison between baseline methods and the LLM-based classifier on the human-annotated subset.

Method	Accuracy	Precision	Recall	F1-Score	Macro-F1
Keyword-only matching	0.59	0.58	0.54	0.47	0.47
LLM-based classifier (Llama 4 Scout)	0.72	0.66	0.76	0.64	0.64

Table 7. Main classification metrics with 95% confidence intervals on the human-annotated subset.

Metric	Estimate	95% CI
Accuracy	0.72	[0.69, 0.74]
Macro-Precision	0.78	[0.72, 0.81]
Macro-Recall	0.68	[0.62, 0.73]
Macro-F1	0.64	[0.58, 0.69]

Table 8. Sensitivity analysis of the classification pipeline.

Configuration	Accuracy	Macro-F1
Full system (keywords on, disambiguation on)	0.72	0.64
Keywords off	0.60	0.53
Second-stage disambiguation off	0.63	0.59

Table 9. Model comparison on the human-annotated subset.

Model	Accuracy	Macro-F1	Notes
DeepSeek V3	0.20	0.10	Uniformly low across all predicates
Qwen 2.5	0.28	0.11	High product-service accuracy (0.61) but very low F1 (0.09); severely uneven coverage
Phi-4	0.44	0.15	Strongest alternative. Best actor-role (acc: 0.60, F1: 0.22) and technique-tool (acc: 0.50, F1: 0.19)
Llama 4 Scout	0.72	0.64	Best overall. Outperforms all alternatives across all predicates

Table 10. Classification coverage by predicate.

Predicate	Frequency	%
activity-context	2991	91.75
actor-role	1726	52.94
product-service	1707	52.36
technique-tool	548	16.81

Table 11. Coverage of categories within the activity context predicate.

Predicate	Frequency	%
announcements	120	3.68
forum-discussion	953	29.23
login-portal	618	18.96
marketplace	1274	39.08
user-profile-area	26	0.80
unclear	269	8.25

Table 12. Coverage of categories within the actor role predicate.

Predicate	Frequency	%
buyer	93	2.85
seller	1387	42.55
staff	246	7.55
unclear	1534	47.06

Table 13. Coverage of categories within the product service predicate.

Predicate	Frequency	%
cardable-site	48	1.47
carding-tool	1115	34.20
credit-card-data	395	12.12
scam-report	21	0.64
tutorial-guide	128	3.93
unclear	1553	47.64

Table 14. Coverage of categories within the technique-tool predicate.

Predicate	Frequency	%
anonymization-tool	285	8.74
cryptography	68	2.09
exploit	61	1.87
malware	65	1.99
social-engineering	69	2.12
unclear	2712	83.19

Table 15. Examples of post for which the classifier could not capture any canonical category across any predicate.

Page_Title	Keywords
Others—Page 4—Best Carding World	others; page 4; best carding
how can you hafe any pudding, if you don’t eat your meat?—Best Carding World	pudding; meat; carding
Messages	messages

Table 16. Sensitivity of the VOSviewer co-occurrence network to threshold variation.

Minimum Occurrence Threshold	Nodes	Main Clusters	Structural Interpretation
9	106	6	Same main functional clusters; more peripheral/noisy terms
10	98	6	Best balance between interpretability and semantic coverage
11	94	6	Same main functional clusters; fewer peripheral terms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Medina-Merodio, J.-A.; Ferrer-Oliva, M.; Fernández López, J.; Ruiz-Zambrano, A.; Domínguez-Díaz, A. Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web. Information 2026, 17, 469. https://doi.org/10.3390/info17050469

AMA Style

Medina-Merodio J-A, Ferrer-Oliva M, Fernández López J, Ruiz-Zambrano A, Domínguez-Díaz A. Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web. Information. 2026; 17(5):469. https://doi.org/10.3390/info17050469

Chicago/Turabian Style

Medina-Merodio, Jose-Amelio, Mikel Ferrer-Oliva, José Fernández López, Alejandro Ruiz-Zambrano, and Adrián Domínguez-Díaz. 2026. "Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web" Information 17, no. 5: 469. https://doi.org/10.3390/info17050469

APA Style

Medina-Merodio, J.-A., Ferrer-Oliva, M., Fernández López, J., Ruiz-Zambrano, A., & Domínguez-Díaz, A. (2026). Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web. Information, 17(5), 469. https://doi.org/10.3390/info17050469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extending Taxonomies and Mapping P2P Credit Card Fraud (Carding) Forums on the Dark Web

Abstract

1. Introduction

2. Related Work

2.1. Artificial Intelligence and Machine Learning in Fraud Detection

2.2. Organisation and Economic Dynamics of Carding Markets

2.3. Profiling and Behaviour of Involved Actors

2.4. Technological Infrastructure and Forensic Analysis in the Dark Web

2.5. LLMs in Security Analysis: Opportunities, Risks, and Pipeline Limitations

3. Methodology

3.1. Data Collection

3.2. Ethical and Legal Considerations

3.3. Crawler Validation

4. Taxonomy Expansion Development

4.1. Initial Corpus and Data Preparation

Manual Validation of Translation Effects

4.2. Definition of the Initial Taxonomy

4.3. Content Classification

Prompt Design, Category Constraints, and Ambiguity Resolution

4.4. Transferable Protocol for Taxonomy Generation and Application

4.5. Human Annotation and Validation Protocol

4.6. Baseline Comparison

4.7. Statistical Robustness Analysis

4.8. Results

4.9. Taxonomy Evaluation

5. Network Analysis and Semantic Representation Using VOSViewer

5.1. Co-Occurrence Analysis Configuration

5.2. Identification and Characterisation of Semantic Clusters

5.3. Semantic Chain Analysis and Functional Flows

5.4. Synthesis of the Network Analysis

6. Discussion

7. Limitations

8. Conclusions and Future Research

8.1. Theoretical Contributions

8.2. Practical Implications

8.3. Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI