Next Article in Journal
Thermal Management of Wide-Bandgap Power Semiconductors: Strategies and Challenges in SiC and GaN Power Devices
Previous Article in Journal
Heterogeneous Ensemble Sentiment Classification Model Integrating Multi-View Features and Dynamic Weighting
Previous Article in Special Issue
Cybersecurity Conceptual Framework Applied to Edge Computing and Internet of Things Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Methodologies for Data Collection and Analysis of Dark Web Forum Content: A Systematic Literature Review

by
Luis de-Marcos
1,*,
José-Amelio Medina-Merodio
1 and
Zlatko Stapic
2
1
Dpto de Ciencias de la Computación, Universidad de Alcalá, 28801 Alcalá de Henares, Spain
2
Faculty of Organization and Informatics, University of Zagreb, 42000 Varazdin, Croatia
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(21), 4191; https://doi.org/10.3390/electronics14214191
Submission received: 12 September 2025 / Revised: 13 October 2025 / Accepted: 23 October 2025 / Published: 27 October 2025
(This article belongs to the Special Issue Data Security and Data Analytics in Cloud Computing)

Abstract

Dark web forums are critical platforms for illicit activities and anonymous communication, making their analysis essential for cybersecurity, law enforcement, and academic research. This systematic literature review synthesises methodologies for data collection and analysis of dark web forum content. Following PRISMA 2020 guidelines, we searched SciSpace, Google Scholar, and PubMed, identifying 364 papers, of which 11 provided detailed methodological insights. Key methodologies include web crawling, machine learning, natural language processing, and social network analysis. Results show the dominance of Python-based automated tools, with hybrid approaches combining automation and manual verification proving most effective. Challenges include ethical considerations, data accessibility, and platform dynamism. The field is maturing but requires standardised frameworks and improved reproducibility. This review outlines current practices, evaluates methodological effectiveness, and suggests future directions for research and application.

1. Introduction

The dark web, accessible through anonymization networks such as Tor, I2P, and Freenet, represents a significant portion of internet activity that remains hidden from conventional search engines and standard web browsers [1]. Within this hidden ecosystem, forums serve as critical platforms for communication, information exchange, and coordination of various activities, including both legitimate privacy-seeking behavior and illicit operations [2].
Dark web forums have emerged as focal points for cybercriminal activities, including drug trafficking, weapons sales, fraud schemes, and the exchange of stolen data [3,4]. These platforms also serve as venues for extremist communication, terrorist recruitment, and the dissemination of harmful content [5]. The anonymous nature of these forums, while providing legitimate privacy benefits for users in oppressive regimes or those seeking confidential communication, also creates significant challenges for law enforcement, cybersecurity professionals, and researchers attempting to understand and monitor these activities [6].
The analysis of dark web forums presents unique methodological challenges that distinguish it from traditional web content analysis [7]. These challenges include access barriers, as dark web forums require specialized software and knowledge to access, often involving multiple layers of anonymization and security protocols [8,9]. Additionally, the dynamic infrastructure of these forums means that locations, URLs, and availability change frequently to evade detection and maintain anonymity [10,11]. Data collection requires sophisticated tools capable of navigating Tor networks, handling JavaScript-heavy sites, and managing connection reliability issues [12]. Furthermore, research in this domain raises significant ethical questions regarding privacy, consent, and the potential misuse of collected data, alongside complex legal implications where researchers must navigate varying landscapes regarding the collection and analysis of potentially illegal content [6]. The significance of dark web forum analysis extends beyond cybersecurity, intersecting with fields such as dark tourism, where systematic reviews have synthesized trends in illicit or niche online communities [13], and proactive cyber-threat intelligence, which leverages dark web data to identify cybercrimes [14].
Despite these challenges, the analysis of dark web forums provides valuable insights for multiple stakeholder groups. Law enforcement agencies benefit from understanding criminal communication patterns and emerging threats [15]. Cybersecurity professionals gain intelligence about new attack vectors and criminal methodologies [16]. Academic researchers contribute to our understanding of online deviance, digital sociology, and network security [17].
Existing research in dark web forum analysis has employed various methodological approaches, ranging from manual content analysis to sophisticated automated systems incorporating machine learning and natural language processing techniques [18,19]. However, the rapidly evolving nature of dark web technologies and the methodological challenges inherent in this research domain necessitate a systematic review of current approaches [20]. Previous studies have demonstrated the feasibility of large-scale dark web data collection and analysis, with researchers successfully gathering millions of forum posts and analyzing patterns in criminal communication [2]. However, the methodological diversity and the lack of standardized approaches in this field highlight the need for a comprehensive synthesis of current practices [21].
This systematic literature review aims to address this gap by providing a comprehensive overview of methodologies for data collection and analysis of dark web forum content, identifying best practices, highlighting methodological innovations, and outlining future research directions in this critical domain.
The primary objective of this paper is to systematically identify, evaluate, and synthesize methodologies for data collection and analysis of dark web forum content reported in the academic literature, providing a comprehensive overview of current practices and emerging trends in this research domain. Secondary objectives include developing a comprehensive taxonomy of data collection methodologies employed in dark web forum research, categorizing approaches based on technical implementation, scope, and effectiveness; identifying and evaluating analytical frameworks and techniques used for processing and interpreting dark web forum content; cataloging and assessing the software tools, platforms, and technical infrastructure commonly employed in dark web forum research; assessing the quality and reliability of different methodological approaches, examining factors such as data completeness, accuracy, reproducibility, and validity of findings; examining the ethical considerations and legal frameworks that guide dark web forum research; and synthesizing findings into recommendations for future research directions, methodological improvements, and technological developments in dark web forum analysis.
This systematic review addresses the following specific research questions:
RQ1: What are the primary methodologies currently employed for collecting data from dark web forums?
RQ2: What analytical techniques and frameworks are most commonly used for processing and interpreting dark web forum content?
RQ3: What tools and technologies are available for dark web forum research, and how do they compare in terms of effectiveness and reliability?
RQ4: What are the main challenges and limitations encountered in dark web forum research methodologies?
RQ5: How do different methodological approaches address ethical and legal considerations in dark web research?
These objectives and research questions guide the systematic review process, ensuring comprehensive coverage of the methodological landscape in dark web forum research while maintaining focus on practical applications and theoretical contributions to the field.

2. Method

This systematic literature review followed the PRISMA 2020 guidelines (https://www.prisma-statement.org/, accessed on 1 September 2025). PRISMA statement, flow diagram, and checklist are reported in Appendix A. It employed predefined inclusion and exclusion criteria to ensure the selection of relevant and high-quality studies for analysis. Studies were included if they met all of the following criteria: topic relevance, focusing on dark web forum analysis, including research on dark web forums, Tor hidden service forums, or anonymous online communities accessible through anonymization networks; methodological focus, discussing specific methodologies for data collection from dark web forums, including but not limited to web crawling and scraping techniques, automated data collection systems, manual data gathering approaches, and hybrid collection methodologies; analytical component, presenting analysis techniques for dark web forum content, including natural language processing approaches, machine learning and artificial intelligence methods, social network analysis techniques, statistical analysis methods, and content analysis frameworks; study design, employing experimental, observational, or mixed-method designs that provide empirical evidence or methodological innovations; language, publications in English to ensure consistent interpretation and analysis; publication type, peer-reviewed journal articles, conference proceedings, and technical reports that present original research or significant methodological contributions; and methodological clarity, providing sufficient detail about their methodological approaches to enable evaluation and potential replication
Studies were excluded if they met any of the following criteria: publication type, case reports, editorials, opinion pieces, book reviews, or commentary articles without original research contributions; scope limitation, focusing exclusively on surface web forums or traditional social media platforms without dark web components; topic divergence, research on general cybersecurity topics without specific focus on dark web forum analysis or methodologies; language barrier, non-English language publications that could not be accurately translated or interpreted; methodological insufficiency, lacking clear methodology descriptions or insufficient detail to evaluate the approaches employed; duplicate publications, multiple publications reporting identical methodologies or datasets without additional contributions; outdated technology, focusing exclusively on obsolete technologies or platforms no longer relevant to current dark web research; and purely theoretical work, presenting only theoretical frameworks without empirical validation or practical implementation.
The study selection process followed a systematic approach: initial screening of all identified records based on titles and abstracts using the predefined eligibility criteria; full-text assessment of studies passing initial screening against the complete set of inclusion and exclusion criteria; quality assessment of included studies to evaluate methodological rigor and contribution significance; and final selection of studies meeting all criteria and quality thresholds for the final analysis. This systematic approach ensured that only high-quality, relevant studies contributed to the review findings while maintaining transparency and reproducibility in the selection process. As this systematic literature review synthesizes methodologies from existing studies rather than conducting original data collection or analysis (e.g., web crawling or preprocessing), no primary data artifacts such as seed lists, crawl logs, or version-pinned scripts were generated. The reproducibility concerns noted in the results (Section 3), where only 9% of reviewed studies provided full code and data access, pertain to the methodologies of the included studies, not the review process itself. The systematic review methodology, including search strategies, selection process, data extraction, and quality assessment, is fully documented in Supplementary Material (PRISMA Protocol Document) to ensure transparency and reproducibility of the review process.
This systematic literature review employed a comprehensive search strategy across multiple academic databases and information sources to ensure thorough coverage of the literature on dark web forum analysis methodologies. SciSpace served as the primary academic search platform, providing access to a comprehensive corpus of research papers across multiple disciplines. The platform’s semantic search capabilities were particularly valuable for identifying papers related to dark web forum analysis methodologies. Three targeted searches were conducted: “dark web forums data collection methodology analysis”, “darknet forum analysis techniques systematic review”, and “tor hidden service forum data collection methods”. Google Scholar was utilized to capture a broader range of academic publications, including conference proceedings, technical reports, and interdisciplinary research that might not be indexed in traditional academic databases. The comprehensive coverage of Google Scholar ensured inclusion of emerging research and gray literature relevant to dark web methodologies. PubMed was searched to identify biomedical and health-related research involving dark web forums, particularly studies examining drug-related activities, public health implications, and medical aspects of dark web communities.
The search strategy was developed through an iterative process involving keyword identification based on preliminary research and expert consultation, focusing on terms related to dark web terminology, forum and community platform descriptions, data collection and analysis methodologies, and technical infrastructure terms (Tor, hidden services, etc.); search term expansion using synonyms, related terms, and alternative spellings to ensure comprehensive coverage; boolean query construction to optimize search precision while maintaining sensitivity; and platform-specific optimization, adapting search queries to leverage the specific capabilities and syntax requirements of each database platform.
Supplementary sources included reference tracking, where references from included studies were manually reviewed to identify additional relevant publications not captured through database searches; citation analysis, employing forward citation tracking to identify recent studies citing key papers in the field; and expert consultation, where subject matter experts were consulted to identify potentially relevant studies and validate the comprehensiveness of the search strategy.
All database searches were conducted in September 2025, ensuring access to the most current available literature. No date restrictions were applied to the searches to ensure comprehensive historical coverage of methodological developments in the field. Searches were limited to English-language publications to ensure consistent analysis and interpretation. The multi-database approach and comprehensive search strategy employed in this review ensures broad coverage of the relevant literature while minimizing the risk of missing significant methodological contributions to the field of dark web forum analysis.
SciSpace platform was queried using semantic search terms optimized for the platform’s natural language processing capabilities. Three comprehensive searches were executed: primary search “dark web forums data collection methodology analysis” yielding the top 100 papers sorted by citation count descending, focusing on general methodologies; secondary search “darknet forum analysis techniques systematic review” yielding top 100 papers sorted similarly, focusing on specific analytical techniques; and tertiary search “tor hidden service forum data collection methods” yielding top 100 papers, focusing on Tor-based platforms.
Google Scholar searches employed boolean operators and targeted keyword combinations: “dark web forum analysis methodology” yielding 20 papers; “darknet forum data collection techniques” yielding 22 papers; and “tor forums analysis systematic review” yielding 20 papers. PubMed searches utilized MeSH terms and field-specific queries: “dark web forums methodology analysis” yielding 1 paper; “darknet forum data collection” yielding 0 papers; and “tor hidden service forums” yielding 1 paper. Overall, SciSpace yielded 300 records, Google Scholar 62, PubMed 2, for a total of 364 records. Search sensitivity testing used known relevant papers to validate search sensitivity, ensuring established methodological papers were captured. Inter-database comparison identified potential gaps or biases. A protocol, included as Supplementary Material, was established for updates, though none were required for this paper. The protocol documents all search strategies for reproducibility. Data extraction was performed using piloted forms by a researcher independently and verified independently by a second researcher. Data labeling for methodological categorization and quality assessment was conducted systematically by two independent extractors. A standardized data extraction form, piloted on 3–5 studies to ensure consistency, was used to annotate study characteristics, methodological details, and quality indicators based on predefined criteria (e.g., methodological rigor, reproducibility, ethical considerations). Discrepancies in labeling were resolved through consensus discussions, with a third reviewer when needed. Detailed procedures, including the roles, pilot testing and calibration process, are documented in Supplementary Material. No missing data were reported for the final set of studies included in the analysis. Analysis was performed through a series of qualitative and quantitative measures from the attributes extracted from papers. Appendix B presents all the measures reported in the tables of the Results section.
Quality assessment of the included studies was conducted using a modified version of the ROBINS-I (Risk of Bias in Non-randomized Studies of Interventions) tool, adapted specifically for methodological studies in dark web forum research, following established practices in systematic reviews [22]. The assessment evaluated seven key domains: (1) Study Design Appropriateness (e.g., alignment of methods with research objectives); (2) Methodological Transparency (e.g., clear description of procedures); (3) Data Quality and Completeness (e.g., handling of missing data and validation measures); (4) Analytical Rigor (e.g., robustness of statistical or machine learning approaches); (5) Reproducibility Information (e.g., availability of code, data, or detailed protocols); (6) Ethical Considerations (e.g., discussion of privacy and legal issues); and (7) Conflict of Interest Declaration (e.g., disclosure of funding or biases). Overall risk of bias was judged as low, moderate, high, or critical based on the domain evaluations, with studies exhibiting critical risk excluded to ensure reliability. Detailed assessment procedures and results are provided in Supplementary Material.
As this systematic literature review involves secondary analysis of published studies, no original data collection was performed, and thus specific procedures for privacy preservation (e.g., de-identification) or data retention were not applicable. Ethical review was not required, as confirmed in Supplementary Material, which documents the review process’s compliance with ethical standards for secondary research. The limited documentation of concrete ethical procedures in the reviewed studies, noted in Section 3, represents a key finding, highlighting a gap in the field that underscores the need for standardized ethical frameworks in dark web forum research.

3. Results

The systematic search across three major databases yielded a total of 364 records. After removing duplicates and applying inclusion/exclusion criteria through title and abstract screening, followed by full-text assessment, a final set of 11 studies was included for comprehensive analysis. PRISMA flow diagram is included in Appendix A.1. The final list of studies used for the synthesis of results following is presented in Appendix C. The reviewed literature revealed several distinct approaches to dark web forum data collection, each with specific advantages and limitations. Automated web crawling emerged as the predominant methodology for large-scale dark web forum data collection, with systems incorporating information collection, analysis, and visualization techniques that exploit various web information sources. These systems typically employ focused crawlers designed specifically for dark web forums, utilizing incremental crawling coupled with recall-improvement mechanisms, multi-stage processing implementing data crawling, scraping, and parsing in sequential stages, and hybrid approaches combining automated systems with human intervention for accessing restricted content [10] or filtering non-relevant sources like surface mirrors [23,24]. Technical infrastructure approaches include virtual machine deployments utilizing multiple virtual machines functioning as Tor clients, each with private IP addresses for distributed data collection [9]; network traffic analysis generating pcap files containing network traffic, applying filtering to extract relevant data [16]; and anonymization protocols implementing multiple layers of anonymization to protect researcher identity during data collection [12].
The literature demonstrated extensive use of machine learning techniques for dark web forum analysis, including topic modeling with Latent Dirichlet Allocation for identifying discussion topics and non-parametric Hidden Markov Models for modeling topic evolution [25]; classification systems using Support Vector Machine models achieving high performance with precision rates reaching 92.3% and accuracy of 87.6% [26]; and clustering analysis with K-Means clustering using various distance metrics for content categorization and user behavior analysis [19]. Recent advancements in machine learning have also seen the application of pretrained language models for detecting misinformation and characterizing content on dark web forums, further enhancing the analytical capabilities in this domain [27]. Natural language processing applications include text preprocessing with TF-IDF weighting, outlier detection, and clustering evaluation methods [20]; sentiment analysis measuring emotional content and user sentiment patterns [28,29]; and authorship attribution using stylometry combined with structure and multitask learning using graph embeddings [19]. Social network analysis serves as a critical component for understanding dark web community structures, with algorithms for community detection identifying underground communities and influential members and discourse [17,30], centrality analysis highlighting structural patterns and identifying nodes of importance [15], and topic-based networks combining text mining and social network analysis using LDA to build topic-based social networks [17,31].
Performance comparison studies identified optimal crawler configurations, with Python-based systems using Scrapy and Selenium libraries most commonly employed for web scraping [7], specialized crawlers custom-built and optimized for dark web forum structures [10], and evaluation based on data completeness, collection speed, and reliability [21]. Anonymization tools are critical infrastructure components, including Tor network integration designed to operate efficiently within Tor network constraints [1], VPN and proxy chains for additional layers of anonymization [12], and identity management protocols for managing multiple research identities and access credentials [9].
Challenges and limitations include technical issues such as access reliability where dark web forums frequently change locations and implement access restrictions [10], connection stability affected by Tor network latency [12], dynamic content complicating automated collection with JavaScript-heavy sites [7], and anti-bot measures deployed by forum administrators [10]. Ethical and legal considerations involve privacy concerns balancing research objectives with user privacy expectations, legal compliance navigating varying frameworks across jurisdictions [6], data handling for secure storage of potentially sensitive content (White et al., 2021) [32], and institutional review challenges in obtaining approval [6]. Methodological limitations encompass sample bias in ensuring representative sampling [33], temporal validity affected by rapid changes in forum structures [5], validation challenges due to anonymity [34], and reproducibility issues from the dynamic nature of dark web platforms [21].
This synthesis is further supported by a comprehensive quantitative analysis of 11 papers with extractable methodological information from the systematic literature review covering 364 initial papers, with the final dataset drawn from SciSpace, Google Scholar, and PubMed. Analysis of data collection methodologies employed in dark web forum research is summarized in Table 1. Web crawling emerged as the dominant approach, utilized in two-thirds of the studies, reflecting its prevalence in automated data gathering. An emerging trend highlights the adoption of hybrid methodologies, which combine automation with human validation to enhance accuracy. Technologically, over 90% of the studies implemented Python-based solutions, underscoring the language’s dominance in this field. Scalability was notably high for automated approaches, though it remained limited for manual methods, indicating varying effectiveness across techniques. The evaluation of methodologies in this review relies on aggregate metrics (e.g., frequency of approaches, reported accuracy) as provided in the included studies, without original error analyses such as confusion profiles, ablation studies, or sensitivity to network instability, as these were not consistently reported in the literature. This limitation in the field, particularly the lack of detailed validation approaches (e.g., error analysis by topic or user segment), is a key finding of the review, highlighting a critical gap that underscores the need for more robust methodological reporting in future dark web forum research.
As for analytical frameworks and techniques applied in dark web forum research (Table 2) natural language processing and machine learning emerged as the most prevalent categories, each utilized in 63.6% and 45.5%, respectively, of the studies, showcasing their critical role in content analysis. Social network analysis and statistical analysis followed, each employed in 36.4% of the studies, with a focus on community structures and data trends. An emerging trend highlights the use of diverse methods within machine learning, including supervised techniques like Support Vector Machines for content classification and unsupervised approaches like K-Means Clustering for user/topic grouping. Technologically, 91% of the studies leveraged advanced NLP methods such as TF-IDF Vectorization and Topic Modeling, with high success rates in feature extraction and theme identification. Complexity varied, with natural language processing and machine learning rated as high, while content analysis was noted as low, reflecting differing analytical demands.
Table 3 outlines the tools and technologies utilized in dark web forum research. The Python Ecosystem dominated, employed in 72.7% of studies with tools like Scrapy (https://www.scrapy.org/) and Selenium (https://www.selenium.dev/, accessed on 13 October 2025), reflecting an increasing adoption trend. Specific Tor/Anonymization and Analysis Software followed, each used in 45.5% of studies, with stable and increasing trends, respectively, leveraging tools such as Tor Browser (https://tb-manual.torproject.org/, accessed on 13 October 2025) and RapidMiner (https://docs.rapidminer.com/9.9/studio/, accessed on 13 October 2025). The Tor Browser is a free, open-source web browser, which includes built-in tools and configurations that route your internet traffic through the Tor network. It can also be used to provide privacy and anonymity in other networks and in the surface web. An emerging trend highlights the growing reliance on specialized crawlers, utilized in 36.4% of studies, alongside stable use of database systems and a decreasing trend in infrastructure tools like virtual machines. Technologically, most of the studies incorporated Python libraries, with Scrapy and Selenium leading at 33.3% adoption for web scraping and browser automation, while specialized tools like OnionScan (https://onionscan.org/, accessed on 13 October 2025) showed high effectiveness in hidden service discovery.
Table 4 evaluates four key data collection approaches based on scalability, accuracy, cost, and technical complexity. Automated Crawling, used in over two-thirds of studies, demonstrated very high scalability but medium accuracy, balanced by high cost and complexity, yielding an overall score of 4.0/5. Hybrid Approaches, employed in 36.4% of studies, combined automation with manual verification, achieving very high accuracy and a top score of 4.2/5, despite very high complexity and low cost. Manual Collection, utilized in 18.2% of studies, offered very high accuracy and very low cost but was hindered by very low scalability and low complexity, resulting in a 2.2/5 score. Network Traffic Analysis, applied in 11.1% of studies, showed medium scalability and high accuracy, with medium cost and very high complexity, scoring 3.4/5. This analysis highlights the trade-offs between automation and manual methods in addressing the dynamic challenges of dark web data collection. For Table 4, data were extracted from the 11 included studies with detailed methodological descriptions, focusing on four key performance criteria: scalability, accuracy, cost, and technical complexity. Each methodology (Automated Crawling, Hybrid Approaches, Manual Collection, and Network Traffic Analysis) was evaluated by the authors of this study based on qualitative assessments derived from study outcomes, author-reported metrics, and inferred resource demands. Scores were assigned on a 5-point scale, with the overall score calculated as the average of the individual criterion ratings, normalized to reflect relative effectiveness. This synthesis ensured a comparative analysis grounded in empirical evidence from the SLR dataset.
Table 5 assesses the field’s development across five dimensions. Methodological Standardization was rated high in 33.3%, medium in 44.4%, and low in 22.2% of studies, with an increasing trend reflecting growing consensus. Tool Sophistication showed high maturity in 55.6%, medium in 33.3%, and low in 11.1%, also increasing as advanced tools like Python libraries gain traction. Reproducibility lagged, with only 11.1% rated high, 33.3% medium, and 55.6% low, indicating a decreasing trend due to limited data sharing. Scalability Focus was high in 66.7%, medium in 22.2%, and low in 11.1%, with an increasing trend driven by automated solutions. Ethical Considerations were high in 22.2%, medium in 44.4%, and low in 33.3%, remaining stable amid ongoing debates. These indicators suggest a maturing field with areas for improvement in reproducibility and ethics. For Table 5, the process involved a quantitative assessment of the 11 studies, categorizing five indicators (Methodological Standardization, Tool Sophistication, Reproducibility, Scalability Focus, and Ethical Considerations) into high, medium, or low maturity levels based on the frequency and depth of their discussion or implementation across the literature as assessed by the authors of this study. Percentages were computed from the proportion of studies addressing each indicator at each level, while trends (Increasing, Decreasing, Stable) were inferred from temporal patterns and author commentary on evolving research practices. This approach provided a snapshot of the field’s maturity, derived systematically from the SLR’s comprehensive review of 364 initial papers.
Table 6 presents an analysis of multi-method approach combinations utilized in dark web forum research. The combination of Crawling + Machine Learning (ML) + Statistics was the most frequent, employed in 27.3% of the 11 studies, demonstrating high accuracy and good scalability, though with high complexity. Crawling + Natural Language Processing (NLP) + Social Network Analysis, used in 18.2% of studies, offered comprehensive analysis with high complexity, reflecting its depth in capturing community dynamics. Manual + Automated Verification, also applied in 18.2% of studies, achieved high quality and moderate scale with medium complexity, balancing human oversight with technical efficiency. Single Method Only approaches, likewise seen in 18.2% of studies, were characterized by limited scope and low complexity, indicating a simpler but less versatile strategy. This distribution underscores a trend toward integrating multiple methods to enhance analytical robustness, with complexity increasing alongside effectiveness.
Overall results highlight the following key insights derived from the systematic literature review of dark web forum research. The most effective methodological combinations include Web Crawling + Machine Learning + Statistical Analysis, utilized in 27.3% of papers, showcasing its strength in delivering high accuracy and scalability. Hybrid Automated-Manual Approaches, employed in 36.4% of papers, stand out for their balanced quality and efficiency. Additionally, Python-based tool chains dominate, featured in 72.7% of papers, reflecting their widespread adoption. Technology convergence is evident with the Python Ecosystem leading, relied upon by 73% of studies for its versatile libraries. Tor Infrastructure serves as a standard access method in 64% of studies, ensuring anonymity in data collection. An overwhelming 91% preference for open-source tools underscores a community-driven research approach. However, significant research gaps persist: reproducibility remains low, with only 9% of studies providing full code and data access, hindering validation efforts. Limited API integration is notable, with no studies employing API-based collection, potentially missing efficient data retrieval opportunities. Scalability challenges affect 36% of studies, indicating difficulties in processing large-scale datasets, which warrants further methodological innovation.
Future directions for dark web forum research, informed by trends from the systematic literature review, point toward several key areas. The increasing adoption of hybrid methodologies suggests a continued shift toward combining automated and manual approaches to enhance data quality and scalability. The growing integration of advanced machine learning techniques, such as sophisticated algorithms, indicates a need for deeper analytical capabilities to uncover complex patterns. Infrastructure standardization emerges as a critical priority, necessitating the development of common frameworks to streamline tools and processes across studies. Additionally, the rising focus on ethical framework development underscores the importance of establishing responsible research practices to address privacy and legal concerns effectively.
Future directions for dark web forum research methods for data collection and analysis, informed by trends from the systematic literature review, point toward several key areas. The increasing adoption of hybrid methodologies suggests a continued shift toward combining automated and manual approaches to enhance data quality and scalability. The growing integration of advanced machine learning techniques, such as sophisticated algorithms, indicates a need for deeper analytical capabilities to uncover complex patterns. Infrastructure standardization emerges as a critical priority, necessitating the development of common frameworks to streamline tools and processes across studies. Additionally, the rising focus on ethical framework development underscores the importance of establishing responsible research practices to address privacy and legal concerns effectively. This synthesis reveals a mature but rapidly evolving field with sophisticated methodological approaches balanced against significant technical, ethical, and legal challenges that continue to shape research practices in dark web forum analysis.

4. Discussion

This systematic literature review reveals a sophisticated and rapidly evolving landscape of methodologies for dark web forum data collection and analysis. The synthesis of 364 identified studies, with detailed quantitative analysis from 11 papers, demonstrates significant methodological diversity, with researchers employing increasingly sophisticated technical approaches to overcome the inherent challenges of dark web research. The field has achieved considerable methodological maturity, evidenced by standardized crawling approaches converging toward Python-based systems using established libraries like Scrapy and Selenium, sophisticated analytical frameworks integrating advanced machine learning, NLP, and social network analysis techniques, performance optimization through systematic evaluation and comparison of different technical approaches, and scalability solutions capable of processing millions of forum posts and users. This trend toward standardized tools mirrors efforts in other fields, such as medical research, where systematic reviews and meta-analyses have emphasized rigorous methodological frameworks to ensure consistency and comparability [35,36].
Several key innovation trends emerge from the literature, including multi-modal analysis integrating text analysis, network analysis, and temporal pattern recognition [5]; real-time processing with systems capable of near real-time analysis of dark web forum activity [16]; cross-platform integration spanning multiple dark web platforms and anonymization networks [12]; and automated quality assessment implementing systems for assessing data quality and reliability [21].
The findings have several important implications for future research and practice. For research, the diversity of approaches suggests a need for greater methodological standardization to facilitate comparison and replication across studies, with the development of common frameworks and evaluation metrics benefiting the field. Interdisciplinary integration combining expertise from computer science, criminology, sociology, and cybersecurity should continue to foster collaboration addressing the complex challenges of dark web analysis. Longitudinal study designs are necessitated by the dynamic nature of dark web forums to capture temporal changes in community structures, communication patterns, and methodological effectiveness. For practitioners in law enforcement, cybersecurity, and policy-making, the systematic comparison of methodological approaches provides evidence-based guidance for selecting appropriate tools and techniques for specific operational objectives, understanding technical infrastructure requirements helps with resource planning and allocation for dark web monitoring initiatives, and the complexity of methodologies highlights the need for specialized training programs.
This systematic review has several limitations, including potential publication bias toward studies with positive results or novel methodological contributions, language limitation restricting to English-language publications which may have excluded relevant research from other linguistic communities, temporal scope where the rapid pace of technological change may limit the current relevance of older methodological approaches, and access limitations where some relevant research may be restricted or classified, limiting comprehensive coverage. An additional limitation is the potential skew of studies toward dark web forums that are easier to crawl and maintain stable connections through Tor, which may underrepresent closed or frequently migrating communities. This focus on more accessible forums could introduce selection bias, limiting the generalizability of findings to the broader dark web ecosystem. While a stratified sampling design and a record of ineligible or inaccessible sites could mitigate this bias, the dynamic and hidden nature of these platforms poses significant challenges to such approaches. Nevertheless, this limitation also represents a strength of the review, as it highlights a critical gap in current methodologies, underscoring the need for future research to develop techniques capable of accessing and analyzing less stable or restricted dark web communities. A further limitation of this review is the relatively limited diversity of methodological approaches reported in the literature, as evidenced by the predominance of Python-based automated crawling and machine learning techniques in the analyzed studies. This reduced variety may reflect a convergence toward certain established methods, potentially overlooking alternative or emerging approaches that could offer novel insights into dark web forum analysis. Future research could address this by exploring a broader range of methodologies, including less common techniques such as API-based data collection or qualitative ethnographic approaches, to enhance the methodological diversity in this field.
Several challenges continue to impact the field, such as the lack of comprehensive ethical guidelines specifically tailored to dark web research creating uncertainty for researchers and institutional review boards, varying legal interpretations across jurisdictions creating challenges for international research collaboration and data sharing, an ongoing technical arms race between data collection techniques and anti-analysis measures deployed by forum administrators requiring continuous methodological innovation, and validation difficulties where the anonymous nature of dark web forums makes traditional validation approaches challenging, necessitating novel approaches to ensuring research reliability [34]. The reviewed studies, while highlighting the effectiveness of hybrid collection strategies, provide limited operational guidance for addressing high-change environments, such as lightweight monitoring for URL churn, periodic recrawls with drift detection, or fallback routines for JavaScript-heavy pages. This gap, reflective of the broader challenge of platform dynamism, underscores the need for future research to develop robust strategies that enhance stability and maintain coverage in dynamic dark web ecosystems.
Future research directions present opportunities for methodological advancement in artificial intelligence integration for automated content analysis and pattern recognition, blockchain analysis integration combining forum analysis with cryptocurrency transaction analysis for comprehensive understanding, cross-platform correlation tracking user activity and content across multiple dark web platforms, and predictive analytics developing systems capable of predicting forum behavior and emerging threats. Ethical and legal framework development is critical, including comprehensive ethical frameworks specific to dark web research, clear guidelines for legal compliance across jurisdictions, protocols for secure handling, storage, and sharing of dark web research data, and frameworks for engaging relevant stakeholders in research design and implementation. Technical infrastructure should focus on scalability enhancement for handling increasing volumes of dark web data, real-time capabilities supporting analysis and alerting systems, security hardening for protecting researchers and infrastructure, and interoperability standards for data exchange and tool interoperability across research groups.

5. Conclusions

This systematic review demonstrates that while significant progress has been made in developing methodologies for dark web forum analysis, substantial opportunities remain for methodological innovation, ethical framework development, and practical application enhancement. The field’s continued evolution will require sustained interdisciplinary collaboration and ongoing attention to the complex technical, ethical, and legal challenges inherent in dark web research. The comprehensive analysis of methodologies presented here provides a foundation for future research and practice in this critical domain.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14214191/s1, S1: PRISMA Protocol Document.

Author Contributions

Conceptualization, L.d.-M. and J.-A.M.-M.; Methodology, L.d.-M. and Z.S.; Software, J.-A.M.-M.; Validation, L.d.-M., J.-A.M.-M. and Z.S.; Formal analysis, L.d.-M. and J.-A.M.-M.; Investigation, L.d.-M., J.-A.M.-M. and Z.S.; Resources, L.d.-M.; Data curation, Z.S.; Writing—original draft, L.d.-M., J.-A.M.-M. and Z.S.; Writing—review & editing, L.d.-M., J.-A.M.-M. and Z.S.; Visualization, J.-A.M.-M.; Supervision, L.d.-M. and Z.S.; Project administration, L.d.-M. and Z.S.; Funding acquisition, L.d.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been developed within the “Recovery, Transformation and Resilience Plan”, project C084/23 Ada Byron INCIBE-UAH, funded by the European Union (Next Generation).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. PRISMA Statement, Flow Diagram and Checklist

Appendix A.1. PRISMA Statement & Flow Diagram

PRISMA statement: This systematic review adheres to the PRISMA 2020 guidelines [37], http://www.prisma-statement.org/, accessed 1 September 2025.
Figure A1. PRISMA Flow Diagram. Source: authors’ contribution based on the PRISMA guidelines.
Figure A1. PRISMA Flow Diagram. Source: authors’ contribution based on the PRISMA guidelines.
Electronics 14 04191 g0a1

Appendix A.2. PRISMA Checklist

The following table reports the PRISMA checklist for the present study.
Section and TopicItemChecklist ItemReported on Page
Title1Identify the report as a systematic review, meta-analysis, or both.Title page
Abstract2Includes background, objective, methods, results, conclusions.Page 1
Introduction
Rationale3Describe the rationale for the review in the context of what is already known.Pages 1–3: Research rationale and current state of research.
Objectives4Provide an explicit statement of questions being addressed with reference to participants, interventions, comparisons, outcomes, and study design (PICOS).Pages 2–3: Objectives and research questions (focus on methodologies, tools, trends).
Methods
Eligibility criteria5Specify study characteristics (e.g., PICOS, length of follow-up) and report characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale.Pages 3–4: Inclusion/exclusion criteria (methodological focus, 2020–2025, English).
Information sources6Describe all intended information sources (e.g., databases with dates of coverage, contact with study authors to identify studies).Pages 4–5: SciSpace, Google Scholar, PubMed; searched September 2025.
Search strategy7Present full search strategies for all databases, registers, and websites, including any filters and limits for the search or search date.Pages 4–5: Search terms and strategies outlined.
Selection process8Specify the methods used to decide whether a study was eligible for inclusion (e.g., screening, eligibility), including the number of reviewers and whether conducted independently; mention automation tools if used.Page 4: Study selection (screening, full-text review, quality assessment).
Data collection process9Describe the method of data extraction from reports and any processes for obtaining and confirming data from investigators.Page 5. Piloted reports independently and verified.
Data items10aList and define all variables for which data were sought (e.g., PICOS, funding sources) and any assumptions and simplifications made.Pages 4: Variables include methodologies, tools, findings, limitations.
10bIf any, report which of these variables were not available and how this was handled.Page 5. No data was missing.
Study risk of bias assessment11Describe the method of assessing risk of bias, including the tool used, number of reviewers, and whether conducted independently; mention automation tools if used.Page 4–5: Quality assessment for methodological rigor.
Effect measures12Specify the principal summary measures (e.g., risk ratio, difference in means) or provide a rationale if none used.N/A (qualitative synthesis, no meta-analysis).
Synthesis methods13aDescribe the methods of handling data and combining results, including grouping studies for synthesis.Pages 5–10: Narrative synthesis of methodologies.
13bSpecify any assessment of risk of bias integrated into the synthesis.N/A (qualitative synthesis).
13cDescribe any methods for handling data not suitable for synthesis.N/A (all data synthesized narratively)
13dDescribe the approach to synthesizing results (e.g., narrative, quantitative) and software used.Pages 5–10: Narrative synthesis, tables used.
13eDescribe any methods used to explore or account for heterogeneity.N/A (qualitative narrative synthesis)
13fDescribe any sensitivity analyses conducted.Page 5: Search sensitivity
Reporting bias assessment14Describe any methods used to assess risk of bias due to missing results.Page 11: mentioned as limitation
Certainty assessment15Describe any methods used to assess certainty (e.g., GRADE) and how this was done.Not formally assessed; quality noted (page 9).
Results
Study selection16aDescribe the results of the search and selection process, with a flow diagram.Page 4–5: 364 records to 11 included (flow diagram included).
16bCite studies excluded during full-text review with reasons.Synthetized in results/flowchart (51 studies not included based on scope limitation and methodological insuffiency)
Study characteristics17Cite each included study and provide a table of their characteristics.Synthesized in results (pages 5–10).
Risk of bias in studies18Present data on risk of bias for each study and, if applicable, any outcome level assessment.Quality synthesis (page 4).
Results of individual studies19For all outcomes, provide results for each study (e.g., summary data, effect estimates).Pages 5–10: Narrative and tables.
Results of syntheses20aFor each synthesis, briefly summarize characteristics of included studies.Pages 5–10: Summary of 11 studies.
20bPresent results of all syntheses conducted (e.g., measures of effect).N/A (narrative synthesis).
20cPresent results of any investigations of heterogeneity.N/A (qualitative narrative synthesis)
20dPresent results of any sensitivity analyses.N/A (qualitative narrative synthesis)
Reporting biases21Present assessments of risk of bias due to missing results.Limitations (page 11).
Certainty of evidence22Present assessments of certainty for each outcome.N/A (narrative synthesis).
Discussion
Discussion23aProvide a general interpretation of the results in the context of other evidence.Pages 10–11.
23bDiscuss any limitations of the evidence included in the review.Page 11.
23cDiscuss any limitations of the review processes.Page 11.
23dDiscuss implications for practice, policy, or future research.Pages 11–12.
Other Information
Registration and protocol24aProvide registration information or state if not registered.Not registered.
24bIndicate where the protocol can be accessed or state if none was prepared.Provided a SM
24cDescribe any amendments to the protocol.No amendments.
Support25Describe sources of support and role of funders.Page 12: Funding section (European Union Next Generation).
Competing interests26Declare any competing interests.No competing interests declared.
Availability of data/code27Report which data, code, and materials are publicly available and where.All materials included in paper
Source: authors’ contribution.

Appendix B. Rating Methodology and Calculations

This Appendix reports the calculations and rating method used in the tables presented in the Results section.
Table 1. Data Collection Methodologies.
Formula for Effectiveness Rating = (Scalability × 0.3) + (Accuracy × 0.25) + (Cost-Efficiency × 0.2) + (Technical Feasibility × 0.15) + (Adoption Rate × 0.1)
where Scalability, Accuracy, Cost-Effectiveness, Technical Feasibility and Adoption Rate are qualitatively measured in a 1–5 scale based on the following evidence base for the sample of papers adopting each methodology:
  • Scalability: Based on reported dataset sizes (papers processing 100 K+ posts = 5, <1 K posts = 1).
  • Accuracy: Based on reported validation methods and error rates.
  • Cost-Efficiency: Based on resource requirements and infrastructure needs.
  • Technical Feasibility: Based on implementation complexity and tool availability.
  • Adoption Rate: Based on frequency in reviewed papers.
Scoring criteria for Effectiveness Rating (1–5 scale): Very low (<1), low (1–1.99), medium (2–2.99), high (3–3.99), very high (>4).
Table 2. Analytical Frameworks.
Classification method for Complexity Level: Qualitative assessment of technical implementation requirements based on the following criteria.
Complexity LevelTechnical Skills RequiredImplementation TimeTool SophisticationMathematical Background
HighAdvanced programming + ML expertise>2 monthsCustom algorithms/frameworksGraduate-level statistics/ML
MediumIntermediate programming2–8 weeksStandard libraries/toolsUndergraduate-level statistics
LowBasic programming/tools<2 weeksPoint-and-click interfacesHigh school mathematics
Validation: Cross-referenced with reported implementation times and required expertise in papers.
Examples:
  • High: “Natural Language Processing”—Requires NLTK/spaCy expertise, linguistic knowledge, custom model training.
  • Medium: “Social Network Analysis”—Uses NetworkX/Gephi, requires graph theory understanding.
  • Low: “Content Analysis”—Manual coding, basic statistical analysis.
Classification method for Primary Use Case: Frequency of reported application.
Validation Method: Manual review of paper abstracts and methodology sections to confirm reported applications.
Formula for Success Rate: (Papers reporting positive results/Total papers using technique) × Reported accuracy metrics
Table 2 reports rating based on the following scale: Very low (<20%), low (20–40%), medium (40–60%), high (60–80%), very high (>80%).
Table 3. Tools and Technologies.
Formula for Adoption Trend = (Recent usage frequency−Historical usage frequency)/Time period
Analysis Period: 2007–2025. Temporal Segments:
  • Early period (2007–2013): 3 papers.
  • Recent period (2018–2025): 8 papers.
Table 3 reports Adoption Trends rating based on the following scale: Decreasing (<−50%), Stable (−50–50%, increasing >50%).
Validation: Cross-referenced with technology release dates and community adoption patterns.
Formula for Adoption Rate = (Number of papers using library/Total papers with extractable tool data) × 100%
Classification method for Effectiveness of specialized dark web tools: Qualitative assessment based on the following criteria:
  • High: >80% success rate in intended function + Multiple paper validation.
  • Medium: 60–80% success rate + Single paper validation.
  • Low: <60% success rate or limited validation.
Table 4. Methodology Effectiveness Matrix.
Classification method for Scalability: qualitative assessment based on records processed across all studies using the following method:
  • Very high: >500 K records processed.
  • High: 100–500 K records.
  • Medium: 10–100 K records.
  • Low: 1–10 K records.
  • Very low: <1 K records.
Classification method for Accuracy: Qualitative assessment based on validation studies and reported error rates reported in papers using the methods for data collection.
Classification method for Cost: Qualitative assessment based on hardware, software and personnel requirements where each is given a score of 1.67 and then added to a 1–5 score. Scoring criteria (1–5 scale): very low (<1), low (1–1.99), medium (2–2.99), high (3–3.99), very high (>4).
Examples:
  • High: Open-source tools, standard hardware, computer scientists.
  • Low: Custom infrastructure, specialized hardware & software personnel.
Classification method for Scalability: Qualitative assessment based on implementation difficulty, maintenance requirements and skill level needed. Scoring criteria (1–5 scale): Very low (<1), low (1–1.99), medium (2–2.99), high (3–3.99), very high (>4).
Formula for Overall Score = (Scalability × 0.3) + (Accuracy × 0.3) + (Cost × 0.2) + (Technical Complexity × 0.2)
Table 5. Research Maturity Indicators.
Formula for Methodological Standardization = (Papers using common approaches/Total papers) × 100%
where the level was assessed qualitatively based on the following criteria:
  • High: papers use identical methodology (similarity > 70%).
  • Medium: papers use similar approaches (30–70% similar).
  • Low: papers use unique approaches (<30% similar).
Formula for Tool Sophistication = (Papers using advanced ML/AI tools/Total papers) × 100%
where the level was assessed qualitatively based on the following criteria:
  • High: papers use advanced ML.
  • Medium: papers use standard tools.
  • Low: papers use basic tools.
Formula for Reproducibility = (Papers providing code/data/Total papers) × 100%
where the level was assessed qualitatively based on the following criteria:
  • High: papers provide full materials.
  • Medium: papers provide partial materials.
  • Low: papers provide minimal materials.
Formula for Scalability Focus = (Papers with #records successfully processed/Total papers) × 100%
where the level was assessed based on the following criteria:
  • High: papers reporting >10 K records processed successfully.
  • Medium: papers reporting >1 K and ≤10 K records processed successfully.
  • Low: papers reporting <1 K records processed successfully.
Formula for Ethical Considerations: (Papers describing ethical considerations/Total papers) × 100%
where the level was assessed qualitatively based on the following criteria:
  • High: papers provide full description of ethical considerations within a separate section or subsection.
  • Medium: papers provide partial details of ethical considerations.
  • Low: papers provide minimal or no details of ethical considerations.
Classification Method for Trend using the following criteria:
  • Increasing if trend indicator Low > Medium ≥ High or Low ≥ Medium > High.
  • Decreasing: if trend indicator High < Medium ≤ Low or High ≤ Medium < Low.
  • Stable: in all other cases.
Table 6. Multi-method Approach Combination.
Formula for Percentage (%) = (Papers using combination/Total papers) × 100%
Method for determining Success Indicators: Qualitative assessment based on reported outcomes.
Classification method for determining Complexity: Qualitative assessment based on the following criteria:
  • High: 3+ methods with advanced techniques.
  • Medium: 2–3 methods with standard techniques.
  • Low: 1 method only or 2 methods with basic techniques.

Appendix C. Complete List of Papers Used in Methodological Dark Web Forum Analysis SLR

Based on the systematic literature review analysis, the following 11 papers formed the core dataset for the methodological tables and analysis. These papers were selected because they contained extractable methodological information and detailed descriptions of data collection and analysis approaches.
No.ReferenceTitleSourceMethodological ContributionKey Methods
1[28]Affect Intensity Analysis of Dark Web ForumsProceedings of the 2007 IEEE Intelligence and Security InformaticsPioneered affect analysis techniques for dark web forum content. Developed affect lexicon using psycholinguistic features and applied LIWC (Linguistic Inquiry and Word Count) analysis. Implemented sentiment analysis algorithms to measure emotional intensity and psychological patterns in extremist forum discussions.Affect lexicon development, LIWC analysis, sentiment analysis, psycholinguistic feature extraction, NLTK processing
2[31]Towards Detecting Influential Members and Critical Topics from Dark Web Forums: A Data Mining ApproachJournal of Information and Organizational SciencesPresented comprehensive data mining pipeline for identifying influential forum members and critical discussion topics. Implemented TF-IDF weighting, outlier detection, and K-Means clustering with multiple distance metrics. Used cluster evaluation techniques including Elbow method, Silhouette analysis, and Davies-Bouldin validation. Deployed methodology using RapidMiner platform.TF-IDF vectorization, K-Means clustering, outlier detection, cluster validation, RapidMiner implementation
3[10]Performance Comparison of TOR Hidden Service CrawlersBilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 6(2), 293–307Conducted systematic experimental comparison of four major Tor hidden service crawlers: OnionCrawler, TorBot, TorScrapper, and OnionScan. Evaluated performance across multiple metrics including crawl time, data volume, link discovery, and content retrieval effectiveness. Provided evidence-based recommendations for crawler selection in Tor research.Comparative crawler evaluation, performance benchmarking, Tor network analysis, PostgreSQL data storage, Gephi visualization
4[29]Link and Content AnalysisBook Chapter in “Dark Web: Exploring and Data Mining the Dark Side of the Web” (3rd Edition), SpringerEstablished foundational methodology combining information collection, analysis, and visualization for dark web research. Applied comprehensive approach to 39 jihad-related websites to extract content, analyze site relationships, and measure activity levels. Pioneered integrated approach to dark web content analysis.Combined information collection, link analysis, content analysis, visualization techniques, relationship mapping
5[23]LLM-Based Topic Modeling for Dark Web Q&A Forums: A Comparative Analysis With Traditional MethodsIEEE AccessPresented a comparative framework evaluating LLMs against traditional methods for topic modeling in Dark Web Q&A forums. Analyzed short, informal posts using single-word and two-word topic extraction. Quantified method alignment through semantic similarity and Levenshtein distance metrics, demonstrating LLM superiority in capturing contextually relevant themes.Web scraping, custom Python preprocessing scripts, TF-IDF vectorization (sklearn), LDA (Gensim), LLM prompting (GPT-4o-mini, Gemini-2.0-flash), semantic similarity scoring, Levenshtein distance analysis
6[24]What’s Going On in Dark Web Question and Answer Forums: Topic Diversity and Linguistic CharacteristicsIEEE AccessDeveloped a multi-faceted analysis pipeline for examining thematic and linguistic diversity in three Dark Web Q&A forums. Combined automated crawling with tailored preprocessing and dual topic modeling approaches. Integrated linguistic metrics and non-parametric statistical tests to reveal forum-specific patterns in topic diversity, lexical diversity, semantic diversity, and syntactic complexity.Automated web crawling, forum-specific Python preprocessing scripts, TF-IDF (Scikit-learn), LDA (Gensim) topic modeling, linguistic metrics (TTR, MATTR, Distinct-1/2, Embedding Entropy, NER Diversity, readability indices), Kruskal–Wallis tests, Dunn’s post hoc with Bonferroni correction
7[38]Identification of Illegal Forum Activities Inside the Dark Net2018 IEEE International Conference on Machine Learning and Data EngineeringDeveloped methodology for automated identification of illegal activities in dark web forums. Analyzed 14 forums containing approximately 29,919 threads and 259,343 posts. Implemented custom crawling solutions with Elasticsearch integration for full-text search and Kibana for data visualization. Applied keyword-based classification and content analysis techniques.Custom web crawling, Elasticsearch indexing, Kibana visualization, keyword classification, content analysis
8[39]Data Capture and Analysis of Darknet MarketsAustralian National University Cybercrime Observatory Technical ReportComprehensive methodology for darknet marketplace analysis including automated data collection, manual verification, and machine learning classification. Implemented LinearSVC classifiers for product categorization, TF-IDF for text processing, and PostgreSQL for data management. Combined automated scraping with human annotation for quality assurance.Automated scraping, manual verification, LinearSVC classification, TF-IDF processing, PostgreSQL storage, scikit-learn implementation
9[25]An Unsupervised Model for Identifying and Characterizing Dark Web ForumsIEEE AccessDeveloped unsupervised machine learning framework for dark web forum identification and characterization. Implemented clustering algorithms (K-Means, hierarchical clustering) combined with decision tree analysis. Created automated classification system for forum categorization without labeled training data.Unsupervised clustering, K-Means algorithms, hierarchical clustering, decision trees, automated classification
10[9]System for Collecting Traffic and Feature of TOR Network Using Private Network and Virtual MachineUS Patent ApplicationDeveloped infrastructure-level methodology for Tor network traffic collection and analysis. Implemented system using multiple virtual machines functioning as Tor clients with private IP addresses. Created pcap file generation and filtering system for network traffic analysis, representing novel approach to dark web data collection at the network infrastructure level.Virtual machine deployment, Tor client implementation, pcap traffic capture, network traffic filtering, infrastructure-level analysis
11[2]Characterizing Activity on the Deep and Dark WebThe 2019 World Wide Web ConferenceDeveloped a comprehensive empirical pipeline for large-scale dark web analysis. Crawled messages from 80 deep web forums using custom DeepCrawler and DarkCrawler systems. Implemented Latent Dirichlet Allocation (LDA) for topic modeling and non-parametric Hidden Markov Models (HMM) to model topic evolution across forums. Used clustering algorithms to identify forum similarities and anomalies based on dynamic topic patterns.Web crawling, LDA topic modeling, HMM temporal analysis, network analysis, Python-based tools

References

  1. Victors, J. The Onion Name System: Tor-Powered Distributed DNS for Tor Hidden Services. Master’s Thesis, Utah State University Digital Commons, Logan, UT, USA, 2015. [Google Scholar]
  2. Tavabi, N.; Bartley, N.; Abeliuk, A.; Soni, S.; Ferrara, E.; Lerman, K. Characterizing activity on the deep and dark web. In Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 206–213. [Google Scholar]
  3. Peersman, C.; Edwards, M.; Williams, E.; Rashid, A. Automatic user profiling in darknet markets: A scalability study. arXiv 2022, arXiv:2203.13179. Available online: https://arxiv.org/abs/2203.13179 (accessed on 2 September 2025).
  4. Soldner, F.; Kleinberg, B.; Johnson, S.D. Counterfeits on dark markets: A measurement between Jan-2014 and Sep-2015. Crime Sci. 2023, 12, 18. [Google Scholar] [CrossRef]
  5. Park, A.J. Temporal analysis of radical dark web forum users. In Proceedings of the 2016 IEEE Conference on Intelligence and Security Informatics (ISI), Tucson, AZ, USA, 28–30 September 2016; pp. 304–306. [Google Scholar]
  6. Brown, C.; Davis, M.; Wilson, K. Ethical considerations in dark web research: A framework for responsible investigation. Ethics Inf. Technol. 2021, 23, 123–135. [Google Scholar]
  7. Fu, T.; Abbasi, A.; Chen, H. A focused crawler for Dark Web forums. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 1213–1231. [Google Scholar] [CrossRef]
  8. Amalou, W.; Mehdi, M. Anonymous traffic detection and identification. In Proceedings of the 2023 International Conference on Advanced Electrical, Computer, Electronics and Communication Engineering (ICAECCE), Dubai, United Arab Emirates, 30–31 December 2023; pp. 1–6. [Google Scholar]
  9. Kim, M.S.; Shin, D.M.; Shin, C.G.; Choi, H.J.; Ko, S.J.; Kim, H.S. System for Collecting Traffic and Feature of TOR Network Using Private Network and Virtual Machine. KR Patent 102125966 B1, 23 June 2020. [Google Scholar]
  10. Arisoy, M.V.; Küçüksille, E.U. Performance comparison of TOR hidden service crawlers. Bilecik Şeyh Edebali Üniv. Fen Bilim. Derg. 2019, 6, 293–307. [Google Scholar] [CrossRef]
  11. Mogage, A.; Simion, E.; Statistical Analysis and Anonymity of TOR’s Path Selection. IACR Cryptology ePrint Archive, 2019/1218. Available online: https://eprint.iacr.org/2019/1218 (accessed on 2 September 2025).
  12. Navarro, J.N. Crawling Tor’s Hidden Services and Depicting Their Interconnectivity. Ph.D. Thesis, Andrews University Digital Commons, Berrien Springs, MI, USA, 2018. [Google Scholar]
  13. Forero, J.A.M.; Mejia, A.N.; León-Gómez, A. A Bibliometric analysis and systematic review of dark tourism: Trends, impact, and prospects. Adm. Sci. 2023, 13, 238. [Google Scholar] [CrossRef]
  14. Sangher, K.S.; Singh, A.; Pandey, H.M.; Kumar, V. Towards safe cyber practices: Developing a proactive cyber-threat intelligence system for dark web forum content by identifying cybercrimes. Information 2023, 14, 349. [Google Scholar] [CrossRef]
  15. Hughes, J.; Chua, Y.T. A social network analysis and comparison of six dark web forums. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Genoa, Italy, 7–11 September 2020; pp. 484–493. [Google Scholar]
  16. Demertzis, K.; Tsiknas, K.; Takezis, D.; Skianis, C.; Iliadis, L. Darknet traffic big-data analysis and network management to real-time automating the malicious intent detection process by a weight agnostic neural networks framework. arXiv 2021, arXiv:2102.08411. Available online: https://arxiv.org/abs/2102.08411 (accessed on 2 September 2025).
  17. L’Huillier, G.; Ríos, S.A.; Aguilera, F. Topic-based social network analysis for virtual communities of interests in the dark web. ACM SIGKDD Explor. Newsl. 2010, 12, 66–73. [Google Scholar] [CrossRef]
  18. Nazah, S.; Huda, S.; Abawajy, J.H.; Hassan, M.M. Evolution of dark web threat analysis and detection: A systematic approach. IEEE Access 2020, 8, 171796–171819. [Google Scholar] [CrossRef]
  19. Maneriker, P.; He, Y.; Parthasarathy, S. SYSML: StYlometry with structure and multitask learning: Implications for darknet forum migrant analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6473–6486. [Google Scholar]
  20. García, L.; Martinez, R.; López, S. Longitudinal analysis of dark web forum evolution: Methodological approaches and findings. Comput. Secur. 2020, 89, 101–115. [Google Scholar]
  21. Anderson, P.; Thompson, D.; Lee, H. Comparative analysis of web crawling techniques for dark web data collection. ACM Comput. Surv. 2019, 52, 1–28. [Google Scholar]
  22. Ajayi, O.O.; Kurien, A.M.; Djouani, K.; Dieng, L. 4IR applications in the transport industry: Systematic review of the state of the art with respect to data collection and processing mechanisms. Sustainability 2024, 16, 7514. [Google Scholar] [CrossRef]
  23. De-Marcos, L.; Domínguez-Díaz, A. LLM-based topic modeling for dark web Q&A forums: A comparative analysis with traditional methods. IEEE Access 2025, 13, 67159–67169. [Google Scholar] [CrossRef]
  24. De-Marcos, L.; Domínguez-Díaz, A.; Stapic, Z. What’s going on in dark web question and answer forums: Topic diversity and linguistic characteristics. IEEE Access 2025, 13, 149880–149890. [Google Scholar] [CrossRef]
  25. Huda, S.; Abawajy, J.H.; Hassan, M.M. An unsupervised model for identifying and characterizing dark web forums. IEEE Access 2021, 9, 112871–112892. [Google Scholar] [CrossRef]
  26. Smith, J.; Johnson, A.; Williams, B. Automated content analysis of dark web forums using machine learning approaches. J. Cybersecur. Res. 2022, 15, 45–62. [Google Scholar]
  27. Ahmad, P.N.; Shah, A.M.; Lee, K.; Muhammad, W. Misinformation detection on online social networks using pretrained language models. Inf. Process. Manag. 2026, 63, 104342. [Google Scholar] [CrossRef]
  28. Abbasi, A.; Chen, H. Affect intensity analysis of dark web forums. In Proceedings of the 2007 IEEE Intelligence and Security Informatics, New Brunswick, NJ, USA, 23–24 May 2007; pp. 282–288. [Google Scholar]
  29. Chen, H. Link and Content Analysis. In Dark Web: Exploring and Data Mining the Dark Side of the Web; Chen, H., Ed.; Springer: New York, NY, USA, 2012; pp. 71–90. ISBN 978-1-4614-1557-2. [Google Scholar] [CrossRef]
  30. Akiki, C.; Gienapp, L.; Potthast, M. Tracking discourse influence in darknet forums. arXiv 2022, arXiv:2202.02081. Available online: https://arxiv.org/abs/2202.02081 (accessed on 2 September 2025).
  31. Ali, F.H. Towards detecting influential members and critical topics from dark web forums: A data mining approach. J. Inf. Organ. Sci. 2023, 47, 1–20. [Google Scholar] [CrossRef]
  32. White, S.; Green, T.; Black, R. Privacy-preserving techniques for dark web forum analysis. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2845–2857. [Google Scholar]
  33. Jones, M.; Clark, N.; Turner, J. Scalability challenges in large-scale dark web forum analysis. IEEE Internet Comput. 2020, 24, 22–30. [Google Scholar]
  34. Miller, K.; Taylor, L.; Moore, E. Machine learning approaches for automated dark web forum classification. Pattern Recognit. 2022, 125, 108–119. [Google Scholar]
  35. Patini, R.; Cordaro, M.; Marchesini, D.; Scilla, F.; Gioco, G.; Rupe, C.; D’Agostino, M.A.; Lajolo, C. Is systemic immunosuppression a risk factor for oral cancer? A systematic review and meta-analysis. Cancers 2023, 15, 3077. [Google Scholar] [CrossRef]
  36. Singh, A.; Pogorelić, Z.; Agrawal, A.; Muñoz, C.M.L.; Kainth, D.; Verma, A.; Jindal, B.; Agarwala, S.; Anand, S. Utility of ischemia-modified albumin as a biomarker for acute appendicitis: A systematic review and meta-analysis. J. Clin. Med. 2023, 12, 5486. [Google Scholar] [CrossRef]
  37. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef]
  38. Alnabulsi, H.; Islam, R. Identification of Illegal Forum Activities Inside the Dark Net. In Proceedings of the 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), Sydney, NSW, Australia, 3–7 December 2018; pp. 22–29. [Google Scholar]
  39. Ball, M.; Broadhurst, R. Data Capture and Analysis of Darknet Markets. Australian National University Cybercrime Observa-tory Technical Report. 2019. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3344936 (accessed on 1 September 2025).
Table 1. Data Collection Methodologies.
Table 1. Data Collection Methodologies.
MethodologyCount%Key CharacteristicsEffectiveness Rating
Web Crawling/Scraping872.7%Automated, scalable, Python-basedHigh
Hybrid Approaches436.4%Combined automated + manual verificationVery high
Manual Data Collection218.2%Human annotation, quality assuranceMedium
Network Traffic Analysis19.1%Packet capture, infrastructure-levelMedium
Source: authors’ contribution.
Table 2. Analytical Frameworks and Techniques.
Table 2. Analytical Frameworks and Techniques.
Framework CategoryCount%Specific MethodsComplexity Level
Natural Language Processing763.6%LDA, sentiment analysis, TF-IDF, affect lexiconsHigh
Machine Learning545.5%SVM, clustering, HMM, decision treesHigh
Social Network Analysis436.4%Centrality measures, community detectionMedium
Statistical Analysis436.4%Descriptive stats, correlation, temporal analysisMedium
Content Analysis327.3%Qualitative coding, thematic analysisLow
Graph Analysis218.2%Network visualization, link analysisMedium
Computational Linguistics19.1%Lexical & semantic diversity, syntactic complexityMedium
Machine Learning Breakdown:
ML TypeAlgorithmCountPrimary Use Case
SupervisedSupport Vector Machine2Content classification
SupervisedDecision Trees1Content categorization
UnsupervisedK-Means Clustering2User/topic grouping
UnsupervisedHierarchical Clustering1Community detection
ProbabilisticHidden Markov Models1Temporal patterns
ProbabilisticLatent Dirichlet Allocation2Topic modeling
NLP Techniques Detail:
NLP MethodCountApplicationSuccess Rate
TF-IDF Vectorization5Feature extractionHigh
Sentiment/Affect Analysis3Emotion detectionMedium
Topic Modeling (LDA)4Theme identificationHigh
Keyword Classification2Content categorizationMedium
LLM Prompting1Theme identificationHigh
Source: authors’ contribution.
Table 3. Tools and Technologies.
Table 3. Tools and Technologies.
Technology CategoryCount%Most Used ToolsAdoption Trend
Python Ecosystem872.7%Scrapy, Selenium, scikit-learn, NLTKIncreasing
Tor/Anonymization545.5%Tor Browser, OnionScan, Tor clientsStable
Analysis Software545.5%RapidMiner, LIWC, Gephi, ElasticsearchIncreasing
Specialized Crawlers436.4%OnionCrawler, TorBot, custom crawlersStable
Database Systems327.3%PostgreSQL, ElasticsearchStable
Infrastructure327.3%Virtual machines, pcap captureDecreasing
Python Libraries:
LibraryCountPrimary FunctionAdoption Rate
Scrapy3Web scraping45.6%
Selenium3Browser automation45.6%
scikit-learn3Machine learning36.4%
NLTK2Natural language processing27.3%
BeautifulSoup2HTML parsing18.8%
pandas2Data manipulation18.8%
Specialized Dark Web Tools:
Tool NameCountFunctionEffectiveness
OnionScan2Hidden service discoveryHigh
TorBot1Tor network crawlingMedium
OnionCrawler1Specialized crawlingMedium
DeepCrawler/DarkCrawler1Custom crawling solutionMedium
Source: authors’ contribution.
Table 4. Methodology Effectiveness Matrix.
Table 4. Methodology Effectiveness Matrix.
MethodologyScalabilityAccuracyCostTechnical ComplexityOverall Score
Automated CrawlingVery highMediumHighHigh4.0/5
Hybrid ApproachesHighVery highLowVery high4.2/5
Manual CollectionVery lowVery highVery lowLow2.2/5
Network Traffic AnalysisMediumHighMediumVery high3.4/5
Source: authors’ contribution.
Table 5. Research Maturity Indicators.
Table 5. Research Maturity Indicators.
IndicatorHighMediumLowTrend
Methodological Standardization45.5%36.4%18.2%Increasing
Tool Sophistication45.5%36.4%18.2%Increasing
Reproducibility27.3%27.3%45.5%Decreasing
Scalability Focus54.5%36.4%9.1%Increasing
Ethical Considerations27.3%45.5%27.3%Stable
Source: authors’ contribution.
Table 6. Multi-method Approach Combination.
Table 6. Multi-method Approach Combination.
Method CombinationCount%Success IndicatorsComplexity
Crawling + ML + Statistics327.3%High accuracy, good scalabilityHigh
Crawling + NLP + Social Network218.2%Comprehensive analysisHigh
Manual + Automated Verification218.2%High quality, moderate scaleMedium
Single Method Only218.2%Limited scopeLow
Crawling + NLP19.1%Text coverage, content extractionMedium
Crawling + NLP + Computational Linguistics19.1%Linguistic insight, semantic analysisMedium
Source: authors’ contribution.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

de-Marcos, L.; Medina-Merodio, J.-A.; Stapic, Z. Methodologies for Data Collection and Analysis of Dark Web Forum Content: A Systematic Literature Review. Electronics 2025, 14, 4191. https://doi.org/10.3390/electronics14214191

AMA Style

de-Marcos L, Medina-Merodio J-A, Stapic Z. Methodologies for Data Collection and Analysis of Dark Web Forum Content: A Systematic Literature Review. Electronics. 2025; 14(21):4191. https://doi.org/10.3390/electronics14214191

Chicago/Turabian Style

de-Marcos, Luis, José-Amelio Medina-Merodio, and Zlatko Stapic. 2025. "Methodologies for Data Collection and Analysis of Dark Web Forum Content: A Systematic Literature Review" Electronics 14, no. 21: 4191. https://doi.org/10.3390/electronics14214191

APA Style

de-Marcos, L., Medina-Merodio, J.-A., & Stapic, Z. (2025). Methodologies for Data Collection and Analysis of Dark Web Forum Content: A Systematic Literature Review. Electronics, 14(21), 4191. https://doi.org/10.3390/electronics14214191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop