You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

2 January 2025

Identifying Similar Users Between Dark Web and Surface Web Using BERTopic and Authorship Attribution

,
,
,
,
and
1
School of Computer Engineering & Applied Mathematics, Hankyong National University, Pyeongtaek-si 17738, Republic of Korea
2
Department of AI Software, Gachon University, Sungnam-si 13120, Republic of Korea
3
Cyber Warfare, LIG Nex1, Seongnam-si 13488, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Applications of Deep Learning in Cyber Threat Detection

Abstract

The dark web is a part of the deep web that ensures anonymity to users, thus facilitating various malicious activities, such as the sales of drugs, firearms, and personal information or the dissemination of malware and cyberattack tools. These activities extend beyond the dark web and have negative effects on the surface web, which is commonly accessed by internet users. Recent studies on the dark web are limited to the detection and classification of specific malicious activities; that is, they cannot trace or identify the authors of dark web content or the source of a given information Therefore, we herein propose a method for identifying similar authors between the surface and dark webs using BERTopic and authorship attribution. We applied BERTopic to the surface and dark webs to extract previously unidentified topics and measured the similarity between the topics to detect similar topics between the two webs. In addition, we applied authorship attribution to the contents written by the authors of similar topics to extract the unique author characteristics. The similarity between the authors was measured to identify authors with similar characteristics. Thus, we identified authors who had written contents on similar topics on both the surface and dark webs as well as authors who are simultaneously active on both webs.

1. Introduction

The steady advancement in information technologies has given internet users easy access to extensive information, thus encouraging the development of new technologies; however, these technologies have also been used to create cyber threats. Instances of loss of personal information through malware, spam emails, zombie personal computers, or malicious ads have become increasingly common. Moreover, the misuse of the internet has made it difficult for people to distinguish between misinformation and real information.
The dark web is a part of the deep web and represents the dark side of the internet []. It provides cyber attackers with opportunities to easily and quickly access advanced attack techniques. The dark web has become a means to facilitate these malicious activities, and various cyber threats are constantly occurring through the dark web. Unlike the surface web, which is used by general users, the dark web is hidden []. Moreover, the dark web is approximately 4000 to 5000 times larger than the surface web []. Hence, the deep web contains extensive information about malicious activities. Because the dark web ensures the anonymity of users, it enables them to conduct illegal activities freely; a study found that the total revenue generated by cybercrime in 2018 was around USD 1.5 trillion [,]. Information about malicious activities is shared through various dark web forums and communities, and malware or hacking tools related to cyber threats, drugs, personal information, firearms, etc., are sold on the marketplaces []. For example, a marketplace called the Silk Road was an underground market for illegal goods and services on the dark web []. Among the malicious activities conducted on the dark web, malicious code such as ransomware, Trojan horses, and DDoS malware have been disseminated on the surface web, causing substantial harm to internet users. This is caused by two characteristics of the dark web. First, anonymity is the strongest characteristic of the dark web, which enables users to hide their identity and location []. This anonymity can be achieved through software called Tor. The Tor network encrypts and hides information from the entry node to the exit node so that users engaging in malicious activities can avoid detection by the government or a vendor []. The second is the change in the internet environment. Internet users are becoming increasingly curious about accessing the dark web, and it has become easier to purchase technology for malicious activities []. Furthermore, with the advent of cryptocurrency, transactions have become more frequent []. To address these issues, studies are being conducted to collect data by crawling the marketplaces, forums, and websites on the dark web; to analyze the images, files, HTML, and text contained within them to classify the types of malicious activities []; to identify the topics related to malicious activities in the forums; and to collect the IDs of the main forum users []. Existing methods classify the types of malicious activities being carried out on the dark web and identify the main authors in the dark web forums by examining the number of articles or posts [,,,,,,,,]. However, these methods focus solely on classifying the dark web data according to the type of malicious activity; hence, obtaining specific information about the users or authors who carried out these activities is impossible. Furthermore, because the methods that find the main authors or vendors by analyzing the dark web forums or marketplaces rely on the number of articles written, they cannot identify the grammatical characteristics or writing habits of individual authors. Moreover, because most dark web analysis techniques utilize only dark web data, it is impossible to determine the association between the articles posted on the surface web and those on the dark web. In addition, it is difficult to identify the authors who move back and forth between the surface and dark webs and write posts while performing malicious activities.
Therefore, in this study, we aimed to find common topics and authors between the surface and dark webs by applying BERTopic and authorship attribution. We attempted to identify authors who use both the surface and dark webs simultaneously—particularly those who have not been previously identified—by finding the authors engaged in writing content on similar topics. First, relevant data were collected from both the surface and dark webs, and the topics included in the forums were extracted by applying BERTopic. The similarity between the topics extracted from the surface and dark webs was calculated to find similar topics between the two webs. Next, the contents corresponding to the list of similar topics were examined to extract the unique characteristics of the authors based on the authorship attribution of each author. Finally, the extracted unique characteristics of the authors were compared to identify authors with similar characteristics on both the surface and dark webs. In addition, the unique user characteristics were utilized to identify malicious users who are simultaneously active on both the surface and dark webs. The main contributions of this study are summarized as follows:
  • BERTopic was applied to the surface and dark webs to extract the topics from each web, and identical or similar topics between the two webs were extracted utilizing similarity measurements;
  • Information about the users creating content on the surface and dark webs was collected, and the unique characteristics of the authors were extracted using authorship attribution;
  • Malicious users actively engaged on the surface and dark webs can be identified based on the unique characteristics of the authors;
  • Malicious authors engaged in simultaneous activities on the surface and dark webs can be identified using BERTopic and authorship attribution.
The remaining part of this paper is organized as follows. Section 2 reviews the related studies, and Section 3 introduces the method proposed in this study. Section 4 presents the experimental results. Section 5 presents the conclusions and summarizes the future research directions.

3. Proposed Method

Figure 1 shows the framework proposed in this study for identifying similar users on the surface and dark webs by using BERTopic and authorship attribution. The proposed method consists of five processes. Through these processes, closely related topics between the surface and dark webs are identified, and the similarity between the malicious authors of these topics is determined. First, forum data are collected from the surface and dark webs for training in the data collection stage. In the text preprocessing stage, the minimal processing needed to use BERTopic is performed. In contrast to conventional topic modeling, BERTopic does not require the removal and transformation of unnecessary information. Therefore, all the collected forum content, except for the text written by the users of interest, is removed. In the topic extraction stage, BERTopic is used to find various topics in surface web forums and dark web forums and extract the topics that include malicious activities. During the extraction of similar topics, keywords that appear in the identified malicious topics from the surface and dark webs are used to measure the similarity, and topics with high similarity between the surface and dark webs are identified. During the identification of similar malicious users, contents written by authors associated with highly similar topics on both the surface and dark webs are individually collected, and the unique characteristics of these authors are extracted using authorship attribution. These unique characteristics are then utilized to identify authors with high similarity among the authors of specific topics on the surface and dark webs and to identify the specific characteristics of the authors that contribute to the high similarity.
Figure 1. Framework for identifying similar users on the surface and dark webs using BERTopic and authorship attribution.

3.1. Data Collection

To analyze the authors of specific topics on the surface and dark webs, data related to such topics are needed. Therefore, we collected the data from forums on the surface web in which dark web-related information was exchanged and from active forums on the dark web. Several studies [,,,,,,,,,,] have used datasets containing data from two dark web forums and one surface web forum. We used these data, excluding the data for author identification, in our study. There were 17,879 authors from the surface web forum and 12,159 and 30,206 authors from the two dark web forums. Among the forum users, the active users had written over 2000 posts, whereas the less active users had written approximately 100 posts.

3.2. Data Preprocessing

BERTopic was applied to the collected data to extract various topics. Unlike conventional topic modeling methods, this method does not require preprocessing, such as checking for empty data, removing stop words and special characters, and unifying the capitalization. Therefore, in this stage, only the posts written by the users were extracted from the contents of the two webs to facilitate the training of BERTopic.

3.3. Topic Extraction

In this stage, various hidden topics were extracted from the surface and dark webs using BERTopic. Unlike conventional models for topic modeling, BERTopic effectively extracts the topics by combining transformer-based embedding and clustering techniques. The algorithm first generates document embedding and clusters them. It then uses class-based TF-IDF (c-TF-IDF) to extract the representative words for each topic from each cluster.
BERTopic transforms documents into vectors by using Sentence-BERT (SBERT). Here, SBERT transforms sentences or paragraphs into a high-dimensional vector space, enabling the calculation of semantic similarity between documents. SBERT is an improved model designed to effectively generate sentence-level embedding. In BERTopic, SBERT assists in determining whether documents with the same topic are located close to each other in the vector space. Moreover, the language model used in the embedding generation stage can be replaced based on the user requirements. In the document clustering stage, uniform manifold approximation and projection, a nonlinear technique that effectively preserves the local and global structures of high-dimensional data and reduces the dimensions, is used. Next, clustering is performed using hierarchical density-based spatial clustering of applications with noise, which is an enhanced version of the DBSCAN algorithm. It locates clusters in high-density regions and regards low-density regions as noise. This algorithm can effectively process clusters of different densities, making it suitable for complex data such as dark web data. In the clustering process, similar documents are grouped into a single cluster and the documents within a cluster are assumed to have the same topic. The topics derived from each cluster are examined using class-based TF-IDF (c-TF-IDF), which is a variation of the typical TF-IDF method. It evaluates the importance of words based on clusters rather than documents. Equation (1) is the formula for calculating the c-TF-IDF, where t f t , c represents the frequency of the word t in a cluster c , and A denotes the average number of words per cluster. This equation calculates the importance of a word within a cluster, and the topic of each cluster is determined based on this importance.
W t , c = t f t , c · log 1 + A t f t

3.4. Similar Topic Extraction

Fifteen to twenty topics were extracted from the surface and dark webs, and the extracted topics had between fifteen and twenty keywords. The keywords of a topic are words frequently found in webpages related to that topic. The similarity between the topics on the surface and dark webs was analyzed using the topics and their keywords generated in this manner from the two webs. The similarity was assessed by performing a one-to-one comparison between the topics on the two webs. The degree of similarity was calculated by applying the cosine measure to the keywords of the two topics, as shown in Equation (2).
s i m i l a r i t y = cos θ = A · B | | A | |   | | B | | = i = 1 n A i B i i = 1 n A i 2 i = 1 n B i 2

3.5. Identifying Malicious Similar Users

The users posting content on the topics selected by analyzing the topic similarity between the two webs were extracted, and the topics were segregated according to the user. Next, authorship attribution was applied to extract the unique characteristics of the style of writing of each user. The authorship attribution-based method analyzes the writing habits of authors to identify their unique characteristics. In this process, analysis methods such as character or word embeddings and the n-grams of the document are applied, and special symbols frequently used by the author as well as the author’s spacing patterns, types of contractions, and typographical errors are analyzed. In this stage, both methods were applied to extract the unique characteristics, and the users deemed to have high similarity based on the similarity analysis between the topics of the two webs were extracted. Fifteen unique characteristics of the authors were extracted through authorship attribution, as shown in Table 1. In addition to grammatical features, such as word length, count, and ratio, the writing habits of the users—such as emojis, quotation marks, URLs, and contractions—were also considered along with the document analysis method. The unique characteristics of the authors, generated based on authorship attribution, were used to identify the users writing similar types of posts on similar topics found on both webs.
Table 1. Unique characteristics of authors based on authorship attribution.

4. Experiment and Results

4.1. Dataset

In this study, we conducted an experiment using datasets comprising the data from one surface web forum and two dark web forums, as shown in Table 2. This dataset contained the data of 60,244 authors—17,879 authors from the surface web and 42,365 authors from the dark web. The number of posts made by each author varied substantially, ranging from 140 to 2400. On the surface web, posts related to the dark web were collected from the Reddit forum, whereas on the dark web, posts were collected from the Silk Road and Agora forums. The data were collected from January 2014 until May 2015. In addition, we did not utilize the data used for author identification in the dataset because these data were repetitions of the surface web data.
Table 2. Datasets used in this study.

4.2. Topic Similarity Analysis Using BERTopic

Preprocessing was performed on the collected dataset, and 15 to 20 topics were extracted from the surface and dark webs using BERTopic. Topics were created for each forum through BERTopic, and the results classified through hierarchical clustering are shown in Figure 2, Figure 3 and Figure 4. Through clustering, similar topics were organized into one stem, and approximately 20 topics could be identified through additional segmentation within it. The x-axis represents each topic, whereas the y-axis represents the branches generated through hierarchical clustering. Each branching point indicates the presence of distinct characteristics that separate the topics. Through this analysis, we were able to identify the key topics and important words commonly mentioned in one surface web platform and two dark web forums. By comparing the key words of the topics across the three visualizations, we could easily and visually assess the extent to which similar topics existed among them. The keywords extracted for each topic are shown in Table 3. As shown in the table, not all topics extracted from both webs were related to malicious activities. However, keywords indicating malicious activities can be observed in some of the topics. For example, words related to drugs, such as LSD, MDMA, weed, cocaine, and drug, and words associated with contract killing, such as gun, karma, and people, were found in some of the topics on both the surface and dark webs. Furthermore, topics related to illegal trade, such as bitcoin, anonym, credit card, and PayPal, were also found. The results of the similarity analyses between the topics derived by analyzing the contents on the two webs are shown in Table 4 and Table 5. Table 4 shows the results of topic association analysis between the Agora forum on the dark web and the surface web. The results reveal that some topics exhibit over 80% similarity. Upon examining these topics, it was found that they were related to drugs, as they contained frequent mentions of the names of drugs. Table 5 presents the topic association analysis between Silk Road on the dark web and the surface web. Similar to Table 4, the topics with the highest similarity were all related to drugs. However, topics with a similarity range of approximately 60–70% were found to involve transactions leveraging anonymity, such as those related to encryption or cryptocurrency. These topics included details on how the transactions are conducted and which items are commonly traded.
Figure 2. Results of surface web clustering created using BERTopic.
Figure 3. Results of dark web (Silk Road) clustering created using BERTopic.
Figure 4. Results of dark web (Agora) clustering created using BERTopic.
Table 3. Keywords extracted for similarity analysis between the surface and dark webs (Agora and Silk Road).
Table 4. Results of topic association analysis between surface and dark webs (Agora).
Table 5. Results of topic association analysis between surface and dark webs (Silk Road).
Multiple similar topics were found by applying the cosine measure to the topics extracted from the two dark web forums and one surface web forum. The degree of similarity between the topics extracted from the two webs varied, with the highest similarity being 80%. The analyses were conducted with a similarity score of at least 50% as the criterion for establishing similarity. The similar topics between the surface and dark webs showed substantial overlapping of the keywords. Among these topics, only those in which words related to malicious activities were frequently mentioned were selected for additional analysis.

4.3. Author Similarity Analysis Using Authorship Attribution

A list of users posting content on similar topics between the surface and dark webs was generated, and the content written by each user was collected to extract the unique characteristics of the authors. These unique characteristics were identified by applying the authorship attribution method; the generated features are shown in Figure 5. The figure illustrates the writing skills of users from three different web forums. The columns represent various features of users’ writing that were extracted through the authorship attribution. It was found that grammatical features such as sentence or word length and average number of words as well as the personal characteristics of the authors, such as the ratio of numbers in the text, use of emojis, and use of contractions, differed depending on the author. This information was used to establish a similarity between users posting content on related topics on both the surface and dark webs. Table 6 shows the results of applying the cosine measure to the authors of similar topics on the two webs. Thus, it was possible to identify users with similar writing styles among the users writing content on similar topics on the two webs.
Figure 5. Unique characteristics of authors.
Table 6. Similarity comparison results between authors of similar topics on surface and dark webs.
Furthermore, it could be inferred that even though they use various IDs, they primarily engage on the same topics and do not transition to other topics. Moreover, Table 7 presents the analysis of the words frequently used by authors with similar unique characteristics among those actively writing on similar topics between the two webs. The analysis results indicate that the words related to malicious activities as well as the common words that are frequently used are similar. In particular, Table 6 demonstrates that user15376 and user10238, who are active on the surface web and dark web, respectively, exhibit high similarity scores despite operating on different platforms. This is attributed to their similar writing habits and characteristics. Additionally, as shown in Table 7, an analysis of the words frequently used by these two users revealed that they predominantly discuss topics related to drugs.
Table 7. Comparison of keywords used by similar authors on the surface and dark webs.

5. Discussion

In this study, we propose a method that uses BERTopic and authorship attribution to identify similar topics between the surface and dark webs. Using this method, we extracted similar topics from one surface web forum and two dark web forums. We then collected the unique characteristics of the authors of these topics and identified users with similar writing styles across the two webs through a similarity measure between the unique characteristics of the authors. Based on this information, we identified authors who are actively engaged on both webs. By using BERTopic, it was possible to identify topics that were apparently dissimilar but were actually common between the two webs and to verify whether these topics contained genuinely similar content. Additionally, through authorship attribution, we were able to assess whether users discussing common topics on both webs could actually be the same individuals by analyzing their habits, characteristics, and writing styles.
However, some of the topics extracted using BERTopic were often composed of common words unrelated to malicious activities, and there were relatively few words associated with malicious activities. Moreover, it was challenging to provide definitive evidence to determine whether similar users identified on both webs were indeed the same individuals. In some cases, users were deemed similar based solely on basic language patterns or commonly used phrases that anyone might use. Therefore, we believe that further research needs to be conducted to compensate for these biased data. In addition, it is necessary to extract more diverse types of unique characteristics by using authorship attribution to further segment and distinguish the authors. Furthermore, methods for determining the optimal number of topics from the collected data should be improved. Our future research will consider these aspects to build a more robust model.

6. Conclusions

Malicious behaviors have become increasingly common on the dark web in recent years. Consequently, studies are being actively conducted to preemptively block these activities. Various methods are employed to detect malicious activities occurring on the dark web, classify such activities, identify risky vendors in marketplaces, and analyze key users in forums. Although key users can be inferred based on the number of activities on the forum and number of posts, it is difficult to obtain specific information about the authors owing to the anonymous nature of the dark web. Moreover, because the information is restricted to the dark web, it is challenging to detect malicious activities by obtaining dark web-related information from the surface web. Therefore, this study proposes a method for identifying users active on both webs using BERTopic and authorship attribution. BERTopic was utilized to identify common topics and key terms mentioned across the two webs, whereas authorship attribution was employed to extract distinguishing features such as the textual, linguistic, and semantic characteristics of users. Through these processes, it became possible to identify suspicious users active on both webs. However, certain features failed to clearly distinguish users active on the webs, or they incorrectly identified general linguistic traits as unique user characteristics, highlighting limitations of the proposed approach. Future research will aim to address these issues by enhancing the user profiling methods to improve the differentiation and characterization of individual users.

Author Contributions

Conceptualization, G.-Y.S.; methodology, G.-Y.S.; investigation, D.-W.K.; formal analysis, G.-Y.S., M.-M.H., S.P., A.-r.P., and Y.K.; writing—original draft preparation, G.-Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a Korea Research Institute for Defense Technology Planning and Advancement (KRIT) grant funded by the Korea Government’s Defense Acquisition Program Administration (DAPA) (Grant No. KRIT-CT-21-037), and the Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (Grant No. RS-2023-00248132).

Data Availability Statement

No new data were created or analyzed in this study. Data are contained within the article.

Acknowledgments

The authors would like to express their gratitude to the National Research Foundation of Korea (NRF) and LIG Nex1 for their generous financial support, which was crucial for the successful completion of this research. We are also deeply appreciative of the invaluable feedback and insightful comments provided by our colleagues, whose expertise and encouragement significantly enhanced the quality of this paper. Their support in various aspects, from conceptual discussions to technical assistance, has been instrumental. Additionally, we would like to acknowledge the administrative and logistical support from our institution, which facilitated our research activities. Finally, we extend our sincere thanks to the anonymous reviewers for their thorough and constructive critiques, which helped us refine and improve the manuscript. For any technical inquiries related to this research, please contact the authors or the corresponding author via email.

Conflicts of Interest

Author SungJin Park, A-ran Park and Younghwan Kim were employed by the company LIG Nex1. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. And the authors declare that this study received funding from the Korea Government’s Defense Acquisition Program Administration. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

  1. Al Nabki, M.W.; Fidalgo, E.; Alegre, E.; de Paz, I. Classifying illegal activities on Tor network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 35–43. [Google Scholar] [CrossRef]
  2. Beshiri, A.; Susuri, A. Dark web and its impact in online anonymity and privacy: A critical analysis and review. J. Comput. Commun. 2019, 7, 30–43. [Google Scholar] [CrossRef]
  3. Finklea, K.M. Dark Web; Rep. R44101; Congressional Research Service: Washington, DC, USA, 2017. [Google Scholar]
  4. Patel, P.B.; Thakor, H.P.; Iyer, S. A comparative study on cyber crime mitigation models. In Proceedings of the 6th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 13–15 March 2019; pp. 466–470. [Google Scholar]
  5. Cascavilla, G.; Tamburri, D.A.; Van Den Heuvel, W.-J. Cybercrime threat intelligence: A systematic multi-vocal literature review. Comput. Secur. 2021, 105, 102258. [Google Scholar] [CrossRef]
  6. Gupta, A.; Maynard, S.B.; Ahmad, A. The dark web phenomenon: A review and research agenda. arXiv 2021, arXiv:2104.07138. [Google Scholar] [CrossRef]
  7. Minnaar, A. Online ‘underground’ marketplaces for illicit drugs: The prototype case of the dark web website ‘Silk Road. Acta Criminol. Afr. J. Criminol. Vict. 2017, 30, 23–47. [Google Scholar]
  8. Catakoglu, O.; Balduzzi, M.; Balzarotti, D. Attacks landscape in the dark side of the web. In Proceedings of the Symposium on Applied Computing, Marrakech, Morocco, 3–7 April 2017; pp. 1739–1746. [Google Scholar] [CrossRef]
  9. Parkar, A.; Sharma, S.; Yadav, S. Introduction to deep web. Int. Res. J. Eng. Technol. 2017, 4, 229–234. [Google Scholar] [CrossRef]
  10. Basheer, R.; Alkhatib, B. Threats from the dark: A review over dark web investigation research for cyber threat intelligence. J. Comput. Netw. Commun. 2021, 2021, 1302999. [Google Scholar] [CrossRef]
  11. Lee, S.; Yoon, C.; Kang, H.; Kim, Y.; Kim, Y.; Han, D.; Son, S.; Shin, S. Cybercriminal minds: An investigative study of cryptocurrency abuses in the dark web. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
  12. Ruiz Ródenas, J.M.; Pastor-Galindo, J.; Gómez Mármol, F. A general and modular framework for dark web analysis. Clust. Comput. 2024, 27, 4687–4703. [Google Scholar] [CrossRef]
  13. Pete, I.; Hughes, J.; Chua, Y.T.; Bada, M. A social network analysis and comparison of six dark web forums. In Proceedings of the 2020 IEEE European symposium on security and privacy workshops (EuroS&PW), Genoa, Italy, 7–11 September 2020; pp. 484–493. [Google Scholar]
  14. Sabbah, T.; Selamat, A.; Selamat, M.H.; Ibrahim, R.; Fujita, H. Hybridized term-weighting method for dark web classification. Neurocomputing 2016, 173, 1908–1926. [Google Scholar] [CrossRef]
  15. He, S.; He, Y.; Li, M. Classification of illegal activities on the dark web. In Proceedings of the 2nd International Conference on Information Science and Systems, Tokyo, Japan, 16–19 March 2019; pp. 73–78. [Google Scholar] [CrossRef]
  16. Cascavilla, G.; Catolino, G.; Ebert, F.; Tamburri, D.A.; van den Heuvel, W.J. “When the code becomes a crime scene” Towards dark web threat intelligence with software quality metrics. In Proceedings of the 2022 IEEE ICSME, Limassol, Cyprus, 3–7 October 2022; pp. 439–443. [Google Scholar] [CrossRef]
  17. Ball, M.; Broadhurst, R. Data capture and analysis of darknet markets. SSRN 2021, 3344936. [Google Scholar] [CrossRef]
  18. Ursani, Z.; Peersman, C.; Edwards, M.; Chen, C.; Rashid, A. The impact of adverse events in darknet markets: An anomaly detection approach. In Proceedings of the 2021 IEEE EuroS&PW, Vienna, Austria, 6–10 September 2021; pp. 227–238. [Google Scholar] [CrossRef]
  19. Alnabulsi, H.; Islam, R. Identification of illegal forum activities inside the dark net. In Proceedings of the 2018 iCMLDE, Sydney, NSW, Australia, 3–7 December 2018; pp. 22–29. [Google Scholar] [CrossRef]
  20. Jin, Y.; Jang, E.; Cui, J.; Chung, J.-W.; Lee, Y.; Shin, S. DarkBERT: A language model for the dark side of the internet. arXiv 2023, arXiv:2305.08596. [Google Scholar] [CrossRef]
  21. Dos Reis, E.F.; Teytelboym, A.; ElBahrawy, A.; De Loizaga, I.; Baronchelli, A. Identifying key players in dark web marketplaces through Bitcoin transaction networks. Sci. Rep. 2024, 14, 2385. [Google Scholar] [CrossRef] [PubMed]
  22. Tavabi, N.; Bartley, N.; Abeliuk, A.; Soni, S.; Ferrara, E.; Lerman, K. Characterizing activity on the deep and dark web. In Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 206–213. [Google Scholar] [CrossRef]
  23. Yang, C.C.; Tang, X.; Thuraisingham, B.M. An analysis of user influence ranking algorithms on Dark Web forums. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Washington, DC, USA, 25–28 July 2010; pp. 1–7, Article 10. [Google Scholar] [CrossRef]
  24. Tang, X.; Yang, C.C.; Zhang, M. Who will be participating next? Predicting the participation of Dark Web community. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Beijing, China, 12 August 2012; Article 1. pp. 1–7. [Google Scholar] [CrossRef]
  25. Anwar, T.; Abulaish, M. Identifying cliques in dark web forums—An agglomerative clustering approach. In Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics, Washington, DC, USA, 11–14 June 2012. [Google Scholar] [CrossRef]
  26. Alkhatib, B.; Basheer, R.S. Mining the dark web: A novel approach for placing a dark website under investigation. Int. J. Mod. Educat. Comput. Sci. 2019, 11, 1–13. [Google Scholar] [CrossRef][Green Version]
  27. Chen, H.; Chung, W.; Qin, J.; Reid, E.; Sageman, M.; Weimann, G. Uncovering the dark web: A case study of jihad on the web. J. Am. Soc. Inf. Sci. Technol. 2008, 59, 1347–1359. [Google Scholar] [CrossRef]
  28. Iliou, C.; Kalpakis, G.; Tsikrika, T.; Vrochidis, S.; Kompatsiaris, I. Hybrid focused crawling on the Surface and the Dark Web. EURASIP J. Inf. Sec. 2017, 2017, 11. [Google Scholar] [CrossRef]
  29. Sangher, K.S.; Singh, A.; Pandey, H.M. LSTM and BERT based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums. Int. J. Inf. Technol. 2024, 16, 5277–5292. [Google Scholar] [CrossRef]
  30. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  31. Ramage, D.; Hall, D.; Nallapati, R.; Manning, C.D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; pp. 248–256. [Google Scholar]
  32. Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794v1. [Google Scholar] [CrossRef]
  33. Ríos, S.A.; Muñoz, R. Dark web portal overlapping community detection based on topic models. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Beijing, China, 12 August 2012; Article 2. pp. 1–7. [Google Scholar]
  34. Porter, K. Analyzing the DarkNetMarkets subreddit for evolutions of tools and trends using LDA topic modeling. Digit. Investig. 2018, 26, S87–S97. [Google Scholar] [CrossRef]
  35. Faizan, M.; Khan, R.A. A two-step dimensionality reduction scheme for dark web text classification. In Proceedings of the Ambient Communications and Computer Systems: RACCCS 2019, Ajmer, India, 29–30 May 2019; Hu, Y.C., Tiwari, S., Trivedi, M., Mishra, K., Eds.; Springer: Singapore, 2019; pp. 303–312. [Google Scholar] [CrossRef]
  36. Bahamazava, K.; Nanda, R. The shift of DarkNet illegal drug trade preferences in cryptocurrency: The question of traceability and deterrence. Forens. Sci. Int. Digit. Investig. 2022, 40, 301377. [Google Scholar] [CrossRef]
  37. Shin, G.-Y.; Jang, Y.; Kim, D.-W.; Park, S.; Park, A.-R.; Kim, Y.; Han, M.-M. Dark side of the web: Dark web classification based on TextCNN and topic modeling weight. IEEE Access 2024, 12, 36361–36371. [Google Scholar] [CrossRef]
  38. Pastor-Galindo, J.; Sandlin, H.-Â.; Mármol, F.G.; Bovet, G.; Pérez, G.M. A Big Data architecture for early identification and categorization of dark web sites. Future Gener. Comput. Syst. 2024, 157, 67–81. [Google Scholar] [CrossRef]
  39. Al-Rowaily, K.; Abulaish, M.; Haldar, N.A.-H.; Al-Rubaian, M. BiSAL—A bilingual sentiment analysis lexicon to analyze Dark Web forums for cyber security. Digit. Investig. 2015, 14, 53–62. [Google Scholar] [CrossRef]
  40. Dalvi, A.; Bhoir, S.; Naik, N.; Kitkaru, A.; Siddavatam, I.; Bhirud, S. A hybrid TF-IDF and RNN model for multi-label classification of the deep and dark web. Int. J. Adv. Comp. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
  41. Alghamdi, H.; Selamat, A. Techniques to detect terrorists/extremists on the dark web: A review. Data Technol. Appl. 2022, 56, 461–482. [Google Scholar] [CrossRef]
  42. Ranaldi, L.; Nourbakhsh, A.; Patrizi, A.; Ruzzetti, E.S.; Onorati, D.; Fallucchi, F.; Zanzotto, F.M. The dark side of the language: Pre-trained transformers in the DarkNet. arXiv 2022, arXiv:2201.05613v3. [Google Scholar] [CrossRef]
  43. Tavabi, N.; Goyal, P.; Almukaynizi, M.; Shakarian, P.; Lerman, K. DarkEmbed: Exploit prediction with neural language models. In Proceedings of the AAAI Conference Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. no. 1. [Google Scholar] [CrossRef]
  44. Murty, C.A.; Rughani, P.H. Sentiment & pattern analysis for identifying nature of the content hosted in the dark web. Indian J. Comput. Sci. Eng. 2021, 12, 1822–1836. [Google Scholar] [CrossRef]
  45. Deguara, N.; Arshad, J.; Paracha, A.; Azad, M.A. Threat Miner—A text analysis engine for threat identification using dark web data. In Proceedings of the 2022 IEEE International Conference on Big Data, Osaka, Japan, 17–20 December 2022. [Google Scholar] [CrossRef]
  46. Ranaldi, L.; Ranaldi, F.; Fallucchi, F.; Zanzotto, F.M. Shedding light on the dark web: Authorship attribution in radical forums. Information 2022, 13, 435. [Google Scholar] [CrossRef]
  47. Litvinova, T.; Litvinova, O.; Panicheva, P. Authorship attribution of Russian forum posts with different types of N-gram features. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, Tokushima, Japan, 28–30 June 2019; pp. 9–14. [Google Scholar] [CrossRef]
  48. Manolache, A.; Brad, F.; Barbalau, A.; Ionescu, R.T.; Popescu, M. VeriDark: A large-scale benchmark for authorship verification on the dark web. Adv. Neural Inf. Process. Syst. 2022, 35, 15574–15588. [Google Scholar]
  49. Arabnezhad, E.; La Morgia, M.; Mei, A.; Nemmi, E.N.; Stefa, J. A light in the dark web: Linking dark web aliases to real internet identities. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), Singapore, Singapore, 29 November–1 December 2020; pp. 311–321. [Google Scholar] [CrossRef]
  50. Nazah, S.; Huda, S.; Abawajy, J.H.; Hassan, M.M. An unsupervised model for identifying and characterizing dark web forums. IEEE Access 2021, 9, 112871–112892. [Google Scholar] [CrossRef]
  51. Nazah, S.; Huda, S.; Abawajy, J.; Hassan, M.M. Evolution of dark web threat analysis and detection: A systematic approach. IEEE Access 2020, 8, 171796–171819. [Google Scholar] [CrossRef]
  52. Spitters, M.; Klaver, F.; Koot, G.; Van Staalduinen, M. Authorship analysis on dark marketplace forums. In Proceedings of the 2015 European Intelligence and Security Informatics Conference, Manchester, UK, 7–9 September 2015; pp. 1–8. [Google Scholar]
  53. Klaver, F. Authorship Attribution of Forum Posts. Thesis, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, 2014. [Google Scholar]
  54. Sennewald, B. Authorship Attribution in the Dark Web. Master’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2020. [Google Scholar]
  55. Benjamin, V.; Chung, W.; Abbasi, A.; Chuang, J.; Larson, C.A.; Chen, H. Evaluating text visualization for authorship analysis. Secur. Inform. 2014, 3, 10. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.