Identifying Similar Users Between Dark Web and Surface Web Using BERTopic and Authorship Attribution

Gun-Yoon Shin; Dong-Wook Kim; SungJin Park; A-ran Park; Younghwan Kim; Myung-Mook Han

doi:10.3390/electronics14010148

,

and

¹

School of Computer Engineering & Applied Mathematics, Hankyong National University, Pyeongtaek-si 17738, Republic of Korea

²

Department of AI Software, Gachon University, Sungnam-si 13120, Republic of Korea

³

Cyber Warfare, LIG Nex1, Seongnam-si 13488, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(1), 148;https://doi.org/10.3390/electronics14010148

This article belongs to the Special Issue Applications of Deep Learning in Cyber Threat Detection

Version Notes

Order Reprints

Abstract

The dark web is a part of the deep web that ensures anonymity to users, thus facilitating various malicious activities, such as the sales of drugs, firearms, and personal information or the dissemination of malware and cyberattack tools. These activities extend beyond the dark web and have negative effects on the surface web, which is commonly accessed by internet users. Recent studies on the dark web are limited to the detection and classification of specific malicious activities; that is, they cannot trace or identify the authors of dark web content or the source of a given information Therefore, we herein propose a method for identifying similar authors between the surface and dark webs using BERTopic and authorship attribution. We applied BERTopic to the surface and dark webs to extract previously unidentified topics and measured the similarity between the topics to detect similar topics between the two webs. In addition, we applied authorship attribution to the contents written by the authors of similar topics to extract the unique author characteristics. The similarity between the authors was measured to identify authors with similar characteristics. Thus, we identified authors who had written contents on similar topics on both the surface and dark webs as well as authors who are simultaneously active on both webs.

Keywords:

dark web; author identification; BERTopic; authorship attribution; user similarity

1. Introduction

The steady advancement in information technologies has given internet users easy access to extensive information, thus encouraging the development of new technologies; however, these technologies have also been used to create cyber threats. Instances of loss of personal information through malware, spam emails, zombie personal computers, or malicious ads have become increasingly common. Moreover, the misuse of the internet has made it difficult for people to distinguish between misinformation and real information.

The dark web is a part of the deep web and represents the dark side of the internet []. It provides cyber attackers with opportunities to easily and quickly access advanced attack techniques. The dark web has become a means to facilitate these malicious activities, and various cyber threats are constantly occurring through the dark web. Unlike the surface web, which is used by general users, the dark web is hidden []. Moreover, the dark web is approximately 4000 to 5000 times larger than the surface web []. Hence, the deep web contains extensive information about malicious activities. Because the dark web ensures the anonymity of users, it enables them to conduct illegal activities freely; a study found that the total revenue generated by cybercrime in 2018 was around USD 1.5 trillion [,]. Information about malicious activities is shared through various dark web forums and communities, and malware or hacking tools related to cyber threats, drugs, personal information, firearms, etc., are sold on the marketplaces []. For example, a marketplace called the Silk Road was an underground market for illegal goods and services on the dark web []. Among the malicious activities conducted on the dark web, malicious code such as ransomware, Trojan horses, and DDoS malware have been disseminated on the surface web, causing substantial harm to internet users. This is caused by two characteristics of the dark web. First, anonymity is the strongest characteristic of the dark web, which enables users to hide their identity and location []. This anonymity can be achieved through software called Tor. The Tor network encrypts and hides information from the entry node to the exit node so that users engaging in malicious activities can avoid detection by the government or a vendor []. The second is the change in the internet environment. Internet users are becoming increasingly curious about accessing the dark web, and it has become easier to purchase technology for malicious activities []. Furthermore, with the advent of cryptocurrency, transactions have become more frequent []. To address these issues, studies are being conducted to collect data by crawling the marketplaces, forums, and websites on the dark web; to analyze the images, files, HTML, and text contained within them to classify the types of malicious activities []; to identify the topics related to malicious activities in the forums; and to collect the IDs of the main forum users []. Existing methods classify the types of malicious activities being carried out on the dark web and identify the main authors in the dark web forums by examining the number of articles or posts [,,,,,,,,]. However, these methods focus solely on classifying the dark web data according to the type of malicious activity; hence, obtaining specific information about the users or authors who carried out these activities is impossible. Furthermore, because the methods that find the main authors or vendors by analyzing the dark web forums or marketplaces rely on the number of articles written, they cannot identify the grammatical characteristics or writing habits of individual authors. Moreover, because most dark web analysis techniques utilize only dark web data, it is impossible to determine the association between the articles posted on the surface web and those on the dark web. In addition, it is difficult to identify the authors who move back and forth between the surface and dark webs and write posts while performing malicious activities.

Therefore, in this study, we aimed to find common topics and authors between the surface and dark webs by applying BERTopic and authorship attribution. We attempted to identify authors who use both the surface and dark webs simultaneously—particularly those who have not been previously identified—by finding the authors engaged in writing content on similar topics. First, relevant data were collected from both the surface and dark webs, and the topics included in the forums were extracted by applying BERTopic. The similarity between the topics extracted from the surface and dark webs was calculated to find similar topics between the two webs. Next, the contents corresponding to the list of similar topics were examined to extract the unique characteristics of the authors based on the authorship attribution of each author. Finally, the extracted unique characteristics of the authors were compared to identify authors with similar characteristics on both the surface and dark webs. In addition, the unique user characteristics were utilized to identify malicious users who are simultaneously active on both the surface and dark webs. The main contributions of this study are summarized as follows:

BERTopic was applied to the surface and dark webs to extract the topics from each web, and identical or similar topics between the two webs were extracted utilizing similarity measurements;
Information about the users creating content on the surface and dark webs was collected, and the unique characteristics of the authors were extracted using authorship attribution;
Malicious users actively engaged on the surface and dark webs can be identified based on the unique characteristics of the authors;
Malicious authors engaged in simultaneous activities on the surface and dark webs can be identified using BERTopic and authorship attribution.

The remaining part of this paper is organized as follows. Section 2 reviews the related studies, and Section 3 introduces the method proposed in this study. Section 4 presents the experimental results. Section 5 presents the conclusions and summarizes the future research directions.

2. Related Work

2.1. Dark Web Analysis

The dark web refers to a space within the deep web where malicious activities occur regularly. Moreover, because the dark web provides anonymity, malicious activities such as the distribution of drugs and firearms, illegal gambling, and the dissemination of malware and cyberattack tools are prevalent. As the dark web has become increasingly influential in recent years, studies are being conducted to analyze it to address these issues. Dark web classification classifies the types of malicious activities found in newly collected websites based on the information gained from previously analyzed websites with malicious activities. In the analysis of dark web forums, the topics of the contents on malicious activities frequently mentioned in the forums are identified or classified, and the key users are detected. In addition, tasks such as searching for vendors selling malicious tools on marketplaces and analyzing the key buyers and sellers are performed.

Sabbah et al. [] presented a classification method that combines five-term weighting methods to generate various features and performed classification using these features. He et al. [] used the term frequency-inverse document frequency (TF-IDF) and bag of words, and Cascavilla et al. [] classified the characteristics of marketplaces by utilizing clustering. Ball and Broadhurst [] identified vendor activity information by analyzing the marketplaces, whereas Ursani et al. [] detected malicious activities occurring on marketplaces. Alnabulsi and Islam [] analyzed the similarities between forums based on the number of posts for each malicious activity class. Jin et al. [] proposed a model that utilized bidirectional encoder representations from transformers (BERT) to classify dark web activities and detect malware, whereas Dos Reis et al. [] generated a buyer–seller graph from 31 marketplaces and identified the key users through category analysis. In [], a large corpus of messages uploaded on over 80 dark web forums was examined, and the variation in the topic between the forums was analyzed using latent Dirichlet allocation (LDA) and hidden Markov model. Through this, the similarity between various topics was checked across multiple forums, and heterogeneous and abnormal events hidden between the forum messages were identified. In [], the influential users in dark web forums were identified and analyzed using various ranking algorithms. To this end, the authors proposed the UserRank algorithm, which incorporates content similarity between the messages and response speed as its weights. In [], a method that analyzes the interactions between the users of dark web forums was proposed to detect criminal activities and understand the structural characteristics of the communities. In addition, the authors constructed an interaction network of six dark web forums and analyzed the forums using social network analysis algorithms. Tang et al. [] proposed a model that predicts the participants of new posts on specific topics in a dark web community. They aimed to predict the probability of participation based on the topic and user interest. In contrast to previous methods, they incorporated user interest and topic detection models to enhance the performance. Anwar and Abulaish [] proposed a new agglomerative clustering method for identifying user groups with similar views in dark web forums. In their study, they considered each post an independent entity and defined a similarity function that considered contextual and temporal coherence to calculate the similarity between the posts. Furthermore, a new method that used data mining techniques was proposed in [] to investigate malicious activities on the dark web. This method comprised a crawling module, data cleaner, and data mining module and analyzed frequently used words in product titles and sales styles by using association rule generation and clustering techniques. They found frequently occurring patterns in product titles by analyzing the association rules and identified the key sellers by determining the similarity between the sellers through seller clustering. Chen et al. [] analyzed the mode of use of the dark web by terrorists and proposed a new methodology for understanding their activities. To this end, they developed a semi-automated methodology that combined information collection, analysis, and visualization techniques. They also identified topics using topic modeling and analyzed the dark web to identify terrorists and their interests. A hybrid-focused crawler was proposed in [] to discover web resources related to specific topics on the dark and surface webs. Through this, the authors aimed to effectively identify common information between both webs. The proposed crawler was designed to follow links related to a given topic by using various link selection methods. Sangher et al. [] proposed a deep learning model based on long short-term memory and BERT for identifying cyber threats that exploit social media platforms in dark web forums. Through this, they aimed to classify and predict criminal activities that exploit social media in dark web forums. Previous studies have primarily applied natural language processing to the collected data to calculate weights based on frequently used words or commonly mentioned phrases or to determine the level of malicious activity through simple count analyses and comparisons of items uploaded for sale in marketplaces. Although these methods allow for the detection and basic analysis of malicious activities, they fail to identify the developers, sellers, or distributors responsible for such activities. Additionally, these approaches are limited to analyzing visible information on the web, making it difficult to uncover hidden data.

2.2. Topic Modeling

Topic modeling is a method for finding topics within a set of documents with unclear topics. In this technique, the model is trained on a set of documents to identify abstract topics included in the set. It can be used to identify keywords for the topics, word probabilities, and relationships between the topics and documents. Document-term matrix and TF-IDF, which are used in conventional document classification, simply calculate the frequencies of specific words and therefore do not reflect the meanings of the words. However, topic modeling solves this issue. LDA is a representative topic modeling algorithm that uses the Dirichlet distribution to estimate and assess the distributions of words related to each topic and the probability distribution of the topics within a document []. Because LDA cannot use labeled data, supervised learning approaches have been considered to solve this problem []. In addition, the BERT algorithm has been applied to identify topics without specific preprocessing of the existing documents [].

Ríos and Muñoz [] used topic modeling to classify ambiguous topics in forums and identify key topics. Porter [] proposed a method for analyzing the trends of topics, which change each month, in specific forums. Faizan and Khan [] performed text classification by applying both mutual information and topic modeling, whereas Bahamazava and Nanda [] used topic modeling and sentiment analysis to analyze the activities of drug users participating in forums. Shin et al. [] proposed a method for identifying keywords through topic modeling to reduce the size of the data and classify malicious activities by applying a convolutional neural network. Pastor-Galindo et al. [] proposed a scalable big data model for early identification of new Tor sites on the dark web and analysis of their contents. They continuously collected onion addresses, removed duplicate content using the MinHash locality-sensitive hashing algorithm, and classified the contents on the sites using the BERTopic model. Bilingual Sentiment Analysis Lexicon (BiSAL), a method for sentiment analysis related to cybersecurity in dark web forums, was proposed in []. In [], a hybrid model combining TF-IDF and a recurrent neural network (RNN) was proposed to classify the contents of the dark web and deep web using multiple labels. The authors aimed to simultaneously classify HTML documents from the deep and dark webs into multiple classes using this model. After preprocessing the text data of the documents, they extracted important words using the TF-IDF technique, assigned labels using FastText, and trained the RNN model to perform multi-label classification, thus ensuring accurate classification.

In [], techniques for analyzing and detecting the contents of websites on the dark web were presented. Feature selection and extraction techniques were used to analyze various methods for performing topic modeling, content analysis, and text clustering. The authors verified that this approach yielded important information for monitoring the web usage of terrorists and understanding their influence. In [], the performance of pre-trained transformer models on sentences encountered for the first time was verified based on data collected from the dark web. The authors aimed to assess the effectiveness of this method in classifying legal and illegal activities. In addition, a neural network language model called DarkEmbed was proposed in [], which predicts the exploitability of software vulnerabilities discussed on the dark and deep webs. This model was used to predict the likelihood of exploitation of software vulnerabilities, and a method for prioritizing security patches based on these predictions was devised. In [], a method was proposed for sentiment analysis and pattern exploration to identify the nature of the content hosted on the dark web. Sentiment analysis was used to check whether the text had a positive, negative, or neutral sentiment, and frequently occurring terms and topics in the data were analyzed through pattern exploration. Furthermore, various topics were selected through LDA, and the nature of the content was identified based on the sentiment analysis results of each topic. In [], a text analysis engine called Threat Miner was proposed to collect and analyze threat data from dark web forums and derive cyber threat information. It was used along with the Word2Vec model to analyze the information shared by cyber attackers on the dark web to enhance defense mechanisms. Previous studies have identified topics frequently mentioned on the dark web to determine the presence of malicious activity or to identify important users or topics. However, in identifying key topics or users, the focus was primarily on analyzing users with high activity levels, such as those who posted frequently or left many comments. This approach made it difficult to assess the value of the content in the posts or to distinguish active participants based on specific key topics. Furthermore, with this method, it remains unclear whether a highly active user is a single individual or if multiple individuals are using the same ID.

2.3. Authorship Attribution

It is difficult to determine the author of content on the dark web owing to its provision of anonymity. This is because the dark web has limited user information, such as IP addresses, IDs, and tracking codes, which can be used to identify individuals. Therefore, additional measures are needed to identify the author of content posted on the dark web. Authorship attribution can find traces of the author in the text and extract these traces as features. It can be used on code and HTML documents as well as general documents to find a variety of information based on the writing habits of the author. Typical features related to the writing habits of authors include contractions, capitalization, number notation, URLs, sentence length, and emojis.

In [], the authors of some of the contents on the dark web were identified by applying style-based, lexical-based, and machine learning techniques. In [], the authors of various posts were distinguished by extracting the characteristics of the authors based on n-grams. In addition, BERT was used to analyze the writings of various authors and identify similar authors [], and potential dark web users were identified using stylometric and temporal features []. An identification method was proposed in [], which used forum clustering and the characteristics of forum users. Furthermore, the authors of [] proposed a method for detecting cyber threats on the dark web by using authorship attribution. In [], a method to find users by collecting the author’s identification information, which can be extracted even from short documents existing on the dark web, was proposed. Five feature categories were classified, and the user characteristics were extracted accordingly; the feature group with excellent performance was identified by combining the five types of features. In [], a method for extracting meaningful user characteristics by combining short documents into one was proposed. Additional characteristics could also be extracted, and their combination improved the performance. In [], we constructed four author feature categories and combined them to construct a feature dataset. In addition, we applied a machine learning algorithm to confirm the user identification performance. We tried to extract approximately 600 types of information and use them to identify users. In [], a plan to visualize user information based on the analyzed user information was proposed, making it easier to analyze the user’s characteristics or personal profiles. Previous studies analyzed users activity in individual forums to determine whether they could be identified. Through this approach, it became possible to attribute unidentified posts within the same forum to specific users. However, these methods faced limitations in identifying the same user across different dark web forums or tracking users active on both the surface web and dark web. Additionally, as the number of users increases, the performance of such methods tends to degrade.

3. Proposed Method

Figure 1 shows the framework proposed in this study for identifying similar users on the surface and dark webs by using BERTopic and authorship attribution. The proposed method consists of five processes. Through these processes, closely related topics between the surface and dark webs are identified, and the similarity between the malicious authors of these topics is determined. First, forum data are collected from the surface and dark webs for training in the data collection stage. In the text preprocessing stage, the minimal processing needed to use BERTopic is performed. In contrast to conventional topic modeling, BERTopic does not require the removal and transformation of unnecessary information. Therefore, all the collected forum content, except for the text written by the users of interest, is removed. In the topic extraction stage, BERTopic is used to find various topics in surface web forums and dark web forums and extract the topics that include malicious activities. During the extraction of similar topics, keywords that appear in the identified malicious topics from the surface and dark webs are used to measure the similarity, and topics with high similarity between the surface and dark webs are identified. During the identification of similar malicious users, contents written by authors associated with highly similar topics on both the surface and dark webs are individually collected, and the unique characteristics of these authors are extracted using authorship attribution. These unique characteristics are then utilized to identify authors with high similarity among the authors of specific topics on the surface and dark webs and to identify the specific characteristics of the authors that contribute to the high similarity.

Figure 1. Framework for identifying similar users on the surface and dark webs using BERTopic and authorship attribution.

3.1. Data Collection

To analyze the authors of specific topics on the surface and dark webs, data related to such topics are needed. Therefore, we collected the data from forums on the surface web in which dark web-related information was exchanged and from active forums on the dark web. Several studies [,,,,,,,,,,] have used datasets containing data from two dark web forums and one surface web forum. We used these data, excluding the data for author identification, in our study. There were 17,879 authors from the surface web forum and 12,159 and 30,206 authors from the two dark web forums. Among the forum users, the active users had written over 2000 posts, whereas the less active users had written approximately 100 posts.

3.2. Data Preprocessing

BERTopic was applied to the collected data to extract various topics. Unlike conventional topic modeling methods, this method does not require preprocessing, such as checking for empty data, removing stop words and special characters, and unifying the capitalization. Therefore, in this stage, only the posts written by the users were extracted from the contents of the two webs to facilitate the training of BERTopic.

3.3. Topic Extraction

In this stage, various hidden topics were extracted from the surface and dark webs using BERTopic. Unlike conventional models for topic modeling, BERTopic effectively extracts the topics by combining transformer-based embedding and clustering techniques. The algorithm first generates document embedding and clusters them. It then uses class-based TF-IDF (c-TF-IDF) to extract the representative words for each topic from each cluster.

BERTopic transforms documents into vectors by using Sentence-BERT (SBERT). Here, SBERT transforms sentences or paragraphs into a high-dimensional vector space, enabling the calculation of semantic similarity between documents. SBERT is an improved model designed to effectively generate sentence-level embedding. In BERTopic, SBERT assists in determining whether documents with the same topic are located close to each other in the vector space. Moreover, the language model used in the embedding generation stage can be replaced based on the user requirements. In the document clustering stage, uniform manifold approximation and projection, a nonlinear technique that effectively preserves the local and global structures of high-dimensional data and reduces the dimensions, is used. Next, clustering is performed using hierarchical density-based spatial clustering of applications with noise, which is an enhanced version of the DBSCAN algorithm. It locates clusters in high-density regions and regards low-density regions as noise. This algorithm can effectively process clusters of different densities, making it suitable for complex data such as dark web data. In the clustering process, similar documents are grouped into a single cluster and the documents within a cluster are assumed to have the same topic. The topics derived from each cluster are examined using class-based TF-IDF (c-TF-IDF), which is a variation of the typical TF-IDF method. It evaluates the importance of words based on clusters rather than documents. Equation (1) is the formula for calculating the c-TF-IDF, where

t f_{t, c}

represents the frequency of the word

t

in a cluster

c

, and

A

denotes the average number of words per cluster. This equation calculates the importance of a word within a cluster, and the topic of each cluster is determined based on this importance.

W_{t, c} = t f_{t, c} \cdot \log (1 + \frac{A}{t f_{t}})

(1)

3.4. Similar Topic Extraction

Fifteen to twenty topics were extracted from the surface and dark webs, and the extracted topics had between fifteen and twenty keywords. The keywords of a topic are words frequently found in webpages related to that topic. The similarity between the topics on the surface and dark webs was analyzed using the topics and their keywords generated in this manner from the two webs. The similarity was assessed by performing a one-to-one comparison between the topics on the two webs. The degree of similarity was calculated by applying the cosine measure to the keywords of the two topics, as shown in Equation (2).

s i m i l a r i t y = \cos (θ) = \frac{A \cdot B}{| | A | | | | B | |} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

(2)

3.5. Identifying Malicious Similar Users

The users posting content on the topics selected by analyzing the topic similarity between the two webs were extracted, and the topics were segregated according to the user. Next, authorship attribution was applied to extract the unique characteristics of the style of writing of each user. The authorship attribution-based method analyzes the writing habits of authors to identify their unique characteristics. In this process, analysis methods such as character or word embeddings and the n-grams of the document are applied, and special symbols frequently used by the author as well as the author’s spacing patterns, types of contractions, and typographical errors are analyzed. In this stage, both methods were applied to extract the unique characteristics, and the users deemed to have high similarity based on the similarity analysis between the topics of the two webs were extracted. Fifteen unique characteristics of the authors were extracted through authorship attribution, as shown in Table 1. In addition to grammatical features, such as word length, count, and ratio, the writing habits of the users—such as emojis, quotation marks, URLs, and contractions—were also considered along with the document analysis method. The unique characteristics of the authors, generated based on authorship attribution, were used to identify the users writing similar types of posts on similar topics found on both webs.

Table 1. Unique characteristics of authors based on authorship attribution.

4. Experiment and Results

4.1. Dataset

In this study, we conducted an experiment using datasets comprising the data from one surface web forum and two dark web forums, as shown in Table 2. This dataset contained the data of 60,244 authors—17,879 authors from the surface web and 42,365 authors from the dark web. The number of posts made by each author varied substantially, ranging from 140 to 2400. On the surface web, posts related to the dark web were collected from the Reddit forum, whereas on the dark web, posts were collected from the Silk Road and Agora forums. The data were collected from January 2014 until May 2015. In addition, we did not utilize the data used for author identification in the dataset because these data were repetitions of the surface web data.

Table 2. Datasets used in this study.

4.2. Topic Similarity Analysis Using BERTopic

Preprocessing was performed on the collected dataset, and 15 to 20 topics were extracted from the surface and dark webs using BERTopic. Topics were created for each forum through BERTopic, and the results classified through hierarchical clustering are shown in Figure 2, Figure 3 and Figure 4. Through clustering, similar topics were organized into one stem, and approximately 20 topics could be identified through additional segmentation within it. The x-axis represents each topic, whereas the y-axis represents the branches generated through hierarchical clustering. Each branching point indicates the presence of distinct characteristics that separate the topics. Through this analysis, we were able to identify the key topics and important words commonly mentioned in one surface web platform and two dark web forums. By comparing the key words of the topics across the three visualizations, we could easily and visually assess the extent to which similar topics existed among them. The keywords extracted for each topic are shown in Table 3. As shown in the table, not all topics extracted from both webs were related to malicious activities. However, keywords indicating malicious activities can be observed in some of the topics. For example, words related to drugs, such as LSD, MDMA, weed, cocaine, and drug, and words associated with contract killing, such as gun, karma, and people, were found in some of the topics on both the surface and dark webs. Furthermore, topics related to illegal trade, such as bitcoin, anonym, credit card, and PayPal, were also found. The results of the similarity analyses between the topics derived by analyzing the contents on the two webs are shown in Table 4 and Table 5. Table 4 shows the results of topic association analysis between the Agora forum on the dark web and the surface web. The results reveal that some topics exhibit over 80% similarity. Upon examining these topics, it was found that they were related to drugs, as they contained frequent mentions of the names of drugs. Table 5 presents the topic association analysis between Silk Road on the dark web and the surface web. Similar to Table 4, the topics with the highest similarity were all related to drugs. However, topics with a similarity range of approximately 60–70% were found to involve transactions leveraging anonymity, such as those related to encryption or cryptocurrency. These topics included details on how the transactions are conducted and which items are commonly traded.

Figure 2. Results of surface web clustering created using BERTopic.

Figure 3. Results of dark web (Silk Road) clustering created using BERTopic.

Figure 4. Results of dark web (Agora) clustering created using BERTopic.

Table 3. Keywords extracted for similarity analysis between the surface and dark webs (Agora and Silk Road).

Table 4. Results of topic association analysis between surface and dark webs (Agora).

Table 5. Results of topic association analysis between surface and dark webs (Silk Road).

Multiple similar topics were found by applying the cosine measure to the topics extracted from the two dark web forums and one surface web forum. The degree of similarity between the topics extracted from the two webs varied, with the highest similarity being 80%. The analyses were conducted with a similarity score of at least 50% as the criterion for establishing similarity. The similar topics between the surface and dark webs showed substantial overlapping of the keywords. Among these topics, only those in which words related to malicious activities were frequently mentioned were selected for additional analysis.

4.3. Author Similarity Analysis Using Authorship Attribution

A list of users posting content on similar topics between the surface and dark webs was generated, and the content written by each user was collected to extract the unique characteristics of the authors. These unique characteristics were identified by applying the authorship attribution method; the generated features are shown in Figure 5. The figure illustrates the writing skills of users from three different web forums. The columns represent various features of users’ writing that were extracted through the authorship attribution. It was found that grammatical features such as sentence or word length and average number of words as well as the personal characteristics of the authors, such as the ratio of numbers in the text, use of emojis, and use of contractions, differed depending on the author. This information was used to establish a similarity between users posting content on related topics on both the surface and dark webs. Table 6 shows the results of applying the cosine measure to the authors of similar topics on the two webs. Thus, it was possible to identify users with similar writing styles among the users writing content on similar topics on the two webs.

Figure 5. Unique characteristics of authors.

Table 6. Similarity comparison results between authors of similar topics on surface and dark webs.

Furthermore, it could be inferred that even though they use various IDs, they primarily engage on the same topics and do not transition to other topics. Moreover, Table 7 presents the analysis of the words frequently used by authors with similar unique characteristics among those actively writing on similar topics between the two webs. The analysis results indicate that the words related to malicious activities as well as the common words that are frequently used are similar. In particular, Table 6 demonstrates that user15376 and user10238, who are active on the surface web and dark web, respectively, exhibit high similarity scores despite operating on different platforms. This is attributed to their similar writing habits and characteristics. Additionally, as shown in Table 7, an analysis of the words frequently used by these two users revealed that they predominantly discuss topics related to drugs.

Table 7. Comparison of keywords used by similar authors on the surface and dark webs.

5. Discussion

In this study, we propose a method that uses BERTopic and authorship attribution to identify similar topics between the surface and dark webs. Using this method, we extracted similar topics from one surface web forum and two dark web forums. We then collected the unique characteristics of the authors of these topics and identified users with similar writing styles across the two webs through a similarity measure between the unique characteristics of the authors. Based on this information, we identified authors who are actively engaged on both webs. By using BERTopic, it was possible to identify topics that were apparently dissimilar but were actually common between the two webs and to verify whether these topics contained genuinely similar content. Additionally, through authorship attribution, we were able to assess whether users discussing common topics on both webs could actually be the same individuals by analyzing their habits, characteristics, and writing styles.

However, some of the topics extracted using BERTopic were often composed of common words unrelated to malicious activities, and there were relatively few words associated with malicious activities. Moreover, it was challenging to provide definitive evidence to determine whether similar users identified on both webs were indeed the same individuals. In some cases, users were deemed similar based solely on basic language patterns or commonly used phrases that anyone might use. Therefore, we believe that further research needs to be conducted to compensate for these biased data. In addition, it is necessary to extract more diverse types of unique characteristics by using authorship attribution to further segment and distinguish the authors. Furthermore, methods for determining the optimal number of topics from the collected data should be improved. Our future research will consider these aspects to build a more robust model.

6. Conclusions

Malicious behaviors have become increasingly common on the dark web in recent years. Consequently, studies are being actively conducted to preemptively block these activities. Various methods are employed to detect malicious activities occurring on the dark web, classify such activities, identify risky vendors in marketplaces, and analyze key users in forums. Although key users can be inferred based on the number of activities on the forum and number of posts, it is difficult to obtain specific information about the authors owing to the anonymous nature of the dark web. Moreover, because the information is restricted to the dark web, it is challenging to detect malicious activities by obtaining dark web-related information from the surface web. Therefore, this study proposes a method for identifying users active on both webs using BERTopic and authorship attribution. BERTopic was utilized to identify common topics and key terms mentioned across the two webs, whereas authorship attribution was employed to extract distinguishing features such as the textual, linguistic, and semantic characteristics of users. Through these processes, it became possible to identify suspicious users active on both webs. However, certain features failed to clearly distinguish users active on the webs, or they incorrectly identified general linguistic traits as unique user characteristics, highlighting limitations of the proposed approach. Future research will aim to address these issues by enhancing the user profiling methods to improve the differentiation and characterization of individual users.

Author Contributions

Conceptualization, G.-Y.S.; methodology, G.-Y.S.; investigation, D.-W.K.; formal analysis, G.-Y.S., M.-M.H., S.P., A.-r.P., and Y.K.; writing—original draft preparation, G.-Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a Korea Research Institute for Defense Technology Planning and Advancement (KRIT) grant funded by the Korea Government’s Defense Acquisition Program Administration (DAPA) (Grant No. KRIT-CT-21-037), and the Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (Grant No. RS-2023-00248132).

Data Availability Statement

No new data were created or analyzed in this study. Data are contained within the article.

Acknowledgments

The authors would like to express their gratitude to the National Research Foundation of Korea (NRF) and LIG Nex1 for their generous financial support, which was crucial for the successful completion of this research. We are also deeply appreciative of the invaluable feedback and insightful comments provided by our colleagues, whose expertise and encouragement significantly enhanced the quality of this paper. Their support in various aspects, from conceptual discussions to technical assistance, has been instrumental. Additionally, we would like to acknowledge the administrative and logistical support from our institution, which facilitated our research activities. Finally, we extend our sincere thanks to the anonymous reviewers for their thorough and constructive critiques, which helped us refine and improve the manuscript. For any technical inquiries related to this research, please contact the authors or the corresponding author via email.

Conflicts of Interest

Author SungJin Park, A-ran Park and Younghwan Kim were employed by the company LIG Nex1. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. And the authors declare that this study received funding from the Korea Government’s Defense Acquisition Program Administration. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Al Nabki, M.W.; Fidalgo, E.; Alegre, E.; de Paz, I. Classifying illegal activities on Tor network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 35–43. [Google Scholar] [CrossRef]
Beshiri, A.; Susuri, A. Dark web and its impact in online anonymity and privacy: A critical analysis and review. J. Comput. Commun. 2019, 7, 30–43. [Google Scholar] [CrossRef]
Finklea, K.M. Dark Web; Rep. R44101; Congressional Research Service: Washington, DC, USA, 2017. [Google Scholar]
Patel, P.B.; Thakor, H.P.; Iyer, S. A comparative study on cyber crime mitigation models. In Proceedings of the 6th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 13–15 March 2019; pp. 466–470. [Google Scholar]
Cascavilla, G.; Tamburri, D.A.; Van Den Heuvel, W.-J. Cybercrime threat intelligence: A systematic multi-vocal literature review. Comput. Secur. 2021, 105, 102258. [Google Scholar] [CrossRef]
Gupta, A.; Maynard, S.B.; Ahmad, A. The dark web phenomenon: A review and research agenda. arXiv 2021, arXiv:2104.07138. [Google Scholar] [CrossRef]
Minnaar, A. Online ‘underground’ marketplaces for illicit drugs: The prototype case of the dark web website ‘Silk Road. Acta Criminol. Afr. J. Criminol. Vict. 2017, 30, 23–47. [Google Scholar]
Catakoglu, O.; Balduzzi, M.; Balzarotti, D. Attacks landscape in the dark side of the web. In Proceedings of the Symposium on Applied Computing, Marrakech, Morocco, 3–7 April 2017; pp. 1739–1746. [Google Scholar] [CrossRef]
Parkar, A.; Sharma, S.; Yadav, S. Introduction to deep web. Int. Res. J. Eng. Technol. 2017, 4, 229–234. [Google Scholar] [CrossRef]
Basheer, R.; Alkhatib, B. Threats from the dark: A review over dark web investigation research for cyber threat intelligence. J. Comput. Netw. Commun. 2021, 2021, 1302999. [Google Scholar] [CrossRef]
Lee, S.; Yoon, C.; Kang, H.; Kim, Y.; Kim, Y.; Han, D.; Son, S.; Shin, S. Cybercriminal minds: An investigative study of cryptocurrency abuses in the dark web. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
Ruiz Ródenas, J.M.; Pastor-Galindo, J.; Gómez Mármol, F. A general and modular framework for dark web analysis. Clust. Comput. 2024, 27, 4687–4703. [Google Scholar] [CrossRef]
Pete, I.; Hughes, J.; Chua, Y.T.; Bada, M. A social network analysis and comparison of six dark web forums. In Proceedings of the 2020 IEEE European symposium on security and privacy workshops (EuroS&PW), Genoa, Italy, 7–11 September 2020; pp. 484–493. [Google Scholar]
Sabbah, T.; Selamat, A.; Selamat, M.H.; Ibrahim, R.; Fujita, H. Hybridized term-weighting method for dark web classification. Neurocomputing 2016, 173, 1908–1926. [Google Scholar] [CrossRef]
He, S.; He, Y.; Li, M. Classification of illegal activities on the dark web. In Proceedings of the 2nd International Conference on Information Science and Systems, Tokyo, Japan, 16–19 March 2019; pp. 73–78. [Google Scholar] [CrossRef]
Cascavilla, G.; Catolino, G.; Ebert, F.; Tamburri, D.A.; van den Heuvel, W.J. “When the code becomes a crime scene” Towards dark web threat intelligence with software quality metrics. In Proceedings of the 2022 IEEE ICSME, Limassol, Cyprus, 3–7 October 2022; pp. 439–443. [Google Scholar] [CrossRef]
Ball, M.; Broadhurst, R. Data capture and analysis of darknet markets. SSRN 2021, 3344936. [Google Scholar] [CrossRef]
Ursani, Z.; Peersman, C.; Edwards, M.; Chen, C.; Rashid, A. The impact of adverse events in darknet markets: An anomaly detection approach. In Proceedings of the 2021 IEEE EuroS&PW, Vienna, Austria, 6–10 September 2021; pp. 227–238. [Google Scholar] [CrossRef]
Alnabulsi, H.; Islam, R. Identification of illegal forum activities inside the dark net. In Proceedings of the 2018 iCMLDE, Sydney, NSW, Australia, 3–7 December 2018; pp. 22–29. [Google Scholar] [CrossRef]
Jin, Y.; Jang, E.; Cui, J.; Chung, J.-W.; Lee, Y.; Shin, S. DarkBERT: A language model for the dark side of the internet. arXiv 2023, arXiv:2305.08596. [Google Scholar] [CrossRef]
Dos Reis, E.F.; Teytelboym, A.; ElBahrawy, A.; De Loizaga, I.; Baronchelli, A. Identifying key players in dark web marketplaces through Bitcoin transaction networks. Sci. Rep. 2024, 14, 2385. [Google Scholar] [CrossRef] [PubMed]
Tavabi, N.; Bartley, N.; Abeliuk, A.; Soni, S.; Ferrara, E.; Lerman, K. Characterizing activity on the deep and dark web. In Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 206–213. [Google Scholar] [CrossRef]
Yang, C.C.; Tang, X.; Thuraisingham, B.M. An analysis of user influence ranking algorithms on Dark Web forums. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Washington, DC, USA, 25–28 July 2010; pp. 1–7, Article 10. [Google Scholar] [CrossRef]
Tang, X.; Yang, C.C.; Zhang, M. Who will be participating next? Predicting the participation of Dark Web community. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Beijing, China, 12 August 2012; Article 1. pp. 1–7. [Google Scholar] [CrossRef]
Anwar, T.; Abulaish, M. Identifying cliques in dark web forums—An agglomerative clustering approach. In Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics, Washington, DC, USA, 11–14 June 2012. [Google Scholar] [CrossRef]
Alkhatib, B.; Basheer, R.S. Mining the dark web: A novel approach for placing a dark website under investigation. Int. J. Mod. Educat. Comput. Sci. 2019, 11, 1–13. [Google Scholar] [CrossRef][Green Version]
Chen, H.; Chung, W.; Qin, J.; Reid, E.; Sageman, M.; Weimann, G. Uncovering the dark web: A case study of jihad on the web. J. Am. Soc. Inf. Sci. Technol. 2008, 59, 1347–1359. [Google Scholar] [CrossRef]
Iliou, C.; Kalpakis, G.; Tsikrika, T.; Vrochidis, S.; Kompatsiaris, I. Hybrid focused crawling on the Surface and the Dark Web. EURASIP J. Inf. Sec. 2017, 2017, 11. [Google Scholar] [CrossRef]
Sangher, K.S.; Singh, A.; Pandey, H.M. LSTM and BERT based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums. Int. J. Inf. Technol. 2024, 16, 5277–5292. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Ramage, D.; Hall, D.; Nallapati, R.; Manning, C.D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; pp. 248–256. [Google Scholar]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794v1. [Google Scholar] [CrossRef]
Ríos, S.A.; Muñoz, R. Dark web portal overlapping community detection based on topic models. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics, Beijing, China, 12 August 2012; Article 2. pp. 1–7. [Google Scholar]
Porter, K. Analyzing the DarkNetMarkets subreddit for evolutions of tools and trends using LDA topic modeling. Digit. Investig. 2018, 26, S87–S97. [Google Scholar] [CrossRef]
Faizan, M.; Khan, R.A. A two-step dimensionality reduction scheme for dark web text classification. In Proceedings of the Ambient Communications and Computer Systems: RACCCS 2019, Ajmer, India, 29–30 May 2019; Hu, Y.C., Tiwari, S., Trivedi, M., Mishra, K., Eds.; Springer: Singapore, 2019; pp. 303–312. [Google Scholar] [CrossRef]
Bahamazava, K.; Nanda, R. The shift of DarkNet illegal drug trade preferences in cryptocurrency: The question of traceability and deterrence. Forens. Sci. Int. Digit. Investig. 2022, 40, 301377. [Google Scholar] [CrossRef]
Shin, G.-Y.; Jang, Y.; Kim, D.-W.; Park, S.; Park, A.-R.; Kim, Y.; Han, M.-M. Dark side of the web: Dark web classification based on TextCNN and topic modeling weight. IEEE Access 2024, 12, 36361–36371. [Google Scholar] [CrossRef]
Pastor-Galindo, J.; Sandlin, H.-Â.; Mármol, F.G.; Bovet, G.; Pérez, G.M. A Big Data architecture for early identification and categorization of dark web sites. Future Gener. Comput. Syst. 2024, 157, 67–81. [Google Scholar] [CrossRef]
Al-Rowaily, K.; Abulaish, M.; Haldar, N.A.-H.; Al-Rubaian, M. BiSAL—A bilingual sentiment analysis lexicon to analyze Dark Web forums for cyber security. Digit. Investig. 2015, 14, 53–62. [Google Scholar] [CrossRef]
Dalvi, A.; Bhoir, S.; Naik, N.; Kitkaru, A.; Siddavatam, I.; Bhirud, S. A hybrid TF-IDF and RNN model for multi-label classification of the deep and dark web. Int. J. Adv. Comp. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
Alghamdi, H.; Selamat, A. Techniques to detect terrorists/extremists on the dark web: A review. Data Technol. Appl. 2022, 56, 461–482. [Google Scholar] [CrossRef]
Ranaldi, L.; Nourbakhsh, A.; Patrizi, A.; Ruzzetti, E.S.; Onorati, D.; Fallucchi, F.; Zanzotto, F.M. The dark side of the language: Pre-trained transformers in the DarkNet. arXiv 2022, arXiv:2201.05613v3. [Google Scholar] [CrossRef]
Tavabi, N.; Goyal, P.; Almukaynizi, M.; Shakarian, P.; Lerman, K. DarkEmbed: Exploit prediction with neural language models. In Proceedings of the AAAI Conference Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. no. 1. [Google Scholar] [CrossRef]
Murty, C.A.; Rughani, P.H. Sentiment & pattern analysis for identifying nature of the content hosted in the dark web. Indian J. Comput. Sci. Eng. 2021, 12, 1822–1836. [Google Scholar] [CrossRef]
Deguara, N.; Arshad, J.; Paracha, A.; Azad, M.A. Threat Miner—A text analysis engine for threat identification using dark web data. In Proceedings of the 2022 IEEE International Conference on Big Data, Osaka, Japan, 17–20 December 2022. [Google Scholar] [CrossRef]
Ranaldi, L.; Ranaldi, F.; Fallucchi, F.; Zanzotto, F.M. Shedding light on the dark web: Authorship attribution in radical forums. Information 2022, 13, 435. [Google Scholar] [CrossRef]
Litvinova, T.; Litvinova, O.; Panicheva, P. Authorship attribution of Russian forum posts with different types of N-gram features. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, Tokushima, Japan, 28–30 June 2019; pp. 9–14. [Google Scholar] [CrossRef]
Manolache, A.; Brad, F.; Barbalau, A.; Ionescu, R.T.; Popescu, M. VeriDark: A large-scale benchmark for authorship verification on the dark web. Adv. Neural Inf. Process. Syst. 2022, 35, 15574–15588. [Google Scholar]
Arabnezhad, E.; La Morgia, M.; Mei, A.; Nemmi, E.N.; Stefa, J. A light in the dark web: Linking dark web aliases to real internet identities. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), Singapore, Singapore, 29 November–1 December 2020; pp. 311–321. [Google Scholar] [CrossRef]
Nazah, S.; Huda, S.; Abawajy, J.H.; Hassan, M.M. An unsupervised model for identifying and characterizing dark web forums. IEEE Access 2021, 9, 112871–112892. [Google Scholar] [CrossRef]
Nazah, S.; Huda, S.; Abawajy, J.; Hassan, M.M. Evolution of dark web threat analysis and detection: A systematic approach. IEEE Access 2020, 8, 171796–171819. [Google Scholar] [CrossRef]
Spitters, M.; Klaver, F.; Koot, G.; Van Staalduinen, M. Authorship analysis on dark marketplace forums. In Proceedings of the 2015 European Intelligence and Security Informatics Conference, Manchester, UK, 7–9 September 2015; pp. 1–8. [Google Scholar]
Klaver, F. Authorship Attribution of Forum Posts. Thesis, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, 2014. [Google Scholar]
Sennewald, B. Authorship Attribution in the Dark Web. Master’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2020. [Google Scholar]
Benjamin, V.; Chung, W.; Abbasi, A.; Chuang, J.; Larson, C.A.; Chen, H. Evaluating text visualization for authorship analysis. Secur. Inform. 2014, 3, 10. [Google Scholar] [CrossRef]

Figure 1. Framework for identifying similar users on the surface and dark webs using BERTopic and authorship attribution.

Figure 2. Results of surface web clustering created using BERTopic.

Figure 3. Results of dark web (Silk Road) clustering created using BERTopic.

Figure 4. Results of dark web (Agora) clustering created using BERTopic.

Figure 5. Unique characteristics of authors.

Table 1. Unique characteristics of authors based on authorship attribution.

Feature Name	Description
Average Word Length	The average length of words in the entire document
Total Number of Words	The number of words used in the entire document
Ratio of Numbers	The ratio of numbers used per sentence
Number of Special Characters	The total number of special characters
Number of Function Words	The total number of function words
Ratio of Sentences Beginning with a Capital Letter	The ratio of sentences that begin with a capital letter in the entire document
Ratio of Contractions Used	The ratio of actual contractions among the words that can be formed into contractions
Number of Short Words	The number of short words used that are of four letters or fewer
Average Number of Words per Sentence	The average number of words used per sentence
Number of Words Used	The number of non-duplicate words used in the entire document
Number of Punctuation Marks	The number of punctuation marks used
Average Number of Punctuation Marks per Sentence	The average number of punctuation marks used per sentence
Use of Quotation Marks	Whether quotation marks are used in the entire document
Use of Emojis	Whether emojis are used in the entire document
Use of URLs	Whether URLs are used in the entire document

Table 2. Datasets used in this study.

Dataset	Size	Author	Average Number of Words	Source
Agora	4,195,381	12,159	143	Dark Web
Silk Road	614,656	30,206	119	Dark Web
Reddit	106,252	17,879	84	Surface Web

Table 3. Keywords extracted for similarity analysis between the surface and dark webs (Agora and Silk Road).

Name	Keywords
Surface Web	‘lsd’, ‘get’, ‘like’, ‘meth’, ‘drug’, ‘mdma’, ‘take’, ‘good’, ‘would’, ‘know’ ‘exif’, ‘pic’, ‘pictur’, ‘data’, ‘imag’, ‘upload’, ‘photo’, ‘jpg’, ‘camera’, ‘imgur’ ‘encrypt’, ‘key’, ‘truecrypt’, ‘password’, ‘use’, ‘disk’, ‘brute’, ‘drive’, ‘comput’, ‘ae’ ‘schizophrenia’, ‘schizophren’, ‘psychosi’, ‘voic’, ‘hear’, ‘brain’, ‘ndma’, ‘bing’, ‘disord’, ‘xl0’, ‘wiki’, ‘littlehelperrobot’, ‘wut’, ‘github’, ‘mobil’, ‘wikipedia’, ‘http’, ‘judg’, ‘org’
Dark Web (Silk Road)	‘vendor’, ‘get’, ‘order’, ‘would’, ‘bitcoin’, ‘btc’, ‘like’, ‘make’, ‘use’, ‘account’ ‘key’, ‘encrypt’, ‘messag’, ‘use’, ‘tor’, ‘onion’, ‘pgp’, ‘http’, ‘public’, ‘file’ ‘like’, ‘trip’, ‘the’, ‘think’, ‘get’, ‘karma’, ‘peopl’, ‘one’, ‘gun’, ‘say’ ‘shall’, ‘state’, ‘child’, ‘debt’, ‘bank’, ‘committe’, ‘govern’, ‘unit’, ‘the’, ‘parti’ ‘cocain’, ‘test’, ‘lab’, ‘result’, ‘convert’, ‘maximum’, ‘levamisol’, ‘puriti’, ‘analyz’, ‘level’
Dark Web (Agora)	‘vein’, ‘needl’, ‘inject’, ‘shoot’, ‘arm’, ‘elbow’, ‘arteri’, ‘bruis’, ‘blood’, ‘use’ ‘week’, ‘zone’, ‘utc’, ‘end’, ‘time’, ‘gmt’, ‘sunday’, ‘monday’, ‘midnight’, ‘friday’ ‘mac’, ‘spoof’, ‘address’, ‘router’, ‘network’, ‘chang’, ‘tail’, ‘wireless’, ‘connect’, ‘macchang’ ‘second’, ‘later’, ‘less’, ‘ago’, ‘pleas’, ‘last’, ‘post’, ‘bmr’, ‘tri’, ‘the’ ‘bitcoin’, ‘paypal’, ‘creditcard’, ‘accept’, ‘anonym’, ‘shop’, ‘minut’, ‘729’, ‘519’, ‘1,099’

Table 4. Results of topic association analysis between surface and dark webs (Agora).

Surface Web	Dark Web	Similarity Score
0	1	70.00
0	3	70.00
1	5	80.00
1	11	80.00
5	14	60.00
7	18	66.67
8	13	60.00
10	4	55.33

Table 5. Results of topic association analysis between surface and dark webs (Silk Road).

Surface Web	Dark Web	Similarity Score
0	3	66.67
0	4	66.67
2	7	70.00
3	5	70.00
8	15	80.00
9	14	60.00
10	14	55.33
12	17	80.00
16	17	80.00

Table 6. Similarity comparison results between authors of similar topics on surface and dark webs.

Surface Web	Dark Web (Agora)	Similarity Score	Surface Web	Dark Web (Silk Road)	Similarity Score
user2666	user4618	57.8	user15376	user10238	68.86
user7287	user9865	26.87	user2364	user4978	16.83
user1847	User20	63.58	user748	user8723	21.75
user6405	User265	19.26	user5546	user1474	10.22
user1470	user452	26.1	user6600	User1369	63.39

Table 7. Comparison of keywords used by similar authors on the surface and dark webs.

Author	High-Frequency Words
User15376	lsd, day, drug, take, week, good, send, messag, know, wait
User10238	ship, lsd, messag, week, agora, good, btc, mdma, send, know

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Identifying Similar Users Between Dark Web and Surface Web Using BERTopic and Authorship Attribution

Abstract

1. Introduction

2. Related Work

2.1. Dark Web Analysis

2.2. Topic Modeling

2.3. Authorship Attribution

3. Proposed Method

3.1. Data Collection

3.2. Data Preprocessing

3.3. Topic Extraction

3.4. Similar Topic Extraction

3.5. Identifying Malicious Similar Users

4. Experiment and Results

4.1. Dataset

4.2. Topic Similarity Analysis Using BERTopic

4.3. Author Similarity Analysis Using Authorship Attribution

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics