Fake User Detection Based on Multi-Model Joint Representation

: The existing deep learning-based detection of fake information focuses on the transient detection of news itself. Compared to user category profile mining and detection, transient detection is prone to higher misjudgment rates due to the limitations of insufficient temporal information, posing new challenges to social public opinion monitoring tasks such as fake user detection. This paper proposes a multimodal aggregation portrait model (MAPM) based on multi-model joint representation for social media platforms. It constructs a deep learning-based multimodal fake user detection framework by analyzing user behavior datasets within a time retrospective window. It integrates a pre-trained Domain Large Model to represent user behavior data across multiple modalities, thereby constructing a high-generalization implicit behavior feature spectrum for users. In response to the tendency of existing fake user behavior mining to neglect time-series features, this study introduces an improved network called Sequence Interval Detection Net (SIDN) based on Sequence to Sequence (seq2seq) to characterize time interval sequence behaviors, achieving strong expressive capabilities for detecting fake behaviors within the time window. Ultimately, the amalgamation of latent behavioral features and explicit characteristics serves as the input for spectral clustering in detecting fraudulent users. The experimental results on Weibo real dataset demonstrate that the proposed model outperforms the detection utilizing explicit user features, with an improvement of 27.0% in detection accuracy.


Introduction
Social media plays a crucial role in today's society, serving as a primary platform for information dissemination, social interaction, and opinion propagation.Weibo is one of the most popular social media platforms in China.It allows users to publish short text messages, as well as various media formats including images, videos, and audio.Users on Weibo can follow others and be followed by others.Content on Weibo can be interacted with through retweets, comments, and likes.Massive amounts of data are generated on a daily basis, among which the proliferation and dissemination of fake news are prevalent, posing one of the serious challenges faced by social media platforms.Past research has mainly focused on analyzing fake news itself.Typically, these studies emphasize single modality, specifically text and visual information recognition.Additionally, some studies employ multimodal methods to identify fake news, with existing fake news detection approaches largely integrating both image and text modalities.Studies on single-modal fake news detection mainly include text-based fake news detection [1][2][3][4] and visual-based fake news detection.Integrating multiple modalities is one of the key challenges in multimodal fake news detection.Methods that map different modal information to the same feature space and fuse them into classifiers for identifying fake news [5][6][7][8].Many scholars believe that if news images do not match the textual content, the news is considered fake.Based on such assumptions, methods for identifying fake news through modal contrasts are employed, utilizing thresholds set for image-text similarity to achieve fake news recognition.Similarly effective methods include enhancing multimodal information through co-attention mechanisms [9] and assisting fake news detection by extracting user features for credibility [10].
For detecting fake users on social media platforms, various methods have been proposed, most of which analyze features in users' profiles [11][12][13] or utilize machine learning methods combining text features with user profiles.However, with the emergence of highquality generative models such as ChatGPT, BigGAN, etc., there is a growing trend of more realistic and human-likeness of fake users on social media platforms.These advanced generative models endow fake users with more sophisticated and realistic machine-generated capabilities, making them increasingly difficult to distinguish on social media platforms, thereby further increasing the threat of widespread dissemination of fake information.
In this paper, we adopt a multi-modal joint representation strategy and proposes the Multi-Modal Aggregation Portrait Model (MAPM) based on this strategy for detecting fake users.Specifically, the MAPM model consists of three parts: (1) Using the Chinese version of Pegasus [14] to extract text summaries as input for the Chinese CLIP pre-training module [15,16], which analyzes text and image similarity using multi-modal information.Utilizing the BERT pre-training module [17] to classify blogs posted by users.Introducing the HiFi-NET network module [18] to determine whether blog images have undergone AI synthesis or tampering.(2) Designing a Sequence Interval Detection Net (SIDN) module based on Sequence to Sequence [19] architecture to model the time interval sequences of user blog posts, representing implicit time behavior features.(3) Adjusting the output size of the feature extraction module to a user-level scalar and mapping it to the same feature space.Combining implicit behavioral features with explicit user features, constructing a spectral clustering-based unsupervised classification module [20] to achieve multi-modal detection of fake users.
The main contributions of this paper are as follows: 1. MAPM utilizes large models like CLIP, BERT, and HiFi-NET transfer learning modules to aggregate explicit and implicit features across multiple dimensions such as text, images, and user data, thereby achieving a concrete fusion of user behavior characteristics.

2.
We propose a Time Interval Detection Network aimed at representing potential temporal characteristics of user posting behavior, enhancing the model's capability to detect unusual user behavior.

3.
By combining implicit behavioral features with explicit user features, we construct a spectral clustering-based unsupervised classification module to further classify fake users, providing a potential approach for public opinion analysis in social media management.
The structure of the remaining part of the paper is as follows: Section 2 will discuss current popular methods for identifying fake news and fake users.Section 3 will delve into the theoretical foundations of building the model and detail the representation of user behavior.Section 4 will introduce the experimental classification results of the model and discuss the data preparation work.Finally, Section 5 will provide the conclusion of this paper.

Fake News Detection
The narrow definition of fake news refers to intentionally falsified information that can be verified as false and may mislead readers [21].This definition has two key features: truthfulness and intent.Firstly, fake news includes falsified information that can be verified.Secondly, fake news is characterized by the intent to mislead readers, a definition widely adopted in recent research [22].Broadly speaking, fake news is defined as news that is deceptive in terms of either the truthfulness of its content or the intent of its publisher, even if it is satirical news.Satirical news, despite being entertainment-oriented, still contains false content [23].Additionally, broad fake news may also cover other forms of deceptive news, such as serious fabrications, hoaxes, and other news that can have a misleading effect [24].Formally, we define fake news as news that contains untrue information.There are several reasons for choosing a broad definition.Firstly, this study identifies users who have posted articles on military-related topics.Compared to regular news, the dissemination of fake military news can have more serious and complex consequences.Regardless of the intent of the disseminator, misleading reports may shape unjust perceptions of the country, leaders, or policies.Secondly, the broad definition of fake news helps enhance the model's generality because content posted on social media platforms often has a certain entertainment value.Such a definition enables the model to better adapt to the diversity of social media platforms.
Currently, significant progress has been made in the research on detecting fake news based on text and visual information, with effective methods proposed to address this issue.In the field of text-based fake news detection, Ma et al. [25] first applied deep learning techniques to fake news detection.Their method inputs sentences into a Gate Recurrent Unit (GRU) to represent news information using the hidden layer vectors of recurrent neural networks.In subsequent work, they introduced the idea of multi-task learning into fake news detection [26].Nan et al. proposed the MDFEND model, which applies the pre-trained BERT model to text encoding, aggregates the outputs of expert layers through domain gates into a classifier, and finally uses softmax for binary fake news classification.Liu et al. [27] proposed a fake news filtering algorithm based on generative adversarial networks.Qi et al. [28] proposed the fake image discriminator MVNN, which uses frequency domain features to determine whether images have been manipulated using image editing software and uses spatial domain features to recognize the semantic information of images.The above research indicates that single-modal information detection is effective in fake news detection tasks, but single-modal methods may overlook crucial information contained in other modalities, resulting in an incomplete understanding of the overall situation.
The key issue in multi-modal fake news detection is coordinating text and visual features.Singhal et al. employed the Visual Geometry Group (VGG) model and XLNET to effectively classify fake news using a cross-modal approach.Zhou et al. [29] used image2text to map textual and visual information into the same vector space, then compared the similarity between visual and textual information.Meng et al. [30] noticed the neglect of interactions between single and multi-modalities in existing research and proposed establishing intra-modality and inter-modality attention mechanisms to capture high-level interactions between text and visual domains.Qi et al. [31] comprehensively considered the complementary information of multi-modality and fused textual and visual information using co-attention.Raza et al. [32] leveraged a Transformer model combining news content with heterogeneous auxiliary information, aiming to conduct early detection of fake news.Ying et al. [33] proposed the BMR model, which treats consistency between text and image as an auxiliary discriminant factor for fake news, combines multi-task models to adaptively weight features from multiple modalities, and ultimately determines whether it is fake news.These research achievements provide valuable examples for improving classification performance in the field of fake news detection.There have been significant advances in detecting fake news.Social media platforms emphasize user participation and creation [34], but previous work has not fully addressed user characteristics in social media platforms [35].
In recent years, research on deep learning-based fake news detection has mostly come from social media platforms.Weibo is China's largest social media platform, where users are not only receivers of information but also important sources of information.In this context, some research works assess the credibility of users by utilizing their personal profiles and posting histories.Lu et al. [36] extracted user features from user profiles and social interactions on Twitter, learned user propagation representations based on user features using CNN and RNN, and proposed a dual co-attention mechanism to learn the correlation between source tweets and retweet propagation and the mutual influence between source tweets and user interactions.Dou et al. [37] identified user credibility using user posting history as an intrinsic factor.At the same time, this work regarded the propagation of news as an extrinsic factor and used intrinsic and extrinsic factors together for fake news detection.
Current fake news research focuses on detecting visual and textual information, neglecting the widespread participation and influence of users in social media networks, or treating user behavior only as an auxiliary task for fake news detection rather than the core of the detection task.We believe that combining users' historical behaviors with user characteristics is a key clue to revealing fake news.

Fake User Detection
The key to identifying fake users lies in mining and analyzing features.Extracting user profile information as features combined with low-dimensional user behavior features is a common method in fake user detection [38][39][40].Common low-dimensional behavioral features include the number of stopwords, emojis, posting frequency, etc. Building classification models based on comprehensive social network features and user interaction behaviors enables more comprehensive consideration for fake user detection [41,42].Implicit features of users refer to their underlying behavioral patterns.These features are not directly visible like explicit features and require algorithms and analysis to reveal deeper characteristics of users, such as "posting patterns" and "correspondence between text and images".Yang et al. [43] noticed that traditional user detection focuses on identifying individual anomalous users and proposed an anomaly detection framework based on bipartite graphs and co-clustering to identify abnormal users.However, due to the singular form of abnormal user samples, it is difficult to address anomaly detection in different social environments.Heidari et al. [44] explored the impact of sentiment features mined from text sentiment classification networks on the accuracy of detecting fake users on social media but failed to fully characterize user behavior.Furthermore, Mohammad et al. [45] utilized the first tweet of each user as input to a convolutional neural network for identifying bot users.However, a single tweet lacks temporal information.Therefore, there is a need to design a detection method that can fully represent user behavior and has generalizability.
On another front, automated users on social media platforms refer to social entities implemented by programs to mimic human social interaction norms [46][47][48].Malicious bot accounts manipulate user influence by engaging in behaviors such as commenting, follower boosting, and reposting, thereby manipulating public opinion, disseminating harmful information, and posing serious threats to the online environment [49][50][51].The core focus of this study lies in analyzing user behaviors on social media platforms.Therefore, the research focus is on identifying automated users and conducting fine-grained classification of fake users based on different automation behaviors.
Firstly, the first category of fake users mimics the daily posting behavior of genuine users, and actively participate in platform giveaways to seek rewards while evading detection of their deceptive nature by automated systems or manual checks.Secondly, the behavior of the other category of fake users primarily involves automatically reposting multimedia messages sourced from multiple platforms, such as images, videos, or other content.The operations of these machine users typically aim to widely disseminate specific types of information, potentially involving promotion, propaganda, or other purposes.Their automated nature enables them to efficiently execute large-scale message reposting behaviors.In this study, these two categories of users are respectively referred to as lottery users and repost users, and they are collectively referred to as fake users.In past research, there has been limited algorithmic analysis of user behavior features [52][53][54], and it is difficult to identify fake user behaviors combined with generative models using traditional machine learning algorithms.Therefore, the proposed method jointly analyzes text information, visual information, multimodal information, and temporal information using multiple models to fully characterize user behaviors.

Methodology
This paper proposes the MAPM for multimodal joint representation and detection of fake users, the architecture of MAPM is shown in Figure 1.The core work of this study focuses on characterizing user behavior.Firstly, all textual and image content from individual users' blogs is analyzed.Specifically, we fine-tuned the BERT model for domain classification of text, identifying the proportion of blog content in different domains.Subsequently, the HiFi-NET module is introduced to detect whether images are AI-generated or tampered, calculating the mean probability of image forgery for all images.Due to the text length limitation of the CLIP module, we first input the blog text into the Pegasus module to extract Chinese text summaries, which are then used alongside blog images as inputs to the CLIP module to extract consistency scores between text and images.Next, user posting richness features are extracted using the entropy formula.Secondly, the SIDN module is introduced to analyze time interval sequences, representing user posting behavior features.Finally, the text domain vectors extracted by the BERT module are flattened into scalars, combined with consistency scores between text and images, mean forgery probabilities, posting richness, and time behavior feature scores, to form user behavior features.These features are combined with existing explicit user features to create individual user feature sets.Subsequently, spectral clustering algorithm is utilized to cluster different users with similar feature performances, enabling the detection of fake users.
Information 2024, 15, x FOR PEER REVIEW 5 of 19 information, multimodal information, and temporal information using multiple models to fully characterize user behaviors.

Methodology
This paper proposes the MAPM for multimodal joint representation and detection of fake users, the architecture of MAPM is shown in Figure 1.The core work of this study focuses on characterizing user behavior.Firstly, all textual and image content from individual usersʹ blogs is analyzed.Specifically, we fine-tuned the BERT model for domain classification of text, identifying the proportion of blog content in different domains.Subsequently, the HiFi-NET module is introduced to detect whether images are AI-generated or tampered, calculating the mean probability of image forgery for all images.Due to the text length limitation of the CLIP module, we first input the blog text into the Pegasus module to extract Chinese text summaries, which are then used alongside blog images as inputs to the CLIP module to extract consistency scores between text and images.Next, user posting richness features are extracted using the entropy formula.Secondly, the SIDN module is introduced to analyze time interval sequences, representing user posting behavior features.Finally, the text domain vectors extracted by the BERT module are flattened into scalars, combined with consistency scores between text and images, mean forgery probabilities, posting richness, and time behavior feature scores, to form user behavior features.These features are combined with existing explicit user features to create individual user feature sets.Subsequently, spectral clustering algorithm is utilized to cluster different users with similar feature performances, enabling the detection of fake users.First, break down the components of U, including membership level, number of blogs, number of follows, number of fans, number of likes, number of retweets, and number of comments.

The Window User Feature Space
U l , U r , U c denote the number of likes, retweets, and comments summed up for each blog by the user.
The B composition rule in θ is: In which, & represents the concatenation symbol in the string, the triple S n = (P tn &T n &I n ) represents the nth blog of the user in the dataset, where P tn , T n , I n respectively represent the posting time, text content, and image of the nth blog of the user.Here, images are linked to T n in the form of text.
In the user behavior analysis stage, S n is extracted for preprocessing, and a method using regular expressions is employed to remove text tags and emoticons from T n on the Weibo platform.
On the Weibo platform, blogs from bot users exhibit a high degree of similarity in style.Some bot users' blogs are all accompanied by images, while others are all in text form.Therefore, we introduce information entropy to measure the diversity of user blogs.The presence of images in blogs from the majority of normal users cannot be predicted, hence showing higher information entropy.The formula for calculating the richness of users' blogs is as follows, where p(x) denotes the proportion of blogs with pictures: In the text classification task, we used the "bert-base-chinese" model.∏ stands for the indicator function.During pre-training, the model parameters were frozen, and a linear layer was used for the output, with the output length being the same as the number of categories in the text domain.T n was extracted, and based on punctuation marks, it was truncated as input to the model.Finally, the involvement of various domains in the textual part of user blogs was analyzed in percentage terms.
The output of the pre-trained BERT model is: In the formula, the output layer of the pre-trained model is fully connected.After linear operation represented by Wx + b, the predicted category i max represents that the text information is classified into the i category.W is the weight parameter, and x represents the hidden state of the last layer of the model except for the output layer.The indicator function ∏ is used to calculate the maximum index in the output list and map it to the category.The feature output is: In the equation, we define a mapping function Map to map this index to the predefined categories of the text.Next, we use an indicator function to count the category for each sentence.Then, we sum up this indicator function and divide by the total number of sentences n to obtain the percentage distribution P.
In visual tasks, we use HiFi-NET to detect whether the image is generated by CNN or by stitching and synthesizing, and output the probability.Given the image X, the logarithm of the output of branch θb and the predicted probability are represented as θb(X) and p(yb|X) , respectively.The formula for predicted probability is: In the multimodal task, we freeze the parameters of Pegasus and cn_clip.Due to the text input length limitation of the CLIP pre-trained model, we utilize the Chinese version of the Pegasus [55] model to extract text summaries of the blogs.This approach not only effectively overcomes the text length limitation of the CLIP model, but also helps standardize the textual information of the blogs.We use Pegasus to extract text summaries, and these summaries, along with images, are jointly used as inputs to the cn_clip model.The output of the model is as follows: In the equation, s i represents the calculation of the cosine similarity between vectors, where f text and f img are the text embedding vector and image embedding vector, respectively.Here, @ denotes matrix multiplication.

Sequence Interval Detect Net
Liu et al. extracted the time interval sequence of blogs to obtain the modified conditional entropy for each user.The study found that the modified conditional information entropy of bot users is significantly lower than that of regular users, indicating a strong regularity in the tweeting behavior of bot users.It is speculated that the timing of bot users' blogs is controlled by automated script algorithms, which may manifest certain patterns in the intervals between blogs.In Ying Q. et al.'s study on the posting behavior of users on online social networks (OSNs), it was observed at a macro level that social media platforms exhibit two distinct peaks in traffic.The first peak occurs during morning working hours, while the second peak occurs in the evening.For users who frequently post, the evening peak is more pronounced [56].Considering the clustering of user posting times, it is challenging for a multilayer perceptron to capture such complex behavioral patterns, and the availability of datasets for modeling time interval sequences is limited.Therefore, the study adopted the seq2seq network architecture from recurrent neural networks.This network structure can better consider information from multiple time steps and is suitable for capturing the temporal patterns of users' posting behavior on social media platforms.
Specifically, for each user, we extract Pt, calculate the time interval it i by subtracting the previous Pt i−1 from the current Pt i , and combine all its values to form a time interval sequence IT = (it 1 , it 2 , . . ., it n ).Then, we delete the first element to obtain a new sequence IT = (it 2 , it 3 , . . ., it n ), which serves as the expected output data for the Decoder.
We have improved the architecture of the seq2seq network, as shown in Figure 2 to model the time interval sequence.SIDN is divided into Encoder and Decoder, with their embedding layers removed.In the encoder part, we use 4 layers of GRU.Similarly, the decoder also utilizes 4 layers of GRU, and the update process for the hidden state at each time step in a single-layer GRU is as follows: SIDN is divided into Encoder and Decoder, with their embedding layers removed.In the encoder part, we use 4 layers of GRU.Similarly, the decoder also utilizes 4 layers of GRU, and the update process for the hidden state at each time step in a single-layer GRU is as follows: In the equation, h t represents the current hidden state, (1 − z t ) ⊙ h t−1 denotes the old hidden state being partially forgotten, and z t ⊙ tanh(W h • [r t ⊙ h t−1 , x t ]) represents the new candidate's hidden state being partially added.z t is the output of the update gate in GRU, r t is the output of the reset gate, and they compute by concatenating the current input x t with the previous hidden state h t−1 , and W is their weight matrix.
In each time step of the Decoder's computation, the expectation is to obtain the output of the next time interval, which is ultimately output through a linear layer.The training process is illustrated in Figure 3. Since the training data is continuous, the mean squared error loss function is utilized, defined as: where it represents the elements of the model's predicted output.To detect anomalies in the time series, the data is inputted into the model, and L(it, it) is extracted as the reconstruction error, which is used to quantify the anomalies exhibited by user posting behavior over time.SIDN is divided into Encoder and Decoder, with their embedding layers removed.In the encoder part, we use 4 layers of GRU.Similarly, the decoder also utilizes 4 layers of GRU, and the update process for the hidden state at each time step in a single-layer GRU is as follows:  In each time step of the Decoder's computation, the expectation is to obtain the output of the next time interval, which is ultimately output through a linear layer.The training process is illustrated in Figure 3. Since the training data is continuous, the mean squared error loss function is utilized, defined as: where  it represents the elements of the model's predicted output.To detect anom- alies in the time series, the data is inputted into the model, and L it it is extracted as the reconstruction error, which is used to quantify the anomalies exhibited by user posting behavior over time.

Classifier Module of MAPM
The behavior features extracted by the deep learning model are concatenated with explicit features to obtain the window user feature space X : { , , , , , } Consider a user θ , whose explicit features are represented by U , the richness of user- generated content is denoted by H ,

Classifier Module of MAPM
The behavior features extracted by the deep learning model are concatenated with explicit features to obtain the window user feature space X: Consider a user θ, whose explicit features are represented by U, the richness of usergenerated content is denoted by H, C = {c 0 , c 2 , . . .c 10 } represents the text classification results, consisting of eleven categories indicating the composition of text domains.Additionally, P represents the probability of image forgery, S represents the similarity between text and images, and Lt represents the reconstruction error of the time series.
Spectral clustering algorithm, as a kind of unsupervised learning method based on a data graph, can fully utilize the input feature information in mining the inherent structure of data, and capture the potential cluster structure in high-dimensional space.Spectral clustering is unaffected by sample distribution, as its optimization objective solely focuses on measuring similarity between samples.In contrast, imbalanced datasets often exhibit a more concentrated distribution pattern.The similarity calculation formula for two user samples x i and x j is as follows: The equation x i − x j represents the Euclidean distance between samples x i and x j , the calculation formula is referred to as: Information 2024, 15, 266 9 of 18 where d denotes the feature dimension of the samples, x ik and x jk represent the values of sample x i and x j on the k − th feature, respectively, and σ is the parameter of the Gaussian kernel function, which controls the decay rate of similarity.Subsequently, by utilizing the similarity, a Laplacian matrix is constructed, followed by eigenvalue decomposition, and clustering of the data is performed based on the eigenvectors.

Experiments
This section will introduce the datasets used in the work, analyze the performance of different modules, and employ ablation experiments to validate their effectiveness.

Data Sets 4.1.1. BERT Text Categorization Dataset
When fine-tuning BERT for text classification, the dataset used was the THUC-News [57] from Tsinghua University, consisting of 740,000 news documents.Based on the original Sina Weibo news classification system, it was reorganized into 10 candidate categories: politics, finance, real estate, stocks, education, science, society, sports, games, and entertainment.Building upon this, we developed a topic-specific web crawler dedicated to fetching blog content.To expand the "military" category, additional text data was collected, including over 6700 training samples, more than 1200 testing samples, and over 1600 validation samples.

Weibo User Dataset
We developed a web crawling tool in-house specifically for collecting data from Weibo user profiles.With this automated tool, we obtained a large amount of relevant information about Weibo users, including their posted content, social context information, and social media metrics.After applying data cleaning methods such as deduplication and removing low-quality data, we constructed a Weibo user dataset consisting of 2076 user records and 198,705 blogs.Figure 4 is a feature heatmap of different user categories in the dataset, used to reveal patterns of feature values across different user categories.Darker colors indicate higher values.After averaging the features for the three user categories, the averaged data is merged and normalized.For aesthetics in the legend, the textual composition features extracted by BERT, which encompass proportions from 11 different domains, are omitted from the heatmap.In the figure, red font denotes implicit user features, while "Avg" represents the mean values.Additionally, "Consistency", "Forgery", and "Time Feature" respectively denote the text-image consistency, AI image probability, and temporal reconstruction error extracted by the cn_clip, HIFI-NET, and SIDN feature extraction modules.Through the heat map, it is evident that the repost users exhibit relatively low posting frequency, text-image consistency, and image forgery probability, which aligns with our previous findings.This category of users tends to share blogs with rich multimedia content, albeit often with a mismatch between text and image descriptions.Lottery users demonstrate higher values in time-related features, thus the features extracted by the SIDN network offer robust support in identifying them.These users exhibit relatively Through the heat map, it is evident that the repost users exhibit relatively low posting frequency, text-image consistency, and image forgery probability, which aligns with our previous findings.This category of users tends to share blogs with rich multimedia content, albeit often with a mismatch between text and image descriptions.Lottery users demonstrate higher values in time-related features, thus the features extracted by the SIDN network offer robust support in identifying them.These users exhibit relatively lower metrics on social media platforms, consistent with their behavioral objective of not publishing valuable content, but rather participating solely in reposting and lottery activities.The color diversity in the blocks of normal users indicates that their behavior lies between the low activity of lottery users and the high activity of repost users to some extent.

Analyzing Module Performance
After fine-tuning on the expanded THUCNews, we obtained experimental results for the BERT model.The results clearly demonstrate that the pre-trained BERT model exhibits high accuracy in Chinese text classification across eleven domains.To validate the classification performance of BERT, we selected multiple benchmarks for comparison, including composite-CNN [58], classical CNN, traditional machine learning methods, and THUCTC.Specifically, classical CNN includes single-layer convolutional neural network (CNN-1) and multi-layer convolutional neural network (CNN-3), while traditional machine learning methods include naive Bayes (NB), k-nearest neighbors (KNN), and support vector machine (SVM).The classification results of different models are shown in Table 1.This indicates that BERT performs second only to the domain-specific model composite-CNN in Chinese text classification tasks.
When extracting time intervals from the dataset of Weibo users, we observed a significant right-skewed distribution of the data.This is attributed to the irregular activity patterns of Weibo platform users, resulting in the collected time interval data exhibiting extremely long-tail characteristics.Figure 5 illustrates the box plot of the time interval sequence, with the time series represented in minutes.By setting the maximum value to 2880 and applying two logarithmic transformations to all data, we effectively alleviated the right-skewness of the data.This processing strategy helps avoid excessive weights during training, thereby maintaining the stability of the model.The blue points in the left plot represent outlier samples.The right plot shows the box plot of the time interval sequence after processing.
We utilized kernel density estimation plots to reveal patterns in posting times among different user categories.Due to the highly skewed nature of the time interval data, to prevent the curves from being overly flattened, we capped the maximum value at 2880, as shown in Figure 6.
quence, with the time series represented in minutes.By setting the maximum value to 2880 and applying two logarithmic transformations to all data, we effectively alleviated the right-skewness of the data.This processing strategy helps avoid excessive weights during training, thereby maintaining the stability of the model.The blue points in the left plot represent outlier samples.The right plot shows the box plot of the time interval sequence after processing.We utilized kernel density estimation plots to reveal patterns in posting times among different user categories.Due to the highly skewed nature of the time interval data, to prevent the curves from being overly flattened, we capped the maximum value at 2880, as shown in Figure 6.We utilized kernel density estimation plots to reveal patterns in posting times among different user categories.Due to the highly skewed nature of the time interval data, to prevent the curves from being overly flattened, we capped the maximum value at 2880, as shown in Figure 6.Reposting users require continuous tweeting to maintain a high level of engagement, with the characteristic of frequent tweeting within short time intervals being particularly prominent.In contrast, lottery users exhibit a posting frequency closer to that of inactive users, with a relatively uniform pattern observed in the posting frequency within the time intervals of 500 to 2500 min.

The Influence of Implicit Features on Sample Distribution
Before applying clustering algorithms, it is crucial to observe the distribution of samples in space.Figure 7 illustrates the impact of removing individual features on the distribution of samples in space, while also presenting the measurement results for selecting the Davies-Bouldin Index (DBI).The DBI is a metric used to evaluate clustering quality, considering both cluster cohesion and separation.A DBI coefficient closer to 0 indicates better clustering results.The observation from clustering metrics and T-SNE distributions indicates a noticeable deterioration in clustering effectiveness upon the removal of all latent features, with no distinct user delineation boundaries discernible.The incorporation of temporal analysis assists clustering algorithms in better discriminating between repost users and normal users.Analysis of text composition in blog content and the diversity of posting reveal clear classification boundaries between normal users and repost users, suggesting discernible distinctions in posting behavior between repost users controlled by automated programs and normal users.Lottery users exhibit relatively uniform patterns both in posting format and content, such as the consistent absence of image content in blogs and the lack of opinions or viewpoints in blog content.

The Influence of Implicit Features on Sample Distribution
Before applying clustering algorithms, it is crucial to observe the distribution of samples in space.Figure 7 illustrates the impact of removing individual features on the distribution of samples in space, while also presenting the measurement results for selecting the Davies-Bouldin Index (DBI).The DBI is a metric used to evaluate clustering quality, considering both cluster cohesion and separation.A DBI coefficient closer to 0 indicates better clustering results.The observation from clustering metrics and T-SNE distributions indicates a noticeable deterioration in clustering effectiveness upon the removal of all latent features, with no distinct user delineation boundaries discernible.The incorporation of temporal analysis assists clustering algorithms in better discriminating between repost users and normal users.Analysis of text composition in blog content and the diversity of posting reveal clear classification boundaries between normal users and repost users, suggesting discernible distinctions in posting behavior between repost users controlled by automated programs and normal users.Lottery users exhibit relatively uniform patterns both in posting format and content, such as the consistent absence of image content in blogs and the lack of opinions or viewpoints in blog content.

Performance Evaluation Metrics on Weibo User Dataset
The experiment used metrics such as Precision, Recall, and F1 Score to validate the performance of MAPM on the Weibo user dataset.Precision represents the model's ability to correctly predict positives, Recall measures the model's ability to identify all positives, and F1 Score combines Precision and Recall.Table 2 shows the Precision, Recall, and F1

Performance Evaluation Metrics on Weibo User Dataset
The experiment used metrics such as Precision, Recall, and F1 Score to validate the performance of MAPM on the Weibo user dataset.Precision represents the model's ability to correctly predict positives, Recall measures the model's ability to identify all positives, and F1 Score combines Precision and Recall.Table 2 shows the Precision, Recall, and F1 Score for the three categories.When manually annotating user categories, we often encounter situations that are difficult to judge.A minority of reproduced users and lottery users exhibit signs of artificial manipulation under certain circumstances.In response to such cases, we categorize them uniformly as automated users.Because manual annotation introduces biases, it can potentially lead to a small amount of data bias.Table 3 highlights the scores of the most superior models in each category.The clustering indicators show the most significant decrease after disabling SIDN, indicating the crucial role of time series features mined by the SIDN network in detecting fake users.The overall clustering scores of the module without behavioral features are lower than those of the MAPM model, demonstrating that introducing the multi-model joint representation method in the model can effectively improve the clustering performance of the classifier.

Exploration of Implicit Features and Hyperparameter Selection Effects
To further validate the contribution of implicit user features to user classification decisions and the optimality of spectral clustering cluster parameter selection, we present the visual results of the experiments.Figure 8 displays the comparison results of disabling individual deep learning modules on the weighted average precision and recall of samples.The vertical axis represents the variation of their precision.

Exploration of Implicit Features and Hyperparameter Selection Effects
To further validate the contribution of implicit user features to user classification decisions and the optimality of spectral clustering cluster parameter selection, we present the visual results of the experiments.Figure 8 displays the comparison results of disabling individual deep learning modules on the weighted average precision and recall of samples.The vertical axis represents the variation of their precision.In Spectral Clustering, the affinity parameter is used to determine the similarity or connectivity between samples.The choice of this parameter directly impacts the final effectiveness of the clustering algorithm, and its proper can significantly enhance the performance and efficiency of the clustering algorithm.Figure 9 illustrates the impact of selecting different parameters on the model's classification performance, measured using macro-averaged Precision, Recall, and F1 score.In Spectral Clustering, the affinity parameter is used to determine the similarity or connectivity between samples.The choice of this parameter directly impacts the final effectiveness of the clustering algorithm, and its proper selection can significantly enhance the performance and efficiency of the clustering algorithm.Figure 10 is used to demonstrate the impact of different parameters on the model's accuracy across different categories.We chose "nearest neighbors" as the strategy for quantifying similarity between samples, as it yielded the best clustering performance for the classifier.From the experimental results, we observe that compared to disabling the behavior analysis module, the MAPM version performs better in terms of prediction accuracy and From the experimental results, we observe that compared to disabling the behavior analysis module, the MAPM version performs better in terms of prediction accuracy and recall.For example, in the user classification dataset, the performance of the MAPM version is nearly 27.0% higher than that of the classification model using only explicit features for classification.Therefore, we can confidently say that compared to traditional fake user detection schemes, the multi-model joint representation strategy is more suitable for fine-grained detection of fake users.

Conclusions and Future Work
This paper proposes a scheme for detecting fake users on social media platforms based on multi-model joint characterization and introduces a multi-modal aggregation portrait model for detecting fake users.It combines multiple models that perform best in the field to characterize the behavior of users posting blogs on social media platforms.Specifically, we introduce the SIDN network module to capture the temporal patterns of fake user postings.Experimental results on the Weibo user dataset demonstrate the significant role of the SIDN network architecture in extracting user behavior features in the field of fake user identification, while the MAPM model performs well in the task of fake user detection.In addition, there are some limitations to this study.Since the training corpora for BERT and cn_clip only contain Chinese text, MAPM cannot be applied to non-Chinese data samples.The use of multiple deep learning models for feature extraction results in slower model inference speed.Furthermore, constrained by insufficient data samples, MAPM is still unable to detect a broader range of user types.
Discriminant analysis technology that integrates the results of multiple large models provides an excellent opportunity for upgrading artificial intelligence technology for online content analysis and user mining on social platforms.In this sense, vertical domain analysis technology that aggregates large models represents the vertical development of machine learning, which can be applied to a wider range of scenarios by more people.We hope that this framework design will provide a new idea for previous social multi-modal content data analysis, and it may open up new research space for more general false information detection and mining tasks.
where U denotes the user characteristics, n B denotes the collection of blog content published by the user, n denotes the number of blogs con- tained in B , and D denotes the dataset.

3. 1 .
The Window User Feature Space Let the user θ = [U, B n ] ∈ D where U denotes the user characteristics, B n denotes the collection of blog content published by the user, n denotes the number of blogs contained in B, and D denotes the dataset.
and W is their weight matrix.
the text classification results, consisting of eleven categories indicating the composition of text domains.Additionally,

Information 2024 ,
15, x FOR PEER REVIEW 10 of 19 temporal reconstruction error extracted by the cn_clip, HIFI-NET, and SIDN feature extraction modules.

Figure 4 .
Figure 4. Heatmap of sample features for different classes.

Figure 4 .
Figure 4. Heatmap of sample features for different classes.

Figure 5 .
Figure 5. Box plots of time interval data.

Figure 5 .
Figure 5. Box plots of time interval data.

Figure 5 .
Figure 5. Box plots of time interval data.

Figure 6 .
Figure 6.Kernel density plot of time intervals for different user categories.

Figure 7 .
Figure 7. T-SNE data distribution plot when excluding certain features as clustering features.

Figure 7 .
Figure 7. T-SNE data distribution plot when excluding certain features as clustering features.

Figure 8 .
Figure 8.The visualization results of the ablation experiment's accuracy and recall.Figure 8.The visualization results of the ablation experiment's accuracy and recall.

Figure 8 .
Figure 8.The visualization results of the ablation experiment's accuracy and recall.Figure 8.The visualization results of the ablation experiment's accuracy and recall.
Information 2024, 15, x FOR PEER REVIEW 15 of 19 Figure 9 illustrates the impact of selecting different parameters on the model's classification performance, measured using macro-averaged Precision, Recall, and F1 score.

Figure 9 .
Figure 9.The impact of different parameters on model performance.

Figure 10
Figure10is used to demonstrate the impact of different parameters on the model's accuracy across different categories.We chose "nearest neighbors" as the strategy for quantifying similarity between samples, as it yielded the best clustering performance for the classifier.

Figure 9 .
Figure 9.The impact of different parameters on model performance.

Figure 9 .
Figure 9.The impact of different parameters on model performance.

Figure 10
Figure10is used to demonstrate the impact of different parameters on the model's accuracy across different categories.We chose "nearest neighbors" as the strategy for quantifying similarity between samples, as it yielded the best clustering performance for the classifier.

Figure 10 .
Figure 10.The accuracy across different categories.

Table 1 .
Data set statistics.

Table 2 .
Precision, Recall, and F1 Scores for Three Categories.To further demonstrate the superiority of multi-model joint representation in extracting user behavior features, this paper designed a dissociation experiment.By removing certain modules and comparing them with common internal measurement metrics used in clustering algorithms, the results of the dissociation experiment are shown in Table3, where:

Table 3 .
Internal metrics for ablation experiments.