Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion

Bu, Xuan; Tang, Minghu; Wang, Junjie; Zhang, Jiayi; Luo, Peng

doi:10.3390/info17040320

Open AccessArticle

Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion

by

Xuan Bu

^1,2,3

,

Minghu Tang

^1,2,3,*,

Junjie Wang

^1,2,3,

Jiayi Zhang

^1,2,3

and

Peng Luo

⁴

¹

School of Intelligent Science and Engineering, Qinghai Minzu University, Xining 810007, China

²

School of Cyberspace Security, Qinghai Minzu University, Xining 810007, China

³

Joint Laboratory of Cyberspace Security, Qinghai Minzu University, Xining 810007, China

⁴

School of Computer Science and Technology, Qinghai Normal University, Xining 810007, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(4), 320; https://doi.org/10.3390/info17040320

Submission received: 7 February 2026 / Revised: 19 March 2026 / Accepted: 20 March 2026 / Published: 25 March 2026

(This article belongs to the Topic Advanced Development and Applications of AI-Generated Content (AIGC))

Download

Browse Figures

Versions Notes

Abstract

The development of generative artificial intelligence technology has brought convenience to various industries, but also caused some confusion. Especially today, when the content generated by large language models is extremely similar to real text, it has created challenges in many fields (such as in the discrimination of graduation theses in schools) in quickly identifying whether a text is from human sources or generated by large language models. Based on the DeepSeek-R1 language model, this paper combines natural language features and uses a judgment mechanism to detect text generated by large language models. Experimental results show that its accuracy is improved compared with conventional methods in the Reuters, WP and HC3 datasets.

Keywords:

large language models; DeepSeek-R1; generative artificial intelligence; detection of text generated by large language models; language features

1. Introduction

With the iteration and updating of large language models, their performance in the field of natural language processing has become increasingly outstanding. From simple text classification and sentiment analysis to complex text generation and intelligent question answering, large language models have demonstrated strong capabilities. With the continuous improvement of computing power, the number of parameters in large language models has also grown exponentially, and models with different architectures and application scenarios, such as DeepSeek-R1, GPT-4, BERT, and Llama, continue to emerge. These models have learned rich language knowledge and semantic expression through pre-training on massive text data, and can generate text that is extremely similar to human writing styles, which has been widely used in intelligent customer service, content creation assistance, intelligent writing and other fields. However, the wide application of large language models has also brought a series of serious problems. In educational scenarios, students use large language models to complete homework and write papers, which seriously undermines academic integrity and hinders the construction of students’ own knowledge systems and the cultivation of their thinking abilities. Studies have shown that by continuously tracing the sources of text generated by large language models, cheating behaviors in students’ homework can be effectively detected [1]. In the field of academic research, some researchers use large language models to generate paper content, which interferes with the normal academic evaluation system and affects the authenticity and reliability of academic research [2]. At the same time, search engines and other platforms are facing the challenge of identifying large-scale AIGC and need to combine knowledge integration and feature pyramid networks to build efficient detection systems [3].

Currently, text detection methods for large language models (LLMs) have significant limitations: traditional approaches mostly rely on single features such as n-gram frequency or lexical diversity, which fail to capture the complex patterns of LLM-generated text and yield large errors [4]; deep learning detectors, though accurate, are opaque, resource-hungry and hard to generalize to new models. Recent work further exposes the theoretical incompleteness of probabilistic methods: Fast-DetectGPT shows that negative log-likelihood alone cannot distinguish distributional shapes and is insensitive to over-confident low-entropy regions, while rewriting experiments reveal that human-vs-AI probability gaps vary markedly across syntactic positions, a sensitivity that current models ignore [5]. These gaps, together with the challenges of adversarial attacks and cross-lingual transfer faced by both black-box statistical classifiers and white-box watermarking schemes [6], make accurate detection a central yet unsolved problem in LLM research, and point to future directions such as multimodal fusion and dynamic updating mechanisms [7].

To address the above issues, this study extracts three types of features based on the DeepSeek-R1 pre-trained language model: logarithmic features of text generation probability, sequence joint probability features, and vocabulary probability ranking features. Key information is obtained from multiple aspects, such as text generation probability distribution and overall generation patterns, to achieve the detection of generated text. This research method enriches the theoretical system of text detection generated by large language models. By integrating multiple feature extraction strategies and judgment mechanisms, it provides new ideas and methods for subsequent related research, and helps promote the in-depth development of the text detection direction in the field of natural language processing. The research results have important practical significance for regulating the application of large language models and ensuring the authenticity of information in various industries, and are expected to provide reliable technical support for content authenticity verification in multiple fields.

The main contributions of this paper are as follows:

(1): Innovatively integrating logarithmic features of text generation probability, sequence joint probability features, and vocabulary probability ranking features to construct a multi-dimensional feature extraction framework, comprehensively capturing text differences and improving detection accuracy.
(2): Designing a feature extractor based on DeepSeek-R1 model fine-tuning, which has strong interpretability, high computational efficiency, and is suitable for practical applications.
(3): Powered by the compact DeepSeek-R1 backbone, the framework delivers rapid, low-resource detection while preserving detection accuracy.

2. Related Work

Research on detecting text content generated by large language models can currently be divided into methods based on statistical features, methods based on deep learning, methods based on pre-trained language models, and hybrid methods.

2.1. Methods Based on Statistical Features

In the early stage, researchers mainly distinguished between human-generated text and text generated by large language models through statistical features of text. Such methods focus on features such as word frequency, syntactic structure, and vocabulary diversity. OpenAI [8] proposed a method to detect GPT-2-generated text using a simple classifier based on TF-IDF features in a published report. The principle is that TF-IDF, as a statistical method, measures the frequency of words in a single document by calculating word frequency, and measures the rarity of words in the entire document collection by inverse document frequency. The product of the two is used to evaluate the importance of a word to a document. Based on this, a simple classifier based on TF-IDF single-word and double-word features is trained to detect model output text. Another example is GLTR technology [9], which conducts detection by quantifying indicators such as vocabulary density and repetition rate. This method can achieve good results on specific datasets, but when facing diverse generation models, its generalization ability is insufficient, making it difficult to capture the complex patterns of text generated by large language models. Relying solely on a single statistical feature makes it difficult to cope with differences in the generation characteristics of different models. Ippolito [10] found that optimizing human deceptive decoding strategies (such as top-k sampling) will introduce statistical anomalies, making automatic detectors based on probability features (such as GLTR) more likely to identify generated text.

2.2. Methods Based on Deep Learning

With the development of deep learning technology, Zellers [11] took the lead in using neural network models to detect text generated by large language models. Early RNNs and their variants, LSTMs and GRUs, can capture contextual information, but they have long-distance dependence problems when processing long texts, leading to limited detection performance; CNNs perform well in short text classification but have shortcomings in long text modeling. In recent years, Transformer-based models (such as BERT and GPT) have improved detection performance with the help of self-attention mechanisms, but deep learning methods generally have problems such as complex model structure, high computational resource requirements, and poor interpretability, which limit their wide application in practical scenarios. Mo [12] and others proposed a Transformer-based detection method that integrates LSTM, Transformer, and CNN layers to detect text generated by large language models through multi-dimensional feature extraction. In the research on detecting ChatGPT-generated text (GPT-3.5-turbo), Alshareef [13] and others proposed GOA-DLDC technology, which uses a convolutional gated recurrent unit classification and gannet optimization algorithm for parameter tuning, and improves the performance of text source recognition through a deep learning architecture, achieving high detection accuracy on datasets composed of human-generated text and ChatGPT-generated text. In the Chinese scenario, Xiong Pan [14] and others constructed an academic text dataset including abstracts, introductions, and conclusions, and proposed the DeBERTa-BiGRU model to achieve a detection accuracy of 91%, providing an effective solution for detecting text generated by Chinese large language models.

2.3. Methods Based on Pre-Trained Language Models

The excellent performance of pre-trained language models in text processing tasks has prompted their application in detecting text generated by large language models. For example, DetectGPT [15] judges the source of text generation by calculating the negative log-likelihood score, which has high accuracy on multiple datasets. The core of such methods is to use the language knowledge learned by pre-trained models to calculate text generation probability, but they are highly dependent on the performance of their pre-trained models and have weak adaptability to new generation models. When facing language patterns not covered by their training data, their detection effect is prone to fluctuations. Chakraborty et al. [16] focused on maintaining text authenticity and proposed a BERT-based method for detecting AI-generated content. By using BERT’s contextual embedding ability, it mines the patterns of large language model generation in the text, thereby achieving an accurate distinction between human-generated text and large language model-generated text. Experimental results verify the effectiveness of this method in identifying text generated by large language models, providing a new technical path for protecting the integrity of human-created content in the digital age. Mao et al. [17] proposed the RAIDAR method, which prompts large language models to rewrite text and calculates the edit distance. It uses the fact that text generated by large language models requires fewer modifications to detect it, which significantly improves its F1 score in multiple fields and is suitable for black-box model scenarios. Bao [5] and others proposed Fast-DetectGPT, which improves on the detection speed of DetectGPT by 340 times through conditional probability curvature and sampling strategies and achieves an AUROC of 0.9338 on GPT-4-generated text, significantly better than DetectGPT. SUN [18] and others proposed a zero-shot detection method based on text reordering, which analyzes changes in semantic coherence to achieve detection.

2.4. Hybrid Methods Based on Pre-Trained Language Models

In addition to the above common methods, some researchers have proposed various hybrid methods that integrate the advantages of statistical features, deep learning models, and pre-trained language models. Liu [19] combined text detection tasks with related tasks such as text classification and sentiment analysis to construct a multi-task learning framework. This framework can make full use of the correlation between different tasks to achieve knowledge sharing, thereby improving detection performance. There are also some studies that focus on the detection of text generated by large language models in specific fields. Xiang Hui [20] and others proposed the EBF Detection method, which integrates fine-tuned pre-trained language models and high-level natural language statistical features (log-probability, log-rank, etc.). By integrating six weak detectors and adopting a majority voting decision mechanism, it achieves the detection of text generated by large language models, solving the problems of poor generalization ability or insufficient detection performance in a single detection method, without affecting the natural-sounding tone of the text. However, hybrid methods usually have problems of high model complexity and difficulty in training, and the fusion of different technologies may lead to performance bottlenecks, making it difficult to achieve a linear improvement in detection performance. At the same time, hybrid methods for specific fields have insufficient universality. Macko [21] and others constructed the MULTITuDE dataset covering 11 languages, revealing the performance bottleneck of existing detectors in cross-lingual scenarios, and emphasizing the importance of multilingual training for improving generalization ability. Krishna [22] pointed out that paraphrase attacks can significantly reduce the performance of detection tools, but retrieval-based defense strategies can effectively identify rewritten text generated by large language models by comparing the generated content database. Bhattacharjee [23] and others proposed the ConDA framework, which learns domain-invariant features through contrastive domain adaptation, and the detection performance of text generated by large language models is close to that of fully supervised models with only a small amount of labeled data.

However, existing probability-based detectors suffer from a deeper theoretical flaw. Fast-DetectGPT argues that relying solely on first-order statistics (negative log-likelihood) fails to capture the curvature of the softmax surface, yet this curvature is precisely what separates low-entropy machine text from high-entropy human text. A text-rewriting study further shows that probability differences are strongly position-dependent, but DetectGPT and similar methods weight every token equally, drowning out syntactic-position cues. Moreover, GLTR only uses low-order percentiles to describe the rank distribution, whereas neural fake news exhibits heavy-tail asymmetry that traditional percentiles cannot encode. Collectively, these findings imply that entropy shape, position sensitivity, and higher-order moments must be explicitly injected into the probability framework to achieve a theoretically complete detector.

3. Detection of Text Generated by Large Language Models Based on DeepSeek-R1 Multi-Feature Fusion

To quickly and automatically distinguish between human-generated and AI-generated text, this paper proposes a detection method (hereinafter referred to as the Dk method) based on a DeepSeek-R1 pre-trained language model and logistic regression, combining multi-dimensional features such as text generation probability. The core of this method includes three main steps: feature extraction, model training, and classification.

3.1. Feature Extraction

Feature extraction is a key step in this method, aiming to extract features from the target text that can effectively distinguish between human-generated and large language model-generated text. Based on the DeepSeek-R1 pre-trained language model, this method mainly extracts three types of features: TextLogScore, SeqProbScore, and TokenRankScore. The design of these features is based on an in-depth analysis of text generated by large language models and can capture subtle differences between human-generated text and text generated by large language models.

3.1.1. Extracting TextLogScore Feature

The Log-Likelihood is a fundamental indicator for measuring the fitness of a model relative to observed data. In the context of large language models, the generative probability of a text sequence

X

= {

x_{i} | x_{1}, \dots, x_{i - 1}

} is calculated through the conditional probability distribution. The probability of the entire sequence is expressed as

P (X) = \prod_{i = 1}^{n} P (x_{i} | x_{1}, \dots, x_{i - 1})

where

P

(

x_{i} | x_{1}, \dots, x_{i - 1}

) is the conditional probability of generating the next token xi under the context {x₁,x₂,…,x_i−1} by the model.

To normalize for sequence length and improve numerical stability, Negative Log-Likelihood (NLL) is adopted as the baseline metric.

L_{N L L} (X) = - \frac{1}{n} \sum_{i = 1}^{n} \log P (x_{i} | x_{1}, \dots, x_{i - 1})

Information-theoretic entropy measures distributional uncertainty and forms a complementary pair with Log-Likelihood [24]: the former quantifies the spread of probability mass, whereas the latter evaluates the overall sequence probability. To encode the characteristic over-confidence of AI models, an entropy-penalty term λ·H is introduced. For a given −log

P

value, this term assigns a higher detection score to low-entropy text, which is a typical trait of AI-generated content [17]. Concretely, during the token-by-token accumulation of the NLL, the local predictive entropy H_i of the current softmax distribution is simultaneously computed. The integrated TextLogScore is subsequently defined as

TextLogScore = L_{N L L} (X) + λ (n) \cdot \bar{H}

where

\bar{H}

represents the mean entropy of the sequence.

Crucially, to ensure the detector remains robust across varying text lengths, the coefficient

λ (n)

is adaptively determined by the sequence length

n

:

λ (n) = \frac{α}{\log_{10} (n + β)}

where

α

and

β

are scaling and smoothing hyperparameters, respectively. This design ensures that as the sequence length increases, the entropy penalty is appropriately scaled to maintain a stable separation margin between human and AI-generated text. The specific values of these hyperparameters are empirically determined during the training phase to optimize classification performance.

The algorithm flow of TextLogScore feature extraction is shown in Figure 1.

3.1.2. Extracting SeqProbScore Feature

Maximum Likelihood is a statistical indicator utilized to quantify the probability of text generation by accumulating the conditional probabilities of each token. While the joint probability of a sequence is defined as P(X|θ) in the preceding section, human text generation typically exhibits a pronounced “first-token uncertainty” effect. As demonstrated by Ippolito [10] through surprisal analysis, the edit distance at prefix positions is significantly larger than at suffixes, evidencing non-uniform positional probability characteristics.

Inspired by the sinusoidal distance encoding in Transformers, a cosine-decay weight

ω_{i}

is heuristically introduced to account for these characteristics:

ω_{i} = \max (\cos (\frac{π \cdot i}{2 n}), 0.1)

where

n

represents the sequence length and

i

is the current token position. This mechanism assigns sentence-initial tokens a weight ≈ 1 while decaying to ≈0.1 for tail tokens. Such a design forces the cumulative process to be more sensitive to low-probability events in the prefix, which often harbor the model’s stylistic “fingerprints.”

Consequently, unlike traditional maximum-likelihood approaches that attend only to global probability values, the proposed SeqProbScore incorporates a “sequence-probability path smoothness” metric. By dynamically re-weighting the conditional probabilities, the importance of contexts at different positions is strategically allocated:

SeqProbScore = \sum_{i = 1}^{n} ω_{i} \cdot \log P (x_{i} | x_{< i}; θ)

where

ω_{i}

is the dynamically adjusted path weight coefficient. This improvement enables SeqProbScore to not only reflect the overall generation probability of the text but also capture the characteristics of local probability fluctuations. This is particularly vital in long text detection tasks where global averages may dilute critical local anomalies.

SeqProbScore maintains an inherent connection with information entropy; a larger value indicates a higher generation probability. Text generated by large language models, especially reasoning models like DeepSeek-R1, usually exhibits higher SeqProbScore values because their generation strategies (e.g., chain-of-thought) tend to consistently follow high-probability token paths.

The algorithm flow of SeqProbScore feature extraction is shown in Figure 2.

3.1.3. Extracting TokenRankScore Feature

GLTR is a tool based on statistical methods for detecting and analyzing text generated by large language models. Its core strategy is to analyze the generation probability distribution of each token in the text and calculate its percentile rank in the probability distribution, thereby revealing the generation pattern of the text. The GLTR score can effectively distinguish between human-generated text and text generated by large language models.

The GLTR score is calculated through the following steps:

Let

X

_1:N be the token sequence of the input text, and

P_{\det} (X_{i} | X_{1 : i - 1})

be the conditional probability distribution of generating token

X_{i}

at position i by the model

First, use DeepSeek-R1 to encode the input text to generate the conditional probability distribution of each token:

P_{\det} (X_{i} | X_{1 : i - 1}) = Model (X_{i} | X_{1 : i - 1})

Then, sort the conditional probability distribution of each token in descending order to obtain the relative rank of the token’s probability in the distribution. The higher the rank, the stronger the model’s probability preference for generating the token:

Rank (X_{i}) = sort (P_{\det} (X_{i} | X_{1 : i - 1}))

where sort represents the position of the probability distribution after sorting in descending order. Finally, statistical analysis is performed on the percentile ranks of all tokens to obtain TokenRankScore. Calculate the percentile from the rank. Next, based on the rank

Rank (X_{i})

, calculate the percentile rank of the token’s probability in the distribution. The percentile reflects the proportion of the current token’s probability exceeding other candidates in the distribution:

P_{X_{i}} = \frac{Rannk (X_{i}) - 1}{N - 1} \times 100 %

where

N

is the total number of candidates when the model generates the token at this position. This formula converts the rank into a ratio of 0% to 100%. The higher the rank (smaller Rank), the higher the percentile, reflecting the more prominent performance of the model’s probability confidence in the token.

GLTR only reports the low-order percentiles P25/P50/P75, yet the rank distribution of neural fake news exhibits heavy-tail asymmetry. Conventional percentiles cannot capture either the direction of distributional skewness or its temporal dynamics. Skewness

γ

(the third central moment) is the lowest-order statistic that encodes tail asymmetry, whereas volatility

V

(the L1-mean of first-order differences) serves as an O(T)-complexity proxy for temporal evolution [25]. Being orthogonal to GLTR’s three percentiles,

γ

and

V

jointly supply complementary information on distributional shape and dynamics.

TokenRankScore = [P_{25}, P_{50}, P_{75}, γ, V]

This can effectively distinguish between human-generated text and text generated by large language models by analyzing the text generation pattern. Studies have shown that text generated by large language models usually has a higher rank percentile because the generation process tends to select a high confidence interval in the probability distribution, while human-generated text tends to select low-probability but semantically reasonable vocabulary. To capture distributional asymmetry and temporal dynamics, skewness

γ

and volatility

V

are further computed:

γ = \frac{\frac{1}{N} \sum_{i = 1}^{N} {(r_{i} - \bar{r})}^{3}}{{(\frac{1}{N} \sum_{i = 1}^{N} {(r_{i} - \bar{r})}^{2})}^{3 / 2}}

V = \frac{1}{N - 1} \sum_{i = 1}^{N - 1} | r_{i + 1} - r_{i} | \bar{r}

where

r_{i}

denotes the rank percentile at position i, and

r

is the mean rank percentile across the sequence. The final TokenRankScore is then constructed as a 5-dimensional feature vector

[P_{25}, P_{50}, P_{75}, γ, V]

.

TokenRankScore can be used to detect whether the generation pattern of the text conforms to the distribution characteristics of natural language. Let Dhuman and Dllm represent the distributions of human-generated text and text generated by large language models, respectively: Dhuman ⊇ Dllm

This indicates that the distribution of human-generated text covers a wider range of vocabulary choices, while the distribution of text generated by large language models tends to concentrate in high-probability regions. The algorithm flow of solving the

T o k e n R a n k S c o r e

feature extraction is shown in Figure 3.

3.2. Feature Extractor Setup

DeepSeek-R1-7B is selected as the feature extraction backbone. First, large-scale reinforcement-learning post-training at the 7B level yields longer reasoning chains, making the softmax distribution more sensitive to the “high-confidence, low-entropy” pattern and amplifying the curvature gap between AI and human text. Second, the vocabulary and pre-training corpus are Chinese-centric, providing finer-grained rank distributions for Chinese long-tail words and domain-specific terms; prior experiments on the HC3 Chinese corpus show an 18% larger human-vs-machine rank-skewness gap than LLaMA-7B. The same checkpoint meets both Chinese and English detection requirements without additional model switching. The official HuggingFace checkpoint deepseek-ai/deepseek-r1-7b is used in torch.float16 with a 32 k context window; texts are truncated to 512 tokens, and a single RTX 4070 GPU performs one forward pass to obtain the full probability matrix within ≤8 GB memory. All sampling-related parameters (do_sample, temperature) are disabled, leaving only deterministic forward passes to ensure reproducible features.

The three types of features extracted based on the DeepSeek-R1 model need to be merged first, and then combined with a classifier to complete the text detection task. This process combines multi-feature fusion with a classic classifier to effectively use model-generated features for detection. The overall algorithm flow is as follows:

Input: Text sequence X, DeepSeek-R1 model

Output: y

1. Perform feature extraction using DeepSeek-R1 model:

T = TextLogScore

S = SeqProbScore

R =

T o k e n R a n k S c o r e

2. Feature merging:

F = [T, S, R] # Spliced into feature vector

3. Standardization processing:

F_norm = Standardization(F) # Use training set statistics

4. Classification prediction:

y = C.predict(F_norm)

The overall principle of the Dk method is shown in Figure 4.

3.3. Model Training

After completing feature extraction, this study uses a logistic regression model to train the extracted features. As a classic binary classification algorithm, logistic regression has the characteristics of strong interpretability, simple model structure, and high computational efficiency, which can adapt to practical application scenarios and is highly consistent with the goal of efficient and interpretable text detection pursued in this study.

In the experiment, the dataset is divided into a training set and a test set, with the test set accounting for 20%. When dividing the dataset, a stratified sampling strategy (stratify = labels) is adopted to ensure that the proportion of human-generated text and text generated by large language models in the training set and test set is consistent, and the random seed is set to 42 (random_state = 42) to ensure the reproducibility of the experiment. This helps to avoid model bias caused by unbalanced data distribution, allowing the model to fully learn the features of different types of text during training, thereby improving the generalization ability and stability of the model.

When using the logistic regression model for classification in this study, L2 regularization (penalty = ‘l2’) is configured to control model complexity, and the optimizer selects the L-BFGS algorithm [26] (solver = ‘lbfgs’) to efficiently solve convex optimization problems, with the maximum number of iterations set to 100. The regularization strength C was optimized via a grid search (C ∈ {0.01, 0.1, 1, 10, 100}) with 5-fold cross-validation, yielding optimal values of 1 for the Essay dataset and 0.1 for the Reuters and WP datasets. The algorithm of the logistic regression model is as follows:

Input: Text feature matrix F, label vector y, DeepSeek-R1 model, Tokenizer

Output: Trained logistic regression classifier lr_model

1. Divide training set and test set:

F_train, F_test, y_train, y_test = Stratified sampling division (test set accounts for 20%, random seed = 42)

2. Initialize logistic regression model:

lr_model = LogisticRegression(

penalty = ‘l2’,

C = regularization_strength,

Solver = ‘lbfgs’,

max_iter = max_iterations)

3. Model training:

lrmodel.fit(F_train, y_train)

4. Return the trained model:

return lr_model

3.4. Classification and Evaluation

After the model training is completed, the test set is used to comprehensively evaluate the model’s performance. The evaluation process mainly includes two key steps: classification and performance indicator calculation.

The trained logistic regression model is used to classify the text samples in the test set. In the classification process, a probability threshold method is used for judgment. The logistic regression model will output a predicted probability value for each sample, which reflects the possibility that the sample belongs to text generated by large language models. In this study, samples with a predicted probability greater than 0.5 are considered to be text generated by large language models, and samples with a predicted probability less than or equal to 0.5 are considered to be human-generated text. This classification method is simple and intuitive, has high operability in practical applications, and can quickly classify a large amount of text, providing an efficient solution for subsequent text detection work.

Next, a series of performance indicators is calculated to evaluate the model’s performance, including Accuracy, Precision, Recall, F1 score, and AUC. Through comprehensive analysis of these indicators, the performance of the model in the generated text detection task can be comprehensively and objectively evaluated, providing a strong basis for model improvement and optimization, thereby continuously improving the effectiveness and reliability of the model in practical applications.

4. Experiment

4.1. Dataset

The dataset comprises human-generated and LLM-generated texts. Human texts are sourced from the public Reuters news corpus and WP dataset [27]. Reuters covers diverse news topics with standardized expressions, while WP contains user-contributed creative writing prompts and corresponding stories with personalized language styles. LLM-generated texts are produced by six representative models (ChatGPT-turbo, Claude, ChatGLM, Dolly, ChatGPT, and GPT4All) responding to identical prompts, ensuring architectural diversity and generalizability of results.

Each dataset contains 1000 human texts and 6000 LLM outputs (1000 prompts × 6 models). Human texts are paired with every model’s corresponding output, yielding 6000 labeled instances per dataset (0 = human, 1 = LLM). The HC3 dataset follows its original domain-specific 8:2 train-test partitioning (Table 1). For Reuters and WP, 5-fold stratified cross-validation is employed with an 80:20 split per fold. All texts are truncated to 512 tokens.

In addition, this experiment also tests the generalization ability of this model and selects Chinese text from HC3, generated by ChatGPT, which contains data in seven different fields. The relevant information of the HC3 dataset is shown in Table 1 [28].

Additionally, the Essay dataset from MGTBench [27] is incorporated to evaluate cross-dataset generalization capabilities. This dataset consists of 1000 academic essays across high school and university levels, serving as both training and testing sets alongside Reuters and WP in cross-dataset experiments.

4.2. Experimental Setup

Comparison methods:

Log-likelihood [3,25]: Distinguish between large language models and human-generated text by calculating the log-likelihood value of text generation probability on the pre-trained language model. A low log-likelihood value usually indicates text generated by large language models.

Rank [4]: Based on the ranking features of vocabulary in the generation probability distribution (such as the GLTR method), text generated by large language models tends to use high-ranking vocabulary, and detection is realized by counting ranking percentiles.

Entropy [3]: Use the entropy value of the text probability distribution to measure uncertainty. Text generated by large language models usually has lower entropy due to concentrated probabilities.

GLTR [4]: Capture the high-confidence vocabulary selection preference of text generated by large language models by visualizing and quantifying the vocabulary generation probability ranking distribution.

NPR [3]: Normalized perplexity indicator, evaluating the prediction uncertainty of the language model on the text. Text generated by large language models usually has lower perplexity.

DetectGPT [9]: Based on the probability curvature difference after text perturbation, text generated by large language models has a more significant probability drop after perturbation, realizing zero-shot detection.

Experimental configuration: In this experiment, the HuggingFace open-source DeepSeek-R1 model is used for feature extraction. This model has undergone large-scale pre-training and has strong natural language understanding and feature extraction capabilities, which can effectively capture semantic and structural information in the text. The classifier uses the logistic regression model in the Scikit-learn library, which has high computational efficiency and strong interpretability, suitable for the requirements of the task in this study. The hardware environment for the experiment is an NVIDIA RTX 4090 GPU, which uses its powerful parallel computing capabilities to accelerate model training and feature extraction processes, improving experimental efficiency.

5. Results

5.1. Experimental Results

In this paper, Accuracy, Precision, Recall, and F1 are used as evaluation indicators to test the performance of the Dk model. For each dataset, the experimental results of each indicator of the Dk method on this dataset are presented first, and then the F1 score is compared with other methods.

The detection performance of the Dk method on text generated by six large language models on the Reuters dataset is shown in Table 2:

Each row of data in Table 2 corresponds to a large model, reflecting the performance of its generated text and human-generated text in various evaluation indicators when correctly classified by the Dk model.

The comparison results of F1 scores between the Dk method and 6 comparison methods are shown in Table 3:

Table 3 represents the horizontal comparison of various detection methods, including Dk, on the Reuters dataset, with F1 score as the core indicator, showing the correct classification effect of each method on human-generated text and text generated by large language models.

On the Reuters dataset, the Dk method achieves strong detection on large-scale transformers (F1 = 0.965 for ChatGPT-turbo, 0.970 for ChatGLM), yet performance drops markedly for lightweight models (F1 = 0.652 for Dolly, 0.765 for GPT4All). This gap reflects suppressed distributional signatures: smaller models exhibit lower generation confidence and reduced lexical sophistication, making their outputs statistically ambiguous between AI and human patterns. Compared to single-feature methods (Table 3), the Dk method improves F1 by 3.1% on GPT4All (0.765 vs. GLTR’s 0.742), confirming that skewness and volatility capture subtle signals missed by percentile-only approaches. The slight underperformance by ChatGLM (ΔF1 = −0.017 vs. GLTR) suggests its rank distribution is already well-captured by basic percentiles. The heat map comparison of F1 scores of the six methods on the Reuters dataset is shown in Figure 5:

The heat map in Figure 5 shows that the F1 score of the Dk method remains consistently high across all tested models. This stability suggests that the Dk method exhibits strong robustness and generalization ability. Despite variations in individual model performance, the Dk method consistently ranks among the top three, highlighting its effective adaptation to the diverse text distributions in the news domain. This consistency further supports the advantages of the multi-dimensional feature fusion strategy in handling complex, domain-specific text patterns.

The detection performance of the Dk method on text generated by six large language models on the WP dataset is shown in Table 4:

The comparison results of F1 scores between the Dk method and six comparison methods are shown in Table 5:

The comparison of the F1 scores of the six methods on the WP dataset is shown in Figure 6.

On the WP creative writing dataset, the Dk method achieves F1 = 0.992 for ChatGLM and 0.962 for ChatGPT-turbo, demonstrating robust discrimination despite high stylistic variability in creative texts. Its performance on Claude (F1 = 0.782) remains competitive despite its strong semantic coherence, which typically masks statistical artifacts. As shown in Table 5 and Figure 6, the Dk method outperforms GLTR by 20.3% on ChatGPT-turbo (0.962 vs. 0.800), while maintaining consistent advantages over Log-Likelihood and Entropy on GPT4All. These results suggest that multi-feature fusion captures patterned traces resistant to stylistic perturbation, validating the framework’s suitability for creative writing detection.

In addition, to study the generalization ability of the method in this paper, the method in this paper is used to detect Chinese text content to test whether the Dk method has the ability to detect Chinese language. The HC3 dataset contains seven domain categories of datasets: finance, openqa, baike, nlpccdbqa, medicine, psychology, and law. The data obtained by experimenting with the Dk method on all domain categories of datasets is shown in Table 6.

Each row of data in Table 6 corresponds to a domain category, showing the performance of the Dk method in each domain of the HC3 dataset.

The comparison of F1 scores of classification by the Dk method on the HC3 dataset is shown in Figure 7.

Experiments on the HC3 Chinese dataset show that the Dk method exhibits stable performance in professional fields such as medicine (F1 = 0.9721) and psychology (F1 = 0.9899), with AUC values exceeding 0.99, suggesting successfully capturessuccessful capture of domain-specific semantic patterns. Performance decreases in encyclopedia (F1 = 0.8580) and nlpcc_dbqa (F1 = 0.8147), likely due to terminology sparsity and specialized syntax that reduce statistical regularity. Overall F1 of 0.8964 and AUC of 0.9550 support cross-domain generalization within Chinese text, even with 512-token truncation.

5.2. Feature Correlation Analysis Across Datasets

To examine the relationships among the proposed features, a correlation analysis is conducted on three benchmark datasets: Reuters, Essay, and WP. For each dataset, three features—TextLogScore, SeqProbScore, and TokenRankScore—are extracted from a fixed subset of human-written and model-generated texts. Pearson correlation coefficients are then computed to quantify the pairwise relationships among these features and to assess their dependency and complementarity. The correlation results are shown in Table 7:

The results demonstrate a consistent correlation structure among the three features across different datasets. The strong negative correlation between TextLogScore and SeqProbScore arises from their inherently inverse definitions, indicating that they capture opposing aspects of model confidence. In contrast, TokenRankScore shows moderate correlations with both features, suggesting that while it is partially related to likelihood-based information, it also captures additional token-level distribution characteristics. Therefore, TokenRankScore is neither redundant nor fully independent, but instead provides complementary information that enhances the overall feature representation.

5.3. Ablation Study

To systematically evaluate the contribution of each feature component, comprehensive ablation experiments are conducted on the WP dataset. The multi-feature fusion framework is decomposed into three baseline single-feature extractors and their combinations:

TextLogScore: Only uses TextLogScore feature

SeqProbScore: Only uses SeqProbScore feature

TokenRankScore: Only uses TokenRankScore feature

TextLogScore + SeqProbScore: Dual-feature combination

TextLogScore + TokenRankScore: Dual-feature combination

SeqProbScore + TokenRankScore: Dual-feature combination

All_Features: Complete 7-dimensional feature vector as described in Section 3.1

All experiments adopt consistent experimental settings: DeepSeek-R1 as the backbone model, logistic regression classifier, 512-token truncation length, and a stratified 8:2 train–test split. Representative models (ChatGPT-turbo, Claude, and ChatGLM) are selected to demonstrate the generalizability of the observations.

Table 8 presents the F1 scores of different feature combinations for detecting text generated by three typical LLMs on the WP dataset.

Single-Feature Performance: TextLogScore performs well on ChatGPT-turbo (0.922) and ChatGLM (0.975), but poorly on Claude (0.742), reflecting model-specific generative differences. SeqProbScore shows limited effectiveness across all models (0.900–0.923), validating our cosine-decay weighting design.

Feature Fusion Insights: Most dual-feature combinations improve upon single features, confirming complementarity. Notably, SeqProbScore + TokenRankScore (0.804) outperforms All_Features (0.782) on Claude, suggesting feature redundancy in the complete set—certain combinations generalize better than comprehensive fusion for specific models. All_Features achieves optimal performance on ChatGPT-turbo (0.962) and ChatGLM (0.992), demonstrating multi-dimensional fusion’s robustness for mainstream LLMs.

5.4. Feature Importance Analysis

To evaluate the contribution of each feature, a coefficient-based importance analysis is conducted using logistic regression with standardized inputs. This experiment aims to examine whether the TextLogScore, SeqProbScore, and TokenRankScore feature group provides complementary signals for distinguishing human-written and machine-generated text. The averaged coefficients and corresponding rankings are summarized in Table 9.

As shown in Table 9, TextLogScore consistently exhibits the highest importance across all settings, indicating that token-level likelihood serves as the primary discriminative signal. SeqProbScore also contributes significantly; although it is strongly correlated with TextLogScore, their combination leads to improved performance, suggesting complementary effects under the classifier. The TokenRankScore further enhances performance as a group, indicating that distributional characteristics provide additional information beyond likelihood-based measures. Overall, the results support the effectiveness of combining these features for robust detection.

5.5. Cross-Dataset Generalization Analysis

To systematically evaluate the domain-independence and robustness of the Dk framework, cross-dataset transfer experiments are conducted across three distinct linguistic domains: Reuters (News), WP (Creative Writing), and Essay (Academic Writing). The model is trained on a single source dataset and directly evaluated on the remaining target datasets without further fine-tuning. This experimental setup identifies whether the detector captures universal statistical signatures or merely overfits to domain-specific vocabulary.

As illustrated in Table 10, the multi-feature fusion maintains stable performance across diverse domains. Specifically, the model trained on Reuters achieves an F1 of 0.8666 on WP and 0.9467 on Essay. From an analytical perspective, this cross-domain stability is attributed to the invariance of underlying rank-based statistical fingerprints relative to the LLM’s decoding strategy, even as semantic distributions (word choice and topic) shift between datasets. The high accuracy in the Reuters → Essay transfer reflects that both academic and news texts exhibit high structural constraints, leading to consistent rank volatility and skewness patterns. Conversely, the observed performance dip in the WP → Essay transfer (F1 = 0.7758) suggests that creative prompts introduce higher stochasticity in token selection, which partially perturbs the SeqProbScore but is mitigated by the stability of TokenRank features.

Furthermore, this mechanism addresses potential vulnerabilities to paraphrasing attacks [23]. While surface-level rewriting can perturb individual probability scores, the intrinsic rank-based signatures inherent to the generative process remain significantly more difficult to eliminate. By fusing these dimensions, the Dk detector maintains a resilient decision boundary that captures the “physical essence” of machine-generated text rather than superficial linguistic patterns.

5.6. Cross-Model Generalization (LOMO)

To evaluate the zero-shot detection robustness against previously unobserved generative patterns, a Leave-One-Model-Out (LOMO) evaluation protocol is implemented. In this setup, the detection framework is subjected to multiple independent iterations corresponding to the number of heterogeneous LLMs in the dataset. During each iteration, one specific LLM is designated as the “unseen target” and is entirely excluded from the training phase. The classification head is trained exclusively on a composite dataset consisting of human-authored texts and generated samples from all remaining LLMs. Subsequently, the model’s performance is validated on a testing set containing only the excluded LLM’s samples and a hold-out set of human texts. This rigorous separation ensures that the resulting metrics, such as the weighted F1-score and AUC, reflect the capacity of the DeepSeek-R1-Distill-Qwen-7B feature extractor to distill invariant structural artifacts that transcend individual model architectures and specific training distributions.

The empirical performance of the proposed framework under the LOMO protocol is detailed in Table 11, which summarizes the F1-score, Area Under the Curve (AUC), and Recall across seven heterogeneous LLM architectures.

As illustrated in Table 11, the proposed framework demonstrates robust zero-shot generalization across heterogeneous LLM architectures, maintaining a mean AUC of 0.9128. The high detection accuracy observed for ChatGLM (0.9964) and ChatGPT (0.9801) indicates that the features extracted by the DeepSeek-R1-Distill-Qwen-7B backbone are sensitive to the fundamental statistical properties inherent in large-scale auto-regressive generation, which remain consistent despite variations in model parameters. The marginal performance attenuation recorded for Claude (0.8236) and StableLM (0.8243) is likely attributable to specific sampling strategies or alignment-tuning techniques that introduce higher local stochasticity into the token distribution. These results collectively demonstrate that our framework is inherently model-agnostic and does not depend on specific model architectures, capturing universal statistical discrepancies rather than overfitting to individual model fingerprints.

6. Conclusions

This study constructs a method for detecting text generated by large language models based on DeepSeek-R1 model feature extraction and logistic regression. This method innovatively integrates three types of features: TextLogScore, SeqProbScore, and TokenRankScore, and realizes efficient and interpretable text classification with the help of a logistic regression model. Experimental results demonstrate competitive or superior performance relative to conventional methods across Reuters, WP and HC3 datasets, with particularly strong results on GPT-architecture models. This result not only provides new technical ideas for the field of text detection generated by large language models but also lays a solid foundation for subsequent research.

The core contribution of this study is to propose an innovative multi-dimensional feature extraction framework. By integrating TextLogScore, SeqProbScore, and TokenRankScore features, it can effectively capture the subtle differences between human-generated text and text generated by large language models from three levels: text generation probability distribution, overall sequence generation pattern, and vocabulary selection preference. This multi-dimensional feature extraction method more comprehensively characterizes text features compared with single-feature detection methods, improving detection accuracy and reliability. At the same time, the feature extractor designed based on DeepSeek-R1 fine-tuning has high interpretability, making it possible to deeply understand the model’s decision-making process; its high computational efficiency also provides convenience for practical applications.

In addition, the current method has not been tested for its detection effects on long texts (more than 1000 tokens), nor has it been tested for robustness against adversarial paraphrasing (such as rewriting or obfuscation attacks), nor has it been verified for the ability to distinguish mixed texts (partially manually edited + AI-generated). Theoretically, manual editing and adversarial rewriting in human–AI collaborative writing may alter the original statistical characteristics by introducing unpredictable linguistic variations. While empirical testing on such datasets was not conducted at this stage due to data limitations, the current framework provides a theoretical basis for capturing these anomalies. In the future, text length adaptive processing (such as segment detection), adversarial defense mechanisms, and hybrid text detection can be added to improve model robustness.

Author Contributions

Conceptualization, X.B. and M.T.; methodology, X.B.; software, X.B.; validation, X.B., J.W. and J.Z.; formal analysis, X.B.; resources, M.T., J.W. and P.L.; data curation, X.B. and J.Z.; writing—original draft preparation, X.B.; writing—review and editing, X.B., M.T., J.W., J.Z. and P.L.; visualization, X.B.; supervision, M.T.; project administration, M.T.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Kunlun Talent Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Eassy, Reuters, WP, and HC3 datasets used in this study are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Q.; Li, H. On Continually Tracing Origins of LLM-Generated Text and Its Application in Detecting Cheating in Student Coursework. Big Data Cogn. Comput. 2025, 9, 50–58. [Google Scholar] [CrossRef]
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Wang, F.; Wang, A.; Pan, M.; Deng, S.; Qian, Q.; Jia, R.; Zheng, R. Recognizing Large-Scale AIGC on Search Engine Websites Based on Knowledge Integration and Feature Pyramid Network. Proc. Assoc. Inf. Sci. Technol. 2024, 61, 679–684. [Google Scholar] [CrossRef]
Ma, J.; Wang, Q.; Zhang, W. Taking ChatGPT as an Example to Explore the New Challenges of Network Security in the AIGC Era. Ind. Inf. Secur. 2025, 2, 62–72. [Google Scholar]
Bao, G.; Zhao, Y.; Teng, Z.; Yang, L.; Zhang, Y. Fast-DetectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–23. [Google Scholar]
Tang, R.; Chuang, Y.; Hu, X. The science of detecting LLM-generated text. Commun. ACM 2024, 67, 50–59. [Google Scholar] [CrossRef]
An, B. AI-Generated Text Detection: Challenges and Future Directions. Int. J. Asian Lang. Process. 2023, 33, 2330002–2330008. [Google Scholar] [CrossRef]
Solaiman, I.; Brundage, M.; Clark, J.; Askell, A.; Herbert-Voss, A.; Wu, J.; Radford, A.; Krueger, G.; Kim, J.W.; Kreps, S.; et al. Release Strategies and the Social Impacts of Language Models. arXiv 2019, arXiv:1908.09203. [Google Scholar]
Gehrmann, S.; Strobelt, H.; Rush, A.M. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Association for Computational Linguistics: Florence, Italy, 2019; pp. 111–116. [Google Scholar]
Ippolito, D.; Duckworth, D.; Callison-Burch, C.; Eck, D. Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2020; pp. 1808–1822. [Google Scholar]
Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; Choi, Y. Defending Against Neural Fake News. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 9054–9065. [Google Scholar]
Mo, Y.; Qin, H.; Dong, Y.; Zhu, Z.; Li, Z. Large language model (llm) ai text generation detection based on transformer deep learning algorithm. Int. J. Eng. Manag. Res. 2024, 14, 154–159. [Google Scholar]
Alshareef, A.M.; Alsobhi, A.; Khadidos, A.O.; Alyoubi, K.H.; Khadidos, A.O.; Ragab, M. Automated detection of ChatGPT-generated text vs. human text using gannet-optimized deep learning. Alex. Eng. J. 2025, 124, 495–512. [Google Scholar] [CrossRef]
Xiong, P.; Yang, X.; Zheng, X.F.; Wu, X.L. Research on the Detection of Elements in Al Generation and Scholar Writing Papers. Artif. Intell. Sci. Eng. 2024, 4, 21–30. [Google Scholar]
Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning; PMLR: Vienna, Austria, 2023; pp. 24950–24962. [Google Scholar]
Chakraborty, U.; Gheewala, J.; Deegadwala, S.; Vyas, D.; Soni, M. Safeguarding authenticity in text with BERT-powered detection of AI-generated content. In Proceedings of the International Conference on Inventive Computation Technologies; IEEE: New York, NY, USA, 2024; pp. 34–37. [Google Scholar]
Mao, C.; Vondrick, C.; Wang, H.; Yang, J. Raidar: geneRative AI Detection viA Rewriting. In Proceedings of the International Conference on Learning Representations; OpenReview: Vienna, Austria, 2024; pp. 1–18. [Google Scholar]
Sun, J.; Lv, Z. Zero-shot detection of LLM-generated text via text reorder. Neurocomputing 2025, 63, 129829. [Google Scholar] [CrossRef]
Liu, P.; Qiu, X.; Huang, X. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Florence, Italy, 2017; pp. 1–10. [Google Scholar]
Xiang, H.; Xue, Y.; Hao, L. Large Language Model-Generated Text Detection Based on Linguistic Feature Ensemble Learning. Netinfo Secur. 2024, 24, 1098–1109. [Google Scholar]
Macko, D.; Moro, R.; Uchendu, A.; Lucas, J.; Yamashita, M.; Pikuliak, M.; Srba, I.; Le, T.; Lee, D.; Simko, J.; et al. Multitude: Large-scale multilingual machine-generated text detection benchmark. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Florence, Italy, 2023; pp. 9960–9987. [Google Scholar]
Krishna, K.; Song, Y.; Karpinska, M.; Wieting, J.; Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 1–32. [Google Scholar]
Bhattacharjee, A.; Kumarage, T.; Moraffah, R.; Liu, H. Contrastive domain adaptation for AI-generated text detection. In Proceedings of the International Joint Conference on Natural Language Processing and the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2023; pp. 598–610. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Hansen, L.K. Higher-Order Statistics in Machine Learning; MIT Press: Cambridge, MA, USA, 2022; pp. 45–72. [Google Scholar]
Niu, Y.; Fabian, Z.; Lee, S.; Soltanolkotabi, M.; Avestimehr, S. mL-BFGS: A Momentum-based L-BFGS for Distributed Large-scale Neural Network Optimization. Trans. Mach. Learn. Res. 2023, 2023, 967. [Google Scholar]
He, X.; Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. MGTBench: Benchmarking Machine-Generated Text Detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 2251–2265. [Google Scholar]
Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv 2023, arXiv:2301.07597. [Google Scholar]

Figure 1. Flowchart of TextLogScore feature extraction algorithm. Orange arrows indicate the sequential stages of the primary data processing. Purple arrows represent the integration of an adaptive penalty term into the accumulation (Acc) layer to optimize the final detection score.

Figure 2. Flowchart of the SeqProbScore feature extraction algorithm. Broad blue arrows represent the primary stages from input sentence processing to score generation. Thin green lines indicate the aggregation of adjusted token probabilities into the multiplication (Mul) layer.

Figure 3. Flowchart of TokenRankScore feature extraction. Broad orange arrows represent the primary stages of data transformation, including the calculation of model probability, token ranking, and percentile normalization. Thin teal lines indicate the aggregation of individual token percentiles and sequence-level statistical features into the Merge layer.

Figure 4. Schematic diagram of Dk method. Orange arrows denote the sequential data flow through the three parallel feature extraction modules: SeqProbScore, TextLogScore, and TokenRankScore. Purple brackets indicate the concatenation of these diverse features into a unified feature vector F.

Figure 5. Heat map comparison of F1 score results of six comparison methods on Reuters dataset.

Figure 6. Comparison of F1 scores of various methods on WP dataset.

Figure 7. Comparison of F1 scores in each domain of HC3 dataset.

Table 1. Information of HC3 chinese dataset.

Category	Total Samples	Training Set Samples	Test Set Samples
finance	689	551	138
open_qa	3293	2634	659
baike	4617	3694	923
nlpcc_dbqa	1709	1367	342
medicine	1074	859	215
psychology	1099	879	220
law	372	298	74
Overall	13,255	10,604	2651

Table 2. Experimental results of Dk method on Reuters dataset.

Model Name	Accuracy	Precision	Recall	F1
ChatGPT-turbo	0.965 ± 0.010	0.960 ± 0.011	0.965 ± 0.010	0.965 ± 0.010
Claude	0.790 ± 0.018	0.790 ± 0.019	0.790 ± 0.018	0.789 ± 0.018
ChatGLM	0.972 ± 0.008	0.960 ± 0.009	0.972 ± 0.008	0.970 ± 0.008
Dolly	0.652 ± 0.022	0.658 ± 0.021	0.652 ± 0.022	0.649 ± 0.022
ChatGPT	0.930 ± 0.012	0.930 ± 0.013	0.920 ± 0.014	0.930 ± 0.012
GPT4All	0.767 ± 0.016	0.760 ± 0.017	0.770 ± 0.016	0.765 ± 0.016

Table 3. Comparison of F1 scores between Dk and other methods on Reuters dataset.

Method Name	Log-Likelihood	Rank	Entropy	GLTR	NPR	DetectGPT	Dk
ChatGPT-turbo	0.926	0.847	0.703	0.946	0.284	0.27	0.965 ± 0.010 ***
Claude	0.798	0.648	0.694	0.772	0.560	0.558	0.789 ± 0.018 ^†
ChatGLM	0.972	0.650	0.477	0.987	0.950	0.866	0.970 ± 0.008 ***
Dolly	0.381	0.413	0.553	0.556	0.790	0.782	0.649 ± 0.022 ***
ChatGPT	0.659	0.635	0.620	0.75	0.751	0.75	0.930 ± 0.012 ***
GPT4All	0.697	0.665	0.668	0.742	0.84	0.821	0.765 ± 0.016 ***

Note: Values are Mean ± SD over 5-fold stratified cross-validation. *** p < 0.001, ^† p < 0.10. p-values are derived from t-tests comparing our method’s fold results to the baseline values.

Table 4. Experimental results of Dk method on WP dataset.

Model Name	Accuracy	Precision	Recall	F1
ChatGPT-turbo	0.962 ± 0.011	0.963 ± 0.012	0.962 ± 0.011	0.962 ± 0.011
Claude	0.787 ± 0.019	0.721 ± 0.025	0.787 ± 0.019	0.782 ± 0.019
ChatGLM	0.992 ± 0.005	0.993 ± 0.006	0.992 ± 0.007	0.992 ± 0.007
Dolly	0.815 ± 0.020	0.801 ± 0.021	0.813 ± 0.020	0.810 ± 0.020
ChatGPT	0.879 ± 0.014	0.879 ± 0.014	0.878 ± 0.015	0.878 ± 0.014
GPT4All	0.938 ± 0.013	0.939 ± 0.013	0.938 ± 0.013	0.938 ± 0.013

Table 5. Comparison of F1 scores between Dk and other methods on WP dataset.

Method Name	Log-Likelihood	Rank	Entropy	GLTR	NPR	DetectGPT	Dk
ChatGPT-turbo	0.841	0.797	0.770	0.800	0.352	0.608	0.962 ± 0.011 ***
Claude	0.773	0.709	0.731	0.733	0.521	0.517	0.782 ± 0.019 *
ChatGLM	0.980	0.840	0.800	0.983	0.970	0.812	0.992 ± 0.007 ***
Dolly	0.794	0.760	0.662	0.766	0.801	0.719	0.810 ± 0.020 ***
ChatGPT	0.786	0.781	0.644	0.861	0.764	0.695	0.878 ± 0.014 ***
GPT4All	0.934	0.891	0.766	0.935	0.905	0.808	0.938 ± 0.013 ***

Note: Values are Mean ± SD over 5-fold stratified cross-validation.* p < 0.05, *** p < 0.001, p-values are derived from t-tests comparing our method’s fold results to the baseline values.

Table 6. Detection effect of Dk method on Chinese text dataset (HC3).

Category	Accuracy	Precision	Recall	F1	AUC
finance	0.9372 ± 0.0085	0.9374 ± 0.0084	0.9372 ± 0.0085	0.9371 ± 0.0085	0.9785 ± 0.0042
open_qa	0.9505 ± 0.0072	0.9504 ± 0.0073	0.9505 ± 0.0072	0.9505 ± 0.0072	0.9831 ± 0.0035
baike	0.8581 ± 0.0156	0.8598 ± 0.0154	0.8581 ± 0.0156	0.8580 ± 0.0156	0.9271 ± 0.0098
nlpcc_dbqa	0.8246 ± 0.0182	0.8229 ± 0.0185	0.8246 ± 0.0182	0.8147 ± 0.0191	0.8727 ± 0.0143
medicine	0.9721 ± 0.0058	0.9721 ± 0.0058	0.9721 ± 0.0058	0.9721 ± 0.0058	0.9972 ± 0.0018
psychology	0.9899 ± 0.0032	0.9899 ± 0.0032	0.9899 ± 0.0032	0.9899 ± 0.0032	0.9975 ± 0.0015
law	0.9298 ± 0.0098	0.9303 ± 0.0097	0.9298 ± 0.0098	0.9299 ± 0.0098	0.9646 ± 0.0059
Overall	0.8963 ± 0.0121	0.8967 ± 0.0120	0.8963 ± 0.0121	0.8964 ± 0.0121	0.9550 ± 0.0073

Table 7. Correlations of three features across datasets.

Dataset	TextLogScore–SeqProbScore	TextLogScore–TokenRankScore	SeqProbScore–TokenRankScore
Reuters	−0.70	0.72	−0.52
Essay	−0.66	0.68	−0.45
WP	−0.65	0.64	−0.44

Table 8. Ablation study results (F1 Score) on WP dataset.

Feature Configuration	ChatGPT-Turbo	Claude	ChatGLM
TextLogScore	0.922 ± 0.008	0.742 ± 0.018	0.975 ± 0.005
SeqProbScore	0.900 ± 0.010	0.740 ± 0.020	0.923 ± 0.006
TokenRankScore	0.922 ± 0.009	0.727 ± 0.022	0.945 ± 0.006
TextLogScore + SeqProbScore	0.944 ± 0.007	0.755 ± 0.015	0.990 ± 0.004
TextLogScore + TokenRankScore	0.945 ± 0.006	0.756 ± 0.014	0.992 ± 0.003
SeqProbScore + TokenRankScore	0.944 ± 0.008	0.804 ± 0.012	0.991 ± 0.003
All_Features	0.962 ± 0.011	0.782 ± 0.019	0.992 ± 0.007

Table 9. Feature importance based on standardized coefficients.

Feature	Avg. Coefficient	Avg. Rank
TextLogScore	−5.67	1.0
SeqProbScore	3.12	2.0
TokenRankScore	−1.02	3.1

Table 10. Cross-dataset generalization results.

Train Set\Test Set	Essay	Reuters	WP
Essay	0.9583	0.9433	0.8368
Reuters	0.9467	0.9433	0.8666
WP	0.7758	0.7771	0.9567

Table 11. Cross-model generalization (LOMO) results.

Target Model (Unseen)	F1-Score	AUC	Recall
ChatGLM	0.9418	0.9964	0.9990
ChatGPT	0.9259	0.9801	0.9600
GPT4All	0.9025	0.9559	0.9080
ChatGPT-turbo	0.8744	0.9392	0.8490
Dolly	0.7919	0.8658	0.6767
StableLM	0.7250	0.8243	0.5401
Claude	0.7219	0.8236	0.5500
Mean (Average)	0.8405	0.9128	0.7833

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bu, X.; Tang, M.; Wang, J.; Zhang, J.; Luo, P. Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion. Information 2026, 17, 320. https://doi.org/10.3390/info17040320

AMA Style

Bu X, Tang M, Wang J, Zhang J, Luo P. Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion. Information. 2026; 17(4):320. https://doi.org/10.3390/info17040320

Chicago/Turabian Style

Bu, Xuan, Minghu Tang, Junjie Wang, Jiayi Zhang, and Peng Luo. 2026. "Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion" Information 17, no. 4: 320. https://doi.org/10.3390/info17040320

APA Style

Bu, X., Tang, M., Wang, J., Zhang, J., & Luo, P. (2026). Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion. Information, 17(4), 320. https://doi.org/10.3390/info17040320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection of LLM-Generated Text vs. Human Text via DeepSeek-R1 Multi-Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Methods Based on Statistical Features

2.2. Methods Based on Deep Learning

2.3. Methods Based on Pre-Trained Language Models

2.4. Hybrid Methods Based on Pre-Trained Language Models

3. Detection of Text Generated by Large Language Models Based on DeepSeek-R1 Multi-Feature Fusion

3.1. Feature Extraction

3.1.1. Extracting TextLogScore Feature

3.1.2. Extracting SeqProbScore Feature

3.1.3. Extracting TokenRankScore Feature

3.2. Feature Extractor Setup

3.3. Model Training

3.4. Classification and Evaluation

4. Experiment

4.1. Dataset

4.2. Experimental Setup

5. Results

5.1. Experimental Results

5.2. Feature Correlation Analysis Across Datasets

5.3. Ablation Study

5.4. Feature Importance Analysis

5.5. Cross-Dataset Generalization Analysis

5.6. Cross-Model Generalization (LOMO)

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI