3. Detection of Text Generated by Large Language Models Based on DeepSeek-R1 Multi-Feature Fusion
To quickly and automatically distinguish between human-generated and AI-generated text, this paper proposes a detection method (hereinafter referred to as the Dk method) based on a DeepSeek-R1 pre-trained language model and logistic regression, combining multi-dimensional features such as text generation probability. The core of this method includes three main steps: feature extraction, model training, and classification.
3.3. Model Training
After completing feature extraction, this study uses a logistic regression model to train the extracted features. As a classic binary classification algorithm, logistic regression has the characteristics of strong interpretability, simple model structure, and high computational efficiency, which can adapt to practical application scenarios and is highly consistent with the goal of efficient and interpretable text detection pursued in this study.
In the experiment, the dataset is divided into a training set and a test set, with the test set accounting for 20%. When dividing the dataset, a stratified sampling strategy (stratify = labels) is adopted to ensure that the proportion of human-generated text and text generated by large language models in the training set and test set is consistent, and the random seed is set to 42 (random_state = 42) to ensure the reproducibility of the experiment. This helps to avoid model bias caused by unbalanced data distribution, allowing the model to fully learn the features of different types of text during training, thereby improving the generalization ability and stability of the model.
When using the logistic regression model for classification in this study, L2 regularization (penalty = ‘l2’) is configured to control model complexity, and the optimizer selects the L-BFGS algorithm [
26] (solver = ‘lbfgs’) to efficiently solve convex optimization problems, with the maximum number of iterations set to 100. The regularization strength C was optimized via a grid search (C ∈ {0.01, 0.1, 1, 10, 100}) with 5-fold cross-validation, yielding optimal values of 1 for the Essay dataset and 0.1 for the Reuters and WP datasets. The algorithm of the logistic regression model is as follows:
Input: Text feature matrix F, label vector y, DeepSeek-R1 model, Tokenizer
Output: Trained logistic regression classifier lr_model
1. Divide training set and test set:
F_train, F_test, y_train, y_test = Stratified sampling division (test set accounts for 20%, random seed = 42)
2. Initialize logistic regression model:
lr_model = LogisticRegression(
penalty = ‘l2’,
C = regularization_strength,
Solver = ‘lbfgs’,
max_iter = max_iterations)
3. Model training:
lrmodel.fit(F_train, y_train)
4. Return the trained model:
return lr_model
Author Contributions
Conceptualization, X.B. and M.T.; methodology, X.B.; software, X.B.; validation, X.B., J.W. and J.Z.; formal analysis, X.B.; resources, M.T., J.W. and P.L.; data curation, X.B. and J.Z.; writing—original draft preparation, X.B.; writing—review and editing, X.B., M.T., J.W., J.Z. and P.L.; visualization, X.B.; supervision, M.T.; project administration, M.T.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the Kunlun Talent Project.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The Eassy, Reuters, WP, and HC3 datasets used in this study are publicly available.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wang, Q.; Li, H. On Continually Tracing Origins of LLM-Generated Text and Its Application in Detecting Cheating in Student Coursework. Big Data Cogn. Comput. 2025, 9, 50–58. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
- Wang, F.; Wang, A.; Pan, M.; Deng, S.; Qian, Q.; Jia, R.; Zheng, R. Recognizing Large-Scale AIGC on Search Engine Websites Based on Knowledge Integration and Feature Pyramid Network. Proc. Assoc. Inf. Sci. Technol. 2024, 61, 679–684. [Google Scholar] [CrossRef]
- Ma, J.; Wang, Q.; Zhang, W. Taking ChatGPT as an Example to Explore the New Challenges of Network Security in the AIGC Era. Ind. Inf. Secur. 2025, 2, 62–72. [Google Scholar]
- Bao, G.; Zhao, Y.; Teng, Z.; Yang, L.; Zhang, Y. Fast-DetectGPT: Efficient zero-shot detection of machine-generated text via conditional probability curvature. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–23. [Google Scholar]
- Tang, R.; Chuang, Y.; Hu, X. The science of detecting LLM-generated text. Commun. ACM 2024, 67, 50–59. [Google Scholar] [CrossRef]
- An, B. AI-Generated Text Detection: Challenges and Future Directions. Int. J. Asian Lang. Process. 2023, 33, 2330002–2330008. [Google Scholar] [CrossRef]
- Solaiman, I.; Brundage, M.; Clark, J.; Askell, A.; Herbert-Voss, A.; Wu, J.; Radford, A.; Krueger, G.; Kim, J.W.; Kreps, S.; et al. Release Strategies and the Social Impacts of Language Models. arXiv 2019, arXiv:1908.09203. [Google Scholar]
- Gehrmann, S.; Strobelt, H.; Rush, A.M. GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Association for Computational Linguistics: Florence, Italy, 2019; pp. 111–116. [Google Scholar]
- Ippolito, D.; Duckworth, D.; Callison-Burch, C.; Eck, D. Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2020; pp. 1808–1822. [Google Scholar]
- Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; Choi, Y. Defending Against Neural Fake News. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 9054–9065. [Google Scholar]
- Mo, Y.; Qin, H.; Dong, Y.; Zhu, Z.; Li, Z. Large language model (llm) ai text generation detection based on transformer deep learning algorithm. Int. J. Eng. Manag. Res. 2024, 14, 154–159. [Google Scholar]
- Alshareef, A.M.; Alsobhi, A.; Khadidos, A.O.; Alyoubi, K.H.; Khadidos, A.O.; Ragab, M. Automated detection of ChatGPT-generated text vs. human text using gannet-optimized deep learning. Alex. Eng. J. 2025, 124, 495–512. [Google Scholar] [CrossRef]
- Xiong, P.; Yang, X.; Zheng, X.F.; Wu, X.L. Research on the Detection of Elements in Al Generation and Scholar Writing Papers. Artif. Intell. Sci. Eng. 2024, 4, 21–30. [Google Scholar]
- Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning; PMLR: Vienna, Austria, 2023; pp. 24950–24962. [Google Scholar]
- Chakraborty, U.; Gheewala, J.; Deegadwala, S.; Vyas, D.; Soni, M. Safeguarding authenticity in text with BERT-powered detection of AI-generated content. In Proceedings of the International Conference on Inventive Computation Technologies; IEEE: New York, NY, USA, 2024; pp. 34–37. [Google Scholar]
- Mao, C.; Vondrick, C.; Wang, H.; Yang, J. Raidar: geneRative AI Detection viA Rewriting. In Proceedings of the International Conference on Learning Representations; OpenReview: Vienna, Austria, 2024; pp. 1–18. [Google Scholar]
- Sun, J.; Lv, Z. Zero-shot detection of LLM-generated text via text reorder. Neurocomputing 2025, 63, 129829. [Google Scholar] [CrossRef]
- Liu, P.; Qiu, X.; Huang, X. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Florence, Italy, 2017; pp. 1–10. [Google Scholar]
- Xiang, H.; Xue, Y.; Hao, L. Large Language Model-Generated Text Detection Based on Linguistic Feature Ensemble Learning. Netinfo Secur. 2024, 24, 1098–1109. [Google Scholar]
- Macko, D.; Moro, R.; Uchendu, A.; Lucas, J.; Yamashita, M.; Pikuliak, M.; Srba, I.; Le, T.; Lee, D.; Simko, J.; et al. Multitude: Large-scale multilingual machine-generated text detection benchmark. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Florence, Italy, 2023; pp. 9960–9987. [Google Scholar]
- Krishna, K.; Song, Y.; Karpinska, M.; Wieting, J.; Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 1–32. [Google Scholar]
- Bhattacharjee, A.; Kumarage, T.; Moraffah, R.; Liu, H. Contrastive domain adaptation for AI-generated text detection. In Proceedings of the International Joint Conference on Natural Language Processing and the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2023; pp. 598–610. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Hansen, L.K. Higher-Order Statistics in Machine Learning; MIT Press: Cambridge, MA, USA, 2022; pp. 45–72. [Google Scholar]
- Niu, Y.; Fabian, Z.; Lee, S.; Soltanolkotabi, M.; Avestimehr, S. mL-BFGS: A Momentum-based L-BFGS for Distributed Large-scale Neural Network Optimization. Trans. Mach. Learn. Res. 2023, 2023, 967. [Google Scholar]
- He, X.; Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. MGTBench: Benchmarking Machine-Generated Text Detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 2251–2265. [Google Scholar]
- Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv 2023, arXiv:2301.07597. [Google Scholar]
Figure 1.
Flowchart of TextLogScore feature extraction algorithm. Orange arrows indicate the sequential stages of the primary data processing. Purple arrows represent the integration of an adaptive penalty term into the accumulation (Acc) layer to optimize the final detection score.
Figure 2.
Flowchart of the SeqProbScore feature extraction algorithm. Broad blue arrows represent the primary stages from input sentence processing to score generation. Thin green lines indicate the aggregation of adjusted token probabilities into the multiplication (Mul) layer.
Figure 3.
Flowchart of TokenRankScore feature extraction. Broad orange arrows represent the primary stages of data transformation, including the calculation of model probability, token ranking, and percentile normalization. Thin teal lines indicate the aggregation of individual token percentiles and sequence-level statistical features into the Merge layer.
Figure 4.
Schematic diagram of Dk method. Orange arrows denote the sequential data flow through the three parallel feature extraction modules: SeqProbScore, TextLogScore, and TokenRankScore. Purple brackets indicate the concatenation of these diverse features into a unified feature vector F.
Figure 5.
Heat map comparison of F1 score results of six comparison methods on Reuters dataset.
Figure 6.
Comparison of F1 scores of various methods on WP dataset.
Figure 7.
Comparison of F1 scores in each domain of HC3 dataset.
Table 1.
Information of HC3 chinese dataset.
| Category | Total Samples | Training Set Samples | Test Set Samples |
|---|
| finance | 689 | 551 | 138 |
| open_qa | 3293 | 2634 | 659 |
| baike | 4617 | 3694 | 923 |
| nlpcc_dbqa | 1709 | 1367 | 342 |
| medicine | 1074 | 859 | 215 |
| psychology | 1099 | 879 | 220 |
| law | 372 | 298 | 74 |
| Overall | 13,255 | 10,604 | 2651 |
Table 2.
Experimental results of Dk method on Reuters dataset.
| Model Name | Accuracy | Precision | Recall | F1 |
|---|
| ChatGPT-turbo | 0.965 ± 0.010 | 0.960 ± 0.011 | 0.965 ± 0.010 | 0.965 ± 0.010 |
| Claude | 0.790 ± 0.018 | 0.790 ± 0.019 | 0.790 ± 0.018 | 0.789 ± 0.018 |
| ChatGLM | 0.972 ± 0.008 | 0.960 ± 0.009 | 0.972 ± 0.008 | 0.970 ± 0.008 |
| Dolly | 0.652 ± 0.022 | 0.658 ± 0.021 | 0.652 ± 0.022 | 0.649 ± 0.022 |
| ChatGPT | 0.930 ± 0.012 | 0.930 ± 0.013 | 0.920 ± 0.014 | 0.930 ± 0.012 |
| GPT4All | 0.767 ± 0.016 | 0.760 ± 0.017 | 0.770 ± 0.016 | 0.765 ± 0.016 |
Table 3.
Comparison of F1 scores between Dk and other methods on Reuters dataset.
| Method Name | Log-Likelihood | Rank | Entropy | GLTR | NPR | DetectGPT | Dk |
|---|
| ChatGPT-turbo | 0.926 | 0.847 | 0.703 | 0.946 | 0.284 | 0.27 | 0.965 ± 0.010 *** |
| Claude | 0.798 | 0.648 | 0.694 | 0.772 | 0.560 | 0.558 | 0.789 ± 0.018 † |
| ChatGLM | 0.972 | 0.650 | 0.477 | 0.987 | 0.950 | 0.866 | 0.970 ± 0.008 *** |
| Dolly | 0.381 | 0.413 | 0.553 | 0.556 | 0.790 | 0.782 | 0.649 ± 0.022 *** |
| ChatGPT | 0.659 | 0.635 | 0.620 | 0.75 | 0.751 | 0.75 | 0.930 ± 0.012 *** |
| GPT4All | 0.697 | 0.665 | 0.668 | 0.742 | 0.84 | 0.821 | 0.765 ± 0.016 *** |
Table 4.
Experimental results of Dk method on WP dataset.
| Model Name | Accuracy | Precision | Recall | F1 |
|---|
| ChatGPT-turbo | 0.962 ± 0.011 | 0.963 ± 0.012 | 0.962 ± 0.011 | 0.962 ± 0.011 |
| Claude | 0.787 ± 0.019 | 0.721 ± 0.025 | 0.787 ± 0.019 | 0.782 ± 0.019 |
| ChatGLM | 0.992 ± 0.005 | 0.993 ± 0.006 | 0.992 ± 0.007 | 0.992 ± 0.007 |
| Dolly | 0.815 ± 0.020 | 0.801 ± 0.021 | 0.813 ± 0.020 | 0.810 ± 0.020 |
| ChatGPT | 0.879 ± 0.014 | 0.879 ± 0.014 | 0.878 ± 0.015 | 0.878 ± 0.014 |
| GPT4All | 0.938 ± 0.013 | 0.939 ± 0.013 | 0.938 ± 0.013 | 0.938 ± 0.013 |
Table 5.
Comparison of F1 scores between Dk and other methods on WP dataset.
| Method Name | Log-Likelihood | Rank | Entropy | GLTR | NPR | DetectGPT | Dk |
|---|
| ChatGPT-turbo | 0.841 | 0.797 | 0.770 | 0.800 | 0.352 | 0.608 | 0.962 ± 0.011 *** |
| Claude | 0.773 | 0.709 | 0.731 | 0.733 | 0.521 | 0.517 | 0.782 ± 0.019 * |
| ChatGLM | 0.980 | 0.840 | 0.800 | 0.983 | 0.970 | 0.812 | 0.992 ± 0.007 *** |
| Dolly | 0.794 | 0.760 | 0.662 | 0.766 | 0.801 | 0.719 | 0.810 ± 0.020 *** |
| ChatGPT | 0.786 | 0.781 | 0.644 | 0.861 | 0.764 | 0.695 | 0.878 ± 0.014 *** |
| GPT4All | 0.934 | 0.891 | 0.766 | 0.935 | 0.905 | 0.808 | 0.938 ± 0.013 *** |
Table 6.
Detection effect of Dk method on Chinese text dataset (HC3).
| Category | Accuracy | Precision | Recall | F1 | AUC |
|---|
| finance | 0.9372 ± 0.0085 | 0.9374 ± 0.0084 | 0.9372 ± 0.0085 | 0.9371 ± 0.0085 | 0.9785 ± 0.0042 |
| open_qa | 0.9505 ± 0.0072 | 0.9504 ± 0.0073 | 0.9505 ± 0.0072 | 0.9505 ± 0.0072 | 0.9831 ± 0.0035 |
| baike | 0.8581 ± 0.0156 | 0.8598 ± 0.0154 | 0.8581 ± 0.0156 | 0.8580 ± 0.0156 | 0.9271 ± 0.0098 |
| nlpcc_dbqa | 0.8246 ± 0.0182 | 0.8229 ± 0.0185 | 0.8246 ± 0.0182 | 0.8147 ± 0.0191 | 0.8727 ± 0.0143 |
| medicine | 0.9721 ± 0.0058 | 0.9721 ± 0.0058 | 0.9721 ± 0.0058 | 0.9721 ± 0.0058 | 0.9972 ± 0.0018 |
| psychology | 0.9899 ± 0.0032 | 0.9899 ± 0.0032 | 0.9899 ± 0.0032 | 0.9899 ± 0.0032 | 0.9975 ± 0.0015 |
| law | 0.9298 ± 0.0098 | 0.9303 ± 0.0097 | 0.9298 ± 0.0098 | 0.9299 ± 0.0098 | 0.9646 ± 0.0059 |
| Overall | 0.8963 ± 0.0121 | 0.8967 ± 0.0120 | 0.8963 ± 0.0121 | 0.8964 ± 0.0121 | 0.9550 ± 0.0073 |
Table 7.
Correlations of three features across datasets.
| Dataset | TextLogScore–SeqProbScore | TextLogScore–TokenRankScore | SeqProbScore–TokenRankScore |
|---|
| Reuters | −0.70 | 0.72 | −0.52 |
| Essay | −0.66 | 0.68 | −0.45 |
| WP | −0.65 | 0.64 | −0.44 |
Table 8.
Ablation study results (F1 Score) on WP dataset.
| Feature Configuration | ChatGPT-Turbo | Claude | ChatGLM |
|---|
| TextLogScore | 0.922 ± 0.008 | 0.742 ± 0.018 | 0.975 ± 0.005 |
| SeqProbScore | 0.900 ± 0.010 | 0.740 ± 0.020 | 0.923 ± 0.006 |
| TokenRankScore | 0.922 ± 0.009 | 0.727 ± 0.022 | 0.945 ± 0.006 |
| TextLogScore + SeqProbScore | 0.944 ± 0.007 | 0.755 ± 0.015 | 0.990 ± 0.004 |
| TextLogScore + TokenRankScore | 0.945 ± 0.006 | 0.756 ± 0.014 | 0.992 ± 0.003 |
| SeqProbScore + TokenRankScore | 0.944 ± 0.008 | 0.804 ± 0.012 | 0.991 ± 0.003 |
| All_Features | 0.962 ± 0.011 | 0.782 ± 0.019 | 0.992 ± 0.007 |
Table 9.
Feature importance based on standardized coefficients.
| Feature | Avg. Coefficient | Avg. Rank |
|---|
| TextLogScore | −5.67 | 1.0 |
| SeqProbScore | 3.12 | 2.0 |
| TokenRankScore | −1.02 | 3.1 |
Table 10.
Cross-dataset generalization results.
| Train Set\Test Set | Essay | Reuters | WP |
|---|
| Essay | 0.9583 | 0.9433 | 0.8368 |
| Reuters | 0.9467 | 0.9433 | 0.8666 |
| WP | 0.7758 | 0.7771 | 0.9567 |
Table 11.
Cross-model generalization (LOMO) results.
| Target Model (Unseen) | F1-Score | AUC | Recall |
|---|
| ChatGLM | 0.9418 | 0.9964 | 0.9990 |
| ChatGPT | 0.9259 | 0.9801 | 0.9600 |
| GPT4All | 0.9025 | 0.9559 | 0.9080 |
| ChatGPT-turbo | 0.8744 | 0.9392 | 0.8490 |
| Dolly | 0.7919 | 0.8658 | 0.6767 |
| StableLM | 0.7250 | 0.8243 | 0.5401 |
| Claude | 0.7219 | 0.8236 | 0.5500 |
| Mean (Average) | 0.8405 | 0.9128 | 0.7833 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |