Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection
Abstract
:1. Introduction
- Authorship identification. The task comes down to the multi-class single-labeled text classification problem. In the classic version, the closed attribution problem is considered, that is, the true author of the disputed text is present in a set of candidate authors. For more complicated cases, authorship identification techniques are used to solve open attribution problems. The attribution system should establish the absence or presence of the true author in the list of candidate authors and determine the true author.
- Authorship verification. The authorship verification task comes down to the problem of one-class classification. The essence of the task is solving the question of whether two documents were written by the same person or not.
- Authorship clustering. This is the most difficult task when there are many texts and it is necessary to group them by the author, but there is no information about the number of candidate authors.
- Authorship profiling. Classification by additional author’s characteristics such as gender, age, educational level, etc.
2. Related Works
2.1. Related Works on Classical Machine Learning Methods and Deep Neural Networks for Authorship Attribution
2.2. Related Works on Identification the Author of a Russian-Language Text
2.3. Related Works on Using FastText for Authorship Attribution
2.4. Related Works on Feature Selection
3. Methods Used for Attribution
3.1. Classical Machine Learning Methods
3.1.1. Support Vector Machine
3.1.2. K-Nearest Neighbors Algorithm
3.1.3. Logistic Regression
3.1.4. Naive Bayes Classifier
3.1.5. Decision Trees
3.1.6. Random Forest
3.2. Deep Neural Networks
3.2.1. LSTM and BiLSTM
3.2.2. CNN
3.2.3. Hybrid Neural Network Models
3.2.4. BERT
3.3. FastText
4. Experiments Setup
4.1. Datasets Description
4.2. Text Preprocessing and Encoding
- Converting all letters to lowercase;
- Removal of stop-words;
- Removal of digits (numbers) and special characters;
- Whitespaces formatting.
4.3. Parameters of Methods
- For SVM training was used the sequential optimization method. The kernel was linear. The regularization parameter was 1, and the acceptable error rate is 0.00001. Normalization and compression heuristics were included as additional options.
- For KNN, different values of the parameter k were used: 3, 5, 7, 15, 25.
- To train LR were chosen: the liblinear optimization algorithm, regularization parameter 1, stopping criteria tolerance 1 × 10−4, and limit number of 100 iterations.
- For DT training, gini was used as the partitioning quality function, and the maximum tree depth was 8.
- For RF training, 5, 15, 25, 35, and 50 decision trees were used.
- 128 filters for LSTM and Bidirectional LSTM were chosen. Dropout and recurrent dropout were equal to 0.3 in both cases. Rectified linear unit (ReLU) was selected as an activation function.
- Number of convolution filters for CNN was 1024, GlobalMaxPooling was chosen as a pooling layer. For the CNN with CNN hybrid, a network with a number of 512 convolution filters was also involved. To prevent overfitting, spatial dropout value 0.2 was used. The activation function was similar to LSTM.
- Hyperparameters for LSTM with CNN and CNN with LSTM hybrids were: the number of convolution filters—256, number of recurrent filters—128, and kernel size was 3. The activation function was selected as a ReLU; The dropout was carried out similarly to LSTM and BiLSTM. The activation function was similar to CNN.
- For fastText, the number of n-grams was 2–4. The learning rate parameter was defined as 0.6, the dimension for short texts was 50, for long texts—500. As a loss function, ‘ova’ (Softmax loss for multi-label classification) was used. The maximum number of allocated memory segments was 2,000,000. The rest of the parameters were default.
- When training BERT, the tokenizers “bert-base-multilingual-case” and “rubert-base-cased” were used. A ReLU was used as an activation function for hidden layers, Softmax as an activation function for the output layer, Adam as an optimization algorithm. For regularization, dropout (0.1) was chosen. The learning rate (lr) was 4 × 10−5. The number of epochs was 5.
5. Results
5.1. Results Obtained on Literary Texts
5.2. Results Obtained on the Social Media Texts Dataset
6. Feature Selection Using Genetic Algorithm
- population size: 200;
- crossover ratio: 0.5;
- mutation rate: 0.2;
- number of populations: 20.
7. Limitations of the Proposed Methodology
- The training dataset should include only texts in the author’s writing style. It is recommended to remove the non-authors material from the text.
- The author of each training text should be known for certain. If this condition is not met, the text should be excluded from the training set.
- No less than three texts with a length of 15,000 and 50 texts with a length of 50 characters should be used for the training sets of literary and social media texts, respectively. In both cases, an increase in the number of texts or their lengths has a positive effect on the accuracy of the author’s identification.
- The specifics of the problem should be considered. In the case of short texts and/or limited resources, the classical ML methods with GAs or fastText should be used. In the case of the possibility of deliberate distortion of the text or an attack such as anonymization, deep NNs are more suitable due to the ability to automatically identify informative features of the author’s style.
- The proposed methodology is intended to solve the only authorship identification of a Russian-language text for a closed-set case.
- And the last one, the fewer number of candidate authors, the higher the accuracy of author identification techniques. For the developed technique, this limit is ten and five authors, respectively, for literary and short texts datasets.
- The most critical limitation for solving real-world scenarios is the lack of training data. However, the minimum amount of data required for the technique can be reduced. In future works, it is planned to apply confidence metrics and calibration curves, which will allow reducing the threshold even more. This will apply the presented methodology even on a small amount of training data without losing accuracy.
8. Discussion and Conclusions
- Unconsciousness. In the case of choosing a feature that is poorly controlled by the author’s consciousness, its deliberate distortion becomes less likely.
- Immutability. The value of the feature is constant within a certain limited range for one author. Such features make it possible to distinguish between two or more authors with a similar writing style or someone, who is trying to imitate the style.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Romanov, A.; Kurtukova, A.; Shelupanov, A.; Fedotova, A.; Goncharov, V. Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks. Future Internet 2021, 13, 3. [Google Scholar] [CrossRef]
- Romanov, A.S.; Kurtukova, A.V.; Sobolev, A.A.; Shelupanov, A.A.; Fedotova, A.M. Determining the Age of the Author of the Text Based on Deep Neural Network Models. Information 2020, 11, 589. [Google Scholar] [CrossRef]
- Romanov, A.; Kurtukova, A.; Fedotova, A.; Meshcheryakov, R. Natural Text Anonymization Using Universal Transformer with a Self-attention. In Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), Saint Petersburg, Russia, 27 November 2019; pp. 22–37. [Google Scholar]
- Shumskaya, A.O. Method of the artificial texts identification based on the calculation of the belonging measure to the invariants. Inform. Autom. 2016, 49, 104–121. [Google Scholar] [CrossRef] [Green Version]
- Kurtukova, A.; Romanov, A.; Shelupanov, A. Source Code Authorship Identification Using Deep Neural Networks. Symmetry 2020, 12, 2044. [Google Scholar] [CrossRef]
- Romanov, A.S.; Vasilieva, M.I.; Kurtukova, A.V.; Meshcheryakov, R.V. Sentiment Analysis of Text Using Machine Learning Techniques. In Proceedings of the 2nd International Conference “R. Piotrowski’s Readings LE & AL’2017”, Saint Petersburg, Russia, 27 November 2017; pp. 86–95. [Google Scholar]
- Khomenko, A.; Baranova, Y.; Romanov, A.; Zadvornov, K. Linguistic Modeling as a Basis for Creating Authorship Attribution Software. In Proceedings of the Computational Linguistics and Intellectual Technologies “Dialogue”, Moscow, Russia, 16–19 June 2021; pp. 1063–1074. [Google Scholar]
- Varela, P.; Justino, E.; Oliveira, L.S. Selecting syntactic attributes for authorship attribution. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 167–172. [Google Scholar]
- Lupei, M.; Mitsa, A.; Repariuk, V.; Sharkan, V. Identification of authorship of Ukrainian-language texts of journalistic style using neural networks. East.-Eur. J. Enterp. Technol. 2020, 1, 30–36. [Google Scholar] [CrossRef] [Green Version]
- Yang, M.; Chen, X.; Tu, W.; Lu, Z.; Zhu, J.; Qu, Q. A topic drift model for authorship attribution. Neurocomputing 2018, 273, 133–140. [Google Scholar] [CrossRef]
- Potha, N.; Stamatatos, E. Improved algorithms for extrinsic author verification. Knowl. Inf. Syst. 2020, 62, 1903–1921. [Google Scholar] [CrossRef]
- Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
- Halvani, O.; Graner, L. POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–12. [Google Scholar]
- Bevendorff, J.; Hagen, M.; Stein, B.; Potthast, M. Bias Analysis and Mitigation in the Evaluation of Authorship Verification. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; pp. 6301–6306. [Google Scholar]
- Radhakrishnan, R.; Penstein, C. Machine Learning Framework for Authorship Identification from Texts. arXiv 2019, arXiv:1912.10204. [Google Scholar]
- Alterkav, S.; Erbay, H. Novel authorship verification model for social media accounts compromised by a human. Multimed. Tools Appl. 2021, 80, 13575–13591. [Google Scholar] [CrossRef]
- Demir, N.; Can, M. Authorship Authentication of Short Messages from Social Networks Machines. Southeast Eur. J. Soft Comput. 2018, 7. [Google Scholar] [CrossRef] [Green Version]
- Demir, N. Authorship Authentication for Twitter Messages Using Support Vector Machine. Southeast Eur. J. Soft Comput. 2016, 5. [Google Scholar] [CrossRef] [Green Version]
- Adamovic, S. Automated language-independent authorship verification (for Indo-European languages). J. Assoc. Inf. Sci. Technol. 2019, 70, 858–871. [Google Scholar] [CrossRef]
- Boumber, D.; Zhang, Y.; Mukherjee, A. Experiments with convolutional neural networks for multi-label authorship attribution. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- PAN: Shared Tasks. Available online: https://pan.webis.de/shared-tasks.html (accessed on 18 November 2021).
- Boenninghoff, B.; Nickel, R.M.; Kolossa, D. O2D2: Out-of-distribution detector to capture undecidable trials in authorship verification. arXiv 2021, arXiv:2106.15825. [Google Scholar]
- Weerasinghe, J.; Singh, R.; Greenstadt, R. Feature vector difference based authorship verification for open-world settings. In Proceedings of the CEUR Workshop 2021, Bucharest, Romania, 21–24 September 2021; Volume 2936, pp. 2201–2207. [Google Scholar]
- Petmanson, T. Authorship verification of opinion pieces in Estonian. Eest. Raken. Uhin. Aastaraam. 2014, 10, 259–267. [Google Scholar] [CrossRef] [Green Version]
- Baj, M.; Walkowiak, T. Computer Based Stylometric Analysis of Texts in Polish Language. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 11–15 June 2017; pp. 3–12. [Google Scholar]
- Kapočiūtė-Dzikicnė, J.; Damaševičius, R. Lithuanian Author Profiling with the Deep Learning. In Proceedings of the 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), Poznań, Poland, 9–12 September 2018; pp. 169–172. [Google Scholar]
- Venckauskas, A.; Karpavicius, A.; Damaševičius, R.; Marcinkevičius, R.; Kapočiūte-Dzikiené, J.; Napoli, C. Open class authorship attribution of lithuanian internet comments using one-class classifier. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, Czech Republic, 3–6 September 2017; pp. 373–382. [Google Scholar]
- Dinu, L.P.; Popescu, M.; Dinu, A. Authorship Identification of Romanian Texts with Controversial Paternity. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, 26 May–1 June 2008. [Google Scholar]
- Plecháč, P.; Bobenhausen, K.; Hammerich, B. Versification and authorship attribution. A pilot study on Czech, German, Spanish, and English poetry. Studia Metr. Poet. 2019, 5, 29–54. [Google Scholar] [CrossRef] [Green Version]
- Litvinova, T.; Litvinova, O.; Panicheva, P. Authorship attribution of Russian forum posts with different types of n-gram features. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, Tokushima, Japan, 28–30 June 2019; pp. 9–14. [Google Scholar]
- Pimonova, E.; Durandin, O.; Malafeev, A. Authorship Attribution in Russian with New High-Performing and Fully Interpretable Morpho-Syntactic Features //International Conference on Analysis of Images, Social Networks and Texts; Springer: Cham, Switzerland, 2019; Chapter 193–204. [Google Scholar]
- Panicheva, P.; Litvinova, T. Authorship attribution in Russian in real-world forensics scenario. In Proceedings of the International Conference on Statistical Language and Speech Processing; Springer: Cham, Switzerland, 2019; pp. 299–310. [Google Scholar]
- FastText: Library for Efficient Text Classification and Representation Learning. Available online: https://fasttext.cc/ (accessed on 18 November 2021).
- Chowdhury, H.; Imon, M.; Islam, M. Authorship Attribution in Bengali Literature Using fastText’s Hierarchical Classifier. In Proceedings of the 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT), Dhaka, Bangladesh, 13–15 September 2018; pp. 102–106. [Google Scholar]
- Van Tussenbroek, T. Who said that? Comparing Performance of TF-IDF and fastText to Identify Authorship of Short Sentences. Bachelor’s Thesis, Delft University of Technology, Delft, The Netherlands, 2020. [Google Scholar]
- Hodashinsky, I.; Hancer, E.; Sarin, K.; Slezkin, A. A wrapper metaheuristic framework for handwritten signature verification. Soft Comput. 2021, 25, 8665–8681. [Google Scholar]
- Svetlakov, M.; Hodashinsky, I.; Slezkin, A. Gender, Age and Number of Participants Effects on Identification Ability of EEG-based Shallow Classifiers. In Proceedings of the 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 13–14 May 2021; pp. 0350–0353. [Google Scholar]
- Hodashinsky, I. Fuzzy classifiers in cardiovascular disease diagnostics. Sib. J. Clin. Exp. Med. 2020, 35, 22–31. [Google Scholar] [CrossRef]
- Ma, J.; Xue, B.; Zhang, M. A Hybrid Filter-Wrapper Feature Selection Approach for Authorship Attribution. Int. J. Innov. Comput. Inf. Control. 2019, 15, 1989–2006. [Google Scholar]
- Escalante, H.; Montes, M.; Villaseñor, L. Particle swarm model selection for authorship verification. In Proceedings of the Iberoamerican Congress on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2009; pp. 563–570. [Google Scholar]
- Martín-del-Campo-Rodríguez, C. Authorship Attribution through Punctuation n-grams and Averaged Combination of SVM. In Proceedings of the CLEF, Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
- Hitschler, J.; Van Den Berg, E.; Rehbein, I. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation, Copenhagen, Denmark, 8 September 2017; pp. 53–58. [Google Scholar]
- Huang, W.; Su, R.; Iwaihara, M. Contribution of improved character embedding and latent posting styles to authorship attribution of short texts. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data; Springer: Cham, Switzerland, 2020; pp. 261–269. [Google Scholar]
- Xing, L.; Qiao, Y. Deepwriter: A multi-stream deep CNN for text-independent writer identification. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 584–589. [Google Scholar]
- Zhong, Z.; Sun, L.; Huo, Q. An anchor-free region proposal network for Faster R-CNN-based text detection approaches. J. Doc. Anal. Recognit. 2019, 22, 315–327. [Google Scholar] [CrossRef] [Green Version]
- Yu, Y.; Wang, C.; Gu, X.; Li, J. A novel deep learning-based method for damage identification of smart building structures. Struct. Health Monit. 2019, 18, 143–163. [Google Scholar] [CrossRef] [Green Version]
- Breuel, T. High Performance Text Recognition Using a Hybrid Convolutional-lstm Implementation. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 11–16. [Google Scholar]
- Library of Maxim Moshkov. Available online: http://www.lib.ru/ (accessed on 18 November 2021).
- Guo, Q.; Qiu, X.; Liu, P.; Xue, X.; Zhang, Z. Multi-Scale Self-Attention for Text Classification. arXiv 2019, arXiv:1912.00544. [Google Scholar] [CrossRef]
- Sharov’s Russian Frequency Dictionary. Available online: http://www.slovorod.ru/freq-sharov/index.html (accessed on 18 November 2021).
- Ruder, S.; Ghaffari, P.; Breslin, J. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv 2016, arXiv:1609.06686. [Google Scholar]
- Akimushkin, C.; Amancio, D.; Oliveira, O., Jr. On the role of words in the network structure of texts: Application to authorship attribution. Phys. A Stat. Mech. Its Appl. 2018, 495, 49–58. [Google Scholar] [CrossRef] [Green Version]
- Evert, S. Understanding and explaining Delta measures for authorship attribution. Digit. Scholarsh. Humanit. 2017, 32, ii4–ii16. [Google Scholar] [CrossRef] [Green Version]
- Britt, C.; Rocque, M.; Zimmerman, G. The analysis of bounded count data in criminology. J. Quant. Criminol. 2018, 34, 591–607. [Google Scholar] [CrossRef]
Dataset Characteristic | Value of the Characteristic |
---|---|
Number of authors | 100 |
Number of texts | 1100 |
Dataset size, symbols | 375,618,852 |
Dataset size, words | 62,603,142 |
Dataset size, sentences | 5,216,929 |
The average length of text, symbols | 973,342 |
The average length of sentence, words | 14.3 |
Maximum number of texts per author | 20 |
Minimal number of texts per author | 9 |
Dataset Characteristic | Value of the Characteristic |
---|---|
Number of authors | 3075 |
Number of texts | 202,892 |
Dataset size, symbols | 30,652,109 |
Dataset size, words | 4,708,619 |
The average length of text, symbols | 151.1 |
The average length of text, words | 23.7 |
The average number of texts per author | 115.37 |
Number of Authors | Accuracy of Models, % | |||||
---|---|---|---|---|---|---|
SVM | LR | NB | DT | RF | KNN | |
2 | 95.4 ± 1.7 | 95.1 ± 4.4 | 91.9 ± 3.4 | 95.1 ± 1.1 | 97.4 ± 1.7 | 94.0 ± 2.6 |
5 | 94.6 ± 2.1 | 92.8 ± 3.1 | 87.1 ± 3.6 | 92.4 ± 2.3 | 92.1 ± 3.7 | 81.1 ± 2.1 |
10 | 81.9 ± 4.6 | 74.2 ± 5.1 | 74.4 ± 5.6 | 84.2 ± 4.4 | 67.6 ± 2.6 | 79.9 ± 3.3 |
20 | 63.3 ± 4.8 | 61.9 ± 5.1 | 58.3 ± 4.5 | 52.2 ± 2.3 | 62.2 ± 3.7 | 51.9 ± 5.2 |
50 | 37.7 ± 6.3 | 34.7 ± 7.1 | 29.8 ± 5.6 | 17.8 ± 2.7 | 35.9 ± 2.5 | 41.9 ± 4.0 |
Avg. accuracy | 74.7 ± 3.9 | 71.2 ± 4.9 | 68.3 ± 4.5 | 68.7 ± 2.6 | 70.6 ± 2.8 | 69.8 ± 3.4 |
Number of Authors | Accuracy of Models, % | |||||
---|---|---|---|---|---|---|
SVM | LR | NB | DT | RF | KNN | |
2 | 92.4 ± 1.1 | 93.2 ± 2.3 | 85.7 ± 2.7 | 90.1 ± 1.1 | 92.5 ± 2.2 | 86.9 ± 3.3 |
5 | 84.5 ± 3.4 | 82.2 ± 4.3 | 72.7 ± 4.8 | 81.1 ± 3.2 | 88.5 ± 3.1 | 78.0 ± 2.6 |
10 | 72.2 ± 5.6 | 68.7 ± 5.5 | 59.4 ± 4.9 | 75.3 ± 2.2 | 70.6 ± 1.1 | 71.4 ± 4.9 |
20 | 55.2 ± 4.2 | 52.3± 4.8 | 49.3 ± 5.3 | 33.6 ± 3.1 | 59.0 ± 3.4 | 57.4 ± 2.6 |
50 | 33.2 ± 4.8 | 27.7 ± 3.3 | 22.8 ± 4.1 | 16.1 ± 5.1 | 31.4 ± 4.1 | 40.2 ± 3.4 |
Avg. accuracy | 67.5 ± 3.9 | 64.8 ± 4.1 | 57.9 ± 4.6 | 59.3 ± 2.9 | 68.4 ± 2.9 | 66.8 ± 3.5 |
Number of Authors | Accuracy of Models, % | ||||||||
---|---|---|---|---|---|---|---|---|---|
LSTM | BiLSTM | CNN | CNN + LSTM | LSTM + CNN | CNN + CNN | fastText | RuBERT | MultiBERT | |
2 | 94.3 ± 5.5 | 95.5 ± 4.5 | 97.1 ± 3.6 | 94.5 ± 6.0 | 98.5 ± 5.6 | 98.8 ± 4.1 | 98.2 ± 4.5 | 95.2 ± 2.6 | 93.6 ± 2.8 |
5 | 86.7 ± 6.3 | 82.5 ± 4.8 | 95.9 ± 2.8 | 86.9 ± 4.8 | 95.5 ± 3.9 | 94.7 ± 3.4 | 95.0 ± 3.7 | 90.4 ± 4.1 | 89.8 ± 3.9 |
10 | 75.4 ± 3.3 | 70.2 ± 5.3 | 81.3 ± 5.9 | 78.2 ± 5.0 | 82.2 ± 5.4 | 86.5 ± 6.1 | 92.2 ± 6.3 | 84.3 ± 3.3 | 81.8 ± 3.5 |
20 | 63.9 ± 5.7 | 58.7 ± 4.9 | 71.2 ± 5.8 | 65.8 ± 3.2 | 62.2 ± 5.8 | 72.8 ± 4.4 | 69.9 ± 4.3 | 67.4 ± 4.1 | 64.9 ± 3.1 |
50 | 44.2 ± 6.1 | 41.1 ± 5.1 | 51.1 ± 5.1 | 55.0 ± 5.2 | 41.3 ± 4.5 | 56.9 ± 4.2 | 54.8 ± 6.2 | 52.2 ± 3.6 | 46.1 ± 4.0 |
Avg. accuracy | 72.9 ± 5.4 | 69.6 ± 5.1 | 79.3 ± 4.8 | 76.1 ± 4.9 | 75.9 ± 5.1 | 82.3 ± 4.8 | 82.1 ± 6.0 | 77.9 ± 3.5 | 75.2 ± 3.5 |
Training Time on Feature Vector, Sec. | ||||||||
SVM | LR | NB | DT | RF | KNN | |||
1582 | 1082 | 677 | 714 | 1243 | 1134 | |||
Training Time on TF-IDF, Sec. | ||||||||
SVM | LR | NB | DT | RF | KNN | |||
1823 | 1418 | 746 | 1371 | 2334 | 2871 | |||
Training Time of Neural Networks, Sec. | ||||||||
LSTM | BiLSTM | CNN | CNN + LSTM | LSTM + CNN | CNN + CNN | fastText | RuBERT | MultiBERT |
58,133 | 65,284 | 43,191 | 52,638 | 50,452 | 50,679 | 26,723 | 48,634 | 49,629 |
Number of Authors | Accuracy of Models,% | |||||
---|---|---|---|---|---|---|
SVM | LR | NB | DT | RF | KNN | |
2 | 72.2 ± 4.0 | 67.1 ± 3.1 | 62.9 ± 2.3 | 69.1 ± 2.1 | 71.2 ± 3.9 | 68.1 ± 4.2 |
5 | 69.9 ± 3.5 | 59.5 ± 4.2 | 58.6 ± 2.6 | 43.5 ± 2.1 | 56.1 ± 2.8 | 65.4 ± 3.7 |
10 | 66.3 ± 3.8 | 48.2 ± 2.9 | 45.9 ± 3.5 | 24.2 ± 3.6 | 37.6 ± 2.7 | 61.9 ± 4.0 |
20 | 55.3 ± 3.1 | 34.3 ± 3.4 | 38.8 ± 4.1 | 19.9 ± 4.1 | 32.2 ± 1.7 | 43.9 ± 4.1 |
50 | 32.1 ± 3.9 | 28.6 ± 3.6 | 27.1 ± 3.3 | 15.9 ± 3.3 | 25.9 ± 2.4 | 33.6 ± 3.4 |
Avg. accuracy | 59.2 ± 3.6 | 47.6 ± 3.4 | 46.8 ± 3.2 | 34.5 ± 3.0 | 44.6 ± 2.7 | 54.6 ± 3.9 |
Number of Authors | Accuracy of Models, % | |||||
---|---|---|---|---|---|---|
SVM | LR | NB | DT | RF | KNN | |
2 | 61.1 ± 3.1 | 69.4 ± 5.2 | 70.8 ± 1.0 | 69.1 ± 2.1 | 68.4 ± 2.9 | 57.9 ± 1.7 |
5 | 54.4 ± 6.0 | 58.1 ± 4.4 | 66.1 ± 7.3 | 43.5 ± 2.1 | 60.4 ± 3.2 | 54.2 ± 3.1 |
10 | 39.4 ± 3.9 | 44.7 ± 0.7 | 49.6 ± 5.4 | 24.2 ± 3.6 | 42.6 ± 1.9 | 45.3 ± 1.9 |
20 | 32.0 ± 1.0 | 36.1 ± 2.2 | 46.1 ± 2.3 | 15.4 ± 3.6 | 33.6 ± 2.2 | 40.1 ± 2.3 |
50 | 17.3 ± 2.6 | 24.8 ± 3.1 | 34.1 ± 5.0 | 11.0 ± 4.8 | 23.5 ± 1.9 | 32.7 ± 2.45 |
Avg. accuracy | 40.9 ± 3.3 | 46.6 ± 3.1 | 53.3 ± 4.2 | 32.5 ± 3.2 | 45.7 ± 2.4 | 46.1 ± 2.3 |
Number of Authors | Accuracy of Models, % | ||||||||
---|---|---|---|---|---|---|---|---|---|
LSTM | BiLSTM | CNN | CNN + LSTM | LSTM+CNN | CNN + CNN | fastText | RuBERT | MultiBERT | |
2 | 93.0 ± 1.9 | 94.6 ± 2.1 | 95.6 ± 2.0 | 95.5 ± 3.5 | 92.3 ± 2.1 | 91.3 ± 2.4 | 94.0 ± 1.2 | 93.3 ± 2.1 | 90.2 ± 1.9 |
5 | 89.7 ± 1.9 | 92.5 ± 2.4 | 93.3 ± 1.5 | 90.9 ± 2.2 | 90.2 ± 1.0 | 89.2 ± 2.2 | 87.2 ± 2.2 | 88.6 ± 1.8 | 87.1 ± 2.2 |
10 | 73.0 ± 2.8 | 71.3 ± 2.4 | 72.4 ± 2.7 | 77.1 ± 3.9 | 64.2 ± 3.3 | 76.6 ± 4.5 | 76.1 ± 3.5 | 76.6 ± 3.2 | 69.5 ± 3.0 |
20 | 68.8 ± 2.4 | 59.3 ± 1.3 | 67.9 ± 3.3 | 62.2 ± 3.4 | 61.3 ± 3.2 | 73.7 ± 2.5 | 68.4 ± 2.3 | 66.8 ± 3.3 | 63.4 ± 2.7 |
50 | 50.1 ± 2.6 | 49.6 ± 3.9 | 48.8 ± 3.6 | 47.3 ± 1.9 | 47.4 ± 2.8 | 50.2 ± 1.4 | 55.6 ± 2.8 | 50.0 ± 2.9 | 47.1 ± 2.8 |
Avg. accuracy | 74.9 ± 2.3 | 73.5 ± 2.4 | 75.6 ± 2.6 | 74.6 ± 3.0 | 71.1 ± 2.5 | 76.2 ± 2.6 | 76.3 ± 2.4 | 75.0 ± 2.7 | 71.5 ± 2.5 |
Training Time on Feature Vector, Sec. | ||||||||
SVM | LR | NB | DT | RF | KNN | |||
589 | 397 | 308 | 236 | 804 | 604 | |||
Training Time on TF-IDF, Sec. | ||||||||
SVM | LR | NB | DT | RF | KNN | |||
717 | 584 | 372 | 416 | 1393 | 955 | |||
Training Time of Neural Networks, Sec. | ||||||||
LSTM | BiLSTM | CNN | CNN + LSTM | LSTM + CNN | CNN + CNN | fastText | RuBERT | MultiBERT |
30,190 | 32,980 | 25,380 | 28,397 | 26,467 | 25,874 | 15,926 | 26,547 | 27,117 |
Number of Authors | Number of Features | ||||||
---|---|---|---|---|---|---|---|
1168 | 500 | 400 | 300 | 200 | 100 | 50 | |
2 | 72.2 ± 4.0 | 75.3 ± 5.2 | 80.3 ± 3.3 | 75.2 ± 4.7 | 67.2 ± 4.5 | 64.9 ± 3.8 | 65.3 ± 5.5 |
5 | 69.9 ± 3.5 | 70.4 ± 2.5 | 77.1 ± 2.8 | 72.4 ± 1.9 | 63.8 ± 3.9 | 59.0 ± 4.2 | 49.8 ± 3.9 |
10 | 66.3 ± 3.8 | 67.8 ± 3.7 | 71.9 ± 2.6 | 66.6 ± 3.1 | 60.2 ± 3.8 | 57.2 ± 3.9 | 47.2 ± 2.1 |
20 | 55.4 ± 3.1 | 62.4 ± 2.9 | 64.8 ± 3.0 | 59.3 ± 2.8 | 52.9 ± 4.1 | 49.4 ± 4.2 | 43.7 ± 3.7 |
50 | 32.1 ± 3.9 | 35.1 ± 6.3 | 37.3 ± 4.1 | 33.5 ± 2.5 | 27.4 ± 4.3 | 26.8 ± 3.0 | 22.4 ± 4.0 |
Avg. accuracy | 59.2 ± 3.4 | 62.2 ± 4.1 | 66.3 ± 3.2 | 61.4 ± 3.0 | 54.3 ± 4.1 | 51.5 ± 3.8 | 45.7 ± 3.8 |
Number of Authors | Number of Features | ||||||
---|---|---|---|---|---|---|---|
1168 | 500 | 400 | 300 | 200 | 100 | 50 | |
2 | 95.4 ± 1.7 | 96.1 ± 2.2 | 98.6 ± 2.7 | 94.5 ± 1.9 | 96.2 ± 1.6 | 98.3 ± 3.9 | 95.9 ± 3.1 |
5 | 94.6 ± 2.1 | 94.7 ± 3.3 | 97.5 ± 3.1 | 90.6 ± 1.8 | 95.0 ± 3.5 | 97.4 ± 1.9 | 94.8 ± 2.0 |
10 | 81.9 ± 4.6 | 83.7 ± 4.1 | 88.0 ± 2.9 | 82.1 ± 1.2 | 87.1 ± 3.4 | 85.9 ± 3.8 | 84.3 ± 3.2 |
20 | 63.3 ± 4.8 | 69.1 ± 2.5 | 73.7 ± 3.3 | 70.7 ± 2.4 | 72.9 ± 2.8 | 63.2 ± 2.5 | 60.7 ± 2.6 |
50 | 37.7 ± 6.3 | 40.2 ± 2.0. | 44.4 ± 2.6 | 40.0 ± 3.7 | 42.4 ± 4.2 | 38.1 ± 3.7 | 33.8 ± 2.3 |
Avg. accuracy | 74.7 ± 3.9 | 76.8 ± 2.9 | 80.4 ± 2.9 | 75.6 ± 2.2 | 78.3 ± 3.1 | 76.6 ± 3.2 | 73.9 ± 2.6 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fedotova, A.; Romanov, A.; Kurtukova, A.; Shelupanov, A. Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection. Future Internet 2022, 14, 4. https://doi.org/10.3390/fi14010004
Fedotova A, Romanov A, Kurtukova A, Shelupanov A. Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection. Future Internet. 2022; 14(1):4. https://doi.org/10.3390/fi14010004
Chicago/Turabian StyleFedotova, Anastasia, Aleksandr Romanov, Anna Kurtukova, and Alexander Shelupanov. 2022. "Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection" Future Internet 14, no. 1: 4. https://doi.org/10.3390/fi14010004
APA StyleFedotova, A., Romanov, A., Kurtukova, A., & Shelupanov, A. (2022). Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection. Future Internet, 14(1), 4. https://doi.org/10.3390/fi14010004