Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network
Abstract
:1. Introduction
- Factors that arise during or as a result of writing source code. This group includes various transformations of source codes. For example, an earlier mentioned case: obfuscation of the source code—modification to an unclear and misunderstood form that makes it difficult to analyze but retains functionality; writing source code following coding standards—and the development of the source code, taking into account the conventions and general rules adopted by a group of programmers.
- Factors that result from the specificity of the development process. This group includes other cases that complicate the process of determining the author of the source code: for example, identification of the author whose code samples are written in different programming languages (on the basis of mixed data), as well as finding a distinction of the source code authorship between a human and generative model. Another complex case is the determination of authorship based on source code samples written as part of group development.
- Source code formed on separate code fragments (commits);
- The artificially generated source code of the program;
- Source code, the author of which is writing in two or more programming languages;
- Obfuscated source codes;
- The source code is written according to coding standards.
2. Our Earlier Research
3. Related Works
- Creative element. Each programmer has their own preference for using different patterns and structures. It is impossible to declare all rules, so code writing has a creative part.
- Number of languages. In most cases, well-known datasets include no more than three languages and do not provide data to evaluate the same programmer’s codes written in two or more programming languages, but in practice, it is quite normal to use two and more programming languages in solving everyday routine tasks. However, the author’s habits and favorite practices can flexibly move from one language to another, and an optimal approach should take it into account.
- Experience. With professional growth, a person improves their skills and step-by-step changes to write better code. This fact is important, and training data should include several samples from different time intervals for the same programmer.
- Team development. Code review and discussion are generally used for a project’s practices and act as strong recommendations. Both procedures also have an impact on human factors and are based on the personal experience of the reviewer, so some code specifics change from one team to another, but features that are implicit and uncontrolled by the programmer do not. These features could be helpful in the identification of the source code author.
- Advanced cases. At present, commits, mixed data, and generated source codes are inseparable from development and methods and should be resistant to complicated tasks. Careful preprocessing and removing noise elements transforms data into the perfect condition but does not provide a realistic view. A novel and accurate approach should keep up to date with modern development techniques and tools.
4. Formal Task Statement
5. Technique for Determining the Author of a Source Code
- An input layer with dimension corresponds to the vector length. In this case, the length is 256, which corresponds to a vector of 255 zeros and 1 one at the position equal to the character code, according to the ASCII encoding.
- Inception-v1 layer group. This group includes convolutional layers with kernel dimensions of 1, 3, and 5. Convolutions form a filter that passes only informative features. Convolutions work in parallel. In order to avoid overfitting after each convolution, a Dropout layer with a rate of 0.2 is added, which resets 20% of incoming neurons. The results of the convolutions concatenate into a single vector.
- Bidirectional Gated Recurrent Units (GRU)—a layer consisting of two independent GRUs. The result of Inception-v1 layers is fed in a direct order to the input of the first network and in reverse order to the input of the second. The outputs of both networks are combined into one vector.
- Layers with a feedforward connection. The result of the Bidirectional GRU is transmitted to the input of two sequential layers with a feedforward connection. Both layers have dimensions of 512 neurons in order to map outputs to a higher dimensional space that make it easier to classify. Similar to Inception-v1, Dropout layers are applied to feedforward layers.
- Output layer. Softmax is used as the output layer. This layer can obtain the probability distribution about the belonging of the input sample to each of the classes. The dimension depends on the number of prediction classes for a particular case. In the figure, the dimension of the layer is 10, which corresponds to 10 potential authors.
6. Experimental Data
6.1. Mixed Data
6.2. Artificially Generated Source Codes
6.3. Source Code Commits
7. Experiment Setup and Results
- The length of the source code must be at least 30 symbols and no more than 3000 symbols.
- No more than 30 samples of source code per author are selected.
- Sample selection is random.
- Mixed data, including two programming languages (language pairs);
- Mixed data, including three or more languages;
- Distinction of authorship between man and generative models;
- Determining the unity of the generative model for pairs of samples;
- Distinction of authorship between generative models;
- Data obtained from contributors’ commits.
- For the case of five authors, the difference in efficiency between HNN and BERT is insignificant. However, HNN has a higher rank, which means that it is able to achieve more accurate results in solving the problem.
- For the case of 10 authors, HNN is the most accurate model. BERT shows a little difference in comparison with HNN and fastText. However, fastText has the lowest rank, so BERT is considered as a less efficient model.
- For the case of 20 authors, the difference in the classification results of HNN and BERT is not significant, and fastText shows significantly lower efficiency.
8. Conclusions
- Identifying the author of a source code with high accuracy.
- Independence from the programming language and author–programmer qualification.
- The stability to deliberate source code conversions through the use of obfuscators or coding standards.
- The ability to train the model on source codes created by the development team.
- The ability to identify informative features indicated to source code created by the particular generative model.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kurtukova, A.V.; Romanov, A.S. Identification author of source code by machine learning methods. Tr. SPIIRAN 2019, 18, 741–765. [Google Scholar] [CrossRef]
- Kurtukova, A.; Romanov, A.; Shelupanov, A. Source Code Authorship Identification Using Deep Neural Networks. Symmetry 2020, 12, 2044. [Google Scholar] [CrossRef]
- Abuhamad, M.; AbuHmed, T.; Mohaisen, A.; Nyang, D. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 101–114. [Google Scholar]
- Zhen, L.; Chen, G.; Chen, C.; Zou, Y.; Xu, S. RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. In Proceedings of the 2022 IEEE 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 25–27 May 2022; pp. 1906–1918. [Google Scholar]
- Holland, C.; Khoshavi, N.; Jaimes, L.G. Code authorship identification via deep graph CNNs. In Proceedings of the 2022 ACM Southeast Conference (ACM SE ‘22), Virtual, 18–20 April 2022; pp. 144–150. [Google Scholar]
- Google Code Jam. Available online: https://codingcompetitions.withgoogle.com/codejam (accessed on 18 August 2022).
- Bogdanova, A.; Romanov, V. Explainable source code authorship attribution algorithm. J. Phys. 2021, 2134, 012011. [Google Scholar] [CrossRef]
- Bogdanova, A. Source code authorship attribution using file embeddings. In Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, Zurich, Switzerland, 17–22 October 2021; pp. 31–33. [Google Scholar]
- Bogomolov, E.; Kovalenko, V.; Rebryk, Y.; Bacchelli, A.; Bryksin, T. Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 932–944. [Google Scholar]
- Ullah, F.; Wang, J.; Jabbar, S.; Al-Turjman, F.; Alazab, M. Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access 2019, 7, 141987–141999. [Google Scholar] [CrossRef]
- Bayrami, P.; Rice, J.E. Code authorship attribution using content-based and non-content-based features. In Proceedings of the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Ottawa, ON, Canada, 12–17 September 2021; pp. 1–6. [Google Scholar]
- Caldeira, R.S. A Deep Learning Approach to Recognize Source Code Authorship. Available online: https://maups.GitHub.io/papers/tcc_008.pdf (accessed on 18 August 2022).
- Codeforces. Available online: https://codeforces.com/ (accessed on 18 August 2022).
- Mateless, R.; Tsur, O.; Moskovitch, R. Pkg2Vec: Hierarchical package embedding for code authorship attribution. Future Gener. Comput. Syst. 2021, 116, 49–60. [Google Scholar] [CrossRef]
- Gorshkov, S.; Nered, M.; Ilyushin, E.; Namiot, D.; Sukhomlin, V. Source code authorship identification using tokenization and boosting algorithms. In Proceedings of the International Conference on Modern Information Technology and IT Education, Moscow, Russia, 29 November–2 December 2018; Springer: Cham, Switzerland, 2018; pp. 295–308. [Google Scholar]
- Suman, C.; Raj, A.; Saha, S.; Bhattacharyya, P. Source Code Authorship Attribution using Stacked classifier. In Proceedings of the Forum for Information Retrieval Evaluation, FIRE (Working Notes), Hyderabad, India, 16–20 December 2020; pp. 732–737. [Google Scholar]
- García-Díaz, J.A.; Valencia-García, R. UMUTeam at AI-SOCO ‘2020: Source Code Authorship Identification based on Character N-Grams and Author’s Traits. In Proceedings of the Forum for Information Retrieval Evaluation, FIRE (Working Notes), Hyderabad, India, 16–20 December 2020; pp. 717–726. [Google Scholar]
- GitHub. Available online: https://GitHub.com/ (accessed on 18 August 2022).
- Gitlab. Available online: https://gitlab.com/ (accessed on 18 August 2022).
- Rothe, S.; Narayan, S.; Severyn, A. Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguist. 2020, 8, 264–280. [Google Scholar] [CrossRef]
- Du, Z. All nlp tasks are generation tasks: A general pretraining framework. arXiv 2021, arXiv:2103.10360. [Google Scholar]
- Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
- Lee, J.S.; Hsiang, J. Patent claim generation by fine-tuning OpenAI GPT-2. World Pat. Inf. 2020, 62, 101983. [Google Scholar] [CrossRef]
- Dusheiko, A. Lead Generation of News Texts using the ruGPT-3 Neural Network. Master’s Thesis, 2022. [Google Scholar]
- Pisarevskaya, D.; Shavrina, T. WikiOmnia: Generative QA corpus on the whole Russian Wikipedia. arXiv 2022, arXiv:2204.08009. [Google Scholar]
- Cruz-Benito, J. Automated source code generation and auto-completion using deep learning: Comparing and discussing current language model-related approaches. AI 2021, 2, 1–16. [Google Scholar] [CrossRef]
- Open AI. Available online: https://openai.com/blog/openai-codex (accessed on 18 August 2022).
- GitHub Copilot. Available online: https://copilot.GitHub.com (accessed on 18 August 2022).
- AlphaCode. Available online: https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode (accessed on 18 August 2022).
- Sber AI ruGPT-3. Available online: https://developers.sber.ru/portal/tools/rugpt-3 (accessed on 18 August 2022).
- Polycoder. Available online: https://venturebeat.com/2022/03/04/researchers-open-source-code-generating-ai-they-claim-can-beat-openais-codex/ (accessed on 18 August 2022).
- Frantzeskou, G.; Stamatatos, E.; Gritzalis, S. Identifying authorship by bytelevel n-grams: The source code author profile (SCAP) method. Int. J. Digital. Evid. 2007, 1, 1–18. [Google Scholar]
- Wisse, W.; Veenman, C.J. Scripting DNA: Identifying the JavaScript Programmer. Digit. Investig. 2015, 15, 61–71. [Google Scholar] [CrossRef]
- FastText. Available online: https://fasttext.cc/ (accessed on 18 August 2022).
- BERT. Available online: https://huggingface.co/docs/transformers/model_doc/bert (accessed on 18 August 2022).
- VGCN-BERT. Available online: https://arxiv.org/abs/2004.05707 (accessed on 18 August 2022).
- Bag of Tricks for Efficient Text Classification. Available online: https://aclanthology.org/E17-2068/ (accessed on 18 August 2022).
- Caliskan-Islam, A. Deanonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Security Symposium, Washington, DC, USA, 12–14 August 2015; pp. 255–270. [Google Scholar]











| Dataset Characteristic | Value of the Characteristic | 
|---|---|
| Number of code lines | 20,967,040 | 
| Number of projects | 569 | 
| Number of projects with one author | 71 | 
| Average commit length in lines | 13 | 
| Maximum commit length in lines | 119,892 | 
| Average number of source codes in project | 231 | 
| Average source code length in lines | 169 | 
| General number of symbols in code | 212,336,637 | 
| General number of codes | 102,908 | 
| General number of authors in all projects | 383 | 
| Author | Method | Complex Cases | Dataset | Programming Language | Average Accuracy | 
|---|---|---|---|---|---|
| Ours | HNN | Obfuscation, Encoding standards, Mixed data, Artificially generated data, Commit-based data | GitHub (ours) | C++ | 92% | 
| Java | 97% | ||||
| JS | 92% | ||||
| Python | 95% | ||||
| C | 96% | ||||
| C# | 96% | ||||
| Ruby | 95% | ||||
| PHP | 92% | ||||
| Swift | 98% | ||||
| Go | 93% | ||||
| Groovy | 99% | ||||
| Kotlin | 91% | ||||
| Perl | 96% | ||||
| GCJ | C++ | 98% | |||
| Java | 99% | ||||
| Python | 98% | ||||
| Abuhamad M., AbuHmed T., Mohaisen A. Nyang D [3] | DL-CAIS | Obfuscation | GCJ | C++ | 97% | 
| Java | 100% | ||||
| Python | 100% | ||||
| Zhen L., Chen G., Chen C., Zou Y., Xu S. [4] | RoPGen | - | GCJ | C++ | 92% | 
| Java | 98% | ||||
| GitHub | C | 84% | |||
| Java | 90% | ||||
| Holland C., Khoshavi N., Jaimes L.G. [5] | GNN | - | GCJ | C# | 60% (avg.) | 
| C++ | |||||
| Java | |||||
| Bogdanova A., Romanov V. [7,8] | XAI | - | GCJ | C++ | 74% | 
| Java | 77% | ||||
| Python | 72% | ||||
| Bogomolov E., Kovalenko V., Rebryk Y., Bacchelli A., Bryksin T. [9] | PbRF | - | GCJ | Java | 98% | 
| Caliskan-Islam A., Harang R. [38] | FuzzyAST, RF | Obfuscation | GCJ | C | 93% | 
| C++ | 98% | ||||
| Python | 88% | ||||
| Ullah F., Wang J., Jabbar S., Al-Turjman F. [10] | PDGDL | - | GCJ | C# | 99% (avg.) | 
| C++ | |||||
| Java | |||||
| Bayrami P., Rice J.E. [11] | RF, n-grams | - | GitHub | C++ | 75% | 
| Caldeira R.S. [12] | LSTM | - | GCJ | C++ | 75% | 
| Codeforces | 71% | ||||
| Mateless R. et al. [14] | Pkg2Vec | - | APKs | Java | 79% | 
| Gorshkov et al. [15] | StyleIndex | - | GitHub | C++ | 94% | 
| Java | 95% | ||||
| JavaScript | 94% | ||||
| Suman C., Raj A., Saha S. [16] | Stacked models | - | AI-SOCO | C++ | 82% | 
| García-Díaz J. A., Valencia-García R. [17] | RF | - | AI-SOCO | C++ | 91% | 
| Author | Method | Programming Language | Obfuscator | Dataset | Accuracy | 
|---|---|---|---|---|---|
| Ours | HNN | JS | JS Obfuscator Tool | GitHub | 86% | 
| GCJ | 91% | ||||
| JS-obfuscator | GitHub | 86% | |||
| GCJ | 90% | ||||
| Python | Opy | GitHub | 87% | ||
| GCJ | 91% | ||||
| Pyarmor | GitHub | 70% | |||
| GCJ | 77% | ||||
| PHP | Yankpro-po | GitHub | 89% | ||
| GCJ | 92% | ||||
| PHP Obfuscator | GitHub | 82% | |||
| GCJ | 89% | ||||
| C++ | C++ Obfuscator | GitHub | 71% | ||
| GCJ | 79% | ||||
| C | Tigress | GitHub | 90% | ||
| GCJ | 95% | ||||
| Abuhamad M., AbuHmed T., Mohaisen A., Nyang D. [3] | DL-CAIS | C | Tigress | GCJ | 93% | 
| Caliskan-Islam A., Harang R. [38] | FuzzyAST, RF | C | Tigress | GitHub | 67.2% | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kurtukova, A.; Romanov, A.; Shelupanov, A.; Fedotova, A. Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network. Future Internet 2022, 14, 287. https://doi.org/10.3390/fi14100287
Kurtukova A, Romanov A, Shelupanov A, Fedotova A. Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network. Future Internet. 2022; 14(10):287. https://doi.org/10.3390/fi14100287
Chicago/Turabian StyleKurtukova, Anna, Aleksandr Romanov, Alexander Shelupanov, and Anastasia Fedotova. 2022. "Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network" Future Internet 14, no. 10: 287. https://doi.org/10.3390/fi14100287
APA StyleKurtukova, A., Romanov, A., Shelupanov, A., & Fedotova, A. (2022). Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network. Future Internet, 14(10), 287. https://doi.org/10.3390/fi14100287
 
        



 
       