Source Code Authorship Identiﬁcation Using Deep Neural Networks

: Many open-source projects are developed by the community and have a common basis. The more source code is open, the more the project is open to contributors. The possibility of accidental or deliberate use of someone else’s source code as a closed functionality in another project (even a commercial) is not excluded. This situation could create copyright disputes. Adding a plagiarism check to the project lifecycle during software engineering solves this problem. However, not all code samples for comparing can be found in the public domain. In this case, the methods of identifying the source code author can be useful. Therefore, identifying the source code author is an important problem in software engineering, and it is also a research area in symmetry. This article discusses the problem of identifying the source code author and modern methods of solving this problem. Based on the experience of researchers in the field of natural language processing (NLP), the authors propose their technique based on a hybrid neural network and demonstrate its results both for simple cases of determining the authorship of the code and for those complicated by obfuscation and using of coding standards. The results show that the author’s technique successfully solves the essential problems of analogs and can be e ﬀ ective even in cases where there are no obvious signs indicating authorship. The average accuracy obtained for all programming languages was 95% in the simple case and exceeded 80% in the complicated ones.


Introduction
Identifying the source code author is the task of determining the most likely author of a computer program based on available code samples from the candidate authors. The process of such identification consists of determining the author's style, identifying the programmer's individual habits, professional techniques, and approaches to writing program code. The solutions to this problem can be used as computer forensics tools [1][2][3][4].
Modern technologies, software systems, and other digitalization tools have great advantages that simplify a person's professional activity. However, there is a significant drawback behind the many advantages-the vulnerability of such developments to malfunctions and failures. Sources of similar problems are often defects of design or incorrect operation of the software. However, this does not exclude that an attacker deliberately breaks their work.
The most common cause of hardware and software failures is the injections of malicious code. Although the creation, using, and distribution of malicious computer software are prohibited by law, it entails criminal liability in almost all states, and there are still no technical solutions capable of speeding up and improving computer audits aimed at establishing source code authorship. Authorship investigation carried out on malware source codes are held manually by experts.

The Task of Identifying of the Source Code Author
Let there be a set of authors A = {a 1 , . . . , a l } and a set of program source codes S = {s 1 , . . . , s k }, where each source code is considered as a feature vector X = {x 1 , . . . , x n }. For the source code subset S = {s 1 , . . . , s m } ⊆ S, where m < k-the authors are known, there are many "author-source code" pairs (s i , a j ) ∈ D ⊆ S × A, where s i ∈ S , a j ∈ A. The solution is to identify the author belonging to set A, who is the true author of anonymous samples of source codes of the subset S = S/S . This problem should be considered as a multi-class classification problem. So, set A consists of many predefined classes and their labels, set D includes the source codes for training the classifier, and codes to be classified are included in set S .
The goal is to develop a model that solves the problem-finding the objective function F : S × A → [0, 1] , which assigns some source code from the set S to its true author. The function value is described as the degree to which the object belongs to a class, where 1 corresponds to the full to the target class, and 0, on the contrary, is a complete mismatch. So, the output is a set of probabilities with length equal to the number of authors. The decision-making model being developed must classify and describe the process of determining whether the author of the anonymous source code sample belongs to a specific class, based on the training data submitted to model input.

Modeling Neural Network Architectures
When analyzing the source codes, the features of the artificial language text as an object of study. In particular semantic and syntactic rules, structural features, and programming language paradigms, must be taken into account.
There are two approaches to training models based on the described aspects. The first of them implies an earlier manual determination of the set of informative features by the researcher. However, this approach has two significant drawbacks: The need for deep expert knowledge of programming languages, as well as the possibility of deliberate distortion of obvious features by the author-programmer. The second approach consists of extracting both obvious and unobvious informative features by model in the process of training and involves the use of deep NN architectures. A comparison of these approaches, as well as traditional and deep machine learning methods, was implemented in the work [2]. The results obtained allow us to conclude that it is appropriate to use deep NN to identify unobvious features that are not consciously controlled by the programmer.
The most common models among researchers are based on recurrent and convolutional architectures.
The popularity became due to the peculiarities of their processing of text sequences [17][18][19].
Convolutional NN (CNN) is known for its effectiveness in solving computer vision problems. They are created in the likeness of the visual cortex of the brain and can focus on certain areas of images and highlight important features in it. This advantage makes them a powerful tool in NLP, where it is necessary to focus on individual text fragments.
In general, the convolutional layer of the CNN architecture can be described as follows: where x l -output of layer l, f ()-activation function; b l -shift coefficient for layer l; and *-convolution operation of input x with kernel k. Moreover, due to edge effects, the size of the original matrices is reduced: where x l j -feature map j (output of layer l), f ()-activation function; b l -shift coefficient for layer l for feature map j; and k l j -convolution kernel of input x with kernel k, and *-convolution operation of input x with kernel k.
Recurrent NN accepts data sequences as separate, unrelated tokens as well as linked and complete sequences-volumetric texts with a specific meaning. The order in which data are fed to the model input directly influences changes in weight matrices and hidden state vectors during training. By the end of RNN training, information obtained in the previous steps is accumulated in hidden state vectors. In this way, recurrent models allow to record context dependencies from the text.
Deep LSTM is a modification of RNN. Its special feature is the ability of LSTM to learn long-term dependencies and successfully extract them even from large texts. The architecture of LSTM is often superior to classical RNN in terms of efficiency when dealing with related tasks [23,24], so this study decided to evaluate its effectiveness. Formally, the states of the LSTM cell at time t can be described as: where f t -the state of the forget gate layer, X t -the input vector, H t−1 -the hidden state of the previous cells, C t−1 -memory state of the previous cell, H t -hidden state of the current cell, C t -state memory of the current cell, W f , U f -vector of weights for the forget gate layer, W c , U c -vector of weights for the gate candidates, W i , U i -vector of weights for the input gate, W o , U o -vector of weights for the output gate, σ is the sigmoidal function, and tanh-tangensoid function.
The key feature that distinguishes LSTM from RNN is the memory state C t . This element is determined by the need to delete or, on the contrary, add information and has a significant effect on the value of the forget gate with range [0;1], where 0 is forgetting all information from the previous state, and 1 is remembering all information from the previous state. Based on the current memory state C t , the following can be calculated: where C t -the current state of memory at step t.
The output value of the LSTM cell can be formally described as: Obtained C t and H t are passed to the next step and the process is repeated. Another non-standard modification of the classical model, which became an innovation in the field of NLP was CNN with self-attention. Experiments show that this model can replace RNN in some tasks due to its ability to identify key dependencies regardless of their lengths. This is the main difference from simple CNN models that can only process sequences of fixed lengths according to the dimensions of convolutions [25]. The choice of CNN with attention for implementation is due to its high results when solving problems of sentiment analysis, text generation, and autoreference tasks.
In general, the principle of the self-attention mechanism is as follows. Request vectors (Q) and key-value pairs (K-V) are fed into the self-attention layer input. Each vector is then converted by a linear transformation. The resulting Q values are scalar multiplied with all K in turn. The result goes to the activation function and all vectors V are concatenated with the obtained weights into a single vector.
It should be noted that the self-attention mechanism can be used either independently or in an ensemble of mechanisms, forming a multi-headed approach, where different weights are used to obtain projections. In the second case, they will train in parallel and then concatenate in a single result and, having gone through linear transformations, enter the output.
The value of the q head can be formally represented as: where A q -attention value for head q, X k -input values, W val -head value matrix, and k-key position. The attention value for head q is calculated as follows: where W qry -query matrix, W key -key matrix, and T-transpose operation. Then the output value of self-attention mechanism with one or more heads: where N-number of heads and b-offset vector of the output measurement. As a part of this study, it was decided to consider a self-attention mechanism with a single head. The idea of creation a hybrid NN (HNN) based on the hypothesis that not only obvious and unobvious informative features (such as the naming of variables, classes, and methods, comments, andwhitespaces) need to be highlighted, but also long-term dependencies. HNN combines the advantages of both recurrent and convolutional architectures. Combinations of classic CNN with the LSTM, GRU, BiLSTM, and BiGRU models were implemented to evaluate the effectiveness of determining the author of the source code.
The GRU architecture was added in the experiments as a simplified version of LSTM. Although this model has fewer filters and the same accuracy as the more complex LSTM. Moreover, due to its imperceptibility, GRU, which is less prone to overfitting, has better generalization capacity for new data, and trains faster.
The last two combinations involve the use of bidirectional architectures. The only difference from simple models is the ability of these architectures to take into account not only the previous context but also the one following a particular token, which can positively affect the identification of the source code author.

Estimating the Accuracy of Neural Network Models
The training of any NN architecture and its further evaluation involves the use of representative data. Data were collected from the largest web-service for hosting IT-projects-GitHub [26]. We selected languages included in the rating of the most popular and fast-growing programming languages for the experiment. Source codes were collected from random repositories of GitHub users using a Python script.
The extended source code base of programs' authors included more than 200,000 samples written by 898 authors. The corpus included simple source codes on 13 different programming languages (C, C++, Java, Python, C#, JavaScript, PHP, Perl, Swift, Groovy, Ruby, Go, and Kotlin). Detailed information on the data is presented in Table 1 (languages are presented in descending order of release date). This dataset will allow for extensive research and objective evaluation of the effectiveness of the developed model. It contains a variety of source codes that differ from each other not only in author's writing style but also implicitly take into account the qualifications, experience, age, and education of the developers.
The quality of the models was evaluated with accuracy and 10-fold cross-validation. This procedure avoids the problem of excessive scattering of scores during the verification and provides a reliable result. Accuracy in each of the blocks is defined as: where TP-true positive, TN-true-negative, FP-false positive, and FN-false negative decision. The TP, TN, FP, and FN parameters are calculated on the basis of the confusion matrix, which contains information on how many times the system has made the right decision and how many times it has made the wrong decision on samples of a given class.
The validation results of the models described in the previous section on a corpus of 10 authors are presented in Table 2. The least efficient architecture was D-LSTM, which showed an average accuracy for all programming languages-74%. Because this is the only architecture that sequentially processes text and is not able to select local and informative features built from n-grams of characters.
High accuracy was achieved by combinations of CNN with BiGRU-88%, CNN with BiLSTM-84%, and CNN with attention-82%. All these models are distinguished by the ability to identify the context of a symbol and to assess its importance relative to the contexts of other symbols. Thus, these architectures perform key functions for identifying the author of the source code: Identifying the area around local feature of the context and determining short-term and long-term dependencies.
The obtained results allow us to conclude that CNN in combination with BiGRU is particularly effective, and showed an accuracy of more than 80% for all languages, had a significant advantage over other implemented architectures. This combination is the basis of the author's hybrid NN, which is detailed described in the next section.

Author's Technique of Identifying the Author of the Source Code
The technique for identifying the author of the source code includes all steps necessary for the model to make a decision about whether an anonymous sample of the source code belongs to one of the training classes. Since the basic decision-making algorithm is a deep NN, capable of highlighting informative features from the non-preprocessed text in vector form by itself, the technique includes only five steps: 1.
Conversion of training and test datasets into vector form.

2.
Training a deep model on a vectorized training dataset.

3.
Validation of a deep model on a vectorized test dataset.

4.
Converting an anonymous source code sample into a vector form.

5.
Authorship identification of a vectorized anonymous source code sample by the model that showed the best accuracy on the validation result.
The author's technique involves using deep NN. Therefore, it was necessary to take into account that such a model can accepts input data only in vector form. Thus, it became necessary to find the method for data vectorization. It was decided to use one-hot encoding method. The choice is since of used data, categorical features are nominal, not ordinal, and, therefore, require the encoder to "binarize" values. Encoding occurs by creating for each character a vector of 255 zeros and single 1 at the position of the element with an index equal to the character code (in accordance with the ASCII encoding). Thus, the training and test datasets, as well as the anonymous source code, are transformed into a vector form, character by character.
The vectorized training set is fed to the input of the deep NN. Results of experiment were presented in the previous section. Obtained results allow us to conclude that the hybrid CNN and BiGRU (HNN) are most effective for solving the problem of identifying the source code author. However, before this combination was included in the technique, it was improved with additional layers. The final model is shown in Figure 1. The author's technique involves using deep NN. Therefore, it was necessary to take into account that such a model can accepts input data only in vector form. Thus, it became necessary to find the method for data vectorization. It was decided to use one-hot encoding method. The choice is since of used data, categorical features are nominal, not ordinal, and, therefore, require the encoder to "binarize" values. Encoding occurs by creating for each character a vector of 255 zeros and single 1 at the position of the element with an index equal to the character code (in accordance with the ASCII encoding). Thus, the training and test datasets, as well as the anonymous source code, are transformed into a vector form, character by character.
The vectorized training set is fed to the input of the deep NN. Results of experiment were presented in the previous section. Obtained results allow us to conclude that the hybrid CNN and BiGRU (HNN) are most effective for solving the problem of identifying the source code author. However, before this combination was included in the technique, it was improved with additional layers. The final model is shown in Figure 1.  These are convolutional layers modeled as components of the Inception-1 architecture [27]. The decision is due to the need to analyze not only each individual symbol of the source code but also n-grams of symbols. Convolutions of dimensions 1 × 1, 1 × 3, and 1 × 5 allow parallel analysis of unigrams, trigrams, and 5-g, respectively. The model, obtained by combining Inception-v1 and BiGRU, will be able to perform deep analysis of the source code, take into account local and global features, the context, and various time dependencies. This is not enough to find the best architecture to obtain high accuracy of decision making with HNN. Moreover, necessary to find effective regularization, activation, loss functions, and function for error minimization.
Optimization of the bias of synaptic weights and local parameters makes it possible to minimize the error obtained in the process of learning the NN. Many popular optimizers have a problem with paralysis or the greedy behavior of the algorithm. To avoid it, it was decided to use the adaptive step optimization-Adadelta [28].
First experiments with default settings used in [1] for the Adadelta optimizer (in the PyTorch framework) do not allow achieving high accuracy in complex cases. The following experimentally founded parameters allow identifying the source code author equally well in all cases (up to 15% increase for complex cases): • lr = 10 −4 (coefficient that scale delta before it is applied to the parameters); • rho = 0.95 (coefficient used for computing a running average of squared gradients); • eps = 10 −7 (term added to the denominator to improve numerical stability).
To prevent possible overfitting of the model, it is necessary to carry out regularization. In this case, out 20% of the vector elements are zeroed and the model does not face the problem of overfitting. As in the previous work [1], Softmax [29], a generalization of the logistic function for the multidimensional case, was used as a function of the HNN output layer. Cross-entropy was used as a loss function, respectively.
Mini-batches of size 8-16 examples were used in the process of HNN training. This approach allows reducing the variance in gradients of individual examples, because they can be very different from each other [16]. The mini-batch calculates the average gradient over a number of examples at one time instead of calculating the gradient for one example.
The effectiveness of the presented author's technique was evaluated using examples of simple (see Section 6) and complex (see Sections 7 and 8) cases of identifying the author of the source code.

Identifying the Author of the Source Code in Simple Cases
Simple cases of author identification do not involve such factors as obfuscation and code formatting by coding standards. For such cases, it was decided to conduct some experiments aimed at determining the capabilities of the author's technique for identifying the true author based on HNN for further recommendations on its application.
The first step was the evaluation of the technique on corpus of various authors by the number of authors. For the experiment, were used subsets of 5, 10, and 20 authors. Selection of the subset of authors was made at random. Authors whose number and length of source codes were approximately equal were selected in one training subset. The results are presented in the Table 3. Based on the analysis of the approaches to identifying the source code author presented in Section 2, we can conclude that the proposed technique demonstrates the best results for all programming languages. Moreover, it turned out to be highly effective for languages not considered in the works of other researchers-C, C #, Ruby, PHP, Swift, Kotlin, Go, Groovy, and Perl. The peculiarities of the language in which the programmer writes affect performance, but in general the technique can be considered universal.
For the most popular Java programming language, the experiment has been extended to include scores of 2, 30, 40, and 50 authors. Results are shown in the Figure 2.   The results of the experiments aimed at evaluating the dependence of accuracy on the size of the corpus show a gradual decrease in accuracy as the number of target classes increases. Increasing the number of authors leads to the problem of multi-class classification in high-dimensional and, as a result, to the problem of data imbalance.
To evaluate the influence of the developer qualifications on the effectiveness of the author's technique, a small experiment was carried out on the corpus of the source codes written by technical students.
The corpus included works of 75 students of two to three courses. The programs are written by them to implement solutions of the same type of problems in the C++. This fact excludes an increase in the generalization ability due to the specificity of the functional purpose of the programs. The subsets were limited to 15 source codes of no more than 500 lines. The obtained results are presented The results of the experiments aimed at evaluating the dependence of accuracy on the size of the corpus show a gradual decrease in accuracy as the number of target classes increases. Increasing the number of authors leads to the problem of multi-class classification in high-dimensional and, as a result, to the problem of data imbalance.
To evaluate the influence of the developer qualifications on the effectiveness of the author's technique, a small experiment was carried out on the corpus of the source codes written by technical students.
The corpus included works of 75 students of two to three courses. The programs are written by them to implement solutions of the same type of problems in the C++. This fact excludes an increase in the generalization ability due to the specificity of the functional purpose of the programs. The subsets were limited to 15 source codes of no more than 500 lines. The obtained results are presented on a histogram- Figure 3.  The presented results indicate that the proposed technique is effective for identifying the authorship of the code not only for professional programmers but also for beginners-the qualification does not affect the classification accuracy. This suggests that the application of the technique can be implemented the educational process for evaluating programming works.

Evaluation of the Technique Based on HNN on Obfuscated Data
Obfuscation of the source code [30][31][32][33] is a process aimed at modifying source code for hide its algorithm and semantic meaning and make it harder to understand by human.
The most common area of obfuscation application is software code protection from plagiarism, which is increasingly common in the field of commercial software development. Programmers obfuscate code with special tools. However, possibility for using such tools for to hiding unique techniques, habits, and style of coding, and, as a result, to anonymize the author of the program cannot be excluded. Virus writers often use obfuscation for this very purpose. In such situations, manual expert analysis of the source code becomes impossible and it is necessary to use technical analysis tools of identifying the source code author that is stable to the obfuscation.
Despite the large number of studies devoted to the problem of determining the author of the source code, the topic of obfuscation is hardly covered in them. The only work that takes into account the effect of obfuscation on the identification of the source code author is [12]. As mentioned in Section 2, the authors examined the influence of the Tiggers obfuscator [34] on the identification of the source code author. The loss in accuracy has reached almost 30%. This is because the features, on the basis of which the model makes a decision, lose their informative value due to renaming, encryption, and other code transformations used to hide the author's identity.
So, it was decided to evaluate the impact of obfuscation on the process of identifying the author of the program code using the HNN-based technique. Table 4 provides information about the obfuscators that were used to modify datasets.  The presented results indicate that the proposed technique is effective for identifying the authorship of the code not only for professional programmers but also for beginners-the qualification does not affect the classification accuracy. This suggests that the application of the technique can be implemented the educational process for evaluating programming works.

Evaluation of the Technique Based on HNN on Obfuscated Data
Obfuscation of the source code [30][31][32][33] is a process aimed at modifying source code for hide its algorithm and semantic meaning and make it harder to understand by human.
The most common area of obfuscation application is software code protection from plagiarism, which is increasingly common in the field of commercial software development. Programmers obfuscate code with special tools. However, possibility for using such tools for to hiding unique techniques, habits, and style of coding, and, as a result, to anonymize the author of the program cannot be excluded. Virus writers often use obfuscation for this very purpose. In such situations, manual expert analysis of the source code becomes impossible and it is necessary to use technical analysis tools of identifying the source code author that is stable to the obfuscation.
Despite the large number of studies devoted to the problem of determining the author of the source code, the topic of obfuscation is hardly covered in them. The only work that takes into account the effect of obfuscation on the identification of the source code author is [12]. As mentioned in Section 2, the authors examined the influence of the Tiggers obfuscator [34] on the identification of the source code author. The loss in accuracy has reached almost 30%. This is because the features, on the basis of which the model makes a decision, lose their informative value due to renaming, encryption, and other code transformations used to hide the author's identity.
So, it was decided to evaluate the impact of obfuscation on the process of identifying the author of the program code using the HNN-based technique. Table 4 provides information about the obfuscators that were used to modify datasets. The JavaScript (JS) Obfuscator Tool was used to obfuscate source codes in the JS language, performing lexical transformations, replacing variable and function names, deleting comments and whitespaces, and converting lines into hexadecimal sequences and encoding them further to base64. The second tool was JS-obfuscator, which, in addition to transformations similar to the first one, obfuscates embedded HTML, PHP, and other code.
Pyarmor was used for the Python language, which obfuscates the code of functions during their work. Its principle of operation is to obfuscate the bytecode of each object and clean up local variables immediately after the end of the function. An alternative tool was Opy, which produces simple lexical obfuscation: Changing the names of variables and functions on a sequence of "1" and "l" characters and converting strings into random character sets.
For PHP two lexical obfuscators, Yakpropo and PHP Obfuscator were used. Both tools aim to remove whitespaces and comments. In addition, the tools allow scrambling the names of variables, functions, classes, etc., as well as replacing conditional statements with the "if... go" construction.
The obfuscators for C and C++ compiled languages were also considered. Cpp Guard removes single-line and multi-line comments, adds pseudo-complex trash code that does not affect the computational complexity of the entire program, removes spaces and line breaks, processes all preprocessor directives using declarations, and single-line "else" conditions where brackets are missing and substitutes lines for their hexadecimal representation. AnalyseC replaces variable names with asequence of random characters. Table 5 presents the results of experiments with interpreted programming languages. The accuracy of HNN model identification, both with and without obfuscation, is given here. Table 5. Results with interpretable languages (obfuscation case).

Obfuscation 5 Authors 10 Authors 20 Authors
It can be noted that simple lexical obfuscation has a minor effect on identification, while more complex code modifications (as in the case of PyArmor) reduce accuracy by an average of 30%. Table 6 presents the results of experiments with compiled programming languages (C, C++). It also shows the identification accuracy of the HNN model both with and without obfuscation. Similar to the PyArmor case, the C++ obfuscator significantly reduces identification efficiency, while lexical conversions that do not work at the bytecode level are minor.
The results of the experiments allow us to conclude that the developed HNN model is resistant to lexical obfuscation: Removing whitespace characters, transforming strings, converting and encoding them, etc. The accuracy of identifying the author of a lexically obfuscated source code is on average 7% lower than that of the original one, which is acceptable for this type of task.
In the case of more complex obfuscation performed by tools such as PyArmor and C++ Obfuscator, with bytecode of objects, the addition of pseudo-complex code, etc., the difference in the accuracy of identification reaches 30%.

Evaluation of the Technique Based on HNN on Source Codes, Designed in Accordance with Coding Standards
A programmer's compliance with coding standards is also a factor that makes it difficult to identify the author.
Source code quality is one of the keynote aspects of commercial software development. The importance of this aspect is supported by studies [43,44] aimed at determining the time spent by the ordinary developer. They confirm that the ratio of time spent reading the code to the time spent writing is on average 10:1. This value, the developer's time, can be reduced by improving the quality of the software code developed in the company. The main way to achieve this goal is through a coding standard. This is a set of rules and conventions that describe how to design source code and is shared by the development team. This improves the readability of the code and reduces the load on the memory and vision of the developer.
The coding standard for software development can imply both the use of basic rules and restrictions and detailed recommendations for the design of interfaces, implementation of methods and classes, formatting conventions, naming of variables, etc. An illustrative example of the latter is the implementation of the Linux kernel [45]. This project is being jointly implemented by a large number of developers.
Coding standards are the subject of research [46,47]. However, their authors do not pay enough attention to how standards influence the author's style, in particular, on his professional techniques and habits, which include many unique informative features of a programmer. An evaluation of this impact is essential for the task of identifying the source code author. The evaluation will make it possible to make conclusions about the applicability of identification algorithms to software codes written according to internally defined rules and corporate agreements.
Data from the Linux Kernel project developers' repositories were collected to conduct experiments to identify the source code author written following coding standards. The total number of source codes in the dataset was over 250,000 written in C, C++, and Assembler programming languages by the five most active developers. The data obtained from Github was transformed into training sets of different sizes 10, 20, and 30 files with lengths from 1000 to 10,000 characters. The histogram in Figure 4 shows a dependence on the accuracy of the training corpus volume in cases when programmers follow the coding standards and when they do not.
Data from the Linux Kernel project developers' repositories were collected to conduct experiments to identify the source code author written following coding standards. The total number of source codes in the dataset was over 250,000 written in C, C++, and Assembler programming languages by the five most active developers. The data obtained from Github was transformed into training sets of different sizes 10, 20, and 30 files with lengths from 1000 to 10,000 characters. The histogram in Figure 4 shows a dependence on the accuracy of the training corpus volume in cases when programmers follow the coding standards and when they do not. This graph demonstrates a direct dependence of accuracy on the volume of the training set-an increase in the number of files in the set and/or the maximum length of the samples entails an increase in the identification accuracy.
The results of the experiments demonstrate the negative impact of source code samples written in accordance with coding standards on the author's identification process. This is due to the unification of the code design among developers in one company. Its consequence is the disappearance of such informative features as: • author's levelling of logical blocks of the program, including the use of a certain type of spaces; • style of naming variables, classes, methods, etc., e.g., CamelCase, snake_case; • formatting comments on source lines; • "code smells" [11].
Thus, the aim of HNN is to the search for unobvious informative features that allow find difference between the authors-programmers. Based on the obtained graph (Figure 4), it is obvious that such searching can be effective only in the case of a sufficient amount of training data and requires more computing resources. If these conditions are met, the author's technique can be used to identify the author of the source code in all mentioned cases.

Conclusions and Future Works
As part of a study, the problem of identifying the source code author was defined, and approaches to its solution were analyzed. The HNN-based technique developed by the authors solves a number of the previously identified problems, namely: Dependence on the programming language, instability to obfuscation, and inefficiency to source codes written according to coding standards.
The technique shows an average 95% accuracy in the simple case of identifying the source code author. Other researchers, engaged in the same problem also tested their methods on source codes from github.com, selecting data at random. However, there are no publicly available training corpora. A direct comparison of results is incorrect since our corpus most likely does not intersect with their ones. Moreover, there are no program implementations of their methods in the public domain, so it is impossible to evaluate them on our corpus. However, judged only on published results our ones exceed competitors. In addition, the developed technique allows achieving high accuracy in cases of obfuscation and using of coding standards, which are more difficult tasks. Competitors did not conduct such research. The author's technique demonstrated resistance stability to lexical obfuscation of source codes-the loss in accuracy for all obfuscators did not exceed 10%. The programmer's commitment to certain coding standards and source code formatting reduced the identification accuracy. However, an increase in the volume of training data is improved accuracy from 30% to 85%, which is an excellent result for such a complex case.
It should be taken into account that the technique does not require any prior data preparation: The HNN model is able to find new patterns and dependencies in the data that are unobvious to the human, as well as indirectly taking into account the intellectual content and implementation of the program code.
The developed technique is universal, i.e., independent of the programming language preferred by the author-the distinctive features of programming languages have not had a critical impact on the accuracy of the classification. The difference in accuracy for various programming languages is caused by several factors. The history of language development and the original philosophy [48] have a great influence on the program structure and semantics of the language. Long-standing programming languages (C, C++, Perl, Python, Java, PHP, and C#) have already passed a phase of active development and contain built-in tools for solving specific tasks. Recently appeared languages (Groovy, Go, Swift, and Kotlin) were created for fast and efficient solving of absolutely new classes of tasks. Now they are still in the active development phase. New modules are released frequently and immediately started to be actively used by the community. Authors often have to invent new solutions, relying only on their own skills. Due to these findings, the program code contains more author's distinctive features. Furthermore, the difference in accuracy can be explained by the creativity thinking in a process of code writing. For example, whitespaces can be placed arbitrarily when using some languages, and in others, they indicate the boundaries of logical blocks. The paradigm of a specific language also limits the author's style. For example, using associative arrays in a programming language makes the text search easier, because the search algorithm is encapsulated in the array implementation. If there is no such data type, so the programmer has to implement it by himself. These facts can partly explain the low accuracy for the JS and Ruby languages. Thus, for different programming languages, the number of informative features extracted by the NN will be different. Accuracy depends on the number of such features.
There is no problem in the performance degradation on a large number of authors. High generalization ability is not required when solving real-life tasks. The number of possible authors can be reduced with source code filtration on functionality, used frameworks and libraries, etc. After that, the author identification procedure is performed. The authors from the obtained set are used as target classes. This approach can improve the accuracy of author identification without necessity of additional training data.
Further research is planned. In particular, it is planned to solve the complicated task of identifying the source code author in the case when multiple programmers working on the same file or project. Moreover, the task needs to be generalized. It is necessary to consider the case when it is unclear whether the code is written by one programmer or more. We are already working on this problem and are developing a deep-learning based approach for one-class classification.