As digitization spreads into all areas of business and social life, the pressure on software development organizations is growing. The sheer amount of code being created, and the increasing complexity of software systems, fuels the need for new methods and tools to support the software development process.
A widely adopted framework addressing the challenges of the modern software delivery lifecycle is the DevOps model [1
], which is founded on the principles of continuous integration, continuous delivery, and continuous testing. Both the wisdom of the crowd and academic evidence [2
] speak for the efficiency of DevOps practice, but adopting DevOps brings its own challenges, including a significant increase in the volume and frequency of testing. In fact, on a large-scale project it is not feasible to implement DevOps without test automation—and writing automated test cases is time- and resource-consuming. Not surprisingly, automated test case generation methods are being actively studied. In general, to generate unit test cases, existing approaches use information extracted from other software artifacts, such as code under test, specification models, or execution logs [3
State-of-the-art test generation tools can significantly improve test coverage; however, it has been shown that their fault detection potential is problematic: many faulty code portions are never executed, or are executed in such a way that defects are not detected [4
]. These tools stem from the tradition of research on code analysis and code generation that is concerned with formal semantics and structural information about the code. Such research takes advantage of the formality, consistency, and unequivocalness of programming languages—that is, the properties that distinguish source code from natural languages. A more recent research trend switches the focus to statistical semantics. This exciting alternative can be now fully explored thanks to the much-increased availability of source-code resources stored in online open source repositories. It has been argued that large source-code corpora exhibit similar statistical properties to those of natural language corpora [5
], and indeed, statistical language models developed for Natural Language Processing (NLP) have proved to be efficient when applied to programming languages [6
However, even billions of lines of code scraped from online repositories are not sufficient to satisfy the training requirements for some types of tasks. Many applications—such as code generation from natural language descriptions, code search by natural language queries, or automated code documentation—require joint processing of natural languages and programming languages. This means that the source-code corpora used for training these systems need to be appropriately annotated with natural language descriptions. The main challenge here is the acquisition of fine-grained natural language annotations that accurately and consistently describe the semantics of code.
As a concrete example, statistical translation models used in NLP require training on parallel corpora—also called bi-texts
—in which many sentences written in a source language are aligned with their translations in a target language. A schematic example of a parallel corpus with aligned sentences in natural languages is presented in Figure 1
a. To apply statistical translation models to source code—that is, to train a model capable of mapping textual sequences to sequences of source code—it is necessary to obtain a text-code
parallel corpus in which a large number of code units are aligned with their descriptions in natural language (Figure 1
Aligned corpora for statistical machine translation (MT) of two natural languages (bi-text datasets) can be gathered from existing collections of translated texts. However, obtaining a parallel corpus for a natural language coupled with a programming language (a text-code dataset) is much less straightforward.
The question of what should be the nature and level of detail of the natural language descriptions provided in such a corpus does not have a definite answer and requires more investigation. Nonetheless, it seems reasonable to assume that the practical value of a text-code corpora depends on the following properties:
Size, and the potential to scale in size in the future, is of particular importance for deep learning models which require large amounts of training data.
Acquisition cost, in terms of human effort and the complexity of the procedure.
Level of noise in the acquired data.
Granularity of the natural language descriptions.
Several recent studies have proposed more or less sophisticated methods of obtaining text-code corpora (see Section 2
). The proposed methods vary in terms of the properties listed above, but regardless their practical value, none of them are applicable to the testing domain. The main contribution of this paper is a novel method of automatically synthetizing large text-code datasets containing textual descriptions of single testing tasks, each matched with the code implementing that task. Moreover, in this paper we demonstrate that machine learning models trained on our datasets can generate complete, compilable, and semantically relevant automated test cases based on quasi-natural language descriptions of testing tasks. These results were obtained using a neural MT model [7
] designed for learning from bi-text corpora, in which the degree of equivalence between source and target languages is very high. We find that this off-the-shelf neural MT architecture performs well on our code-text corpora, which suggests that the quasi-natural language descriptions obtained using our approach are precise and consistent enough to allow direct translation to code.
There are two aspects of the potential implications of the presented work. First, from the perspective of the testing community, we present an efficient, inexpensive, and scalable method for annotating test code with textual descriptions. The availability of such annotated datasets can accelerate the application of the latest advances in machine learning to the testing domain. Second, from the perspective of research on applying statistical models to source code, our datasets may provide better insight into the desired characteristics of text and code sequences in a training corpus. Understanding what type of annotations works well or what is the optimal translation unit for the source code may be valuable for researchers concerned with synthesizing text-code datasets.
The remainder of this article is organized as follows. Section 2
provides an overview of existing solutions for text-code corpora acquisition. In Section 3
we provide the rationale of our approach and explains how it works. Section 4
describes in detail the procedure of synthesizing training corpora used in our experiments. Section 5
presents the experimental setup and the results of training a neural MT model on a text-code dataset generated using our method. In Section 6
we discuss the results, and Section 7
concludes the paper.
2. Related Work
In this section, we do not attempt to show the full range of techniques of matching natural language descriptions to code that have been proposed throughout the literature. Rather, our aim is to investigate which approaches can yield datasets that meet the training needs of statistical text-code language models. Thus, the focus of this review is on studies which are explicitly concerned with applying language models to source code, and which provide some evidence for the performance of text-code language models trained on the proposed datasets.
The performance of the many of the models covered in this review, and indeed the models we present later in the paper, are evaluated using BLEU [8
]. BLEU is a de facto standard measure for MT. BLEU compares machine output with human-authored ground-truth translation, and scores how close they are to each other, on a scale from 0 to 100, with 100 indicating a hypothetically perfect translation. In the context of source-code generation from text input, BLEU is calculated by comparing the output of the model to the source-code ground truth from the corpus.
Perhaps the most straightforward solution to creating a dataset of aligned text and code is reported in [9
], where a software engineer was hired to manually annotate 18,000 Python statements with pseudo-code. This approach is neither scalable nor cheap, but the study provides interesting insights. The dataset was used to train a phrase-based and a tree-to-string SMT models to generate pseudo-code from source code. The tree-to-string model outperformed the phrase-based model by a large margin (BLEU score of 54 compared to 25), suggesting that correct code-to-text mapping necessitates parsing program structure, and even line-by-line, noise-less descriptions are not sufficient to support a plain phrase-based translation model.
For the work reported in [10
] two text-code datasets, one containing Java and the other Python code, were created. In both datasets the code units were aligned with descriptions that combine structured and unstructured text. These datasets were used to train a neural model which generated code from a mix of natural language and structured inputs. The model achieved an impressive performance of 65.6 BLEU scores. Furthermore, the authors trained two neural translation models as baselines, one augmented with their structured attention mechanism. The augmented translation model outperformed the plain translation model on both datasets (BLEU scores of 50 and 44 compared to 34 and 29).
The remaining papers included in this review use a Big Data approach. This research follows two main directions: one toward exploiting Big Code (primarily GitHub (https://github.com
)), and the other toward mining programming-related Q&A websites, primarily StackOverflow (https://stackoverflow.com
The Big Code route involves scraping API documentation (Javadoc
, Python docstrings
) from online source-code repositories, and using it as natural language description of code fragments. The research reported in [11
] created a massive parallel corpus of over 7.5 million API sequences annotated with excerpts from Javadoc
. These API sequences are not raw code sequence, they are rather parsed representations of general API usage. Consequently, a neural MT model trained on this corpus would not generate code, instead given some description of required functionality it would produce hints on the APIs that can be used. This model was augmented with information on API importance, and achieved BLEU score of 54.
Another corpus exploiting API documentation [12
] consists of over 150,000 Python function bodies annotated with docstrings
. The authors used a back-translation approach [13
] to extend this corpus with a synthetic dataset of 160,000 entries. The performance of a (non-augmented) neural translation model trained on the extended corpus was low (BLEU score of 11).
The second route—assembling datasets from user queries matched to code fragments mined from Q&A websites—has recently attracted a lot of attention. In [14
], two training corpora were created from C# snippets extracted from responses to StackOverflow and Dot Net Perls questions, and matched with the titles of these questions. Furthermore, general-purpose engine queries that produced clicks to the questions were added as alternative natural language descriptions. A bi-modal source-code language model trained on the resulting dataset was evaluated in terms of retrieval capability. The model had much better performance when retrieving natural language descriptions (with code snippets as queries) compared with retrieving code snippets, with NL descriptions as queries (Mean Reciprocal Rank of 0.44 as compared to 1.18).
Datasets collected from Q&A websites are large, and have the potential to grow as new questions and answers are added to the websites, but the level of noise in the data is very high. Queries can have irrelevant or very informal titles, and the code snippets are often incomplete and non-compilable. This problem was partly addressed in [14
], by applying simple heuristics, but other researchers deemed this approach insufficient and proposed extracting quality datasets from noisy Q&A corpora by applying machine learning models, trained on human-annotated seed datasets, as filters. For example, in one study after collecting C# and SQL snippets produced in response to questions posted on StackOverflow and paired with the titles of these questions, the authors manually annotated a small subset of data and trained a semi-supervised classifier to filter out titles that were irrelevant to the corresponding code snippet [15
]. The resulting cleaned corpora (containing over 66,000 pairs for C# and 32,000 of SQL) were used to training a neural MT model for code summarization (that is, for generating text from code, not code from text), which achieved BLEU scores of 20.5 (C#) and 18.4 (SQL).
Systematic mining of question-code datasets retrieved from Stack Overflow was the main focus of two other studies. In [16
], user queries matched with Python and SQL code snippets were subject to a series of cleaning steps. First, a logistic regression classifier (with human-engineered features) was trained to select questions of type "how-to-do-it", in which the user provides a scenario and asks how to implement it. Next, a subset of over 5000 question-code pairs was manually annotated by hired students who judged whether a snippet constitutes a standalone solution to the corresponding question. A novel model, called the Bi-View Hierarchical Neural Network, was trained on the annotated data and used to select over 147,500 Python and 119,500 SQL question-code pairs, to be included in the final dataset (https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset
Another complex method to mine high-quality aligned data from Stack Overflow was described in [17
]. First, the authors manually engineered a set of code structure features needed to determine the syntactic validity of a code snippet. Second, a subset of collected StackOverflow posts was manually annotated, using a carefully designed annotation procedure, to label specific elements in each post (intent, context, snippet), and to filter out non "how-to-do-it" questions. Next, a neural translation model was trained to learn "correspondence features"—that is, to learn the probability of the intent given a snippet, and the probability of the snippet given an intent. Finally, the calculated probabilities for each intent and snippet were fed to a logistic regression classifier as the correspondence features, together with the manually engineered structural features. After training, the classifier was used to create an automatically mined training corpus containing almost 600,000 intent-snippet pairs (https://conala-corpus.github.io/
). Given the small set of annotated data used for training the correspondence features, the authors acknowledged existing threats to validity, and provided an additional dataset of 3,000 examples manually curated by annotators. A baseline neural MT model trained on 100,000 examples from the automatically mined corpus combined with the curated data achieved BLEU score of 14.26. The performance of the model trained on curated data only was even lower (BLEU score 10.58).
3. Text-Code Corpora Acquisition from Self-Documenting Code
In Section 2
we outlined two main approaches to the large-scale acquisition of annotated source code: collecting developer-defined descriptions extracted from API documentation, and collecting user-defined descriptions, extracted from users’ questions posted on Q&A websites and matched with code snippets posted as answers. Neither of these approaches is applicable to the testing domain. Javadoc
only exist for the code that is a part of a public API, and this type of documentation is not available for test automation code. Data collected from Q&A websites contains code snippets that can help in solving concrete programming issues but not in writing specific test cases. Only a small fraction of questions on Stack Overflow are related to testing, and dedicated websites on software quality are far behind in terms of popularity (For example, Software Quality Assurance & Testing website (https://sqa.stackexchange.com/
) stores 8500 questions, as compared to 17,000,000 at Stack Overflow). Furthermore, the performance of machine learning models trained on the existing large-scale text-code datasets it low.
In our approach we take advantage of a programming routine known as self-documenting code. Although research on applying statistical language models to source code is relatively new, software developers have long been aware that source code is written for two recipients: one is the machine, and the other—a software developer who will be reviewing, extending, or maintaining the code. The need to secure the interests of the second recipient has been embodied in Clean Code paradigm [18
]—a well-known set of best practices focused on making programming code readable and easy to understand for humans. One of the key Clean Code principles is to create variable identifiers and functions names that are meaningful and reveal programmer’s intent. To that end, the developer uses multiple words to formulate an adequate description of a function, and then squeezes all the words into the function name, using some convention that helps the reader to recognize individual words (such as camelCase
in Java, or snake_case
in Python). Code comments are discouraged, as they carry high risk of being outdated and are often redundant; instead, it is recommended that the code should be self-explanatory.
The method we propose is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular unit testing. Figure 2
shows real-life examples drawn from the spring-framework
, an open source Java project stored on GitHub (https://github.com/spring-projects/spring-framework
). Each of these self-documenting test function names is a concise summary of a specific testing task, written in a quasi-natural language, and observing a consistent naming convention. The body of each function is the implementation of that task in a programming language. Thus, a parallel text-code corpus can be formed from function names split into individual words and aligned with function bodies.
The text-code dataset creation method we propose exploits the self-documenting code and can, in principle, be applied to test cases written in any high-level programming language, provided that the code has been written with the consideration for readability. The generic language-independent procedure can be summarized as follows:
Collect automated test cases from source-code repository. Depending on the use case, datasets can be assembled from data within a single repository (for training custom, project- or organization-specific models) or from multiple repositories (for training generic models).
For each collected test case:
Split function identifiers into individual words, according to the adopted naming convention.
Tokenize the function body, preserving punctuation marks.
Add the split function name to the corpus as a source sequence. Add the tokenized function body to the corpus as the corresponding target sequence.
In the experiments we use three datasets synthesized from open source code stored on GitHub. Two of them (labeled sf and rx) were each extracted from a single large Java repository, and the third one (multi) was assembled from test code pulled from over 700 repositories. In the following paragraphs we first describe how we acquired data from a single repository, and then describe the procedure used to assemble the multi-repository corpus.
4.1. Processing Data from A Dingle Depository
To create the project-specific datasets, sf
, we chose two actively maintained GitHub repositories: the spring-framework
). This choice was guided by two criteria: the size of the repository and the popularity of the repository, as indicated by the number of stars assigned by users and the number of created forks (repository copies created by members of the GitHub community). We assume that the popularity of a repository is correlated with the quality of code. Table 1
summarizes the properties of the two selected repositories.
We crawl each repository to retrieve all test class files, identified as those containing the @Test
annotation string. We process every test class using the JavaParser library (http://javaparser.org/
) to extract and parse the test functions. Each test function is an automated test case. A parsed test function is represented as a JSON object whose properties comprise function name, function body, parent class name, and some metadata for identification. Any inline comments are separated out from the function body and rejected. The result of this step is a JSON array containing all parsed test functions from the repository.
The next step is the actual synthesis of the corpus. From each JSON file, we take the class name and the function name, which are both camel case strings, convert them into space-separated strings, and prepend with the special tokens #class and #method, respectively. The quasi-natural language description is created by concatenating the two resulting strings. The programming language sequence is created by simply tokenizing the function body. We assume that tokens such as parentheses, punctuation marks or mathematical symbols should be treated as individual words, and parse the code by surrounding each such character with white-spaces.
shows schematically how the quasi-natural language sentence and the corresponding programming language sequence are derived from a single unit test case.
When adding sequence pairs to the corpus, we apply two filters:
4.2. Collecting Data from Multiple Repositories
The data used to create the multi-repository corpus was collected using the github3 (https://pypi.org/project/github3.py/
) library—a Python wrapper for the GitHub API. The search query included two parameters: language:java
. We were not concerned with the size of individual repositories. The query returned almost 2000 repositories. Over 700 of them contained Junit test cases which we used for dataset creation.
The repositories were processed one by one according to the procedure described in Section 4.1
. In addition to applying the duplicate and length filters to each individual repository, we also checked for and removed any duplicates across the repositories. We also applied some simple heuristics to exclude functions with meaningless names, such as test1()
lists the metadata for the three datasets used in the experiments. The source-code vocabulary sizes are very large. This is not surprising, given that each programming API brings in new source-code tokens. For the rx
corpora we kept the full PL vocabularies. For the multi-repository dataset, we limited the PL vocabulary size by discarding all tokens that occurred less than n times in the corpus. In the dataset, words excluded from the vocabulary were mapped to a special unknown
token. We created two versions of the multi-repository dataset, one with
and the other with
The purpose of the experiments reported in this section is to investigate whether test code annotated with descriptions extracted from meaningful function names provides good quality data for training statistical language models. To do this, we use the corpora described in Section 4
to train a well-known neural translation model to generate test code from quasi-natural language descriptions.
Neural translation models are an example of end-to-end learning: these models take a source language sentence as an input, encode it into a dense vector (inter-lingual) representation, and decode this representation to generate the translation target language sentence. The model used in our study belongs to the class of sequence-to-sequence models [19
], and is a TensorFlow implementation of the attention-based architecture proposed in [7
We trained three models, one on each of the datasets. In the preliminary phase we performed trial runs to establish a set of reasonable hyperparameters, and then kept them constant throughout the experiments. We used 2-layered LSTMs for both the encoder and the decoder, with the Bahdanau attention mechanism [20
]. For the optimizer we used Adam with a starting learning rate of 0.001. Regarding regularization, we set dropout probability to 0.2, as recommended for LSTMs [21
Each model was evaluated on a test set that was randomly sampled from the dataset the model was trained on. These test sets were sampled prior to training and were held-out from the training process. In case of the multi-repository corpus, the test set is the concatenation of test sets sampled from each contributing repository. The results are presented in Table 3
BLEU scores on code generation are not directly comparable with scores on natural language translation due the differences in the length of the translated strings and the inherent differences in the (natural vs. programming) language structures. Moreover, it is well known in the MT community that BLEU scores are not necessarily a true reflection of translation quality. These caveats aside, however, the results from our experiments are promising and indicate that our approach of generating test case source code from parsed class and function names is feasible (BLEU score obtained on multi-min10 is somewhat higher than the score obtained on multi-min5, which may seem surprising, given that in the context of natural language translations, the higher number of unknown words has been shown to have a negative impact on performance [20
]. However, since the aim of our experiments was to confirm the usefulness of the parallel corpora built from self-documenting code (rather than the evaluation of a specific machine learning model), we put limited effort into hyperparameters optimization, and it is possible that a more extensive search of the hyperparameter space would provide slightly different results. Investigating the impact of the vocabulary size in the context of programming language would be useful, but remains out the scope of this paper). In the following section we present a qualitative, example-based analysis of the performance of our models with a view to better understanding and contextualizing the results of training an NMT model on the three datasets extracted from self-documenting code.
7. Conclusions and Future Work
In this paper, we have presented a method that exploits the availability of source code in open software repositories to automatically construct an aligned text-code dataset. This method leverages self-documenting code—the software engineering practice of using meaningful class and function names to describe the functionality of program modules. Furthermore, we have demonstrated how datasets constructed in this way can be used to train (MT-inspired) text to source-code generation systems. Moreover, we have shown that it is feasible to use this machine translation-based code generation approach for automatic test case generation.
A key differentiation between our approach and the methods discussed in Section 2
is that the textual descriptions in our parallel corpora are not expressed in true natural language. In NLP, a lack of the naturalness is often considered a weakness, but in the context of the software testing domain, the quasi-natural language nature of the text in our generated datasets does not affect usability, because this language has been devised by the software developer community. This form of communication has a lot in common with Controlled Natural Languages [23
]: it uses simplified syntax but preserves most of the natural properties of its base language and can be intuitively understood. As elaborated on in Section 6
, this compliance by software developers with naming conventions results in consistent repeatable patterns within the generated datasets which make the learning task feasible. Furthermore, unlike previous attempts to leverage developer-defined descriptions of code [11
], our approach can be applied to generating code that is not exposed as a public API and therefore lacks inline documentation. Admittedly, escaping one limitation (the restriction to code sourced from public APIs) comes at the cost of another limitation (the restriction to self-documenting code only).
We have demonstrated the feasibility of our approach within the software testing domain. Specifically, our experiments have been on the generation of unit test cases, which can be described in a single quasi-natural language sentence and which have relatively short code bodies. As a result, one open question that remains is whether our approach can be generalized to other (potentially more complex) code domains. This is a question we will address in future work. However, even with this potential limitation the current results are very worthwhile because the demand for automated tests in the modern software development cycle is very high, and we believe it is important to fill the gap in the availability of training data for developing test automation tools, even if the approach is not universal. The fact that existing neural translation models trained on this type of data can achieve satisfactory performance is evidence of the high quality of the text-code parallel corpora synthesized from class and function names. Indeed, we believe that the performance of these initial systems is at a level that permits the immediate application of our approach in the area of software engineering.
That said, the evaluation of the true value of the generated code requires far more effort. As pointed out in Section 5
, a BLEU score is an approximate indicator of the quality of translation of natural languages. We have not identified any empirical studies investigating the applicability of BLEU to source code, and therefore the results reported in this and other papers need to be treated with caution.
A more reliable evaluation would involve retrieving feedback from human users. Our current efforts are focused on developing a Test Recommendation Engine trained on the corpora extracted from self-documenting code. The Engine, which is a part of a Horizon 2020 project (https://elastest.eu/
), will be released publicly, and we have plans for the collaboration with several industry test teams to gather their feedback.
We envisage two use cases benefiting from automated test generation. The first one involves training a project-specific model, similar to the ones trained on the rx and sf datasets. The tools built using such a specialized model are likely to produce accurate test cases for the new code written as developers add or modify features in an already mature project. The second use case involves a generic model trained on multiple repositories. In this case, the predictions are likely to be less accurate (and so require some editing) but the model would still be of value for test teams working on new projects with a minimal codebase.