You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

17 February 2019

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

and
1
Innovation Exchange, IBM Ireland, Dublin 4, Ireland
2
ADAPT Centre & ICE Research Institute, Technological University Dublin, Dublin 2, D08 X622, Ireland
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence—Methodology, Systems, and Applications

Abstract

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

1. Introduction

As digitization spreads into all areas of business and social life, the pressure on software development organizations is growing. The sheer amount of code being created, and the increasing complexity of software systems, fuels the need for new methods and tools to support the software development process.
A widely adopted framework addressing the challenges of the modern software delivery lifecycle is the DevOps model [1], which is founded on the principles of continuous integration, continuous delivery, and continuous testing. Both the wisdom of the crowd and academic evidence [2] speak for the efficiency of DevOps practice, but adopting DevOps brings its own challenges, including a significant increase in the volume and frequency of testing. In fact, on a large-scale project it is not feasible to implement DevOps without test automation—and writing automated test cases is time- and resource-consuming. Not surprisingly, automated test case generation methods are being actively studied. In general, to generate unit test cases, existing approaches use information extracted from other software artifacts, such as code under test, specification models, or execution logs [3].
State-of-the-art test generation tools can significantly improve test coverage; however, it has been shown that their fault detection potential is problematic: many faulty code portions are never executed, or are executed in such a way that defects are not detected [4]. These tools stem from the tradition of research on code analysis and code generation that is concerned with formal semantics and structural information about the code. Such research takes advantage of the formality, consistency, and unequivocalness of programming languages—that is, the properties that distinguish source code from natural languages. A more recent research trend switches the focus to statistical semantics. This exciting alternative can be now fully explored thanks to the much-increased availability of source-code resources stored in online open source repositories. It has been argued that large source-code corpora exhibit similar statistical properties to those of natural language corpora [5], and indeed, statistical language models developed for Natural Language Processing (NLP) have proved to be efficient when applied to programming languages [6].
However, even billions of lines of code scraped from online repositories are not sufficient to satisfy the training requirements for some types of tasks. Many applications—such as code generation from natural language descriptions, code search by natural language queries, or automated code documentation—require joint processing of natural languages and programming languages. This means that the source-code corpora used for training these systems need to be appropriately annotated with natural language descriptions. The main challenge here is the acquisition of fine-grained natural language annotations that accurately and consistently describe the semantics of code.
As a concrete example, statistical translation models used in NLP require training on parallel corpora—also called bi-texts—in which many sentences written in a source language are aligned with their translations in a target language. A schematic example of a parallel corpus with aligned sentences in natural languages is presented in Figure 1a. To apply statistical translation models to source code—that is, to train a model capable of mapping textual sequences to sequences of source code—it is necessary to obtain a text-code parallel corpus in which a large number of code units are aligned with their descriptions in natural language (Figure 1b).
Figure 1. Schematic examples of parallel corpora: (a) for NLP translation (source English sentences aligned with target Vietnamese sentences); (b) for joint processing of natural and programming languages.
Aligned corpora for statistical machine translation (MT) of two natural languages (bi-text datasets) can be gathered from existing collections of translated texts. However, obtaining a parallel corpus for a natural language coupled with a programming language (a text-code dataset) is much less straightforward.
The question of what should be the nature and level of detail of the natural language descriptions provided in such a corpus does not have a definite answer and requires more investigation. Nonetheless, it seems reasonable to assume that the practical value of a text-code corpora depends on the following properties:
  • Size, and the potential to scale in size in the future, is of particular importance for deep learning models which require large amounts of training data.
  • Acquisition cost, in terms of human effort and the complexity of the procedure.
  • Level of noise in the acquired data.
  • Granularity of the natural language descriptions.
Several recent studies have proposed more or less sophisticated methods of obtaining text-code corpora (see Section 2). The proposed methods vary in terms of the properties listed above, but regardless their practical value, none of them are applicable to the testing domain. The main contribution of this paper is a novel method of automatically synthetizing large text-code datasets containing textual descriptions of single testing tasks, each matched with the code implementing that task. Moreover, in this paper we demonstrate that machine learning models trained on our datasets can generate complete, compilable, and semantically relevant automated test cases based on quasi-natural language descriptions of testing tasks. These results were obtained using a neural MT model [7] designed for learning from bi-text corpora, in which the degree of equivalence between source and target languages is very high. We find that this off-the-shelf neural MT architecture performs well on our code-text corpora, which suggests that the quasi-natural language descriptions obtained using our approach are precise and consistent enough to allow direct translation to code.
There are two aspects of the potential implications of the presented work. First, from the perspective of the testing community, we present an efficient, inexpensive, and scalable method for annotating test code with textual descriptions. The availability of such annotated datasets can accelerate the application of the latest advances in machine learning to the testing domain. Second, from the perspective of research on applying statistical models to source code, our datasets may provide better insight into the desired characteristics of text and code sequences in a training corpus. Understanding what type of annotations works well or what is the optimal translation unit for the source code may be valuable for researchers concerned with synthesizing text-code datasets.
The remainder of this article is organized as follows. Section 2 provides an overview of existing solutions for text-code corpora acquisition. In Section 3 we provide the rationale of our approach and explains how it works. Section 4 describes in detail the procedure of synthesizing training corpora used in our experiments. Section 5 presents the experimental setup and the results of training a neural MT model on a text-code dataset generated using our method. In Section 6 we discuss the results, and Section 7 concludes the paper.

3. Text-Code Corpora Acquisition from Self-Documenting Code

In Section 2 we outlined two main approaches to the large-scale acquisition of annotated source code: collecting developer-defined descriptions extracted from API documentation, and collecting user-defined descriptions, extracted from users’ questions posted on Q&A websites and matched with code snippets posted as answers. Neither of these approaches is applicable to the testing domain. Javadoc or docstring only exist for the code that is a part of a public API, and this type of documentation is not available for test automation code. Data collected from Q&A websites contains code snippets that can help in solving concrete programming issues but not in writing specific test cases. Only a small fraction of questions on Stack Overflow are related to testing, and dedicated websites on software quality are far behind in terms of popularity (For example, Software Quality Assurance & Testing website (https://sqa.stackexchange.com/) stores 8500 questions, as compared to 17,000,000 at Stack Overflow). Furthermore, the performance of machine learning models trained on the existing large-scale text-code datasets it low.
In our approach we take advantage of a programming routine known as self-documenting code. Although research on applying statistical language models to source code is relatively new, software developers have long been aware that source code is written for two recipients: one is the machine, and the other—a software developer who will be reviewing, extending, or maintaining the code. The need to secure the interests of the second recipient has been embodied in Clean Code paradigm [18]—a well-known set of best practices focused on making programming code readable and easy to understand for humans. One of the key Clean Code principles is to create variable identifiers and functions names that are meaningful and reveal programmer’s intent. To that end, the developer uses multiple words to formulate an adequate description of a function, and then squeezes all the words into the function name, using some convention that helps the reader to recognize individual words (such as camelCase in Java, or snake_case in Python). Code comments are discouraged, as they carry high risk of being outdated and are often redundant; instead, it is recommended that the code should be self-explanatory.
The method we propose is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular unit testing. Figure 2 shows real-life examples drawn from the spring-framework, an open source Java project stored on GitHub (https://github.com/spring-projects/spring-framework). Each of these self-documenting test function names is a concise summary of a specific testing task, written in a quasi-natural language, and observing a consistent naming convention. The body of each function is the implementation of that task in a programming language. Thus, a parallel text-code corpus can be formed from function names split into individual words and aligned with function bodies.
Figure 2. Examples of long, descriptive function names from the spring-framework project.

4. Methodology

The text-code dataset creation method we propose exploits the self-documenting code and can, in principle, be applied to test cases written in any high-level programming language, provided that the code has been written with the consideration for readability. The generic language-independent procedure can be summarized as follows:
(1)
Collect automated test cases from source-code repository. Depending on the use case, datasets can be assembled from data within a single repository (for training custom, project- or organization-specific models) or from multiple repositories (for training generic models).
(2)
For each collected test case:
(a)
Split function identifiers into individual words, according to the adopted naming convention.
(b)
Tokenize the function body, preserving punctuation marks.
(c)
Add the split function name to the corpus as a source sequence. Add the tokenized function body to the corpus as the corresponding target sequence.
On a lower level, however, preprocessing details become language-specific. All the examples presented in this article are based on code written in Java—the second most popular programming language on GitHub (https://blog.github.com/2018-11-15-state-of-the-octoverse-top-programming-languages/). Furthermore, all the test cases we used for dataset synthesis were created with Junit (https://junit.org/)—an open source unit testing framework for Java.
In the experiments we use three datasets synthesized from open source code stored on GitHub. Two of them (labeled sf and rx) were each extracted from a single large Java repository, and the third one (multi) was assembled from test code pulled from over 700 repositories. In the following paragraphs we first describe how we acquired data from a single repository, and then describe the procedure used to assemble the multi-repository corpus.

4.1. Processing Data from A Dingle Depository

To create the project-specific datasets, sf and rx, we chose two actively maintained GitHub repositories: the spring-framework and RxJava (https://github.com/ReactiveX/RxJava). This choice was guided by two criteria: the size of the repository and the popularity of the repository, as indicated by the number of stars assigned by users and the number of created forks (repository copies created by members of the GitHub community). We assume that the popularity of a repository is correlated with the quality of code. Table 1 summarizes the properties of the two selected repositories.
Table 1. Properties of the repositories used for creating project-specific datasets, as of January 2019.
We crawl each repository to retrieve all test class files, identified as those containing the @Test annotation string. We process every test class using the JavaParser library (http://javaparser.org/) to extract and parse the test functions. Each test function is an automated test case. A parsed test function is represented as a JSON object whose properties comprise function name, function body, parent class name, and some metadata for identification. Any inline comments are separated out from the function body and rejected. The result of this step is a JSON array containing all parsed test functions from the repository.
The next step is the actual synthesis of the corpus. From each JSON file, we take the class name and the function name, which are both camel case strings, convert them into space-separated strings, and prepend with the special tokens #class and #method, respectively. The quasi-natural language description is created by concatenating the two resulting strings. The programming language sequence is created by simply tokenizing the function body. We assume that tokens such as parentheses, punctuation marks or mathematical symbols should be treated as individual words, and parse the code by surrounding each such character with white-spaces.
Figure 3 shows schematically how the quasi-natural language sentence and the corresponding programming language sequence are derived from a single unit test case.
Figure 3. Extracting a text-code sequence pair from a test case.
When adding sequence pairs to the corpus, we apply two filters:
  • Duplicates filter. Any repeating sequence pairs are skipped, although we allow repeating code sequences, as long as they have different natural language descriptions.
  • Length filter. Any sequence pair with a programming language sequence beyond a predefined length (see Table 2) is rejected. This filter has dual purpose. First, it removes code sequences whose size drastically deviates from the average. For example, before applying the length filter, sequences of the length up to 300 tokens comprised over 97% of test cases extracted from RxJava repository. The sizes of the remaining 3% ranged from 300 to more than 1000. Second, it mitigates the disproportion between the size of the quasi-natural language sequences and programming language sequences.
    Table 2. Properties of datasets used in the experiments described in Section 5. NL refers to natural language and PL—to programming language.

4.2. Collecting Data from Multiple Repositories

The data used to create the multi-repository corpus was collected using the github3 (https://pypi.org/project/github3.py/) library—a Python wrapper for the GitHub API. The search query included two parameters: language:java and stars:>1000. We were not concerned with the size of individual repositories. The query returned almost 2000 repositories. Over 700 of them contained Junit test cases which we used for dataset creation.
The repositories were processed one by one according to the procedure described in Section 4.1. In addition to applying the duplicate and length filters to each individual repository, we also checked for and removed any duplicates across the repositories. We also applied some simple heuristics to exclude functions with meaningless names, such as test1(), test2() etc.
Table 2 lists the metadata for the three datasets used in the experiments. The source-code vocabulary sizes are very large. This is not surprising, given that each programming API brings in new source-code tokens. For the rx and sf corpora we kept the full PL vocabularies. For the multi-repository dataset, we limited the PL vocabulary size by discarding all tokens that occurred less than n times in the corpus. In the dataset, words excluded from the vocabulary were mapped to a special unknown token. We created two versions of the multi-repository dataset, one with n = 5 and the other with n = 10 .

5. Experiments

The purpose of the experiments reported in this section is to investigate whether test code annotated with descriptions extracted from meaningful function names provides good quality data for training statistical language models. To do this, we use the corpora described in Section 4 to train a well-known neural translation model to generate test code from quasi-natural language descriptions.
Neural translation models are an example of end-to-end learning: these models take a source language sentence as an input, encode it into a dense vector (inter-lingual) representation, and decode this representation to generate the translation target language sentence. The model used in our study belongs to the class of sequence-to-sequence models [19], and is a TensorFlow implementation of the attention-based architecture proposed in [7] (https://github.com/tensorflow/nmt/).
We trained three models, one on each of the datasets. In the preliminary phase we performed trial runs to establish a set of reasonable hyperparameters, and then kept them constant throughout the experiments. We used 2-layered LSTMs for both the encoder and the decoder, with the Bahdanau attention mechanism [20]. For the optimizer we used Adam with a starting learning rate of 0.001. Regarding regularization, we set dropout probability to 0.2, as recommended for LSTMs [21].
Each model was evaluated on a test set that was randomly sampled from the dataset the model was trained on. These test sets were sampled prior to training and were held-out from the training process. In case of the multi-repository corpus, the test set is the concatenation of test sets sampled from each contributing repository. The results are presented in Table 3.
Table 3. BLEU scores from training on three datasets.
BLEU scores on code generation are not directly comparable with scores on natural language translation due the differences in the length of the translated strings and the inherent differences in the (natural vs. programming) language structures. Moreover, it is well known in the MT community that BLEU scores are not necessarily a true reflection of translation quality. These caveats aside, however, the results from our experiments are promising and indicate that our approach of generating test case source code from parsed class and function names is feasible (BLEU score obtained on multi-min10 is somewhat higher than the score obtained on multi-min5, which may seem surprising, given that in the context of natural language translations, the higher number of unknown words has been shown to have a negative impact on performance [20,22]. However, since the aim of our experiments was to confirm the usefulness of the parallel corpora built from self-documenting code (rather than the evaluation of a specific machine learning model), we put limited effort into hyperparameters optimization, and it is possible that a more extensive search of the hyperparameter space would provide slightly different results. Investigating the impact of the vocabulary size in the context of programming language would be useful, but remains out the scope of this paper). In the following section we present a qualitative, example-based analysis of the performance of our models with a view to better understanding and contextualizing the results of training an NMT model on the three datasets extracted from self-documenting code.

6. Discussion

Acknowledging the limitations of BLEU metrics as a tool for evaluating the quality of generated code, we carefully reviewed the outputs of our models and analyzed them in the context of the examples available in the training corpora. In this section, we discuss the findings from this qualitative analysis of the three datasets.

6.1. Dataset rx

The unusually high performance of the model trained on the rx dataset becomes less surprising after a closer look into the RxJava project test code. RxJava features several base classes exhibiting standardized behaviors that can be—and are—tested by analogous test functions. For instance, a large body of analogous test code exists for the base classes Observable and Flowable. Examples of highly equivalent functions testing these two classes are shown in Figure 4. Each test function is presented in the parsed format, as a text-code pair. The parts forming textual descriptions are marked in italics. The differences between equivalent function versions (a, b) are highlighted in bold.
Figure 4. Examples of functions written to validate analogous behavior of Observable (1a, 2a, 3a) and Flowable (1b, 2b, 3b).
The replacement patterns emphasized in Figure 4 (Observable vs. Flowable, PublishSubject vs. PublishProcessor, TestObserver vs. TestSubscriber, BackpressureStrategy and blockingGet vs. the absence of these patterns) are highly repeatable throughout the corpus, making the learning task much easier. In fact, the model was able to fully generate the unseen long function body shown in 3a, based on the NL description (#class flowable delay subscription other test #method test complete triggers subscription) which was a part of our test set.
After completed training of the rx model, we reviewed all the code generated against NL descriptions from the test set. Out of 2014 examples in the test set, 965 (47.9%) generated sequences that fully matched unseen ground-truth source code. Clearly, the repeatable patterns occurring in analogous test functions strongly contributed to this result, but the model was also able to exploit other regularities in the training data.
In the case of short code sequences, it is relatively easy to track back at least some of the pieces of information used by the model to generate test functions. Table 4 presents the analysis of four correctly predicted test functions. The top test-code pair (bold font) in each section of the table is pulled from the test set. The remaining text-code pairs (plain font) are the pulled from the trainset. For the rx dataset, we selected a function (rxe) that—in contrast to the patterns depicted in Figure 4—did not have an analogous counterpart in the trainset. Training examples (rxt) reveal how the model learned the code sequence representing the string timeout in the description.
Table 4. Examples of correctly predicted unseen functions, together with seen instances that provided necessary information. Each row in the table contains the following data: Lbl (label)—created by concatenating the dataset label (rx, sf, m for multi) with a suffix indicating whether the text-code pair comes from the test set (suffix e) or the trainset (suffix t). ; Text—the quasi-natural description (input sequence); Code—the test code (output sequence). Text-code pairs coming from the test set (rxe, sfe, me) are additionally highlighted in bold.

6.2. Dataset sf

The phenomenon of analogous functions is not observed in the test code pulled from spring-framework, which probably explains why the NMT model trained on the sf dataset achieved less spectacular performance than the one trained on the rx dataset. Nevertheless, the BLEU score of 58 obtained by the sf model indicates that consistent naming conventions are useful for learning mappings between textual descriptions and code sequences. As we did previously with the rx dataset, we analyzed the examples generated by the model from the descriptions in the test set. Table 4 shows an interesting example of a generated function (sfe), in which the model was able to correctly predict mathematical symbols corresponding to the string ge (greater-than-or-equal-to) based on the seen examples gt, lt, le (greater-than, less-than and less-than-or-equal-to).
The total number of model predictions that fully matched the ground-truth was 224 out of 2759 examples in the test set (8%).

6.3. Dataset Multi

The BLEU score obtained by the model trained on multi-repository dataset was significantly lower that than the score of the sf model, but the percentage of fully matched predictions was actually slightly higher (9.7%, or 2,908 out of 30,039). On analyzing these fully matched outputs of the model trained on the multi-repository dataset, we did not find evidence that the model learned to use information across projects originating from different repositories. For example, the correct predictions produced for the two functions from the multi-repository test set shown in Table 4 (labeled me) were produced based on examples originating from the same project as the generated sequence (i.e., the same parent project). However, it seems that generating a correct code sequence does not require a large parent project. The parent projects of the two examples in Table 4 are google/guava (https://github.com/google/guava) and dropwizard/metrics (https://github.com/dropwizard/metrics), represented in the trainset by 288 and 355 examples, respectively.
Although the model trained on the multi dataset performed worse than the project-specific models, it is worth remembering that the vanilla neural translation models trained on the parallel corpora extracted from API documentation [12] or Stack Overflow queries [17] achieved BLEU scores of 11 and 14, respectively (It is not possible to make direct comparisons with the results obtained using models dedicated for code processing task (exploiting structural/syntactical information about the code, building Abstract Syntax Tree representations etc.). The comparisons of the potential of training corpora only make sense for experiments that used a similar ML model. From the studies discussed in Section 2, four papers report the results of training a pure NMT model. However, the results reported in [11,15] do not refer to code generation task (the first paper focuses on generating parsed API usage examples, and the second is concerned with code summarization (generating text from code))). The relatively high score of the model trained on data assembled from 700 GitHub repositories seems to confirm that the principle of self-documenting code is popular among test automation developers and can be relied upon as a source of training data.

7. Conclusions and Future Work

In this paper, we have presented a method that exploits the availability of source code in open software repositories to automatically construct an aligned text-code dataset. This method leverages self-documenting code—the software engineering practice of using meaningful class and function names to describe the functionality of program modules. Furthermore, we have demonstrated how datasets constructed in this way can be used to train (MT-inspired) text to source-code generation systems. Moreover, we have shown that it is feasible to use this machine translation-based code generation approach for automatic test case generation.
A key differentiation between our approach and the methods discussed in Section 2 is that the textual descriptions in our parallel corpora are not expressed in true natural language. In NLP, a lack of the naturalness is often considered a weakness, but in the context of the software testing domain, the quasi-natural language nature of the text in our generated datasets does not affect usability, because this language has been devised by the software developer community. This form of communication has a lot in common with Controlled Natural Languages [23]: it uses simplified syntax but preserves most of the natural properties of its base language and can be intuitively understood. As elaborated on in Section 6, this compliance by software developers with naming conventions results in consistent repeatable patterns within the generated datasets which make the learning task feasible. Furthermore, unlike previous attempts to leverage developer-defined descriptions of code [11,12], our approach can be applied to generating code that is not exposed as a public API and therefore lacks inline documentation. Admittedly, escaping one limitation (the restriction to code sourced from public APIs) comes at the cost of another limitation (the restriction to self-documenting code only).
We have demonstrated the feasibility of our approach within the software testing domain. Specifically, our experiments have been on the generation of unit test cases, which can be described in a single quasi-natural language sentence and which have relatively short code bodies. As a result, one open question that remains is whether our approach can be generalized to other (potentially more complex) code domains. This is a question we will address in future work. However, even with this potential limitation the current results are very worthwhile because the demand for automated tests in the modern software development cycle is very high, and we believe it is important to fill the gap in the availability of training data for developing test automation tools, even if the approach is not universal. The fact that existing neural translation models trained on this type of data can achieve satisfactory performance is evidence of the high quality of the text-code parallel corpora synthesized from class and function names. Indeed, we believe that the performance of these initial systems is at a level that permits the immediate application of our approach in the area of software engineering.
That said, the evaluation of the true value of the generated code requires far more effort. As pointed out in Section 5, a BLEU score is an approximate indicator of the quality of translation of natural languages. We have not identified any empirical studies investigating the applicability of BLEU to source code, and therefore the results reported in this and other papers need to be treated with caution.
A more reliable evaluation would involve retrieving feedback from human users. Our current efforts are focused on developing a Test Recommendation Engine trained on the corpora extracted from self-documenting code. The Engine, which is a part of a Horizon 2020 project (https://elastest.eu/), will be released publicly, and we have plans for the collaboration with several industry test teams to gather their feedback.
We envisage two use cases benefiting from automated test generation. The first one involves training a project-specific model, similar to the ones trained on the rx and sf datasets. The tools built using such a specialized model are likely to produce accurate test cases for the new code written as developers add or modify features in an already mature project. The second use case involves a generic model trained on multiple repositories. In this case, the predictions are likely to be less accurate (and so require some editing) but the model would still be of value for test teams working on new projects with a minimal codebase.

Author Contributions

Conceptualization, M.K. and J.D.K.; methodology, M.K.; software, M.K.; investigation, M.K; Writing—original draft preparation, M.K.; Writing—review and editing, J.D.K.; Supervision, J.D.K.

Funding

This work was partly funded by the EC project ElasTest, 731535 (H2020-ICT-2016-2017/H2020- ICT-2016-1, Research and Innovation Action). The work was also supported by the ADAPT Centre which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Acknowledgments

Magdalena Kacmajor would like to include special thanks to Micael Gallego Carrillo from Universidad Rey Juan Carlos in Madrid, who first brought her attention to self-documenting code as a possible source of training data.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNLControlled Natural Language
DevOpsDevelopment and Operations
MLMachine Learning
MTMachine Translation
NLNatural Language
NLPNatural Language Processing
NMTNeural Machine Translation
PLProgramming Language
SMTStatistical Machine Translation

References

  1. Virmani, M. Understanding DevOps & bridging the gap from continuous integration to continuous delivery. In Proceedings of the 5th International Conference on the Innovative Computing Technology (INTECH 2015), Galicia, Spain, 20–22 May 2015; pp. 78–82. [Google Scholar] [CrossRef]
  2. Perera, P.; Silva, R.; Perera, I. Improve software quality through practicing DevOps. In Proceedings of the 2017 Seventeenth International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 6–9 September 2017; pp. 1–6. [Google Scholar] [CrossRef]
  3. Anand, S.; Burke, E.K.; Chen, T.Y.; Clark, J.; Cohen, M.B.; Grieskamp, W.; Harman, M.; Harrold, M.J.; McMinn, P.; Bertolino, A.; et al. An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw. 2013, 86, 1978–2001. [Google Scholar] [CrossRef]
  4. Shamshiri, S.; Just, R.; Rojas, J.M.; Fraser, G.; McMinn, P.; Arcuri, A. Do automatically generated unit tests find real faults? An empirical study of effectiveness and challenges (t). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 201–211. [Google Scholar] [CrossRef]
  5. Hindle, A.; Barr, E.T.; Gabel, M.; Su, Z.; Devanbu, P. On the naturalness of software. Commun. ACM 2016, 59, 122–131. [Google Scholar] [CrossRef]
  6. Allamanis, M.; Barr, E.T.; Devanbu, P.T.; Sutton, C.A. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 2018, 51, 81. [Google Scholar] [CrossRef]
  7. Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar] [CrossRef]
  8. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
  9. Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to generate pseudo-code from source code using statistical machine translation (T). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 574–584. [Google Scholar] [CrossRef]
  10. Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K.M.; Kočiský, T.; Wang, F.; Senior, A. Latent Predictor Networks for Code Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 599–609. [Google Scholar] [CrossRef]
  11. Gu, X.; Zhang, H.; Zhang, D.; Kim, S. Deep API Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; pp. 631–642. [Google Scholar] [CrossRef]
  12. Miceli Barone, A.V.; Sennrich, R. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017; pp. 314–319. [Google Scholar]
  13. Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 86–96. [Google Scholar]
  14. Allamanis, M.; Tarlow, D.; Gordon, A.D.; Wei, Y. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2123–2132. [Google Scholar]
  15. Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 2073–2083. [Google Scholar] [CrossRef]
  16. Yao, Z.; Weld, D.S.; Chen, W.P.; Sun, H. StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1693–1703. [Google Scholar] [CrossRef]
  17. Yin, P.; Deng, B.; Chen, E.; Vasilescu, B.; Neubig, G. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, Sweden, 28–29 May 2018; pp. 476–486. [Google Scholar] [CrossRef]
  18. Martin, R.C. Clean Code: A Handbook of Agile Software Craftsmanship, 1st ed.; Prentice Hall PTR: Upper Saddle River, NJ, USA, 2008. [Google Scholar]
  19. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
  20. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv. 2014. Available online: https://arxiv.org/abs/1409.0473 (accessed on 15 February 2019).
  21. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization, 2014. arXiv. 2014. Available online: https://arxiv.org/abs/1409.2329 (accessed on 15 February 2019).
  22. Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014; pp. 103–111. [Google Scholar]
  23. Kuhn, T. A Survey and Classification of Controlled Natural Languages. Comput. Linguist. 2014, 40, 121–170. [Google Scholar] [CrossRef]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.