Next Article in Journal
Foreword to the Special Issue: “Semantics for Big Data Integration”
Next Article in Special Issue
Evolution, Robustness and Generality of a Team of Simple Agents with Asymmetric Morphology in Predator-Prey Pursuit Problem
Previous Article in Journal
Design and Comparative Study of Advanced Adaptive Control Schemes for Position Control of Electronic Throttle Valve
Previous Article in Special Issue
MOLI: Smart Conversation Agent for Mobile Customer Service
Article Menu

Export Article

Open AccessArticle
Information 2019, 10(2), 66; https://doi.org/10.3390/info10020066

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

1
Innovation Exchange, IBM Ireland, Dublin 4, Ireland
2
ADAPT Centre & ICE Research Institute, Technological University Dublin, Dublin 2, D08 X622, Ireland
*
Author to whom correspondence should be addressed.
Received: 21 January 2019 / Revised: 9 February 2019 / Accepted: 13 February 2019 / Published: 17 February 2019
(This article belongs to the Special Issue Artificial Intelligence—Methodology, Systems, and Applications)
Full-Text   |   PDF [872 KB, uploaded 17 February 2019]   |  

Abstract

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality. View Full-Text
Keywords: test automation; code generation; neural machine translation; naturalness of software; statistical semantics test automation; code generation; neural machine translation; naturalness of software; statistical semantics
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Kacmajor, M.; Kelleher, J.D. Automatic Acquisition of Annotated Training Corpora for Test-Code Generation. Information 2019, 10, 66.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top