This section aims to answer research questions RQ2, RQ3, RQ4, and RQ5. There are three main perspectives that must be considered if we want to investigate, classify, analyze, design, or construct a program synthesis system, namely, users (developers, programmers), programs of interest, and search techniques [

101]. Dealing with these perspectives helps researchers and developers to draw a comprehensive view of program synthesis frameworks and systems. In previous works presented in [

101,

102,

135], three dimensions of program synthesis that tackle these perspectives were discussed, namely, user intent, search space, and search technique. In the following subsections, these dimensions are reintroduced as top-level features that reflect critical points in the variation of program synthesis systems in the feature model (

Figure 6).

#### 6.1. Developer & User Intent Specifications

Describing intent, or the specifications on the desired program [

102], is the first significant dimension that is related to users or developers. When applying the adopted strategy of data extraction, which is based on the proposed RQs, IC, EC and Classification Scheme, it was found that the total number of papers that are grouped together under the Features and Techniques category with a focus on user intent was 50. Publications belonging to references groups: 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 are included in

Table 3.

According to the domain analysis conducted on these groups of papers to start answering RQ2, there are different ways of expressing the user intent adopted in the various program synthesis approaches, such as input/output examples, formal specifications, logical relations or formulas, demonstrations, test cases, partial programs, (restricted) natural languages, and traces [

101,

102].

Figure 7 shows the distribution of publications on the various methods for expressing user intent. After reviewing the selected 50 papers and evaluating the publication dates and the methods of expressing user intent, the RQ3 was completely answered and the changes in expressing user intent trends between 2003 and 2019 are summarized in

Figure 8.

Common features of these forms of user intent were derived after reviewing each paper; the mechanisms for describing user intent were named and determined during the second level of domain analysis as syntax-based, semantics-based, symbolic-based, and example-based mechanisms. The following feature diagram (

Figure 9) illustrates the mechanisms used to describe the user intent.

Frameworks that fall under the syntax-based subfeature focus on a different format of syntax provided to the synthesis framework as user intent to produce the target code [

179]. A mandatory (at least one) notation is used here to indicate that any combination of subfeatures is possible. The following points highlight, compare, and distinguish program synthesis approaches, briefly, in terms of the subfeatures illustrated in

Figure 9.

**Natural Languages:** Synthesis frameworks that fall under this subcategory use natural languages to describe the user intent program. Synthesizers then work to produce a program code from the NL description using a learning algorithm such as reinforcement or maximum marginal likelihood. The approaches presented in [

116,

139], and [

19] are examples of synthesis frameworks that use an NL description for expressing user intent. Additionally, other frameworks, such as Tellina, adopt Recurrent Neural Networks to translate a program described using a natural language into an executable program [

15].

**Program Sketches:** According to frameworks presented in [

25], synthesis frameworks allow a user to write an incomplete program (a program with holes or missing details) and a synthesizer and then derive the low-level implementation detail from the sketches by filling all given holes based on previously specified assertions.

**Domain-Specific Languages**: A Domain-Specific Language (DSL) is a restricted set of a programming language that is designed to be understood and adopted for a particular domain. Similar to the structure of general-purpose programming languages (like C++ and Java), a DSL is a set of typed and annotated symbol definitions that form the DSL terminology [

97]. These symbols can be either terminals or non-terminals that are defined using some high-level specification rules (e.g., context-free grammar). Each rule describes the transformation of every non-terminal into another non-terminal or terminal token of the language. All possible transformation operators and source symbols (tokens) are typed and located on the right-hand-side of the rules. Every symbol in the grammar is annotated with a corresponding output. The PROgram Synthesis using examples (PROSE) approach [

31] is an example of a program that falls under the deductive synthesis paradigm (explained previously in

Section 5) where the synthesis problem is solved using transformation rules and version space algebra such as FlashExtract [

44] and FlashMeta [

32] Additionally, the solver-aided DSL (Rosette) that is based on theorem proving technique is designed in another DSL based approach for solving synthesis problem [

33].

In addition, there are some synthesis frameworks that focus on learning their synthesizers using different forms of examples given as user intent, instead of providing a syntactic representation of the desired code. These frameworks are classified under the example-based category (

Figure 9). The types of examples can be either one or a combination of I/O examples and counterexamples, or even traces. Programming by examples, as done in [

16,

39,

41,

44,

47], is a common approach where a user expresses a desired code behavior using a set of I/O example pairs, and the synthesis tool constructs an executable implementation from these examples.

**Input/Output (I/O) Example:** Frameworks under this subcategory adopt the use of I/O examples as an alternative strategy of expressing the user intent for a desired program. This kind of synthesis approach provides an interactive interface between the user and the synthesizer that allows the user to provide input/output example pairs until the desired program is reached, such as the approaches provided in [

41,

44,

47].

**Counterexample:** Synthesizers in frameworks under this subfeature adopt the so-called “Counterexample-guided” inductive synthesis strategy to produce possible candidate implementations from concrete examples of program behavior, whether this behavior is correct or not [

27]. The synthesizer, in this instance, acts as a verifier or an Oracle in some approaches like [

94,

95] that performs a validation process on the candidate implementation code and produces (generates) counterexamples from its context to be used in the following iteration as input fed to the synthesizer. The counterexamples in this mechanism are used, iteratively, instead of new knowledge-free I/O examples generated for each solving iteration [

27].

Moreover, frameworks that fall under the symbolic-based (computational) approach treat the program synthesis as a computational problem. Constraints, logic formulas, finite-state-machines, and context-free grammar are examples of symbolic notations that might be used to solve various computational problems in computer science. They can be adopted in the symbolic-guided synthesis framework as a representation of the synthesis problem to obtain the target code.

**Logic Formulas:** The use of logical formulas is considered to be one of the classic methods for expressing high-level specifications of programs. There are two kinds of specifications considered for describing programs: semantics specifications and syntactic specifications. Frameworks that follow these subcategories use logic to describe semantic specifications, whereas they use grammar (e.g., context-free grammar) to describe constraints of syntactic specifications. Together, grammar and syntactic constraints provide a comprehensive template for the desired program. Using a template benefits the synthesizer by reducing the program search space [

50,

85].

**Constraints:** Approaches that fall under this subfeature use formal language such as context-free grammar (attribute grammar) as a language for describing rich, structured constraints over desired programs. This kind of synthesis approach tries to tackle the problem of learning synthesizers, a rich set of constraints that must be satisfied from provided data, which is considered a difficult mission. The work presented in [

54] is an example of this kind of synthesis framework.

**Finite Automata (FA):** Frameworks under this subcategory allow a user to describe the desired program partially using finite state machines (FSMs) or, in some approaches, Extended FSMs with execution specifications and invariants to construct an FSM skeleton of the program. The synthesizer then completes the FSM skeleton from the desired specifications and invariants (supplied with the skeleton) using an inference technique [

55]. In a TRANSIT tool [

55], for instance, a computational-guided synthesis approach is adopted for a reactive system, where each process is expressed as an Extended FSM. The description of processes consists of a collection of internal state variables and control states and the transitions between them. The synthesis approach works by specifying these transitions through a set of guard conditions and an update code by inferring expressions using symbolic forms of functions, variables, and examples (Concolic snippets) to achieve a consistent system behavior.

**Grammar:** In addition, there are other frameworks that may involve program synthesis activities, such as program analysis and debugging. They use semantics information about a program, such as a bug report or memory address, instead of using actual program syntax or formal representations of it (e.g., grammar). These frameworks are categorized as under the semantics-based subcategory, demonstrated in

Figure 9. It is commonly known that execution traces of programs consist of rich semantics information about the code. Thus, the use of execution traces has become widely accepted in the domains of program analysis and synthesis, which has brought remarkable results [

57]. According to [

57], many learning processes of (learning-based) synthesizers [

58,

59,

152], are improved when using execution traces generated from I/O graphical image examples. Based on this idea, the approach presented in [

57] uses execution traces that contain no control flow constructs as the specifications of a desired program along with I/O examples to train the proposed (neural) program synthesis model. As a result, the accuracy is improved to 81.3% from the 77.12% of their prior work. Frameworks that fall under the semantics-based subfeature, shown in

Figure 8 above, use execution traces that contain semantics information about I/O values rather than using the I/O values themselves.

**Traces:** Frameworks with this subfeature provide a set of execution traces for learning synthesizers instead of a collection of I/O examples or logic rules [

62]. This is because execution traces have been widely used for program analysis [

63,

64], where the traces are given as input to identify detailed (technical) characteristics about a program. Trace information may contain significant detail about the program, including dependencies, control flows (paths), values, memory addresses, and the inter-relationship between them. Reverse engineering techniques and tools are used to analyze traces and understand all possible scenarios and dynamic behaviors related to the code [

64].

#### 6.2. Search Space of the Program

The search space of a program is considered to be the domain of programs over which the desired program will be searched. Expressiveness and efficiency are two significant characteristics that must be considered by search space developers when designing the search space. On one hand, the expressiveness of the space should be adequate to describe all programs that users require. On the other hand, the space should be designed with a good degree of restrictiveness to allow it to perform an efficient search [

148]. Reaching this balance between expressiveness and efficiency allows developers to create a good code synthesizer.

When applying the adopted strategy of data extraction, it was found that the total number of papers grouped under the program search space category was six. This is illustrated as group 16 (

Table 3). According to the analysis conducted on these groups to continue answering RQ4, there is a variety of ways in which the search space can be expressed, for example, as a subset of an existing programing language, domain-specific language, context-free grammar, deterministic/non-deterministic FA, or logics [

84,

148].

During the second level of systematic domain analysis on the search space, it was found that templates are widely used across almost all kinds of program synthesis approaches [

85,

87,

88,

89]. Templates are considered to be a common technique that enables developers (users) to provide high-level insights about target programs to a synthesis framework using a generic programming or meta-programming feature (technique) available in some programming languages, such as C++, to create a template of a desired program. Template-based synthesis approaches can reduce the search problem and optimize the solving performance. The detail of possible types of solvers is covered later in other sections where the synthesis frameworks are categorized based on different adopted search strategies [

88]. The creation of templates using programming languages or even formal specification languages (e.g., Z, Petri Net or Abstract Syntax Tree (AST)) or logic is considered a critical and difficult task, as the solver needs to translate the template back into an appropriate form for performing formal reasoning, such as logics or grammar [

88] and then produce the complete target code.

After reviewing the selected six papers, it was found that the search space of a program can be expressed using four alternatives, namely, programming languages, logic, grammar, and domain specific languages. From that, RQ4 was completely answered. At the completion of this level of the domain analysis on the search space, the results were documented using a feature diagram (

Figure 10). Some possible language combinations may appear to form the final search space template. That is why the mandatory (at least one) notation is used in

Figure 10.

#### 6.3. Search Strategy

As mentioned earlier, the program synthesis problem is defined as a problem of finding an executable program that satisfies some high-level specifications and constraints. The process of searching over a program space to solve this problem is considered one of the three critical dimensions of any program synthesis approach. There are various search techniques and algorithms that might be adopted when designing code synthesizers based on whether the user intent specification is expressed via examples, partial program code, example pairs, or formal specifications [

84,

148].

This section is used to answer both RQ5 and RQ6. To answer RQ5 first, the data extraction strategy was applied, in which all publications belonging to reference groups 11, 12, 13, 14, and 15 were included (

Table 3). The total number of publications considered under the search strategy category at this step was 21. These papers were evaluated based on the methods adopted for dealing with the synthesis problem and its variations over the period between 2005 and the middle of 2019. It was found that the program synthesis problem is tackled and treated from different perspectives as five kinds of computational problem, namely, the verification problem, the constraints satisfaction (solving) problem, the machine approximation problem, the combinatorial optimization problem, and the learning (statistical) problem (

Figure 11).

The alternative searching techniques used for solving the synthesis problem are documented in a feature diagram demonstrated in

Figure 12. Additionally, in order to highlight the changes in this issue,

Figure 13 summarizes the changes in handling the synthesis problem between 2005 and the middle of 2019. A remarkable increase in adopting machine learning (ML) and its related techniques and optimization techniques as search techniques for solving synthesis problems can be observed. Secondly, in order to answer RQ6, again the data extraction strategy was applied once more, in which all publications belonging to reference groups 11, 12, 13, 14, 15, 18, 19, 20, 21, and 22 were included (

Table 3).

The total number of publications considered under the search strategy category at this step was 95, after eliminating some publications that did not mainly cover search techniques. These papers were evaluated based on the adopted technique used for solving the synthesis problem.

Figure 14 demonstrates the distribution of publications on the solving techniques used for searching the program space to solve the synthesis problem. It is worth mentioning that the findings represented in

Figure 14 were also used to answer RQ8, as described in the following section (

Section 7). The following subsections compare and distinguish program synthesis approaches based on the features of the search technique, as demonstrated in the above top-level feature diagram (

Figure 12).

This caused the emergence of two widely adopted resolution methods of combinatorial optimization, namely, Stochastic Optimization and Deterministic Optimization (

Figure 15). In some approaches, both techniques may be used together. This is illustrated in the FD as a mandatory (at least one) subfeature.

Stochastic Optimization: The Stochastic Optimization method involves solving combinatorial optimization problems that involve uncertainties, whereas the deterministic one focuses on finding solutions for combinatorial optimization problems by evaluating a finite set of discrete variables. For each method, several efficient algorithms have been designed and successful search techniques have been adopted for solving many real-world problems, including program synthesis (demonstrated in the detailed feature diagram in

Figure 16). According to the domain analysis, the Evolutionary Algorithm (Genetic programming), Dynamic Programming Algorithm (e.g., Viterbi algorithm), and Simulated Annealing are stochastic algorithms that are used in several program synthesis frameworks as techniques for seeking the target code constructs from the high-level specifications [

68,

155].

- ○
Dynamic Programming: Dynamic Programming is an optimization technique that simplifies complex problems by boiling them down into many overlapping subproblems. Solutions of these simpler subproblems are combined to provide an optimal solution to the complicated problem. In a nested problem structure, a relation between the value of the larger problem and the values of the subproblems is specified, and each computed value of a subproblem’s solution is used recursively to find the overall optimal solution to the problem [

70].

It is worth mentioning that the synthesis problem must be described in a way in which its solution is constructed from some solutions to overlap subproblems [

71]. Many implementation algorithms that improve the overall performance of the optimization process are based on dynamic programing, such as the divide-and conquer [

70] and linear-time dynamic programming algorithms presented in [

71].

- ○
Simulated Annealing: The principles of simulated annealing were inspired and inherited from physical properties in annealing solid mechanics. In physics, defects of solids are removed first by heating the solids up to a high temperature and then transforming them into crystal materials by a slow cooling process. At the highest temperature, the material is considered to be at the highest (max) energy state, whereas the minimum energy state is the frozen state [

72].

Simulated annealing was introduced into the domain of computer science as a probabilistic strategy for solving combinatorial optimization problems with a large search space. For example, simulated annealing is used, via the Real-Time Software System Generator (RT-Syn) framework [

72], to minimize the related resource costs of software applications, including design and maintenance, by synthesizing the implementation detail of the design. When considering program synthesis as an optimization problem, some crucial implementation decisions must be involved during problem resolution, such as data structures, control flows, and algorithms. In simulated annealing, the program space is treated as a configuration space that encompasses all legal decisions.

Iteratively, a random current feasible design with some perturbations (move set) is proposed. At the end, this move set must achieve all feasible designs in the design space. In each iteration, a cost function is used to measure the goodness of the current design in order to find the best design that can be reached. The last characteristic of the simulation is the cooling schedule, which mimics the cooling process of materials in physics. Moves in the high-energy state that decrease gradually in the cost function are accepted to produce a suboptimal solution, whereas a quick decrease results in a near-optimal solution to the problem [

72].

- ○
Evolutionary Algorithm: An evolutionary algorithm is a kind of generic population-based optimization that is inspired by biological evolution mechanisms, such as reproduction, mutation, recombination, and selection. In biology, biological changes in characteristics, or evolution, occur when evolutionary mechanisms and genetic recombination react to these changes, resulting in different characteristics becoming more common or hidden in the population in the following generations [

96,

169]. In the domain of computer science, In the domain of computer science, algorithms for solving optimization problem is applied to a population of individuals, where fitness functions are used iteratively over the population to evolve the quality of the final solution [

73].

- ○
Genetic Programming (GP): is a kind of evolutionary algorithm that uses genetic operations, namely, mutation, crossover, and selection to evolve its populations iteratively, until the best solutions to a given optimization problem are achieved [

154]. It performs better than the exhaustive search when searching a problem (program) space that is too broad, because the search over the space is guided by the measures produced by the fitness function [

96,

169]. GP is considered one of the common techniques that is applied in the domain of program synthesis and automatic program repairs [

73,

96,

154,

155,

169].

Deterministic Optimization: On the other hand, the simpler alternative to the Stochastic method is the Deterministic Optimization method, which is used in some synthesis frameworks as an implemented search technique (

Figure 16). Exhaustive Enumeration is considered to be a very general search-based problem-solving technique that involves all possible alternatives to be examined during the problem resolution process in order to find the optimal solution to the problem [

68]. The brute-force algorithm is considered to be a common technique of exhaustive enumeration optimization, as noted during the conducted domain analysis. In this optimization, three kinds of collection optimization input are given: formal representation of candidate expressions

E, logical specification and constraints

S, and a finite set of examples

X. The targeted problem to be solved must satisfy the following (First-Order Predicate Logic with equality) formal rule:

However, the rapid (exponential) growth of the search space, which occurs due to the program’s size or other reason, is considered a crucial problem that deterministic-based synthesizers may face, even when a powerful optimization technique is adopted. In order to solve this problem, the synthesis framework must be improved to guide the search using the weighted directed graph and decision tree in the approaches mentioned in [

55,

69], respectively. A probabilistic model is used as guidance for the search-based synthesizer.

The model takes a set of program tokens, including terminal and non-terminal ones, and produces a probability for each production rule. A weighted directed graph with a sentential form for each node and a calculated weight for each edge is then derived from the model. The enumeration search based on this improved structure decreases the search by considering the shortest path from the source node via graph search algorithms such as Dijkstra’s algorithm [

69].

Constraint Solving: The theory behind constraint solving program synthesis begins by expressing the semantics of a given program in some logic formulas. Instead of compiling the program into such a low-level executable machine code, it is compiled into logical constraints (formulas) as an intermediate representation of the given program. A solver-based strategy is then applied via solver-aided verification or synthesis tools to solve the condition satisfaction problem through proofing the correctness of the given program. It tries to find an input that makes the program fail (if it exists) when such a constraint is unsatisfied in an automatically generated test. Here, the program synthesis problem is treated as the Constraints’ Satisfaction Problem (CSP). The CSP can be defined as a collection of mathematical questions that are considered to be objects that must satisfy some constraints. Some intensive research has been conducted in the artificial intelligence (AI) and operational research domains when solving the CSP.

The feature diagram shown in

Figure 17 classifies program synthesis frameworks with respect to those approaches that solve the synthesis problem as a CSP using theorem provers (logical reasoning techniques). The solving approach can be achieved by adopting either the Boolean Satisfiability Problem (SAT solver), the Satisfiability Modulo Theories (SMT solver), or a combination of both. A common solving strategy that is based on logical reasoning is aimed at reducing the second-order search problem to (first-order) constraint solving first. A type of solver (SAT or SMT) is then used to solve the constraint problem. The solving-based tool can be integrated within some program synthesis approaches like syntax-based synthesis, as discussed in [

78]. This is expressed in the following FD by the mandatory (at least one) notation.

Solvers can be implemented using two strategies, namely, solver-aided programming and an algorithmic based approach. In the algorithmic-based approach, the written implementation is often complex and hard to understand with an informal correctness proof. It is described normally using a high-level specification language (e.g., Hoare logic) supported by one theorem proving system (e.g., Isabelle [

74]) to prove its correctness [

79].

Another approach for implementing solvers is adopting an appropriate solver-aided domain specific language (e.g., Rosette [

33,

48]) and tools. DSLs are used to package the insights and knowledge of domain experts and allow other people who are interested in that domain application to effectively solve problems in that domain [

30]. Rosette is a solver-aided DSL that is built on the top of a programmable programming language called Racket to enable the development of the kinds of tools that are based on program verification and synthesis concepts [

33,

48].

Unlike the algorithmic-based approach where a compiler must be built from a programming language into the constraint solving system, which is an extremely hard task, DSL simplifies the task by building these special kinds of compliers to build an interpreter for the DSL language, or just a library or an API when using an embedded type of DSL. The interpreter of the language requires a so-called symbolic virtual machine to translate the given program semantics into constraints. While using solver-aided DSL language, the synthesis framework becomes simpler and better, as the translation from the language into constraints is obtained automatically [

33,

48].

- ○
Boolean Satisfiability Problem (SAT solver): The Boolean Satisfiability Problem (SAT) can be defined as a problem for checking whether or not a formula that is expressed using Boolean logic is satisfiable. SAT solving is considered the cornerstone of several software engineering applications, such as system design, model checking and hardware, debugging, pattern generation, and software verification [

33]. The SAT problem is denoted as the first proven nondeterministic polynomial time (NP-complete problem) in which algorithms in their worst-case complexity that involve thousands of variables and millions of constraints are used for solving [

33,

75]. There are several program synthesis frameworks that use the SAT solver to resolve the synthesis problem, which is implemented based on an algorithmic approach using C++ or Python. For instance, the SKETCH framework utilizes the SAT solving technique in a counterexample-guided iteration that interacts with a verifier to check the candidate program against the specification and generates counterexamples until the final program that meets the complete specifications is found [

27,

94]. Additionally, SAT solving and the so-called gradient-based numerical optimization technique are combined and used for solving program synthesis problems in the Real Synthesis (REAS) framework [

76]. The search space in REAS is explored using the SAT Solver for solving constraints on discrete variables to fix the set of Boolean expressions that appear in the program structure. This allows better tolerance with approximation errors, which leads to efficient approximation results. The REAS technique is implemented within the SKETCH framework. The end user, a programmer, writes their program with a set of unknowns using the high-level SKETCH language to express the intent. These unknowns are Boolean expressions (constraints) that need to be solved [

76].

- ○
Satisfiability Modulo Theories (SMT solver): The Satisfiability Modulo Theories (SMT) is a technique that is used to find satisfying solutions for the First-Order Logic (FOL) with an equality formula. The FOL formulas include the Boolean operations, belonging to Boolean Logic, which have more complicated expressions than variables including functions, predicates, and constants, as sometimes, the adoption of SAT solvers for a program synthesis problem requires richer logic formulas. Thus, in SMT formulas, some propositional variables in the SAT formula are replaced with some First-Order predicates. These predicates are Boolean functions that return the Boolean values of some variables [

77]. The use of the Satisfiability Modulo Theory (SMT) solvers has emerged as a useful tool for verification, symbolic execution, theorem proving, and program synthesis approaches. There are many available SMT solvers, such as Z3 and the Cooperating Validity Checker (CVC4), that are used for solving the program synthesis problem. These frameworks are implemented based on an algorithmic approach using general-purposes programming languages [

77].

For instance, Z3 is a new SMT solver that is implemented in C++ and produced by Microsoft Research to tackle software analysis and verification problems. It works as a reasoning engine that proves the correctness of programs or discovers their errors by analyzing the verification conditions. Additionally, Z3 acts as a test case generation tool in which it is used to produce new test cases with different behaviors from the execution traces of the program [

77]. Some techniques appear to integrate SMT solving tools with various synthesis approaches. According to [

78], the syntax-guided approach of program synthesis allows users to provide hints to guide the synthesizer to find solutions to its synthesis conjectures. Moreover, SMT solvers are used to solve synthesis conjectures. The CVC4 SMT solver is, as shown in the work presented in [

78], extended with some capabilities to make it efficient for synthesis conjectures using two embedded techniques, namely, Quantifier Instantiation and Syntax-Guided Enumeration.

Machine Learning: Machine learning (ML) is an application of the artificial intelligence (AI) branch of computer science that enables the machine to learn from a massive amount of data without being explicitly programmed. In the context of software engineering, ML techniques have brought great advances in program synthesis, in which they may be used to create automated tools with better code comprehension ability to help developers to understand and modify their code using knowledge extraction or recognition techniques [

81]. Thus, the synthesis problem is introduced here as a machine learning problem. Developers who are interested in following this approach to solve the synthesis problem find themselves faced with a variety of independent choices, expressed in FD 17 by some optional notations. Different learning techniques are used to guide the synthesis search and automatically decompose the problem synthesis, such as deep learning, neural networks, reinforcement learning, and version space learning. These learning styles are illustrated in

Figure 18.

- ○
Version Space Learning: Version Space Learning is commonly used in programming-by demonstration (PBD) synthesis applications. In the PBD approach, a programmer demonstrates how to perform a task, and the system learns an appropriate representation of the procedure of that task. Version Space is considered to be a logical approach to machine learning where the concepts of learning are described using some logical language.

The learning process can be seen as a search function over the space of the hypothesis that maps from a complex object into a binary classification. Different learning algorithms might be used to search over the space. This space is a hypothesis that is a set of disjunction logical formulas, which can be defined as

In this approach, the learning algorithm uses a number of training examples to restrict the space of the hypothesis. Each inconsistent hypothesis with a given example is removed from the space. This refinement process of the hypothesis space is called the candidate elimination algorithm.

According to [

62], an extended version space with algebraic operations is used for learning the synthesis approach from execution traces of programs, in addition to the inputs and outputs of programs. Algebraic operators, such as the union, intersection, join, and transformation operators are used to construct complex version spaces. This allows an efficient, exhaustive search of the program’s space that is consistent with the training execution traces. The designed learner is able to recognize the control structures of a program, such as IF and WHILE statements, as well as an array data structure. In the evaluation of this approach, it was shown to provide correct results from a small number of training examples.

Additionally, the SMARTedit framework was introduced as a (PBD) application based on version space algebra [

180]. Reusable version space components have been designed beside version space algebra for the domain of text editing that supports a subset of Emacs command language. Any type of mapping between inputs and output, including Boolean values and structured objects, is considered.

- ○
Reinforcement Learning: Reinforcement learning (RL) is considered to be a subfield of machine learning that aims to teach an agent how to perform a specific task and achieve a goal in an uncertain, potentially complex environment. Many RL applications have been emerging with the rapid advancement in the domain of games technology and robotics. In the context of program synthesis, reinforcement learning algorithms are applied within various frameworks to maximize the likelihood of generating semantically correct programs, as well as to tackle program aliasing issues when different programs may satisfy a given specification [

19,

139].

It is obvious that there is a limitation in other supervised machine learning techniques when dealing with program aliasing, for example. With the synthesis approach presented in [

139], the process of generating any consistent program with the given I/O examples is directly encouraged by using policy gradient reinforcement learning instead of only optimizing the maximum likelihood. Furthermore, a syntax checker is used to prune the space of possible programs, which helps to generate better programs.

Not only this, but reinforcement learning has been presented in a transformation approach from natural language mapping into an executable program, such as the approach presented in [

19]; reinforcement learning (RL) has been integrated with the maximum marginal likelihood (MML) paradigm. This resulted in a new learning algorithm that can be applied to a neural semantic parser and showed significant results. It can deal with spurious program bias by adopting an exploration strategy that is based on approximating the policy gradients of both LR and MML, which guide the exploration task [

19].

- ○
Neural-Network-Based Learning: A neural network can be defined as an interconnected group of artificial neurons that use a mathematical or computational model for information processing. They are used to solve AI problems through building classification and prediction systems to make predictions. According to the domain analysis, neural-network-based approaches to program synthesis have gained greater attention from the software engineering research community. This is reflected in the popularity of NNs for machine learning in recent years. Several recent research works have introduced neural-network-based frameworks and approaches to program synthesis from I/O examples [

57].

- ○
Deep Learning: As mentioned earlier in this paper, deep learning (DL) can be defined as a branch of machine learning where the architecture of a learning approach consists of multiple layers of data processing units. There is a variety of synthesis frameworks that adopt deep learning techniques, such as deep neural networks (Convolutional and Recurrent NNs) and deep reinforcement learning [

116]. The RobustFill framework [

82], for instance, is a neural program synthesis framework based on RNN that allows variable-length sets of input/output examples (pairs) to be encoded.

- ○
Domain Specific Language (DSL) is used in RobustFill to express the collection of transformation rules of different textual operations, such as substring extractions, constant strings, and text conversions. The adopted DSL has the ability to express complex textual expressions (strings) by employing an effective regular expression extraction technique. The DSL takes a given string as the input and returns another string as the output. The synthesis system is trained with a number of I/O examples and has been shown to achieve 92% accuracy. It is worth mentioning that during the conducted domain analysis, we found various deep learning techniques adopted in different program synthesis applications, such as DeepCom [

126] and CRAIC [

127] for code comment, the CDE-Model [

118] for code summarization, DeepRepair [

156] for code repair, and RobustFill [

82] and DLPaper2Code [

145] for code translation and generation.

Further comparison and detail of these frameworks is beyond the scope of this paper. For each mentioned framework, only the kind of DL technique adopted was extracted to be used in developing the lower level of the above-mentioned feature diagram (

Figure 18).