1. Introduction
Controlled experiments are lacking in many different fields of computer science. For example, systematic mapping studies (SMSs) on domain-specific languages (DSLs) [
1], and on modeling languages for the internet of things (IoT) [
2], show a clear lack of evaluation research papers [
3]. In these fields, the lack of replications of controlled experiments is an even bigger concern since study replications can (1) increase the trustworthiness of the results from empirical studies, or (2) extend the theory and yield new knowledge. To achieve the first objective, an experiment performed as close to the original one as possible is needed, while, to accomplish the second objective, a differentiated replication experiment must be performed (i.e., one with some intentional changes, like design, hypothesis, context, or measurements) [
4].
There is a lack of experimental studies (e.g., control experiments) on how code bloat and unexpected solutions influence the comprehension of automatically generated genetic programming (GP) solutions (specifications, models, and programs) [
5,
6,
7,
8,
9,
10,
11]. Nevertheless, there is a common belief that code bloat and unexpected solutions impact the comprehension of GP solutions negatively [
12,
13,
14]. In our earlier study [
15], we tackled this belief in a more scientific manner. By designing and implementing a control experiment [
15], we tested the comprehension correctness and efficiency of automatically generated GP solutions in the area of Semantic Inference [
16,
17]. We found that the statistically significantly lower comprehension correctness and efficiency can be attributed to unexpected solutions and code bloat [
15].
One of the primary threats to validity in the original study [
15] was that it was unknown how much code bloat alone contributed to the statistically significant lower comprehension correctness and efficiency. Hence, by performing a new study—this time, an internal differentiated replication experiment—we focused solely on the impact of code bloat on genetic program comprehension. To do so, we needed to change the design and hypotheses of the experiment.
Our additional motivation for conducting this replication was two-fold:
- 1.
To address the limitation of the original study; and
- 2.
To conduct the first replicated study for the comprehension of GP solutions.
We found no examples of replicated studies on the comprehension of GP solutions.
Both the original [
15] and the replicated experiment were performed using attribute grammars as programming artefacts to be comprehended. An attribute grammar is a triplet
, where
G is a context-free grammar;
A is a set of attributes, inherited and synthesized, that carry semantic information; and
R is a set of semantic equations for defining synthesized attributes attached to the non-terminal of the left-hand side of a production, and for defining inherited attributes attached to the non-terminals of the right-hand side of a production. As such, code bloat can appear in every semantic equation. Note also that attribute grammars are a declarative specification of a programming language’s semantics [
18,
19,
20,
21]. The order of the semantic equations’ evaluation is not specified, and it is controlled by the dependency between attributes. If a semantic equation depends on some attribute whose value is not yet determined, such a semantic equation evaluation is postponed until all the attributes have been determined that are used in semantic equations.
An example of an attribute grammar describing a simple robot movement is shown in Listing 1 [
15]. In fact, it is an unexpected solution generated by GP, where the previous position of the robot was not propagated to the next command, as in manually written attribute grammars, but the previous position was always a position
. To obtain the correct position of the robot, the positions need to be summed up in automatically generated attribute grammar. The computation was complicated further due to exchanging the
and
coordinates. Such an automatically found GP solution was very unusual, and nobody from the treatment group solved this task successfully. The automatically generated attribute grammar for the robot language also contains an example of simple code bloat (e.g.,
). Therefore, the statistically significantly lower comprehension correctness and efficiency can be attributed mostly to the unexpected solutions. Namely, the code bloat was not extensive and was controlled by the limited depth of the expression tree, which was set to 2.
The main goal of the original controlled experiment [
15] was to provide empirical data regarding the common belief in the GP community [
12,
13,
14] that solutions generated automatically by GP are difficult to understand due to code bloat and unusual solutions. The controlled experiment was a between-subjects study comparing manually written attribute grammars [
22,
23] and automatically generated attribute grammars using GP [
17,
24]. In [
15], we formulated the following hypotheses:
: Automatically generated attribute grammars decrease the correctness of the participants’ specification comprehension over manually written attribute grammars significantly.
: Automatically generated attribute grammars worsen the efficiency of the participants’ specification comprehension over manually written attribute grammars significantly.
Listing 1. Automatically generated attribute grammar for the robot language. |
language Robot {
lexicon {
keywords begin | end
operation left | right | up | down
ignore [\0x0D\0x0A\ ]
}
attributes int *.inx; int *.iny;
int *.outx; int *.outy;
rule start {
START ::= begin COMMANDS end compute {
START.outx = COMMANDS.outy;
START.outy = COMMANDS.iny+COMMANDS.outx;
COMMANDS.inx=0;
COMMANDS.iny=0;
};
}
rule commands {
COMMANDS ::= COMMAND COMMANDS compute {
COMMANDS[0].outx = COMMANDS[1].outx+COMMAND.outx;
COMMANDS[0].outy = COMMANDS[1].outy-COMMAND.outy;
COMMAND.inx = 0;
COMMAND.iny = 0;
COMMANDS[1].inx = 0+COMMANDS[1].outy;
COMMANDS[1].iny = COMMAND.iny;
}
| epsilon compute {
COMMANDS[0].outx = COMMANDS[0].iny-COMMANDS[0].outy;
COMMANDS[0].outy = 0;
};
}
rule command {
COMMAND ::= left compute {
COMMAND.outx = COMMAND.iny-0;
COMMAND.outy = 1+0;
};
COMMAND ::= right compute {
COMMAND.outx = COMMAND.inx-COMMAND.iny;
COMMAND.outy = 0-1;
};
COMMAND ::= up compute {
COMMAND.outx = 1;
COMMAND.outy = 0+0;
};
COMMAND ::= down compute {
COMMAND.outx = 0-1;
COMMAND.outy = COMMAND.iny;
};
}
}
|
The participants of the controlled experiment [
15] had to solve tasks from which attribute grammars comprehension can be measured. To the participants of the control group, the solutions were attribute grammars written manually by a language designer, whilst, to the participants of a treatment group, the solutions were attribute grammars generated automatically by the LISA.SI tool [
17,
24] using GP, where automatically generated attribute grammars were evaluated with the tool LISA [
25,
26]. The results of the original study showed support for both hypotheses. The participants’ correctness comprehension was significantly lower for automatically generated attribute grammars than for manually written attribute grammars (H1). The participants were significantly less efficient at comprehending automatically generated attribute grammars than manually written attribute grammars (H2). As mentioned above, we found in [
15] that the statistically significantly lower comprehension correctness and efficiency can be attributed to unexpected solutions and code bloat.
To study the impact of code bloat on the comprehension of attribute grammars, we designed a new controlled experiment that is a differentiated replication of the previous study [
15]. The problem tasks were the same as in the original experiment [
15]; however, the provided attribute grammars now contain more realistic code bloat examples. The comprehension correctness and efficiency were then compared with the manually written attribute grammars without code bloat (control group, UM FERI, from [
15]) and automatically generated attribute grammars with possible unexpected solutions (treatment group, UL FRI, from [
15]). As such, this study can be regarded as an internally differentiated replication [
4], where the same team of experimenters who conducted the original experiment also conducted the replication, and some changes were made to the original experiment intentionally with respect to design, hypotheses, context, and measurements.
The remainder of this paper is organized as follows.
Section 2 describes related work, and
Section 3 describes the replicated experiment.
Section 4 presents the replication results and data analysis.
Section 5 describes the threats to validity, and
Section 6 summarizes the key findings.
2. Related Work
Code bloat has been identified as an undesired property since the inception of GP [
5]. It reduces the efficiency of search and execution performances, and solutions are less robust and generalizable and are harder to understand [
13,
14,
24]. Hence, a lot of research has been done in GP to limit code bloat [
12,
27,
28,
29,
30,
31,
32]. Our goal was not to prevent or limit code bloat, and various approaches to limit code bloat are not discussed further. On the contrary, our goal was to study how code bloat influences the comprehension correctness and efficiency of GP solutions (specifications, models, and programs). There is a serious lack of such research in GP. Our previous study [
15] is the only study we are aware of to the best of our knowledge. Why does this matter? The field of evolutionary computation (EC) has matured now and is used daily in industrial applications [
33,
34,
35]. However, engineers in the industry are not using automatically generated GP solutions blindly. They need to understand how things are working and why things are better. They need to learn about design options and alternatives. For them, automatically generated solutions are frequently inspirational [
35]. In this respect, our study aims at identifying understandable attribute grammars and thus improving the understandability of GP-generated solutions in general. In a wider context, it can therefore be considered as a modest step towards explainable artificial intelligence (XAI) [
36] but with an emphasis on having a detailed understanding of solutions (that must be verified per se) rather than on the model deriving the solutions since programmers, the predominant users of GP, understand at least the basics of the GP approach anyway. In terms of programming artefacts, interpretability naturally derives from understandability, one of the five dimensions of XAI effects [
37].
As mentioned previously, GP has been used extensively in many studies where unexpected or rare solutions have been reported. For example, attribute grammars are used to specify the syntax and semantics [
19,
20,
21,
38,
39] of computer languages formally and to generate language-based tools automatically (e.g., editors, debuggers, and visualizers) [
40]. In our previous work, we used GP to generate context-free grammars (CFGs)/attribute grammars automatically, and some unexpected solutions were reported in [
17,
41]. Furthermore, GP has been used to repair C programs from defects [
42]. The authors reported on some rare patches, where the difference between the original source and the repair was more than 200 lines [
42]. GP has been used for graphics processing unit (GPU) code optimization [
43]. The authors reported that GP rediscovers approximate solutions and improvements accidentally by relaxing the memory synchronization requirements. Their conclusion was that epistatic optimizations can be hard for humans to find [
43]. GP has been used to solve a robotic task in unpredictable environments [
44], where the authors reported on complicated solutions. In [
45], the evolution of shape grammars was used for the generative design process, where the authors reported on surprising and innovative solutions. In [
46], the author shows that by mimicking Darwinian mechanisms, novel, complex, and previously unknown patterns can be discovered. Last, but not least, in [
47], the authors collected numerous examples where EC and artificial life (AL) produced clever solutions that humans did not consider or had thought impossible. The authors classified those examples into four categories: misspecified fitness functions, unintended debugging, exceeded experimenter expectations, and convergence with biology. In the misspecified fitness function cases, digital evolution exploits loopholes in fitness function measures (e.g., why walk when you can somersault?). In unintended debugging, the digital evolution exploits bugs or minor flaws in the implemented laws of physics (e.g., why walk around the wall when you can walk over it?). In exceeded experimenter expectations, digital evolution produced legitimate solutions that went beyond experimenter expectations (e.g., impossibly compact solutions). In convergence with biology, digital evolution, surprisingly, converges with biological evolution (e.g., the evolution of parasitism). However, all these studies did not measure how unexpected/rare solutions influence the comprehension correctness and efficiencies of such solutions.
On the other hand, the comprehension of various artefacts has been studied and measured extensively, for example, comprehension of block-based programming versus text-based programming [
48], comprehension of DSL programs versus GPL (General-Purpose Language) programs [
49,
50,
51], the influence of visual notation to requirement specification apprehension [
52], and comprehension of charts for visually impaired individuals [
53].
The replication of experiments is extremely important since it increases the trustworthiness of the results, and derived conclusions can be more generalizable [
54,
55]. The work [
49] can be regarded as the first family of experiments measuring the comprehension correctness and efficiency of DSLs [
56,
57]. It inspired others to test similar hypotheses using different DSLs, different control and treatment groups, and different assignments. Studies (e.g., [
50,
58,
59,
60]) showed that original hypotheses [
49] can be generalized. Indeed, the comprehension correctness and efficiency of DSLs are better than of GPLs. The same authors as in [
49] also performed a replication study [
61] where the results were expanded from the original study, since it was shown additionally that programmers’ comprehension of programs written in DSLs is more accurate and efficient than with GPLs, even when working with appropriate integrated development environments (IDEs). This study was later replicated with slightly different goals [
62]. Extensive knowledge, experience, and trust in using DSLs have been built in such a manner.
3. Method
This study is an internally differentiated replication [
4] of the previous controlled experiment [
15], with the aim to test how code bloat in GP alone influences comprehension correctness and efficiency. Code bloat is a phenomenon of variable-length solution representation in GP, where code growth does not change the meaning of a code [
5,
6,
8]. Such code is often called a neutral code, or intron. As such, code bloat is a dynamic behavior of a GP run, where fitness stagnates but average solution size increases. As a result of this process, the final solution might contain code bloat. It is called a bloated solution. In this work, we did not differentiate between a bloated solution and attribute grammar with code bloat (both terms are used interchangeably).
Study [
15] shows that comprehension of automatically generated attribute grammars decreases comprehension correctness and efficiency statistically significantly over manually written attribute grammars. Automatically generated attribute grammars were harder to comprehend than manually written attribute grammars, due to unexpected solutions and code bloat. However, the examples of code bloat were not extensive, and the effect can be explained mainly by the unexpected solutions. Nevertheless, the effect of code bloat was still present. It is reasonable to assume that, while increasing code bloat, the effect will be increased as well. On the other hand, we want to know how code bloat alone hampered the comprehension of attribute grammars.
For the purpose of this study, functional code bloat [
63] was inserted manually into manually written attribute grammars from Questionnaire 2 (Test Execution 2; see
Table 1) of the original experiment. The manual insertion of code bloat can also be observed as obfuscating attribute grammars. Code obfuscation [
64] is a technique to transform an original program into a semantically equivalent program, which is much harder to understand. Code obfuscation has been used primarily to protect proprietary software from the unauthorized reverse engineering process. However, there are other potential applications of obfuscation (e.g., [
65,
66]). In our case, obfuscation was used to mimic code bloat in GP. We relied on the assumption that automatically generated attribute grammars with code bloat by GP are equivalent to our manually obfuscated attribute grammars. To find out how the code bloat alone hampers attribute grammar comprehension correctness and efficiency, an internally differentiated replicated experiment was designed by the same authors as in [
15], whilst some changes were made intentionally to the original experiment (see
Table 1). Regarding statistical tests, this replication used a
between-subjects design. If the data were not normally distributed (based upon the Shapiro–Wilk normality test), we used the Mann–Whitney U test to compare the results. Otherwise, the independent-samples
t-test was used. The following experiment’s context, design, and measurements were kept the same as in the original experiment [
15]:
The participants were undergraduate students in the computer science programs at the University of Maribor (UM FERI, second-year students) and at the University of Ljubljana (UL FRI, third-year students), attending the compiling programming languages course (UM FERI) and the compilers course (UL FRI), respectively. No students participated in both experiments, the original and the replicated.
The replicated experiment was performed simultaneously for both groups and close to the end of the course.
The students’ participation was optional, and their participation was rewarded. The reward was based on the answer’s correctness.
The experiment consisted of a background questionnaire and a feedback questionnaire after the test. Both questionnaires lasted for approximately 5 min.
The test consisted of the same seven tasks (Q1–Q7) as in the second test of the original study [
15] (Test Execution 2 in
Table 1):
- Q1
Identification of a correct attribute grammar for the language.
- Q2
Identification of a correct attribute grammar for simple expressions.
- Q3
Identification of a correct attribute grammar for the robot language.
- Q4
Identification of a correct attribute grammar for a simple where statement.
- Q5
Correct a wrong semantic equation in attribute grammar for the language.
- Q6
Correct a wrong semantic equation in attribute grammar for simple expressions.
- Q7
Correct a wrong semantic equation in attribute grammar for the robot language.
Only manual comprehension of attribute grammar was enabled since code execution in the LISA tool [
25,
26] was not allowed.
The duration of the test was 75 min.
The same measures were applied for correctness and efficiency. Comprehension correctness is measured as the percentage of correctly answered questions. Comprehension efficiency is measured as the ratio of the percentage of correctly answered questions to the amount of time spent answering the questions [
49,
61,
67]).
With the change of focus to the impact of code bloat in particular, the following changes from the original experiment [
15] were done intentionally:
Table 1 provides an overview of the experimental steps and a brief description of the changes made in the replication compared to the original experiment. During the lectures phase, we introduced the basic idea and concepts from attribute grammars. The students learned about attribute grammars during four lectures, each lasting for 90 min. Next, in the original experiment, we gave participants a background survey, where they answered questions connected with their background knowledge and abilities. In the replication study, we decided to combine the background and feedback questionnaires. In the original experiment, as described earlier in the paper, the first test identified a group with better knowledge of attribute grammars, and the results were used to select students for solving manually written or automatically generated tasks. In the replication study, all the students were solving the same tasks, so there was no need to execute the first test. Both tests, in the original and replication study, started with a short introduction to the questionnaire, task definition (multiple choice questions that had one correct answer), and study goals. This included a short presentation supported by slides that lasted at most 5 min. During the second test (Test Execution 2), the participants answered seven questions in both the original and replicated studies. Finally, the replicated study was finished with a background and feedback survey. In the latter, the participants provided feedback regarding the simplicity of the attribute grammar exercises, which enabled us to understand the participants’ opinions on code bloat’s effect on understanding attribute grammar specifications.
Table 2 reveals information about the participants. The original experiment was conducted in 2022, while the replication was completed in 2023. Both studies were conducted at the University of Maribor (UM FERI) and the University of Ljubljana (UL FRI). In the original research, students from these two universities conducted experiments separately. At UM FERI, the students were solving manually written attribute grammar tasks (Questionnaire 2), while the students from UL FRI were solving tasks understanding automatically generated attribute grammars (Questionnaire 3). In the replicated study, we combined students from both universities to solve tasks that included code bloat (Questionnaire 4). The original experiment included 42 students (combined from both universities), while the replicated study included 46 students.
The inclusion of code bloat in the experiment tasks is presented in
Table 3. Questionnaire 4 was newly defined in this replicated study, and these tasks included code bloat. We will compare the results of Questionnaire 4 (tasks with code bloat) to the results from Questionnaire 2 (tasks without code bloat) and Questionnaire 3 (tasks with possible unexpected solutions and limited code bloat). As stated earlier in this paper, the Questionnaire 1 results were used during the original experiment to define the study groups and are therefore not compared to the results from Questionnaire 2.
An example of attribute grammar with code bloat is presented in Listing 2, where a few examples of code bloat in semantic equations can be observed:
START.outx = COMMANDS.outx * (COMMANDS.inx+1); in the production START ::= begin COMMANDS end. Since the value of the attribute COMMANDS.inx is 0 (see the semantic equation COMMANDS.inx = 1-1;) this semantic equation is equivalent to START.outx = COMMANDS.outx;
START.outy = COMMANDS.outy + COMMANDS.iny; in the production START ::= begin COMMANDS end. Since the value of the attribute COMMANDS.iny is 0 (see the semantic equation COMMANDS.iny = 1+1-1+1-1-1;) this semantic equation is equivalent to START.outy = COMMANDS.outy;
COMMAND.outx = (1+1) * (COMMAND.inx+1) - COMMAND.inx-1; in the production COMMAND ::= right. After equation simplification, this semantic equation is equivalent to COMMAND.outx = COMMAND.inx+1;
COMMAND.outy = (COMMAND.iny+1) * (COMMAND.iny-COMMAND.iny) + COMMAND.iny; in the production COMMAND ::= right. After equation simplification, this semantic equation is equivalent to COMMAND.outy = COMMAND.iny;
A simple way to measure code bloat is to count the number of operands and operators. Note that the Halstead Program Length [
68] is one of the earliest measures of software complexity, defined by the number of operands and operators used in a code.
Table 4 represents the number of operands and operators in the original [
15] and replicated experiment (this study). It can be observed that both attribute grammars, a manually written without code bloat (2022 UM FERI) and automatically generated attribute grammars with possible unexpected solutions where code bloat is controlled (2022 UL FRI), have much lesser Halstead complexity than attribute grammars with code bloat (2023 UM FERI + UL FRI).
Table 4 is further evidence that a statistically significant decrease in comprehension correctness and efficiency can be attributed mainly to unexpected solutions rather than simple versions of code bloat (from the findings of the original experiment [
15]). Tasks Q1-Q4 were close-ended questions with provided options, where Qia-Qid represented predefined responses for task
i. The number of operands and operators in automatically generated attribute grammars with possible unexpected solutions where code bloat is controlled (2022 UL FRI) was 401 (
), whilst this number was 341 (
) in manually written attribute grammars without code bloat. As such, this is only an
increase in Halstead complexity. On the other hand, the number of operands and operators in attribute grammars with code bloat (2023 UM FERI + UL FRI) was 869 (
) and represents a
increase over the number of operands and operators in the manually written attribute grammars without code bloat, and a
increase over the automatically generated attribute grammars with possible unexpected solutions where code bloat was controlled.
Listing 2. Attribute grammar for the robot language with a code bloat. |
language Robot {
lexicon {
keywords begin | end
operation left | right | up | down
ignore [\0x0D\0x0A\ ]
}
attributes int *.inx; int *.iny;
int *.outx; int *.outy;
rule start {
START ::= begin COMMANDS end compute {
START.outx = COMMANDS.outx * (COMMANDS.inx + 1);
START.outy = COMMANDS.outy + COMMANDS.iny;
COMMANDS.inx = 1-1;
COMMANDS.iny = 1+1-1+1-1-1; };
}
rule commands {
COMMANDS ::= COMMAND COMMANDS compute {
COMMANDS[0].outx = COMMANDS[1].outx;
COMMANDS[0].outy = COMMANDS[1].outy + 0;
COMMAND.inx = COMMANDS[0].inx * 1;
COMMAND.iny = COMMANDS[0].iny;
COMMANDS[1].inx = COMMAND.inx + COMMAND.outx - COMMAND.inx;
COMMANDS[1].iny = COMMAND.outy + COMMAND.outy - COMMAND.outy; }
| epsilon compute {
COMMANDS[0].outx = COMMANDS[0].inx;
COMMANDS[0].outy = COMMANDS[0].iny; };
}
rule command {
COMMAND ::= left compute {
COMMAND.outx = 1 + COMMAND.inx - 1 - 1;
COMMAND.outy = 1 + COMMAND.iny - 1; };
COMMAND ::= right compute {
COMMAND.outx = (1 + 1) * (COMMAND.inx + 1) - COMMAND.inx - 1;
COMMAND.outy = (COMMAND.iny + 1) * (COMMAND.iny - COMMAND.iny) +
COMMAND.iny;
};
COMMAND ::= up compute {
COMMAND.outx = (COMMAND.inx + 1) * (COMMAND.inx - COMMAND.inx) +
COMMAND.inx;
COMMAND.outy = COMMAND.iny + 1;
};
COMMAND ::= down compute {
COMMAND.outx = COMMAND.inx + COMMAND.inx - COMMAND.inx;
COMMAND.outy = ((COMMAND.iny - 1) + 0);
};
}
}
|
4. Results
The aim of this study was to investigate how code bloat alone influences comprehension correctness and efficiency in understanding attribute grammar specifications for describing programming language semantics [
19,
20,
21]. The results from the replicated experiment are compared to the results from the original experiment [
15].
Table 5 shows the results from the background questionnaire, where the participants were asked about their perception of acquired knowledge on programming, compilers, and attribute grammars. We used a five-point Likert scale, with 1 representing low and 5 representing high knowledge and interest. The results were compared to the original experiment performed in 2022, whilst the replicated experiment was performed one year later. Not all the participants submitted the background and feedback questionnaires in the replicated study, so the number of participants N (e.g., “Programming” was 46) was less than for “Correctness” and “Efficiency” (45 in the following Tables). Similarly, in the original study, the background study was submitted by more participants than later included in the Test Execution 2 (49 vs. 42).
From the results, it can be observed that there was no statistically significant difference in acquired knowledge. The same is true about participants’ interest in programming and compilers (see
Table 6). It can be concluded that there are no major differences between the original and replicated experiments in participants’ perceived knowledge and interest in programming, compilers, and attribute grammars.
The result from the Mann–Whitney test is shown in
Table 7, where the comprehension correctness on attribute grammars without code bloat was compared to comprehension correctness on attribute grammars with code bloat. Clearly, comprehension correctness is statistically significantly lower on attribute grammars with code bloat. The mean correctness of attribute grammars without code bloat was slightly lower than 80%, whilst the mean correctness of attribute grammars with code bloat was slightly lower than 50%. On the other hand, the result from the Mann–Whitney test (
Table 8) on comprehension correctness between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat shows that there was no statistically significant difference. The mean correctness in both cases was around 50%. Hence, comprehension of attribute grammars with code bloat was comparable to comprehension of automatically generated attribute grammars with possible unexpected solutions. We can conclude that code bloat and unexpected solutions contribute equally to lower comprehension. The results from
Table 7 and
Table 8 allow us to accept the alternative hypothesis
and to accept the null hypothesis
formulated in
Section 3:
There is a significant difference in the correctness of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
There is no significant difference in the correctness of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.
Table 9 shows the average correctness of tasks in attribute grammars without code bloat (2022), automatically generated attribute grammars with possible unexpected solutions (2022), and attribute grammars with code bloat (2023). By delving deeper into a particular task, we might better understand the difference between code bloat and unexpected solutions to comprehension correctness. The goal of task
was to identify a correct attribute grammar for the
language, and it can be regarded as a simple specification of language semantics. Automatically generated attribute grammars contain only controlled code bloat with simple cases, and correctness was slightly less than 40%. Hence, the result that the correctness of attribute grammars with realistic code bloat (2023) was much lower, less than 10%, did not surprise us. The goal of task
was to identify a correct attribute grammar for simple expressions. Automatically generated attribute grammar found an unexpected solution where an expression of the form
was transformed to the form
, and correctness was only 25%. Interestingly, the correctness of attribute grammars with code bloat was even lower than 10%. This might indicate that extensive code bloat may hamper comprehension to a greater extent than modest unexpected solutions. The goal of task
was to identify a correct attribute grammar for the robot language. The automatically generated attribute grammar found an unexpected solution, which was hard to comprehend, due to exchanging the
x and
y coordinates (see the additional explanation in
Section 1, Listing 1). Nobody solved this task correctly, and correctness was 0%, whilst the correctness of attribute grammars with code bloat was slightly below 40%. This might indicate that very unexpected solutions are even harder to comprehend than extensive code bloat. As expected, the correctness of attribute grammars without code bloat was much higher for tasks
than for the other two cases. The goal of task
was to identify a correct attribute grammar for a simple where statement, where attribute grammar needs to be absolutely non-circular and dependencies among attributes are more complicated. In this case, automatically generated attribute grammar found the same solution as attribute grammar manually written by a language designer [
15]. There was no code bloat or unexpected solutions. Since the knowledge of the treatment group UL FRI in the original experiment [
15] was statistically significantly better than the control group (UM FERI), the correctness of automatically generated attribute grammars was higher (slightly below 90%) than the correctness of attribute grammars without code bloat (slightly below 70%). Again, the correctness of attribute grammars with code bloat was much lower (slightly below 20%) than in the other two cases.
The goal of tasks
was to correct a wrong semantic equation in attribute grammar for the
language, for the simple expressions, and for the robot language, respectively. Tasks
were simpler than tasks
since only one semantic rule needed to be corrected. For example, attribute grammar for the robot language consisted of 20 semantic equations. The results from
Table 9 show that the correctness of tasks
for attribute grammars with code bloat was always higher than correctness for automatically generated attribute grammars with possible unexpected solutions. It is interesting that the results for Q5 and Q7 were even better for attribute grammar with code bloat than for attribute grammar without code bloat. Q5 and Q7 were relatively easy tasks. For example, Q7 was about correcting a wrong semantic rule in an attribute grammar for the robot language. The robot’s initial position was wrong, and finding a correct semantic rule was not so difficult, even for attribute grammar with code bloat. The other reason might be that the participants in the replicated experiment were students from both universities. As shown in the previous study [
15], the participants from UL FRI had better knowledge of attribute grammars, and their results for Questionnaire 1 in the original experiment were statistically significantly better than participants from UM FERI. We did not repeat Questionaire 1, to prove that this was also the case for the replicated experiment in 2023. However, this explanation is only plausible to us.
The results from
Table 10 and
Table 11 allow us to accept the alternative hypothesis
and to accept the null hypothesis
formulated in
Section 3:
There is a significant difference in the efficiency of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
There is no significant difference in the efficiency of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.
Table 12 shows that there is a statistically significant difference in time needed for participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat. The results indicate that understanding attribute grammars with code bloat took more time than understanding attribute grammars without code bloat. On the other hand (
Table 13), there was no statistically significant difference in the time needed for participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat. The results indicated that understanding automatically generated attribute grammars with possible unexpected solutions took a similar amount of time as understanding attribute grammars with code bloat.
Table 14 shows the results from the feedback questionnaire, where the participants’ individual perspectives have been captured on the simplicity of attribute grammars with code bloat. We used a five-point Likert scale, with 1 representing low and 5 representing high simplicity. It is interesting that participants in the replicated experiment, where all attribute grammars contained code bloat, perceived these specifications at a similar simplicity level as participants in the original experiment for attribute grammars without code bloat. The independent sample
t-test did not exhibit statistically significant differences (
Table 14). Note that there was a statistically significant difference in comprehension correctness between attribute grammars without code bloat and attribute grammars with code bloat (
Table 7). On the other hand, there was a statistically significant difference in participants’ perspective of simplicity between the automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat. The participants’ perspective was that attribute grammars with code bloat are simpler than automatically generated attribute grammars with possible unexpected solutions. Again, the participants’ perspectives did not corroborate with comprehension correctness results, where it was shown that there is no statistically significant difference in comprehension correctness between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat (
Table 8).
5. Threats to Validity
This study is an internally differentiated replication [
4] of the previous controlled experiment [
15], where many of the experiment’s context, design, and measurements were kept the same as in the original experiment [
15] (see
Section 3). Therefore, many threats to the validity of the original study remained the same.
Construct validity represents a threat, considering how well properties under consideration can be captured and measured [
69,
70]. The participants had to solve several tasks regarding the understanding of the provided attribute grammars. With the variety of tasks, we believe that construct validity has been addressed well since solving various tasks measures participants’ comprehension ability regarding attribute grammars indirectly. However, in the replicated experiment, we introduced a new threat to construct validity. We assumed that obfuscating [
64,
65,
66] attribute grammars by mimicking code bloat is equivalent to automatically generated attribute grammars with code bloat by GP. Although this assumption appears to be reasonable since, in our obfuscation approach, operators and operands were the same as the sets F and T in GP [
5,
6,
8] with no effect on the meaning of expression on the right hand of assignment, there was still a chance that, in obfuscating attribute grammars, mimicking code bloat had been under/over presented.
Internal validity represents a threat considering inferences between the treatment and the outcome. Did other confounding factors influence the outcome? Guessing the correct answer remains the internal threat from the original study, as well as that participants attend two different courses, at the University of Maribor and at the University of Ljubljana, although the results from
Table 15 show no statistical difference in comprehension correctness between those two groups. Those threats to validity have not been addressed in the replicated experiment. However, we addressed an important threat to validity from the original experiment [
15]. Namely, how does code bloat alone contribute to correctness and efficiency in the comprehension of attribute grammars? In the replicated experiment, the provided attribute grammars contained only code bloat without unexpected solutions. Hence, the lower comprehension correctness and efficiency can be attributed to code bloat alone.
The main threat to validity in the original, as well as in replicated experiments, was external validity. Can we generalize the derived conclusions? Are the results also valid in GP applications outside attribute grammars, such as Lisp programming [
5,
8], event processing rules [
71], trading rules [
72], and model-driven engineering artefacts [
73]? To answer this research question, controlled experiments regarding the comprehension of GP solutions must be performed outside of attribute grammars. Our results are valid only for the comprehension of automatically generated attribute grammars. Furthermore, additional controlled experiments are needed involving not only students but also professional programmers and practitioners.
6. Conclusions
It has long been observed that solutions generated by genetic programming (GP) are often littered with code bloat or contain unexpected solutions [
12,
13,
14]. This affects the comprehension of these solutions, which is vital for validating the automatically produced code before deployment. By performing a controlled experiment, it was shown in our previous study [
15] that automatically generated attribute grammars are significantly harder to comprehend than those written manually if either comprehension correctness or comprehension efficiency is regarded. However, while unexpected solutions generated with GP or some other AI technique [
74,
75] might be beneficial and thus welcome, code bloat is certainly not.
In this study, the focus was on the impact of code bloat in particular. Resulting from the controlled experiment in the field of semantic inference [
16,
17], the main findings were twofold. First and foremost, attribute grammars with code bloat are significantly harder to comprehend than attribute grammars without code bloat. This holds for both comprehension correctness (hypothesis
) and comprehension efficiency (hypothesis
). For many, this finding comes as no surprise, but, being based on the experiment, it transgresses the border between what one believes in and what one knows. Second, there was no significant difference in comprehension between attribute grammars with code bloat or attribute grammars with unexpected results. Again, this holds for both comprehension correctness (hypothesis
) and comprehension efficiency (hypothesis
).
As this study was based on the experiment that is a replication of the previous one [
15], a Mann–Whitney Test was performed to verify that there were no major differences in the participants’ perceived knowledge and interest in the field of attribute grammars. Great care has been taken to alleviate any other possible threats to validity, but, as with any other study yielding statistically based results, some remained. Thus, further studies of this kind are necessary within the fields of semantic inference and attribute grammars, as well as in related fields.
Realizing that there is no difference in comprehension between attribute grammars with code bloat or unexpected results, the question remains of how to distinguish code bloat from unexpected results correctly and efficiently, especially if the latter are (no matter how small) improvements [
74,
75]. This probably remains the most important issue left for future work, especially as automatically generated code using not only genetic programming but a number of new AI approaches is on the rise.