The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference

Kosar, Tomaž; Kovačević, Željko; Mernik, Marjan; Slivnik, Boštjan

doi:10.3390/math11173744

Open AccessArticle

The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference

¹

Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, 2000 Maribor, Slovenia

²

Department of Computer Science and Informatics, Zagreb University of Applied Sciences, Vrbik 8, 10000 Zagreb, Croatia

³

Faculty of Computer and Information Science, University of Ljubljana, Večna Pot 113, 1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(17), 3744; https://doi.org/10.3390/math11173744

Submission received: 20 July 2023 / Revised: 28 August 2023 / Accepted: 29 August 2023 / Published: 31 August 2023

(This article belongs to the Section E1: Mathematics and Computer Science)

Download Versions Notes

Abstract

:

Our previous study showed that automatically generated attribute grammars were harder to comprehend than manually written attribute grammars, mostly due to unexpected solutions. This study is an internally differentiated replication of the previous experiment, but, unlike the previous one, it focused on testing the influence of code bloat on comprehension correctness and efficiency. While the experiment’s context, design, and measurements were kept mostly the same as in the original experiment, more realistic code bloat examples were introduced. The replicated experiment was conducted with undergraduate students from two universities, showing statistically significant differences in comprehension correctness and efficiency between attribute grammars without code bloat and attribute grammars with code bloat, although the participants perceived attribute grammars with code bloat as simple as attribute grammars without code bloat. On the other hand, there was no statistically significant difference in comprehension correctness and efficiency between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat, although there was a statistically significant difference in participants’ perspective of simplicity between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat. The participants perceived attribute grammars with code bloat as significantly simpler than automatically generated attribute grammars.

Keywords:

genetic programming; program comprehension; controlled experiment; replication; semantic inference; attribute grammars

MSC:

68Q55; 68Q42

1. Introduction

Controlled experiments are lacking in many different fields of computer science. For example, systematic mapping studies (SMSs) on domain-specific languages (DSLs) [1], and on modeling languages for the internet of things (IoT) [2], show a clear lack of evaluation research papers [3]. In these fields, the lack of replications of controlled experiments is an even bigger concern since study replications can (1) increase the trustworthiness of the results from empirical studies, or (2) extend the theory and yield new knowledge. To achieve the first objective, an experiment performed as close to the original one as possible is needed, while, to accomplish the second objective, a differentiated replication experiment must be performed (i.e., one with some intentional changes, like design, hypothesis, context, or measurements) [4].

There is a lack of experimental studies (e.g., control experiments) on how code bloat and unexpected solutions influence the comprehension of automatically generated genetic programming (GP) solutions (specifications, models, and programs) [5,6,7,8,9,10,11]. Nevertheless, there is a common belief that code bloat and unexpected solutions impact the comprehension of GP solutions negatively [12,13,14]. In our earlier study [15], we tackled this belief in a more scientific manner. By designing and implementing a control experiment [15], we tested the comprehension correctness and efficiency of automatically generated GP solutions in the area of Semantic Inference [16,17]. We found that the statistically significantly lower comprehension correctness and efficiency can be attributed to unexpected solutions and code bloat [15].

One of the primary threats to validity in the original study [15] was that it was unknown how much code bloat alone contributed to the statistically significant lower comprehension correctness and efficiency. Hence, by performing a new study—this time, an internal differentiated replication experiment—we focused solely on the impact of code bloat on genetic program comprehension. To do so, we needed to change the design and hypotheses of the experiment.

Our additional motivation for conducting this replication was two-fold:

1.: To address the limitation of the original study; and
2.: To conduct the first replicated study for the comprehension of GP solutions.

We found no examples of replicated studies on the comprehension of GP solutions.

Both the original [15] and the replicated experiment were performed using attribute grammars as programming artefacts to be comprehended. An attribute grammar is a triplet

G = 〈 G, A, R 〉

, where G is a context-free grammar; A is a set of attributes, inherited and synthesized, that carry semantic information; and R is a set of semantic equations for defining synthesized attributes attached to the non-terminal of the left-hand side of a production, and for defining inherited attributes attached to the non-terminals of the right-hand side of a production. As such, code bloat can appear in every semantic equation. Note also that attribute grammars are a declarative specification of a programming language’s semantics [18,19,20,21]. The order of the semantic equations’ evaluation is not specified, and it is controlled by the dependency between attributes. If a semantic equation depends on some attribute whose value is not yet determined, such a semantic equation evaluation is postponed until all the attributes have been determined that are used in semantic equations.

An example of an attribute grammar describing a simple robot movement is shown in Listing 1 [15]. In fact, it is an unexpected solution generated by GP, where the previous position of the robot was not propagated to the next command, as in manually written attribute grammars, but the previous position was always a position

(0, 0)

. To obtain the correct position of the robot, the positions need to be summed up in automatically generated attribute grammar. The computation was complicated further due to exchanging the

x

and

y

coordinates. Such an automatically found GP solution was very unusual, and nobody from the treatment group solved this task successfully. The automatically generated attribute grammar for the robot language also contains an example of simple code bloat (e.g.,

COMMANDS [1] . inx = 0 + COMMANDS [1] . outy;

). Therefore, the statistically significantly lower comprehension correctness and efficiency can be attributed mostly to the unexpected solutions. Namely, the code bloat was not extensive and was controlled by the limited depth of the expression tree, which was set to 2.

The main goal of the original controlled experiment [15] was to provide empirical data regarding the common belief in the GP community [12,13,14] that solutions generated automatically by GP are difficult to understand due to code bloat and unusual solutions. The controlled experiment was a between-subjects study comparing manually written attribute grammars [22,23] and automatically generated attribute grammars using GP [17,24]. In [15], we formulated the following hypotheses:

$H 1$ : Automatically generated attribute grammars decrease the correctness of the participants’ specification comprehension over manually written attribute grammars significantly.
$H 2$ : Automatically generated attribute grammars worsen the efficiency of the participants’ specification comprehension over manually written attribute grammars significantly.

Listing 1. Automatically generated attribute grammar for the robot language.

    
language Robot {
    lexicon {
        keywords  begin | end
        operation left | right | up | down
	    ignore [\0x0D\0x0A\ ]
    }
     
    attributes int *.inx; int *.iny;
               int *.outx; int *.outy;
 
    rule start {
        START ::= begin COMMANDS end compute {
          START.outx = COMMANDS.outy;
          START.outy = COMMANDS.iny+COMMANDS.outx;
          COMMANDS.inx=0;
          COMMANDS.iny=0;
        };
      }
 
    rule commands {
        COMMANDS ::= COMMAND COMMANDS compute {
          COMMANDS[0].outx = COMMANDS[1].outx+COMMAND.outx;
          COMMANDS[0].outy = COMMANDS[1].outy-COMMAND.outy;
          COMMAND.inx = 0;
          COMMAND.iny = 0;
          COMMANDS[1].inx = 0+COMMANDS[1].outy;
          COMMANDS[1].iny = COMMAND.iny;
        }
        | epsilon compute {
            COMMANDS[0].outx = COMMANDS[0].iny-COMMANDS[0].outy;
            COMMANDS[0].outy = 0;
        };
    }
 
    rule command {
        COMMAND ::= left compute {
          COMMAND.outx = COMMAND.iny-0;
          COMMAND.outy = 1+0;
        };
        COMMAND ::= right compute {
          COMMAND.outx = COMMAND.inx-COMMAND.iny;
          COMMAND.outy = 0-1;
        };
        COMMAND ::= up compute {
          COMMAND.outx = 1;
          COMMAND.outy = 0+0;
        };
        COMMAND ::= down compute {
          COMMAND.outx = 0-1;
          COMMAND.outy = COMMAND.iny;
        };
   }
}

The participants of the controlled experiment [15] had to solve tasks from which attribute grammars comprehension can be measured. To the participants of the control group, the solutions were attribute grammars written manually by a language designer, whilst, to the participants of a treatment group, the solutions were attribute grammars generated automatically by the LISA.SI tool [17,24] using GP, where automatically generated attribute grammars were evaluated with the tool LISA [25,26]. The results of the original study showed support for both hypotheses. The participants’ correctness comprehension was significantly lower for automatically generated attribute grammars than for manually written attribute grammars (H1). The participants were significantly less efficient at comprehending automatically generated attribute grammars than manually written attribute grammars (H2). As mentioned above, we found in [15] that the statistically significantly lower comprehension correctness and efficiency can be attributed to unexpected solutions and code bloat.

To study the impact of code bloat on the comprehension of attribute grammars, we designed a new controlled experiment that is a differentiated replication of the previous study [15]. The problem tasks were the same as in the original experiment [15]; however, the provided attribute grammars now contain more realistic code bloat examples. The comprehension correctness and efficiency were then compared with the manually written attribute grammars without code bloat (control group, UM FERI, from [15]) and automatically generated attribute grammars with possible unexpected solutions (treatment group, UL FRI, from [15]). As such, this study can be regarded as an internally differentiated replication [4], where the same team of experimenters who conducted the original experiment also conducted the replication, and some changes were made to the original experiment intentionally with respect to design, hypotheses, context, and measurements.

The remainder of this paper is organized as follows. Section 2 describes related work, and Section 3 describes the replicated experiment. Section 4 presents the replication results and data analysis. Section 5 describes the threats to validity, and Section 6 summarizes the key findings.

2. Related Work

Code bloat has been identified as an undesired property since the inception of GP [5]. It reduces the efficiency of search and execution performances, and solutions are less robust and generalizable and are harder to understand [13,14,24]. Hence, a lot of research has been done in GP to limit code bloat [12,27,28,29,30,31,32]. Our goal was not to prevent or limit code bloat, and various approaches to limit code bloat are not discussed further. On the contrary, our goal was to study how code bloat influences the comprehension correctness and efficiency of GP solutions (specifications, models, and programs). There is a serious lack of such research in GP. Our previous study [15] is the only study we are aware of to the best of our knowledge. Why does this matter? The field of evolutionary computation (EC) has matured now and is used daily in industrial applications [33,34,35]. However, engineers in the industry are not using automatically generated GP solutions blindly. They need to understand how things are working and why things are better. They need to learn about design options and alternatives. For them, automatically generated solutions are frequently inspirational [35]. In this respect, our study aims at identifying understandable attribute grammars and thus improving the understandability of GP-generated solutions in general. In a wider context, it can therefore be considered as a modest step towards explainable artificial intelligence (XAI) [36] but with an emphasis on having a detailed understanding of solutions (that must be verified per se) rather than on the model deriving the solutions since programmers, the predominant users of GP, understand at least the basics of the GP approach anyway. In terms of programming artefacts, interpretability naturally derives from understandability, one of the five dimensions of XAI effects [37].

As mentioned previously, GP has been used extensively in many studies where unexpected or rare solutions have been reported. For example, attribute grammars are used to specify the syntax and semantics [19,20,21,38,39] of computer languages formally and to generate language-based tools automatically (e.g., editors, debuggers, and visualizers) [40]. In our previous work, we used GP to generate context-free grammars (CFGs)/attribute grammars automatically, and some unexpected solutions were reported in [17,41]. Furthermore, GP has been used to repair C programs from defects [42]. The authors reported on some rare patches, where the difference between the original source and the repair was more than 200 lines [42]. GP has been used for graphics processing unit (GPU) code optimization [43]. The authors reported that GP rediscovers approximate solutions and improvements accidentally by relaxing the memory synchronization requirements. Their conclusion was that epistatic optimizations can be hard for humans to find [43]. GP has been used to solve a robotic task in unpredictable environments [44], where the authors reported on complicated solutions. In [45], the evolution of shape grammars was used for the generative design process, where the authors reported on surprising and innovative solutions. In [46], the author shows that by mimicking Darwinian mechanisms, novel, complex, and previously unknown patterns can be discovered. Last, but not least, in [47], the authors collected numerous examples where EC and artificial life (AL) produced clever solutions that humans did not consider or had thought impossible. The authors classified those examples into four categories: misspecified fitness functions, unintended debugging, exceeded experimenter expectations, and convergence with biology. In the misspecified fitness function cases, digital evolution exploits loopholes in fitness function measures (e.g., why walk when you can somersault?). In unintended debugging, the digital evolution exploits bugs or minor flaws in the implemented laws of physics (e.g., why walk around the wall when you can walk over it?). In exceeded experimenter expectations, digital evolution produced legitimate solutions that went beyond experimenter expectations (e.g., impossibly compact solutions). In convergence with biology, digital evolution, surprisingly, converges with biological evolution (e.g., the evolution of parasitism). However, all these studies did not measure how unexpected/rare solutions influence the comprehension correctness and efficiencies of such solutions.

On the other hand, the comprehension of various artefacts has been studied and measured extensively, for example, comprehension of block-based programming versus text-based programming [48], comprehension of DSL programs versus GPL (General-Purpose Language) programs [49,50,51], the influence of visual notation to requirement specification apprehension [52], and comprehension of charts for visually impaired individuals [53].

The replication of experiments is extremely important since it increases the trustworthiness of the results, and derived conclusions can be more generalizable [54,55]. The work [49] can be regarded as the first family of experiments measuring the comprehension correctness and efficiency of DSLs [56,57]. It inspired others to test similar hypotheses using different DSLs, different control and treatment groups, and different assignments. Studies (e.g., [50,58,59,60]) showed that original hypotheses [49] can be generalized. Indeed, the comprehension correctness and efficiency of DSLs are better than of GPLs. The same authors as in [49] also performed a replication study [61] where the results were expanded from the original study, since it was shown additionally that programmers’ comprehension of programs written in DSLs is more accurate and efficient than with GPLs, even when working with appropriate integrated development environments (IDEs). This study was later replicated with slightly different goals [62]. Extensive knowledge, experience, and trust in using DSLs have been built in such a manner.

3. Method

This study is an internally differentiated replication [4] of the previous controlled experiment [15], with the aim to test how code bloat in GP alone influences comprehension correctness and efficiency. Code bloat is a phenomenon of variable-length solution representation in GP, where code growth does not change the meaning of a code [5,6,8]. Such code is often called a neutral code, or intron. As such, code bloat is a dynamic behavior of a GP run, where fitness stagnates but average solution size increases. As a result of this process, the final solution might contain code bloat. It is called a bloated solution. In this work, we did not differentiate between a bloated solution and attribute grammar with code bloat (both terms are used interchangeably).

Study [15] shows that comprehension of automatically generated attribute grammars decreases comprehension correctness and efficiency statistically significantly over manually written attribute grammars. Automatically generated attribute grammars were harder to comprehend than manually written attribute grammars, due to unexpected solutions and code bloat. However, the examples of code bloat were not extensive, and the effect can be explained mainly by the unexpected solutions. Nevertheless, the effect of code bloat was still present. It is reasonable to assume that, while increasing code bloat, the effect will be increased as well. On the other hand, we want to know how code bloat alone hampered the comprehension of attribute grammars.

For the purpose of this study, functional code bloat [63] was inserted manually into manually written attribute grammars from Questionnaire 2 (Test Execution 2; see Table 1) of the original experiment. The manual insertion of code bloat can also be observed as obfuscating attribute grammars. Code obfuscation [64] is a technique to transform an original program into a semantically equivalent program, which is much harder to understand. Code obfuscation has been used primarily to protect proprietary software from the unauthorized reverse engineering process. However, there are other potential applications of obfuscation (e.g., [65,66]). In our case, obfuscation was used to mimic code bloat in GP. We relied on the assumption that automatically generated attribute grammars with code bloat by GP are equivalent to our manually obfuscated attribute grammars. To find out how the code bloat alone hampers attribute grammar comprehension correctness and efficiency, an internally differentiated replicated experiment was designed by the same authors as in [15], whilst some changes were made intentionally to the original experiment (see Table 1). Regarding statistical tests, this replication used a between-subjects design. If the data were not normally distributed (based upon the Shapiro–Wilk normality test), we used the Mann–Whitney U test to compare the results. Otherwise, the independent-samples t-test was used. The following experiment’s context, design, and measurements were kept the same as in the original experiment [15]:

The participants were undergraduate students in the computer science programs at the University of Maribor (UM FERI, second-year students) and at the University of Ljubljana (UL FRI, third-year students), attending the compiling programming languages course (UM FERI) and the compilers course (UL FRI), respectively. No students participated in both experiments, the original and the replicated.
The replicated experiment was performed simultaneously for both groups and close to the end of the course.
The students’ participation was optional, and their participation was rewarded. The reward was based on the answer’s correctness.
The experiment consisted of a background questionnaire and a feedback questionnaire after the test. Both questionnaires lasted for approximately 5 min.
The test consisted of the same seven tasks (Q1–Q7) as in the second test of the original study [15] (Test Execution 2 in Table 1):
Q1
Identification of a correct attribute grammar for the $a^{n} b^{n} c^{n}$ language.
Q2
Identification of a correct attribute grammar for simple expressions.
Q3
Identification of a correct attribute grammar for the robot language.
Q4
Identification of a correct attribute grammar for a simple where statement.
Q5
Correct a wrong semantic equation in attribute grammar for the $a^{n} b^{n} c^{n}$ language.
Q6
Correct a wrong semantic equation in attribute grammar for simple expressions.
Q7
Correct a wrong semantic equation in attribute grammar for the robot language.
Only manual comprehension of attribute grammar was enabled since code execution in the LISA tool [25,26] was not allowed.
The duration of the test was 75 min.
The same measures were applied for correctness and efficiency. Comprehension correctness is measured as the percentage of correctly answered questions. Comprehension efficiency is measured as the ratio of the percentage of correctly answered questions to the amount of time spent answering the questions [49,61,67]).

With the change of focus to the impact of code bloat in particular, the following changes from the original experiment [15] were done intentionally:

In the original experiment [15] two tests were given to the participants (Test executions 1 and 2 in Table 1). The goal of the first test (Test Execution 1 in Table 1) was to identify a group with better knowledge of attribute grammars (if they exist). This was then a treatment group. We wanted to eliminate a threat to validity, where worse comprehension correctness and efficiency can be attributed to impaired knowledge of attribute grammars. In the replicated experiment the results were compared to the results of the original experiment. Hence, the first test was eliminated. The tasks were the same as in the second test of the original experiment (Test Execution 2 in Table 1) [15]. Instead of two sets of attribute grammars, namely, manually written and GP generated (Questionnaires 2 and 3 in Table 2), the replicated experiment contained only manually written attribute grammars but with more realistic code bloat inserted on purpose. Therefore, possibly worse comprehension correctness and efficiency could be attributed only to code bloat.
We formulated different hypotheses:
$H 1_{null}$
There is no significant difference in the correctness of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 1_{alt}$
There is a significant difference in the correctness of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 2_{null}$
There is no significant difference in the efficiency of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 2_{alt}$
There is a significant difference in the efficiency of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 3_{null}$
There is no significant difference in the correctness of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.
$H 3_{alt}$
There is a significant difference in the correctness of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.
$H 4_{null}$
There is no significant difference in the efficiency of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.
$H 4_{alt}$
There is a significant difference in the efficiency of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.

Table 1 provides an overview of the experimental steps and a brief description of the changes made in the replication compared to the original experiment. During the lectures phase, we introduced the basic idea and concepts from attribute grammars. The students learned about attribute grammars during four lectures, each lasting for 90 min. Next, in the original experiment, we gave participants a background survey, where they answered questions connected with their background knowledge and abilities. In the replication study, we decided to combine the background and feedback questionnaires. In the original experiment, as described earlier in the paper, the first test identified a group with better knowledge of attribute grammars, and the results were used to select students for solving manually written or automatically generated tasks. In the replication study, all the students were solving the same tasks, so there was no need to execute the first test. Both tests, in the original and replication study, started with a short introduction to the questionnaire, task definition (multiple choice questions that had one correct answer), and study goals. This included a short presentation supported by slides that lasted at most 5 min. During the second test (Test Execution 2), the participants answered seven questions in both the original and replicated studies. Finally, the replicated study was finished with a background and feedback survey. In the latter, the participants provided feedback regarding the simplicity of the attribute grammar exercises, which enabled us to understand the participants’ opinions on code bloat’s effect on understanding attribute grammar specifications.

Table 2 reveals information about the participants. The original experiment was conducted in 2022, while the replication was completed in 2023. Both studies were conducted at the University of Maribor (UM FERI) and the University of Ljubljana (UL FRI). In the original research, students from these two universities conducted experiments separately. At UM FERI, the students were solving manually written attribute grammar tasks (Questionnaire 2), while the students from UL FRI were solving tasks understanding automatically generated attribute grammars (Questionnaire 3). In the replicated study, we combined students from both universities to solve tasks that included code bloat (Questionnaire 4). The original experiment included 42 students (combined from both universities), while the replicated study included 46 students.

The inclusion of code bloat in the experiment tasks is presented in Table 3. Questionnaire 4 was newly defined in this replicated study, and these tasks included code bloat. We will compare the results of Questionnaire 4 (tasks with code bloat) to the results from Questionnaire 2 (tasks without code bloat) and Questionnaire 3 (tasks with possible unexpected solutions and limited code bloat). As stated earlier in this paper, the Questionnaire 1 results were used during the original experiment to define the study groups and are therefore not compared to the results from Questionnaire 2.

The questionnaires and the test, including the text of all tasks, as well as the participants’ results, are available at https://github.com/slivnik/AG_experiment2023 (accessed on 18 July 2023, commit 44f5d66).

An example of attribute grammar with code bloat is presented in Listing 2, where a few examples of code bloat in semantic equations can be observed:

START.outx = COMMANDS.outx * (COMMANDS.inx+1); in the production START ::= begin COMMANDS end. Since the value of the attribute COMMANDS.inx is 0 (see the semantic equation COMMANDS.inx = 1-1;) this semantic equation is equivalent to START.outx = COMMANDS.outx;
START.outy = COMMANDS.outy + COMMANDS.iny; in the production START ::= begin COMMANDS end. Since the value of the attribute COMMANDS.iny is 0 (see the semantic equation COMMANDS.iny = 1+1-1+1-1-1;) this semantic equation is equivalent to START.outy = COMMANDS.outy;
COMMAND.outx = (1+1) * (COMMAND.inx+1) - COMMAND.inx-1; in the production COMMAND ::= right. After equation simplification, this semantic equation is equivalent to COMMAND.outx = COMMAND.inx+1;
COMMAND.outy = (COMMAND.iny+1) * (COMMAND.iny-COMMAND.iny) + COMMAND.iny; in the production COMMAND ::= right. After equation simplification, this semantic equation is equivalent to COMMAND.outy = COMMAND.iny;

A simple way to measure code bloat is to count the number of operands and operators. Note that the Halstead Program Length [68] is one of the earliest measures of software complexity, defined by the number of operands and operators used in a code.

Table 4 represents the number of operands and operators in the original [15] and replicated experiment (this study). It can be observed that both attribute grammars, a manually written without code bloat (2022 UM FERI) and automatically generated attribute grammars with possible unexpected solutions where code bloat is controlled (2022 UL FRI), have much lesser Halstead complexity than attribute grammars with code bloat (2023 UM FERI + UL FRI). Table 4 is further evidence that a statistically significant decrease in comprehension correctness and efficiency can be attributed mainly to unexpected solutions rather than simple versions of code bloat (from the findings of the original experiment [15]). Tasks Q1-Q4 were close-ended questions with provided options, where Qia-Qid represented predefined responses for task i. The number of operands and operators in automatically generated attribute grammars with possible unexpected solutions where code bloat is controlled (2022 UL FRI) was 401 (

304 + 97

), whilst this number was 341 (

267 + 74

) in manually written attribute grammars without code bloat. As such, this is only an

18 %

increase in Halstead complexity. On the other hand, the number of operands and operators in attribute grammars with code bloat (2023 UM FERI + UL FRI) was 869 (

531 + 338

) and represents a

255 %

increase over the number of operands and operators in the manually written attribute grammars without code bloat, and a

217 %

increase over the automatically generated attribute grammars with possible unexpected solutions where code bloat was controlled.

Listing 2. Attribute grammar for the robot language with a code bloat.

    
language Robot {
    lexicon {
        keywords  begin | end
        operation left | right | up | down
        ignore [\0x0D\0x0A\ ]
    }
    attributes int *.inx; int *.iny;
               int *.outx; int *.outy;
    rule start {
     START ::= begin COMMANDS end compute {
       START.outx = COMMANDS.outx * (COMMANDS.inx + 1);
       START.outy = COMMANDS.outy + COMMANDS.iny;
       COMMANDS.inx = 1-1;
       COMMANDS.iny = 1+1-1+1-1-1; };
    }
    rule commands {
     COMMANDS ::= COMMAND COMMANDS compute {
       COMMANDS[0].outx = COMMANDS[1].outx;
       COMMANDS[0].outy = COMMANDS[1].outy + 0;
       COMMAND.inx = COMMANDS[0].inx * 1;
       COMMAND.iny = COMMANDS[0].iny;
       COMMANDS[1].inx = COMMAND.inx + COMMAND.outx - COMMAND.inx;
       COMMANDS[1].iny = COMMAND.outy + COMMAND.outy - COMMAND.outy; }
     | epsilon compute {
       COMMANDS[0].outx = COMMANDS[0].inx;
       COMMANDS[0].outy = COMMANDS[0].iny; };
    }
    rule command {
     COMMAND ::= left compute {
       COMMAND.outx = 1 + COMMAND.inx - 1 - 1;
       COMMAND.outy = 1 + COMMAND.iny - 1; };
     COMMAND ::= right compute {
       COMMAND.outx = (1 + 1) * (COMMAND.inx + 1) - COMMAND.inx - 1;
       COMMAND.outy = (COMMAND.iny + 1) * (COMMAND.iny - COMMAND.iny) +
                      COMMAND.iny;
    };
     COMMAND ::= up compute {
       COMMAND.outx = (COMMAND.inx + 1) * (COMMAND.inx - COMMAND.inx) +
                       COMMAND.inx;
       COMMAND.outy = COMMAND.iny + 1;
    };
     COMMAND ::= down compute {
       COMMAND.outx = COMMAND.inx + COMMAND.inx - COMMAND.inx;
       COMMAND.outy = ((COMMAND.iny - 1) + 0);
    };
   }
}

4. Results

The aim of this study was to investigate how code bloat alone influences comprehension correctness and efficiency in understanding attribute grammar specifications for describing programming language semantics [19,20,21]. The results from the replicated experiment are compared to the results from the original experiment [15].

Table 5 shows the results from the background questionnaire, where the participants were asked about their perception of acquired knowledge on programming, compilers, and attribute grammars. We used a five-point Likert scale, with 1 representing low and 5 representing high knowledge and interest. The results were compared to the original experiment performed in 2022, whilst the replicated experiment was performed one year later. Not all the participants submitted the background and feedback questionnaires in the replicated study, so the number of participants N (e.g., “Programming” was 46) was less than for “Correctness” and “Efficiency” (45 in the following Tables). Similarly, in the original study, the background study was submitted by more participants than later included in the Test Execution 2 (49 vs. 42).

From the results, it can be observed that there was no statistically significant difference in acquired knowledge. The same is true about participants’ interest in programming and compilers (see Table 6). It can be concluded that there are no major differences between the original and replicated experiments in participants’ perceived knowledge and interest in programming, compilers, and attribute grammars.

The result from the Mann–Whitney test is shown in Table 7, where the comprehension correctness on attribute grammars without code bloat was compared to comprehension correctness on attribute grammars with code bloat. Clearly, comprehension correctness is statistically significantly lower on attribute grammars with code bloat. The mean correctness of attribute grammars without code bloat was slightly lower than 80%, whilst the mean correctness of attribute grammars with code bloat was slightly lower than 50%. On the other hand, the result from the Mann–Whitney test (Table 8) on comprehension correctness between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat shows that there was no statistically significant difference. The mean correctness in both cases was around 50%. Hence, comprehension of attribute grammars with code bloat was comparable to comprehension of automatically generated attribute grammars with possible unexpected solutions. We can conclude that code bloat and unexpected solutions contribute equally to lower comprehension. The results from Table 7 and Table 8 allow us to accept the alternative hypothesis

H 1_{alt}

and to accept the null hypothesis

H 3_{null}

formulated in Section 3:

$H 1_{alt} :$ There is a significant difference in the correctness of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 3_{null} :$ There is no significant difference in the correctness of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.

Table 9 shows the average correctness of tasks in attribute grammars without code bloat (2022), automatically generated attribute grammars with possible unexpected solutions (2022), and attribute grammars with code bloat (2023). By delving deeper into a particular task, we might better understand the difference between code bloat and unexpected solutions to comprehension correctness. The goal of task

Q 1

was to identify a correct attribute grammar for the

a^{n} b^{n} c^{n}

language, and it can be regarded as a simple specification of language semantics. Automatically generated attribute grammars contain only controlled code bloat with simple cases, and correctness was slightly less than 40%. Hence, the result that the correctness of attribute grammars with realistic code bloat (2023) was much lower, less than 10%, did not surprise us. The goal of task

Q 2

was to identify a correct attribute grammar for simple expressions. Automatically generated attribute grammar found an unexpected solution where an expression of the form

a - b + c

was transformed to the form

a - (b - c)

, and correctness was only 25%. Interestingly, the correctness of attribute grammars with code bloat was even lower than 10%. This might indicate that extensive code bloat may hamper comprehension to a greater extent than modest unexpected solutions. The goal of task

Q 3

was to identify a correct attribute grammar for the robot language. The automatically generated attribute grammar found an unexpected solution, which was hard to comprehend, due to exchanging the x and y coordinates (see the additional explanation in Section 1, Listing 1). Nobody solved this task correctly, and correctness was 0%, whilst the correctness of attribute grammars with code bloat was slightly below 40%. This might indicate that very unexpected solutions are even harder to comprehend than extensive code bloat. As expected, the correctness of attribute grammars without code bloat was much higher for tasks

Q 1 - Q 3

than for the other two cases. The goal of task

Q 4

was to identify a correct attribute grammar for a simple where statement, where attribute grammar needs to be absolutely non-circular and dependencies among attributes are more complicated. In this case, automatically generated attribute grammar found the same solution as attribute grammar manually written by a language designer [15]. There was no code bloat or unexpected solutions. Since the knowledge of the treatment group UL FRI in the original experiment [15] was statistically significantly better than the control group (UM FERI), the correctness of automatically generated attribute grammars was higher (slightly below 90%) than the correctness of attribute grammars without code bloat (slightly below 70%). Again, the correctness of attribute grammars with code bloat was much lower (slightly below 20%) than in the other two cases.

The goal of tasks

Q 5 - Q 7

was to correct a wrong semantic equation in attribute grammar for the

a^{n} b^{n} c^{n}

language, for the simple expressions, and for the robot language, respectively. Tasks

Q 5 - Q 7

were simpler than tasks

Q 1 - Q 4

since only one semantic rule needed to be corrected. For example, attribute grammar for the robot language consisted of 20 semantic equations. The results from Table 9 show that the correctness of tasks

Q 5 - Q 7

for attribute grammars with code bloat was always higher than correctness for automatically generated attribute grammars with possible unexpected solutions. It is interesting that the results for Q5 and Q7 were even better for attribute grammar with code bloat than for attribute grammar without code bloat. Q5 and Q7 were relatively easy tasks. For example, Q7 was about correcting a wrong semantic rule in an attribute grammar for the robot language. The robot’s initial position was wrong, and finding a correct semantic rule was not so difficult, even for attribute grammar with code bloat. The other reason might be that the participants in the replicated experiment were students from both universities. As shown in the previous study [15], the participants from UL FRI had better knowledge of attribute grammars, and their results for Questionnaire 1 in the original experiment were statistically significantly better than participants from UM FERI. We did not repeat Questionaire 1, to prove that this was also the case for the replicated experiment in 2023. However, this explanation is only plausible to us.

The results from Table 10 and Table 11 allow us to accept the alternative hypothesis

H 2_{alt}

and to accept the null hypothesis

H 4_{null}

formulated in Section 3:

$H 2_{alt} :$ There is a significant difference in the efficiency of the participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat.
$H 4_{null} :$ There is no significant difference in the efficiency of the participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat.

Table 12 shows that there is a statistically significant difference in time needed for participants’ comprehension of attribute grammars without code bloat vs. attribute grammars with code bloat. The results indicate that understanding attribute grammars with code bloat took more time than understanding attribute grammars without code bloat. On the other hand (Table 13), there was no statistically significant difference in the time needed for participants’ comprehension of automatically generated attribute grammars with possible unexpected solutions vs. attribute grammars with code bloat. The results indicated that understanding automatically generated attribute grammars with possible unexpected solutions took a similar amount of time as understanding attribute grammars with code bloat.

Table 14 shows the results from the feedback questionnaire, where the participants’ individual perspectives have been captured on the simplicity of attribute grammars with code bloat. We used a five-point Likert scale, with 1 representing low and 5 representing high simplicity. It is interesting that participants in the replicated experiment, where all attribute grammars contained code bloat, perceived these specifications at a similar simplicity level as participants in the original experiment for attribute grammars without code bloat. The independent sample t-test did not exhibit statistically significant differences (Table 14). Note that there was a statistically significant difference in comprehension correctness between attribute grammars without code bloat and attribute grammars with code bloat (Table 7). On the other hand, there was a statistically significant difference in participants’ perspective of simplicity between the automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat. The participants’ perspective was that attribute grammars with code bloat are simpler than automatically generated attribute grammars with possible unexpected solutions. Again, the participants’ perspectives did not corroborate with comprehension correctness results, where it was shown that there is no statistically significant difference in comprehension correctness between automatically generated attribute grammars with possible unexpected solutions and attribute grammars with code bloat (Table 8).

5. Threats to Validity

This study is an internally differentiated replication [4] of the previous controlled experiment [15], where many of the experiment’s context, design, and measurements were kept the same as in the original experiment [15] (see Section 3). Therefore, many threats to the validity of the original study remained the same.

Construct validity represents a threat, considering how well properties under consideration can be captured and measured [69,70]. The participants had to solve several tasks regarding the understanding of the provided attribute grammars. With the variety of tasks, we believe that construct validity has been addressed well since solving various tasks measures participants’ comprehension ability regarding attribute grammars indirectly. However, in the replicated experiment, we introduced a new threat to construct validity. We assumed that obfuscating [64,65,66] attribute grammars by mimicking code bloat is equivalent to automatically generated attribute grammars with code bloat by GP. Although this assumption appears to be reasonable since, in our obfuscation approach, operators and operands were the same as the sets F and T in GP [5,6,8] with no effect on the meaning of expression on the right hand of assignment, there was still a chance that, in obfuscating attribute grammars, mimicking code bloat had been under/over presented.

Internal validity represents a threat considering inferences between the treatment and the outcome. Did other confounding factors influence the outcome? Guessing the correct answer remains the internal threat from the original study, as well as that participants attend two different courses, at the University of Maribor and at the University of Ljubljana, although the results from Table 15 show no statistical difference in comprehension correctness between those two groups. Those threats to validity have not been addressed in the replicated experiment. However, we addressed an important threat to validity from the original experiment [15]. Namely, how does code bloat alone contribute to correctness and efficiency in the comprehension of attribute grammars? In the replicated experiment, the provided attribute grammars contained only code bloat without unexpected solutions. Hence, the lower comprehension correctness and efficiency can be attributed to code bloat alone.

The main threat to validity in the original, as well as in replicated experiments, was external validity. Can we generalize the derived conclusions? Are the results also valid in GP applications outside attribute grammars, such as Lisp programming [5,8], event processing rules [71], trading rules [72], and model-driven engineering artefacts [73]? To answer this research question, controlled experiments regarding the comprehension of GP solutions must be performed outside of attribute grammars. Our results are valid only for the comprehension of automatically generated attribute grammars. Furthermore, additional controlled experiments are needed involving not only students but also professional programmers and practitioners.

6. Conclusions

It has long been observed that solutions generated by genetic programming (GP) are often littered with code bloat or contain unexpected solutions [12,13,14]. This affects the comprehension of these solutions, which is vital for validating the automatically produced code before deployment. By performing a controlled experiment, it was shown in our previous study [15] that automatically generated attribute grammars are significantly harder to comprehend than those written manually if either comprehension correctness or comprehension efficiency is regarded. However, while unexpected solutions generated with GP or some other AI technique [74,75] might be beneficial and thus welcome, code bloat is certainly not.

In this study, the focus was on the impact of code bloat in particular. Resulting from the controlled experiment in the field of semantic inference [16,17], the main findings were twofold. First and foremost, attribute grammars with code bloat are significantly harder to comprehend than attribute grammars without code bloat. This holds for both comprehension correctness (hypothesis

H 1_{a l t}

) and comprehension efficiency (hypothesis

H 2_{a l t}

). For many, this finding comes as no surprise, but, being based on the experiment, it transgresses the border between what one believes in and what one knows. Second, there was no significant difference in comprehension between attribute grammars with code bloat or attribute grammars with unexpected results. Again, this holds for both comprehension correctness (hypothesis

H 3_{n u l l}

) and comprehension efficiency (hypothesis

H 4_{n u l l}

).

As this study was based on the experiment that is a replication of the previous one [15], a Mann–Whitney Test was performed to verify that there were no major differences in the participants’ perceived knowledge and interest in the field of attribute grammars. Great care has been taken to alleviate any other possible threats to validity, but, as with any other study yielding statistically based results, some remained. Thus, further studies of this kind are necessary within the fields of semantic inference and attribute grammars, as well as in related fields.

Realizing that there is no difference in comprehension between attribute grammars with code bloat or unexpected results, the question remains of how to distinguish code bloat from unexpected results correctly and efficiently, especially if the latter are (no matter how small) improvements [74,75]. This probably remains the most important issue left for future work, especially as automatically generated code using not only genetic programming but a number of new AI approaches is on the rise.

Author Contributions

Conceptualization, T.K., Ž.K., M.M. and B.S.; methodology, T.K., Ž.K., M.M. and B.S.; software, Ž.K. and T.K.; validation, T.K., Ž.K., M.M. and B.S.; investigation, T.K., Ž.K., M.M. and B.S.; writing—original draft preparation, T.K., Ž.K., M.M. and B.S.; and writing—review and editing, T.K., Ž.K., M.M. and B.S. All authors have read and agreed to the published version of the manuscript.

Funding

The second author wishes to thank the University of Maribor Faculty of Electrical Engineering and Computer Science; the Zagreb University of Applied Sciences; and the Zagreb University Computing Center for providing computer resources. The first and third authors acknowledge the financial support of the Slovenian Research Agency (Research Core Funding No. P2-0041).

Institutional Review Board Statement

Ethical review and approval were waived for this study because the tests had the form of a midterm exam.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

https://github.com/slivnik/AG_experiment2023, accessed on 18 July 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kosar, T.; Bohra, S.; Mernik, M. domain-specific languages: A Systematic Mapping Study. Inf. Softw. Technol. 2016, 71, 77–91. [Google Scholar] [CrossRef]
Arslan, S.; Ozkaya, M.; Kardas, G. Modeling Languages for internet of things (IoT) Applications: A Comparative Analysis Study. Mathematics 2023, 11, 1263. [Google Scholar] [CrossRef]
Wieringa, R.; Maiden, N.; Mead, N.; Rolland, C. Requirements engineering paper classification and evaluation criteria: A proposal and a discussion. Requir. Eng. 2006, 11, 102–107. [Google Scholar] [CrossRef]
Baldassarre, M.T.; Carver, J.C.; Dieste, O.; Juristo, N. Replication Types: Towards a Shared Taxonomy. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE ’14, New York, NY, USA, 13–14 May 2014. [Google Scholar]
Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
Banzhaf, W.; Nordin, P.; Keller, R.E.; Francone, F.D. Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
Ryan, C.; Collins, J.; Neill, M.O. Grammatical evolution: Evolving programs for an arbitrary language. In Genetic Programming; Banzhaf, W., Poli, R., Schoenauer, M., Fogarty, T.C., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 83–96. [Google Scholar]
Langdon, W.B.; Poli, R. Foundations of Genetic Programming; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
de la Cruz Echeandía, M.; de la Puente, A.O.; Alfonseca, M. attribute grammar Evolution. In Artificial Intelligence and Knowledge Engineering Applications: A Bioinspired Approach; Mira, J., Álvarez, J.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 182–191. [Google Scholar]
McKay, R.I.; Hoai, N.X.; Whigham, P.A.; Shan, Y.; O’Neill, M. Grammar-based genetic programming: A survey. Genet. Program. Evolvable Mach. 2010, 1, 365–396. [Google Scholar] [CrossRef]
Fonseca, A.; Poças, D. Comparing the Expressive Power of Strongly-Typed and Grammar-Guided genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’23, New York, NY, USA, 15–19 July 2023; pp. 1100–1108. [Google Scholar]
Poli, R. A Simple but Theoretically-Motivated Method to Control Bloat in genetic programming. In Genetic Programming; Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 204–217. [Google Scholar]
Javed, N.; Gobet, F.; Lane, P. Simplification of genetic programs: A literature survey. Data Min. Knowl. Discov. 2022, 36, 1279–1300. [Google Scholar] [CrossRef]
Song, A.; Chen, D.; Zhang, M. Contribution based bloat control in genetic programming. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Slivnik, B.; Kovačević, Ž.; Mernik, M.; Kosar, T. On Comprehension of genetic programming Solutions: A Controlled Experiment on Semantic Inference. Mathematics 2022, 10, 3386. [Google Scholar] [CrossRef]
Law, M.; Russo, A.; Bertino, E.; Broda, K.; Lobo, J. Representing and Learning Grammars in Answer Set Programming. In Proceedings of the 33th AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, USA, 27 January–1 February 2019; pp. 229–240. [Google Scholar]
Kovačević, Ž.; Mernik, M.; Ravber, M.; Črepinšek, M. From Grammar Inference to Semantic Inference—An Evolutionary Approach. Mathematics 2020, 8, 816. [Google Scholar] [CrossRef]
Knuth, D.E. The genesis of attribute grammars. In Attribute Grammars and Their Applications; Deransart, P., Jourdan, M., Eds.; Springer: Berlin/Heidelberg, Germany, 1990; pp. 1–12. [Google Scholar]
Mey, J.; Schöne, R.; Hedin, G.; Söderberg, E.; Kühn, T.; Fors, N.; Öqvist, J.; Aßmann, U. Relational reference attribute grammars: Improving continuous model validation. J. Comput. Lang. 2020, 57, 100940. [Google Scholar] [CrossRef]
Kramer, L.; Kaminski, T.; Van Wyk, E. Reflection of terms in attribute grammars: Design and applications. J. Comput. Lang. 2021, 64, 101033. [Google Scholar] [CrossRef]
Bock, A.A.; Bøgholm, T.; Sestoft, P.; Thomsen, B.; Thomsen, L.L. On the cost semantics for spreadsheets with sheet-defined functions. J. Comput. Lang. 2022, 69, 101103. [Google Scholar] [CrossRef]
Deransart, P.; Jourdan, M. (Eds.) International Conference WAGA on attribute grammars and Their Applications; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
Alblas, H.; Melichar, B. (Eds.) Lecture Notes in Computer Science. Proceedings of the Attribute Grammars, Applications and Systems, International Summer School SAGA, Prague, Czechoslovakia, 4–13 June 1991; Springer: Berlin/Heidelberg, Germany, 1991; Volume 545. [Google Scholar]
Kovačević, Ž.; Ravber, M.; Liu, S.H.; Črepinšek, M. Automatic compiler/interpreter generation from programs for domain-specific languages: Code bloat problem and performance improvement. J. Comput. Lang. 2022, 70, 101105. [Google Scholar] [CrossRef]
Mernik, M.; Korbar, N.; Žumer, V. LISA: A Tool for Automatic Language Implementation. SIGPLAN Not. 1995, 30, 71–79. [Google Scholar] [CrossRef]
Mernik, M.; Žumer, V.; Lenič, M.; Avdičaušević, E. Implementation of Multiple attribute grammar Inheritance in the Tool LISA. SIGPLAN Not. 1999, 34, 68–75. [Google Scholar] [CrossRef]
Poli, R.; Langdon, W.B. Genetic programming with One-Point Crossover. In Soft Computing in Engineering Design and Manufacturing; Chawdhry, P.K., Roy, R., Pant, R.K., Eds.; Springer: London, UK, 1998; pp. 180–189. [Google Scholar]
Wagner, N.; Michalewicz, Z. Genetic programming with efficient population control for financial time series prediction. In Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation Late Breaking Papers, San Francisco, CA, USA, 7–11 July 2001; Volume 1, pp. 458–462. [Google Scholar]
Silva, S.; Almeida, J. Dynamic Maximum Tree Depth: A Simple Technique for Avoiding Bloat in Tree-Based GP. In Proceedings of the 2003 International Conference on Genetic and Evolutionary Computation: PartII, GECCO’03, Chicago, IL, USA, 12–16 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 1776–1787. [Google Scholar]
Poli, R.; McPhee, N.F. Parsimony Pressure Made Easy. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO ’08, New York, NY, USA, 8–12 July 2008; pp. 1267–1274. [Google Scholar]
Poli, R. Covariant Tarpeian Method for Bloat Control in genetic programming. In Genetic Programming Theory and Practice VIII; Springer: New York, NY, USA, 2011; pp. 71–89. [Google Scholar]
Trujillo, L.; Muñoz, L.; Galván-López, E.; Silva, S. Neat genetic programming: Controlling bloat naturally. Inf. Sci. 2016, 333, 21–43. [Google Scholar] [CrossRef]
Wang, Y.; Limmer, S.; Olhofer, M.; Emmerich, M.; Bäck, T. Automatic preference based multi-objective evolutionary algorithm on vehicle fleet maintenance scheduling optimization. Swarm Evol. Comput. 2021, 65, 100933. [Google Scholar] [CrossRef]
Ray, T.; Singh, H.K.; Rahi, K.H.; Rodemann, T.; Olhofer, M. Towards identification of solutions of interest for multi-objective problems considering both objective and variable space information. Appl. Soft Comput. 2022, 119, 108505. [Google Scholar] [CrossRef]
Dommaraju, N.; Bujny, M.; Menzel, S.; Olhofer, M.; Duddeck, F. Evaluation of geometric similarity metrics for structural clusters generated using topology optimization. Appl. Intell. 2023, 53, 904–929. [Google Scholar] [CrossRef]
Evans, B.P.; Xue, B.; Zhang, M. What’s inside the Black-Box? A genetic programming Method for Interpreting Complex Machine Learning Models. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, New York, NY, USA, 13–17 July 2019; pp. 1012–1020. [Google Scholar]
Haque, A.B.; Islam, A.N.; Mikalef, P. Explainable Artificial Intelligence (XAI) from a user perspective: A synthesis of prior literature and problematizing avenues for future research. Technol. Forecast. Soc. Chang. 2023, 186, 122120. [Google Scholar] [CrossRef]
Borsotti, A.; Breveglieri, L.; Crespi Reghizzi, S.; Morzenti, A. Fast GLR parsers for extended BNF grammars and transition networks. J. Comput. Lang. 2021, 64, 101035. [Google Scholar] [CrossRef]
Slivnik, B. Context-sensitive parsing for programming languages. J. Comput. Lang. 2022, 73, 101172. [Google Scholar] [CrossRef]
Henriques, P.R.; Varanda Pereira, M.J.; Mernik, M.; Lenič, M.; Avdičaušević, E.; Žumer, V. Automatic Generation of Language-based Tools. Electron. Notes Theor. Comput. Sci. 2002, 65, 77–96. [Google Scholar] [CrossRef]
Mernik, M.; Gerlič, G.; Žumer, V.; Bryant, B.R. Can a Parser Be Generated from Examples? In Proceedings of the 2003 ACM Symposium on Applied Computing, SAC ’03, New York, NY, USA, 9–12 March 2003; pp. 1063–1067. [Google Scholar]
Weimer, W.; Nguyen, T.; Le Goues, C.; Forrest, S. Automatically finding patches using genetic programming. In Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, Vancouver, BC, Canada, 16–24 May 2009; pp. 364–374. [Google Scholar]
Liou, J.Y.; Wang, X.; Forrest, S.; Wu, C.J. GEVO: GPU Code Optimization Using Evolutionary Computation. ACM Trans. Archit. Code Optim. 2020, 17, 1–28. [Google Scholar] [CrossRef]
Iovino, M.; Styrud, J.; Falco, P.; Smith, C. Learning Behavior Trees with genetic programming in Unpredictable Environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), 30 May–5 June 2021; pp. 4591–4597. [Google Scholar]
Gero, J.S.; Kazakov, V.A. An Exploration-Based Evolutionary Model of a Generative Design Process. Comput. Aided Civ. Infrastruct. Eng. 1996, 11, 211–218. [Google Scholar] [CrossRef]
Pennock, R.T. Can Darwinian Mechanisms Make Novel Discoveries?: Learning from discoveries made by evolving neural networks. Found. Sci. 2000, 5, 225–238. [Google Scholar] [CrossRef]
Lehman, J.; Clune, J.; Misevic, D.; Adami, C.; Altenberg, L.; Beaulieu, J.; Bentley, P.J.; Bernard, S.; Beslon, G.; Bryson, D.M.; et al. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. Artif. Life 2020, 26, 274–306. [Google Scholar] [CrossRef]
Lin, Y.; Weintrop, D. The landscape of Block-based programming: Characteristics of block-based environments and how they support the transition to text-based programming. J. Comput. Lang. 2021, 67, 101075. [Google Scholar] [CrossRef]
Kosar, T.; Mernik, M.; Carver, J.C. Program comprehension of domain-specific and general-purpose languages: Comparison using a family of experiments. Empir. Softw. Eng. 2012, 17, 276–304. [Google Scholar] [CrossRef]
Johanson, A.; Hasselbring, W. Effectiveness and efficiency of a domain-specific language for high-performance marine ecosystem simulation: A controlled experiment. Empir. Softw. Eng. 2017, 22, 2206–2236. [Google Scholar] [CrossRef]
Fronchetti, F.; Ritschel, N.; Holmes, R.; Li, L.; Soto, M.; Jetley, R.; Wiese, I.; Shepherd, D. Language impact on productivity for industrial end users: A case study from Programmable Logic Controllers. J. Comput. Lang. 2022, 69, 101087. [Google Scholar] [CrossRef]
Gardner, H.; Blackwell, A.F.; Church, L. The patterns of user experience for sticky-note diagrams in software requirements workshops. J. Comput. Lang. 2020, 61, 100997. [Google Scholar] [CrossRef]
Mishra, P.; Kumar, S.; Chaube, M.K.; Shrawankar, U. ChartVi: Charts summarizer for visually impaired. J. Comput. Lang. 2022, 69, 101107. [Google Scholar] [CrossRef]
Carver, J.C. Towards Reporting Guidelines for Experimental Replications: A Proposal. In Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering, Cape Town, South Africa, 2–8 May 2010. [Google Scholar]
Santos, A.; Gómez, O.; Juristo, N. Analyzing Families of Experiments in SE: A Systematic Mapping Study. IEEE Trans. Softw. Eng. 2020, 46, 566–583. [Google Scholar] [CrossRef]
de la Vega, A.; García-Saiz, D.; Zorrilla, M.; Sánchez, P. Lavoisier: A DSL for increasing the level of abstraction of data selection and formatting in data mining. J. Comput. Lang. 2020, 60, 100987. [Google Scholar] [CrossRef]
Chavarriaga, E.; Jurado, F.; Rodríguez, F.D. An approach to build JSON-based Domain Specific Languages solutions for web applications. J. Comput. Lang. 2023, 75, 101203. [Google Scholar] [CrossRef]
Häser, F.; Felderer, M.; Breu, R. Is business domain language support beneficial for creating test case specifications: A controlled experiment. Inf. Softw. Technol. 2016, 79, 52–62. [Google Scholar] [CrossRef]
Cachero, C.; Meliá, S.; Hermida, J.M. Impact of model notations on the productivity of domain modelling: An empirical study. Inf. Softw. Technol. 2019, 108, 78–87. [Google Scholar] [CrossRef]
Hoffmann, B.; Urquhart, N.; Chalmers, K.; Guckert, M. An empirical evaluation of a novel domain-specific language – modelling vehicle routing problems with Athos. Empir. Softw. Eng. 2022, 27, 180. [Google Scholar] [CrossRef]
Kosar, T.; Gaberc, S.; Carver, J.C.; Mernik, M. Program comprehension of domain-specific and general-purpose languages: Replication of a family of experiments using integrated development environments. Empir. Softw. Eng. 2018, 23, 2734–2763. [Google Scholar] [CrossRef]
Chodarev, S.; Sulír, M.; Porubän, J.; Kopčáková, M. Experimental Comparison of Editor Types for Domain-Specific Languages. Appl. Sci. 2022, 12, 9893. [Google Scholar] [CrossRef]
Purohit, A.; Choudhari, N.S.; Tiwari, A. Code Bloat Problem in genetic programming. Int. J. Sci. Res. Publ. 2013, 3, 1–5. [Google Scholar]
Collberg, C.; Thomborson, C. Watermarking, tamper-proofing, and obfuscation—Tools for software protection. IEEE Trans. Softw. Eng. 2002, 28, 735–746. [Google Scholar] [CrossRef]
Saffran, J.; Barbosa, H.; Pereira, F.M.Q.; Vladamani, S. On-line synthesis of parsers for string events. J. Comput. Lang. 2021, 62, 101022. [Google Scholar] [CrossRef]
Gregório, N.; Bispo, J.; Fernandes, J.P.; Queiroz de Medeiros, S. E-APK: Energy pattern detection in decompiled android applications. J. Comput. Lang. 2023, 76, 101220. [Google Scholar] [CrossRef]
Nugroho, A. Level of detail in UML models and its impact on model comprehension: A controlled experiment. Inf. Softw. Technol. 2009, 51, 1670–1685. [Google Scholar] [CrossRef]
Halstead, M.H. Elements of Software Science; Elsevier: New York, NY, USA, 1977. [Google Scholar]
Ralph, P.; Tempero, E. Construct Validity in Software Engineering Research and Software Metrics. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018, EASE’18, New York, NY, USA, 28–29 June 2018; pp. 13–23. [Google Scholar]
Sjoberg, D.I.; Bergersen, G.R. Construct Validity in Software Engineering. IEEE Trans. Softw. Eng. 2022, 49, 1374–1396. [Google Scholar] [CrossRef]
Bruns, R.; Dunkel, J. Bat4CEP: A bat algorithm for mining of complex event processing rules. Appl. Intell. 2022, 52, 15143–15163. [Google Scholar] [CrossRef]
Michell, K.; Kristjanpoller, W. Strongly-typed genetic programming and fuzzy inference system: An embedded approach to model and generate trading rules. Appl. Soft Comput. 2020, 90, 106169. [Google Scholar] [CrossRef]
Batot, E.R.; Sahraoui, H. Promoting social diversity for the automated learning of complex MDE artifacts. Softw. Syst. Model. 2022, 21, 1159–1178. [Google Scholar] [CrossRef]
Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Ruiz, F.J.R.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef]
Mankowitz, D.J.; Michi, A.; Zhernov, A.; Gelmi, M.; Selvi, M.; Paduraru, C.; Leurent, E.; Iqbal, S.; Lespiau, J.B.; Ahern, A.; et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 2023, 618, 257–263. [Google Scholar] [CrossRef]

Table 1. Design of the original and replicated experiments.

Procedure	Steps		Execution Time
	Original experiment	Replicated experiment
Lectures	Introduction to attribute grammars	Introduction to attribute grammars	4 × 90 min
Test Execution 1	Background survey	Omitted	5 min
	Questionnaire presentation and execution details	Omitted	5 min
	Attribute grammar questionnaire	Omitted	75 min
	Feedback	Omitted	5 min
Test Execution 2	Questionnaire presentation and execution details	Questionnaire and execution details	5 min
	Attribute grammar questionnaires	Attribute grammar questionnaire	75 min
	Feedback	Background survey and feedback	5 min

Table 2. Participants in the original and replicated experiment.

Study	Original Experiment		Replicated Experiment
Year	2022		2023
University	UM FERI	UL FRI	UM FERI + UL FRI
Task definition	Manual	Generated	Manual
No. of participants	26	16	18 + 28 = 46
Test Execution 1	Questionnaire 1		Omitted
Test Execution 2	Questionnaire 2	Questionnaire 3	Questionnaire 4

Table 3. Questionnaires and the inclusion of code bloat.

	Code Bloat Included
Questionnaire 1	No
Questionnaire 2	No
Questionnaire 3	Possible unexpected solutions and limited code bloat
Questionnaire 4	Yes

Table 4. Task complexity measured as the number of operands and operators.

Task	Attribute Grammars without Code Bloat		Automatically Generated Attribute Grammars with Possible Unexpected Solutions		Attribute Grammars with Code Bloat
	Operands	Operators	Operands	Operators	Operands	Operators
Q1a	13	6	14	7	26	19
Q1b	10	3	14	12	16	9
Q1c	19	12	19	9	26	19
Q1d	12	5	16	5	27	20
Q2a	10	2	12	4	22	14
Q2b	10	2	11	3	10	2
Q2c	11	3	11	3	22	14
Q2d	11	3	11	3	24	16
Q3a	24	4	26	6	56	36
Q3b	24	4	27	7	54	34
Q3c	24	4	24	4	24	4
Q3d	27	7	27	7	56	36
Q4a	10	2	10	2	10	2
Q4b	11	3	11	3	17	9
Q4c	10	2	10	2	20	12
Q4d	11	3	11	3	19	11
Q5	12	5	15	8	26	19
Q6	6	0	8	2	21	13
Q7	24	4	27	7	55	35
$Σ$	267	74	304	97	531	338

Table 5. Background study on knowledge (Mann–Whitney Test).

Knowledge on	Part	N	Mean	Std. Dev.	Median	Mean Rank	Z	p-Value
Programming	2022	49	4.00	0.71	4	48.61	−0.456	0.649
Programming	2023	45	3.93	0.75	4	46.29	−0.456	0.649
Compilers	2022	49	3.12	0.90	3	48.10	−0.243	0.808
Compilers	2023	45	3.09	0.70	3	46.84	−0.243	0.808
Attribute	2022	49	2.75	0.83	3	46.98	−0.207	0.836
Grammars	2023	45	2.8	0.79	3	48.07	−0.207	0.836

Table 6. Background study on interest (Mann–Whitney Test).

Interest in	Part	N	Mean	Std. Dev.	Median	Mean Rank	Z	p-Value
Programming	2022	49	4.37	0.67	4	47.19	−0.127	0.899
Programming	2023	45	4.38	0.68	4	47.83	−0.127	0.899
Compilers	2022	49	2.94	1.27	3	43.39	−1.572	0.116
Compilers	2023	45	3.38	1.05	3	51.98	−1.572	0.116

Table 7. Comparison of comprehension correctness between attribute grammars without code bloat (2022) and attribute grammars with code bloat (2023) (Mann–Whitney Test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
2022 (UM FERI)	77.46	26	24.30	85.70	52.71	−5.073	<0.001
2023 (UM FERI + UL FRI)	47.21	46	12.70	42.90	27.34	−5.073	<0.001

Table 8. Comparison of comprehension correctness between automatically generated attribute grammars with possible unexpected solutions (2022) and attribute grammars with code bloat (2023) (Mann–Whitney Test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
2022 (UL FRI)	51.79	16	19.41	42.90	34.03	−0.687	0.492
2023 (UM FERI + UL FRI)	47.21	46	12.70	42.90	30.62	−0.687	0.492

Table 9. Average correctness of tasks in the original (2022) and replicated experiment (2023).

Task	Attribute Grammars without Code Bloat	Automatically Generated Attribute Grammars	Attribute Grammars with Code Bloat
	2022 (UM FERI)	2022 (UL FRI)	2023 (UM FERI + UL FRI)
Q1	92.31%	37.50%	8.70%
Q2	65.38%	25.00%	8.70%
Q3	76.92%	0.00%	39.13%
Q4	69.23%	87.50%	19.57%
Q5	84.62%	81.25%	95.65%
Q6	73.08%	43.75%	60.87%
Q7	80.77%	87.50%	97.83%
Average	77.46%	51.79%	47.21%

Table 10. Comprehension efficiency comparison: attribute grammars without code bloat (2022) vs. attribute grammars with code bloat (2023) (independent sample t-test).

Part	Mean	N	Std. Dev.	Median	t	df	p-Value
2022 (UM FERI)	1.98	26	0.77	1.86	8.463	70	< 0.001
2023 (UM FERI + UL FRI)	0.91	46	0.30	0.89	8.463	70	< 0.001

Table 11. Comprehension efficiency comparison: automatically generated attribute grammars with possible unexpected solutions (2022) vs. attribute grammars with code bloat (2023) (independent sample t-test).

Part	Mean	N	Std. Dev.	Median	t	df	p-Value
2022 (FRI)	0.99	16	0.40	1.05	0.858	60	0.97
2023 (UM FERI + UL FRI)	0.91	46	0.30	0.89	0.858	60	0.97

Table 12. Comprehension time comparison (in minutes): attribute grammars without code bloat (2022) vs. attribute grammars with code bloat (2023) (independent sample t-test).

Part	Mean	N	Std. Dev.	Median	t	df	p-Value
2022 (UM FERI)	40.99	26	12.13	41.50	−4.943	70	< 0.001
2023 (UM FERI + UL FRI)	53.88	46	9.69	53.54	−4.943	70	< 0.001

Table 13. Comprehension time comparison (in minutes): automatically generated attribute grammars with possible unexpected solutions (2022) vs. attribute grammars with code bloat (2023) (independent sample t-test).

Part	Mean	N	Std. Dev.	Median	t	df	p-Value
2022 (UL FRI)	55.83	16	15.04	61.00	0.595	60	0.277
2023 (UM FERI + UL FRI)	53.88	46	9.69	53.53	0.595	60	0.277

Table 14. Feedback on simplicity of attribute grammars (independent sample t-test).

	N	Mean	Std. Dev.	Median	Mean Rank	t	df	p-Value
2022 (UM FERI)	28	3.38	0.76	3.43	19.56	0.540	72	0.591
2023	45	3.30	0.52	3.29	36.22	0.540	72	0.591
2022 (UL FRI)	15	2.73	0.36	2.71	25.93	-3.910	59	<0.001
2023	45	3.30	0.52	3.29	14.67	-3.910	59	<0.001

Table 15. Comparison of comprehension correctness between participants from two different courses (Mann–Whitney Test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
2023 (UM FERI)	46.04	18	15.12	42.90	22.42	−0.466	0.619
2023 (UL FRI)	47.96	28	11.11	42.90	24.2	−0.466	0.619

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kosar, T.; Kovačević, Ž.; Mernik, M.; Slivnik, B. The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference. Mathematics 2023, 11, 3744. https://doi.org/10.3390/math11173744

AMA Style

Kosar T, Kovačević Ž, Mernik M, Slivnik B. The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference. Mathematics. 2023; 11(17):3744. https://doi.org/10.3390/math11173744

Chicago/Turabian Style

Kosar, Tomaž, Željko Kovačević, Marjan Mernik, and Boštjan Slivnik. 2023. "The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference" Mathematics 11, no. 17: 3744. https://doi.org/10.3390/math11173744

APA Style

Kosar, T., Kovačević, Ž., Mernik, M., & Slivnik, B. (2023). The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference. Mathematics, 11(17), 3744. https://doi.org/10.3390/math11173744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Impact of Code Bloat on Genetic Program Comprehension: Replication of a Controlled Experiment on Semantic Inference

Abstract

1. Introduction

2. Related Work

3. Method

4. Results

5. Threats to Validity

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI