An Expert Judgment in Source Code Quality Research Domain—A Comparative Study between Professionals and Students

: In scientiﬁc research, evidence is often based on empirical data. Scholars tend to rely on students as participants in experiments in order to validate their thesis. They are an obvious choice when it comes to scientiﬁc research: They are usually willing to participate and are often themselves pursuing an education in the experiment’s domain. The software engineering domain is no exception. However, readers, authors, and reviewers do sometimes question the validity of experimental data that is gathered in controlled experiments from students. This is why we will address this difﬁcult-to-answer question: Are students a proper substitute for experienced professional engineers while performing experiments in a typical software engineering experiment. As we demonstrate in this paper, it is not a “yes or no” answer. In some aspects, students were not outperformed by professionals, but in others, students would not only give different answers compared to professionals, but their answers would also diverge. In this paper we will show and analyze the results of a controlled experiment in the source code quality domain in terms of comparing student and professional responses. We will show that authors have to be careful when employing students in experiments, especially when complex and advanced domains are addressed. However, they may be a proper substitution in cases, where non-advanced aspects are required.


Introduction
Software quality is a well-defined and accepted term within the software engineering domain. Within software development teams there needs to be a strong consensus about not only what software quality is, but also on how to measure it. Several formal [1] and de-facto standards list different software quality attributes and prescribe procedures and metrics for software quality evaluation. They range from relatively simple indirect methods for measuring source code in order to be confident in the expected quality (e.g., comment-to-code ratio) to relatively complex processes. Requiring several ceremonies, many people, is time-consuming and is described in standards, like in ISO/IEC 25040 [2].
While software quality is a strong research domain, the evaluation of new or improved approaches according to existing ones is essential in order to ensure a high level of quality within the software engineering domain [3]. This can be done by using different empirical research approaches, including the implementation of experiments, whose importance and widespread use within software engineering has already been highlighted in many studies [3][4][5].
Moreover, source code quality, in addition to being measured and exposed to certain threshold values, is usually assessed by ad-hoc or systematic code reviews. With the industry-wide introduction 1.
Are participants consistent in expert judgment evaluations regarding the source code quality research domain?
(a) What is the level of agreement between students in their expert judgments regarding the source code quality research domain?
What is the level of agreement between professionals in their expert judgments regarding the source code quality research domain?

2.
Are the levels of agreement between students and the level of agreement between professionals comparable? (a) In which aspects of source code quality assessment are the levels of agreement comparable?
By answering our research questions, we would like to gain a clear insight into the dilemma of employing students in experiments, with regard to source code quality domain, without compromising the external validity of the results. In addition to this, we would also like to provide advice in the form of when experience (professionals) is crucial and when a formal education with minor experience (students) is enough.
The paper is organized as follows. In the next section, we will summarize the research background and outline the most important related work in the domain. In Section 3 we present hypotheses and the research method in detail. Section 4 gives an overview of our experiment, presents tasks that participants are exposed to, and summarizes the experiment results. In Section 5, we discuss the experiment results, while outlining the differences between students and professionals. We discuss research questions and conclude this paper with Section 6.

Research Background and Related Work
An application of empirical research methods has been researched by many papers. Zhang et al. [9] present a mapping study, wherein in the majority of selected studies, the experiment is used as an empirical method. As the results show, experiments are followed by case studies, surveys, literature reviews, replication experiments, pilot studies, and simulations [9]. Furthermore, the expert judgments are not a frequently used approach within software engineering experiments.
The scope of using experiments within software engineering and its sub domains has spread over the years. In a literature review provided by Sjoeberg et al. [10], only one study was detected that implemented the experiment in the domain of software metrics and measurement. On the other hand, 13 years later, Zhang et al. [9] identified 48 studies, with 16.2%, implementing an experiment in the software quality domain. The use of experiments in software engineering started in the 1960s [3]. The authors [3] list four dimensions characterizing the context of the experiment, including students vs. professionals, opening up a challenging domain. As the literature review provided by Sjoeberg et al. [10] shows, professionals were used as the experimental subjects in only 9% of studies, and on the other hand, 86.8% of studies used students. Undergraduate students are used much more frequently, whereas graduate students are used in 10.8% of studies [10]. Sjoeberg et al. [10] also reported that only three studies in seven used students and professionals while measuring the differences between the groups. No research detected differences. This was also confirmed by Daun et al. [4], conducting a systematic mapping study looking into the state of the art in controlled experiments with students in software engineering. As the results indicate that the majority of controlled experiments are done using student participants, 42.33% only used graduated students. In 15.95% of experiments, students received evaluation tasks, wherein the most frequently experiments researched the student's comprehension skills [4].
Falessi et al. [7] presented the positive and negative aspect of using students and professionals in the experiments. The professionals are more challenging to acquire than students, and when they are willing to cooperate, the sample size is usually small [7]. However, the external validity of the results is better in contrast to using students [7]. As the results show, the choice of an appropriate subject depends upon understanding the developer population portion represented by the participants [7]. As added by Feldt et al. [11], student samples could efficiently step-in for a specific subset of professionals. Falessi et al. [7] also propose a characterization scheme dividing subjects based on their experience. The experiences can be described using three dimensions: real, relevant and recent [7]. The fact that the use of the terms professionals and students may be misleading was also pointed by Feitelson [8], who at the same time exposed the problem of using years of experiences as a metric. Feitelson [8] lists some drawbacks when using students as an experiment's subject. As the results of the literature review show, students could not represent the professionals, namely due to a lack of experience, differences in the use of technology, learning misconceptions, and academic orientation that may not be aligned with professional practice [8].
Within the experiments, the use of expert judgment for evaluation could also be used. Expert judgment is frequently used as one of the estimation techniques within the project management [12,13], wherein experts have specialized knowledge [14]. Expert judgment within software quality presents the use of developers' experiences for reliable evaluation [15]. Boehm [12] defines expert judgment as the consultation of one or more experts. Hughes [13] added that experts possess experiences familiar to the judging domain. Since the uniqueness of software products makes quality evaluation a challenging task [15], the domain of experts' experiences is crucial.
Expert judgment is also used in the code smells domain, for example, when investigating the code smells impact on system level maintainability [16]. Bigonha et al. [17] used manual inspection in order to validate the code smells provided by the tool. Oliveira et al. [18] validated derived thresholds with the help of developers. Moreover, Rosqvist et al. [19] presented a method for software quality evaluation using expert judgment. Within expert judgements, evaluators form their opinion based on past experiences and knowledge, which can result in the subjectivity of the provided assessments [13,19]. Rosqvist et al. [19] claim that each expert judgement is based on a participant's mental model that is used to interpret the assessed quality aspect. However, considering and properly addressing the mentioned challenges, experts' assessments can-despite being based on participants' personal experience-constitute a good and valuable supplement to empirical shreds of evidence [19].
While many studies address the use of students in software engineering experiments in general, the characteristics of the subjects involved in expert judgment were not addressed yet. As Falessi et al. [7] says, the comparison of the performance of professionals and students is enabled when experiments are implemented using both types of participants. This was done in our study, where the expert judgment in the source code quality research domain was done using students and experts. Additionally, we did not find research that would address the question of substituting professional participants with students in terms of performing empirical research, as is the case with the research that we present in this paper.

Research Method
In order to answer the research questions, an experiment was designed. We experimented separately with professionals and students. A bird-eye view of our approach is shown in Figure 1. As illustrated, we wanted to capture individual as well as coordinated evaluations for the same set of source code entities-once by employing professionals, then again with the student group. The details of our single experiment are shown in Figure 2.
Before beginning, all participants provided their profile. We designed our questionnaire based on the practices set forth Chen et al. [20]. We asked them to enter their perceived level of knowledge of programming languages and provide a number of years for their professional experience. Since the knowledge self-assessment can be biased and subjective, the years of experience criterion was added in order to objectify participant's experiences. In addition, to have a record of a participant's classification in terms of students and professionals, it also gives us an opportunity to check whether participant profiles are comparable.  The participants were asked to evaluate several aspects of source code quality, i.e., class size, class complexity, class cohesion, coupling with other classes and general quality assessment. The example of an evaluation form for a software class is presented in Table 1. The participant evaluated each aspects using the scale: "very poor", "poor", "good", "very good". The scale aims at gathering their opinion about the quality of the assessed software entity, e.g., during the source code size assessment, the evaluators assessed, if, in their opinion, the size of a software class is poor (i.e., inappropriate) or good (i.e., appropriate). A software class where the source code size is evaluated as poor, contains too many or too little lines of code, resulting in unmanageable size and opacity or, on the other hand, in inappropriately short content. Contrarily, a software class that is assessed as good in the terms of source code size, is composed of manageable and acceptable lines of code. An example of a software class assessment is shown in Tables 2 and 3. Table 2 depicts the assessment and coordination of the chosen software class. After the individual evaluations of each participant were made (shown in Table 1 as Assessor 1 and Assessor 2), the participants were asked to coordinate their evaluations with the assigned co-assessor, providing a coordinated and agreed-upon final evaluation (shown in Table 1 as Coordinated). For example, Assessor 1 assessed the source code size as "very good" and Assessor 2 assessed the same quality aspect as "poor", they have to coordinate their chosen assessments. Based on the exchanged views they assessed the source code size of evaluated software class as poor. We forced participants to coordinate their assessments as a measure to get as objective assessments as possible on one hand, and to address possible inconsistencies in the assessor's subjective views.  Table 3. An example of cross section between students and between students and professionals in categories general assessment (1), source code size (2), class complexity (3), class cohesion (4) and coupling with other classes (5).
Entity: net.sf.jasperreports.engine.export.JRTextExporter students ∩ students students ∩ professionals n/a n/a n/a n/a n/a n/a n/a The subset from the same repository of source code entities was given to every participant. This is how each participant had to evaluate several, but not all entities. At the same time, each entity was evaluated by several participants. They were asked to provide their evaluations. We asked participants to evaluate, how appropriate they found the provided source code in terms of size, complexity, cohesion, coupling and how they evaluate overall source code quality. They had to judge each aspect in the 4-step range from "very poor" to "very good".
After all the participants finished their judgments, groups were formed, based on source code entities. This means thet each participant was a member of several groups. One member was elected per group to be a group leader, which, once an agreement on a joint entity evaluation was reached, confirmed the agreed-upon evaluation. Group members had to collaborate and discuss only the source code quality aspects, where they did not provide the same judgment during their evaluation session. Since experiments with students and professionals were executed separately, there was no coordination group with mixed participants (students and professionals). This is why coordinated evaluations, which would separate student and professional participants are also a useful source of data.
In order to conclude all data records, including participant profiles, individual evaluations, and coordinated evaluations (that clearly marked which evaluators changed their decision during the collaborative evaluation step), were combined into a report. Please note that the experiment was performed separately for professionals and students, but the source code entity repository was the same. This allowed us to not only observe inconsistencies within the groups, but also to observe them between groups, and, more importantly, to compare student and professional efforts, results and group agreement.

Source Code Quality Judgment Tool
The expert judgments were performed using the developed source code assessment tool. The tool supported coordination between assessors in order to reduce bias and achieve greater reliability for evaluations. Figure 3 demonstrates examples of a cross-section between participants. In the first step of the evaluation process, individual assessments are made. Individual assessments of entities are coordinated between linked assessors within the coordination step.
The architecture of the developed tool is presented in Figure 4. The tailor-made IT solution consists of four parts. The external Single-Sign-On provider provides authentication. Components are containerized-front-end, back-end and persistent storage are placed into a docker, communicating via REST web interfaces. The Evaluators Rich Web Application could be extended with additional functionalities. The back-end system covers the entity and assessment management, assessment implementation and reporting and is connected to a sustainable data repository covering the safe and distributed data storage.
The tool was designed in an extendable and adjustable way. The concept allows the use of the tool in different domains by including additional components and making the needed adjustments. Therefore, instead of source code, the tool could also be used for the evaluation of any other entities, since the assessment criteria and number of assessors could be freely adapted.

The Results
In order to get a more detailed profile of participating experts and students, we gathered their experiences and perceived knowledge. The perceived knowledge of the programming language was self-assessed, since Feitelson [8] presents it is as a good option for assessing proficiency. The profile of participants is presented in Table 4. In the study, 54 students and 11 experts participated. Participating students evaluated their experiences with software development with an average score of 4 on a scale from 1 to 10, while experts evaluated their experiences with an average of 8.1. On average, students evaluated their knowledge of Java with 4.9 and experts with 8.6. The difference between students and experts can also be seen in their years of experience with Java. While most of the students had less than three years of experience, the majority of experts had more than ten years of experience. Table 4. Profile of participating students and experts.

Number of participants 54 11
Experiences with software development (1-10) 4.0 8.1 Knowledge of Java (1-10) 4.9 8.6 Years of experiences with Java between 3 and 6 years (31%) more than 10 year (81%) less than 3 years (69%) between 6 and 10 years (19%) In the implemented study, participants evaluated 33 different program entities in the form of Java Classes. Among the experts, 16 entities were evaluated by three pairs of assessors, and 15 entities by two pairs of assessors. One entity was evaluated by four pairs of assessors and one entity by one pair. On the other hand, among students, 10 entities were evaluated by two pairs of assessors, 7 by three pairs, 6 entities were evaluated by four pair of assessors, 5 by five pairs, 3 entities by six pairs and 2 by one pair of assessors. The assessors were randomly divided into pairs, during each assessor's assessment of between 7 and 9 entities.

Results Analysis
As we are interested in examining and comparing the student's and expert's judgments of the source code quality and software metrics, computing the agreement between participants (in our case, this would be an agreement in the judgment of pairs of students and experts) would be most suitable approach for an analysis. Several measures of inter-participant agreement exist. However, they are mainly limited to estimating the agreement between the two participants. In our case, more than one participant gave their assessment of the source code. Furthermore, they assessed it on a dichotomous scale. To enable the estimation of inter-participant agreement in such cases (more than two participants), the Fleiss's kappa was used, which was first developed in 1971 [22]. Fleiss's kappa has since proven itself to be a useful tool to estimate inter-participant agreement between several participants in the literature [23].

Discussion
Based on the established Fleiss's kappa value interpretation [22], it is obvious not only that strong agreement is achieved inside a professional group (highest value, i.e., "almost perfect agreement" is reached practically on all evaluated aspects), but also, that the agreement is consistent throughout all quality aspects that we included in the experiment ("almost perfect agreement" according to [22]).
On the other hand, the Fleiss's kappa values for student participants show worse performance in terms of the achieved agreement in practically all aspects. We can also find entities (E30-E33) with one or several aspects where poor agreement (the lowest rate according to [22]) was recorded.
To support our findings further, the aggregated values of Fleiss's kappa for all entities was also calculated, as shown in Table 7.
Based on the aggregated Fleiss's kappa values in Table 7, it is even more evident that there is a lack of agreement inside the student group (the student group diverges in their answers).
As Fleiss's kappa values and their established interpretation clearly shows, professionals are consistent in their expert judgments. Furthermore, as shown in Table 7, professionals can reach almost perfect agreement (the highest value 1.0, see Tables 5 and 7) on all given source code quality aspects. In the "cohesion" aspect, professionals reached the worst agreement (value 0.8) which is just below "Almost perfect agreement" and is the highest value in the interpretation of "substantial agreement". However, as we will show, the "cohesion" aspect was shown to be the most complex, since students also had the lowest agreement on it. The situation, when comparing agreement levels, is different when looking at student groups. As shown in Table 7, almost perfect agreement is found only when we take into account source code size. Furthermore, even in this case, the value is the lowest possible one, to be interpreted as "almost perfect agreement" (0.81). In other aspects, agreement barely achieves a substantial level (complexity with a value of 0.65, overall quality with the lowest value for this agreement class-0.61). It is only the "moderate agreement" level where we can observe cohesion (0.50) and coupling (0.53). This is why we cannot state that students are consistent in their expert judgments with the source code quality research domain. However, we can identify a particular aspect where student answers do not diverge.
Based on this evidence, we can answer research question 1: "Are participants consistent in expert judgment evaluations regarding the source code quality research domain?". In the case of professional participants, where strong experience is reported, based on an interpretation of the Fleiss's kappa values, we can answer research question 2b as positive. Professionals are consistent in their expert judgment regarding the source code quality research domain. Their level of agreement is almost perfect. In the case of student participants, where domain education seems strong, but a lack of experience is reported, the answer is not straightforward. The interpretation of the Fleiss's kappa values shows lower agreement levels for all aspects (research question 2a). In the "source code size" aspect, the level of agreement inside student groups reached the same level (based on Fleiss' kappa values interpretation) as in professional groups (i.e., "Almost perfect agreement"). The lowest level of agreement was seen in the "cohesion" and "coupling" aspects.
This brings us to the final discussion on research question 2: "Are the levels of agreement between students and level of agreement between professionals comparable?" Based on the interpretation of Fleiss's kappa values (see Tables 7) we can report important differences in the majority of aspects. The exception is, when participants judged the "source code size" aspect. This is why we cannot easily answer research question 2-the level is comparable only in certain aspects, for others it is not. This is why we can the distinct quality aspects while answering research question 2a: "In which aspects of source code quality assessment is the level of agreement comparable?".
When we observe the "size" and "complexity" aspects, students performed the best in terms of converging to the same answer. Our interpretation of this result is, that those aspects are an example of simple questions, where experience does not play an important role ("Do you find this source code to be an appropriate size?"/"Do you find this source code to be of appropriate complexity?"). However, to judge more complex aspects, such as "overall quality", "coupling" and, even more apparent, "cohesion", the experience seems to be necessary, thus students are not an appropriate substitution while experimenting.
To sum up: our research is detailed in its finding that students are an appropriate substitution for professional participants in experiments in the source code quality research domain only when the experiment is dealing with simple aspects (e.g., source code size, source code complexity). When dealing with complex aspects, participants should show a certain level of professional experience in order to keep the experiment sound and relevant in terms of external validity.

Conclusions
In this paper we reported our research in order to question whether students are a comparable substitute for professionals during experiments in the source code quality research domain. On several occasions, scholars have tended to use students as an important source of participants in experiments to show and validate their thesis. As we showed during the paper, there are numerous reasons for this, one of the most important being that students are usually willing to participate and are pursuing an education in the experiment's domain. Other authors have also expresses their doubts on the external validity of student-based experiments. This is why we addressed that question in this paper. While performing an experiment, gathering results, interpreting them and answering our research questions, we showed where and why students can or cannot replace professionals. Professionals are participants that boast higher level of experience, in contrast to students, whose education level might be high but who are often missing a comparable level of experience.
We designed and performed an experiment in order to answer the research questions. We experimented separately with professionals and students. The tool, which was tailor-made for this experiment, supported coordination between assessors in order to reduce bias and achieve greater reliability for the evaluations. Based on the results analysis we showed that professionals achieved almost perfect agreement for all given source code quality aspects. On the other hand, when we dealt with student groups, almost perfect agreement was found only when we took into account source code size, which is a representative of asking an experiment participant a simple question.
Based on our findings, students were not out-performed by professionals in certain aspects, and students would mostly not only give different answers compared to professionals, but their answers would also diverge. While calculating Fleiss' kappa values and interpreting the results, we showed that students might be appropriate substitute for professionals, when simple aspects are questioned (e.g., source code size, also source code complexity level). In the case of investigating some complex aspects (e.g., a cohesiveness of a class), where day-to-day practice experience might help, students are not appropriate participants.
Based on the presented paper, we would encourage authors in the software quality research domain, to employ professional participants in their experiments. In cases, when simple answers are expected, students can also be appropriate. However, based on the approach, demonstrated by this paper, we would also encourage authors dealing with mixed participants, in terms of students and experienced professionals, to compare the student-and professional-based results in order to verify if the student-based data is valid.
As a side effect, we clearly showed in this paper that in the area of software quality research, experience (professionals) in addition to formal education (students) is crucial.