Ensuring Sustainable Evaluation: How to Improve Quality of Evaluating Grant Proposals?

: The principle of sustainable development is an obligation placed on all entities involved in the implementation and delivery of the structural funds made available not only by the European Commission but also by grant donors from all over the world. For this reason, when applying for a grant, proposals need to demonstrate the positive or neutral impact of the project on sustainable development. To be able to select projects that will ensure sustainability, we need to ensure the effective evaluation of the proposals. The process of their evaluation should be objective, unbiased and transparent. However, current processes have several limitations. The process by which grants are awarded and proposals evaluated has come under increasing scrutiny, with a particular focus on the selection of reviewers, fallibility of their assessments, the randomness of assessments and the low level of common agreement. In our studies, we demonstrated how some of those limitations may be overcome. Our topic of interest is the work of reviewers/experts who evaluate scientiﬁc grant proposals. We analyse data coming from two prominent scientiﬁc national grant foundations, which differ in terms of expert’s selection procedure. We discuss the problems associated with both procedures (rating style of the reviewers, lack of calibration and serial position effect) and present potential solutions to prevent them. We conclude that, to increase the unbiasedness and fairness of the evaluation process, reviewers’ work should be analysed. We also suggest that, within a certain panel, all grant proposals should be evaluated by the same set of reviewers, which would help to eliminate the distorting inﬂuence of the selection of a very severe or very lenient expert. Such effective assessment and moderation of the process would help ensure the quality and sustainability of evaluations.


Introduction
A vital part of academic work relies on evaluation. One of the most important elements of evaluation is peer review, acting as both a filter for selection and a quality control mechanism [1]. However, the system is not flawless, and the disadvantages of the peer review system have been widely discussed [2][3][4][5][6][7]. Assessing grant proposals is one area where peer review is often used; however, the empirical evidence on the effects of grant giving peer review is limited [8,9]. The process of grant proposal evaluation has also come under discussion, with particular criticism focused on the selection of reviewers and the fallibility of their assessments [10]; the occurrence of conflicts of interest between the reviewee and the reviewer [11]; randomness of assessments and low level of agreement among experts [3]; bias against innovative research [8,12]; theft of ideas [13]; and favouring a particular group of applicants [14].These biases may distort the process of evaluation and influence the distribution on public funding, financing projects thus resulting in not achieving "development that meets the needs of the present without compromising the ability of future generations to meet their own need" [15] (p. 20). In other words, promoted proposals that underwent biased evaluation may not meet the criteria of sustainable development, support the best science and contribute to funders' desired outcomes and the most effective achievement of sustainability.
Our topic of interest is the work of reviewers/experts who evaluate scientific grant proposals in governmental grant-making agencies. Their work is critical because, based on their evaluations, decisions are made on which proposals will receive funding, hence indirectly shape and affect the production of knowledge, as well as societal progress, economy, health and well-being.
Literature reviews present examples of many biases accompanying the review process [16][17][18]; unfortunately, the work of evaluators has rarely been analysed in a systematic way and little research has been done into the peer review of grant applications [19]. This is a problem, as such analysis could help to counteract some of the biases that have been identified.
In our paper, we analyse the work of the evaluators reviewing grant proposals for national bodies. Our datasets come from two prominent scientific national grant foundations, which differ in terms of expert's selection procedure. We focus on the first step of the evaluation-where grant proposals are being evaluated by a small group of reviewers (usually two or three) and then those with the higher scores are discussed at a full panel meeting. We present the problems associated with both systems as well as potential solutions to prevent them. This paper is structured as follows. In the Theoretical Section, we discuss the evaluation process concentrating on its main elements: reference pattern, calibration, rating style and multiple objects evaluation. In the next Section 2, we provide empirical support for our theoretical considerations. First we compare two sets of reviews coming from two scientific national grant foundations. Then, in the experimental study, we test the bias called serial position effect, demonstrating that the order of evaluated grant influences its evaluation. In Section 3 we present a results and discussion and in Section 4 conclusions.

Evaluation Process
When considering a grant proposal, reviewers use an evaluation system provided by the contracting institution, which often takes a form of rating scales (e.g., determining the scientific experience of the applicant on a scale of 1-5 and evaluating of the innovative character of the project and the impact of its implementation on the development of scientific discipline on a scale of 1-3). Conducting such evaluation requires the reviewers to perform two tasks: (1) compare the evaluated grants with the quality internal standard; and (2) transform their own, private response scale, into the one provided by the contracting institution.

Activating Reference Pattern
The results of many tasks are easily measurable, e.g., number of items sold or the number of attracted customers. In such scenarios, a simple algorithm can be used to assess task completion. While for most material objects there are precisely defined standards (e.g., technical parameters), the process of objectivising the assessment of mental products (e.g., evaluating a grant proposal) is a major challenge. This activity requires reviewers to make a comparison with a certain reference pattern/standard which is tacit [20]. This reference pattern may have the form of: (1) a typical object (an averaged representation of all grant proposals previously reviewed); (2) an ideal object (the vision of an ideal grant proposal, not necessarily existing in reality); or (3) an exemplary object (e.g., a representation of a paper that was recently reviewed by the given expert).
Exactly which pattern will be activated depends on the reviewer's habitual choices (e.g., some always focus on the ideal pattern) and changing situational factors (e.g., last written review) as well as the mental and physical status of the reviewer: mood [21] and cognitive and psychoenergetic resources [22]. It might happen, however, that reviewers do not have a ready reference standard, e.g., when they are reviewing research proposals for the first time.

Calibration
Reference pattern begins to emerge during subsequent evaluation, so the previous grant proposal becomes a reference for the next one. It can be assumed that, at some point in the series, the reference pattern will be crystallised/calibrated. This in turn may change the average value of the pattern. This process is called calibration [23] and can be described as creating similar patterns in the reviewer's mind. Experts learn how to use an available evaluation system (a rating scale, grading system or binary accept/reject decisions) to rate objects. The anchoring effect also plays a fundamental role for the calibration process. In the classic sense, in the anchoring effect, the number specified at the beginning plays the role of a reference point (anchor) for further evaluations [24][25][26].
An attempt to control a reviewer's pattern seems to be important to ensure similar conditions for all evaluations. At the beginning of a rating series, experts often cannot predict the evaluating range of the objects they will have to rate (e.g., how good or bad the next grant proposal is). For this reason, they avoid giving the highest and lowest grades. The process of calibration manifests itself in an initial avoidance of extreme categories [27,28]. This is particularly problematic for good or bad proposals, especially when they are evaluated in sequence and reviewers cannot change their assessment by going back to the proposal and re-evaluate it [29].

Rating Style
Another key factor in evaluation processes is the reviewer rating style. When evaluating grant proposals, the reviewer needs to assess the quality of a grant proposal often using numeric scales. Even though the scale is precise (e.g., 1-5), reviewers might transform their own, private rating scale into the one provided by the contractor. We can identify reviewers who use the whole range of scale (using all points, i.e., 1, 2, 3, 4 and 5), those who use only two values (e.g., 2 or 3) and others who avoid extreme endings. This is a manifestation of a person-specific rating style that can be expressed through: (1) tendency to lenient/severe assessments [20,30]; and (2) lacking of differentiation through dimensions, namely the halo effect [31][32][33]. Lenient evaluators are referred as zealots (they accept a lot) while severe ones as assassins (they reject a lot) [34]. The reasons for leniency/severity of the evaluators are the differences in the way they process information, their personality traits, individual features and situational conditions [16].
The influence of rating style on the assessment has been presented in many studies: in the work of examiners evaluating the work of medical candidates [16], in ratings received from police officers, nurses and social workers [35] and aircraft mechanics [36]. It should also be noted that evaluator rating styles can be assessed only when we have a sufficient number of evaluations provided by an individual.

Single Versus Multiple Evaluation
All grant proposals within a certain panel can be evaluated by the same set of reviewers or by a random set of reviewers (a different set of reviewers for each grant proposal). With a random set of reviewers, we cannot control what reference standard is activated in the mind of an expert, which may influence their evaluation.
Marsh, Jayasinghe and Bond [37] compared assessments of reviewers evaluating one or more proposals. They found that ratings of those who evaluated at least three projects were more cohesive with the evaluations of other reviewers of the same project and more correct (i.e., similar to the final grades). This suggests that conducting more than one evaluation helps a calibrated pattern to emerge. Interestingly, ratings of those who evaluated only one project were more severe on average than ratings of those who evaluated more proposals. In a group of evaluators who have evaluated ten or more projects, a more consistent rating style (thus, their tendency for severe/lenient evaluations) was observed.
Although previous studies have established that the assessment of grants is often subject to relatively low inter-rater agreement, the impact this variability has on actual decisions about grant funding has received less attention [19]. To determine the inter-rater agreement, we need to have access to multiply evaluations of the same reviewers. A lack information about the reviewers' rating styles may lead to wrong decisions. This is an example of adverse selection of reviewers [38].
Thus, we suggest that, to minimise the negative effects of rating stale and different comparison pattern, all projects within one panel should be evaluated by the same set of reviewers.

Study 1: Comparison of Two Evaluation Systems
The aim of the first study was to compare two datasets of grant reviews coming from two national foundations: (1) The National Centre for Research and Development awards grants within the LEADER programme of up to half a million dollars per project ($13 million in total). (2) The Foundation for Polish Science grants one-year scholarships to young researchers within the START 100 programme, amounting to around $8500 per project.
The main difference in the evaluation system is the selection process employed by the reviewers. In the START project, the majority of reviewers (67%) evaluated one project only. In the LEADER project, the same three reviewers evaluated all projects in a certain panel.

START Programme
In the START programme, 126 proposals were competing for the scholarships. The evaluation procedure was structured as follows: First, a set of three reviewers were selected separately to evaluate each application on a 100-point scale. Then, another set of experts formed a committee that read the reviews and conducted an interview with the leaders of the projects. To be interviewed, the project needed to receive 65 points (an average across all three reviews). Overall, 252 reviewers were involved in the evaluation process of 126 projects (among whom, 67% evaluated one project only).
When we looked at the difference between the maximum and minimum ratings given to the same project, we found that for almost 40% of the projects the difference between expert evaluation exceeded 30 points on a 100-point scale. This discrepancy between the ratings is an indicator of the rating style and a lack of an agreed basis for reviews (see Figure 1).
Based on the 65 points criterion, only 64 applicants were invited to present their projects to the committee. If, for a scientific reason, we excluded the ratings of the most severe reviewer out of three, the ranking would change. When the mean was computed based on the rating of two reviewers, an additional 25 project leaders would be able to present their project to the committee, which would increase the number of invited presenters from 64 to 89.
We repeated the same analysis with changed threshold-increasing the criterion to 70 (see Figure 2). Excluding the most severe reviewer would result in 17 more invitations to present proposals. Of course, the strict reviewer could be right and two others mistaken in their evaluation, but otherwise fallible evaluation undermines confidence that projects deserve funding will receive it.

LEADER Programme
In the LEADER programme, all proposals were reviewed by the same set of reviewers for each panel. The number of evaluated projects depended on the field (mathematics, physics, economics, chemistry and biology) and ranged from 16 to 59. There were generally 18 experts.
Having ratings of many research proposals from each reviewer allowed us to determine their rating styles, which was indicated by mean, median and variability [16]. In Figure 3, we can see that the experts differed in terms of the mean ratings (some were significantly more lenient than others) and variability (some differentiated more than others).
The presence of very lenient or very severe reviewers did not distort the assessments, as all competing proposals were evaluated by the same experts, i.e., severe reviewers influenced all the proposals in the same way.
When evaluating an object, an evaluator needs to compare it with another object which can take different forms (ideal proposal that exists in the reviewer's mind, the last proposal that reviewer evaluated or the average of all previously evaluated proposals). After performing a couple of evaluations, a comparison pattern stabilises in the reviewer's mind. The lack of such stabilisation poses a threat to evaluations, especially when reviewers conduct multiple reviews in a sequence. This risk is a cognitive bias called the serial position effect.

Study 2: Experimental Research on Serial Position Effect
When evaluating multiple objects, a comparison pattern emerges during the evaluation in the process of calibration. When calibrating, evaluators avoid extreme evaluations at the beginning of series. As a result, "good" objects are evaluated less favourably at the beginning of evaluation series, while "poor" objects are evaluated more favourably at the beginning of evaluations series as compared to the same objects evaluated at the end of a series. The serial position effect has been demonstrated in several studies [27,[39][40][41]. It is worth noting that this effect occurs regardless of the degree of knowledge of the object under evaluation and the rating criteria [28].
In Study 2, we conducted an experiment demonstrating the occurrence of serial position effect in the context of conference abstracts evaluations. We also examined whether introducing a small break during the evaluation process would minimise the occurrence of this effect.

Participants
Eighty-six management students volunteered to participate in the study (70% women, aged 19-29, M = 21.26; SD = 1.09) in exchange for bonus points in one of their courses. The results of the experiment were later discussed during classes.

Procedure and Materials
An email with a link to an online survey was sent to students, inviting them to participate in the study. The participants took on the role of the evaluators of conference abstracts submitted to a conference on stress in the workplace, which was aimed at HR specialists and scientists. The abstracts concerned studies on the problems of occupational stress for different work groups (doctors, nurses, journalists, teachers, white-collar workers, etc.).
Based on the ratings collected in the preliminary study, the abstracts were classified into three categories: good, bad and average. A computer program randomly assigned the participants who filled out their consent form to take part in the study to one of the three experimental conditions. The scheme of the scientific procedure is in Table 1. Group E3, during a break, was asked to evaluate six projects of conference logos on four dimensions on a seven-point scale: (1) originality; (2) thematic compliance; (3) precision of delivery; and (4) recommendation. After evaluating a whole series of abstracts, the participants indicated their age and gender.

Results
To test the impact of the order and the abstract quality on the evaluation process, a two-way ANOVA with repeated measurement on the last factor was performed, followed by the post-hoc Bonferroni post-test. The results are presented in Table 2. All effects were statistically significant at the 0.05 significance level. The hypothesis predicting an interaction effect of the position and the quality of the abstract was supported. Poor abstracts gained and good ones lost when evaluated at the beginning of the series, while good abstracts gained and poor ones lost when evaluated at the end of the series.
In the next step, the second hypothesis was examined, stating that the serial position effect could be minimised if a break were introduced, wherein the serial evaluation was interrupted to clean the operating memory. To assess the impact exerted by the break, the ratings given by Groups 2 and 3 for the best abstracts were compared. If break led to the reference pattern developed in the experts' minds vanishing, then the ratings of good abstracts in Group 3 would be lower than those in Group 2. Student's t-test for independent groups revealed no significant difference in the ratings of good abstracts conducted by Group E3 after a break with an additional task and by Group E2 working uninterruptedly without any break.

Results and Discussion
The aim our study was to underline the need for systematic analysis of experts evaluating grant proposals to increase the quality of their assessments. We compared evaluations of two datasets originating from the scientific grant competitions of two national agencies. The two datasets differed in terms of the reviewers' selection procedure. In the first case, all grant proposals were evaluated by different sets of reviewers, while, in the second case, all grant proposals from a certain domain were evaluated by the same set of reviewers. The first method does not allow controlling the comparison pattern which activates in the reviewer's mind while reviewing. This might result in the lack of inter-rater reliability, understood as the extent of agreement between the independent reviewers [4]. As a consequence, this undermines confidence that projects deserving funding will receive it. This approach also does not allow for assessment of a reviewer's rating style, for example the tendency to severe/lenient assessments, which, as has been presented in many studies, influences the final evaluation [16,20,36]. Discordant evaluations among referees could make the review system unfair to authors whose manuscripts happened to be sent to an assassin or zealot reviewer [42] and thus should be controlled.
There are different methods for minimising the distorting impact of rating style, e.g., training aimed at increasing awareness of basic errors occurring during the review process. However, their usefulness is limited [43]. Giraudeau [44] proposed a simple method to identify discordant proposals, aiming to help track the proposals where evaluators disagree and further discussion is required. The first element of their method requires discarding any proposal with only one rating. One rating does not allow controlling the rating style of the evaluator or applying an algorithm that helps track evaluations that require further discussion.
To control the negative impact of an evaluator's rating style, we recommend a system where: (1) we have multiple evaluations from one evaluator; and (2) all grant proposals in a certain panel are evaluated by the same set of reviewers. If there are strict reviewers ( assassin type), they will be strict for all of the evaluations in the process, and the same applies to those who are lenient (zealot type).
However, this approach also has limitations. If the evaluation is conducted in a series, reviewers are susceptible to the occurrence of the serial position effect. This distortion causes reviewers to avoid extreme ratings at the beginning of the series, which influences the evaluations of good proposals that are evaluated less favourably at the beginning of a series and bad proposals that are evaluated more favourably at the beginning of a series.
The serial position effect is explained by the calibration theory-evaluators learn how to use an available category system to evaluate objects [23]. In their first evaluations, evaluators avoid using very high or low grades as they do not want to violate the "internal consistency" of their evaluations. In other words, they are afraid to give maximum points to a proposal as they do not know the quality of the next one. After performing a couple of evaluations, a new comparison pattern is created in their mind, based on previous evaluations. We think that evaluators, prior to evaluating proposals, should receive a few proposals (i.e., the best and the worst from the previous competition), which would help to "calibrate" their mind and create a similar pattern to all. At the same time, it would give some insight into the reviewer's rating style and help to compute it. Similar solutions are used in the evaluation of applications in the American Graduate Research Fellowships programme of the National Science Foundation (GRFP). The calibration method raises some future research questions, for example: (1) How many proposals do reviewers need to evaluate to calibrate? (2) Is there a difference in reference patterns between zealots/assassins and more moderate reviewers?
Our attempt to minimise the serial position effect by introducing a small break task failed. Our intention was to clear evaluators' working memory and change the reference pattern in their mind. As a breaking task, we asked participants to evaluate six logo proposals. It might be that the task was too short, and the content of their working memory remained the same. In our future studies, we would like to examine if longer and different types of breaks would reduce the serial position effect. In our experiment, we used an online evaluation system that did not randomise the order of the evaluated abstracts and did not allow re-evaluating the proposals. It seems that allowing the two could help minimise the effect occurring when evaluating multiple proposals.
Finding ways to minimise the distortive effect of the described cognitive biases is important in the presence of an uncontrollable flood of information [45]. Over several decades, the number of scientific papers has climbed by 8-9% each year [46]. An explosion of scientific publications is accompanied by an increase in the number of cited references.
Overwhelmed by the volume of data to be parsed, the ability to accurately discriminate is increasingly important. Overflow leads to the shortening of time to concentrate on a single stimulus, succumbing to the influence of attention-grabbing stimuli and in turn adversely affecting information processing [47]. One of the consequences is heightened susceptibility to cognitive biases and reduced ability to defend against them. The tools and strategies that help to prevent the occurrence of biases are most wanted.

Conclusions
If we want ensure that the process of evaluation promotes proposals that are harmonising social, environmental and economical interest, reflecting values and ethics, the process of the evaluation itself need to be sustainable [48]. To achieve that, reviewers' work should be analysed. To prevent the distortive effect of evaluators' rating style and the selection of a lenient or severe evaluator, we suggest that: (1) evaluators should perform multiple evaluations, which would allow donors to control for their rating style and determine the zealots and assassins; (2) within a panel, all proposals should be evaluated by the same set of reviewers, which will allow controlling for the rating style as all of the proposals would be equally influenced by their rating styles; and (3) to create similar evaluation conditions, the evaluators should be calibrated prior to evaluation. To ensure the emergence of a similar comparison pattern, they should review the few best and weakest proposals from past competitions. We also suggest that within a certain panel all grant proposals should be evaluated by the same set of reviewers, which would help to eliminate the distorting influence of the selection of a very severe (who rejects a lot) or very lenient (who accept a lot) expert.
That effective assessment and moderation of the process would help ensure a more robust review of grant applications, providing for better use of resources and outcomes.
Author Contributions: Conceptualisation, G.W. and K.K.; methodology, G.W. and K.K.; software, K.K.; formal analyses, G.W. and K.K.; writing-original draft preparation, G.W. and K.K.; and writing-review and editing, G.W. and K.K. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data supporting the findings of this study are available from the corresponding author K.K on request.

Conflicts of Interest:
The authors declare no conflict of interest.