Inadequate Standards in the Valuation of Public Goods and Ecosystem Services: Why Economists, Environmental Scientists and Policymakers Should Care

: Surveys of stated willingness to pay (WTP) are increasingly used to assess the value of public goods and ecosystem services. However, the currently dominant survey protocols and practices are inadequate. This is most evident from robust ﬁndings that the valuations are heavily inﬂuenced by the response options or response scales used in the valuation questions. The widely accepted survey guidelines do not require the validity tests and reporting standards that would be needed to make the uncertainty of the results transparent. The increasing use of inadequate survey results by policymakers threatens to undermine trust in environmental valuation, environmental policies, and political institutions. The objective of this paper is to explain the relevant issues and the available alternatives in survey-based valuation to a wider audience of non-specialized economists, environmental scientists, and policymakers.


Introduction
Policymakers are increasingly interested in assessments of the monetary value of public goods and services including, more recently, ecosystem services. The primary way in which these values are assessed today is through surveys of stated willingness to pay (WTP). Almost all researchers working in this field today follow one particular survey paradigm, the so-called contingent valuation method [1]. The more general term of surveybased valuations-stated preferences, as opposed to revealed preferences-is often used synonymously with the contingent valuation approach [2,3].
While behavioral decision researchers and most economists are skeptical of contingent valuation, the approach has become popular among economists and environmental scientists specialized in environmental valuation and among policymakers who commission the studies as a basis for environmental policy decisions. In addition, large environmental research consortia funded by national and international science foundations often involve valuation research to assess economic implications or demonstrate the societal relevance of the research. These projects often apply the contingent valuation approach. This paper argues that the currently dominant survey approach and reporting standards in this field are inadequate and that the widely-accepted guidelines support flawed best practices of a self-regulated industry rather than sound methodological standards. It is concluded that economists, environmental scientists, and policy makers who use or tolerate these practices are on a dangerous path that undermines public trust in science and political institutions. The objective of this present paper is to explain the relevant issues in an accessible way and illustrate them using recent examples.
The paper is structured as follows. The next section explains the key issues with the current standards and practice in contingent valuation. Section 3 explains how the issues are currently addressed in the discipline. Section 4 illustrates the issues with examples, followed by a description of perceptions in the wider scientific community (Section 5). Section 6 presents available alternatives in survey-based valuation, and Section 7 concludes.

Current Survey Paradigm
In the current survey paradigm [1,4], respondents are informed about proposed environmental or other improvements and asked about their willingness to pay for measures to implement these improvements. The willingness-to-pay questions may be formulated in any of several possible ways. The preferred format is to "offer" the same proposed policy at different hypothetical prices to different respondents. The respondents are asked whether they would approve of the policy at the specified costs. Statistical methods are then used to estimate parameters of WTP such as mean WTP from the distribution of "yes" and "no" responses at different price levels. Other popular formats include open-ended questions about maximum willingness to pay and payment cards showing a range of amounts and asking respondents to pick their maximum WTP. In a variant called "choice experiment", the respondents are asked to choose their preferred option among two or more policy options, which vary in terms of several attributes including the price.
What is common to all formats is that the valuation questions do not contain information about the actual (expected) cost of the policy. The objective of the survey is not to measure how many percent of the subjects would approve of the policy at its actual cost but to measure the maximum the subjects would be willing to pay for the policy in question.

Cognitive Limitations
A key issue with this survey paradigm are cognitive limitations. After all, the respondents are asked about unfamiliar policy proposals which have never been publicly discussed. They are required to answer without knowing any of the positions or arguments of trusted political parties or interest groups that are typically available in elections or referendums. Lacking any useful advice that would help them make a choice that is reasonably in line with their interests and values, they resort to problematic response heuristics. A robust finding is that the respondents use the hypothetical prices at which a policy is offered as a starting point for guessing their willingness to pay. The valuations are therefore systematically influenced by the hypothetical prices or response options used in the survey instrument. The higher the prices, the higher the resulting WTP estimates. As methodological choices affect the valuations, the valuations are easy to manipulate.
The influence of hypothetical prices on WTP responses has been intensively investigated in studies of anchoring effects [5][6][7]. A recent review of such studies found that when hypothetical prices were increased by 1 percent, the WTP increased by 0.3 percent on average [8]. For instance, if a policy is offered at hypothetical prices in the range of 10 to 1000 Euros, the resulting WTP is thirty times higher than if the same policy is offered at prices in the range of 0.10 to 10 Euros. For public goods, the effects of the hypothetical prices or, more generally, response scales [9,10], can be even stronger [11,12]. As researchers typically use relatively high prices, they often measure (too) high mean WTP for public goods and ecosystem services (see Example 1 below).
Anchor effects, response scale effects, and a range of other response phenomena in WTP surveys have lead many researchers to question the very existence of well-articulated preferences for non-market goods and services [7,9,[13][14][15][16]. In comparison, preferences in the marketplace and in the political domain are found to be relatively well-defined and stable [17][18][19][20]. The reason is that important contextual information that is available in actual policy decisions is lacking in the current WTP survey approach.

Hypothetical Prices
Another difference between the current WTP survey approach and actual policy decisions concerns the incentives for truthful responses [21,22]. WTP questions with hypo-thetical prices may invite strategic responses. Assume, for instance, that the true cost of a policy to a respondent is much lower than the price at which it is offered. Furthermore, assume the person welcomes the proposed policy and would like to see it implemented, although not at the unrealistically high hypothetical price specified in the survey. How would the respondent decide? A sophisticated respondent may act strategically and approve of the policy in order to drive up mean WTP in the study and therefore increase the chances that the policy will be implemented.
Proponents of the current survey approach argue that they could influence respondents' believes about the hypothetical prices in ways that prevent strategic responses [1,23]. Unfortunately, the existing evidence on survey respondents' beliefs do not support these hopes [24,25]. Using different question formats does not solve the problem, either. In openended questions, for instance, the potential gains from answering strategically are particularly easy to see.
A further problem of WTP questions involving hypothetical prices is their lack of clarity. To see this, assume a respondent is offered the policy at an unreasonably high price which may be a future tax payment, for instance. What could the high price mean? Will the policy involve overpriced contracts? The respondents simply cannot know based on the available information. The true (expected) price as an important piece of policy information is missing, which has important implications. Unclear questions mean that the interpretation of the responses will be unclear as well. The analyst cannot know which additional considerations such as utter disbelief or perceptions of an inefficient policy approach may have influenced the WTP responses.
Due to the peculiar way of asking questions, a significant portion of respondents typically refuse to respond, or willfully misstate their preferences. This gives rise to yet another problem. Survey researchers needs to make awkward choices about which responses to keep and which to sort out or recode. There is no obvious or correct way to make these choices. The consequence is added uncertainty and additional influence of methodological choices.

Lack of Powerful Validity Tests
The systematic influence of hypothetical prices, response options or response scales on WTP estimates is highly problematic for the credibility of policy-oriented survey research. Obviously, these influences would need to be routinely measured and reported for any useful judgments about survey validity. In applied research, however, this hardly ever happens. Powerful examinations of anchoring or scale effects as demonstrated in academic studies are not difficult to implement. However, survey researchers and the organizations commissioning studies do not seem to be interested in such tests. They appear to be happy obtaining "some number" [26].
What some studies do is to implement a so-called scope test which examines how sensitive the valuations are to the quantity or 'scope' of the policy in question (e.g., [27], cf. Example 2 below). The survey guidelines most commonly cited before 2017 explicitly required a scope test [28]. Unfortunately, the standard examinations are not very useful, as they cannot reveal whether or not the valuations are adequately sensitive to scope. In many cases, the survey researchers merely examine if there is any sensitivity to scope at all (cf. Example 2 below). A particularly prominent example is the scope test in a survey to assess the damage of the Deepwater Horizon oil spill [29][30][31].

Flawed Survey Guidelines
The widely accepted guidelines for survey-based valuation research support the current practices [4]. Most importantly, they do not require researchers to examine anchoring or response-scale effects or even to acknowledge the possible distortions. Even studies by co-authors of the guidelines lack such tests or acknowledgments (e.g., [32] (cf. Example 1 below)). Study authors may neglect these uncertainties and report the estimates without further comment (see Example 3 below). Hence, the guidelines [4] do not support sound scientific standards but flawed best practices.
What the guidelines do require is a statistical analysis of the sampling error. This analysis can be highly elaborate, especially in highly ranked academic journals, while fundamental issues of the survey approach are at the same time completely ignored.

Unsubstantiated and Misleading Claims
The case for the validity of survey estimates in Johnston et al. [4] and elsewhere involves unsubstantiated or misleading claims in field journals in which the reviewers are themselves proponents of the method. One key argument is that some comparisons with voting data show that the contingent valuation estimates are unbiased (e.g., [3] (p. 16); [4] (p. 371)). These comparisons, however, involve surveys that are not contingent valuation surveys but a kind of pre-election polls in which, due to the ongoing actual policy debate, the available information was very different from WTP surveys (e.g., [33] (p. 647); cf. [19] (p. 46)). Another unsubstantiated claim is that valuation questions following the preferred referendum format are incentive compatible [28,[34][35][36] or, more recently, that they can be made incentive compatible [2,3,23,37].

Editorial Power
The mentioned survey guidelines are published in a refereed journal. However, one of the authors of the guidelines is also an Editor of that same journal. The proponents of the current survey paradigm are also well represented among the chief editors of environmental economics journals more generally, including those of Environmental & Resource Economics, the Journal of the Association of Environmental and Resource Economists, the Journal of Environmental Economics and Policy, Resource and Energy Economics and the International Review of Environmental and Resource Economics (in November 2020). This study [32] exemplifies how flawed and misleading WTP research is funded by respectable organizations, carried out by respectable authors-including, in this case, a co-author of the current survey guidelines-and published by respectable journals. The study was financed and carried out by the Swiss Federal Institute of Aquatic Science and Technology (Eawag), and published in the Journal of Environmental Management.
The study mainly uses a choice-experiment approach but also reports estimates from open-ended WTP questions. A sample of the population in the Swiss canton of Zurich was asked about their willingness to pay for river restoration projects to be funded by taxpayers at the cantonal level. The mean WTP results for the restoration of river sections of one kilometer length along the rivers Thur and Töss were, respectively, CHF 144 and CHF 196 per person and year. The respective estimates from the open-ended WTP questions were CHF 52 and CHF 59. One important cause of these extremely high valuations is clear (cf. Section 2.2). The restoration projects were offered at extremely high prices, between CHF 25 and CHF 500 per person and year for restoration of small stretches of the rivers.
To scale up to the population, the authors used the values from the open-ended WTP questions which, unlike those of the choice experiment questions, did not relate to clearly specified restoration outcomes. (The reason they provide for using these values rather than those of the much more carefully executed choice experiment approach is that the estimates are "more conservative".) Extrapolation of these values to the entire population (and not only the rural population as preferred by the authors without any plausible reason) and subsequent division by the costs of the restoration projects (CHF 4.1 million and CHF 2.8 million, respectively) yields cost-benefit ratios of 288 and 502, respectively. The benefits of restoration are therefore 288 and 502 times higher than the costs.
A voter initiative in the canton of Bern provides a basis to examine how the survey estimates relate to taxpayer preferences. In 1997, the voters of the canton of Bern were called to vote on a proposal to spend CHF 3 million annually on river restoration projects. This amount corresponded to about CHF 5 per person and year (see [38,39] for a detailed description). The parliament opposed the measure. The voters approved of it by a narrow margin (54% yes). The river restoration issue has not fundamentally changed over the past twenty years. Another comparison of a WTP survey and a cantonal vote, with the survey only a few months before the vote, shows a similar discrepancy [40].
A validity test was not implemented in the survey. The presented sensitivity analysis generously overlooks the potentially most important source of error-the respondents' cognitive limitation and associated response heuristics. No comparison is made with the outcome of past voting decisions, and there is no useful discussion of the high valuations. The authors simply explain them with a "high purchasing power" and "high level of environmental awareness" of the local population [32] (p. 1084).
When asked to comment on the high hypothetical prices used in the survey, the lack of validity checks, and other issues, one author of the study responded: "I am aware of the various biases that may enter WTP responses, but we tried to adhere as much as possible to existing international guidelines" (personal communication, e-mail, 10 November 2020). The response highlights the problematic role of the current guidelines (cf. Section 3.1 above).

Example 2: Metaanalysis Published by the OECD (2012)
This study [27] illustrates how even authoritative government institutions apply low standards in their analysis of survey-based WTP. The authors of the OECD study compiled the available survey-based WTP estimates for reducing mortality risks in the areas of health, transport, and the environment. The estimates were screened for quality based on sample size (at least 200 respondents) and representativeness (only samples "representative of a broad population"). For each study, the authors also recorded whether it conducted a scope test (cf. Section 2.4) and, if it did, whether the valuations passed the test. The employed scope test was extremely weak, however. The estimates passed as long as there was any statistically significant effect of scope.
Based on the available dataset with a total of 856 estimates from 76 studies, only 85 estimates (from 10 studies) passed both the quality screening and the scope test. (The corresponding numbers for only European studies are: 5 estimates from 2 studies.) Since these numbers were so small, the authors ended up using the entire quality screened sample of 405 estimates (163 for Europe), including those which did not conduct or pass the scope test. Public policies in many OECD countries are now based on these values.

Example 3: Meta-Study Published by IPBES (2018)
This example is chosen to illustrate that even authoritative reports by intergovernmental organizations publish extremely misleading figures. The meta-study was published as part of a major report on the state of biodiversity and ecosystem services in Europe and Central Asia [41]. In the Summary for Policymakers, the first paragraph of the key messages states: "In Europe and Central Asia, which has an area of 31 million square kilometers, the regulation of freshwater quality has a median value of $1965 per hectare per year". Numbers for other ecosystem services follow. A table on page 209 of the 1151-page (without supporting material) report shows that the value of $1965 is the median value of only three estimates for the ecosystem service "Regulation of freshwater and coastal water quality". The three studies are not identified in the report. The responsible authors were contacted for this information. The data file was available three weeks after the first request.
As it turned out, the value of $1965 originates from a survey study that examined values of ecosystem services based on agricultural management options in a 10-m wide buffer strip along watercourses in a study region in Ireland [42]. The authors asked a sample of farmers how much compensation they demanded for managing the buffer strips without fertilizer. The resulting value is several times higher than typical agrienvironmental payments for buffer strips in the European Union. How the compensation for agri-environment measures on buffer strips relates to the ecosystem services in question is not clarified. The validity of the estimate is not examined in the study, and it remains unclear how representative the figure may be of other areas in Europe and Central Asia. A request to the main author of the meta-study for comment on the validity and representativeness of the result (e-mail, 10 November 2020) was not answered (as of 4 January 2021).

Perceptions of the Current Survey Paradigm in the Scientific Community
The perceptions of the paradigm in the scientific community are only partly reflected in published work. The following paragraphs are therefore based partly on personal communications and subjective experience in this field of research.
Psychologist decision researchers are skeptical of the current methods. They have produced a large body of evidence on how a range of psychological phenomena influence survey-based WTP estimates [5,7,11,43]. Most of this research was conducted in the 1990s, and much of it was academic rather than applied to real-world policy issues and therefore conducted with non-representative samples. Economist survey practitioners often argue that the biases demonstrated by the psychologists would be weaker or absent in welldesigned applied work. Many psychologists have given up work in this field. A common perception is that the economist practitioners are not interested in their findings.
Environmental economists who use the current WTP survey methods in applied work defend the current survey paradigm with very few exceptions. In personal communications, however, the issues are widely acknowledged. Some of those who agree that the respondents may not have well-articulated preferences experiment with deliberative monetary valuation (DMV). In DMV, small groups of individuals participate in workshops to learn about policy issues before the survey-based valuation [44,45]. Even fewer researchers pursue alternative survey approaches with large, representative samples (including our own research, see Section 6).
Environmental economists who do not use surveys to value public goods and services tend to be skeptical of the method. In the 1990s, when the CV method became popular, some of them participated in the debate (e.g., [46][47][48]). Later on, only few of them remained active in the debate (e.g., [30]), and there are no young researchers with other areas of specialization participating in the debate.
The broader economics profession remains skeptical (e.g., [16,26,49]). However, most economists do not know the specific issues. Their skepticism may be rooted mainly in traditional concerns about the collection of subjective data (cf. [50] (p. 132)) rather than in specific concerns about the current WTP approach.
Environmental scientists and policymakers are mostly ignorant of the issues. The responsibility for the quality of the numbers is delegated to the environmental economists who deliver the values (see Example 2 and 3), referring to established survey guidelines (Example 1 and 2).

Available Alternatives
Advocates of the current approach often claim that, when it comes to values of nonmarket goods and services, their approach is "the only game in town [1,3]. There are, however, alternatives, and there are also promising avenues for future progress. Table 1 presents an overview of the issues described in Section 2 and how these issues have been addressed in alternative survey approaches. Regarding the issue of cognitive limitations, much can be learned from decisions in the political domain [51,52]. Valuation tasks can be made easier by providing arguments and issue positions of competing interest groups as available in actual political decisions [19,[53][54][55][56][57][58][59].
Regarding the issue of hypothetical prices, WTP questions about policy issues may specify credible prices that reflect individual tax factors such as income [53,56] or other factors [12]. In addition, the credibility of the scenarios can be empirically examined [24].
Regarding validity, powerful validity tests developed by psychologists [6,7,11] can easily be implemented in applied economic work [12,53,56,59]. If required in applied work, these tests would unlock a healthy competition for survey quality.
Another avenue is democratic monetary valuation (DMV) approaches [45,[60][61][62][63]. Non-representative samples and concerns about the democratic legitimacy of information provision in the valuation settings, however, remain a challenge. In addition, the more conventional DMV approaches such as Lienhoop et al. [44] may suffer from some of the same issues as conventional WTP surveys [63].
Another interesting alternative in non-market valuation is the analysis of collective decisions [48], [64] (p. 48), [65] (p. 18), [66]. Willingness to pay information can be derived from voting decisions about tax-financed public goods [39,40,67]. Even financing decisions by parliaments may provide relevant information about the value of public goods, provided that certain institutional requirements are met [68] or public support can be established from auxiliary analyses of public opinion [69].
The key to future innovation in the field of survey-based valuation is validity testing. These tests are the basis for much-needed competition and innovations. With a requirement for powerful validity tests, survey researchers will have a natural incentive to focus on survey approaches with clearly formulated and less cognitively demanding questions, which produce better results.

Conclusions
Contingent valuation research as currently practiced does not follows scientific standards but undermines these standards in important ways. Known issues and uncertainties of the estimates are generously overlooked. Clarity of valuation questions and mitigation of strategic responses are sacrificed for the statistical convenience of hypothetical prices. Standard economic assumptions of individual rationality are privileged over descriptive insights about the kinds of information people need to make decisions in the political domain.
The resulting WTP estimates are systematically influenced by methodological choices. Although this is well understood, these influences are rarely measured or appropriately discussed. Since the influence of the methodological choices is so well understood today, the approach is, at its core, paternalistic. WTP surveys as done today neither take citizens and their preferences seriously, nor are they scientifically sound. This applies even to the best of studies, since the problems are inherent in the currently established methods and guidelines.
The current survey paradigm threatens to undermine people's trust in economics, in science more generally, and in the political institutions that use the survey-based valuations. The approaches may even pose a threat to democracy, as they may provoke fundamental opposition to how political decisions are made by elites who bolster their preferred choices with the results of flawed approaches.
National and supranational ministries for environment, transport and agriculture and intergovernmental initiatives such as The Economics of Ecosystems and Biodiversity [70] or the International Panel on Biodiversity and Ecosystem Services [41] would be well advised to re-consider their blind reliance on the well published but nevertheless flawed standards and practices of a self-regulated survey industry.
The key to future progress is increased attention to (lack of) validity tests in applied work as driver of quality and innovation. However, the current state of affairs suggests that the needed attention is unlikely to come from the survey practitioners or from those who commission the surveys. Interventions would have to come from outside-from journalists and politicians who point out the embarrassing lack of validity tests and the fundamental uncertainty of the results. Such interventions would be most effective if they focused on authoritative studies that were commissioned and funded by large public agencies or other organizations that are publicly accountable.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.