Failure to CAPTCHA Attention: Null Results from an Honesty Priming Experiment in Guatemala

We report results from a large online randomised tax experiment in Guatemala. The trial involves short messages and choices presented to taxpayers as part of a CAPTCHA pop-up window immediately before they file a tax return, with the aim of priming honest declarations. In total our sample includes 627,242 taxpayers and 3,232,430 tax declarations made over four months. Treatments include: honesty declaration; information about public goods; information about penalties for dishonesty, questions allowing a taxpayer to choose which public good they think tax money should be spent on; or questions allowing a taxpayer to state a view on the penalty for not declaring honestly. We find no impact of any of these treatments on the average amount of tax declared. We discuss potential causes for this null effect and implications for ‘online nudges’ around honesty priming.


Introduction
Improving the fairness, transparency, efficiency and effectiveness of tax systems will be critical to achieving the sustainable development goals [1]. Where countries such as Guatemala fail to collect tax revenue effectively from their citizens, there are consequences for the ability to deliver public services and, where tax compliance is patchy, general confidence in the government's competence and fairness may also be compromised. This natural field experiment involved priming honesty among Guatemalan taxpayers filing online declarations for income tax or value-added tax (VAT). Treatment messages were included as part of a pop-up CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) window on the tax declaration website (Declaraguate) immediately prior to individuals reaching a declaration form. The large-scale trial involves all people who completed an online tax return between August 2014 and January 2015 (627,242 individuals, completing a total of 3.2 million tax returns).
Randomised controlled trials in a natural context, commonly referred to as natural field experiments in the social sciences [2,3], are becoming the gold standard for evidence-based policy-making, and their increased use is being advocated in governments and organisations [4][5][6][7]. Field experiments encouraging honest tax reporting, in particular, have a long tradition among economists, psychologists and policy scholars [8][9][10][11][12]. Some have found that the mention of increased audits or penalties has a positive impact on tax compliance, e.g., [13][14][15]. These interventions are in part based on laboratory or survey experiments that have studied what structural factors promote contributions to public goods in general (e.g., [7,16,17]) and paying taxes in particular (e.g., [18,19]). Others have been able to prompt more compliance using behaviourally inspired messages (often referred to as 'nudges', [20]. For example, Hallsworth et al. [21] found in a large-scale field experiment in the United Kingdom that letters communicating the widespread social norm of paying taxes on time increases the fraction of taxpayers who pay their taxes on time. In a similar experiment in Guatemala, Kettle et al. [12] showed that behaviourally informed letters more than triple receipts from late income tax filers. Perez-Truglia and Troiano [22] demonstrated in a field experiment that a 'shaming' policy can be effective in reducing tax delinquency. Taxpayers in our trial were randomly allocated to either see one of six messages, or be part of a control group, which saw the original CAPTCHA without any behavioural message. The trial finds no differential impact between the original CAPTCHA (which contains no behavioural prompts) and various prompts inspired by behavioural science. Treatments included an honesty declaration, information about public goods paid for by taxes and punishment for noncompliance, and treatments that allowed participants to choose what they believe to be a good use of public funds, or an appropriate punishment. The results show that none of our treatments had a significant impact on tax declarations in this context. The fact that all of the six treatments were found to be ineffective (rather than some) supports the hypothesis that the setting in which the information was conveyed may have been crucial here, rather than the content of the messages.

Tax Regime Selection
The trial involves Guatemalan taxpayers making declarations for income tax and value added tax (VAT), of which there are two regimes for each. For income tax the default option is a gross income tax, which entails a six percent direct tax on gross revenue. This option is also associated with simplified accounting standards. The alternative is for taxpayers to self-select into a profits tax at the beginning of the year, which entails a tax rate of 28 percent (in 2014) on taxable income or profit.
The two regimes for VAT both entail a 12 percent tax on the value of products. They differ, however, in terms of eligibility and payment procedures. Taxpayers with an annual turnover of more than GTQ 150,000 (approx. $19,200 USD at the time of the experiment) are part of the general regime, while other taxpayers are part of the small taxpayers' regime. VAT is charged on sales of goods within the country, sale of services in the country, any imported goods, leasing contracts, transfers of real estate, and insurance and bond sales. Exports, banking activities, payments in-kind, mergers, trade in financial instruments, and trust arrangements are exempt from VAT. We consider taxpayers under both regimes.
The trial involves all Guatemalans who declare their tax online. For income tax and VAT general, online declaration is mandatory, while for VAT small taxpayers over 50 percent of declarations are made online. However, in conversations with government officials, we learned that a large fraction of Guatemalans are likely to be unregistered and thus not captured by the tax system. Our interventions will thus only affect the fraction of citizens who have already made an ethical decision to register for tax filing, although we note that this limitation is common to all tax compliance field experiments.
There are reasons to expect different treatment effects depending on the tax being declared. The profits regime is charged at a higher rate than the simplified regime and so the incentive to behave dishonestly may be higher (for a more thorough discussion of this, see Kettle et al. [12]). Individuals and businesses that complete this kind of return may, however, be more affluent, which in turn could reduce their incentive to be dishonest due to a lower marginal value of money, consistent with the model put forward by Allingham and Sandmo [23]. The simplified option is, as its name implies, a less complex form. This may lead to greater honesty as there are fewer dimensions across which it is possible to obfuscate tax liability. Additionally, the differences in timing for declarations may affect behaviour. For VAT small taxpayers, and the gross income tax regime, declarations are made monthly. These smaller and more frequent declarations may create more of an incentive to behave dishonestly as the declarations are so common, or less so as the payments are smaller.

Sample Selection and Random Treatment Assignment
Our initial sample involves the universe of taxpayers that made a declaration for any of the four tax regimes outlined above between August 2014 and January 2015. There are no restrictions to eligibility; however, due to missing observations for some variables (and in particular for the lagged value of declarations), our sample size for analysis is reduced from 715,190 participants to 627,242 participants. In total the trial involves 3,232,430 declarations by these 627,242 individuals. Table 1 shows the random assignment of individuals across the seven trial arms as well as summary statistics on demographics.
Randomisation is conducted at the individual level with no stratification. Randomisation was built into www.declaraguate.gt, the website for tax declaration owned by the Guatemalan Tax Authority (SAT), using a JavaScript random number generator that randomly draws an integer between 1 and 7 inclusive, using a standard (publicly available) algorithm. When an individual begins their tax return this provides an instruction to the declaration website to display one of the seven treatment arms. Randomisation is recorded by SAT as a field in the tax return data.
Individual randomisation was chosen as it is the most statistically efficient mechanism for randomisation, in terms of both analysis and the quality of randomisation. In practice, this means that individuals submitting multiple returns within the trial period are treated by a randomly assigned treatment for each declaration. Contamination between individuals was not believed to be a significant enough issue to warrant clustering the randomisation at a level above the individual. As individuals are identified after treatment, individuals are assigned afresh on each individual visit. We discuss the effects of first-time exposure to any condition and the repeated exposure to multiple conditions separately. Table 1. Descriptive statistics of our sample (mean and standard deviation). Column 1 shows the proportion of taxpayers who are women; column 2 lists the mean age in years; and columns 3-6 show the proportion of taxpayers in each of the four tax types.

Experimental Design and Treatments
Taxpayers are randomly assigned to one of seven arms; a control arm where taxpayers receive the original CAPTCHA from the Declaraguate website, and six adapted versions that entail viewing or interacting with a behavioural prompt in addition to typing the original CAPTCHA. The CAPTCHA appears after the taxpayer selects the tax form to fill in on the main Declaraguate website, and before the form page. Table 2 below presents an overview of the seven treatment arms which we then describe in further detail below. The original Spanish versions of the treatment CAPTCHAs (and translations) are included in Appendix A. The 'customer journey' of taxpayers through the websiteis included in Appendix B (Figures A7-A9, i.e., the website and forms between which the CAPTCHAs appear).
The treatment arms are placed directly below where the characters are typed for the CAPTCHA, but above the 'fill form' button which the taxpayer must press to continue to the next form. Some of the treatment CAPTCHAs involve specific actions that the taxpayer has to complete before they are able to press the 'fill form' button. These are described further below. Table 2. Summary of treatment arms and behavioural prompts. Number of forms submitted (n) is cumulative over the entire trial period.

Message and Procedure
Control Group, Number of forms submitted = 585,872 Typical CAPTCHA website design. Box states 'please type the characters that you see in the picture'. Once the characters from the picture are typed, the taxpayer is then able to click on the 'fill form' button to take them to the declaration form.
Honesty Declaration, Number of forms submitted = 529,397 Includes an honesty declaration that translates as: 'Declaration: I will fill out this form honestly. Please sign your name to confirm this declaration' The taxpayer must then enter their name in a box below this statement before being able to press the 'fill form' button.
Public Good, Number of forms submitted = 573,676 Includes an image of the Guatemalan flag, and the following public good message: 'In 2013 your taxes helped pay for schools, hospitals and policemen.' Gives the taxpayer a choice of public goods that they would like to see tax money spent on: 'Please choose what you want us to direct your tax money to:' The taxpayer must choose one of the options: schools; hospitals; or policemen, before they are able to press the 'fill form' button.
Choice Enforcement, Number of forms submitted = 539,542 Gives the taxpayer a choice of the punishment that they think people should receive for fraudulently declaring their tax: 'Please tell us what you think should happen to people who fill out their forms dishonestly' The taxpayer must choose one of the options: pay a fine; confiscate your assets; or go to jail, before they are able to press the 'fill form' button.
Self Select 'I am Honest', Number of forms submitted = 545,168 Allows the taxpayer to self-select into being honest: 'Which of the following do you identify with?' The taxpayer then has to select one of the following two options: 'I am an honest taxpayer who declares truthfully' or 'I'm a busy taxpayer who declares quickly'

Control Group
This CAPTCHA is the original used in the Declaraguate website. On the website the CAPTCHA appears as a pop up box. The text states 'please type the characters that you see in the picture'. Below this is a picture containing jumbled but legible characters which need to be typed out by the taxpayer. Once the characters from the picture are typed, the taxpayer is then able to click on the 'fill form' button to take them to the declaration form.

Honesty Declaration
This CAPTCHA includes an honesty declaration that translates as: 'Declaration: I will fill out this form honestly. Please sign your name to confirm this declaration'. The taxpayer must then enter their name in a box below this statement before being able to press the 'fill form' button. The treatment can be seen in Figure A1.
The Honesty Declaration CAPTCHA is based around the idea that most people see themselves as honest and act according to the ethical codes of society, but need reminders that these ethical codes should apply to all their decision-making. For instance, research has found that people are more honest when they are asked to recall the Ten Commandments just before making decisions where they have an option to be dishonest [24].
Specifically in the context of filing forms, Shu et al. [25] show that moving an honesty declaration to the start of a car insurance form increases truthful reporting of mileage driven each year. The intervention was simple but very effective: while traditionally people would fill out their annual insurance form and then sign their name at the bottom of the form, researchers moved both the honesty declaration and signature box to the top of the insurance form before they would fill it out. The intervention increased the number of miles declared by 10%. Other studies, for example Bhanot [26], who investigated the effect of pledges on loan repayment rates, found somewhat smaller effects, suggesting that this effect may have limitations.
The 'honesty declaration CAPTCHA' applies the same considerations to tax declarations. We hypothesise that when participants sign their full name below an honest declaration, their moral code is made more salient and hence they will be more likely to comply with it.

Public Good
This CAPTCHA involves an image of the Guatemalan flag and the following public good message: 'In 2013 your taxes helped pay for schools, hospitals and policemen.' The treatment can be seen in Figure A2.
The Public Good CAPTCHA draws on the 'non-deterrence' approach to taxation; that citizens are fundamentally predisposed to cooperate with tax authorities. The approach contends that taxpayers' decisions are not based on utility maximisation alone, but are influenced by morality, social norms and public goods, amongst other concepts [11,[27][28][29]. In this view, the relationship between tax authority and taxpayer is basically cooperative, with the latter asking 'what should I do?' rather than 'what can I get away with?' [30]. The approach we take here is similar to Hallsworth [17].
In terms of administrative policy, the non-deterrence approach recommends that taxpayers are treated fairly and respectfully. They should be provided with clear and helpful information in order to make decisions. The authority should create a helpful service that makes it easy to comply. The route to reducing non-compliance is to persuade taxpayers by emphasizing that tax compliance is an ethical activity that is practised by the great majority of people, and that it creates valued public goods [28]. Following from this, the CAPTCHA is designed to prime taxpayers to think about the public goods that tax money is spent on. We hypothesise that people who consider the benefits of their own contribution will weigh this more heavily, relative to their own private consumption than they would do otherwise, and so will be more likely to make an honest declaration.

Enforcement
This CAPTCHA involves an image of a gavel and the following text: '5060 taxpayers in early 2014 had legal proceedings for breach of their tax obligations.' The treatment can be seen in Figure A3.
The Enforcement CAPTCHA is based on the idea that signals of punishment enforcement can be a powerful deterrence against unlawful actions [31]. Knowledge that others have been caught for a crime can lead people to overestimate the likelihood that they could be the next to be caught [32].
Enforcement messages to increase taxation are based on the original economic model used to analyse tax compliance developed by Allingham and Sandmo [23]. The model sees taxpayers as rational utility maximisers and suggests that a taxpayer's decision of whether to pay or evade tax is based on the trade-off between the monetary cost of complying and the expected cost of evading. The expected cost of evading is in turn based on the probability of getting caught, the probability of punishment if caught, and the punishment for which they would be liable. Accordingly, this 'deterrence' model predicts that the way to deter tax evasion is through increased sanctions for noncompliance.
The enforcement CAPTCHA draws on making enforcement more salient to the taxpayer. Similarly to the above, our hypothesis is that information on potential costs makes these costs more salient and so adjusts relative weights of private consumption and taxpaying for the participant. Enforcement messages on tax letters have been shown to be effective in a number of countries including Argentina and Venezuela [33,34].

Choice Public Good
This CAPTCHA gives the taxpayer a choice of public goods that they would like to see tax money spent on. The text of the CAPTCHA reads: 'Please choose what you want us to direct your tax money to:' The taxpayer is then given the choice of selecting 'schools', 'hospitals', or 'policemen'. The treatment can be seen in Figure A4.
The design of this CAPTCHA incorporates a choice design with a public goods message. Research shows that people like to have a say in the design of a product or policy [35]. It can thus be effective to allow people to give an opinion on the cause towards which their tax money should go. This hypothesis is supported by Lamberton, De Neve & Norton [36] who find that participants who have had a chance to express non-binding preferences are 16% more likely to comply with tax instituted in a laboratory setting.
The intervention aims to prime the taxpayer into thinking about the public goods that tax goes towards, giving them the option to voice their preference. Our hypothesis is that even a cursory sense of control over public spending provides a sense of ownership and hence leads to taxpaying being more highly valued. The number of choices is limited to three in order to avoid potential cognitive exhaustion before filling out the form [37].

Choice Enforcement
This CAPTCHA gives the taxpayer a choice of the punishment that they think people should receive for fraudulently declaring their tax: 'Please tell us what you think should happen to people who fill out their forms dishonestly'. The taxpayer is then given the choice of selecting; pay a fine, confiscate your assets, or go to jail. The treatment can be seen in Figure A5.
This design is based on the same premise as above but incorporates punishment choices rather than public good choices. This CAPTCHA aims to prime the taxpayer into thinking about what should happen to people who do not pay their tax honestly.
The message also aims to invoke cognitive dissonance [38] in taxpayers who would have filled out the form dishonestly. We hypothesise that participants will suffer from cognitive dissonance due to the clash between their beliefs that dishonest declarers should be punished, and their dishonest actions. Under this hypothesis, participants would seek to resolve this dissonance by bringing their actions in line with their beliefs.

Self-Select 'I Am Honest'
This CAPTCHA allows the taxpayer to self-select into being honest. The CAPTCHA reads: 'Which of the following do you identify with?' The taxpayer then has to select one of the following two options: 'I am an honest taxpayer who declares truthfully' or 'I am a busy taxpayer who declares quickly'. The treatment can be seen in Figure A6.
The principle applied in the Self Select 'I am honest' CAPTCHA is that individuals are asked to voluntarily declare their identity. The signals in our prompts are costless; thus choosing an option that signals a positive and honest indication (despite whatever declaration is actually made) should be expected of all individuals. On the one hand, this intervention might trigger an identity of an honest taxpayer [39][40][41][42], Benjamin et al. [43] find that triggering identity as a member of a religious group has significant impacts on the tendency to donate to the public good, but that this varies depending on the group identity being invoked. On the other hand, self-selection of the other (non-honesty-promoting) option is based on research that has found that people may reveal their identity (or 'type'), even when there is no need to do so and that self-identification is not beneficial [44].
People are asked which of the two categories they would put themselves in-one is clearly positive while the other is not overly negative. The latter category is more likely to be chosen by people who would want an excuse for not filing their taxes properly and would not easily be moved by a nudge to act honestly [45]. However, everyone else will choose the former category-and they not only self-select into this category, but are being nudged into being honest (like in the 'honesty declaration' CAPTCHA above).
Conversely, however, the potential for 'adverse nudging' could apply to those who choose the second category: they might feel more justified in making a dishonest declaration. This might open the door for increased fraudulent behaviour, but might equally be offset by more targeted audits of those who self-select into the second category. It enables law enforcement to more easily identify people who might not have filed their taxes completely honestly as data will be gathered on those that self-categorised as those with too little time to do their taxes correctly to avoid the 'honesty' nudge.
The specific writing of this design also involves a further insight. Recently, evidence has emerged that subtle linguistic cues can have a substantial impact on our maintenance of our self-image and, consequently, on our ethical decision-making. In a study by Bryan et al. [46], the authors found that the phrase 'I am a voter' is more effective than 'voting' in inducing voter turnout because the first one feels to people like an association of their self-image (not just an act they do).
Similarly, Bryan et al. [47] find that people identify themselves with 'I am' more than non-identity invoked primes in the context of cheating. In three experiments people were asked not to cheat on a task. In the instructions cheating was either referred to in the context of an actor's identity ('Please don't be a cheater') or using language that referred to the action ('Please don't cheat'). People who had been asked not to be a cheater were less likely to cheat compared to those where the action of cheating was mentioned.
For this treatment, we hypothesise a similar cognitive dissonance to that described in the previous treatment. In addition, we hypothesise that participants will be primed with a sense of identity as an 'honest' individual, to which they will feel the desire to conform.

Outcome Variables
This trial was pre-registered in the American Economic Association's Trial Registry (#AEARCTR-0000424) in advance of trial execution. The publicly available trial protocol included the main analysis plan, including the primary outcome variables, the regression model, and power calculations for sample size estimation based on our estimates of how the population might respond to the interventions.
The objective of this trial can be framed in one of two ways: to increase tax revenues in Guatemala, or to encourage more honesty in the declaration of tax liabilities. The first outcome variable is a continuous non-negative number of tax declared, while the latter is a binary variable that is 1 if a citizen declared any non-zero tax, and 0 otherwise. Total tax liability declared is our primary outcome measure, followed by any tax declared in our secondary analysis.
Following Shu et al. [25], we assume that if a participant is reporting their tax liability dishonestly (as opposed to making an error), they will systematically understate their liability relative to the truth. We further assume that misreporting through error is statistically similar to classical measurement error, and so is randomly distributed around the true value according to some unknown distribution. We assume that this error is (a) classical (b) orthogonal on treatment, and (c) does not interact with treatment-i.e. that the measurement error does not affect treatment. If (a) does not hold, we may experience a loss of power, and any treatment effect detected would be a lower bound on the true effect size; (b) holds by design of the trial, and the validity of (c) is irrelevant if (b) holds.

Sample Size Estimation
Based on previous research [12,17] we estimated the possible effect of our interventions to be around a 2% increase in total tax liability declared. We estimated the sample standard deviation to be 369.87 Quetzales (US $48.28) and based our sample size calculations on 80% power. Given these estimates, we arrived at a total sample size of 60,087 participants per arm, for a total minimum sample size requirement of 420,000 participants, although our ultimate sample size was larger. We are powered to detect effects of between 1 and 2% increases in tax declared (the sample size requirement for 1% is 241,000 per arm).

Estimation
The outcome measure for this trial is the total tax liability declared by individuals. Although the outcome measure is relatively straightforward, the structure of the trial and data present potential complications, which we seek to address here.
As described above, participants taking part in the trial will be subject to at least one of four different forms of tax. The rates and levels of these taxes are different, and so, as discussed above, we may see differential impacts of our treatments by tax type. As such, the effect of pooled analysis across all four tax types is likely to be statistically inefficient, as the increased variance due to pooling is likely to be more detrimental to statistical power than the increased number of observations is beneficial. Hence, for each model specified below, we produce four results, one for each type of tax in our trial. Primary analysis is conducted only on 'clean' observations-that is, where the observation is the first time an individual has received any treatment. Primary use of all four months' data risks underestimation or overestimation of treatment effects due to repeated exposure and ordering effects. Given the dynamic structure of our data, we remain confident that conservative effect sizes may still be detected.
Our primary model of interest is: Where Y it is our outcome measure, the amount of tax declared by individual i in month t; α is a constant term, capturing the amount of tax declared by participants in the control group in the first month contained in the data; and Y it−x is a vector of lagged values of the outcome measure. Where participants have failed to make declarations in a given past period, these are set to 0. C it is a vector of binary treatment variables, one for each of our treatment CAPTCHAs (i.e., excluding the control CAPTCHA), set to 1 if a participant sees that CAPTCHA and 0 otherwise; X i is a vector of time-invariant participant characteristics, including age, gender and region (see Appendix C (Tables A1-A3)) for the balance check of the control variables); t is a linear time trend in trial-months; d it is the day within-month that the declaration is received; and u i is the error term. Table 3, columns 1 to 5 show the parameter estimates for the Intention-to-Treat (ITT) model described above. The results show no significant impact of any of our treatments on the amount of tax declared by individuals. For reference, the mean level of the outcome measure in the control group for all participants is included at the bottom of the table. Thus, using the model specified in our trial protocol, we find no impact of any the treatment conditions. All of the secondary analysis below falls outside of our original protocol and so results are treated accordingly. Table 3. Linear regression ITT estimates of treatment impacts on tax declaration (in Quetzales) with the control condition as baseline. First exposure to any treatment only. Robust standard errors in parentheses.

Secondary Analysis-Full Sample
Despite our expectation that people filing different tax types would respond differently, we do not see any differences across tax types. Thus, in Table 3 column 5, we pool tax types in order to see if increasing our power allows us to detect any effects. The results show that the parameter estimates all remain insignificant other than public good choice which becomes negative at the five percent significance level. This result is unlikely to be a robust effect as we now have 30 treatment condition parameter estimates, making a false positive likely.
Next we conduct further analysis using the full dataset available to us. That is, instead of looking at first exposures to the treatment conditions only, we expand our analysis to all interactions of individuals with the treatment conditions. This gives us over 400,000 observations for three out of the four tax types and over 3 million observations when we pool tax types. Table 4 shows the results of this analysis. The results again show no impact of any of our treatments on tax declarations. We therefore have strong evidence that none of our interventions have had an impact on the amount of tax declared by individuals. It is perhaps worth noting, however, that in our most highly powered analysis (column 5 of Table 4), which makes use of all observations of all tax types, the most effective treatment, signing first, is statistically significant at the 10% level in a two-tailed test (5% in a one-tailed test), although in samples of this size it is far from conclusive. Table 4. Linear regression ITT estimates of treatment impacts on tax declaration (in Quetzales) with the control condition as baseline and control variables. Full sample of all tax declarations. Robust standard errors clustered at the individual level in parentheses.

Impact of Treatments on Propensity to Declare
Finally we consider that our treatments may have an impact on the propensity to declare at all. We therefore run regressions with a binary variable as our outcome variable. Similar to our primary results, we find no impact of any of our treatments on the propensity to declare for each individual tax type. Table 5 shows the coefficients of our treatments on the propensity to declare with our taxes pooled, suggesting there was no impact on the propensity to declare. The only exception is that 'enforcement choice' during first exposure only has a negative impact of 0.5 percent on the propensity to declare. The combination of the relatively small effect size and the number of statistical comparisons necessary in this trial make it likely that this is a spurious effect. Table 5. Treatment impacts on propensity to declare a non-zero amount; all tax types are pooled.

Discussion
The results of this large-scale field experiment show that none of our treatments in this context had a significant impact on tax declaration. Interventions that have been shown to be successful elsewhere are found to have no impact on the amount declared in this context. Given the previous successful replication of various honesty declarations in a different field context (e.g., [26,48]), it is important to consider what contextual factors could have caused this failure to replicate.
We speculate that this could be for a number of reasons. First, the interventions were not a part of the form themselves (as, for example, they were in Shu et al. [25]). As we were not able in this trial to alter the forms themselves, the interventions were placed in a pop-up window as part of a CAPTCHA. This could have meant that the messages were too far separated from the filling in of the tax declaration, which appears to be in keeping with the findings of Bhanot [26]. Second, the setting of the CAPTCHA itself could have negated the impact of the messages, as CAPTCHA participants may simply ignore the extraneous prompts in a bid to progress to the main form. This could have made our treatments seem like a task rather than inviting thought about their content. The fact that all six treatments were found to be ineffective (rather than some) supports the hypothesis that the setting in which the information was conveyed was crucial here, rather than the content of the messages. Third, online nudges in some settings, including honesty primes with e-signatures, have been shown to be ineffective in prompting honesty [49]. Fourth, the Guatemalans who choose to declare their tax might do so honestly. In Guatemala there are limited consequences for not paying your tax; failure to declare is highly unlikely to result in prosecution. Thus, the taxpayers who do declare, and are thus part of our sample, may therefore be the subset of Guatemalans who do declare their taxes more honestly than the rest of the population that we were not affecting with this online intervention. Fifth, the intervention designs might simply not have had an impact on honesty in this context. Further research will be needed to investigate to what extent and under what circumstances the interventions studied here do or do not promote honest tax declarations.
It is important to place this result in the context of the wider and growing behavioural science literature around tax compliance. Much of this literature is concerned with either binary payment/declaration decisions, or the timeliness of those decisions (for example Hallsworth et al. [50]), or with honesty measured after an audit. Our paper differs from these for two reasons. First, as stated above, the auditing regime in Guatemala, as well as the consequences for non-compliance, are very weak, and so follow-up to measure truth-telling is not possible. Second, the context of this study means that it takes place 'downstream' of papers such as Hallsworth et al. [50], and the binary compliance decision has already taken place. We therefore consider our interventions to be an application of the findings of Shu et al. [25] who in their field experiment consider the decision whether or not to lie on car insurance forms, and therefore arguably more related to the substantive literature on honesty in behavioural science than to the tax compliance literature. As with Shu et al. [25], the (lack of a) monitoring regime requires the assumption that people systematically lie in a direction that aligns with their financial incentive not to pay tax, but we believe this assumption not to be onerous.

Conclusions
Our null results have important implications for both academic researchers and practitioners. While previous research on tax compliance [13,18,50] has shown that messaging and framing around tax choices can be an effective means of promoting more honesty in tax declarations, our experiment demonstrates that previous findings cannot simply be applied in any context. Differences between countries, institutions, and settings matter and require careful consideration before similar interventions are scaled up in a new context. For instance, our setting differs notably from previous studies in that the interventions to increase tax compliance were delivered online, not in a letter as in past research. Work by Chou [49] suggests one potential reason for the failure of our sign-before intervention: signing one's name digitally does not have the same symbolic meaning to signers and consequently does not invoke one's identity in the same way. More research is required to understand what, and how, interventions work (or occasionally fail to work) in real policy contexts. It also highlights the importance of testing before implementing a new policy idea, providing more evidence for the need for evidence-based policy-making [5].
Finally, publishing null results is becoming increasingly more common in psychology, economics, policy and other disciplines. Doing so helps reduce the file-drawer problem [51] and adds to our understanding of the tested interventions and their boundary conditions [52]. All in all, this helps evidence-based policy-makers make more informed decisions.

Appendix C. Balance Checks
The primary assumption of randomisation is that it creates balance on both observable and unobservable covariates. Given that randomisation was conducted by an automated process, in real time, on the tax website of Guatemala, in this appendix we consider basic analysis of the balance of covariates prior to the experiment.
The tables below estimate a simple regression model, in which the baseline (prior to the experiment) of tax payments, and age and gender in our analysis is regressed in turn on the full set of binary treatment variables. In each table, column 1 assesses these variables for the first time an individual is treated, while column 2 considers it for the full sample of observations. Table A1. Balance of treatments with respect to age.

Appendix C. Balance Checks
The primary assumption of randomisation is that it creates balance on both observable and unobservable covariates. Given that randomisation was conducted by an automated process, in real time, on the tax website of Guatemala, in this appendix we consider basic analysis of the balance of covariates prior to the experiment.
The tables below estimate a simple regression model, in which the baseline (prior to the experiment) of tax payments, and age and gender in our analysis is regressed in turn on the full set of binary treatment variables. In each table, column 1 assesses these variables for the first time an individual is treated, while column 2 considers it for the full sample of observations.