Animal-Assisted Interventions for the Improvement of Mental Health Outcomes in Higher Education Students: A Systematic Review of Randomised Controlled Trials

Background: The aim of this systematic review was to evaluate the effectiveness of Animal-Assisted Interventions (AAIs), particularly Animal-Assisted Therapy (AAT) and Animal-Assisted Activity (AAA), in improving mental health outcomes for students in higher education. The number of students in higher education reporting mental health problems and seeking support from universities’ student support services has risen over recent years. Therefore, providing engaging interventions, such as AAIs, that are accessible to large groups of students are attractive. Methods: MEDLINE, PsycINFO, Embase and Cochrane Library were searched from relative inception to end of April 2020. Additionally, a grey literature search was undertaken. Independent screening, data extraction and risk of bias assessment were completed, with varying percentages, by two reviewers. Results: After de-duplication, 6248 articles were identified of which 11 studies were included in the narrative synthesis. The evidence from randomised controlled trials suggests that AAIs could provide short-term beneficial results for anxiety in students attending higher education but with limited evidence for stress, and inconclusive evidence for depression, well-being and mood. For the non-statistically significant results, the studies either did not include a power calculation or were under-powered. Conclusions: Potential emerging evidence for the short-term benefits of AAI for anxiety, and possibly stress, for students in higher education was found.


Introduction
Attending higher education commonly represents a major life transition for young people, with it often being the first time living away from the family home, which can bring social, financial and academic stressors [1,2]. The true prevalence of mental health problems for students in higher education is hard to estimate accurately. For example, in the UK, there is a scarcity of large-scale studies being truly representative of the UK student population in higher education or applying a weighted adjustment to accommodate for the lack of representativity [3,4]. However, the number of students in higher education disclosing mental health problems and accessing higher education institutions' (HEI) support services has risen in recent years [1]. Disclosure and requesting support can result in long waiting lists for more traditional individualised therapy sessions, while stigma around seeking help for mental health and well-being is still present [5,6]. Therefore, a possible solution may be the provision of interventions aimed at reducing stress and anxiety as well as boosting mental health and well-being that are appealing, effective, and accessible to large groups of students [6]. In this respect, part of the solution could be Animal Assisted Interventions (AAIs).
• emotional contagion (transmitting the animal's positive emotions onto humans) • facilitating social interaction • opportunities for reinforcement (by partaking in pleasurable activities and experiencing positive emotions) • evoking expectations that participation will reduce stress (expectancy that the intervention will work) Beck [22] describes that the human-animal bond is rooted in evolutionary as well as physiological and psychological processes with significant health benefits for both humans and animals. Furthermore, the importance, in the psychosocial model, of social support for health and how social support can function as a buffer against stress are also relevant [23,24]. The animal-human bond can be considered as a type of social relationship, which can offer this type of support. Some individuals may form an animal-human bond more readily than a human-human bond as animals are considered to be indifferent and non-judgemental to an individual's appearance, social skills or socioeconomic status (SES) [25]. Furthermore, the Biophilia Theory proposes that humans are drawn to interact with animals due to an innate desire to connect with living organisms and nature [25,26]. Additionally, distraction as a cognitive refocus may also contribute, though this research has mostly focused on anxiety and pain whilst awaiting or receiving medical treatment [27].
Interestingly, the prevalence of programmes using AAIs at HEIs has increased, for example by 2015 over 900 existed in the USA [6]. In the UK, these types of programmes have also risen in popularity with various forms being implemented, ranging from oneday events to specific sessions [28][29][30][31][32]. Additionally, the evidence-base for using AAIs with this population is growing with persuasive descriptive and anecdotal reports of the benefits [2,6,33,34]. Over recent years, more randomised controlled trials (RCTs) have been published, evaluating the effectiveness of AAIs in respect of various outcomes for students in higher education [35][36][37]. Nonetheless, a lack of completed systematic reviews for AAIs and this specific population exists. Therefore, to our knowledge, no completed systematic reviews were identified that primarily evaluate the effectiveness of AAIs, particularly AAT

Aim and Objectives
The aim of this systematic review was to evaluate the effectiveness of AAIs, particularly AAT and AAA, in improving mental health outcomes for students in higher education.
The objectives were to: • systematically search and critically appraise the relevant published and unpublished literature on the effectiveness of AAIs, particularly AAT and AAA, in improving mental health outcomes for this particular population. • provide evidenced-based recommendations for policy, practice and further research.

Protocol and Ethics
A scoping review defined the focus of this systematic review by identifying gaps in the literature. This included searches for published literature in MEDLINE, The Cochrane Library, PsychINFO and Campbell Collaboration, as well as PROSPERO and Joanna Briggs Institute's Systematic Review Register. The protocol for this systematic review was peerreviewed and registered on the PROSPERO database on 25 June 2020 (registration number CRD42020186541) [38]. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed [39]. Ethics approval was not required.

Search Strategy
The search strategy was independently peer-reviewed by both an information specialist and an experienced librarian at Newcastle University. The full search strategies are included in Appendix A. The search strategies were not limited by year, study design, language or publication status. MEDLINE, PsycINFO, Embase and Cochrane Library with inclusion of Central Register of Controlled Trials CENTRAL, Cochrane Database of Systematic Reviews and Cochrane Clinical Answers were searched from their relative inception to week three of April/week commencing 27 April 2020. Additionally, an Advanced Google search, using the first four pages due to Google sorting by relevance, during the week ending the 1 May 2020, as well as a further search in the PROSPERO database were completed. To identify additional studies, the reference lists of all full manuscripts meeting eligibility criteria and a "cited in" search using Science Citation Index/Science Citation Index Expanded via Web of Science were reviewed.

Inclusion and Exclusion Criteria
A description of the inclusion and exclusion criteria, according to PICOS (Population, Intervention, Comparator, Outcome and Study design), has been provided below and summarised in Table 1 [40]. During study selection, no restrictions for study geographical location, date or language were applied.
The population was students in higher education with no age, course or location restrictions. Higher education was operationalised as delivered beyond secondary school leading to a degree [46]. Studies that only included students with an already established diagnosis were excluded, as this would have substantially affected the clinical heterogeneity of the studies being compared. Consequently, the intervention's true effect might have been affected by differences in the population and not the intervention itself, thereby potentially compromising the generalisability of the results [40].
Differences regarding the definitions, corresponding terminology and operating practices of the various types of AAIs can lead to difficulties assessing and comparing the interventions [7][8][9][47][48][49]. AAA and AAT are often used interchangeably in the literature, leading to ambiguity [12,49]. To overcome these identified discrepancies, both AAT and AAA were included. The definitions provided by the International Association of Human-Animal Interaction Organisations (IAHAIO) and the American Veterinary Medical Association (AVMA) to classify the various types of AAI were used [7,8]. In summary, AAT involves a specifically trained live animal in a planned, structured and goal-directed intervention, designed to improve socio-emotional, physical, behavioural and/or cognitive functioning of the individual(s) as part of the treatment process [7,8]. AAT is delivered and/or directed by a trained human professional (from education, health or human ser- 5 of 43 vices) with specific expertise, and progress is measured/evaluated [7,8]. AAA is a planned informal interaction with trained animal-human teams for recreational, motivational and educational opportunities [7,8]. For the purposes of this systematic review, in accordance with IAHAIO, AAA is goal-orientated [8]. It was anticipated that studies might provide insufficient detail to objectively assess and classify the type of intervention (for example to distinguish clearly between AAT and AAA). Consequently, studies were included if the live animal was called/considered a therapy animal, a therapeutic aim/goal was identified, and the outcomes of interests were evaluated. If the term "therapy animal" had not been used, the authors had to explicitly mention that the animal had, at least, had introductory training and an assessment/evaluation [8,9]. If the intervention was part of a multi-component programme, isolating the effectiveness of the AAT/AAA had to be possible; otherwise, the study was excluded. Additionally, if a study used a stressor, the stressor had to be an aspect of study, training, education or be student specific. An exam or an experimental cognitive test that was used to emulate evaluative testing are examples of included stressors.
Any type of comparator was included, including active intervention, attention control, placebo/sham therapy, usual care/treatment as usual, or alternative active intervention. If there was more than one comparator, one was chosen according to a hierarchy that was established to assess the intervention's effectiveness [40,50]: control (no-treatment, attention, usual care, or wait-list) 2.
validated sham treatment (where known not to be efficacious) 3.
other active intervention with known efficacy 4.
sham/alternative treatments (where efficacy is unknown) Specific psychological mental health outcomes, assessed using various established or published standardised measures, before and after the intervention were included. The primary outcomes of effect focused on anxiety and/or stress, using a range of established or published standardised measures, including but not limited to, Perceived Stress Scale (PSS) or State-Trait Anxiety Inventory (STAI) [51,52]. These outcomes were chosen as students in higher education may experience a significant amount of stress [6,[53][54][55]. Additionally, anxiety can occur as a reaction to stress, with stress and anxiety being closely linked [56]. Differences in anxiety and/or stress scores from baseline pre-intervention to directly postintervention and/or final follow-up were included. Secondary outcomes of effect focused on mood/affect, depression and well-being, using a range of published or established standardised measures, including but not limited to, Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS) or Positive and Negative Affect Schedule (PANAS) [57,58]. The time-point considered to have the biggest potential health benefit was considered as the time-point immediately after the intervention. Thereafter, the next best alternative was the time-point closest to the end of the intervention.
RCTs represent the gold standard for measuring an intervention's effectiveness with high internal validity [59,60]. Therefore, only RCTs were included. Studies were excluded if allocation to the respective groups was not objectively randomised, for example if randomisation was according to participants' availability, student number or date of birth with no random component. Any pilot/exploratory studies that met all inclusion criteria and analysed the outcomes of interest were included, unless the full RCT had been reported.

Study Selection
Following the electronic database and grey literature searches, titles and abstracts (n = 8036) were de-duplicated. The remaining titles and abstracts (n = 6248) were cautiously screened for relevance, erring on over-inclusivity, by the first author (CPC) in Rayyan to remove obvious irrelevant studies or duplicated studies not identified by the automated systems [40,61]. Subsequently, 100% of the articles that were deemed potentially relevant (n = 928), were reviewed independently by two reviewers (CPC and ML, a fellow Master's student) against the pre-specified eligibility criteria using Rayyan [61]. 132 articles were identified as requiring full manuscript review, which was undertaken in full by CPC and 20% by ML, both blinded to the other's decisions. To identify a random 20% for ML, the titles were arranged alphabetically in Endnote. Subsequently, a random number (n = 114) was generated by a true random number generator website [62]. Every fifth article (as 20%) was chosen starting from the 114th article. If any discrepancies arose at any stage, discussion occurred between the two reviewers. If consensus was not achieved, agreement was obtained by discussing with a third reviewer (EM). If the full manuscripts were not available, title/abstract/keywords were reviewed. To meet the eligibility criteria to request an inter-library loan, at least three elements of the PICOS criteria had to be fulfilled. Keywords were identified from Rayyan, Endnote and MeSH analyser [61,63,64].

Data Extraction
A structured Microsoft Excel data extraction form was adapted with permission of Dr. C Marshall after being piloted with two studies. The TIDieR checklist was incorporated as AAI is a complex intervention [65]. The data extraction form included [13,66] Data extraction with rigorous double-checking was primarily undertaken by CPC. ML independently data extracted 18% (n = 2 out of 11). The same previously described strategy for addressing disagreements was followed, as required.

Risk of Bias Assessment and Strength of Evidence
A validated tool, Risk of Bias 2.0-revised (RoB2) for individually randomised parallelgroup trials, was used for the risk of bias assessment [67]. As this systematic review was aimed to inform a health policy question, the effect of interest was the effect of assignment to AAIs [68]. For each study that reported more than one of the relevant primary or secondary outcomes, a risk of bias assessment was completed for each relevant outcome. For each study that reported multiple time-points for the assessment of outcomes or had more than one comparator, one time-point and one comparator were chosen according to the hierarchies previously described. Where the RoB2 guidance did not cover specific situations found in this review, decision rules were developed and applied in a standardised manner (Appendix B). The RoB2 assessment was completed in full by CPC with 18% (n = 2) undertaken independently by ML. The same previously described strategy for addressing disagreements was followed. Authors were contacted to request clarifications or to access missing data and given a two-week period to reply.
A narrative synthesis based on the Economic and Social Research Council (ESRC) Methods Programme was planned [69]. Meta-analysis was not appropriate due to all studies being at high risk of bias for the outcomes of interest, with most having high or some concerns regarding missing data, alongside the substantial clinical heterogeneity. Since meta-analysis was judged to be inappropriate, a harvest plot using vote counting, based on direction of effect, was used with categorisation of the studies by their effect (detrimental, no or beneficial effect) [40]. Effect size or statistical significance were not included for this categorisation as this can be misleading [40]. Vote counting was used for both the primary and secondary outcomes using mean change score for one comparator and one time-point according to the hierarchy previously described. If the outcomes were measured immediately after the intervention, as well as, at an additional time with a stressor, the former was only included for the vote counting. A set of decision criteria were created for interventions with a stressor and those without to standardise interpretation of the expected response to the intervention (Appendix B).
The quality and relevance of evidence was appraised using the 'Weight of Evidence' approach [70] and described in the guidance on narrative synthesis for the ESRC guid-and one time-point according to the hierarchy previously described. If the outcomes were measured immediately after the intervention, as well as, at an additional time with a stressor, the former was only included for the vote counting. A set of decision criteria were created for interventions with a stressor and those without to standardise interpretation of the expected response to the intervention (Appendix B).
The Evidence for Policy and Practice Information (EPPI) approach was implemented to assess strength of evidence as described in the ESRC Guidance [69,70]. In accordance with the ESRC guidance, trustworthiness was measured by Jadad's scale [69,71].

Study Selection
Eleven articles, describing eleven studies, met the inclusion criteria. The PRISMA flowchart is displayed in Figure 1. Ten studies were journal articles and therefore classed as published literature [35,37,[72][73][74][75][76][77][78][79]. One study was a dissertation for partial fulfilment of PhD and classed as grey literature [80]. All 11 studies were individually randomised parallel-group trials. Of these, three were described as pilot or exploratory studies but nonetheless presented data on interventions' effectiveness [37,75,80]. The excluded full text published studies (with reasons) are listed in Appendix C.  [39]; * other included not available due to COVID-19 (n = 2), did not meet criteria for requesting inter-library loan (n = 10) & ongoing trial (n = 1).
Nine studies allowed free interaction with the animals [35,37,72,73,[75][76][77][78][79] and two used a structured format [74,80]. Most interventions (n = 8) corresponded to the definition of AAA described in 3.3 [37,72,73,[75][76][77][78][79]. Two studies were classified as AAT [35,74] and one combined AAT and AAE together [80]. These classifications were mostly from an objective assessment of the description provided vis-à-vis the definitions outlined in 3.3. If insufficient detail for an objective assessment, the classification used by the primary authors was kept [35]. Length of intervention ranged from unspecified to 90 min, with the modal length being between 10 to 20 min. Seven studies used a single session [35,37,72,73,[78][79][80]. Four studies used multiple sessions [74][75][76][77]; of these, two used once-weekly sessions for four weeks [76,77]; one implemented three sessions with non-reporting of the time interval between sessions [74]; and one allowed various lengths and frequencies over a 15 to 16-week period at the participant's choice [75]. None of the studies reported monitoring or measuring the intervention's fidelity. The theoretical frameworks were clearly stated in three studies [35,37,80]. More information regarding the theoretical frameworks is included in Appendix D.

Outcomes
The primary outcomes were self-reported anxiety and stress measured before and after the intervention. Seven studies measured self-reported anxiety [37,[74][75][76][77]79,80] with most using the State-Trait Anxiety Inventory (STAI), apart from one study which used the Hospital Anxiety and Depression Scale (HADS) [75]. Two studies measured stress using the Perceived Stress Scale (PSS) on the original 5-point Likert scale [35,78].
Regarding the secondary outcomes, two studies measured depression [75,77] using the HADS-depression subscale and Beck Depression Inventory II (BDI II), respectively. Five studies measured mood/affect, with most using Positive and Negative Affect Schedule (PANAS) [72,73,76,78] and one used the University of Wales Institute of Science and Technology Mood Adjective Checklist (UMACL) [37]. Well-being was measured in three studies [35,37,78] using various tools.
The timing of outcome measurement after the intervention varied substantially between studies (with some reporting multiple time-points outlined in Table 2) and encompassed: immediately after the intervention without a stressor (n = 6); after the intervention but before an exam (n = 2); after the intervention and an experimental stressor (n = 4); within 24 h of the intervention (n = 1); 2 weeks after the intervention (n = 1); up to 1 month after the intervention (n = 1); up to 15-16 weeks since the start of the intervention (n = 1).    Regarding adverse events reporting, Hall [75] described ten individuals who reported increased anxiety and stress due to being in the control group and unable to participate with the animal. These participants were subsequently removed from the study to allow interaction. No other adverse events were reported. None of the studies specifically reported any adverse events for the animals involved.
Characteristics are summarised in Table 2. Figure 2 summarises the risk of bias for each domain with an overall assessment for each study for all the relevant outcomes. All studies were at overall high risk of bias for the outcomes of interest as susceptible to high risk of bias in the domain "measurement of the outcome". This was mainly due to the nature of the intervention as participants could not be blinded and the measures were self-reported, therefore were not objective.  Figure 2 summarises the risk of bias for each domain with an overall assessment for each study for all the relevant outcomes. All studies were at overall high risk of bias for the outcomes of interest as susceptible to high risk of bias in the domain "measurement of the outcome". This was mainly due to the nature of the intervention as participants could not be blinded and the measures were self-reported, therefore were not objective.

Strength of Evidence
The quality and relevance of evidence was appraised using the 'Weight of Evidence' approach [70] and described in the guidance on narrative synthesis for the ESRC guidance [69]. The strength of evidence of the included studies is summarised in Table 3. One study had high overall weight [74], eight [35,37,73,75,76,[78][79][80] had medium overall weight and two had low overall weight [72,77].

Narrative Synthesis: Interventions' Effect
As all the studies were at high risk of bias for the outcomes of interest, with most having high or some concerns regarding missing data, alongside the substantial clinical heterogeneity, a meta-analysis was not appropriate as it was unlikely to provide meaningful results.
All the included measurement scales provided continuous data; results were presented in various formats such as pre-and post-values, mean change or adjusted estimates of the intervention's effects (for example, using analysis of covariance (ANCOVA) with baseline measurements included as a covariate). Appendix D provides a summary of the studies' results with mean change (post-scores minus pre-scores) as the common parameter using the most immediate measurement of outcomes after the intervention (or the next best alternative).
Six studies tested outcomes without a stressor, with four showing a statistically significant improvement in favour of the AAI [37,74,75,77]. Four studies tested the outcomes immediately after the intervention without a stressor, with three showing a statistically significant reduction in favour of the AAI [37,74,77] and one showing a non-statistically significant reduction but did not include a power calculation [76]. For the two studies which tested the outcome after a longer time interval, Meola [80] found a non-statistically significant reduction in anxiety when measured up to one month after the intervention but this study was under-powered. Hall [75] found a statistically significant reduction in the post-intervention scores 15-16 weeks after the intervention had started.
Four studies involved a stressor: two assessed the outcomes after the intervention but prior to an exam [74,79] and two assessed the outcomes after an experimental cognitive test (WAIS-IV IQ test) which was designed to imitate evaluative testing that occurs in higher education [76,77]. Two studies found no statistical difference [74,77] but a power calculation was not included. Hunt et al. [76], despite no significant effect of condition on anxiety after a stressor, still undertook paired comparisons which showed a statistically significant higher level of anxiety in the AAI group compared to the control group. Williams et al. [79] showed an increase in anxiety for both groups, after the intervention but before an exam, which one might therefore expect. However, the control group had statistically higher anxiety levels than the AAI group.
Overall risk of bias was high for all seven studies regarding the outcomes of interest. Regarding overall strength of evidence, one study had high [74], one had low [77] and the remaining five had medium strength. Using vote counting, according to direction of effect and not statistical significance as described in the methods, all seven studies showed a beneficial effect in favour of the intervention compared to the comparator (demonstrated in harvest plot in Figure 3).

Primary Outcome: Stress
Two studies, using single group sessions with dogs, measured self-reported stress [35,78]. Binfet [35] described the intervention as AAT whilst the intervention in Ward-Griffin et al. [78] was consistent with AAA. The sessions lasted varying amounts of time from 20 min [35] to up to 90 min (but on average 30 min) [78]. Neither study used a stressor but one study [78] occurred during mid-term exam season. Despite the differences between the two studies, both showed a statistically significant reduction in stress when measured within 24 h of the intervention. This effect was not sustained at the two-week follow-up for the one study that included longer follow-up [35]. Therefore, these studies showed cautious preliminary evidence of a short-term, statistically significant, beneficial effect on stress, using AAIs, with students at Canadian universities. However, with such a small number of studies (with both being at high risk of bias for the outcomes of interest), caution is needed regarding generalisability to other countries and settings.

Secondary Outcomes: Depression, Mood/Affect and Well-Being
The evidence for depression is only based on two studies with different study characteristics and with mixed results [75,77]. One study, with low strength of evidence, showed a non-statistically significant beneficial effect on depression (with a stressor) [77] but we are unable to state if this is the true effect as the study could have been underpowered. The other study showed a reduction in depression scores for both intervention and comparator but by a larger degree for the control (without a stressor) [75]. However, caution is required due to the data's distribution (skewed) and how the results were provided (mean) personal communication [81].
The evidence for mood is mixed with five studies measuring this outcome [37,72,73,76,78]. A non-statistically significant detrimental effect appeared to occur particularly when a stressor is applied. Of the two studies that showed statistically significant beneficial results without a stressor, this result had not lasted by the 4th session for one study. Where a non-statistical significance was shown, a power calculation was not included. Therefore, the studies may have been underpowered to demonstrate the true effect (which may or may not be similar to the results explored here). Further investigation is required before any conclusions can be made for this outcome.    Three studies measured well-being using various tools (for example, Sense of Belonging in School, Warwick-Edinburgh Mental Well-being Scale (WEMWBS), Satisfaction with Life Scale (SWLS), Subjective Happiness Scale and Medical Outcomes Study Social Support) [35,37,78]. Overall, tentative but mostly beneficial effects were found in varying measures of well-being when measured immediately after or within 24 h of the AAI. Where a non-statistically significant effect was found, the outcomes were measured within 24 h of the intervention during mid-term exam season. Additionally, no power calculation had been included for the non-statistically significant findings, and therefore, distinguishing between true "no effect" or being underpowered was not possible.

Statement of Principal Findings
This systematic review included 11 RCT assessing AAIs on mental health outcomes for students attending higher education in a variety of settings and countries. The evidence suggests that AAIs could provide short-term beneficial results for anxiety in students attending higher education. There is limited evidence for stress, and inconclusive evidence for depression, well-being and mood. These results are from studies at high risk of bias for the outcomes of interest with mostly medium strength of evidence.

Strengths and Weaknesses of the Review
The strengths of this review include a comprehensive search strategy incorporating both grey and published literature. Additionally, independent conduct of screening, data extraction and risk of bias with inclusion of a third reviewer to resolve any discrepancies were incorporated. This reduced the introduction of random and systematic errors [82]. Development of decision rules, where required, increased transparency and rigor. Furthermore, this review combined RCTs, which represent the gold standard study design for evaluating a causal relationship and for measuring an intervention's effectiveness [83]. A meta-analysis was not appropriate for reasons already described; therefore, vote counting, based on direction of effect and not statistical significance, was employed. Vote counting using statistical significance can be misleading especially in studies where no power calculation was reported and no significance was found [40]. The lack of statistically significant effect may either be due to the study being underpowered or may reflect a true lack of effect [40]. However, when using vote counting on direction of effect without considering statistical significance, the effect seen may be due to chance. Additionally, vote counting methods are unable to provide a precise estimate of the overall effect size.
Due to COVID-19, some resources were inaccessible (list included in Appendix C). Additionally, a pragmatic approach to the Advanced Google search was incorporated using only primary outcomes. Well-being is a particularly broad concept. Proxies that were strongly related to well-being were included but may not have been an exhaustive list. Therefore, some articles that could have met the eligibility criteria may have been missed. A percentage above a minimum for double screening, data extraction and risk of bias assessment was chosen due to resource and time constraints, which represents a limitation. Using the strength of evidence approach, overall weight was influenced mainly by how well the authors described randomisation and/or presence of withdrawals/missing data as all the studies were relevant RCTs where double blinding was difficult. Where not reported, concluding whether these elements simply had not occurred or had occurred but not described due to reasons (such as word-count limits) was impossible, unless clarified by the authors through correspondence. Furthermore, different tools to assess strength of evidence or risk of bias may have produced different findings.

Strengths and Weaknesses of the Studies
The limitations of the primary evidence reviewed included variable levels of descriptive reporting in the included studies, such as participant characteristics, delivery of the intervention and theory of change. For example, the descriptions of the included AAIs varied considerably from a high level of detail to only a single sentence or short paragraph. Therefore, providing conclusions about whether certain types of AAIs were more effective that others or to isolate the "active ingredient" of effective AAIs was not possible. Most of the participants were females, ranging from 57% to 85%, and authors rarely commented on how well the sample population represented the target population. Furthermore, all the participants volunteered to participate and, therefore, may not be representative of the target population [84]. For example, those volunteering (and therefore, self-selecting) may be more motivated and/or have a different mental health status than those who did not [84,85]. Additionally, individuals who are afraid, allergic or have a medical condition precluding participation with animals are unlikely to have volunteered. Therefore, building a comprehensive picture of the type of individual attending higher education, who would benefit from AAIs was challenging. These reporting issues meant that assessing the generalisability of the results was also problematic.
The outcomes reviewed were self-reported and can be subject to unconscious and/or conscious ascertainment bias. With this type of intervention, blinding of participants or individuals delivering the intervention was difficult, if not impossible. This resulted in all the studies having an overall high risk of bias for the outcomes of interest for this systematic review. Additionally, participant expectancy bias could have been introduced, which was highlighted in Williams et al. [79] as 90% of the control group sampled (n = 15) stated they thought an interaction with the dog would have reduced their stress prior to an exam. Despite these limitations, capturing how an individual feels after an intervention/comparator is important. To aid corroboration and confidence with self-reported results, triangulation would be beneficial such as with objective measures (for example, physiological outcomes) and/or blinded behavioural observations [86]. Indeed, some of the studies reported physiological outcomes but were beyond the scope of this systematic review, which is a limitation.
Furthermore, for the outcomes where no statistical difference was found, either no power calculation was included, or the study was underpowered. In those studies, no definitive conclusion can be derived about the effectiveness of that particular intervention for that specific outcome as it is not possible to state if the non-statistically significant difference was the true effect or not [87].

Study Meaning: Possible Mechanism and Implications for Policymakers
The Fogg Behaviour Model (FBM) is a theory of change model that could be considered for AAIs and student engagement to improve mental health outcomes [88]. In summary, FBM requires three elements for the intended behaviour of student engagement with AAIs to occur: motivation, ability and triggers [88]. Triggers promote the intended behaviours and may be achieved by advertisement/promotion of the sessions [88]. Motivation for attendance is proposed as the animals' presence addressing three core motivators: (1) hope of an experience that is likely to be (2) pleasurable and (3) socially acceptable for the majority [88]. Finally, to optimise the students' ability to participate in this intended behaviour, the model's six elements of simplicity should be addressed [88]: time (sessions to be short) 2.
money (sessions to be cost-neutral for students) 3.
physical effort (sessions to be offered in an accessible location) 4.
brain cycles (process by which to attend the sessions should be easy) 5.
social acceptance (as offered by activities with animals) 6.
routine (regular sessions to be offered) For policymakers considering implementing AAIs in a higher educational setting, a logic model should be developed alongside the intervention to assist in clarifying the active ingredients and causal assumptions [89]. Key stakeholders, such as students from varying backgrounds, staff from student support services and animal/handler teams, should be involved during the design stage. A formative evaluation, including both process and implementation assessments, with a mixed-methods pilot study would be a useful first step [89][90][91]. The feasibility, acceptability and fidelity of the AAI in the target population can therefore be assessed with adaptation, if required. Thereafter, a summative evaluation, using a larger mixed-methods RCT, evaluating effectiveness, with a nested process evaluation, would be recommended [90].

Future Research Recommendations
Further research recommendations are detailed in Box 1. Particularly, the overall reporting quality by authors should be improved with facilitation from journals. For example, authors should provide enough information to allow replication of the interventions or expansion of the existing research [65]. These steps will help identify the effective or ineffective interventions and facilitate further evaluation of the active ingredients. Furthermore, potentially conflicting evidence, albeit not statistically significant, is present for mood which needs further evaluation to ensure that AAIs do not have unintended negative consequences [92]. Additionally, including animal welfare and economic evaluations are important to help secure support and funding from commissioners.

•
Use of standardised and internationally recognised definitions when describing AAIs • Use of sample sizes that provide adequate power • Clarity regarding the randomisation procedure (including description of allocation, and whether concealed allocation occurred) and provision of an adequate description of the participants' characteristics separated by group • Clear reporting of the participants' flow through the trial with reasons for any missing data for each respective group and at each time-point • Use of explicit comparators to establish the relative effects of the co-interventions (e.g., appropriate attention controls) • Adequate descriptions of the interventions implemented to facilitate replication • Clear reporting of the outcome measurement procedure (particularly when multiple timepoints or stressors are present), including any adaptations made to the scales used • Provision of access to publicly available pre-specified statistical analysis plans by authors, including justification for choice of target differences • Clear reporting of adverse events for both humans and animals

Conclusions
Animal-assisted interventions (AAIs) were considered as potential interventions to help improve mental health and well-being of students in higher education who are willing to and can engage with animals. The pooled evidence suggests that AAIs could provide short-term beneficial results for anxiety, and possibly stress, in this population, known to be at risk of mental health issues. However, caution is required as these results were from studies at high risk of bias for the outcomes of interest for this systematic review with mostly medium strength evidence, and in various cultural settings. Subsequent implementation of AAIs in this setting requires both formative and summative evaluation to measure both the intended and unintended consequences. Furthermore, consideration of alternatives for students unable to participate due to fear of animals or medical contraindications is recommended to prevent widening of health inequalities.   Appendix A

Search Strategies
The search strategy was independently peer-reviewed by both an information specialist and an experienced librarian at Newcastle University.

Appendix B
Additional decision rules for RoB2 assessment and vote counting RoB2 Assessment: The decision rules were developed and applied to the RoB2 assessment to standardise the approach when information was lacking in the RoB2 guidance on how to approach. The specific domains and questions are taken directly from the RoB2 tool [67,68].
"Domain 1: Risk of bias arising from the randomisation process 1.3: Did baseline differences between intervention groups suggest a problem with the randomisation process?" To allocate "No"/"Probably No" or "Yes"/"Probably Yes": the baseline differences of the participants had to be split by each group to be able to compare. Additionally, the authors had to ensure it was clearly stated (either in the narrative or in Tables/Figures) that the characteristics were for the randomised participants. Providing only a statement regarding the characteristics for the whole group or a statement regarding statistical differences between the groups were not enough.
Otherwise, the grading of "No Information" was allocated. "Domain 2: Risk of bias due to deviations from the intended interventions (effect of assignment to intervention) 2.6: Was an appropriate analysis used to estimate the effect of assignment to intervention?" "Probably Yes" was allocated if all missing information was specified with reasons and the trialists did not state if deviations had occurred. The rationale was that the trialist may not state that deviations had not occurred due to word-count issues. Deviations are reportable to the relevant Institutional Review Boards (IRB) in accordance with ethical approval and trialists are expected to state deviations (if they occurred) according to good research practice. "2.7: Was there potential for a substantial impact (on the result) of the failure to analyse participants in the group to which they were analysed?" "Probably No" was allocated if excluded or missing data were less than 5% with appropriate reasons why.
"Domain 3: Risk of bias due to missing outcome data 3.1: Were data for this outcome available for all, or nearly all, participants randomised?" For the allocation of "Yes" or "Probably Yes": the authors had to clearly state the amount of missing data either in the narrative or in a Table/Figure and had to be less than 5%.
"No" was allocated if the missing data were more than 5%. "No Information" was allocated if the amount of missing data was not clearly stated in either the narrative or Tables/Figures.  "3.3. Could missingness in the outcome depend on its true value?" "No" was allocated if reasons were provided for the missing data and the reasons were not related to the outcome.
"Yes" was allocated if reasons were given and related to the outcome. "No Information" was allocated if no reasons were provided. If 3.4 required a response: "Is it likely that missingness in the outcome depended on its true value?" "No information" was allocated if no reasons were provided for the missing data and the amount that were missing in each group was not clearly stated.
"Domain 4: Risk of bias in measurement of the outcome: 4.5: Is it likely that assessment of the outcome was influenced by knowledge of intervention received?" Due to the nature of the intervention, even if the participants were blinded to the study's purpose: "Yes" or "Probably Yes" was allocated.
"Domain 5: Risk of bias in selection of the reported results: 5.1 Were the data that produced the results in accordance with a pre-specified analysis plan that was finalised before unblinded outcomes data were available for analysis?" For "Yes" to be allocated, the authors were required to publish their pre-specified analysis protocol in the public domain.
For "No" to be allocated, the authors had to both confirm pre-specified analysis protocol was not published and analysis was, indeed, not in accordance with an unpublished pre-specified plan.
"No Information" was allocated if no published pre-specified plan. "5.2: Is the numerical result being assessed likely to have been selected, on the basis of the results from multiple eligible outcome (e.g., scales, definitions, time-points) within the outcome domain?" "No" was allocated if the outcomes were measured at similar times in the same order with the same measures and same mode of administration for the groups involved.
"Yes" was allocated if the above elements varied. "No information" was allocated if not enough information was given to assess adequately. "5.3: Is the numerical result being assessed likely to have been selected, on the basis of the results, from multiple eligible analyses of the data?" For "Yes" to be allocated, evidence was required to suggest multiple analyses had occurred either through the available information or through correspondence with the authors.
"No" was allocated if a pre-specified analysis protocol was published or the authors confirmed that the analysis was in accordance with a pre-specified unpublished analysis plan which was then made available for reviewers to examine.
"No Information" was allocated if the above criteria were not met. Vote counting: Table A1 summarises the decision criteria for both without and with a stressor. Specifically, in situations where a stressor was present, the expected response without the intervention was considered. For example, after a stressor (e.g., an experimental cognitive test) or before a known stressor (e.g., exam), the expectation is that anxiety scores are likely to be worse than baseline. For the intervention to have a beneficial effect when a stressor was present, three situations could occur: (1) anxiety was worse after the intervention (as expected due to the stressor) but not by as much as the control group (2) no change seen after the intervention, but anxiety was worse in the control group (3) anxiety was better after the intervention and better than the control group Table A1. Decision criteria to aid interpretation of vote counting.

Without a stressor
Beneficial Post-intervention assessment shows improvement in scores (direction relative to the measure used) compared to pre-assessment and better than control

No
Post-intervention assessment: (1) shows no change in scores compared to pre-assessment or (2) improvement in scores (direction relative to the measure used) compared to pre-assessment but not as much as control Detrimental Post-intervention assessment shows worsening in scores (direction relative to the measure used) compared to pre-assessment and worse than control With a stressor (either before the post-assessment or present during post-assessment, e.g., occurring prior to an exam) Beneficial Post-intervention assessment shows: (1) worsening in scores (direction relative to the measure used) compared to pre-assessment as would be expected due to stressor but less than the control or (2) no change with the control scores being worse or (3) an improvement in scores which are better than the control Detrimental Post-intervention assessment shows worsening in scores (direction relative to the measure used) compared to pre-assessment as would be expected due to stressor and worse than control Appendix C Table A2. Full text excluded studies with reasons (n = 122).

References of Excluded Studies from Full Manuscript Search Reason Excluded
Adamle et al. [93] wrong study design Adams et al. [94] wrong study design Adams et al. [95] wrong study design Alonso [96] criteria for inter-library loan not met Anderson [97] wrong outcome measures Anonymous [98] criteria for inter-library loan not met Anonymous [99] wrong study design Ashton [100] wrong study design Baghain et al. [101] wrong study design Bajorek [102] wrong outcomes Barker et al. [103] wrong outcomes Table A2. Cont.

References of Excluded Studies from Full Manuscript Search Reason Excluded
Perry et al. [176] wrong study design Picard [177] wrong intervention Polking et al. [178] wrong study design Quinn et al. [179] wrong study design Ralston et al. [180] wrong study design Renne et al. [181] wrong study design Robino et al. [182] wrong study design Robson [183] wrong study design Rose [184] wrong study design Sanford [185] wrong study design Silas et al. [186] wrong study design Sola-Perkins [187] wrong population Stewart et al. [188] wrong intervention Stewart et al. [189] wrong study design Straatman et al. [190] wrong intervention Swan [191] criteria for inter-library loan not met Taylor et al. [192] wrong study design Thelwell [193] wrong intervention Thew [194] wrong outcomes Tobin [195] wrong population Tomaszewska et al. [196] wrong study design Trammell [197] wrong outcomes Turner et al. [198] wrong study design Voelpel et al. [199] wrong study design Walsh [200] wrong study design Wheeler et al. [201] wrong intervention Williams et al. [202] wrong outcomes Wilson [203] not available due to COVID-19 Wilson [204] not available due to COVID-19 Wood et al [205] no comparator Young [206] wrong reporting of outcomes Zents et al. [207] wrong population (Exclusion criteria were according to the methods. Additionally, if not correct study design and reported in a chapter, book or letter, was categorised as the wrong study design; wrong outcomes included wrong type of reporting outcome (e.g., anecdotal with no empirical data included) as well as wrong sequence of measuring outcomes; wrong intervention also included if stressor was not student-specific).

Appendix D. Summary of the Theoretical Frameworks and Results
Theoretical frameworks: For a study to be classed as clearly stating the theoretical framework that underpinned the included intervention, one of the following criteria was required: • mechanism of action was stated and directly linked back to the intervention's development before implementation; or • proposal was offered for the intervention's mechanism of action on the outcomes before the intervention was implemented; or • mechanism of action was stated with a pre-specified assessment to distinguish the different co-interventions' relative effects Three of the included studies were assessed as articulating the theoretical framework according to the above [35,37,80]: • Binfet [35] used Biophilia and Cobb's Social Support Theory [24,26]  Shearer et al. [77] described the intervention of interest as an active control and in the discussion considered unconditional positive regard & social environment. • Ward-Griffin et al. [78] suggested that therapy animals may act as a source of social support with no further expansion. • Williams et al. [79] reported that therapy animals are trained to provide comfort, affection & can be calming with no further expansion.
One study [72] was assessed as not articulating the theoretical framework according to the above criteria:

•
Banks et al. [72] discussed the benefits of canine interaction, for example the cognitive changes with reducing mind wandering & increasing sustained attention but did not explore how & why mental health outcomes would be improved.
Results: Box A1 aids with interpretation of mean change scores and Table A3 provides summary of the results.    For a 12-min inferred individual single AAA session delivered prior to an exam, anxiety levels increased for both groups with control having statistically significant higher levels of anxiety than dog.