An empirical study of the impact of Systems Thinking and Simulation on Sustainability Education

: Education for Sustainable Development (ESD) is considered vital to the success of the United Nations’ Sustainable Development Goals. Systems Thinking has been identified as a core competency necessary to incorporate into ESD. Systems Thinking orientated ESD learning tools, established methods of assessment of sustainability skills, and studies to demonstrate effectiveness of such learning tools, are all lacking. There is a wealth of experience in the System Dynamics field regarding the application of Systems Thinking and simulation to environmental problems, sustainability and systems education. Many System Dynamicists regard simulation as essential for teaching Systems Thinking. The substantial body of research into the design of effective simulation-based learning environments (SBLEs) can also inform ESD initiatives. This research describes a randomised controlled trial (n=106) to investigate whether an online sustainability learning tool that incorporates Systems Thinking and System Dynamics simulation increases understanding of a specific problem and supports transfer of knowledge to a second problem with a similar systemic structure. The effects of Systems Thinking and simulation were tested separately and in combination. The learning tool was designed for a single online learning session. Simulation was found to increase ESD learning outcomes significantly, and also to support transfer of skills, although less significantly. adjustments were estimated, but more generally, limitations to findings and recommendations for further experimental studies were reported.


Introduction
Sustainability has become an increasingly important topic of discussion over the last several decades as the harmful long-term consequences of unsustainable human activities and lifestyles have become ever more apparent. However, the concept of sustainability is complex and is often used in an imprecise way [1]. Since the 1980s the UN has been instrumental in developing the related concepts of sustainability, sustainable development and sustainability education. It founded The World Commission on Environment and Development (WCED) in 1980 which was responsible for the influential 1987 Brundtland Report [2]. The definition of sustainability in that report is the one most frequently quoted, namely that 'Sustainable development is development that meets the needs of the present without compromising the ability of future generations to meet their own needs. ' The UN also led efforts to formulate concrete targets for action towards sustainability. In 2000 the UN defined the 8 Millennium Development Goals (MDGs) for 2015, of which goal 7 was 'To ensure environmental sustainability'. The MDGs were developed further in the 17 Sustainable Development Goals (SDGs), set in 2015 and to be achieved by 2030, and adopted by all 193 United Nations member states.
The UN adopted the Decade of Education for Sustainable Development (DESD) from 2005 to 2014. Education for Sustainable Development (ESD) is explicitly recognized in the SDGs as part of Target 4.7 of the SDG on education. It is seen as 'crucial for the achievement of sustainable development' [3] (p. 63). The Council of the European Union sees ESD as 'essential for the achievement of a sustainable society and is therefore desirable at all levels of formal education and training, as well as in non-formal and informal learning' 1 . Thus ESD is seen as a form of lifelong learning, and necessary for all citizens.
Sustainability education seeks to address the considerable challenge of training learners not only to solve or understand existing complex problems but also to equip them with skills that they can transfer to new problems as they arise. In the last few years there has been an urgent call for innovative sustainability pedagogies [4] (p. 58).
O'Flaherty and Liddy provide a useful summary of approaches so far taken to ESD, including blended learning, drama, simulation exercises, multi-media, problem-based learning and discussion forums [5]. They describe methodological and pedagogical questions that remain open and highlight the need for assessment frameworks and formal trials for evaluating the effectiveness of different approaches to ESD.

The need for Systems Thinking in Sustainability Education
Sustainability education is an emerging field. In her review, Maria Hofman-Bergholm explores reasons for problems with its implementation [6]. She draws out the commonalities between the literatures on Sustainability Education and Systems Thinking, in that both require critical thinking, real-world complex problem-solving skills and action. She finds that Systems Thinking is required to comprehend the intricate connections in sustainable development [7] (p. 27). Complex reasoning skills must be taught, as they are not inherent. Humans have well-known limitations in cognitive ability to reason about complex systems that must be overcome [8] (p. 599).
Similar observations have been made in the related field of Ocean Literacy [9]. A pilot study we conducted to investigate the effectiveness of a systems-orientated online Ocean Literacy learning tool gave promising results [10].
According to Frisk and Larson, sustainability education will only be effective if it incorporates Systems Thinking, long-term thinking, collaboration and engagement, and action-orientation [11]. Sustainability, they say, is fundamentally a call to action, and Sustainability Education therefore requires experiential, practical and flexible learning methods.
According to Wiek et al., 'Sustainability education should enable students to analyse and solve sustainability problems' [12] (p. 204). This requires a particular set of interlinked and interdependent key competencies. Wiek et al. review the literature and identify five key competencies, the first being Systems Thinking competence (the others are anticipatory, normative, strategic and interpersonal competence).
According to Soderquist and Overakker, the discipline of Systems Thinking provides a process, set of thinking skills and 'technologies' that can improve the systemic understanding that is required for sustainability education [13]. These include stock and flow mapping, computer simulation, and simulation-based learning environments. They claim that simulation-based learning environments build mental simulation capacity, if they are designed carefully.
Cavana and Forgie describe a number of well-established systems education programs and review teaching approaches for Sustainability Education [14]. They explore the strong links between systems approaches and sustainability goals, illustrating that the two are so entwined as to be inseparable. They describe the need for, and the lack of, simulation-based learning environments for Systems Thinking orientated sustainability education.

Relevant work in the field of Systems Dynamics
A substantial body of knowledge focused on modelling and simulation of complex human-environmental systems has accumulated in the field of System Dynamics since the 1970s. This is potentially useful for informing efforts to develop effective, innovative Systems orientated sustainability education tools.
System dynamics modelling was first used to address sustainability in Jay Forrester's 'World3' model, which formed the basis for the influential book, 'Limits to Growth' [15]. There have been many subsequent examples, from environmental models [16,17], models for water supply, waste management, air quality, land use [18], fisheries [19,20], climate change [21], models of social and economic development [22], reindeer pasture management [23] and many more.
Furthermore, the System Dynamics community has identified education as a priority for a long time. The Creative Learning Exchange was founded in 1991 by Jay Forrester 'to encourage the development of systems citizens who use systems thinking and system dynamics to meet the interconnected challenges that face them at personal, community, and global levels' 2 . They provide resources representing three decades of experience of teaching Systems Thinking and System Dynamics for realworld problem-solving to school children [24].
System Dynamics simulation has frequently been employed for the purpose of environmental education. There are flight simulators for sustainability [25] and simulation-based learning environments to teach sustainability [26,27]. System Dynamics models and simulations have also been used to try to explain why renewable resources are so often over-utilised; this is because of faulty reasoning and systematic misperceptions of the dynamics of complex systems [23]. Simulation has been shown to improve understanding and performance in a natural resource management task [28]. Simulation can serve effectively as the 'problem' in problem-based learning [29], and as an experiential activity it can both increase retention and have a stronger influence on behaviour than declarative learning [11] (p. 11).
There is debate about whether simulation based on stock and flow models is an essential, advanced part of Systems Thinking or an extension of it [30]. According to Richmond, 'System thinkers use diagramming languages to visually depict the feedback structures of… systems. They then use simulation to play out the associated dynamics' [31]. Because simulation is seen as an essential by some, but not by all, systems thinkers, in our study the effect of adding Systems Thinking and simulation was evaluated separately and in combination.
System dynamics scholars have also long been interested in the transfer of insights between management situations that share common structural characteristics, going back to Forrester [32] (p. 355). According to Sterman, perhaps counterintuitively given the immensely rich and varied complex systems in the world around us, 'most dynamics are instances of a fairly small number of distinct patterns of behaviour' [8] (p. 108). Senge describes Systems Archetypes as 'nature's templates' [33] (p. 92). They reveal an elegant simplicity underlying complex issues. Mastering them represents putting Systems thinking into practice. Indeed, Richmond includes what he calls 'generic thinking' in his list of eight critical Systems Thinking skills [34]. Once an archetype is identified, 'it will always suggest areas of high-and low-leverage change.' For this reason, Kim views archetypes as diagnostic tools [35][36][37].
Because of the importance of Systems Archetypes to many systems thinkers, and because of the potential benefits of their use to sustainability education, their effect on transfer of sustainability skills is explored in this study. The choice of two sustainability problems that share a common systems archetype was made to test the hypothesis that learners can recognise similar patterns in different contexts, and therefore transfer their learning. If successful, this approach would make a strong case for a patterns-based approach to sustainability education, which would build systems and environmental literacy, obviating the need to teach each sustainability challenge in a piecemeal fashion.

Design considerations for Simulation-Based Learning Environments
The field of sustainability education can also benefit from the accumulated body of knowledge relating to design aspects and best practice in the development of simulation-based learning environments (SBLEs).
Landriscina advises that learners need guidance with simulations in the form of explanations, background information, tasks to perform, hints and feedback [38]. Kopainsky and Sawicka [28] (p. 143) cite Yasarcan [39] who holds that a 'gradual-increase-in-complexity approach helps improve performance in an inventory management simulation game'. In their critical review of 61 studies to evaluate effectiveness of simulations used for science instruction, Smetana and Bell report that 'simulations used in isolation were found to be ineffective', and that they should encourage reflection and promote cognitive dissonance, meaning that learners confront their erroneous assumptions and reconstruct their beliefs [40]. Cannon-Bowers and Bowers identified the importance of using case studies as a context for instruction and setting goals for the learner [41].
Ghaffarzadegan, et al. [42] argue that simulations based on small System Dynamics models offer advantages for learning in a public policy context. By small models they mean 'models that consist of a few significant stocks and at most seven or eight major feedback loops'. These small models can 'yield accessible, insightful lessons for policy making' without overwhelming participants with too much detail.
There are two main approaches to simulation-based learning, learning by building a simulation, or by using an existing one. Reimann and Thompson assert that while learning by modelling may result in better long term learning outcomes, positive results have also been found in studies examining the effect of learning with pre-built models [43] (p. 115). Gobert and Buckley concur [44], stating that learners can gain more insight from building models, but considerable time and skills are required. If this is not feasible, manipulation of an existing simulation offers an alternative. The approach can vary from the simplest, where learners can change a few variable values and see the consequences of their decisions on graphs, to the more complex, where learners can restructure the model. Reimann and Thompson believe that, given the greater amount of time needed to train students to use modelling software, and for them to produce a working model, 'learning with prebuilt models may be a more realistic option in an environmental education context'.
The ESD learning tool developed for this study was designed in line with these general guidelines. It was designed for a single online learning session and therefore interaction with the simulation model was limited to manipulation of a few key variables.

Aims of the study
Summarising the themes identified in the reviewed literature, the following research areas were identified and motivated the work described in this paper: 1. There is a need for Systems-based sustainability learning tools that can be shown to increase the effectiveness of sustainability education. 2. It would be useful to evaluate the effect of Systems Thinking (theory, tools and techniques) separately from that of interactive simulation, so that the effect of each factor on learning outcomes, and their combined effect, can be compared. 3. If Systems Thinking can facilitate recognition of similar systemic structures in different sustainability problems, this could make a useful contribution to the development of transferable sustainability skills. 4. Formal trials to evaluate the effectiveness of approaches to ESD, specifically a Systems Thinking approach, are needed.

Hypothesis and Research Questions
The general hypothesis underlying our research was: Incorporating Systems Thinking increases the effectiveness of sustainability education.
The specific research questions were, regarding sustainability education: RQ1: Does Systems Thinking enhance the learner's practical understanding of sustainability?
RQ2: Does interacting with System Dynamics simulations enhance the learner's practical understanding of sustainability? RQ3: Does adding both Systems Thinking and System Dynamics simulation enhance learning more than Systems Thinking only, simulation only, or a non-systemic treatment?
RQ4: Do Systems Thinking and/or simulation support the transfer of sustainability understanding from one problem to another with a similar systemic structure?
A brief account of the initial design of the study was published before the study was conducted [45]. A fuller account of the design, together with results and analysis, are all documented in the following sections.

Study Design
The study concerned comparison of educational outcomes, therefore the design was drawn from established practices in Social Sciences research [46]. The investigation was an experimental study using a twoby-two factorial design. The two factors, Systems Thinking and simulation, each had two levels, present or absent. To answer the research questions, the study aimed to discover the main effects, i.e. the effect of each factor on the learning outcome, and the interaction effect, or the combined effect of both factors.
The study was conducted in the summer of 2020. Participants were randomly assigned to one of four groups: a control group, a Systems Thinking (ST) group, a Simulation (Sim) group, and a Systems Thinking and Simulation (ST + Sim) group (see Table 1). They were then given access to an online learning tool. The control group saw only standard, non-systemic content. The other groups saw additional content according to their group, either a Systems Thinking section, a simulation section, or both. All groups took the same two quizzes, and the performance of the groups in these quizzes was compared using statistical methods.

Teaching method
The learning tool was originally planned to be used in a small group classroom context, with the researcher delivering an overview to the whole group before each participant would then engage with the learning tool individually. The researcher would have been available in person to answer questions about how to use or navigate the tool or to resolve any technical issues that might have arisen. However, due to Covid-19 restrictions, the training was re-designed as a single online unsupervised individual session. Support was available from the researcher via email.
The learning session lasted between approximately 50 minutes for the control group and 100 minutes for the full treatment (ST + Sim) group.

The Sustainability Learning Tool
An open access version of the learning tool is available here: https://exchange.iseesystems.com/public/carolineb/sustainabilitylearning-tool 3 . It is not a simulation game or a flight simulator, in that learners are not asked to take the role of an actor in the scenarios. The Systems Thinking and simulation elements in the learning tool offer the 'big picture' of the systems underlying two sustainability problems and offer insights into their essential structure and dynamics. The learning tool also explores sustainable solutions to the problems. The emphasis is thus on systemic understanding and policy making.
The learning tool consists of two main sections, one for deer herd management and one for fisheries, as shown in Figure 1. The deer section contains additional sections for Systems Thinking and simulation. The learning tool features are summarised in Table 2. The design elements are outlined briefly in the following sections, except for SBLE design principles, which were discussed in the introduction. Identify leverage points (places to intervene in a system)  Understand system equilibrium (a dynamic and sustainable state) Simulation Exercises  Simulate deer herd growth in first four years (exponential increase)  Simulate deer herd growth in first ten years (exponential increase and then decline)  Simulate deer herd growth, this time with vegetation added to the graph (vegetation decline explains decline in deer population)  Simulate to find the estimated vegetation level after one year, given vegetation growth and simultaneous consumption by deer (interacting stock levels are hard to calculate without simulation)  Try lowering initial deer population to avert collapse (this only delays it)  Try increasing initial vegetation level to avert collapse (this only delays it)  Try changing deer birth and death rates to obtain a stable population (birth and death rates must be equal)  Try to make the deer herd sustainable (stabilise deer population AND ensure it does not exceed the carrying capacity)

Standard non-systemic introductory pages
Introductory pages consist of text, images, graphs and short embedded video clips. The general concept of sustainability is first explained and explored, then each sustainability theme is described using standard domain-specific terminology. See Figure 2 and Figure 3 for sample pages, one for each case study. Embedded Quizzes and Surveys Quiz 1 (deer management) and quiz 2 (fisheries) were tests of ability, used to measure sustainability understanding of the deer and fisheries sustainability problems. These quizzes were embedded in the tool as shown in Figure 4. All participants took these quizzes, and the presurvey, containing the consent form. Some groups saw short post-ST and simulation feedback surveys, which were also embedded in the tool. Table 3 summarises the quizzes and surveys seen by each group.
All quizzes and surveys were refined by pilot testing. They are openly available along with the study data in the Zenodo dataset (URL: https://zenodo.org/record/5569508).

Case Studies, Systems Archetype and System Dynamics model
The learning tool supports a teaching approach that combines sustainability topics with case studies, Systems Thinking and simulation. Exploring two specific problems increases understanding of sustainability in context. The two problems, deer herd management and sustainable fisheries, are both examples of renewable resource management. Each problem is illustrated with a historic case of overexploitation of the renewable resource, leading to overshoot and collapse.
The catastrophic unsustainable growth of the Kaibab deer herd in the US in the 1920s has been the subject of analysis by System Dynamicists including Donella Meadows 4 and Andrew Ford [16] (p. 267). If natural predators are removed, deer will go on breeding until they overgraze and risk exceeding the carrying capacity of their environment.
The collapse of the Grand Banks cod fishery in 1992 is a famous example of disastrous unsustainable fishing practices [48]. Once one of the richest fishing grounds in the world, in 1992 the fishery collapsed completely, devastating the local community and economy. The collapse was caused by serious overfishing, which began in the late 1950s, together with poor management. Damage done to the coastal ecosystem proved irreversible and the cod fishery remains closed.
The Limits to Growth archetype, also known as Overshoot and Collapse [8] (p. 123), describes the behaviour of both these case studies well. The generic structure underlying this archetype consists of two stocks. The first stock grows exponentially while depending on a second stock, which is a renewable resource. Here, a fast-growing deer herd is eating ever more vegetation, and a growing fishing industry is exploiting fish stocks more and more heavily. This systemic structure will tend to cause the following behaviour. The first stock grows so rapidly that it overshoots, depleting the resource more rapidly than it can renew itself, leading to the collapse of the resource, and then the stock that depends on it. The deer herd overgrazes, causing collapse of the vegetation supply, and then the herd. The fishing industry overfishes, so that the fish population cannot reproduce itself, destroying the industry.
The remedy for this problematic dynamic is that the exponential growth of the first stock must be checked, so that the resource on which it depends will not be consumed faster than it can regenerate. If limits (e.g., carrying capacity or maximum sustainable yield) are respected, then the system can become sustainable, meaning that the second stock, the renewable resource, remains available to the first stock indefinitely because it is not overexploited and there is time for it to renew itself. A description of this strategy for sustainability was seen by all participants, including the control group.
The System Dynamics deer herd model used in the learning tool is slightly adapted from that documented by Breierova from the MIT System Dynamics in Education Project [47]. It is available in the Zenodo dataset published for this study. In her article, Breierova explains the usefulness of generic structures in helping transfer knowledge among different systems.

Systems Thinking principles
The essential Systems Thinking concepts, tools and techniques listed in Table 2 were chosen from the literature [49][50][51] as suitable tools for analysis of the two sustainability problems under consideration. These concepts, tools and techniques are explained in general terms in the first pages of the Systems Thinking section, and then they are used to analyse the deer herd population dynamics. Two sample pages from the Systems Thinking section are shown in Figure 5 and Figure 6. The section takes about 30 minutes to work through.

Simulation
Simulations of the deer herd population model provided the basis for a series of six exercises and two tasks, listed in Table 2, which explore the dynamics that lead to overshoot and collapse, and how those dynamics can be changed so that the herd size can become sustainable. The exercises explore in stages the interplay between the two stocks, deer and vegetation, and the key role of deer birth and death rates and vegetation regeneration and consumption rates. The exercises increase in complexity as the first stock and then the second stock is added, then the interaction between the two stocks is considered, then learners are given control of key variables so that they can explore their effects on the dynamics of the deer herd. The simulation section is presented using text and embedded simulations, with a pop-up comment, hint or explanation available for each exercise to provide feedback to the learner. The section takes about 20 minutes to work through.
The aim of this sequence of challenges is to demonstrate that a sustainable deer population can result if the birth and death rates are balanced and the population remains within the carrying capacity of the available land, so that the herd size will remain stable and will consume no more than the regenerated vegetation. Alternative strategies that might seem attractive, such as starting with a lower population or increasing the amount of vegetation (or size of the park), are shown to be ineffective, since the powerful exponential deer population growth dynamic dominates and will reach the limits of the park, albeit a little later. This exponential growth behaviour is seen to persist as long as the birth rate is greater than the death rate. This learning process is designed to encourage reflection, confronting erroneous assumptions and reconstructing beliefs.
A sample simulation exercise is shown in Figure 7. The sustainability principles and topics listed in Table 2 and tested in the quizzes were selected from the general literature on sustainability [52,53], renewable resource management [16], and a systems view of sustainability [54] (p. 214). They were chosen as necessary skills for analysing the two cases under consideration, guided by Harris [48] for the Grand Banks fishery collapse and Meadows' analysis of the Kaibab deer dynamics in her lectures, already cited. This list forms the framework for operationalising sustainability understanding using quiz 1 and quiz 2, making use of marking schemes to obtain quantitative percentage scores.
Note that these topics are limited to the cognitive aspect of sustainability understanding, not the affective, behavioural or other aspects [12].
Platform Figure 8 shows the architecture of the learning tool. A gateway web page was used to allocate users to groups randomly and to provide a link to a Stella Architect interface with authentication and data collection enabled. The group id passed to the Stella interface determined conditional pathways according to group. The Stella interface was published to the ISEE Exchange. Quizzes and surveys were embedded in the learning tool using SurveyMonkey surveys, and these employed custom variables to allow user identification taken from Stella logins. Desktop or laptops were recommended, not mobile devices, because the simulations required a reasonably large screen. The learning tool was divided into two sections, facilitating two experiments (see Figure 9). Experiment 1 was concerned with the effect of Systems Thinking and/or simulation on sustainability learning outcomes, and was designed to answer RQs 1, 2 and 3. Experiment 2 was concerned with the transfer of sustainability understanding from the deer problem to the fisheries problem and was designed to answer RQ4. Quiz 1 data were captured for experiment 1, and quiz 2 data for experiment 2.
In experiment 1, a significant increase for non-control group members in quiz 1 performance would suggest that Systems Thinking and/or simulation improved sustainability learning outcomes (RQ1, RQ2 and RQ3).
In experiment 2, a significant increase for non-control group members in quiz 2 performance would suggest that insights from Systems Thinking and/or simulation applied to the deer problem resulted in a transfer of sustainability skills to the fisheries problem (RQ4), since only a standard non-systemic description of the fisheries problem was provided. Figure 9 How the pathways through the learning tool and the experiments were designed to answer the research questions

Participants and sampling methods
According to UNESCO, ESD is necessary for all 'citizens, voters, workers, professionals, and leaders' [55]. This is a very large population globally, so random selection was not possible because of resource and access constraints. Subjects were instead selected using non-probability sampling techniques: a combination of two forms convenience sampling with self-selection [46] (p. 113). Convenience sampling means that participants chosen were those most easily accessible. Invitations to members of the public over 18 were sent out through emails, social media or website invitations, word of mouth etc. Individuals and groups targeted included university student societies, postgraduate students, environmental organisations and political parties, friends, acquaintances and colleagues.
Those contacted were also invited to pass the invitation on to others. This is known as snowball sampling and is a form of convenience sampling. In this way the sample was extended, repeating until the required number of valid datasets was collected. Those who signed up were self-selected from this large network. A two-by-two factorial design requires a minimum of 20 participants per group [56] (p. 87), so at least 80 subjects needed to be recruited.
The Covid-19 restrictions led to a decision to deliver the learning tool for unsupervised online use. This meant that people could participate from anywhere in the world.
Randomisation was by random assignment. Whilst it is a valid method for cancelling out the effects of extraneous variables, random assignment reduces generalisability across populations when compared to random selection.

Data collection, validation and anonymisation
The following data were collected from participants:  In the pre-survey, basic information such as age, gender, degree subject and/or occupation, and prior knowledge of sustainability.  Quiz 1 and quiz 2 answers comprised a mix of quantitative and qualitative data, for example, numeric answers to questions about population growth, and textual answers to questions about the meaning of sustainability in context. Each quiz question was scored numerically and included in the overall percentage results. Some qualitative answers were also analysed separately.  The short surveys, appropriate for each treatment group, collected subjective feedback about the usefulness of the simulation and Systems Thinking sections. At the end of quiz 2, participants were also asked for overall feedback about the learning tool.  The email address used to log in to the learning tool was captured by ISEE and used to allow identification of survey, simulation and page analytics data. o Simulation data were used to verify that users had interacted with the simulation exercises. o Learning tool page analytics were used to judge whether participants engaged adequately with the learning tool.
Once the data were collected, datasets were validated. Validation rules used to define acceptable engagement with sections and delay in recording quiz answers are detailed in the codebooks available in the Zenodo dataset. Datasets were also checked for completeness, compared with surveys and quizzes for each group member summarised in Table  3. Participants were asked to complete promptly any feedback surveys that failed to record. Some cross-checking was necessary between SurveyMonkey data and ISEE data where items of data failed to record, for unknown reasons. Login email ids in the data were replaced with anonymised participant ids to avoid rater bias [57] (p. 209). In this study, since the researcher knew some participants personally, and had access to demographic data collected in the pre-survey, there was a risk of rater bias. The researcher may also have been influenced, consciously or unconsciously, by knowing the participant's treatment group. To reduce these risks, quizzes were marked 'blind', i.e. the researcher did not know the participant's identity nor which group they were allocated to.
Quiz scores were calculated, and background variable values recorded, using predetermined marking schemes and scales. Quiz answers, marking schemes and code books for recording results are all included in the published Zenodo dataset.

Data Analysis
A Factorial ANOVA is an appropriate overall test for exploring the causal relationship between the two categorical independent variables and one quantitative dependent variable [58]. It detects whether any group differs significantly from the others. Factorial ANOVA differs from the standard ANOVA test, in which there is only one independent variable. Certain assumptions about the distribution of the data must be met in order to conduct either form of ANOVA test. Kruskal-Wallis is a suitable alternative non-parametric overall test that can be used instead of the standard ANOVA if these assumptions are not met. If the overall test finds that there is a difference between the groups, individual posthoc tests can be conducted. In this study the unpaired two-samples Wilcoxon tests (a non-parametric post-hoc test that can be performed after Kruskal-Wallis) and an independent two-sample t-test were conducted.
The statistical programming language R was used to create descriptive statistics such as graphs and summary statistics, to check assumptions for parametric tests, to carry out all the inferential statistics tests and to calculate effect sizes [59]. The R scripts necessary to reproduce all the results in detail are openly available in the Zenodo dataset (URL: https://zenodo.org/record/5569508).
Each significance test result documented in later sections includes a p-value, but arguably this does not measure the strength of the relationship. An effect size such as Cohen's d is a useful complement [60]. Cohen provided basic guidelines for interpreting the effect size, namely 0.2 as small, 0.5 as medium, and 0.8 as large [61]. However, he advised that his benchmarks were recommended for use only when no better basis is available. In education research, the average effect size is d = 0.4, with 0.2, 0.4 and 0.6 considered small, medium and large effects [62].
Randomisation in the study design aims to generate comparable groups to eliminate the effect of extraneous variables, but it is always possible that unidentified confounding variables exist, confounding can be introduced by inappropriate adjustments, and the effects of confounders may not be entirely removed [63]. The approach taken to analysing extraneous variables in this study was, where known, to check their actual distribution across groups, to see if this was even, and/or to examine their effect on scores, using stratification. Data were not formally adjusted to compensate for their effects, if found; instead, sometimes adjustments were estimated, but more generally, limitations to findings and recommendations for further experimental studies were reported.

Profile of participants
Of 227 people who signed up to participate, 80 did not follow up, and 8 started but withdrew. There were 33 incomplete or invalid datasets. Some participants experienced technical issues or otherwise needed support to complete the learning experiment. After data validation there were 106 complete datasets, one dataset per participant.
The majority of participants (58.5%) were female, and 41.5% were male. The average age was around 50 years. The great majority (85%) resided in the Ireland or the UK (60% and 25% respectively), and 15% elsewhere. Most were graduates or postgraduates (77%), the average a little below Master's degree level. The majority (62%) of participants had little or no prior knowledge about sustainability. The vast majority (87%) of participants had no prior knowledge of Systems Thinking or System Dynamics. For nearly two-thirds (64%) of participants, their occupation and/or education had no relevance or little relevance to sustainability or systems thinking.

Experiment 1: Quiz 1 (Deer Herd Management) Scores
The simulation group, highlighted in Table 4, performed best. All treatment groups performed better than the control group. A boxplot showing the distribution of quiz scores is shown in Figure 10. This shows an outlier in the control group. The mean score of the full treatment group (ST + Simulation) was lower than that of the Systems Thinking group, and the Simulation group, which was unexpected. The question arose, why was the mean score obtained when both factors were combined not at least as high as that obtained with either of the factors alone?
Scores for individual quiz questions were compared to find out which groups performed best on specific sustainability topics.  The ST + Sim group got lower than expected results. The distribution of known background variables between groups was explored in case these were confounding, and this group was found to differ from the others in three ways. There were far more over 65s, they had far less prior sustainability knowledge, and there were far more delays in both quizzes due to technical issues or interruptions. However, after closer analysis using stratification, higher age and more delays were found not to be associated with lower quiz scores. A lower average prior sustainability knowledge score did affect score a little: estimating the effect on group mean score of increasing the average level of prior knowledge to that of other groups suggests an increase of 1.2%, not enough to create a significant result for that group. However, further studies could use techniques such as restriction or matching in the study design to eliminate any possible effect. A design that simply excluded people with high prior sustainability knowledge could be sufficient.
A much more likely explanation for the poor performance of this group, the full treatment group, was found in the significant negative interaction effects uncovered by Factorial ANOVA testing and described in the Inferential Statistics section.

Experiment 2: Quiz 2 (Sustainable Fisheries Management) Scores
Again, the simulation group, highlighted in Table 5, performed best. Other treatment groups performed worse than the control group, when comparing mean, median and mode of group scores. A boxplot showing the distribution of quiz scores is shown in Figure 11. There are two outliers in the control group, one of which (the lowest score) belonged to the same participant as the outlier in Figure 10. One outlier was less extreme, and was only an outlier for the control group, not the participants as a whole. No obvious errors or unusual circumstances gave rise to this outlier.  Scores for individual questions were compared by group to find out which groups performed best on specific sustainability topics. The Sim group outperformed all other groups in the calculation of years to maximum fishery capacity, in understanding maximum sustainable yield, and in identifying sustainable graph patterns. It performed a little better than other groups in single stock exponential growth and maximum capacity calculations, and in defining sustainability in the context of fisheries. Table 6 summarises the process followed when conducting the inferential statistical tests on quiz 1 and quiz 2 data. The main findings are highlighted. All R scripts necessary to reproduce the results outlined here, including the analysis of possible confounding variables, are available in the Zenodo published dataset. The following paragraphs provide explanatory notes.

Inferential Statistics
From the literature, there is a clear expectation that Systems Thinking and/or simulation will increase understanding of sustainability problems. Therefore, alternative hypotheses tested asserted that scores for treatment groups would be greater than those of the control group, leading to right-tailed (one-tailed) significance tests. The null hypotheses were that there were no differences between the groups.
Where a parametric test was conducted, the appropriate assumptions for the test were first checked. The assumptions for ANOVA tests are the independence of observations, the homogeneity of variances and the normality of residuals. The first condition is satisfied since participants in this study were randomly allocated to treatment groups. Levene's test for homogeneity of variance and the Shapiro-Wilk test for normality of residuals were both carried out using R on the appropriate datasets to check the other two assumptions. The assumptions for the two-sample independent t-test are similar: independence of the observations, an approximately normal distribution for each group, and homogeneity of variances. If the assumptions were not met, a suitable non-parametric test was used instead.
Parametric tests do not work well when there are outliers [46] (p. 592). The outlier score in quiz 1 was removed, and the more extreme outlier in quiz 2 removed, before ANOVA testing. Finally, two quiz 2 datasets were removed prior to analysis, as page analytics logs revealed that these participants did not engage with the fisheries section of the learning tool. They both spent no more than two minutes on the fisheries section, whereas the minimum acceptable time was 5.5 minutes, and recommended time was 15 minutes.
The Factorial ANOVA tests revealed that both in quiz 1 and quiz 2 the presence of both factors (Systems Thinking and simulation) created a negative, or 'antagonistic' interaction effect. Interaction plots are shown in Figure 12 and Figure 13. This means that adding a second treatment reduced the quiz scores. The interaction effect partly cancelled out the main effects of each factor alone. This refutes RQ3.

Effect Size
The best performing group was the Sim group with Cohen's d effect size calculated at 0.6. This is a large effect in an educational context. ST improved learning outcomes but had a weaker effect (Cohen's d 0.4, a medium effect). ST + Sim had a still weaker effect (0.1, a very small effect).
For quiz 2, the Sim group was the only group that performed better than the control group, so the effect size for other groups was not calculated. Cohen's d was calculated at 0.4. This is a medium effect in an educational context. Table 7 provides a formal summary of the results of the inferential tests and effect sizes and provides answers to the research questions and main hypothesis. Sim scores were better than the control group scores in quiz 2, but the result is weaker than for quiz 1.  A Wilcoxon rank sum test (one-tailed, and using the Bonferroni correction to adjust p) showed that there was no significant increase in mean scores for the 26  0787. However, the mean scores were significantly better at the 90% confidence level (α = 0.10). The effect size was medium in the educational context (Cohen's d 0.4). Other treatment groups did not perform better than the control group. A Factorial ANOVA test found a significant interaction effect at the 90% confidence level between the two factors on score (p = .052). The interaction effect was negative, since the presence of each factor reduced the effect of the other factor. Systems Thinking reduced scores. Hypothesis: Incorporating Systems Thinking increases the effectiveness of sustainability education.

Result: Only simulation
Only simulation was found to significantly increase sustainability quiz scores. Systems Thinking increased scores in quiz 1, but not significantly.

Feedback from Participants
The quantitative and qualitative feedback data summarised in this section are available in full in the published Zenodo dataset. The main points are summarised below.

General Feedback on the Learning Tool
Comments were generally very positive. The most frequent words used are visualised in the word cloud in Figure 14. The most frequent evaluative words used were 'interesting' and 'informative'.
Stop words: found, can, made, used, ones, second, seems, taking, back, way, make, done, quite, first, two, around, section, without, also, use, one, content, without, will Summarising the 65 comments, some people commented favourably on the benefits of interactive learning with simulation. It helped them better understand cause and effect and consequences of policy decisions, allowed experimentation, knowledge construction, and made learning enjoyable. Case studies were found useful.
Some people remarked that they understood the first problem better than the second, because of the Systems Thinking analysis and/or simulation provided for the first problem and not the second. Perhaps they struggled to transfer their understanding from the first problem to the second, unlike this person, who saw the connection: 'The similar nature of the two problems meant that lessons learned in simulations on the first problem applied to the second problem without simulations'. Another person said that they internalised the lessons from simulation and were able to apply those lessons to the second problem without explicit simulation material.
This comment suggests the usefulness of combining both factors: 'The challenge in education is that often times these two valuable tools (systems thinking and simulation) are parsed but their combined use here has been excellent in facilitating knowledge construction'.
Some people found the mathematical aspects of the learning material challenging, and another cautioned about the complexity of the Systems Thinking section.

Feedback on the usefulness of Systems Thinking
Participants with access to the Systems Thinking section were asked to rate how useful they found it, on a 5-point Likert scale (see Figure 15). About three-quarters (74.1%) of participants said it helped quite a lot or really transformed the way they saw the problem. The most frequent words found in the 24 optional comments given about the usefulness of Systems Thinking are visualised in Figure 16. The most frequent evaluative words were 'helped', 'helpful' and 'excellent'. Participants remarked on the usefulness of graphs, diagrams, videos and highlighted words for simplifying a complex topic and making it more memorable. Systems Thinking was found useful for understanding interrelationships and how systems interlink, identifying the point at which systems become unsustainable, clarifying cause and effect, identifying patterns of behaviour and changes over time, and making decisions. A few expressed concern about remembering the complex terminology.
Stop words: see, use, see, use, can, need, often, gives Participants with access to the simulation section were asked to rate how useful they found it on a 5-point Likert scale (see Figure 17). The great majority (84.6%) of participants felt that simulation helped quite a lot or really transformed the way they saw the problem, higher even than the 74.1% of participants who felt the same about Systems Thinking. When analysed by group (figures not shown), 92% of Sim group participants felt that simulation helped quite a lot or really transformed the way they saw the problem, compared with 78.6% of ST + Sim group participants. The most frequent words found in the 32 optional comments about the usefulness of simulation are visualised in Figure 18. The most frequent evaluative or descriptive words used were 'useful', 'interactive', 'learning', 'remember', 'understanding' and 'thinking'. Participants found simulation useful for increasing clarity and understanding by adjusting variables, experimenting with strategies, assessing impacts and informing policy decisions, for seeing how quickly resources can be depleted, for finding sustainable limits, and for teaching responsibility. Interactivity helps learning and retention, some said, and seeing graphs change dynamically is more effective than reading text for understanding complexity and real-world problems and performing the mathematical work themselves.
Number of comments scanned: 32. Minimum word frequency: 3. Stop words: see, also, made, might, much, seeing, thing, can, will

Discussion and Conclusions
The main findings were that System Dynamics simulation has a strong effect on understanding a sustainability problem, and a weaker but significant effect on transfer of understanding to another problem with a similar systemic structure. Systems Thinking did not make a significant difference to mean scores in either case, and the combination of Systems Thinking and simulation in the full treatment group proved negative. This could be evidence that the additional learning material, or perhaps its abstract complexity, pushed participants over a limit with respect to 'cognitive load' [64] in this experimental setting (a single learning session). It could also be evidence that quantitative simulation has a better learning outcome than more qualitative approaches. Interactive simulation provides an opportunity for learners to perform actions (operations) and build their understanding of a system through 'operational thinking' [31,65].
Feedback from participants was very positive with a large majority reporting finding both Systems Thinking and simulation useful.
In conclusion, simulation is a highly effective tool for enhancing sustainability understanding in a single short learning session, even when learning is done remotely online without supervision.

Limitations of the study
Conclusions are limited to the effect of the factors in a learning environment designed for a single individual learning session. Participants, since not randomly selected, may not represent the whole population, suggesting that the study should be repeated to check external validity. The findings of the study are limited to one aspect of sustainability, namely knowledge. The study did not evaluate affective or behavioural aspects.

Suggestions for Future work
The medium effect size and positive feedback from learners suggests that Systems Thinking may be useful if presented differently. It may have a stronger effect on understanding and development of transferrable skills if taught in an interactive classroom or group situation and not limited to a single session. Care should be taken not to cause excessive cognitive load on learners.
Furthermore, effect sizes in educational research are often categorised as small by Cohen's standards [66]. This is because there are typically many other important factors affecting results, typically prior education and skills such as numeracy, literacy, science and so on. Interpreting effect sizes in educational interventions is a complex matter, and evolving [ibid]. Where effect sizes are modest, the sample size must be increased in order to increase the power of the study. Increasing statistical power decreases the the probability of a Type II error, in which the researcher wrongly concludes that there is no effect when one actually exists [60].
Since Systems Thinking and simulation were the factors under investigation, and sustainability understanding was the dependent variable, an improvement to a future study design would be to exclude people already knowledgeable in those areas.
The methodology (factorial study design coupled with Factorial ANOVA significance testing) could be usefully employed for further studies to investigate various styles of learning intervention, such as multiple learning sessions, role play and group model building, or it could be used to evaluate existing virtual worlds, games and simulators. The effect of other factors (independent variables) could be evaluated using a similar factorial design.

Funding Information
This research was undertaken for the PhD studies of the corresponding author at the National University of Ireland Galway (NUIG) and was supported by funding from ResponSEAble (EU Horizon 2020 project number 652643), Ireland's Higher Education Authority (through the IT Investment Fund and ComputerDISC, and the Covid-19 Costed Extension), and the NUIG PhD Write-Up Bursary.

Conflict of Interest
The authors declare no conflict of interest.

Research Ethics Statements
All subjects were adults aged over 18 and gave their informed consent for inclusion before they participated in the study. For this noninterventional study, all participants were fully informed that confidentiality was assured, that data would be anonymised before publishing, why the research was being conducted, how their data would be used, any risks associated, and their right to withdraw.

Data Availability Statement:
Data available in a publicly accessible repository. The anonymised data presented in this study, together with the R Scripts necessary to reproduce all results, are openly available in Zenodo. DOI: https://doi.org/10.5281/zenodo.5569508 URL: https://zenodo.org/record/5569508.