Leaderboard Positions and Stress—Experimental Investigations into an Element of Gamification

Gamification, i.e., the use of game elements in non-game contexts, aims to increase peoples’ motivation and productivity in professional settings. While previous work has shown both positive as well as negative effects of gamification, there have been barely any studies so far that investigate the impact different gamification elements may have on perceived stress. The aim of the experimental study presented in this paper was thus to explore the relationship between (1) leaderboards, a gamification element which exchanges and compares results, (2) heart rate variability (HRV), used as a relatively objective measure for stress, and (3) task performance. We used a coordinative smartphone game, a manipulated web-based leaderboard, and a heart rate monitor (chest strap) to investigate respective effects. A total of n = 34 test subjects participated in the experiment. They were split into two equally sized groups so as to measure the effect of the manipulated leaderboard positions. Results show no significant relationship between the measured HRV and leaderboard positions. Neither did we find a significant link between the measured HRV and subjects’ task performance. We may thus argue that our experiment did not yield sufficient evidence to support the assumption that leaderboard positions increase perceived stress and that such may negatively influence task performance.


Introduction
Gamification, the use of game elements outside a game context ( [1], p. 2), enjoys an evergrowing interest in the workplace as well as in educational contexts. The main goal of gamification is to improve a process through playful experiences in such a way that the personal added value of those carrying it out increases ( [2], p. 19). Furthermore, it has shown to have positive effects on motivation and productivity [3]. From a negative perspective, however, aspects such as off-task behavior [4,5], negative feedback, as well as physical and psychological damage [6] have also been linked to gamification.
Game elements are different components that are used in games ( [1], p. 3). Reeves and Read [7] define the ten most common game elements, highlighting examples such as time pressure, feedback, stories, reputation in the form of levels, awards, rankings and competition according to set rules. Dale uses the term game mechanics for game elements and also lists luck and the exchange of results within a community as the most frequently used game mechanics ( [3], p. 82). The effects of gamification are highly dependent on both the environment used and the users of the system ( [8], p. 3027), where it is important to have a group of users who are pursuing the same goal ( [9], p. 8). Yet, if gamification is applied correctly, it offers a captivating experience accompanied by a noticeable learning effect ( [3], p. 85).

Recent Work on Gamification
In a meta-analysis of gamification in learning settings, Sailer & Homner found a small effect of gamification on motivational learning outcomes [10]. A more recent study (n = 205) further showed a connection between the use of gamification and the subsequent application of knowledge, which was moderated by students' learning process performance [11]. Zainuddin and colleagues, on the other hand, found gamification to be a promising and effective alternative to formative assessments [12]. To this end, however, it was also shown that the effects of gamification often do not last and that lower performing students are less likely to benefit from its use [13]. Yet, it seems to enhance a learner's user experience and thus may improve more traditional educational contexts [14]. Finally, Groening & Binnewies' work showed that achievements as an element of gamification may foster motivation and subsequently impact positively on performance [15].

The Leaderboard Element
One specific element of gamification are so-called leaderboards. Leaderboards are tools which can be used to exchange and compare the results within a community [3,16]. They serve as a feedback mechanism for social competition and may promote the engagement of participants. They show an overall ranking and are regularly updated so that players can compete for higher placements ( [17], pp. 94-95). In contrast to other game elements such as badges or levels, leaderboards allow immediate and direct comparisons [18]. However, they can be harmful for users with poor performance ( [4], p. 179). That is, if users feel forced into a competitive situation, this may have negative effects on skills and attitudes, or as Lopez puts it: "Being consistently at the bottom of the ranking list is negative feedback." (online: https://www.latimes.com/health/la-xpm-2011-oct-19-la-me-1019-lopez-disney-20111018-story.html (accessed on 26 March 2021)). Users who feel that they will never be at the top are inclined to let go of their ambition ( [19], p. 565). Hence, leaderboard-triggered transparency may not always be the ideal solution, as users can easily feel weakened compared to better placed persons ( [6], p. 166), and consequently may encounter higher levels of stress.

Gamification and Stress
Jamal ([20], p. 728) defines stress as the reaction to elements of the environment which appear threatening. Stress factors can for example result from the tasks to be carried out, or the environment itself ( [21], p. 5). Abbe's [21] model assumes that subjective stress can lead to conditions such as anxiety, hostility, and depression. These conditions have a negative effect on task performance. It is also assumed that subjective stress is caused by events occurring during task performance. The more frequent and more intense the events are for the individual, the greater the subjectively perceived level of stress. The environment determines, in part, the frequency with which these events occur. Individual characteristics such as prior experience, Type-A behavioral patterns and fear of negative evaluation determine the frequency and intensity of the stress burden on individuals ( [22], pp. 618-620). While stress can also be measured by self-report questionnaires [23,24], a person's heart rate variability (HRV) is a physical measure of stress, which may be used to more objectively measure the level of stress experienced ( [25], p. 236). Also Jobbágy et al. recommend HRV to assess the current stress level ( [26], p. 238).
Although there is a significant body of work on the positive effects of gamification, little is known about the impact gamification elements may have on perceived stress. A recent study by Paniagua and colleagues e.g., found that gamification reduces stress levels and improves the academic performance of chemical engineering students [27]. Teenakoon & Wanninayake [28], on the other hand, highlighted a moderating effect on the relationship between work stress and work performance of non-managerial bank employees in Sri Lanka. Yet, to our knowledge, there have not been any investigations into the effects of leaderboards.

Research Question
The fact that (1) little is known regarding the effect of leaderboards on stress experience, and (2) that most previous studies concerning stress in gamification focused on self-report scales of stress, inspired the following research question guiding our investigations: How is the leaderboard position related to (physical measures of) stress and changes in task performance?
Following we start our report with Section 2 discussing the Physical Measures of Stress. Next, Section 3 outlines the Hypotheses, Materials and Procedure we used for our study. Section 4 summarizes the gained Results and Section 5 provides a Discussion of the respective conclusions which may be drawn. Finally, Section 6 highlights the work's Limitations and Section 7 provides an overall Summary and Future Recommendations.

Physical Measures of Stress
In order to investigate the above outlined research question we designed an experimental setting. We focused on factors influencing task performance while controlling for other variables potentially influencing stress as well as task performance. The following outlines the used measurement variables and subsequently explains the research model.

Heart Rate Variability and Its Measurement
HRV was first studied by Hon and Lee [29] when they found that embryos were preceded by a change in the intervals between heartbeats before any appreciable change in heart rate occurred. For respective studies, Castaldo et al. ([30], pp. 376-377) propose the following: • Define the length of the HRV measurements and the anticipated stressors to the best of available knowledge • If at all possible, avoid physical activity • Perform the analysis of the HRV values according to standardized guidelines • Carry out the stress measurement according to the aim of the study, i.e., before, during or after the session HRV indicators are determined by various differences between a series of ordered heartbeats ( [25], p. 231). In order to record a series of heartbeats, a continuous measurement of the heart rate is necessary (ibid.). After the acquisition, the row must be corrected for abnormal impacts, so-called artifacts. An artifact is given if the interval between two heartbeats exceeds a certain limit of the measured median. The removal of artifacts must be carried out carefully, as even individual artifacts can have a significant influence on the analysis results [31]. Artifacts are of technical or physiological origin and can be caused by poorly attached measuring devices or by movement of the test subjects ( [32], p. 1). An artifact can either be carried out manually by removing individual heartbeats or automatically using a software program ( [33], p. 7). When the raw data is cleaned up by a software program, a limit value for the removal of artifacts must be determined ( [31], p. 4). In the next step, the interbeat intervals of the heartbeats are determined ( [25], p. 231). The interbeat intervals are defined as the time interval between the R-peaks, which reflects the contraction of the heart chambers. Since the intermediate beat intervals are determined by successive "normal" heartbeats, these intervals are often referred to as "normal to normal interval" (NN interval) or "R to R interval" (RR interval) (ibit.) ( [34]

HRV Analysis Methods
The RR intervals can be analyzed using various methods [25,33,35]. Below we discuss two common analysis methods, i.e., the time domain analysis and the frequency analysis.
• Time domain analysis-The time domain analysis describes fluctuations in the RR intervals ( [33], p. 2). In the time domain analysis, among other things, the square root of the mean value of consecutive RR interval differences (RMSSD), the number of pairs of adjacent NN intervals that differ by more than 50 ms (NN50), and that from the division percentage (pNN50) resulting from the NN50 by the total number of NN intervals, are calculated [25,26]. Castaldo et al. carried out a systematic review of acute psychological stress assessment through HRV analysis. The reviewed studies showed that pNN50 and RMSSD decreased during the stress measurement compared to the rest of the measurement [30]. • Frequency analysis-The frequency analysis calculates the distribution of the absolute and relative performance of the heart in different frequency bands ( [33], p. 2). The intervals are divided into "very low frequency" (VLF), "low frequency" (LF) and "high frequency" (HF) [25,35]. The division of the relative performance values (i.e., LF HF ), allows for a direct comparison between people despite large differences in the absolute performance ( [33], p. 2). There is an inverse correlation between the HR value and stress [30,33], where an increase in the LF values points to less stress ( [30], p. 373). HF values are strongly influenced by artifacts, since R-peaks occurring in the data increase the power at higher frequencies ( [31], p. 9).

Duration of HRV Measurements
The duration of HRV measurements are divided into three categories: 24-h, short-term and ultra-short-term measurements. Short-term measurements have a duration of approx. five minutes, with ultra-short-term measurements comprising all measurements shorter than five minutes ( [33], p. 2). Although HRV values are traditionally calculated from five-minute to 24-h recordings, also ultra-short-term recordings can determine cardiac activity ( [35], p. 355). Shaffer and colleagues recorded five-minute rest measurements from 38 students and correlated ultra-short-term measurements with the complete measurement. They found that measurements take 60 s for RMSSD and pNN50, 90 s for LF and 180 s for HF and LF HF ( [36], p. 231). Here Nussinovitch et al. [37] compared ten seconds and one-minute rest recordings with five-minute recordings from 70 healthy test subjects, and found that ultra-short-term RMSSD measurements achieved acceptable correlations. When examining 467 volunteers, Baek, Cho, Cho & Woo recorded five-minute resting measurements. They found that HF 20 s, RMSSD 30 s, pNN50 60 s and LF 90 s measurements are needed in order to be able to make valid statements ( [38], p. 413). In order to investigate ethnic and gender-specific differences in HRV measurements, Li et al. conducted three times 30 s measurements to infer RMSSD and HF [39]. To validate short-term measurements, Salahuddin, Cho, Jeong & Kim divided a 24-h measurement into 30-min intervals. These half-hour intervals were divided into three ten-minute measurements, from which 10 to 150 s measurements were taken at random. They found that with 10 s RMSSD and with 20 s pNN50, HF and LF HF conclusions about acute psychological stress can be drawn ( [40], p. 4658).

AI-Driven Physiological Measures
While the above describes HRV as a previously used and commonly accepted measure of stress, recent work in artificial intelligence and Big Data has led to alternative approaches for measuring this and other types physiological data streams. Massaro et al., for example, presented an AI-driven diagnostics platform connected to a wearable device to predict individual physiological data [41,42]. Kim et al., on the other hand, used facial image threshing to recognize emotions, from which one may also deduce individual stress levels [43]. Finally, Gonzalez-Viejo and colleagues demonstrated non-contact heart rate and blood pressure measures using the the photoplethysmography technique and machine learning [44].

Hypotheses, Measures and Procedure
Following we outline the assumed hypotheses for our investigations (subdivided into core and control hypotheses), present the used measures and explain our procedure.

Core Hypotheses: Leaderboard, Stress and Task Performance
Lopez (online: https://www.latimes.com/health/la-xpm-2011-oct-19-la-me-1019 -lopez-disney-20111018-story.html (accessed on 26 March 2021)) examined the use of digital leaderboards in Disneyland hotels in Anaheim, California. There digital leaderboards were implemented to compare the domestic staff with one another. In some cases, this created such fear and shame among workers that some even skipped toilet breaks because of the fear of losing their jobs. Lazarus (1990, p. 3) explains that stress-besides individual triggers-needs environmental triggers in order to occur [45]. As shown by Lopez (ibit.) the mere introduction of a leaderboard can actually be interpreted as an environmental stressor. Thus, used in an educational context, students may even find the mere idea of a leaderboard introduction stressful-while especially being placed in lower ranks of such a leaderboard might also have an impact on academic performance [46]. However, leaderboards can also encourage people to compete for higher placements, and thus have a positive effect on task performance ( [17], p. 95). Consequently we may deduce the following hypotheses: Hypothesis 1 (H1). The use of a leaderboard leads to increased levels of stress.
Hypothesis 2 (H2). Individuals experiencing higher levels of stress show lower task performance.

Hypothesis 3 (H3)
. The use of a leaderboard leads to higher task performance.
In order to enhance gaming experience, Bowey et al. manipulated the leaderboard ranking. They found that an emphasis of the best and worst placements by color intensified the felt experience ( [16], p. 116). However, forcing users into such competitive situations can also have negative effects on skills and attitudes. For example, employees who feel that they will never be at the top may be inclined to abandon their work ( [19], p. 565). Hence, it is actually sometimes recommended to avoid displaying the lowest leaderboard ranks ( [47], p. 1958) as such may be harmful for users with poor performance ( [4], p. 179). Furthermore, Lazarus et al. found that subjects who perform well on the first test tend not to improve, whereas subjects who perform bad tend to improve on a second test ( [48], pp. 299-300). Hjortskov et al. on the other hand claim that psychological stress reduces performance ( [49], pp. 87-88). Hence, we may argue that: Hypothesis 4 (H4). Individuals ranked among the 'Bottom 3' experience more stress than individuals ranked among the 'Top 3'.
Hypothesis 5 (H5). Individuals ranked among the 'Bottom 3' show a higher increase in task performance than individuals ranked among the 'Top 3'.

Control Hypotheses: Individual Behavior and Personality
One area of stress research focuses on the understanding of stress-related behavior, which can also be referred to as the Type-A behavior pattern ( [20], p. 728). Perlman and colleagues ( [50], p. 6) define the Type-A behavior pattern as an intense pursuit of performance, easily provoked hostility and impatience in combination with excessive competitive pressure and a permanent feeling of time pressure. Individuals not showing these characteristics, are likely to possess a Type-B behavior pattern ( [20], pp. 728-729). Type-A people act in a way that causes more stressful events for them and they also experience these events as more stressful ( [22], p. 620). Using a sample of 215 nurses, Jamal examined whether there is a connection between workplace stress, workplace stressors and Type-A behavioral patterns with job satisfaction, commitment and fluctuation. It was found that Type-A nurses experience significantly more stress and strain at work than Type-B nurses [20]. Hence, we may argue: Hypothesis 6 (H6). Type-A behavior increases stress.
Individuals who are afraid of negative evaluations are more likely to experience higher levels of stress [22,51]. Fear of negative evaluation is one of the most widely used scales to measure social phobias ( [52], p. 982). Carleton, McCreary, Norton & Asmundson define fear of negative evaluation as a scale for measuring fears and anxieties that arise from the fear of being judged negatively by others ( [53], p. 297). Motowidlo et al. [22] and Packard & Motowidlo [51] found that nurses with a high fear of negative evaluation are more likely to experience high levels of stress. Consequently, we may argue:

Hypothesis 7 (H7). Fear of negative evaluation increases stress.
The Big Five model is used to describe personality ( [54], p. 7). The model distinguishes between five dimensions: extraversion, agreeableness, conscientiousness, neuroticism, and openness [54,55]. A higher value for extraversion stands for characteristics such as sociability, talkability and assertiveness, whereas a lower value stands for characteristics such as being quiet and withdrawn. Furthermore, people with a low score in the dimension of agreeableness can be described as cool, critical and suspicious. If it is high, there is talk of high interpersonal trust, cooperativity and compliance. Conscientiousness, on the other hand, distinguishes determined, disciplined and reliable people from negligent, inconsistent and indifferent people. Neuroticism describes a person's emotional behavior. If a low value is achieved, this speaks for serenity and relaxation. A high value means uncertainty, nervousness and fear. People with a high degree of openness are considered imaginative, inquisitive and intellectual. A low score stands for firm views, conservatism and traditionalism (ibid.).
Ebstrup et al. examined the relationship between personality types and perceived stress on the basis of 3471 individuals. They found a significant negative relationship between perceived stress and extraversion, conscientiousness, as well as agreeableness. A significant positive relationship was found between perceived stress and neuroticism. Neuroticism had the greatest impact on perceived stress, followed by conscientiousness, extraversion, and agreeableness. Openness did not show a significant relationship to perceived stress ( [56], pp. 414-416). Thus, we hypothesise: Hypothesis 8 (H8). High levels of extraversion reduce stress.

Hypothesis 11 (H11). High levels of neuroticism increase stress.
A positive, significant correlation with stress was demonstrated for Negative Affect (N A) [57][58][59] The affect is the experience of feelings. By dividing it into positive and negative affect, it is used to investigate sensations and feelings ( [60], p. 6). Both factors can be measured either as traits, i.e., general persistent properties, or as current states. Positive Affect (PA) reflects the extent to which a person feels active, excited, and in tune with the environment. High PA can be described with terms related to high energy, mental alertness, high concentration and determination. Lower PA stands for sadness and indolence [57][58][59]. Negative Affect (N A) is a factor of subjective distress and describes a wide range of aversive mood states, including despair, fear, guilt, and nervousness. Low N A indicates a state of calmness and serenity ( [59], p. 1063).
Watson examined the relationship between affect and health complaints, perceived stress and daily activities. A positive significant correlation between N A and stress could be demonstrated in 75% of the test subjects. The relationship between stress and N A was stronger than the relationship between stress and PA. Consequently Watson concluded that the PA scale is related to social activities and the N A scale is significantly related to perceived stress ( [58], p. 1024). This is also in line with the results of a study by Kanner et al., who found that stress levels are significantly related to affect, especially the N A scale ( [57], p. 21). Hence, finally we may argue that: Hypothesis 12 (H12). Negative Affect (NA) increases stress.
An overview of the different variables and their respective connection is depicted by Figure 1. Note: Our investigations did not aim to confirm or disprove this model based on Motowidlo et al. ([22], p. 619). We merely show it because we believe for it to nicely outline the connection between construct variables and thus help in understanding potential influences.

Measures Leaderboard
We designed a leaderboard using common Internet technologies (i.e., HTML, CSS and JavaScript). Based on the findings of Bowey et al. ( [16], p. 116), the best and worst rankings on the leaderboard were coloured in order to reinforce participants' experienced feelings. Depending on participants' achieved score, the other scores displayed were calculated randomly. Furthermore, the exact placement, which was dependent on the assigned group, was randomly displayed in the respective colour range. With this participant-dependent presentation of the scores, an attempt was made to present the manipulated results as realistically as possible, which according to Lazarus et al. is a key point to consider when manipulating results [48]. Figure A1 in Appendix A shows the manipulated leaderboard for a top ranking. In this example, the second place was randomly assigned. Based on the achieved score of 134, which was entered in a pop-up window before the leaderboard was displayed (cf. Figure A2 in Appendix A). The scores for the remaining placements were randomly calculated.

Type-A Behavior Pattern
The Framingham Type-A Scale was used to measure the Type-A behaviour pattern [61]. The scale contains ten questions about personal characteristics and feelings experienced at the end of a productive day. The questions cover topics such as competitive pressure and feelings of time urgency. The first five questions use a Likert scale containing the following four answer options: Very Good (1.00), Quite Good (0.67), Somewhat (0.33) and Not at all (0.00). Questions six to ten use a binary scale, i.e., Yes (1.00) and No (0.00) (please refer to Figure A6 in Appendix B for a copy of the complete scale in German). All answers were added up so as to calculate a total score (referred to as TypeA − Sum). Higher scores indicate a Type-A behaviour pattern. According to Perlman et al. ([50], p. 16) the scale shows an internal consistency of Cronbach's α = 0.71 for men and α = 0.70 for women (n = 3000).

Fear of Negative Evaluation
The Brief Fear of Negative Evaluation-Revised Scale (BFNE-R) [53] translated into German by Reichenberger et al. [62] and validated on the basis of four studies was used to measure fear of negative evaluation. The BFNE − R consists of 12 positively phrased questions evaluated on a five-point Likert scale ranging from 1 = not at all a characteristic of me to 5 = absolutely a characteristic of me (please refer to Figure A4 in Appendix B for a copy of the complete BFNE − R in German). Its cumulative value may take on numbers between 12 and 60 (referred to as Fear − Sum). Higher values correspond to a higher fear of negative evaluation. There are four questions per personality dimension. Except for the dimension Openness, which has three negatively and one positively directed question, all dimensions of the model have two positively and two negatively directed questions. The positive questions are rated from 1 = very inaccurate to 5 = very accurate. Negatively directed questions use the reverse order. For interpretation, the values for each dimension are added up (referred to as Extra − Sum for extraversion, Conscient − Sum for conscientiousness, Neuro − Sum for neuroticism and Agree − Sum for agreeableness). The German translation for this study was taken from the official IPIP website (online: https://ipip.ori.org/newItemTranslations.htm (accessed on 27 March 2021)) and a copy of it can be found in Figure A3 of Appendix B.

Affect
In order to measure affect we used the German translation of the short form of the Positive Affect Negative Affect Scale (PANAS) [59,60]. The scale uses 20 adjectives to describe feelings and sensations, which are to be rated on a five-point Likert scale ranging from 1 = not at all to 5 = absolutely. Half of the adjectives concern PA the other half N A (please refer to Figure A5 in Appendix B for a copy of the PAN AS in German). Evaluations use the respective means. Previous studies have shown high internal consistency for the scale, with Cronbach's α values ranging from 0.86 to 0.90 for PA and from 0.84 to 0.87 for N A [59,60].

Task performance
In our study we were interested in whether stress has an effect on task performance. In order to measure task performance we used Dots: The coordination game smartphone app (online: https://www.dots.co/ (accessed on 30 March 2021)). The number of points achieved in this game increase by vertically or horizontally connecting points of the same color. Extra points are given when similar colored squares are connected. Our test participants played the one-minute game mode time play twice after they had completed the game introduction. Their first round was played under normal conditions. After displaying the leaderboard, which indicated their rank being either at the top or at the bottom of the leaderboard, the second round was played under stress conditions. The difference in the number of points achieved in both rounds was used to interpret the change in task performance (referred to as points − DIF).

Stress
We used RMSSD to objectively measure stress levels of all participants throughout the 60 s they were playing. The relative change from the second to the first measurement was used to interpret the changed stress level (referred to as RMSSD − DIF). As in Baumgartner et al. [34], a chest strap equipped with a heart rate sensor was used for recording. The measurement data was recorded with the Suunto Ambit3 Run heart rate monior. The standard version of Kubios HRV was used to evaluate the data. Measurement quality was ensured via the artefact correction integrated in Kubios HRV. Since the recording of the HRV measurement had to be stopped manually, all measurement data in Kubios HRV was shortened to the 60 s relevant for playing the Dots: The coordination game. For the measurement of HRV values, the points recommended by Castaldo et al. ( [30], pp. 376-377) were considered as follows: The length of the HRV measurements was set to 60 s, i.e., the duration of the game mode Time play. The leaderboard position was set as the stressor. Test subjects played the smartphone game in a sitting position, which avoided additional physical activity.

Procedure
The analysis of different effects of the leaderboard position on stress as well as task performance was investigated using a between-subject experimental design. We used the measures outlined in Section 3.3 to form two homogeneous groups. Questionnaires containing the 62 questions were provided online and completed by participants prior to starting the experiment. In addition, an ID was generated with each questionnaire, which was then used to link the questionnaire with the HRV-measurement during the experiment.
Participants were offered a twenty minutes time window for carrying out the experiment in a prepared room at our institution. To assure a standardised procedure, the experimental procedure was pre-tested and validated. At the start of the experiment, participants were told which smartphone game they will play. They were instructed that a one-minute timed mode will be played twice and that the goal was to score as many points as possible. Before playing the intro, they were asked to put on the chest strap (note: they were shown how to correctly wear such a chest strap). The start of the HRV measurement was triggered by our experiment facilitator via the Suunto Ambit3 Run heart rate monitor. The mobile game was played by all subjects on a provided Honor 9 Lite smartphone.
After the first round of the game, the HRV measurement was stopped. It was explained to the participants that their performance will now be displayed on a leaderboard, and compared to 19 other players. The achieved score was entered into the pop-up window (cf. Figure A2) and the (manipulated) placement subsequently displayed on the leaderboard. Afterwards, the game was played for a second time, before participants were asked to take off the chest strap and eventually were debriefed (i.e., told about the manipulation) and asked not to inform other participants about the experiment procedure and conditions. The experiment was conducted in spring 2019 and was approved by the school's Research Ethics group in terms of ethical considerations regarding research with human participation.

Results
From March 2019 to April 2019, a total of 48 people completed the online questionnaire and were consequently asked to participate in the experimental study. We received 40 responses for which we then used the above outlined measurements (cf. Section 3.3) to evenly split them into two groups á 20 participants. The experiment was conducted in May 2019 at our institution.

Data Cleansing
80 heart measurements, two from each individual, were exported from the Suunto Ambit3 Run and analysed in the Kubios HRV Standard software. Measurements were cropped to a duration of 60 s and artefact correction was carried out. For the artefact correction the respective threshold was set to Medium. The data of five participants had to be removed due to unrealistic RMSSD values.
In order to remove RMSSD − DIF outliers we applied Hoaglin et al.'s outlier labelling rule [65], which led to the exclusion of one more dataset, leaving us with a final sample size of n = 34 (21 female). Nine of the participants were born before 1995, all the other were younger. Normal distribution for the RMSSD − DIF was evaluated using the Kolmogorov-Smirnov test (K-S test), whereas for the internal consistency we used Cronbach's α (cf. Table 1).

Hypothesis Evaluation
Tables 2 and 3 summarize our experiment results with respect to the earlier stated hypotheses. The following Section 4.3 will discuss some additional subsequent analyses.

Additional Analyses
Additionally, we conducted some exploratory analyses using variance-based partial least squares structural equation modelling (PLS-SEM) with SmartPLS (v. 3.3.2) [66,67] (online: https://www.smartpls.com/ (accessed on 7 April 2021)). The exploratory character of this additional analysis led to the decision to use variance-based partial least squares SEM (PLS-SEM) instead of co-variance-based SEM [66,67]. PLS-SEM has been applied in a variety of different research settings [68], and is especially suited to predict complex relationships on an exploratory level due to its block-wise estimation process [69]. Hence, it is considered an appropriate choice for analyzing the proposed hypotheses despite its lack of global model fit indices such as the Root Mean Square Error of Approximation (RMSEA) in co-variance based SEM. Although Tenenhaus, Vinzi, Chatelin, and Lauro [70] developed a Goodness of Fit index for PLS, this has been widely criticized and is also only applicable when comparing models in limited contexts, such as multigroup-analysis [67]. When compared to the advantages of using PLS for the respective study, this shortcoming seems acceptable, as the intention of this additional analysis lies within testing relationships on an exploratory level.
We were especially interested in finding out whether the manipulated leaderboard position accounts for different outcomes of the stress-performance relationship above and beyond the other indicators. Since the setting of our study did not really put participants in a competitive situation, we decided to especially focus on fear of negative evaluation as a predictor of task performance besides stress for these exploratory analyses. In the first step of the analysis the reliability of indicators and constructs, as well as construct and discriminant validity were examined more closely following the common recommendations for reliability and validity in SmartPLS [67,71]. The majority of the indicators load above the threshold of 0.700. Two indicators are above or exactly 0.600. The remaining two rank under 0.600 (more precisely: 0.575 and 0.543). In line with Hulland [72] these indicators are kept in the model, as only values of 0.400 or lower should definitely be excluded. This means that indicator reliability can be assumed for all scales. Composite reliability can also be assumed, as it is above the threshold of 0.700 for all constructs. In addition, all constructs show Average Variance Extracted (AVE) values above 0.500. In order to look at discriminant validity, item level cross-loadings were analyzed [67]. As no high cross-loadings were found on an item level, discriminant validity may be given at the item level.
After analyzing the reliability and validity of the model, we evaluated the structural model based on 5000 bootstraps while controlling for age and gender. The respective path model is depicted in Figure 2. In order to identify potential group differences in this exploratory approach, we calculated the structural model for the different settings. For Round 1 we calculated the relationships for the entire sample, for Round 2 we added two groups indicating the random selection for the leaderboard manipulation. Group 0 included the subjects randomly assigned to the 'Top 3', while Group 1 included the subjects randomly assigned to the 'Bottom 3'. This led to the following results: In summary this means that prior to the introduction of a leaderboard, the experience of stress seems to play a role in terms of task performance. That is, those test participants who experienced higher levels of stress achieved a lower number of points during the first round. After the introduction of a leaderboard, however, stress as well as Fear of Negative Evaluation led to lower task performance during Round 2 for Group 0 (those participants who were randomly assigned to the 'Top 3'). For Group 1 (those participants who were randomly assigned to the 'Bottom 3'), on the other hand, only Fear of Negative Evaluation led to lower task performance, whereas higher levels of stress even led to a higher number of points achieved by members of this group.

Discussion
A manipulated leaderboard was used in order to analyze the respective effects for stress and task performance. During the experiment participants played the smartphone game Dots: The coordination game in a supervised setting. By using the one-minute time mode, the game element time pressure, which can increase task-related stress, was used ( [48], p. 298). In order to compare the test participants' task performance under stress, they were, as recommended by Lazarus et al. ([48], p. 299), tested twice. All participants carried out the first measurement under the same conditions. The second measurement was carried out after the respective manipulated leaderboard positions were announced. During the experiment the guidelines for HRV experiments according to Castaldo et al. ([30], pp. 376-377) were applied and implemented.
To examine the effect of using a leaderboard and its impact on stress, the mean values of the first measurement were compared to the mean values of the second measurement. There was no leaderboard in use for the first measurement. The second measurement was carried out after the leaderboard position was announced. However, the results do not indicate any significant differences in stress levels between Rounds 1 and 2, which means that H1 cannot be confirmed. Also, individuals experiencing higher levels of stress did not shown lower task performance, thus showing no support for H2. In a next step the mean value scores of the two game rounds were compared. In the second round of the game, after the leaderboard position was displayed, a significant increase in task performance was expected. The results indicate an increase in the average mean, but no significant difference between the two rounds could be determined, contradicting the assumption put forward by H3.
For H4, we wanted to find out whether the leaderboard position leads to different levels of stress in Round 2. In order to make the different RMSSD values comparable, we used RMSSD − DIF. The leaderboard position was interpreted as a stress factor. The non-significant result may thus be interpreted in light of the statement expressed by Lazarus et al. ([48], pp. 299-300) that some people are stressed by the danger of failure, while others seem to not be effected.
In contrast to Lazarus et al. ([48], pp. 299-300), however, no significant difference between the two populations could be identified (cf. H5). That is, it could not be determined with sufficient significance that test persons' results, depending on the leaderboard position shown to them, improved or deteriorated. Yet, in contrast to the hypothesis, an increase in the mean value was found in both groups, where the increase in the 'Top 3', at 15.688 points, was significantly higher than the increase in the 'Bottom 3' (3.611).
And finally, contrary to H6-H12, also none of the individual variables accounted for significant differences in stress during our experiment.
Thus, in summary, based on the 12 hypotheses evaluated during this experiment, we were unable to find a significant relationship between stress and the leaderboard position. In addition, no significant relationship between stress and task performance could be demonstrated.
Being confronted with these somewhat unsatisfactory results, we decided to perform some additional exploratory analyses using PLS-SEM. Specifically analyzing the parallel effects of Fear of Negative Evaluation and stress on task performance, we could find that stress was indeed a significant indicator of task performance in Round 1. Although, Fear of Negative Evaluation did not have an effect on the task performance without the use of a leaderboard.
The effect size for the negative relationship between Fear of Negative Evaluation and task performance was even higher in Round 2-after the introduction of the leaderboard. On top, we could find a difference in the direction of the relationship between stress and task performance for the 'Top 3' and 'Bottom 3' groups. For individuals ranked among the 'Top 3' stress led to lower task performance, while for those individuals ranked among the 'Bottom 3' higher levels of stress led to even higher levels of task performance.
These results are in line with the considerations of King [17], who describe that individuals might be encouraged to compete for higher placements by a leaderboard, which can have a positive effect on task performance. Also the Transactional Theory of Stress [73] includes the concept of positive stressors. Thus a stressful event-like being ranked among the 'Bottom 3' of a leaderboard-might spur ambition and therefore even lead to increased performance for those individuals specifically, who do not experience high levels of Fear of Negative Evaluation.

Limitations
Our investigations into the effects leaderboard positions may have on peoples' heart rate variability and respective task performance were driven by Motowidlo et al.'s work on occupational stress and its causes and consequences for job performance [22]. To this end, our results show no significant connection between how people were told to be ranked on a leaderboard and their perceived stress level (expressed by a change in heart rate). Lack of significance, however, may primarily be owed to our rather small sample size (i.e., n = 40), which should thus be considered a significant limitation of our study.
Also, the sample of participants was not only small, but did consist of students and colleagues. While the small size was subject to the given limitations in time and available resources, the inclusion criteria were intended, since we wanted to impede any effects related to age and/or previous experience with the used stimulus (i.e., Dots: The coordination game). Such, however, heavily impacts on the generalizability of our results.
Furthermore, it should be noted that not all individual variables affecting stress had been taken into account. For example, we focused solely on measuring stress via heart rate variability measured through a chest strap, where recordings were cropped to frames of 60 s. No subjective assessment of stress were collected, neither did we consider medical reasons and their potential effects on stress, which both could have significantly improved validity with respect to participants' perceived stress.
Also, it has to be highlighted that the experimental setting was rather artificial and did not incorporate any consequences with respect to participants' performance. That is, they had nothing to gain nor anything to lose. Our goal was to create stress by introducing a manipulated leaderboard. When introducing these types of false results, test participants need to be convinced that the presented information is plausible ( [48], p. 297). Yet, during the experiment, four of our participants asked whether the results on the leaderboard were correct, which suggests that at least some of them questioned the sequence shown and suspected manipulation. Hence, although we aimed at creating a realistic setting, the artificial nature of our experiment may have significantly limited the validity of results.
Finally, personal motivation influences how much a person aims to prevent failure ( [48], p. 296), which in turn might influence perceived levels of stress. Here it may be that participants did not perceive a low leaderboard position to be a failure and therefore did not experience failure-related stress symptoms. Also, it might be that playing a game was not perceived as being such a stressful task. As Lazarus et al. ( [48], p. 297) point out, the motivation of test subjects in the case of task-related stress depends on how the requirements are interpreted and whether there is ultimately a noticeable increase in one's performance level. The fact that the number of achieved points was not publicly visible may have added to this feeling of indifference. Also, according to Hamari and Koivisto ([9], p. 8), it is important with gamification that users pursue the same goal. In the context of our study, participants' own goals were, however, not taken into account.

Summary and Future Recommendations
We reported on the results of an experimental study investigating the effects leaderboard positions would have on peoples' heart rate variability and respective task performance. We used a smartphone game, a manipulated leaderboard and a chest strap heart rate monitor to investigate potential connections. Although our experiment with n = 40 participants did not yield significant statistical results, findings hint towards a link between the use of a leaderboard and increased levels of stress (p = 0.052). Connections between concrete leaderboard positions and perceived stress, consequent effects concerning task performance, or individual characteristics such as Type-A behavior or affect, however, could not be identified. Although a lack of connection as such may not be considered an in-significant finding, we have discussed a number of limitations that could have affected the validity of our experiment (e.g., sample size, measurement instruments etc.; cf. Section 6 for a respective discussion). Hence, for future studies we would recommend a number of alterations.
First, we recommend that they should focus on a larger and more stratified sample, and investigate potential differences in target populations. Here, it is further advised to search for more stable personality-type measures, as in our case the internal consistency of both the Framingham Type-A Scale and the PANAS were not satisfactory. To this end, future studies may also control for individual motivation as a predictor for task performance.
Second, we suggest the use of additional/other potentially less intrusive technologies to collect physiological parameters such heart rate or blood pressure as, e.g., demonstrated by Gonzalez-Viejo and colleagues [44] or Massaro et al. [42].
Finally, future studies should expand on the 'realness' of the experimental condition and thus control for potential contextual influence factors.
Author Contributions: The article was a collaborative effort by all three authors. M.S. was the lead researcher in the presented study, which was supervised by T.S. T.S. furthermore wrote the original draft of the article, supported by S.S. who helped writing, reviewing and editing the paper. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to informed consent restrictions.