Music and Time Perception in Audiovisuals: Arousing Soundtracks Lead to Time Overestimation No Matter Their Emotional Valence

: One of the most tangible effects of music is its ability to alter our perception of time. Research on waiting times and time estimation of musical excerpts has attested its veritable effects. Nevertheless, there exist contrasting results regarding several musical features’ inﬂuence on time perception. When considering emotional valence and arousal, there is some evidence that positive affect music fosters time underestimation, whereas negative affect music leads to overestimation. Instead, contrasting results exist with regard to arousal. Furthermore, to the best of our knowledge, a systematic investigation has not yet been conducted within the audiovisual domain, wherein music might improve the interaction between the user and the audiovisual media by shaping the recipients’ time perception. Through the current between-subjects online experiment ( n = 565), we sought to analyze the inﬂuence that four soundtracks (happy, relaxing, sad, scary), differing in valence and arousal, exerted on the time estimation of a short movie, as compared to a no-music condition. The results reveal that (1) the mere presence of music led to time overestimation as opposed to the absence of music, (2) the soundtracks that were perceived as more arousing (i.e., happy and scary) led to time overestimation. The ﬁndings are discussed in terms of psychological and phenomenological models of time perception.

In the tradition of studies conducted regarding music's influence, a promising thread of the research focused on time perception, starting from the seminal work of Rai [14]. As we will see, aside from the research traditionally focusing on music (i.e., perceived duration of musical excerpts), some work has pointed at the domain of audiovisual stimuli [15] and film-induced mood [16], with some interest in the dynamic interaction between the auditory (i.e., musical) and visual elements [17]. For example, it has been found that music is capable of modulating and altering visual perception [18][19][20] due to phenomena such as auditory driving [21].
In this study, we are interested in the modalities through which music affects time perception. We firmly believe that this peculiar research theme has the merit of meeting the needs of a growing number of scientists, artists, and professionals who are interested in shaping the interaction between users and audiovisuals with some background music, whether in movies [22], educational videos [23], interactive games [24], or videogames [25,26]. For instance, there may be some utility in decreasing the self-perceived passage of time in interactive educational or tutorial contexts, especially considering that the majority of educational or tutorial videos have background music within. In such contexts, the challenge is to reach a balance between the entertaining effects of music that improve learning [3,27] and its distracting properties [28] that dampen attention. We are confident that a proficient management of the background music could act upon the recipients' time perception, thus improving the interaction between the users and the audiovisual devices at hand.
Before discussing the existing tradition of research, a clarification is needed on the assessment of time estimation. As suggested by [29], in the literature, two main paradigms exist for the study of time estimation: prospective and retrospective; the former implies that the participants are aware that, after the presentation of a given stimulus, there will be questions on the perceived duration of elapsed time, thus studying the experienced time (i.e., the subject attends to the passage of time itself), whereas the latter does not have such an implication, that is, the experimental participants are not informed about the questions that will follow concerning time perception, thus analyzing the remembered time (i.e., the subject's attention is not focused on time perception) (for a detailed discussion of the different cognitive processing of prospective and retrospective time judgement, see also [30][31][32]). In the current work, as we are interested in analyzing the soundtrack's influence on the time estimation of an audiovisual experience, we focus on the retrospective paradigm (i.e., remembered time). We opted for this paradigm because we wanted our experimental viewers to be completely unaware of the task at hand. In other words, we wanted to avoid conscious time counting.
To begin with, in Section 2, a summary of previous studies on the relationship between music and time perception is presented, alongside a focus on the musical parameters that have been proven to perform a role in affecting time perception in several contexts. In Section 3, more attention is devoted to two psychological models (i.e., Dynamic Attending Theory and Scalar Expectancy Theory) that attempt to explain how time perception works within the audiovisual domain. In Section 4, we present our online experiment, which we discuss in Section 5, where we also list some of the limitations of this work and a few suggestions for future research. A brief conclusion is presented in Section 6.

Previous Works on Music and Time Perception
A variety of studies focused on how music alters our subjectively perceived time in an indirect fashion, for instance, considering the waiting times [33] in retail settings [34], restaurants [35], queue contexts [36,37], and on-hold waiting situations [38].
North and Hargreaves [33] compared the waiting times of four groups of participants who were waiting for an experiment to begin; three of the groups were provided with background music differing in complexity, while the last group was not provided with music (control condition). They found no differences between the music conditions, but the controls showed a significantly lower waiting time. Areni and Grantham [39] reported that when waiting for an important event to begin, their participants tended to overestimate the waiting time when they disliked the background music that was playing, whereas they underestimated the waiting time in the presence of background music they liked. Fang [35] found that slow-paced background music extended the customer waiting time in a randomly selected restaurant, whereas fast-paced background music shortened the waiting time of customers, i.e., they decided to leave earlier. Guéguen and Jacob [38] shed further light on the issue by analyzing the cognitive mechanisms that come into play in an on-hold telephone scenario. Their results proved that in comparison with the no music condition, the simple presence of music led to both an underestimation of the time elapsed and an overestimation of the projected time passed before a person would hang up. Finally, the most up-to-date meta-analytic review of the effects of background music in retail settings [34] (p. 761) concludes that, "A higher volume and tempo, and the less liked the music, the longer customers perceive time duration. Tempo has the greatest effect on arousal." In most of these studies, the dependent variables are indirect measures of time perception because participants do not explicitly report their time awareness, but their behavior is simply annotated. In some other cases [40,41], an actual self-report of the wait length was assessed.
Evidence that music alters the representation of time also stems from qualitative research on altered states of consciousness (ASCs) [42]. In such studies, subjects' reports often mention feelings of timelessness, time dilation, and time-has-stopped in correspondence to music listening activities [43].
To sum up, there exists a consensus that music has a robust role in influencing how we perceive time [15,25,37,40,[44][45][46], although fostering small-to-moderate effects, as Garlin and Owen [34] pointed out in their meta-analytic review.
On the contrary, less consensus exists on the musical parameters that are responsible for the alteration of time perception. Research with both direct and indirect measures of time estimation has focused on diverse music parameters.

Musical Parameters and Time Perception
Several studies have investigated time estimation in the dependence of several music parameters and types of music, primarily by assessing the perceived length of musical excerpts.
To begin with, the musical structure complexity has been found to increase the time estimation [47]; on the contrary, the results on tempo are not consistent; if [44] found no evidence, other works proposed that a slower tempo seems to lead to time underestimations [37,48,49]. Coherently, [37] found temporal perception (perceived minus actual wait duration) to be a positive function of musical tempo. In relation to musical modes, a study [50] proved that the Locrian mode (diminished, thus more likely to be unpleasant) led to time overestimation as opposed to the Ionian and Aeolian modes. The modes of ancient Greece can be described as a kind of musical scale coupled with a set of particular melodic behaviors. The Ionian mode is equal to modern-day major scale. The Aeolian mode is today's natural minor scale (i.e., 1, 2, 3, 4, 5, 6, 7). The Locrian mode is a minor scale with the second and fifth scale degrees lowered a semitone (i.e., 1, 2, 3, 4, 5, 6, 7). Arguably, the Locrian mode tends to yield more negative valence than the Minor due to the different composition of the tonic triads (i.e., the principal chord that determines the tonality of the mode). When compared with the Minor mode, wherein the tonic triad is constituted by a minor a 3rd and a perfect 5th (i.e., minor chord), the Locrian mode's tonic is composed of minor 3rd and diminished 5th (i.e., diminished chord). Because of their composition, the diminished chords are considered dissonant and prove to be responsible for conveying the lowest valence and very high tension [51].
The music volume also might play a role; indeed, [45] proposed that people listening to quieter music tend to underestimate the time passed. Lastly, [52] reported that an overestimation of passed time was observed for pop music played in a major vs. minor mode, while [53] concluded that listening to familiar music leads to an underestimation of passed time.

Time Perception in Audiovisuals-Models and Mechanisms
When it comes to audiovisuals (i.e., complex multimodal stimuli that present at least two channels of information: auditory and visual), it can be assumed that different mechanisms come into play, especially since the viewer engages with a conscious elaboration of the overall stimulus by integrating its various interacting parts. The integration process can be easy or difficult depending on the stimulus' internal congruency, and this can, in turn, impact the time estimation. For instance, as elaborated by [54] (p. 504), "Because of its effects on information processing, stimulus congruity may influence the retrospective estimation of event duration. Specifically, underestimation of lapsed time might be expected when the elements comprising the event are incongruent, because incongruent information tends to be more difficult to encode and retrieve because of the absence of a preexisting cognitive schema [ . . . ], and because of weaker linkages between unrelated nodes in an associative network. [ . . . ] exposure to incongruent information, like elevated arousal states, may create a distraction that reduces attention to one's internal "cognitive timer [55]".
In their recent review on time perception in audiovisual perception, Wang and Wöllner [15] clarify that two Internal Clock models can account for the effects of music on time perception in the audiovisual context, which are the Dynamic Attending Theory (DAT), also known as the oscillator model [56], and the Scalar Expectancy Theory (SET), also referred to as pacemaker-counter model [57]. Contrarily to the SET, which postulates regularly emitted pulses by an independent internal clock, the DAT claims that the time estimation of the duration of past events depends on the coupling between attentional pulses and the occurrences of external events [56]. For this reason, this model is sometimes referred to as an attention-based model. Crucial to this theory is the idea that the emission of attentional pulses or oscillations is a non-linear (i.e., dynamic) process, and that the attention regulates the pulses to a greater extent than the working memory (on the relationships among time perception, attention, and working memory, see [32]). Indeed, the adjective "dynamic" stems from the fact that, contrary to the SET model, the emission of the attentional pulses is not a static process, rather, it varies depending on the salience of the external events (i.e., stimuli).
The DAT model also suggests that when the attention toward a stimulus is low, a fewer number of pulses is emitted, thus leading to an underestimation of time. Furthermore, other results agreeing with the DAT model indicate a positive correlation between musical tempo and time estimation in both auditory [58][59][60] and visual perception [61].
Conversely, the latter model (i.e., Scalar Expectancy Theory) proposes a linear cumulation of regularly emitted attentional pulses and a pacemaker that counts them. According to this memory-based theoretical framework, more akin to that of [62], a more important role is assigned to the working memory as it is a three-step model in which the memorization constitutes the second and central phase, following the clock phase and followed by the time judgement. In particular, the storage-size model of memory of time perception [62] claims that richer stimuli (i.e., with a high amount of information) lead to the perception that a greater number of events occur in a given interval, thus favoring time overestimation.
To sum up, among the models that account for the influence of music on time perception, two share the hegemony: the attention-based (DAT) and the memory-based (SET and storage-size model of memory of time perception). Nevertheless, various results exist within the literature because various studies claim that different features of the stimuli influence time estimates (Table 1). Major mode overestimation [52] Minor mode underestimation [52] As we are interested in the time estimation of an audiovisual, we focus our attention on two basic parameters, namely valence and arousal of the emotions conveyed by music, for two reasons: 1.
It is known that certain pieces of music can, through their emotional valence, foster positive affective states, to the point that music has traditionally been considered as a valid mood inductor [68]. Therefore, in accordance with previously collected results from outside of the audiovisual domain [66], we can hypothesize that the positive affect experienced by the recipients while viewing may be negatively correlated with the estimation of the time elapsed [31], that is, the better the viewers feel as they watch the scene (i.e., positive affective state), the less they perceive the passing of time.

2.
A great deal of research suggests that the arousal (i.e., the physiological and psychological state of activation) conveyed by music might lead to time overestimation [34,54], possibly due to an effect on the internal clock system speed (both in attention-and memory-based models of time perception). Nevertheless, no one, to our knowledge, has ever shown such a phenomenon in an audiovisual domain.
These two points underpin our research questions, which are introduced in the following section.

The Present Study
As stated above, in the literature on the influence of music on time perception, one cannot draw definitive conclusions about a variety of factors. With this study, we aim to investigate how the perceived length of a visual scene is affected by the background music (i.e., soundtrack), and, more specifically, the following two particular features conveyed by the music: emotional valence and arousal.

Research Questions
We hypothesize that both the emotional valence and arousal conveyed by music have a key role in the time estimation of an audiovisual piece, although in contrasting ways.
First, coherently the studies on waiting times [38], we expect the mere presence of music to lead to a decrease in the perception of the elapsed time (i.e., time underestimation) (Hypothesis 1).
Secondly, considering the literature on music pieces [66], we expect positively valenced music to result in time underestimation, and negatively valenced music to induce time overestimation (Hypothesis 2).
Third, in accordance with both the attention and memory-based models of time perception, we hypothesize that the arousal level should lead to time overestimation (Hypothesis 3).
Below, we describe the experimental paradigm, each construct, and its related measurement separately.

Method
We designed a between-subjects experiment wherein the participants watched a modified version (01 30") of a short movie by Calum Macdiarmid [69] (Figure 1).

Method
We designed a between-subjects experiment wherein the participants watched modified version (01′30″) of a short movie by Calum Macdiarmid [69] (Figure 1). Using Reaper 6.29, we created five versions of the short movie-varying under th five experimental conditions-with the video accompanied respectively by a happy piec (Appalachian spring-VII: doppio movimento) by A. Copland), a sad cello melody accompa nied by a piano, (After Celan by D. Darling and K. Bjørnstad), a frightening track from th Original Motion Picture Soundtrack of the film Proxy (Murder by the Newton Brothers), relaxing piece specifically composed to control anxiety [70], or by no music at all (i.e control condition). This method allowed us to present all the possible combinations o valence and arousal ( Table 2). Similarly to [20], the four pieces were chosen by considering the findings of [71], an the subsequent studies enumerated by [72] concerning a plethora of psychoacoustic pa rameters associated with emotional expression in music. Two of the pieces evoked nega tive affects but differed in the arousal dimension: the After Celan track's soft tone and mo bid intensity fosters sadness and tenderness [73]. Conversely, the Newton Brother track's great sound level variability and the rapid changes in its sound level could be a sociated with the experience of fear [71], while its increasingly louder volume can evok restlessness, agitation, tension [74] or rage, fear [75] and scariness [76].
In a similar way, the two other pieces both foster positive feelings, but with a marke difference in the arousal dimension: if Copland's piece's orchestration, fast tempo, an high pitch all cultivate a sense of highly exciting joy, the relaxing piece was specificall composed with the goal of controlling anxiety. To elaborate, it presents a relatively con stant volume, narrow melodic range, legato articulation, and regular beat [70]. To mitigat any loudness perception effects, the perceived loudness of all the tracks was normalize via a Loudness, K-weighted, relative to Full Scale (LKFS) [77]. Using Reaper 6.29, we created five versions of the short movie-varying under the five experimental conditions-with the video accompanied respectively by a happy piece (Appalachian spring-VII: doppio movimento) by A. Copland), a sad cello melody accompanied by a piano, (After Celan by D. Darling and K. Bjørnstad), a frightening track from the Original Motion Picture Soundtrack of the film Proxy (Murder by the Newton Brothers), a relaxing piece specifically composed to control anxiety [70], or by no music at all (i.e., control condition). This method allowed us to present all the possible combinations of valence and arousal ( Table 2). Similarly to [20], the four pieces were chosen by considering the findings of [71], and the subsequent studies enumerated by [72] concerning a plethora of psychoacoustic parameters associated with emotional expression in music. Two of the pieces evoked negative affects but differed in the arousal dimension: the After Celan track's soft tone and morbid intensity fosters sadness and tenderness [73]. Conversely, the Newton Brothers' track's great sound level variability and the rapid changes in its sound level could be associated with the experience of fear [71], while its increasingly louder volume can evoke restlessness, agitation, tension [74] or rage, fear [75] and scariness [76].
In a similar way, the two other pieces both foster positive feelings, but with a marked difference in the arousal dimension: if Copland's piece's orchestration, fast tempo, and high pitch all cultivate a sense of highly exciting joy, the relaxing piece was specifically composed with the goal of controlling anxiety. To elaborate, it presents a relatively constant volume, narrow melodic range, legato articulation, and regular beat [70]. To mitigate any loudness perception effects, the perceived loudness of all the tracks was normalized via a Loudness, K-weighted, relative to Full Scale (LKFS) [77].
We also aimed at ecological validity; thus, in order to allow people to participate in a less detached situation than a lab, we built an online procedure on Qualtrics.com. The participants accessed a single-use link (An anti-ballot box stuffing was employed to avoid multiple participations from the same device) through which they could run the experiment. As a result of the online procedure, they were able to participate directly from home on their laptops, smartphones, or tablets, just as if they were watching an actual movie. An introductory screen summarily presented the task to the participants without mentioning the question about the time estimation. Immediately after this introductory screen, the informed consent statement was presented. After viewing the scene, a questionnaire was administered with three questions: the first two, which might be considered as a manipulation check, were designed to verify whether the emotional valence and arousal self-reported by participants were the same as those expected for each music condition. The last question aimed to assess the dependent variable, namely the participants' perception of elapsed time (i.e., time estimation). To avoid sequence effects (i.e., the theoretical possibility that a previous question could affect the following one in any possible way), the order of questions was completely randomized for each participant.

Affective States of the Recipients
To measure the affective state of the viewers, we needed to identify what we might call the emotional nuclei of the viewing session and the emotional nuances each soundtrack could add to the narration. To this aim, as we were interested in a fast and immediate answer that caught the gist of the emotional content of the vision, we decided against a Likert scale with several emotions as the items, because this would have resulted in increased fatigue for the recipients. On the contrary, we resorted to Plutchik's wheel of emotions [78]. In brief, we presented our participants with the image of the wheel (Figure 2), asking them to select with a click the region that best represented the emotion they were experiencing while viewing the video.

Arousal
To assess emotional arousal, we used a 100-point slider, asking our participants how active they felt while viewing the scene. The slider was initially set to 0; the recipients were required to place it at their desired point. As it would have been suboptimal to use a single adjective to refer to the concept of arousal unambiguously, in this assessment, we provided a note in the question to our participants that read: "When we say active, we also mean awake or ready".

Time Estimation
We asked our participants to indicate the length of the video by dragging a slider that ranged between 60 and 120 s (i.e., minimum and maximum values admitted); the slider was initially placed at the center of the bar (i.e., 90 s). Later, as was the case with [37], we created a measure of the gap between the estimated time and the actual time, according to the formula:

Arousal
To assess emotional arousal, we used a 100-point slider, asking our participants how active they felt while viewing the scene. The slider was initially set to 0; the recipients were required to place it at their desired point. As it would have been suboptimal to use a single adjective to refer to the concept of arousal unambiguously, in this assessment, we provided a note in the question to our participants that read: "When we say active, we also mean awake or ready".

Participants and Preliminary Sample Data Analysis
As a first step, six hundred and three (n = 603) Italian participants were recruited by sharing the link of the study on social media and through university mailing lists (i.e., snowball procedure). Their participation was provided on a voluntary basis, and the participants were not incentivized with any reward.
Before our data analysis, to improve the reliability of our sample, we performed exclusions based on the following pre-established criteria: • An attention check question in which a short Likert scale was presented with the explicit instruction that asked participants to avoid completing it; we excluded all those participants who completed such a scale. • A time counter on the screen displaying the video was incorporated (it was visible to the experimenters only) so as to exclude all participants who had not watched the whole video (i.e., time spent on that screen < 90 s).

•
All those participants who completed the task in less or more than the mean duration ± 3SD were excluded. • All participants who did not complete the questionnaire in all its parts were also excluded.
After the above exclusions, our sample size decreased from 603 to 565 valid participants (mean age = 26.01 SD = 10.53, 339 females, 60%). The five experimental groups were comparable in the number of participants (range 104-119) and were gender-balanced (p = 0.41).

Results
For the statistical analyses, IBM SPSS 26.0 was used; the path analysis was processed through Mplus 8.5 [79]. The violin plots were made by means of R (ggplot2 package). For each test, the effect size is provided by employing η (eta squared, for chi-square and ANOVA statistics). In the ANOVA tests, the post-hoc computed observed power is provided in terms of (1-β). In the results of the model (Section 4.5.2), for each path, we provide the standardized path coefficient (β), the relative Standard Error (S.E.), the level of statistical significance (p value), and a 95% Confidence interval (95% CI). In the case of the indirect effects, 95% Bias-Corrected Confidence Intervals are indicated (BCa).

Affective States of the Recipients
The heatmaps of Figure 2 provide a first and intuitive point of view of the participants' affective states. A common emotional nucleus emerges in all conditions, specifically the bottom region of Plutchik's wheel of emotions, which is the axis that includes pensiveness, sadness, and grief. It is worth mentioning that the other soundtracks add or subtract diverse emotional nuances in comparison with the control condition. For instance, comparing the controls with the happy group, the region of the serenity/joy becomes more populated. When considering the scary condition, the serenity/joy axis loses relevance, while the expectancy area remains active, and apprehension and awe gain saliency. Conversely, when considering the sad condition, all the other axes aside from the pensiveness/sadness/grief axis become unnoticeable.
Upon further analyses, considering that our participants simply clicked once on the Plutchik's wheel of emotions image in correspondence with the emotion they were feeling, we created an emotional score by assigning 1 point to the participants who chose a positively valenced emotion (21.9%), 0 points to non-valenced emotions (expectation, interest, surprise, and distraction, 17.3%), and −1 point to negatively valenced emotions (60.7%). We then performed a chi-square test to evaluate the distribution of the emotion valence in dependence of the condition, finding it to be significant, χ 2 (8565) = 101.34, p < 0.001, η condition dependent = 0.08, η aff. state dependent = 0.40 (Table 3).

Time Estimation
Before proceeding with the analysis of variance, we studied the descriptive statistics. The first aspect to consider is that the majority of participants in our sample (71.3%) underestimated the actual length of the scene (M = −14.98 SD = 27.01, min = −62, max = 60). We then proceeded to the verification of our hypotheses (Section 4.1).

Hypothesis 1 (H1). Does the presence of music lead to time underestimation?
To verify Hypothesis 1, that is, whether the mere presence of music negatively influenced time estimation, a one-way ANOVA was performed, which revealed the main effect of the music [F(1, 563) = 6.46, p = 0.011 η 2 = 0.011 (1 − β) = 0.72]. Contrarily to the hypothesis, the control group reported the video to be shorter (M = −21.03 SD = 26.10) as opposed to the music group (M = −13.62 SD = 27.01) (Figure 3). We can therefore state that Hypothesis 1 was not verified.

Time Estimation
Before proceeding with the analysis of variance, we studied the descriptive statistics. The first aspect to consider is that the majority of participants in our sample (71.3%) underestimated the actual length of the scene (M = −14.98 SD = 27.01, min = −62, max = 60). We then proceeded to the verification of our hypotheses (Section 4.1).

Hypothesis 1 (H1). Does the presence of music lead to time underestimation?
To verify Hypothesis 1, that is, whether the mere presence of music negatively influenced time estimation, a one-way ANOVA was performed, which revealed the main effect of the music [F(1, 563) = 6.46, p = 0.011 η 2 = 0.011 (1 − β) = 0.72]. Contrarily to the hypothesis, the control group reported the video to be shorter (M = −21.03 SD = 26.10) as opposed to the music group (M = −13.62 SD = 27.01) (Figure 3). We can therefore state that Hypothesis 1 was not verified.   Table 4 and Figure 4). As concerns the specific roles of valence and arousal, we resorted to a path analysis that we describe in the following paragraph. After analyzing all the groups in greater detail, we still found an effect of the music [F(4, 560) = 4.93, p = 0.001 η 2 = 0.034, (1 − β) = 0.96]. Subsequent custom hypothesis contrasts revealed the significant differences against the control condition to be those of the happy (M = −14.00 SD = 24.74, p = .050), scary (M = −7.37 SD = 29.08, p < 0.001), and relaxation conditions (M = −13.28 SD = 27.88, p = .031) (Table 4 and Figure 4). As concerns the specific roles of valence and arousal, we resorted to a path analysis that we describe in the following paragraph.

Hypothesis 3 (H3). Arousal in time estimation.
As for the verification of Hypotheses 2 and 3, a path analysis was performed to analyze the role of the valence and arousal as conveyed by the music and self-reported by our participants with regard to time estimation. The model presents two exogenous variables, namely the valence and the arousal conveyed by music (i.e., the experimental conditions). Both the variables were operationalized on three levels; the valence denoted as −1 (negative valence: sad and scary), 0 (neutral valence/no music), and 1 (positive valence: happy and relaxation); and the arousal denoted as −1 (low arousal: relaxation and sad), 0 (neutral arousal/no music), and 1 (positive arousal: happy and scary). For the next step (i.e., order of the model), the endogenous variables were the self-reported affective state and arousal. The first part of our model can be considered as a manipulation check that is conducted to ensure that our participants' affective state and arousal were effectively and coherently affected by the pieces of music that we selected. Finally, the last endogenous variable was the time estimate.
To avoid normality issues, Robust Maximum Likelihood (MLR) was used as the estimator.

Discussion
Firstly, our results suggest that the mere presence of music causes an increase in time estimation in an audiovisual context. This finding seems to contradict that of other studies that found that music presence, as opposed to music absence, led to longer waiting times, therefore suggesting a decrease in the perception of elapsed time (i.e., time underestimation) [33,38,81] due to the fact that music leads to perceive the time passing by as being slower. We can account for such an apparent contradiction by considering that music has a twofold nature: on the one hand, when it is reproduced in the background (as in most of the studies on waiting times mentioned above), it may be conceived as a distractor that draws the focus of attention away from the conscious time perception. On the other hand, when music is paired with a visual stimulus (as in the film music domain), it becomes a key part of the meaning of that scene, the integration processing of which requires added attentional and memory resources.
Indeed, concerning the above-mentioned models of time perception (Section 3), this effect of the presence of music might go in the direction of some memory-based phenomenon related to an added complexity. In further detail, an audiovisual stimulus requires more information to be processed than a visual stimulus alone. Not only do audiovisuals require several parallel levels of processing, such as visual, music, kinesthetic, and, possibly, speech, sound FX (i.e., sounds recorded and presented to make a specific storytelling or creative point without the use of words or soundtrack, ex.: sounds of real weapons or fire), and text [82], but they also require their coherent integration aimed at building a working narrative, namely the subjective interpretation of a scene. Such integration involves both bottom-up (sensory-perceptual) and top-down (expectative) processes: on the one hand, a recipient perceives information using their senses; on the other, one integrates this information using previous knowledge and cognitive schemas stored in the long-term memory [82]. Other studies have already revealed that, under the influence of differently valenced soundtracks for the same video, not only do the viewers generate diverse plot expectations [20,83] and alter their recall of the scene [84,85] (i.e., high-level processing), but they can also be driven and even deceived in a way that impacts their visual perception (i.e., low-level processing) [18][19][20][21]. Therefore, it can be assumed that such a to-be-processed integration, only present in the music conditions, could be the cause of time overestimation in accordance with the memory-based model of time perception.
Elaborating on the differences in the soundtracks more in detail, when considering the four music conditions separately, both the positively valenced soundtracks (i.e., happy and relaxing) and highly arousing ones (i.e., happy and scary) seemingly result in time overestimation (Figure 4).
Nevertheless, to better clarify the roles of valence and arousal, we implemented a more sophisticated path analysis that considered not just the experimental conditions but the subjectively perceived affective state and arousal level as plausible predictors of time estimation. The results of the model ( Figure 5) clarified that the soundtracks' impact on the study participants was coherent with our hypotheses and, more importantly, that only the subjectively perceived level of arousal positively predicted the time estimation (in contradiction with [54,65]).
It appears that our results contrast the well-known traditional adage that "time flies when you're having fun", or at least they correct this adage in quite a counter-intuitive fashion, that is: "Time flies when you're not activated".
This outcome is coherent with those studies that furthered several music parameters, showing that fast musical tempi [37,58,60] and high musical structure complexity [47], all features present only within the highly arousing soundtracks, led to overestimations. Similarly, our findings also overlap with those studies that found music in major mode (widely associated with positive affect) and music in minor mode (largely associated with negative affect) do not differ in their influence on time perception [64]. Were this the case, then the music valence should have behaved as a negative predictor, given that the two positively valenced pieces were both in the major mode, whereas the two negatively valenced ones were both in the minor mode.
It is also worth noting that these findings are in contradiction with those of [66], where systematic overestimation in the judgment of the duration of joyful musical excerpts was found, and the opposite was noticed for the sad tracks. We may account for such a difference by bringing attention to two significant differences between their procedure and ours: first, although both the studies employ a retrospective paradigm, Bisson and colleagues [66] inserted a cognitive task between the two musical excerpts; thus fostering a relevant change in the participants' foci of attention that could have created a bias in the internal clock mechanism. Secondly, and most importantly, it should be considered that their results (i.e., positive valence music in major key fosters time overestimation as opposed to negative valence in minor key) might also be explained in terms of arousal. In fact, the positive valence musical piece that was used by Bisson et al. [66] (i.e., the 1st movement of Johann Sebastian Bach's Brandenburg Concerto No. 2 in F major, BWV 1047O) can undoubtedly be considered to be an arousing composition, incomparably more arousing when contrasted with the negative valence musical piece that was used (Samuel Barber's Adagio for strings in B minor from the 2nd movement of String Quartet, Op. 11).
As for the diatribe between the attention and memory-based models, it is worth mentioning that no safe conclusion can be drawn from the current study. The attentionbased model posits that time overestimations are due to a higher number of attentional pulses emitted in highly arousing situations, whereas the memory-based model posits time overestimation to be a phenomenon caused by the stimulus complexity. The more complex the stimulus, the more processing is required, and a greater number of traces remain in memory, thus leading to overestimations. To put this in terms of the Scalar Expectancy Theory, the pacemaker regularly emits attentional pulses, but the counter device, in the presence of richer stimuli, counts an increased number of pulses. The issue here is that the two arousing soundtracks both present, apart from the faster tempi, greater perceived complexity compared to both the relaxing and sad tunes.
To disambiguate between these two differently oriented models, in future works, it could be profitable to compare, for instance, two arousing soundtracks differing in the degree of harmonic and melodic complexity (for example, a very fast bebop jazz tune with a techno track), and the same might be done with two scarcely arousing pieces.
Rather than through the attention and memory-based models, a phenomenological approach appears to be promising in explaining this and other aforementioned results. Such a phenomenological approach, promoted by Flaherty [86], has philosophical roots in the thought of Heidegger, Husserl [87], and Merleau-Ponty [88,89]. It proposes that time consciousness cannot be fully analyzed through perception because of the intrinsic nature of time, which is considered as a construction more than an objectively perceived entity. As such, no fixedly emitted pulses of sort can exist; on the contrary, our experience of the now rises from the integration of diverse perceivable stimuli into a single unit of content within consciousness. Yet, the number (i.e., how many of these stimuli we process) and the saliency of these stimuli vary depending on several factors, including memory, personality, affect, and physiological conditions. Two of Flaherty's forms of temporal experience (i.e., "temporal compression" and "protracted duration") deal with retrospective time judgement. Temporal compression happens when the listening activity is not so engaging (e.g., in our sad and relaxing conditions); in these cases, the listener's brain works almost automatically, so that "time will be experienced and retroactively constructed as having flowed quickly" [90] (p. 256) [31]. Conversely, the protracted duration phenomenon arises in cases of intense, novel, or extraordinary experiences (e.g., the highly arousing soundtracks of our study, as opposed to the less arousing, might belong to this category to a greater extent), and, similarly to the memory-based models, it is mainly due to a more complex structure of information that needs to be processed. Such a difference is eminently important as it accounts for our results in a coherent fashion.

Limitations
Lastly, the four main limitations of this study must be highlighted. Firstly, all of the measures employed are self-reported. On the one hand, self-reported measures in psychological studies on music have been consistently applied to the study of musically elicited emotions over the last 35 years and presented good reliability as long as they stem from validated theories or models of emotion [91]. Moreover, there is some evidence suggesting a consistent overlap between self-reported and psychophysiological measures such as skin conductance levels [92][93][94][95], heart rate [94,96], finger temperature, and zygomatic facial muscle activity [95]. On the other hand, it must be acknowledged that these two sets of measures cannot always be considered as equally valid in all contexts, and other studies found more complex relationships between them [97], even in the audiovisual domain [98]. For these reasons, it would be good practice to replicate these findings in a laboratory setting by employing one or several psychophysiological measures [99].
Secondly, some studies insist on the role of music preference [25] and familiarity [100] in time perception. In our study, to construct a more condensed online task (and to avoid further losses in participation), we did not ask our participants to express their musical preferences, nor did we measure the extent to which they were previously exposed to the genre of the soundtrack they were listening to. Similarly, we did not ask for their movie preferences. All these personal characteristics could have slightly biased our findings.
Thirdly, as regards the stimulus complexity in audiovisuals, we need to mention that an assessment of the subjectively perceived musical fit [101], that is, the degree to which, according to a viewer, musical and visual information overlap each other with no semantical frictions, could have been profitable. Nevertheless, so far, such an assessment has been validated for audiovisual advertising only [101,102].
Lastly, the short movie we used as the stimulus was not completely neutral from an affective standpoint; indeed, we found that 64.42% of the viewers in the no-music condition reported a negative affect during their viewing. Although we are confident that such an "affect negativity" of the visual stimulus could not have jeopardized the validity of the results per se, we are less certain that it did not impact the perceived congruity of the audiovisual; namely, the fact that a negative visual stimulus was in some conditions paired with a pos-itive soundtrack could have led to a decrease in the stimulus congruity, thus eliciting a slightly different (and perhaps more complex) processing. Indeed, in this design, we did not include a measure of the musical experience per se. In other words, we did not assess the self-reported valence and arousal of the musical pieces separately from the video. As a consequence, what we have referred to as a self-reported measure of the affective state of the participants is a measure of the overall audiovisual stimulus, subsequent to the aforementioned cognitive process of integration of the visual and auditory channels of information. On the one hand, the results of our model (Section 4.5.2) support the claim that the musical pieces were representative of the desired valence; on the other, there is also evidence that visual information influences the perception and memory of music [103]. In future studies, we plan to include the investigation of the bi-directional influence between music and video stimuli, especially with reference to musical fit [101,102] interactions with time perception.

Conclusions
To conclude, two main results have been found in this study. The first is that the mere presence of music, regardless of its valence and arousal, leads to time overestimation in an audiovisual context, possibly due to the cognitive process of integrating the visual and auditory information. Secondly, and most importantly, the primary result is that the subjectively perceived level of arousal, which is in turn increased by faster musical tempi and greater stimuli complexity (i.e., happy and scary soundtracks), positively predicts the time estimation of an audiovisual (i.e., arousal leads to time overestimation). In the light of the studies mentioned in Section 3, the supposedly causal role of the arousal in time overestimation appears to be solid. Further studies need to identify the cause by distinguishing between attention and memory-based models of time perception.
It is our intention to underline the potential that these findings and this research niche might present in the audiovisual domain. The notion that the interaction between the soundtrack and the moving image can affect the viewers' time perception should receive further attention from media psychologists, video content creators, filmmakers, and, in general, any scholars or professionals interested in shaping and improving the interaction between viewers and an audiovisual. As the development of new technologies continues, their interactive uses become more and more explored and exploited. It is not negligible to claim that an ameliorated management of the background music within the audiovisuals could improve the interaction between the user and the audiovisual devices by shaping the recipients' time perception.
Our results confirm previously collected evidence [16,59,64] revealing that the musically conveyed arousal, and not specific emotions, fosters time overestimation within a narrative audiovisual scene.
We are aware that, from a naïve point of view, the fact that arousing music steers the listeners towards time overestimations might appear paradoxical. Instead, this is far from being unknown among music composers. For instance, it is told that Maurice Ravel was very disappointed by Wilhelm Furtwängler's rendition of his Boléro, which was so fast that he thought it would have lasted forever [104].  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.