1. Introduction
Cultural stage events like operas or theater performances not only offer social elements that enhance live experience, but also convey an ambience, a unique atmosphere and a sense of being “live” that have been proven difficult to experience via traditional screen-based streaming [
1,
2]. With the ever-increasing accessibility of low-cost head-mounted displays (HMDs), many possibilities arise that might change this status.
Users of HMDs are able to experience an extension of the real world from home by accessing three-dimensional artificial worlds. Virtual environments (VEs) are based on powerful computing technology and algorithms to create immersive experiences [
3]. These immersive experiences can be defined as “the acceptance of one’s involvement in the moment that is conceived through multiple senses, creating fluent and uninterrupted physical, mental, and/or emotional engagements with a present experience” [
4] (p. 633). Depending on the purpose, VEs may involve the simultaneous participation of multiple users who are in different physical locations, which enables social interaction [
5,
6]. Due to their immersive features, VEs can create a sense of presence, colloquially described as a “sense of being there” [
3].
This opens up great opportunities for both cultural creators, such as theater and opera professionals, and cultural enthusiasts: On the one hand, VEs provide new business models for the provision of and access to entertainment and cultural content, e.g., allowing real-world regional events to be shared with a wider audience. Therefore, it is a solution to help cultural stage formats to find their way out of their fixed locations and into the homes of their audiences [
7]. On the other hand, cultural VE formats enable expanded access opportunities for audiences—referred to as “users” in the context of virtual events—who have previously been unable or unwilling to experience traditional cultural events, such as individuals with physical disabilities, neurodivergent individuals, or those who simply prefer to experience cultural content from the comfort of their own homes. Immersive technologies, therefore, have the potential to foster greater diversity and inclusion in cultural participation [
8].
The original Reality-Virtuality Continuum by Milgram and Kishino [
9] characterizes the degrees of extending the real world with virtual content. In their original taxonomy,
Virtual Reality (VR) is presented as the endpoint of the continuum, fully immersing the user in an artificial virtual world and blocking out the real world, allowing them to immerse themselves entirely in the virtual one. So far, research on VR in the cultural context has primarily focused on musical performances [
10,
11,
12] (e.g., under the keyword “Music 3.0” [
13]) and cinema [
14,
15]. However, as the technology continues to evolve, new areas of application are emerging. While VR has been argued to show potential for innovative theatrical performances, including enhanced artistic creativity [
8], concerns about preserving the “performance’s ephemeral nature” [
16] of live performances make fully artificial virtual environments challenging for cultural stage events. Moreover, developing immersive VR applications requires significant financial resources, which cultural institutions rarely have at their disposal.
Between the endpoints of pure reality and virtuality,
Mixed Reality (MR) describes technologies that blend real and virtual elements to varying degrees, including
Augmented Reality (AR) and
Augmented Virtuality (AV) [
9]. To overcome the limitations of VR in the specific case of virtual stage events, AV has particular potential as it allows real-time transmission of cultural performances like concerts or theater from the physical stage to a virtual environment. AV comprises two elements: a three-dimensional virtual environment and real-world content embedded within it [
9,
17]. In a virtual opera, this can be achieved, for example, by recording the real-world stage performance and integrating the recordings in a multiplayer application that consists of a virtual 3D model of a concert hall or theatre, and includes a virtual stage as well as a virtual auditorium. Although the implementation of pre-recorded video content naturally implies a lower level of direct interaction between performers and audience, this aligns well with the traditionally passive nature of opera and theatre experiences, making AV particularly well-suited for these cultural formats. As the recordings are embedded in an interactive, immersive VE, AV also holds significant advantages over 360-degree video streaming. While the latter merely enables content to be streamed, AV enables multiplayer setups and, therefore, social interaction and connection over distance.
Despite being the most underexplored research area within the Reality-Virtuality Continuum [
9,
18], AV therefore offers several advantages for the implementation of stage performances: (i) The virtual environment only needs to be designed and developed once, therefore complex methods for digitalizing real spaces, e.g., photogrammetry [
19], can be applied to ensure a high level of immersion. (ii) Recording performers’ emotions is beneficial, as even by applying advanced AI algorithms, replicating emotional nuances in performers’ facial expression in VR based on avatar representations remains challenging [
20]. (iii) Further, the technical implementation based on real-time recordings enables hybrid formats, extending the performance’s reach beyond the real auditorium, encompassing the virtual audience as well.
When designing an AV application, the perennial challenge is to determine which elements and features contribute to an immersive user experience and should therefore be implemented. A recent systematic review on attendance motives has shown that ‘artistic content’ is a main motivator for both real-world and virtual art performances [
21]. This attendance motive category highlights ‘
aesthetics’, which results from sensory appeal as well as the visual and auditory design of the performance. Therefore, the representation of artistic content must be carefully considered to meet the needs and expectations of the audience in virtual stage performances. In the special case of AV applications, the representation of the artistic content is highly dependent on the recording methods used. As the event environment (e.g., theater, concert hall) is based on a multiplayer, three-dimensional environment, empirical findings have shown that increasing users’ feeling of presence requires the quality of various implemented elements to be matched [
22]. Consequently, audio and video recordings would require spatial features to match the three-dimensional event environment. Both features show promise in creating immersive audio-visual experiences that match users’ real-world perception, from which they could potentially benefit.
However, implementing these spatial features introduces significant technical constraints for real-time, distributed AV applications. The increased technical requirements can reduce the number of users who can simultaneously access a distributed AV application or negatively impact system performance, e.g., through increased latency or reduced bandwidth, which are crucial factors for user experience [
23]. Furthermore, empirical findings on the effects of stereoscopic video and spatial audio in virtual environments show contradictory results: Some studies demonstrate positive effects on presence [
24,
25,
26,
27,
28], while others show no effects or even negative effects such as motion sickness [
29,
30].
To the best of our knowledge, there are no empirical studies that examine this relationship between specific recording features and the user experience in cultural AV environments. Although prior work has examined the effects of video and audio features in VR and 360-degree video applications, these results cannot be easily transferred to AV applications, which—mentioned above—differ in their levels of virtuality and interactivity and, crucially, in the integration of spatial audio and video. The context-dependency of the effects [
31] and the lack of studies specifically investigating AV applications [
18,
32] make it unclear whether the benefits of spatial features justify their technical costs. Drawing from the need to balance user experience and technical performance, such as system stability and accessibility for distributed users, our interest is in exploring the relation of audio and video features to the perceived user experience in an AV application. We chose the context of an opera performance for this research, as it places demands on both the video and audio features. Our study is the first to investigate the absolute and relative effects of video and audio recording features for the design of AV applications for cultural institutions. The study results are not only an empirical contribution but also provide practical recommendations that act as a resource allocation framework. This enables designers and developers to make informed decisions about effective and efficient AV applications. This is important for cultural institutions, which have limited resources at their disposal.
As theoretical models and constructs considering user experience in virtual environments are highly diverse and debated in the VR community, in the following, this paper first briefly summarizes key dimensions of user experience to define terms and establish a common understanding of constructs relevant for this paper.
1.1. Key Dimensions of User Experience in Virtual Environments
There is no consensus in MR research on how to operationalize user experience, and different theoretical models subsume different constructs under this umbrella term [
33]. Within the field of human factor research, user experience can be defined as “a person’s perceptions and reactions resulting from the usage or expected use of a product, system, or service” (ISO 9241-210, 2019; [
34]). These perceptions and reactions include all emotions, ideas, preferences, perceptions, feelings of well-being or discomfort, behaviors, and performance that occur throughout the entire usage cycle. The selection and measurement of relevant constructs typically depend on the specific application context and research objectives. For the present study of a virtual opera performance, we identified several relevant constructs that capture different key dimensions of user experience in an AV application for cultural event experiences. Reference [
35] suggests several constructs to be relevant to capture different key dimensions of user experience in VR, including presence, VR sickness, emotions, aesthetics, and usability. Our study’s primary objective was to examine the audio and video features of embedded recordings in AV. Given that these recordings do not impact the application’s usability, it appears unreasonable to consider usability as a metric for assessing their effects. The same argument can be applied to aesthetics, as this dimension is operationalized as aesthetic appreciation of the application design. It is measured using adjectives such as ‘beautiful’, ‘attractive’, and ‘elegant’, and therefore represents an overall impression and evaluation of the entire VR experience. It is not assumed to be sensitive to the spatial features of embedded recordings in an AV application. However, the three remaining dimensions of user experience seem applicable to an AV application and are further described.
The dimension of (
tele)
presence is one of the most widely used and established concepts in VR research, often considered a key indicator of the overall “quality” of a virtual experience [
22]. It has gained growing attention as researchers and practitioners from different fields seek to understand the social and psychological effects of immersive, interactive technologies [
36]. According to a meta-analysis by Skarbez et al. (2017) [
22]
presence can be broadly understood as “the perceived realness of a mediated or virtual experience” (p. 16). This differs from the commonly used definition of
presence, which describes the subjective ‘sense of being there’ [
3], an experience in which users react to the mediated space as if it were real, even though they are cognitively aware of its artificial nature. However, as [
22] argue, this particular sensation of being located in a remote or virtual place is more accurately captured by the term
place illusion.
Place illusion—rather than
presence in its broader sense—refers specifically to the sensation of being physically situated in a space that is not actually real. For this paper, we follow the model of Skarbez et al. (2017) [
22] that defines
place illusion as closely related to pure
spatial presence, resulting from the sensory immersive characteristics of the application and being a subdimension of the broader and higher-order construct of
presence. From this, it can be assumed that improvements in sensory immersive characteristics, such as spatial features in an AV application, directly affect the
place illusion and therefore contribute positively to
overall presence. Still, Skarbez et al. (2017) [
22] do not address MR applications, thereby limiting the transfer of relationships and hierarchical orders of constructs to AV to an exploratory basis. As the overall
presence as well as the
place illusion are relevant for the user experience of a virtual opera experience, both constructs should be considered separately as outcomes in empirical studies.
Compared to live performances, the dimension
VR sickness or
motion sickness represents a potential negative consequence when using an HMD, like eye-strain, blurred vision, or nausea, that can significantly impair user experience [
37]. Given that various characteristics of content in VEs, including visual features such as stereoscopy, can influence these sources of discomfort, it is essential to monitor motion sickness when evaluating different media integration approaches in an AV application.
The emotions dimension is characterized by the occurrence of basic feelings that positively or negatively contribute to the overall user experience. In VR research,
enjoyment [
38] is an important outcome measure, as it represents an overarching goal of the experience of cultural stage events.
Enjoyment captures the affective or emotional dimension of user experience, reflecting the pleasure and emotional engagement users derive from the virtual experience. As
enjoyment was found to be a central attendance motive for performing arts events—both in real-world and virtual events [
21]—the concept of
enjoyment serves as a critical indicator of whether the application successfully delivers emotionally meaningful content. Enjoyment and presence were found to be positively related [
39] and findings of [
35] even suggest that emotions partially mediate the relationship between presence and user experience. Study results of [
40] indicate that presence increases enjoyment and [
41] even defined enjoyment to be a ‘presence effect’. This suggests enjoyment to be a rather distal outcome, while presence might be a rather direct and distal outcome resulting from VE design.
Additionally, we assume
Quality of Experience (QoE) to be a relevant criterion.
QoE, in contrast, represents the cognitive-evaluative dimension of user experience. It refers to users’ overall evaluation and judgment of the quality of their experience, encompassing factors such as perceived technical quality, satisfaction with the application, and the degree to which the experience meets their expectations [
42]. Therefore, while
enjoyment focuses on emotional responses,
QoE captures users’ reflective assessment of the experience.
Summing up, a substantial amount of MR literature highlights the relevance of the above-mentioned key dimensions of user experience. While there is, to our knowledge, no model or framework that explains the relationships of these dimensions in AV, we still assume these dimensions to be a useful guide to select relevant outcome criteria for evaluating design features in AV.
1.2. Video Features
The visual sense is the most valued of the five human senses [
43]. Therefore, it is not surprising that a major effort in the development of a VE is put into the visual design. The majority of humans can perceive depth through natural stereoscopic vision, which is based on binocular perception. This enables them to view the world in three dimensions. Stereopsis contributes to a vivid experience of the world and improves perceptual quality [
44]. From this, it can be deduced that depth perception is crucial in VEs, as it enhances the sense of realism (presence) by creating an experience that visually resembles the real world, and simultaneously increases place illusion. To conclude, the format of video recording transferred to an AV application should be designed carefully. Stereoscopic videos, recorded with two cameras and an appropriate stereo base [
45], can mirror this natural perception and potentially enhance presence in VEs.
As discussed above, findings from TV, VR, or 360-degree video applications may not be directly applicable to AV applications. Nevertheless, we want to provide a brief review of the current state of research in these related fields below, as the reported findings can offer insights into the potential effects of spatial audio and video rendering features within AV applications. Regarding the effect of stereoscopy, contradictory results can be found in the literature, with some studies showing positive effects, while others show no or even negative effects. Consistent with expectations based on human natural stereoscopic vision, several studies on TV screens [
25,
26,
27] as well as one study on HMDs [
28] have shown positive effects of stereoscopy on the sense of presence. Furthermore, a meta-analysis on immersive system features [
24] emphasizes stereoscopic vision as a major factor in enhancing presence, recommending its prioritization in immersive system development. On the other hand, [
31] shows that stereoscopy does not automatically improve the sense of presence and that the context of use has to be taken into account. Using 360-degree videos, the comparison of 2D and 3D videos did neither show significant differences for presence, nor, surprisingly, for the subscale realness [
46]. Furthermore, it was shown that 3D visuals might lead to motion sickness [
29,
30]. This effect is enhanced even more in case of low update rates [
23] resulting from an insufficient available data rate. The risk of insufficient data rates is higher when streaming stereoscopic videos into the virtual environment than with 2D videos. To the best of our knowledge, no study investigated effects of video features in an actual AV application. In summary, the effects of video features in VEs in general and for AV applications in particular remain unclear as a result of limited research in this field. However, since (i) depth information may enhance an immersive experience, (ii) stereoscopic recordings mirror familiar real-world perception, and (iii) stereoscopic recordings create visual quality that matches the three-dimensional virtual environment [
22], it can be assumed that stereoscopic recordings positively affect both place illusion and overall presence in AV applications. As a result of a vivid experience and with reference to the above effects of presence (cf.
Section 1.1), it can further be assumed that spatial video features enhance enjoyment and QoE but simultaneously increase the risk of motion sickness.
1.3. Audio Features
Audio features represent the second characteristic that can be intentionally influenced when integrating recordings in an AV application.
Sound is an important aspect of an immersive and realistic experience as well, and the increasing demand for immersive virtual environments has given the topic of spatial sound a new relevance [
47,
48,
49]. Humans localize sound through binaural hearing by the comparison of signals received by the two ears [
50]. Spatial audio rendering in VEs can recreate this three-dimensional auditory experience [
47,
48,
49]. Therefore, binaural hearing, like stereoscopic vision, should lead to a higher perception of vividness and mirror the real-world experiences (presence) of people in virtual environments. Additionally, auditory stimuli from different spatial directions increase the impression of being in a three-dimensional space. As audio contributes to the overall immersion of a system [
24] and immersion being the precondition for the development of place illusion [
22], spatial audio features need to be considered in the design of an AV application.
In virtual environments, binaural audio rendering is created by estimating the spatial position of the user relative to the sound source(s) and accordingly adapting the auditory information to both ears based on direction, distance or sound barriers [
51]. Achieving spatial audio in virtual environments requires a complex rendering process [
47]. Ideally, each sound source is recorded separately by microphones, which further increases technical requirements.
In an empirical study by Shin et al. (2019) [
11] including 360-degree videos of a live concert, spatial audio was found to increase presence, which further positively affected enjoyment. The effect was higher than the effect of stereoscopic vision. However, spatial sound can lead to motion sickness and thus negatively affect the user experience [
52]. Contrary to Shin et al. (2019) [
11], Cummings and Bailenson (2016) [
24] also argue that stereoscopic vision should be preferred over auditory stimuli in the development of immersive applications from a cost–benefit perspective, as the positive effect of audio on presence measured in their meta-analysis was comparably low. To the best of our knowledge, there is no study that investigated audio effects in an actual AV application.
In summary, the effects of audio features in VEs in general and for AV applications in particular remain unclear as a result of limited research in this field. Still, it can be assumed that spatial audio positively affects place illusion, presence, enjoyment, and QoE due to the following reasons: (i) Spatial audio has the potential to induce a vivid experience, (ii) binaural rendering mirrors familiar real-world experiences, and (iii) spatial sound matches the three-dimensional virtual environment regarding the level of audio quality. Simultaneously, the risk of motion sickness was found to increase when applying spatial audio.
1.4. Scope of Study
Facing inconsistent results from VR research and insufficient empirical data from AV research, the present study investigates the relevance of the spatial aspects of video and audio features for key dimensions of user experience. We evaluate which components of an AV application actually enhance the user experience and exploratively compare the relative importance of video and audio features, especially taking into account the limited system performance resources of AV applications. Therefore, our research question is as follows:
How do video dimensionality and audio rendering features of an opera performance recording affect the user experience of an AV opera experience?
We operationalize the user experience based on specific key dimensions derived from the literature (cf.
Section 1.1). Due to the lack of specific theoretical models and frameworks for AV, we explore the derived key dimensions as separate outcome criteria. Based on the theoretical considerations on the effects of spatial features of recordings in AV applications outlined above, we derive the following hypotheses, which will be tested in the present study:
H1. Compared to 2D videos, integrating 3D videos into a virtual environment will result in higher levels of presence (H1a), place illusion (H1b), enjoyment (H1c), perceived QoE (H1d), but will also lead to increased motion sickness (H1e).
H2. Compared to stereophonic audio, the integration of spatial audio recordings into a virtual environment will result in higher levels of presence (H2a), place illusion (H2b), enjoyment (H2c), perceived QoE (H2d), but will also lead to increased motion sickness (H2e).
To answer the research question and test the hypotheses, a quantitative laboratory study was conducted with participants experiencing opera scenes in an AV application, applying a within-subjects design. The following sections describe the study method in detail, followed by the results and implications for the design of AV applications.
2. Materials and Methods
2.1. Participants
A total of 30 participants (17 female and 13 male) aged between 19 and 43 years (M = 26.27; SD = 7.66) took part in the experiment. The sample consisted of interdisciplinary researchers related to the Chemnitz University of Technology, as well as interested participants from the surrounding area. They had no involvement in the project or prior knowledge of the study. Participants were all informed about the possible occurrence of motion sickness due to the use of an HMD. Diagnosed epilepsy was an exclusion criterion.
Additionally, 25 participants (83.3%) had used an HMD before; however, only two individuals (6.7%) reported using them regularly (at least once a month). Only 5 of the 30 participants (16.7%) regularly attend real-world stage performances (such as theater and opera), while eight (26.7%) reported visiting cultural stage events at most once a year or less frequently. Eleven people (36.7%) wore visual aids during the trial, each reporting full visual acuity.
2.2. Virtual Environment
Using photogrammetry techniques, the VE with six degrees of freedom was constructed as a digital replica of a German opera house. Participants were seated in the center of a row, positioned seven meters from the stage, the same distance at which the videos were recorded. This seating position remained unchanged throughout the experiment. No non-player character avatars or a characteristic soundscape of the opera hall were integrated into the seating area to avoid distraction of participants and ensure full focus on the video and audio recordings. The digital replica of the stage was extended by a virtual screen object. On this screen, video recordings from the operetta “The Merry Widow”, shot at the real opera house, were seamlessly integrated to make the stage content look like part of the VE, rather than on a cinema screen.
Figure 1 shows a schematic visualization of the AV application. Participants’ (visualized by the avatar) ego perspective included both the virtual model of the opera and the real-world video recording [
53].
2.3. Technical Setup
The prototype AV application was developed using Unity 2021.3.0f1 (Unity Technologies, San Francisco, CA, USA). It ran on a Fujitsu Celsius H980 laptop (Fujitsu Limited, Tokyo, Japan) equipped with an Intel i7-8850H processor (Intel Corporation, Santa Clara, CA, USA) and an Nvidia Quadro P4200 graphics card (Nvidia Corporation, Santa Clara, CA, USA). The experiment was conducted in 2023 using the Meta Quest 2 HMD (Meta Platforms Inc., Menlo Park, CA, USA), as it was the most widely used VR headset for home use. It enabled a total render solution of 1832 × 1920 pixels per eye and was connected to the laptop via an Oculus Link cable. The VE was rendered with 72 frames per second. Due to the field of view of the Meta Quest 2, the entire stage was visible, with part of the surrounding VE also within the user’s field of vision, without the need to turn one’s head.
The video was recorded using eight Sony RX0 II cameras (Sony Corporation, Tokyo, Japan), enabling recordings with 4K resolution, arranged in a stereoscopic camera array at a height of 140 cm. Afterward, the recordings were rendered to create a 3D video (stereo base = 12 cm). Due to the use case of an opera experience, the videos were shot at a large distance from the stage (seven meters), requiring hyper-stereographic video to create a sufficient depth effect. Therefore, the stereo base of 12 cm, which exceeds the natural human stereo base, was chosen to ensure an appropriate depth perception.
Both videos—2D and 3D—were then integrated into the VE by mapping them to the texture of the virtual screen object with a resolution of 1920 × 1080 pixels. The frame rate of the video rendering was set to 30 frames per second. Both settings—screen resolution and frame rate—were based on pre-tests that showed a subjectively satisfactory level of video detail and no apparent restrictions on the fluidity of the video, while ensuring manageable file sizes and good overall VE performance.
With regard to the audio recordings, all performers were recorded individually. In addition, the orchestra and the hall were picked up via stereo miking at the orchestra pit. The final stereo mix was created via Samplitude Pro X7 (Magix Software GmbH, Berlin, Germany) and then rendered in Unity in stereo and in spatial audio, resulting in two audio rendering versions. For the stereo condition, a consistent auditory experience was created regardless of head movement, whereas in the spatial audio condition, sound sources were rendered with head tracking, allowing for realistic spatial localization and direction-dependent sound intensity. To ensure high audio quality despite the relatively limited audio capabilities of the Meta Quest 2, external Bluetooth headphones (JBL Live 660; JBL, Stamford, CT, USA) were used for audio playback within the experiment.
2.4. Experimental Design
To test the hypotheses, a 2 × 2 within-subjects design was applied for the study. We varied the recordings of an opera that were integrated on the virtual stage by two independent variables: Video dimensionality (2D (V
2D) vs. 3D (V
3D)) and audio rendering (stereophonic (A
ST) vs. spatial (A
SP)). This experimental design resulted in four scenarios that all participants experienced in randomized order after a baseline condition (cf.
Section 2.6). Therefore, a Latin square was applied, resulting in 24 sequences of the four experimental scenarios. Accordingly, six sequences need to be used twice and were selected to minimize sequence overlap. Apart from the recording features, neither the VE nor the content of the integrated real-world recordings changed between the scenarios. The study design was reviewed and approved by the ethics committee of Chemnitz under the reference number #101558791.
2.5. Data Collection
Several dependent variables were applied. In addition to the subjective quantitative measurement tools, qualitative data was collected at the end of each experiment in the form of a short-guided interview with open-ended questions.
2.5.1. Quantitative Measures
To measure the dependent variables, and therefore the relevant key dimensions of user experience as presented in Chapter 1.1, we aimed to use well-established, validated scales. However, due to the specific context of our AV application, we occasionally had to delete individual items or develop items on our own for content-related reasons. Through reliability analysis, we ensured internal consistency of the factors. Since we used a 2 × 2 within-subject design, all participants had to complete four questionnaires. To minimize the burden and effort for the participants, we therefore included several single-item measures. Across all scales, participants answered 14 items following each of the four scenarios. Responses were given on a 7-point Likert scale, with varying anchor labels depending on the specific question. A complete list of the items is provided in
Appendix A. German translations of the scales were used.
Presence: To measure the overarching sense of presence, we used the slightly adapted single-item measure by [
54] as it has demonstrated good reliability, validity, and sensitivity across different levels of presence. We therefore followed the recommendation of a meta-analysis [
22], which recommended this item as an efficient direct measure of presence due to its brevity and strong correlations with established questionnaires.
Place Illusion: To measure Place Illusion, we used three items from the TPI [
36] subscale Spatial Presence. Given the specific characteristics of our AV environment, where real video recordings of an opera performance are integrated into a virtual opera house, we identified a need for additional items that specifically address the spatial integration of this video content. To address this specific aspect of Place Illusion, we developed two additional items. Both items directly addressed whether participants perceived the stage performance as being located within the same spatial environment as themselves: “How much did it seem as if you and the actors were together in the same place?” and “How much did it seem as if the actors and requisites were in the virtual space?” As the third item of the TPI subscale already addresses the impression of a shared space based on auditory integration, the strong emphasis on audio design is offset by the two additional items. To ensure adequate scale quality, we undertook reliability and validity assessments as recommended by [
55]. Confirmatory factor analysis (CFA) was used to evaluate the factor structure of the scale (cf.
Section 2.6). We exemplarily report the results for scenario ‘V
3D + A
SP’. The suitability of the data for factor analysis was checked using the KMO test. Following Kaiser [
56], KMO showed suitability (‘meritorious’) of data for factor analysis (KMO = 0.81). Analysis of Eigenvalues scree plot supported a one-factor structure of the five items. The specified measurement model for one latent factor showed excellent fit for the data (
p(
χ2) = 0.347, CFI = 0.992, TLI = 0.983, RMSEA = 0.063). We performed a reliability analysis to assess the internal consistency of the scale by calculating Cronbach’s Alpha and McDonald’s Omega [
57] for each of the scenarios. The lowest internal consistency measured was
α = 0.83 and
ω = 0.81, both demonstrating a good internal consistency [
58].
Enjoyment: We included the scale of Enjoyment by [
11] that was developed based on [
38] to measure the overall affinity to the experiences. Reliability analysis demonstrated high internal consistency, with the lowest measure being
α = 0.92 and
ω = 0.92.
QoE: To measure the direct effect of different audio and video representations on the perceived quality of experience, we used a single-item: “How would you rate the overall quality of the experienced stage show?” The item was designed based on the work of [
59,
60], who both used single-measures to examine QoE.
Motion Sickness: Motion Sickness was measured by a single-item capturing general physical discomfort, which allows for a concise yet comprehensive assessment of motion-related unease across all scenarios. It was developed based on the first Item of the cybersickness scale by [
61].
2.5.2. Qualitative Evaluation
After the experiment, short semi-structured interviews were conducted. As we aimed for unaffected opinions to achieve a better understanding of user requirements, we did not ask directly about stereoscopy or spatial audio. The participants answered, amongst others, the following questions (translated here from German to English) verbally:
How would the application have to change to give you added value?
How important was the video quality of the performance to you?
How important was the audio quality of the performance to you?
When respondents indicated that video or audio quality was relevant to their user experience, follow-up questions were asked to better understand which aspects of video and audio design were particularly relevant.
2.6. Procedure
After arriving at the laboratory, participants obtained a written and an oral instruction describing the study procedure and were asked to sign consent forms. Subsequently, they completed a pre-survey that collected demographic data and asked about prior experience with XR technologies. Afterwards, spatial-hearing and stereo-vision tests were conducted to ensure that participants were able to experience the variations between scenarios. For testing stereo-vision, we used the Lang Stereo Test II [
62]. As far as we know, there is no standardized test for measuring spatial-hearing. We therefore tested participants’ spatial-hearing by presenting them with three different 3D audio recordings in which the position of the sound source changed in three-dimensional space. The test was passed if the subjects were able to identify the direction of the moving audio source. JBL Live 660 headphones were used for the spatial-hearing test. Participants were then instructed on the use of HMDs and were given the opportunity to explore and familiarize themselves with the VE in a baseline scenario. In this scenario, participants were immersed in the virtual opera environment, including an empty virtual stage (
Figure 2). In this condition, the brightness of the virtual opera hall was higher than in the later experimental scenarios, allowing participants to familiarize themselves with the virtual space. The primary goal of this baseline scenario was to mitigate the novelty effect and allow novice XR users to adapt to the new experience. Following this introduction, which lasted a minimum of 90 s but could be extended at the participants’ request, participants completed a brief follow-up questionnaire regarding motion sickness on a tablet.
They were then asked to put on the HMD again and watch the first of four video clips of the opera scene in randomized order. As mentioned above, all four video clips showed the same scene of the opera and differed only in their recording modalities (video dimensionality and audio rendering; cf.
Section 2.4). Each video lasted three minutes, a time span equivalent to the results of [
63], suggesting that subjects need at least three minutes for a sense of presence to emerge. Participants were reminded that they were free to move and look around during the performance while remaining seated.
After each of the four scenarios, participants completed a follow-up questionnaire assessing their subjective impressions (cf.
Section 2.5.1) and were able to take a break to reduce the possibility of motion sickness. Towards the end of the experiment, a short semi-structured interview was conducted in which participants answered open-ended questions about their perception of the audio and video recordings and possibilities for enhancing the experience. In total, the experiment lasted approximately 50 min. Participants were paid 15 euros for their participation.
2.7. Data Analysis
Statistical analyses were conducted using IBM SPSS Statistics (version 30.0; IBM Corporation, Armonk, NY, USA). Prior to the main analyses, the dataset was screened for serious outliers and unexpected patterns within individual items. Although the Shapiro–Wilk test indicated that the data for the variables Presence, QoE and Motion Sickness deviated from perfect normality, two-way repeated measures ANOVA was performed. This approach is justified by simulation-based evidence suggesting that repeated-measures ANOVA is robust to violations of the normality assumption as long as the sphericity assumption is met [
64]. A priori power analysis for a two-way repeated measures ANOVA (effect size
f = 0.25,
α = 0.05, power of 1 −
β = 0.90) indicated a minimum sample size of 30 participants, which was achieved. The effect size used for the calculation was based on a meta-analysis [
24], which showed a medium-sized effect of both stereoscopy and audio stimuli on the user’s presence in virtual environments. Therefore, two-way repeated measures ANOVAs with the factors ‘video’ (2D (V
MO) vs. 3D (V
ST)) and ‘audio’ (stereophonic (A
ST) vs. spatial (A
SP)) were conducted for all dependent variables (see
Section 2.5.1) to test for differences between the audio and video formats as described in our hypotheses as well as to compare relative effects of both independent variables. Significant effects were further examined using paired-samples t-tests to evaluate simple effects.
Statistic Software R (version 4.5.0; R Foundation for Statistical Computing, Vienna, Austria) was used to specify a measurement model by structural equation modeling. The dimensionality of the scale was determined by Eigenvalue > 1 and scree plot analysis. The package “lavaan” [
65] was applied using maximum likelihood estimation.
The qualitative data were transcribed using MAXQDA 2020 analysis software (VERBI Software, Berlin, Germany) and afterwards coded and categorized using a qualitative content analysis afterwards [
66]. The categories were initially created deductively and are based on the research and interview questions. These formed the main categories of the analysis and the basis for the first coding run. In a second step, all text passages in the main category were assigned to subcategories according to their content. The subcategories were therefore created inductively, as they were derived from the collected material and thus contribute to the differentiation of the category catalog. For example, the different aspects of video and audio features mentioned by the participants served as subcategories. Finally, the interviews were coded in their entirety using the differentiated category system. Text passages were assigned to several categories if it seemed appropriate based on their content. As the interviews also included questions relevant to the user-centered development process of the AV application, only answers to the relevant questions are reported in the results.
4. Discussion
This study explored the effects of video dimensionality and audio rendering on user experience in an augmented virtuality (AV) application that integrates real-life recordings in a virtual environment. For this reason, a virtual opera house was created, using photogrammetry techniques. Within a 2 × 2 within-subjects design, we measured the impact of video dimensionality (2D vs. 3D videos) and audio rendering (stereophonic vs. spatial) on relevant key dimensions of the user experience. We found that there were no significant differences between the 2D and 3D videos regarding their effects on most dimensions of users’ subjective experience, though a medium effect size was observed for Enjoyment. Due to a lack of significance, this effect needs to be replicated in further studies. Overall, the results did not support H1a–e, which supposed that stereoscopic visual features significantly improve key dimensions of user experience. It therefore showed that the video dimensionality of integrated recordings had no significant effect on the user experience in the AV stage event setting, where users were seated at a distance of seven meters away from the implemented video recordings. This result is supported by the analysis of the qualitative data.
It is conceivable that due to the relatively large distance to the integrated videos in the VE (seven meters), which is a realistic seating position within an opera hall, the positive effects of stereoscopy were not sufficiently effective due to reduced parallax cues, even though we already applied a hyper-stereoscopy condition that should have resulted in a higher perception of depth compared to the natural human stereoscopic vision. Another reason why we did not find an effect of video dimensionality on presence may be explained by the already high immersion of the VE that could be experienced through the HMD. The VE used in our study was developed by photogrammetry, giving participants the feeling of being in a real opera house. This could explain the difference in our study results from those of previous studies, which were conducted in front of TV screens [
24,
25,
26,
27]. Within the VE, the dimensionality of integrated real-life video recordings does not appear to be as influential as it is on TV screens. This is in line with results from [
31], who did not find a significant difference between monoscopic and stereoscopic visualization on the feeling of presence within VEs.
To answer H2a–e, we compared stereophonic audio (A
ST) with spatial audio (A
SP). In scenarios with A
SP, participants could experience spatial hearing as the sound adapted to their (head) movements. Results of our study showed that spatial sound resulted in higher Place Illusion compared to stereophonic sound alone. Interpreting the partial eta square revealed a huge effect of spatial sound on Place Illusion. This supported hypothesis H2b, demonstrating the theoretically derived importance of spatial audio for creating a sense of “being there” in the virtual opera house. This aligned with theoretical predictions that binaural rendering mirrors real-world auditory experiences and that sound should match the three-dimensional characteristics of the virtual environment [
50]. The qualitative data further emphasized the importance of spatial audio for participants’ user experience. The remaining hypotheses regarding spatial sound were, however, not supported by data, as there were no other significant differences between stereophonic and spatial sound found in our study. Therefore, our results empirically underline the theoretical considerations of [
22], indicating that Place Illusion is directly affected by immersive features such as spatial audio, while overall presence, as a higher-order construct, is affected by additional subdimensions such as social presence illusion and plausibility illusion. This probably lowers the direct effect of a single predictor, such as immersion, on overall Presence. Following this, unsurprisingly, the impact of a single immersive feature on even more distal and broader constructs like Enjoyment or QoE was also not significant.
Comparing the main effects of video dimensionality and audio rendering on user experience in an AV application, the effect of audio rendering exceeded the effect of video dimensionality regarding Place Illusion.
The analysis of the interviews suggested that personal preferences are a key factor in how users differ in their perceptions of content consumption in VE. The overall picture that emerges is that the design choice of appropriate audio and video features should be based on both context and requirements, as needs may vary. Overall, the participants’ responses clearly indicate a demand for high video quality to ensure an optimal and immersive opera experience in the AV application. In the use case of opera, where both musical performance and performers’ acting contribute to the experience, high-resolution recordings may be more relevant than stereoscopic video features. Due to the distance to the stage from a typical seating position within an opera concert hall, visual details of performers appear relatively small, reducing the perceptual salience of 3D video effects. In contrast, the results show that spatial audio cues are perceptually effective even at large viewing distances, as they dynamically respond to head movements and provide directional information that scales with the listener’s position. This suggests that, in distributed AV applications with fixed seating, auditory spatialization has a greater impact on Place Illusion than visual dimensionality, emphasizing the comparatively stronger contribution of spatial audio in this context.
4.1. Design Recommendations
Our findings have implications for the resource-efficient design of AV applications in cultural settings. It is important to emphasize that these recommendations apply specifically to the opera use case examined in this study, as well as to applications with a similar technical viewing geometry. The integration of real-world content into virtual environments must be tailored to the specific use case and user requirements of each cultural application. Different performing arts contexts, such as music performances, dance productions, or experimental theater, may place different demands on video and audio features, and audience needs and expectations may vary accordingly [
21].
The lack of significant effects for stereoscopic video suggests that the considerable effort and resources required for 3D video production and post-processing may not be justified within this context. High-quality 2D recordings appear to be sufficient for delivering an immersive opera experience in AV, particularly when combined with a highly immersive virtual environment. This was also emphasized by the qualitative analysis, in which participants expressed a greater demand for video quality than for stereoscopic effects. However, these findings contradict those of a meta-analysis [
24], which found exactly the opposite. We assume that these findings cannot be transferred to the specific context of cultural AV applications, as the majority of studies included in the meta-analysis used large-screen displays, projection systems, or desktop monitors rather than HMDs.
Therefore, several practical advantages can be derived from our findings: First, 2D production is substantially less resource-intensive in terms of recording equipment, post-production workflow, and computational requirements for real-time rendering, which offers even smaller cultural institutions with little budget the chance to benefit from this technology solution. Second, and perhaps most importantly, 2D video enables live streaming capabilities, allowing for real-time transmission of performances into the virtual event venues. This would be significantly more challenging with stereoscopic 3D content due to the necessary data rates. Third, this approach enables a sustainable production model where resources can be invested once in creating a high-quality, interactive virtual environment (e.g., through photogrammetry of the venue), which can then be reused for multiple performing arts productions over time.
For audio content, our findings present a more nuanced picture. From a practical significance perspective, spatial audio demonstrated a large effect on Place Illusion (partial η2 = 0.329), representing a substantial, user-perceptible enhancement. However, it did not significantly affect other key dimensions of user experience, such as Presence, Enjoyment, or QoE. This suggests that while spatial audio contributes to enabling an immersive experience, its impact may be more specific than previously assumed. From a resource allocation perspective, we recommend implementing spatial audio for the performance content, as it was demonstrated to enhance Place Illusion, a core component of a virtual experience. The investment in spatial audio rendering appears justified, given its positive effect, particularly as spatial audio can simultaneously serve multiple purposes within the application: enhancing the performance content, supporting social features in multiplayer environments through spatially accurate voice communication and enhancing the immersion of being in an actual event location by implementing characteristic soundscapes. This dual functionality maximizes the return on development investment.
In summary, our findings suggest that for opera AV applications, resources should be prioritized as follows: (i) invest in creating a high-quality, reusable VE (e.g., through photogrammetry), (ii) implement high-resolution monoscopic video recording with live streaming capability, and (iii) incorporate spatial audio rendering to enhance Place Illusion and support, if applicable, social interaction of multiplayer to both enhance cultural and social participation.
Beyond the audio and video features focused on in this paper, there are, of course, other factors that can influence and enhance the experience of a virtual cultural application. As the qualitative analysis (cf.
Section 3.2) and the literature [
4,
67] show, these include opportunities for interaction with the VE as well as social interaction and embodiment through avatars, and, last but not least, the control of navigation. We therefore strongly recommend adopting a user-centered design approach [
34] when developing AV applications for cultural contexts. Through iterative user research methods such as interviews, focus groups, and usability testing with target audiences, developers can identify the specific requirements and preferences that should guide design decisions. This approach ensures that technical choices regarding media integration are grounded in empirical evidence about what matters most to users in a given context, rather than being solely technology-driven.
4.2. Limitations and Future Research
Our study has some limitations that should be considered when interpreting the results and planning future research on the topic: In this study, we examined the use of AV in the context of cultural stage events by integrating real-world opera recordings into a virtual environment. The results of the study are therefore not applicable to other formats, such as virtual concerts, where users assume a standing position and/or dance. Future research, therefore, may investigate how different audio and video features affect user experience in AV settings where participants can move freely and therefore change their position to the visual and auditory cues. In these interactive contexts, it may be useful to additionally integrate objective measures—such as position tracking and interaction patterns—to complement self-report assessments. In this study, we also did not integrate social interactivity and an environmental soundscape. However, this may have influenced the results as these aspects would generate new sound sources located in different locations within the virtual room, which in turn could further increase the impact of spatial audio [
48]. Further research should therefore use multi-user social AV applications.
We used hyper-stereoscopic videos with a stereo base of 12 cm to compensate for the relatively large distance to the stage, and thus to the integrated video recordings, of seven meters. Data from the conducted ANOVA showed that these hyper-stereoscopic videos did not influence motion sickness. The observed effects of 2D vs. 3D videos may nonetheless have been smaller than in comparable studies, where stereoscopic video content filled the entire field of view [
25,
26,
27]. In addition, qualitative interview data indicated that the quality of the embedded videos was not satisfactory, both in the monoscopic and stereoscopic video conditions, with a resolution of 1920 × 1080, when viewed over the Meta Quest 2. We chose these technical specifications as they represent a realistic implementation scenario for the use case, reflecting both the hardware capabilities commonly available for private home use at the time of the study. In this context, it is noteworthy that while the video format showed no significant main effect on Enjoyment (
F(1, 29) = 3.365,
p = 0.081, partial
η2 = 0.101), the exploratory observed medium effect size suggests a potential influence of 3D video that may become detectable with advanced HMDs or larger sample sizes. To replicate and validate the effects of video dimensionality and audio rendering found in this study, therefore, more advanced HMDs with better lenses and correspondingly higher image and video quality should be used. Future research could investigate the effects of higher-resolution recordings to determine whether improved video quality might reveal effects that were not detected in this study under the current technical constraints. When using the Meta Quest 2 with the external over-ear headphones, some testers reported a sense of confinement. Therefore, the hardware may have caused some discomfort. For future studies, we plan on using HMDs with superior sound integration and reduced weight.
With regard to the measurement of constructs, single items were used for some constructs in the present study. Single items may have reduced content validity when used to represent a multidimensional construct [
68]. We therefore used single items that queried the overall impression, rather than specific indicators of the respective constructs. Due to the within-subjects design with repeated measurement, this approach reduces various negative effects of multi-item scales, e.g., less effort and time of participants to answer the items [
68]. However, finer-grained assessment might be useful to detect small effects and should be used in future research when the trade-off with data minimization is deemed acceptable.
Finally, although participants with special needs are expected to benefit from the accessibility afforded by AV opera performances, the current study did not include this group of participants as a sample. Future research should systematically include this user group to understand the specific requirements of these users to ensure that the potential benefits of the technology can be fully realized.