Uses of Metaverse Recordings in Multimedia Information Retrieval

Steinert, Patrick; Wagenpfeil, Stefan; Frommholz, Ingo; Hemmje, Matthias L.

doi:10.3390/multimedia1010002

Open AccessArticle

Uses of Metaverse Recordings in Multimedia Information Retrieval

¹

Faculty of Mathematics and Computer Science, University of Hagen, Universitätsstrasse 1, D-58097 Hagen, Germany

²

Faculty of Business Computing and Software Engineering, PFH University of Applied Science, D-37073 Goettingen, Germany

³

School of Engineering and Computer Science, Bern University of Applied Sciences, Falkenplatz 24, 3021 Bern, Switzerland

^*

Author to whom correspondence should be addressed.

Multimedia 2025, 1(1), 2; https://doi.org/10.3390/multimedia1010002

Submission received: 29 April 2025 / Revised: 16 July 2025 / Accepted: 27 July 2025 / Published: 10 August 2025

Download

Browse Figures

Versions Notes

Abstract

Metaverse Recordings (MVRs), screen recordings of user experiences in virtual environments, represent a mostly underexplored field. This article addresses the integration of MVR and Multimedia Information Retrieval (MMIR). Unlike conventional media, MVRs can include additional streams of structured data, such as Scene Raw Data (SRD) and Peripheral Data (PD), which capture graphical rendering states and user interactions. We explore the technical facets of recordings in the Metaverse, detailing diverse methodologies and their implications for MVR-specific Multimedia Information Retrieval. Our discussion not only highlights the unique opportunities of MVR content analysis, but also examines the challenges they pose to conventional MMIR paradigms. It addresses the key challenges around the semantic gap in existing content analysis tools when applied to MVRs and the high computational cost and limited recall of video-based feature extraction. We present a model for MVR structure, a prototype recording system, and an evaluation framework to assess retrieval performance. We collected a set of 111 MVRs to study and evaluate the intricacies. Our findings show that SRD and PD provide significant, low-cost contributions to retrieval accuracy and scalability, and support the case for integrating structured interaction data into future MMIR architectures.

Keywords:

multimedia; content analysis; Metaverse Recordings; Multimedia Information Retrieval; time-series content analysis

1. Introduction

Since the invention of smartphones, the growth rate of multimedia generation has been enormously high. Digital cameras are ubiquitous [1], and social networks have led to vast media generation [2,3,4,5]; in recent years, the amount of short-form video content has increased significantly [6]. Furthermore, the COVID crisis has significantly accelerated the adoption of remote communication and collaboration technologies, exemplified by the surge in video and virtual conferencing usage [7]. Another trend that has emerged in recent years is the idea of an everlasting virtual space, where people meet and live together: the Metaverse [8]. Following the definition by Mystakidis, the Metaverse is a space of multiple connected, perpetual, and immersive worlds for people [8]. The increasing usage of Metaverse platforms [9,10] like Roblox [11] or Minecraft [12] shows that people are heavily using virtual worlds, even beyond gaming [13]. Trend reports predict even higher usage in the future [14,15]. Therefore, it is likely that people will create recordings of their experiences in virtual worlds, similar to how they do so in the real world. Early versions of such videos can be seen on YouTube [16] for entertainment purposes. As many games span virtual worlds, e-sports contribute a significant amount of recordings, live-streams, and screencasts. We call such recordings Metaverse Recordings (MVRs) [17], i.e., a screen recording from a virtual world experience. In addition to personal and entertainment use, Virtual Reality (VR) training [18] is a relevant application area. Furthermore, with the rise of the Metaverse, cybercrime, such as drug trafficking or child grooming, has become a recognized risk by Interpol [19], and such violations require recordings for documentary purposes. Similar to evidence being captured in photos and videos in the real world, MVRs can be used in the Metaverse context. Finally, the industrial Metaverse [20] leverages simulations of real-world scenarios, i.e., simulations of road driving situations as input for autonomous driving training [21] or complete simulations of factories and their production processes. Such simulations generate single images or a series of continuous images from different angles or perspectives, resulting in thousands of images for a single scene. In a field study [22], several existing application scenarios for creating, and later searching, MVRs were revealed.

Multimedia Information Retrieval (MMIR) [23] is a research area in computer science that particularly addresses the indexing and retrieval of Multimedia Content Objects (MMCOs), i.e., audio, video, text, or images. As the Metaverse is based on virtual worlds that are computer-generated multimedia scenes, we examine the integration of MVRs in MMIR and its differences with classical MMIR [17,24], particularly focusing on the technical integration of MVR.

Compared to other types of media, user sessions in the Metaverse can be recorded inside the technical space of a computer rendering a virtual world. Technically, Metaverse virtual worlds are based on Massively Multiplayer Online games [25] which use 3D graphics technology. This allows for the recording of much more than just the perceivable video/audio stream as an output of the rendering process: the rendering input, referred to as Scene Raw Data (SRD), and additional data from peripheral devices, referred to as Peripheral Data (PD), i.e., hand controller movement data or biosensor data. Figure 1 describes a simplified version of such a recording. While the rendering process renders each frame for the video stream, SRD can also be captured, i.e., if the position of an object in the scene changes or if a user input was processed. The information captured using this procedure can be used for indexing and querying MVR. For example, a VR trainer can search for MVRs in which the person being trained had a hectic activity or encountered another avatar. However, despite the richness of data, practical usage remains limited: a VR police training supervisor scrolling through 3 TB of headset video still cannot easily locate the trainee’s high-stress moments. This highlights an essential gap in the literature: existing multimedia content analysis methods are insufficiently adapted to MVRs, as they mainly target conventional audiovisual streams and do not include SRD and PD. Without incorporating time-mapped SRD and PD, current multimedia retrieval systems are unable to efficiently index or query the full scope of information contained within MVRs.

In this article, we present research that addresses the question of how non-video/audio data, in particular SRD and PD, can be used in MVR-specific retrieval. We formulate the following research question: Can SRD and PD support MVR content analysis? To answer this research question, we employ the research framework of Nunamaker et al. [27] in the remainder of this article. The framework connects the four research disciplines, observation, theory building, systems development, and experimentation, to answer research questions. In prior work [26], the 111 Metaverse Recordings dataset was introduced to support research to generate images based on scene graphs. However, the methods used to generate and annotate this dataset, as well as the supporting models and software, have not yet been described in detail. In this article, we present the full pipeline used to construct the dataset to address the aforementioned research question.

The following sections present our observation results as an overview of the Metaverse and related technologies, the differences of MVRs, and the implications for MMIR in Section 2. In Section 3, we examine the structure of recordings and model a recording process, which is our theory-building process. The recording is demonstrated based on a prototypical implementation (systems development) described in Section 4. Section 5 presents our experiments, which include an evaluation of the recording. Finally, Section 6 summarizes the work presented and discusses future work.

2. State of the Art and Related Work

This section presents our observations on Metaverse-based multimedia, its differences with other multimedia, options to produce MVRs, and ways to analyze MVR content. In addition, it describes the current state of the art in related approaches to image generation.

2.1. Metaverse Recordings

The Metaverse has become a platform for a wide range of applications, as highlighted by recent studies [10].

Within this evolving environment, recent research [17] has identified four main domains in which MVRs are actively used. These are personal media collections, where users create digital memories within the Metaverse; the entertainment domain, where users create and share MVRs [16]; the education and training domain, where MVRs are used to enhance learning and skills development in VR-based settings [18]; and, finally, the domain of scientific research, where MVRs provide valuable data for studying behavior and other phenomena.

These diverse application areas highlight the growing importance of developing MMIR solutions tailored to the unique characteristics of MVRs. Unlike conventional multimedia content, MVRs contain rich contextual and experiential data, requiring specialized handling and retrieval mechanisms. A field study [22] identified application scenarios in the domains of education, video conferencing/collaboration, entertainment/video games, the industrial Metaverse, and personal use. They are partly located in the domains described above, but also in other domains.

These scenarios reveal a growing interest in additional forms of recorded information, such as PD from fitness tracking or eye movement data, which can support retrieval tasks like similarity search. The following section will examine the ways in which MVRs differ from other types of media.

However, the extensive detail captured in MVRs introduces serious privacy issues [28,29]. SRD may include interactions with avatars or virtual objects that are sensitive, while PD can contain biometric information such as heart rate, facial expressions, or movement patterns. If not properly managed, this data can be used for identity tracking or behavioral profiling without the user’s consent [28]. Therefore, the implementation of ethical frameworks and protective technologies is essential. This includes methods like data anonymization, secure storage, and user-defined access controls. Although these concerns are acknowledged, they fall outside the scope of this work and are suggested for future research.

2.2. Differences in MVRs Compared to Real-World Capturing Media

MVRs are created within virtual worlds, which can mirror or completely redefine reality using computer-generated imagery, leading to differences from traditional multimedia [17]. MVRs are unique in terms of content, structure, and format [17]. Unlike multimedia that captures real-world scenes, MVRs document artificial environments with different physical laws, appearances, and behaviors. Real-world assumptions often do not apply to virtual contexts, making it unclear whether real-world solutions are transferable.

As discussed in [17], MVR content consists of actions and events similar to movies or news shows, but represents virtual concepts, which can vary in realism and abstraction. Avatars, a fundamental element of virtual worlds [30], can take on diverse shapes, not necessarily human-like [8]. Thus, MVR concepts differ significantly from real-world videos.

While traditional videos have clear structures, e.g., scenes in a movie, MVRs lack such predefined sequences and capture continuous random actions and events. This makes their temporal structure unique and typically undefined.

Lastly, MVRs record digitally rendered scenes, unlike real-world videos, allowing for logging information like object positions [26]. These distinctions in content, structure, and format make MVRs a unique type of multimedia.

2.3. Approaches to Metaverse Session Recording

The range of available recording formats for MVRs—including Multimedia Content Objects (MMCOs), SRD, PD, or their combinations [17]—raises an important question about which formats are most suitable for specific applications. This section discusses the strengths, limitations, and potential uses of each group of formats.

The three main categories of MVR recording formats—MMCOs, SRD, and PD—have been examined in prior work [26]. Each offers distinct characteristics and use potential. The first group, MMCOs, includes audio-visual recordings, such as videos captured directly within Metaverse platforms, e.g., Roblox, either through built-in functions or screen-recording tools. These recordings are easy to generate and replay, but require significant storage and offer limited semantic insight. The second group, SRD, captures scene composition data used in rendering, such as the Rendering Scene Graph (RSG) [31] or network-transmitted player inputs [32]. These recordings contain structured information that supports semantic interpretation, but are technically more demanding to record and reconstruct. The third group, PD, involves peripheral sensor data, such as biometric signals or controller input, captured during or alongside the virtual session. This data provides valuable behavioral and emotional context, but is difficult to collect and synchronize without specialized infrastructure.

The preceding options present a number of advantages, disadvantages, and opportunities, which are summarized in Table 1. Regarding data storage, the recorded content can be saved on the user’s device, at the service provider’s end, or at an intermediary location. Storing high-quality renderings from all users presents a significant data challenge. However, considering the need to observe and analyze Metaverse interactions, providers are likely to adopt strategies for efficient data storage and utilization.

At present, there is no definitive answer as to which recording format is the most effective. Audio-visual recordings are widely used due to their ease of playback and compatibility with existing media tools. However, they offer limited support for detailed content analysis, as semantic information must be manually or computationally extracted [33,34,35]. In contrast, the technical structure of the Metaverse enables the use of system-generated data, such as event logs or controller-based interaction records, which can provide direct insights into user activity without requiring complex video interpretation.

Each recording type offers distinct advantages depending on the context of use. Therefore, we propose a model that integrates multiple data formats, as illustrated in Figure 2. When different types of data are combined, the result is referred to as a Composite Multimedia Content Object (CMMCO) [17,24]. If these data sources share a common temporal reference—such as a synchronized timestamp or frame index—they form a Time-mapped CMMCO (TCMMCO) [17,24].

No industry standard is known that describes how to record user sessions with MMCO, SRD, and PD. There is also no standardized rendering process with MMCO, SRD, and PD, but there are some standards to be considered for rendering engines, e.g., Vulkan [36] (video/audio stream) or IEEE 802.15-based protocols [37] (controller input). Also, some tools exist to intercept rendering processes, primarily for debugging purposes. There are mechanisms to record a virtual world beyond a simple screen capture, e.g., Nvidia Ansel [38]. However, none of these standards aim to provide a way to record semantic content information.

2.4. Multimedia Information Retrieval

Multimedia is the combination of any media formats [39]. In the case of MMIR, it refers to the different types of media that can be retrieved using an MMIR system. The types can be any perceivable media, such as images, audio, or biometric sensor data. The general model for MMIR includes indexing, querying, and results presentation [40]. For indexing, the features of a MMCO are extracted, usually by content analysis, also known as feature extraction [40]. The query interface affords the user the capability to articulate their search query. The result of a search query is shown in the result presentation section.

MMIR is driven by the information need of a searching user, which could be finding MVRs containing a certain person. While the presence of a person could be identified in the content analysis of the MMCO, the presence could also be logged in SRD as a digital identity. Thus, MVR-specific MMIR can benefit from SRD and PD.

The growing amount of multimedia is a challenge for MMIR in terms of the integration of object types and analysis methods, scalability, and semantic integration [41]. Here, these challenges are addressed by introducing the Generic Multimedia Analysis Framework (GMAF) [41] and the Multimedia Feature Graph (MMFG) [41]. GMAF is an extensible and integratable framework for content analysis and MMIR. MMFGs are directed graphs, where the nodes and edges represent special features and their relationships.

In order for MMIR to support the emerging multimedia type, MVR, it is necessary that the content analysis be capable of processing MVRs. The current status of the content analysis is presented below.

2.5. Multimedia Content Analysis

In MMIR, multimedia content analysis is performed to understand the content of the MMCO to index it and respond to queries later. Techniques can extract low-level features, such as color or format, but they result in a lack of a semantic understanding of the content, known as the semantic gap [40]. Computer vision, such as object recognition, is used to create a semantic understanding and bridge the semantic gap. Computer vision is a well-researched field with many tools. Some experiments have evaluated the effectiveness of existing tools, such as the You Only Look Once (YOLO) algorithm [42], which are inefficient for MVR-specific content analysis because of the differences around the concepts in the videos [17]. However, we tried to adapt algorithms, using transfer learning, to also recognize Metaverse-specific objects such as avatars. An object recognition system for avatars [30] was developed based on YOLO [42]. Despite the advances in computer vision, automated content analysis is limited in robustness [43], particularly if data and context change, which is the case for Metaverse virtual worlds.

Multimedia content analysis utilizes computer vision to understand image content. Additional data, such as metadata or localization data [44], also contributes to understanding and indexing. Metadata, such as Exchangeable Image File Format (EXIF) data, is used to store camera model and settings. Furthermore, MPEG-7 [45] provides a format for storing additional data about MMCOs, such as shapes. Another example of a useful source of additional data is subtitles or transcripts, which provide a textual description of the audio, particularly dialog. A different example is the use of data originating from fitness trackers, which are researched in the context of retrieval of lifelogging [46]. For example, ref. [47] incorporates wearable technology, such as Electroencephalography (EEG) devices, in this process. Another example is measuring the quality of experience in multimedia applications [48] using heart rate or EEG.

In conclusion, the use of peripheral data is a source of relevant data for retrieval and should be included in the content analysis.

2.6. Summary

Overall, several approaches and tools exist for MMIR. Since MVRs are a novel type of multimedia. Formats and optimal ways to record MVRs are unknown, yet provide advantages for MVR-specific MMIR. Content analysis of MVRs is unknown, and its impact on the effectiveness and efficiency is yet to be studied. Furthermore, the production of human visible media from SRD, specifically RSG, is unknown. Hence, the research questions remain open challenges. For these open challenges, we present our modeling work in the next section.

3. Modeling and Design

This section presents our modeling work, which follows the User-Centered System Design (UCSD) [49] and employs the Unified Modeling Language (UML) [50]. Basically, MMIR can be separated into indexing and retrieval. As shown in Figure 3, in this paper, we focus on the use cases of MVR content analysis as part of the indexing process.

Our assumption is that SRD and PD can support MVR content analysis. To validate this, we first address the content analysis of MVRs and model a system to record MVRs. Based on the research in lifelog retrieval and video retrieval, it can be assumed that activities play a distinct role in retrieval. For example, the search for specific activities is relevant for VR training and cybersecurity. Hence, further use cases are focused on the identification of activities, i.e., activity analysis.

3.1. Content Analysis of Metaverse Recordings

With regard to the research question of this article, three hypotheses can be defined: First, Feature Extraction (FE) of SRD and/or PD can extract features that are not possible to extract by FE of MMCO. Second, because of the limited-modality vision and audio, the results of FE for MMCO should be improved with the contribution of a FE of SRD and PD, or even be replaced by FE of SRD and PD if they deliver enough features to fulfill user queries. Third, FE of SRD and PD requires substantially less computing power. Each hypothesis is addressed in the following paragraphs.

Content analysis is an important initial step in MMIR to build a semantic understanding to process user queries effectively [51]. It extracts the features of the content, in our case MVRs, and integrates them with an index with which user queries are matched. For MVR retrieval, content analysis can analyze all elements of an MVR. MVRs can consist of the elements MMCO, SRD, and PD.

An exemplary MVR can therefore contain an MMCO, which itself can consist of video and audio, while audio can contain multiple tracks, such as the game audio and the user’s microphone. SRD can be many elements of different types, that is, a RSG, 3D models, or network data transmitted between clients, called netcode [32]. PD can include actor data, such as controller inputs, and sensor data, e.g., heart rate (HR). All listed elements can contain relevant information, which can be extracted in the content analysis.

All elements of this example MVR can be modeled according to the well-established Strata model [52] as tracks in a recording, visualized in Figure 4. Different actions and events in the MVR are included in the tracks—for example, if another joined the world, which is visualized in the video and is included in the exchanged data, SRD netcode. Another example can be a boxing activity of a user, which is visible as a hand movement in the video, but also as movements recorded from the controller and as high HR on the HR sensor.

Let F be the universe of all features that a feature extraction pipeline can emit from any MVR. A standard reference set

R_{s t a n d a r d} \subseteq F

contains those features whose semantics are agreed upon during system design. Each element

r \in R_{s t a n d a r d}

is a search factor and is represented in the index without additional learning.

The content analysis aims to extract all relevant information from the MVR. An MVR contains an ordered set of features

E = {e_{1}, \dots, e_{n}}

. If any available feature can be detected using a feature extraction pipeline, we call this Optimal MMCO Feature Extraction FE. Optimal MMCO FE would result in

E_{M a x} \subseteq R_{s t a n d a r d}

,

F E_{M M C O} (M V R) = E_{M a x}

, but achieving optimal FE is often difficult. Some difficulties are evident in the form of erroneous predictions, such as ghost predictions or mispredicted classes, undetected objects because of errors, or undetected objects of unknown classes. Typically, only a subset of

E_{M a x}

can be extracted, resulting in

F E_{M M C O} (M V R) = E_{R e a l}

, where

E_{R e a l}

are the realistically extractable features by the method and a proper subset of

E_{M a x}

,

E_{R e a l} ⊊ E_{M a x}

.

With regard to the hypothesis of this article, the modeling work explains that SRD and PD contain relevant additional data for the content analysis. Hence, FE of SRD and PD can support MVR content analysis.

Even if limited in robustness (see Section 2.5),

F E_{M M C O} (M V R)

can deliver a significant set of relevant features

R_{s t a n d a r d}

. In theory,

F E_{S R D} (M V R)

, or

F E_{S R D, P D} (M V R)

, is able to deliver relevant features

R_{s t a n d a r d}

if SRD contains the information. When a standardized format for SRD and PD is missing, the data can be captured as structured or unstructured data. Feature extraction for SRD

F E_{S R D} (M V R)

or SRD in combination with PD,

F E_{S R D, P D} (M V R)

, is unknown. In a simple example, SRD and PD are a structured data object, such as a JSON object, containing the recorded data. Another example could be a log file that contains SRD and PD. An

F E_{S R D, P D} (M V R)

could be a simple filter algorithm, searching for relevant log entries and reading the data.

Another form of support is the enhancement of

F E_{M M C O} (M V R)

by

F E_{S R D} (M V R)

. For example, detecting a boxing gesture based on images is difficult. However, if the visual data is combined with HR and hand controller movement data, the effectiveness of such a detection can be increased.

\begin{matrix} ∣ {E ∣ E = F E_{M M C O} (M V R)} \cup {E ∣ E = F E_{S R D, P D} (M V R)} ∣ \\ > {E ∣ E = F E_{M M C O} (M V R)} \end{matrix}

(1)

Overall, the theoretical target quality Q of the FE of MVR, which extracts exactly all of the desired features

R_{s t a n d a r d}

, represents the fraction of desired features that are correctly detected, with 1 as optimum. The difference between optimum precision 1 and the actual precision can be defined as the overall error

o e

,

o e = Q - P

. Since the calculation of

o e

over any possible E with any thinkable class in an MVR is unknown, we remove the unknown classes from the quotation. Hence, Q and

o e

can be described with precision and recall or can be combined as the F1 score. This provides strong metrics to compare different FE approaches.

Given the recorded features Boxing

E_{B}

, Avatar joins

E_{A J}

, Avatar gesture

E_{A G}

, and Avatar Talking

E_{A T}

from Figure 4, the feature

E_{A J}

can be detected by

F E_{M M C O}

only if the join happened in the field of view of the recording. But it can be detected by

F E_{S R D}

with high confidence. Similarly,

E_{A G}

can be recognized by

F E_{M M C O}

. In case of

E_{B}

, if reduced to high heart rate activity,

F E_{M M C O}

nor

F E_{S R D}

can extract a high heart rate, but

F E_{P D}

can. However, a high heart rate does not indicate boxing, but a combination of

F E

can extract features, which can be analyzed or fused to produce higher quality features. We assume that SRD contains more features than MMCO, since MMCO contains the same concepts, but only the ones in the recorded field of view. PD delivers only additional features and contributes to the other FE, but alone achieves a lower recall value. Hence, the ranking of information is assumed to be

\begin{matrix} R e c a l l_{M M C O, S R D, P D} & \geq R e c a l l_{S R D, P D} \\ \geq R e c a l l_{S R D} \\ \geq R e c a l l_{M M C O, P D} \\ \geq R e c a l l_{M M C O} \\ \geq R e c a l l_{P D} . \end{matrix}

(2)

After all, this theoretical model illustrates a simplified formalization of the FE process. In real scenarios, FE methods do not recognize features deterministically, but provide, for example, ambiguous results or results with low recognition reliability. Further processing, such as multimodal feature fusion [53] or fuzzy logic [54], could be applied to improve the overall result, which is left to future work.

During querying, a user statement q is mapped by the query interface to a subset of factors:

ϕ (q) \subseteq R_{s t a n d a r d} .

(3)

Retrieval, therefore, becomes a problem of finding every recording whose extracted feature multiset contains

ϕ (q)

.

Regarding the computational effort of the MVR content analysis, the hypothesis is that

F E_{S R D}

is less computationally intensive than

F E_{M M C O}

. A simple comparison of the runtime produces an indication if they differ by an order of magnitude. Hence, the performance P measured in runtime is expected to be

P (F E_{S R D}) < P (F E_{M M C O})

and

P (F E_{S R D, P D}) < P (F E_{M M C O})

. A further hypothesis is that SRD is much smaller in file size than MMCO and, hence, is more efficient to store. Because feature extraction and indexing are executed independently for each MVR, the pipeline can be extremely parallel across large collections and needs to run only when new recordings arrive. The small log-file footprint of SRD and PD and the CPU-bound extraction contribute to reducing input–output and compute costs, reinforcing system-wide scalability.

3.2. MVR Process

As described previously, no standard for the acquisition of rendering data to document MVRs is available. To explore the opportunities and challenges for MVRs, we describe an initial approach to recording SRD and PD for MVRs.

For the SRD, the RSG is particularly interesting, since it contains all objects in a scene with many attributes, such as size, position, and materials. Hence, the scene graph could provide equivalent data for object detection. A change in attribute values in the scene graph can be used to detect events, such as the movement of an object by a change of position data.

For PD, the input controllers are a first point of interest, because, with these, a player interacts with a scene. Hence, capturing the main input devices, such as mouse and keyboard on a computer and VR hand controllers and head movements on a VR headset, is required. The taxonomy of interaction techniques for immersive worlds [55] provides a categorization of modalities, which we use as categories of input controller data to record. In the same way that input controllers provide a source of information about activity, other players in virtual worlds also influence the scene and interaction flow. Hence, the netcode is a relevant source of information. For VR worlds, the recorded hand controller data (PD) and netcode (SRD) data provide relevant interaction information, which can be easily extracted by FE. Hence, this supports the thesis that PD supports FE for MMCO and SRD.

The rendering process typically incorporates a render loop, which renders every single frame based on the events and modifications of elements in the scene. In order to capture information in the rendering loop, a script can be inserted into the loop that retrieves and saves certain information. Figure 5 shows the process, where the RendererLoop is modified to check if a button is pressed on an InputDevice, and, if yes, the information is sent to the MvrRecorder. Also, if an avatar joins the scene (event), it will be sent to the MvrRecorder and be recorded. In a repetition of, for example, every 5 s, the scene graph will be captured by the MvrRecorder.

The PD from external devices can be captured by reading the sensor values during recording, as described, or by temporarily storing them on the device and consolidating them with the MVR later.

The persistence of the captured data as a MVR can be achieved by serialization and logging with standard methods, such as the system standard log, in a separate log file, or transmitted to a log service over the network. A hierarchical graph representing the scene can be easily converted into a textual format by indenting each node according to its hierarchy level and listing them sequentially.

Based on the modeling work, the relevant points for a standard for MVRs recording can be described. A crucial point is whether the app provides certain data or whether the recorder specifies requested data. The standard should define the interface for the transfer of the data to record from the app to the recorder. The standard should include a container format for persisting MVRs, a definition of data structures for scene graphs, and data structures for PD.

In summary, the recording of MVRs can be achieved using deep integration into the application.

3.3. Summary

In this section, we presented the anatomy of an MVR and explained how content analysis, specifically feature extraction, differs for MMCO, SRD, and PD. We have also discussed how these formats could substitute each other or support each other to enhance the results. To evaluate the hypotheses regarding MVR content analysis, we modeled a recording process for MVRs, which is necessary due to the lack of a standard.

Next, we describe our prototypical implementation of recording and content analysis, as well as the graph conversion process, to create an MVR.

4. Implementation

To enable the recording of the described data in a technology used to create Metaverse and virtual worlds and to validate our modeling, we implemented a prototypical application. Lacking a standard interface for MVRs, an individual implementation was created, which is described in this section.

4.1. Prototype MVR

The implementation of this prototype is based on the models presented in Section 3.2. This prototype implementation was developed using Unity [32]. For further experiments, the applications can be executed on a standard PC and a VR device.

Unity provides a scripting interface to attach application logic to Unity elements. The scripts can, i.e., register event listeners for button presses or make use of the Unity internal event system, which calls script functions during the application lifecycle. All captured data is recorded by logging to the standard out interface.

The developed MVR recorder is tested with three different virtual worlds. The first scene, referred to as Man-Tree, is a self-created and simplified scene as a reference, with a simple plane and two objects, a tree and a person, shown in Figure 6a. The second scene, referred to as XR, is a small scene of several interactive elements, hence being of medium complexity, from the XR Interaction Toolkit, shown in Figure 6b. The third scene, referred to as City, is a feature-rich scene with 7354 elements in the RSG, shown in Figure 6c.

Table 2 provides an overview of the recorded data. The following sections explain the implementation in detail.

4.1.1. Recording Rendering Scene Graphs

The current scene RSG is accessed using Unity’s GetRootGameObjects method. Our implementation periodically retrieves the scene graph and logs each object’s name, depth, and attributes, e.g., visibility, position, and bounding box. Recursive iteration captures all elements, and the resulting log file can be parsed to create a sequence of scene views.

4.1.2. Netcode Scene Raw Data

A second example for SRD is the information about other players entering the virtual world, which can be obtained from the netcode. Adding limited multiplayer capabilities enabled the prototype to log data on specific events. As described in the model, in the case of a player-join event, a message is logged. The joining network player is represented by different 3D models, hence referred to as avatars.

4.1.3. Capturing of Inputs, Gestures, and Sensors

To demonstrate the recording of PD, log information for simple push buttons, gestures, and biosensors is recorded. To capture simple button inputs, a log function was implemented and registered as an action listener to the buttons of the hand controller. To demonstrate the recognition of complex movements, the prototype implements hand gesture recognition. The implementation measures the movements when a button is pressed and compares them to predefined movement patterns. If a gesture is recognized, an action is triggered to log the event.

Together with the video scene graph log, HR can be recorded with a peripheral HR tracker. Based on the timestamp information, an approximated mapping can be achieved.

Based on the implementation, the prototype application generates a log file with SRD and PD. A simple screen-recording program can generate the MMCO in parallel at runtime. In addition, heart rate can be recorded with the separate device. All parts together form an MVR. Next, we describe the FE of the MVR generated by the prototype.

4.1.4. Postprocessing and Prototypical Content Analysis

The implementation of

F E_{S R D, P D}

of the MVRs produced by the prototype is based on simple Python scripts. After the recording process, the files of the screen recording, device log file, and FIT file from the HR sensor were collected and post-processed. The HR values from the FIT file are extracted and stored in a Comma-Separated Values (CSV) file.

The content analysis for the log file is straightforward and iterates through the lines in the log and CSV file, text-matching to the expected keywords defined in the implementation logging statement. For the HR, present values between 0 and 110 are extracted as normal HR, and present values above 110 are extracted as high HR. After the extraction is completed, for each file, the detected features are converted into an MMFG. The MMFG can be processed by the GMAF for querying.

4.2. Summary

The prototypical implementation demonstrates the feasibility to record SRD and PD using the same technology that is used to build virtual worlds. The recorded information can be processed by feature extraction and ingested in an MMIR system to be retrieved. The implementation shows that relevant information, such as gestures or HR data, can be captured and used for retrieving MVRs, and supports the hypothesis regarding the usability of SRD and PD. The next section evaluates the effectiveness of retrieval based on the extracted data.

5. Evaluation

This section discusses the evaluation results of our models and the implementation to prove the hypothesis and answer the research questions. We performed quantitative experiments on leveraging the different media types from the MVRs within MMIR.

The described prototype enables the production of MVRs, which includes SRD and PD, thus allowing for experiments to be executed.

5.1. Evaluation Methods

Methodology overview: We created an MMIR setup defined by four elements: dataset, feature extraction pipelines, retrieval tasks, and metrics.

For the evaluation, a dataset comprising 111 MVRs from three distinct virtual environments was used [26]. The MVRs consists of a video, a log file with extractable information, and a HR log file created by a single user. Different scenarios of activities, listed in Table 3, were performed during the recordings. Most of the recordings contain at least one of the following: a visible object, a gesture, a trigger pressed, player joined, normal, or high HR. The 111 videos have a total length of 41.21 min, shown in Table 4. The dataset is available in [56,57] for reproduction and further experiments. Each recording was performed on a Meta Quest 2 [58] device, running the prototypical implementation, in combination with a Garmin Forerunner 910XT [59] recorded. The screen capture was recorded with the built-in screen-recorder feature and was downloaded from the device. The HR was recorded as a FIT file and converted to a CSV file. The HR data and the log files contain timestamps measured to the second; the screen recording contains the start timestamp in the file name, which provides time-mapping. The recordings were performed by a single male person of age 40 in an office setup, given a predefined list of recordings with a combination of features each. This person is experienced in the present research, and hence knew how to produce the defined features. This allowed us to use the predefined list as ground truth for the evaluation.

For the experiments, three pipelines were used within the GMAF: a CPU-only flow that analyzes SRD+PD logs; a GPU-utilizing flow that applies the described YOLO-based processing of the MMCOs; and the third pipeline combines the processing of the SRD, PD, and MMCO of the previous pipelines.

As retrieval tasks, query by keyword [60] and query by example ([40], p. 20) were executed on both pipelines.

As metrics, the common metrics precision, recall, F1 for keywords, and mAP@k for query by example were used. Additionally, file size and runtime duration for each pipeline were measured.

5.2. Shape of the MVR and Performance of the Content Analysis

The efficiency of the different content analysis methods can be measured in file size and processing time. Table 4 shows a comparison of the file size of the recordings of the 111 MVR dataset described in the previous section. The log file size is influenced by the number of scene graph elements recorded. For instance, the RSG of the “Man-Tree” world (recording 64) has 198 elements, resulting in a smaller log size compared to the more complex “City” world (recording 111). On average, log sizes are 51% smaller than their video counterparts, and further compression can reduce size effectively. The processing time of the video depends on the selected algorithms and performing hardware. We performed the tests using a PC with Intel i5 8500, 6 cores, 32 GB RAM and Nvidia GeForce RTX 3090ti. The runtime for

F E_{M M C O}

of the 111 video files by YOLOv3 and YOLOv7 object detection trained for avatar detection [30] ran at 5:45 h. In comparison, the

F E_{S R D, P D}

ran in under 1 min.

Overall, the processing of SRD+PD only is much more efficient in storage and processing. When added to the MMCO processing, there was minimal overhead in storage and processing times. Hence, the evaluation results support the hypothesis that SRD and PD are more efficient to process and store.

5.3. Leveraging SRD and PD from MVRs

Answering our hypothesis, ’Can SRD and PD support MVR content analysis?’, we examined the recorded MVRs containing SRD and PD and measured the effectiveness of FE in MVR retrieval. For each MVR, the described logfile analysis and HR CSV file reading were performed, generating a simple MMFG. All three MVR were imported as MMFG into the GMAF, while no content analysis of the MMCO was performed. For the MMCO, a YOLO v3 object recognition, based on COCO labels, and an avatar detection [30], a trained YOLOv7 with avatar images from the 256 Metaverse Recordings [57], were performed. All features found were merged, deduplicated, and saved without time information.

5.3.1. Experiment Using Query by Keyword

The retrieval results were evaluated employing the ’Query by keyword‘ pattern. The 111 MVR dataset collection was processed by all three pipelines. Three different keyword queries were used to search the collection: keyword avatar, which should be matched by all three pipelines; keyword car, which should be detected by the MMCO pipeline; and keyword high_hr_rate, which should be detected by the SRD/PD pipeline. The results were compared with the ground truth and counted as True Positive (TP), False Positive (FP), and True Negative (TN).

Table 5 presents the measured results in the common metrics precision, recall, and F1. The results for keyword avatar for SRD are expected to be precise. The MMCO avatar detector is sensitive, especially for the examples where static persons are present, i.e., the world Man-Tree, and is not precise for non-human avatars. Due to the nature of the implementation, the precision and recall were perfect. If more than just the SRD and PD features were added, specifically features from YOLO, the results were different. The results for the keyword car are as expected. The MMCO delivers good results, while the SRD/PD method has no evidence on these objects and is zero. In contrast, the high HR rate shows perfectly in the SRD/PD method, while the MMCO method has no means to obtain it. When used in combination, both methods add up and provide superior results.

These results confirm that SRD and PD can reliably detect structured activities with perfect precision and recall, while MMCO content analysis struggles with non-standard visual representations.

5.3.2. Experiment Using Query by Example

The second experiment searches for an MVR, similar to an example video; hence, it is a query by example, using a similarity metric for the result ranking. This query not only includes the designed features, but also features visible using computer vision of YOLO, and hence shows if feature extraction can benefit from SRD/PD. As a collection of MVRs, the 111 MVR dataset is used, processed by the described three pipelines. The comparison of results of an example query is shown in Table 6 and demonstrates that the features extracted by YOLO and Avatar Detector are not beneficial to the desired outcome. The threshold of a similarity score of 0.75 only matches the query object, while a threshold of 0.5 shows that the result of the combination of computer vision and SRD/PD increases the F1 score.

As an example, the metrics for recording98 are presented in Table 6. The FE for SRD and PD delivered exact results. The object detection of MMCO, in comparison, performed worse. It was not trained in activity or gesture detection, but it also delivered inaccurate results on the detected objects of avatars and other classes.

Precision at K measures the precision at each occurrence of a relevant result within the ranked list. Average Precision at K (AP@K) computes the average of these precision values for every relevant result in the search query list. Consequently, the mean average Precision at K (mAP@K) [61] is derived from averaging the AP@K values across all queries in the data set. The overall comparison of the retrieval results has been measured as mAP@k for the entire collection, based on the feature similarity metric and the ground truth of the SRD/PD. Table 7 presents the evaluation results with the whole collection. As expected, retrieval of the analyzed SRD/PD a provides better results than the MMCO-analyzed data and improves the result if combined.

The result is limited by the ground truth data of the activities. YOLOV3 analysis extracts features which may be correct, but are not recognized in the ground truth data and hence are counted as not relevant.

For similarity search, SRD+PD features yielded higher overall precision and better ranked results (mAP@k), especially when queries involved activities rather than static objects. Overall, the evaluation shows an improved multimedia analysis by including SRD and PD. However, it cannot be stated that

F E_{M M C O}

or

F E_{S R D, P D}

is superior.

5.4. Summary and Discussion

In summary, the results were as expected and thus demonstrate that SRD and PD provide value by adding relevant features in the content analysis. Despite the small dataset and the low number of queries, this supports the hypothesis that SRD and PD support MVR content analysis. The hypothesis that

F E_{S R D, P D}

alone in general has a smaller loss than

F E_{M M C O}

could not be validated. Furthermore, the evaluation results show that low-level features can be extracted from the MVRs with little computational effort.

Restricting analysis to SRD and PD cuts file size by around 51% and completes feature extraction in under a minute on a 6-core CPU, versus 5 h 45 min for GPU-based MMCO processing. Because SRD+PD are structured logs, the pipeline is CPU-bound, extremely parallel and deployable on ordinary clusters, with no GPU limitation issues. Adding MMCO improves accuracy but re-introduces GPU costs due to the computer vision tasks. The trade-off between efficiency and accuracy is likely to be defined by the task at hand. Large-scale tests on multi-million-session datasets remain for future work, but near-linear cloud scaling is expected given the GPU-free per-record independence of the SRD with PD option.

The low-level features themselves do not provide semantic meaning. For example, high HR is just information showing high HR, but this alone does not indicate what the reason for it is. However, as an indicator of activities, it can support activity detection in combination with other computer vision techniques. Hence, as is also the case with MVRs, the semantic gap is still relevant and either requires more processing on the information given in the recording or reasoning in the content analysis, such as feature fusion or data mining.

6. Conclusions

In this article, we present our results on the opportunities SRD and PD provide to MVR retrieval. We addressed the question of whether SRD and PD can support MVR content analysis. We discussed how MVR can be structured and how they provide value to content analysis of MVRs. We modeled and implemented a prototype, which shows how MVRs with MMCO, SRD, and PD can be created.

The evaluation indicates that SRD and PD support the overall retrieval process. An evaluation with a larger dataset and more queries is left to future work. Future research could also address the missing standard interface for the MVRs. The present work focused on the technical opportunities of SRD and PD for MVRs. Future work will address their relevance and benefits for users.

Overall, our results encourage the use of SRD and PD in MVR content analysis for MVR retrieval. In particular, VR training could highly benefit from the indexing of features from PD, which are relevant for quality assurance of the training. However, the results should be validated in a larger MVR collection.

Author Contributions

Conceptualization and methodology: P.S., S.W. and M.L.H. Software, validation, formal analysis, investigation, resources, data curation, and writing: P.S. and S.W. Review, editing, and supervision: S.W., I.F. and M.L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [56,57].

Conflicts of Interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

References

Richter, F. Smartphones Cause Photography Boom. 2017. Available online: https://www.statista.com/chart/10913/number-of-photos-taken-worldwide/ (accessed on 31 August 2017).
Cassidy, F. A Day in Data. 2019. Available online: https://www.raconteur.net/infographics/a-day-in-data (accessed on 10 February 2024).
Jenik, C. A Minute on the Internet in 2021. 2022. Available online: https://www.statista.com/chart/25443/estimated-amount-of-data-created-on-the-internet-in-one-minute/ (accessed on 17 October 2022).
Austin, D. 2023 Internet Minute Infographic. 2023. Available online: https://ediscoverytoday.com/2023/04/20/2023-internet-minute-infographic-by-ediscovery-today-and-ltmg-ediscovery-trends/ (accessed on 10 February 2024).
Andre, L. 53 Important Statistics About How Much Data Is Created Every Day in 2024. 2024. Available online: https://financesonline.com/how-much-data-is-created-every-day/ (accessed on 10 February 2024).
Statista. YouTube: Hours of Video Uploaded Every Minute 2022. 2023. Available online: https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/ (accessed on 10 February 2024).
Karl, K.A.; Peluchette, J.V.; Aghakhani, N. Virtual Work Meetings During the COVID-19 Pandemic: The Good, Bad, and Ugly. Small Group Res. 2022, 53, 343–365. [Google Scholar] [CrossRef] [PubMed]
Mystakidis, S. Metaverse. Encyclopedia 2022, 2, 486–497. [Google Scholar] [CrossRef]
Metaversed Consulting. The Metaverse Reaches 600m Monthly Active Users. 2023. Available online: https://metaversed.webflow.io/blog/the-metaverse-reaches-600m-monthly-active-users (accessed on 8 October 2023).
KZero Worldwide. Exploring the Q1 24’ Metaverse Radar Chart: Key Findings Unveiled-KZero Worldswide. 2024. Available online: https://kzero.io/2024/02/06/2633/ (accessed on 10 February 2024).
Wikipedia. Roblox, 2023. Available online: https://en.wikipedia.org/w/index.php?title=Roblox&oldid=1177660840 (accessed on 8 October 2023).
Mojang. Minecraft Official Website. 2023. Available online: https://www.minecraft.net/de-de (accessed on 19 February 2023).
Gunkel, S.; Stokking, H.; Prins, M.; Niamut, O.; Siahaan, E.; Cesar, P. Experiencing Virtual Reality Together: Social VR Use Case Study. In Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online Video, Seoul, Republic of Korea, 26–28 June 2018; pp. 233–238. [Google Scholar] [CrossRef]
Gartner Inc. Gartner Predicts 25% of People Will Spend At Least One Hour Per Day in the Metaverse by 2026. 2022. Available online: https://www.gartner.com/en/newsroom/press-releases/2022-02-07-gartner-predicts-25-percent-of-people-will-spend-at-least-one-hour-per-day-in-the-metaverse-by-2026 (accessed on 28 October 2022).
Statista. Metaverse Worldwide Market Forecast. 2024. Available online: http://frontend.xmo.prod.aws.statista.com/outlook/amo/metaverse/worldwide (accessed on 28 April 2025).
Bestie Let’s Play. Wir Verbringen Einen Herbsttag mit der Großfamilie!!/Roblox Bloxburg Family Roleplay Deutsch. 2022. Available online: https://www.youtube.com/watch?v=sslXNBKeqf0 (accessed on 28 February 2022).
Steinert, P.; Wagenpfeil, S.; Frommholz, I.; Hemmje, M.L. Integration of Metaverse Recordings in Multimedia Information Retrieval. In Proceedings of the ICSCA 2024, Bali Island, Indonesia, 1–3 February 2024; pp. 137–145. [Google Scholar] [CrossRef]
Uhl, J.C.; Nguyen, Q.; Hill, Y.; Murtinger, M.; Tscheligi, M. xHits: An Automatic Team Performance Metric for VR Police Training. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), Milano, Italy, 25–27 October 2023; pp. 178–183. [Google Scholar] [CrossRef]
INTERPOL. Grooming, Radicalization and Cyber-Attacks: INTERPOL Warns of ‘Metacrime’. 2024. Available online: https://www.interpol.int/en/News-and-Events/News/2024/Grooming-radicalization-and-cyber-attacks-INTERPOL-warns-of-Metacrime (accessed on 10 February 2024).
Zheng, Z.; Li, T.; Li, B.; Chai, X.; Song, W.; Chen, N.; Zhou, Y.; Lin, Y.; Li, R. Industrial Metaverse: Connotation, Features, Technologies, Applications and Challenges. In Methods and Applications for Modeling and Simulation of Complex Systems; Communications in Computer and Information Science; Fan, W., Zhang, L., Li, N., Song, X., Eds.; Springer Nature Singapore: Singapore, 2022; Volume 1712, pp. 239–263. [Google Scholar] [CrossRef]
Sholingar, G.; Alvarez, J.M.; Choe, T.E.; Joo, J. Using Synthetic Data to Address Novel Viewpoints for Autonomous Vehicle Perception. 2023. Available online: https://developer.nvidia.com/blog/using-synthetic-data-to-address-novel-viewpoints-for-autonomous-vehicle-perception/ (accessed on 19 May 2024).
Steinert, P.; Mischkies, J.; Wagenpfeil, S.; Frommholz, I.; Hemmje, M.L. Information Need in Metaverse Recordings—A Field Study. arXiv 2024, arXiv:2411.09053. [Google Scholar] [CrossRef]
Rüger, S. What is Multimedia Information Retrieval? In Multimedia Information Retrieval; Synthesis Lectures on Information Concepts, Retrieval, and Services; Springer International Publishing: Cham, Switzerland, 2010; pp. 1–12. [Google Scholar] [CrossRef]
Steinert, P.; Wagenpfeil, S.; Frommholz, I.; Hemmje, M.L. Towards the Integration of Metaverse and Multimedia Information Retrieval. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), Milano, Italy, 25–27 October 2023; pp. 581–586. [Google Scholar] [CrossRef]
Wikipedia. Massively Multiplayer Online Game. 2023. Available online: https://en.wikipedia.org/w/index.php?title=Massively_multiplayer_online_game&oldid=1138213454 (accessed on 7 March 2023).
Steinert, P.; Wagenpfeil, S.; Frommholz, I.; Hemmje, M.L. Artificial-Intelligence-Based Image Generation from Scene Graphs for Metaverse Recording Retrieval. Electronics 2025, 14, 1427. [Google Scholar] [CrossRef]
Nunamaker Jr, J.F.; Chen, M.; Purdin, T.D. Systems development in information systems research. J. Manag. Inf. Syst. 1990, 7, 89–106. [Google Scholar] [CrossRef]
Huang, Y.; Li, Y.J.; Cai, Z. Security and Privacy in Metaverse: A Comprehensive Survey. Big Data Min. Anal. 2023, 6, 234–247. [Google Scholar] [CrossRef]
Wang, Y.; Su, Z.; Zhang, N.; Xing, R.; Liu, D.; Luan, T.H.; Shen, X. A Survey on Metaverse: Fundamentals, Security, and Privacy. IEEE Commun. Surv. Tutor. 2023, 25, 319–352. [Google Scholar] [CrossRef]
Becker, F.; Steinert, P.; Wagenpfeil, S.; Hemmje, M.L. Avatar Detection in Metaverse Recordings. Virtual Worlds 2024, 3, 459–479. [Google Scholar] [CrossRef]
Wang, R.; Qian, X. OpenSceneGraph 3.0; Packt Open Source; Packt Publishing: Birmingham, UK, 2010. [Google Scholar]
Unity Technologies. About Netcode for GameObjects|Unity Multiplayer Networking. 2024. Available online: https://docs-multiplayer.unity3d.com/netcode/current/about/ (accessed on 15 February 2024).
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Vahdani, E.; Tian, Y. Deep Learning-Based Action Detection in Untrimmed Videos: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4302–4320. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Nie, L.; Wang, Y.; Wang, M.; Rui, Y. A Survey on Video Moment Localization. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Wikipedia. Vulkan. 2024. Available online: https://en.wikipedia.org/w/index.php?title=Vulkan&oldid=1216825726 (accessed on 7 April 2024).
IEEE 802.15; Standard Working Group for Wireless Specialty Networks. IEEE: New York, NY, USA, 2002.
Güngör, A. Video: NVIDIA Ansel Architecture Explained. 2016. Available online: https://www.technopat.net/sosyal/konu/video-nvidia-ansel-architecture-explained.329263/ (accessed on 20 May 2024).
Vaughan, T. Multimedia: Making It Work, 8th ed.; McGraw-Hill: New York, NY, USA, 2011. [Google Scholar]
Rüger, S.; Marchionini, G. Multimedia Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Wagenpfeil, S.; Kevitt, P.; Hemmje, M. Smart Multimedia Information Retrieval. Analytics 2023, 2, 198–224. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Shenkman, C.; Thakur, D.; Llansó, E. Do You See What I See? Capabilities and Limits of Automated Multimedia Content Analysis. arXiv 2022, arXiv:2201.11105. [Google Scholar] [CrossRef]
Hürst, W.; Ouwehand, K.; Mengerink, M.; Duane, A.; Gurrin, C. Geospatial Access to Lifelogging Photos in Virtual Reality. In Proceedings of the 2018 ACM Workshop on The Lifelog Search Challenge, Yokohama, Japan, 11 June 2018; pp. 33–37. [Google Scholar] [CrossRef]
Manjunath, B.S.; Salembier, P.; Sikora, T. (Eds.) Introduction to MPEG-7: Multimedia Content Description Interface; Wiley: Berlin, Germany; New York, NY, USA, 2002. [Google Scholar]
Ribeiro, R.; Trifan, A.; Neves, A.J.R. Lifelog Retrieval From Daily Digital Data: Narrative Review. JMIR mHealth uHealth 2022, 10, e30517. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Li, Z.; Zhou, P.; Li, M. Memento: An Emotion-driven Lifelogging System with Wearables. ACM Trans. Sens. Netw. 2019, 15, 1–23. [Google Scholar] [CrossRef]
Zhang, Y.; Su, Y.; Sun, X. A QoE Physiological Measure of VR With Vibrotactile Feedback Based on Frontal Lobe Power Asymmetry. IEEE Trans. Multimed. 2024, 26, 2932–2942. [Google Scholar] [CrossRef]
Norman, D.A.; Draper, S.W. (Eds.) User Centered System Design: New Perspectives on Human-Computer Interaction; L. Erlbaum Associates: Hillsdale, NJ, USA, 1986. [Google Scholar]
Object Management Group. Unified Modeling Language 2.5.1. 2017. Available online: https://www.omg.org/spec/UML/ (accessed on 29 April 2022).
Feng, D.D. (Ed.) Multimedia Information Retrieval and Management: Technological Fundamentals and Applications; Signals and Communication Technology; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Smith, T.G.A.; Pincever, N. Parsing Movies in Context. In Proceedings of the USENIX, Nashville, TN, USA, 10–14 June 1991; pp. 157–168. [Google Scholar]
Natarajan, P.; Wu, S.; Vitaladevuni, S.; Zhuang, X.; Tsakalidis, S.; Park, U.; Prasad, R.; Natarajan, P. Multimodal feature fusion for robust event detection in web videos. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1298–1305. [Google Scholar] [CrossRef]
Abdulghafour, M.; Chandra, T.; Abidi, M. Data fusion through fuzzy logic applied to feature extraction from multi-sensory images. In Proceedings of the [1993] Proceedings IEEE International Conference on Robotics and Automation, Atlanta, GA, USA, 2–6 May 1993; pp. 359–366. [Google Scholar] [CrossRef]
Hertel, J.; Karaosmanoglu, S.; Schmidt, S.; Braker, J.; Semmann, M.; Steinicke, F. A Taxonomy of Interaction Techniques for Immersive Augmented Reality based on an Iterative Literature Review. In Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 4–8 October 2021; pp. 431–440. [Google Scholar] [CrossRef]
Steinert, P. 111 MVR Dataset. 2024. Available online: https://patricksteinert.de/256-metaverse-records-dataset/mvr-dataset/ (accessed on 20 July 2024).
Steinert, P. 111-Metaverse-Recordings Repository. 2024. Available online: https://github.com/marquies/111-Metaverse-Recordings (accessed on 3 June 2024).
Wikipedia. Meta Quest 2. 2023. Available online: https://de.wikipedia.org/w/index.php?title=Meta_Quest_2&oldid=240623939 (accessed on 15 February 2024).
Wikipedia. Garmin Forerunner. 2024. Available online: https://en.wikipedia.org/w/index.php?title=Garmin_Forerunner&oldid=1202848733 (accessed on 15 February 2024).
Mhawi, D.N.; Oleiwi, H.W.; Saeed, N.H.; Al-Taie, H.L. An Efficient Information Retrieval System Using Evolutionary Algorithms. Network 2022, 2, 583–605. [Google Scholar] [CrossRef]
Rahman, M.M.; Roy, C.K. TextRank based search term identification for software change tasks. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, QC, Canada, 2–6 March 2015; pp. 540–544. [Google Scholar] [CrossRef]

Figure 1. Illustrative rendering process, generating frames for video stream and scene raw data for capturing [26].

Figure 2. Elements and packages of MVRs [26].

Figure 3. UML Use Cases of MVR Retrieval.

Figure 4. Visualization of events detectable in the tracks of an example MVR.

Figure 5. Sequence diagram of recording elements in the rendering process.

Figure 6. Samples of the 111 recordings [26]. (a) shows an example of a simple scene of a tree and a man. (b) shows an example of the XR Interaction Toolkit of Unity. (c) shows a complex scene of a cityscape.

Table 1. Advantages, disadvantages, and opportunities of different formats of MVRs.

Type	Example	Advantages	Disadvantages	Opportunities
MMCO	Video	Easy to record, easy to playback, available	Semantic understanding hard to achieve, high volume of data to store	Can be processed by existing MMIR
SRD	Scene Graph	Medium amount of data to store, contains computer-processable information	Hard to record, hard to playback, not available	Easy to use for semantic understanding
PD	Pressed Controller Buttons	Provides relevant data, contains computer-processable information	Hard to record, hard to playback	Easy to use for semantic understanding

Table 2. Overview of recorded data.

Type	Information
MMCO	Screen and Audio via OS level screen recorder
SRD	Rendering Scene Graphs
SRD	Netcode recorder for joining players
PD	Heart Rate (via external device)
PD	Controller action data

Table 3. Overview of the activities in the dataset.

Activity	Values	Events
Gestures	smile, w, wave	0–2
Activity by HR (moving to get pulse > 100)	high_heart_rate, normal_heart_rate	0–∞
Push the trigger button on the controller	button_pressed	0–∞
another player joins the game	player_joined	0–1

Table 4. Efficiency comparison of the file size for SRD+PD logs compared to MMCO videos.

MVR ID	Virtual World	Video Duration	Filesize SRD+PD	Filesize MMCO	Filesize Difference
		(s)	(MByte)	(MByte)	(%)
1	XR	61	4.0	23.5	17
2	XR	108	5.3	43.1	12
3	XR	98	5.4	40.7	13
…
64	Man-Tree	28	2.1	13.8	15
…
111	City	24	29.1	12.5	233
Avg		2472	11.52	22.4	51

Table 5. Metrics for query by keyword result sets with the following keywords: avatars, car, high_heart_rate.

Keyword	Analysis	TP	FP	FN	P	R	F1
Avatar	SRD/PD	24	0	0	1	1	1
	MMCO	18	60	6	0.231	0.750	0.3253
	Combined	23	78	1	0.228	0.958	0.368
Car	SRD/PD	0	0	41	-	-	0
	MMCO	41	0	0	1	1	1
	Combined	39	0	2	1	0.951	0.975
High HR	SRD/PD	42	0	0	1	1	1
	MMCO	0	0	42	-	-	0
	Combined	42	0	0	1	1	1

Table 6. Metrics for query by similarity result with similarity threshold 0.5.

	SRD/PD	MMCO (YOLO)	Combined
Precision	1	0.857	1
Recall	1	0.261	0.478
F1	1	0.4	0.647

Table 7. mAP@k for each MVR in the collection.

Data/k	5	10	20	50	100
SRD/PD	0.237	0.429	0.670	0.887	0.934
MMCO	0.093	0.126	0.172	0.263	0.406
Combined	0.161	0.246	0.356	0.529	0.629

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Steinert, P.; Wagenpfeil, S.; Frommholz, I.; Hemmje, M.L. Uses of Metaverse Recordings in Multimedia Information Retrieval. Multimedia 2025, 1, 2. https://doi.org/10.3390/multimedia1010002

AMA Style

Steinert P, Wagenpfeil S, Frommholz I, Hemmje ML. Uses of Metaverse Recordings in Multimedia Information Retrieval. Multimedia. 2025; 1(1):2. https://doi.org/10.3390/multimedia1010002

Chicago/Turabian Style

Steinert, Patrick, Stefan Wagenpfeil, Ingo Frommholz, and Matthias L. Hemmje. 2025. "Uses of Metaverse Recordings in Multimedia Information Retrieval" Multimedia 1, no. 1: 2. https://doi.org/10.3390/multimedia1010002

APA Style

Steinert, P., Wagenpfeil, S., Frommholz, I., & Hemmje, M. L. (2025). Uses of Metaverse Recordings in Multimedia Information Retrieval. Multimedia, 1(1), 2. https://doi.org/10.3390/multimedia1010002

Article Menu

Uses of Metaverse Recordings in Multimedia Information Retrieval

Abstract

1. Introduction

2. State of the Art and Related Work

2.1. Metaverse Recordings

2.2. Differences in MVRs Compared to Real-World Capturing Media

2.3. Approaches to Metaverse Session Recording

2.4. Multimedia Information Retrieval

2.5. Multimedia Content Analysis

2.6. Summary

3. Modeling and Design

3.1. Content Analysis of Metaverse Recordings

3.2. MVR Process

3.3. Summary

4. Implementation

4.1. Prototype MVR

4.1.1. Recording Rendering Scene Graphs

4.1.2. Netcode Scene Raw Data

4.1.3. Capturing of Inputs, Gestures, and Sensors

4.1.4. Postprocessing and Prototypical Content Analysis

4.2. Summary

5. Evaluation

5.1. Evaluation Methods

5.2. Shape of the MVR and Performance of the Content Analysis

5.3. Leveraging SRD and PD from MVRs

5.3.1. Experiment Using Query by Keyword

5.3.2. Experiment Using Query by Example

5.4. Summary and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI