Introduction
It is still common practice to store gaze information belonging to video stimuli next to the video file in custom file formats. In addition to proprietary formats, where the data structure of the gaze information is unknown, the data structures are strongly customized—sometimes unique—and stored in a diversity of formats, e.g., plain text, xml, Matlab mat format, or even binary. As a consequence, special tools for accessing, visualizing and analyzing these data and stimuli are needed. For the general audience, specialized software tools, which are expensive or require compilation, are a major obstacle and often prohibit accessing the data.
An open and standardized exchange format will overcome this problem and will form the basis for visualization and visual analytics (VA) software. Thus, why not encapsulate gaze data alongside the stimuli material in a joint container? For storing text plus metadata, this has become common practice—a well-known example is the pdf container. For the encapsulation of multimedia content comprising video, audio and metadata, such as subtitles, audio comments, and interactive features, multimedia container formats were specified. Examples are the open container formats (
Xiph.org, 2016) (
ogg),
mpeg-4 (
ISO/IEC, 2003), or the
Matroška container (
Matroska, 2017) (
mkv). Housing multimedia content in a single file, which can be easily archived, has fixed and validated synchronization of different data streams, can be played by standard multimedia players, and can be streamed via the Internet. If gaze data or other participant-related data, such as EEG traces, are encapsulated in such container formats, the accessibility will be improved for both experts and the general audience, in scientific publications, or in popular science video demonstrations (Demonstration video, software including all tools, and converted eye-tracking data sets can be downloaded at
https://ikw.uos.de/%7Ecv/projects/mm-mkv).
We present concepts and first software tools for creating, analyzing, and decomposing multimedia containers of eye-tracking research data. Our aim is that video eye-tracking data can be stored in common multimedia containers, carrying (i) the video stimuli, (ii) the gaze trajectories of multiple participants, and (iii) other video-related data. With a strong focus on gaze data, we evaluate current multimedia containers that support a variety of video, audio, and subtitle data formats. We want to find out if such containers can provide instantaneous visualization and get VA support in standard media players. By contributing our code extensions to official development repositories, our long-term aim is to establish a standard format. Based on that format, research data could be encoded uniquely and will benefit various fields, ranging from training neuronal networks based on human sensory data over highlighting objects in movies for visually impaired people to creating auditory displays for blind people. Further, VA on gaze data may experience a boost once a standard format exists that allows sharing data and results, and combining gaze data with other metadata.
Building on previous work on merging video and eye-tracking data (Schöning, Faion, Heidemann, and Krumnack, 2016), we focus on the following aspects in the present paper: (i) a more detailed discussion of feasible data formats for metadata description, timed data representation, and multimedia encapsulation (cf. Section Data Formats and Suitable Data Formats for Gaze Data), (ii) the description and implementation of software tools for creating and decomposing multimedia containers with gaze data as well as our first two add-ons for VA on gaze data using the VLC media player (cf. Section Software Tools for Gaze Data in Multimedia Containers), and (iii) a discussion about benefits of our approaches highlighted in user interviews, and a list of available data sets (cf. Section User Study). Based on a comparison of available container formats suitable for housing sensory and other relevant data, we conclude by discussing possible future work in the field of VA and computer vision, where gaze data in multimedia containers have a strong impact.
Data Formats
By listing available video eye-tracking data sets (Winkler and Ramanathan, 2013, cf. Section 2.2), the diversity of formats, they are stored in, e.g., plain text, xml, Matlab mat format, or even binary, is quite impressive. Any of these formats provides instantaneous visualizations, and the stimulus sequence can be streamed together with metadata, like gaze trajectories and object annotations. In contrast, the domains of DVD, Blu-ray, or video compression can provide formats with efficient mechanisms for visualization and streamability via the Internet. Therefore, it is reasonable to evaluate the formats used in these domains and analyze whether one of these fits the requirements for storing research data. These requirements include e.g., storing the stimuli, as well as all relevant metadata, providing opportunities for visualization, and, if possible, VA within a standard multimedia player.
Metadata Formats
Computer vision algorithms still have great difficulties gaining semantic content from video sequences. Hence, VA of gaze trajectories of subjects performing, e.g., a visual search task, may provide insights for designing better algorithms. Gaze trajectories might also serve as training data for deep learning approaches if the data representations of several data sets are compatible. However, combining, extending, exchanging, and analyzing gaze data from different repositories is tedious, time-consuming, as well as inefficient, due to the lack of a standardized format.
With the rise of the semantic web, standards for metadata were introduced, which have become quite popular. The most general standard for expressing metadata is the resource description framework (
RDF) (
W3C, 2014). Its well-defined formal syntax and distributed representation allow statements about resources, i.e., virtually everything that can be uniquely described. The sparse predefined vocabulary of
RDF requires extension if
RDF is used in a new domain. By now, several vocabulary schemes for various domains exist, but there is, to our knowledge, no scheme for video content. Further, the temporal and spatial structure of video sequences differs from the majority of vocabularies. Therefore, a new vocabulary scheme has to be defined for fitting eye-tracking data set.
For annotating time-continuous data with different types of metadata, the Continuous Media Markup Language (cmml) (Pfeiffer, Parker, and Pang, 2006) was developed. Similar to HTML description, it provides markup facilities for timed objects, which are ideal for the integration of multimedia content on the Internet. In a predefined vocabulary, cmml provides textual descriptions, hyperlinks, images (e.g., keyframes), and—in the form of value pairs—other unspecified metadata. By using this vocabulary, temporal metadata of video and audio sequences can be represented. Unfortunately, the spatial structure, i.e., reference to pixels, regions, and areas is not provided by cmml.
Probably the most advanced metadata framework for multimedia content is mpeg-7, which has been developed by the Moving Picture Experts Group. This Multimedia content description interface defined in the ISO/IEC 14496-3 standard, specifies a set of mechanisms for describing as many types of multimedia information as possible. However, this specification has been criticized and has lost support because of its lack of formal semantics. This causes ambiguity, leading to interoperability problems, and hinders a widespread implementation and a standard interpreter (Mokhtarian and Bober, 2003; Arndt, Staab, Troncy, and Hardman, 2007).
Nevertheless, mpeg-7 can describe multimedia content at different degrees of abstraction and is designed as an extensible format. Using its own “Description Definition Language” (DDL—an extended form of xml Schema), its defines connected vocabularies on all levels of abstraction. The basic structure of the vocabulary is focused on Grid Layout, Time Series, Multiple View, Spatial 2D Coordinates, and Temporal Interpolation. For object labeling, as well as continuous gaze trajectories, the feature Temporal Interpolation could be useful. Using connected polynomials, it allows temporal interpolation of metadata over time and forms the basis of the Spatio-Temporal Locator which can be used for describing, e.g., the bounding box of an object in the complete video sequence.
Timed Text
Like a video transcript,
timed text (
W3C, 2017) assigns text labels to certain time intervals based on time-stamps and is typically used for subtitling of movies, as well as captioning for hearing impaired or people lacking audio devices. The simplest grammar of
timed text consists of a time interval (start and end time) and the text to be displayed. The popular
SubRip format (s
rt) sticks to that grammar only. Enhancing this temporal grammar with spatial features, the second edition of the
Timed Text Markup Language 1 (
TTML1) (2013) introduces a vocabulary for regions. Nowadays, the amount of variation within timed text formats makes them hardly interoperable. Therefore, a large set of decoders and rendering frameworks have emerged for supporting the compatibility of media players.
Suitable Data Formats for Gaze Data
What kind of data format would be best suited for gaze trajectories of several subjects, the video stimulus itself, and optional metadata like object annotations? In the previous section, we gave a brief overview of the diversity of formats of metadata and multimedia containers. Fostered by the advance of the semantic web and other digital technologies, the diversity keeps growing. While general formats, like
RDF (
W3C, 2014), are well established, they are not geared towards the description of temporal content like video sequences, and hence miss the required vocabulary. To avoid the introduction of new formats, we suggest three approaches to a standard format for gaze data on video stimuli. The first approach is the extension of a general metadata formalism like
RDF with video, audio, and eye-tracking, as well gaze data vocabulary. As the second approach, we use a welldefined standard for metadata like
mpeg-7 along with the implementation of proper multimedia player extensions. Third, we present an ad hoc approach that utilizes existing technologies, e.g., for subtitles or captions, to store gaze data and object annotations.
The first approach, i.e., the development of a specialized eye and gaze tracking format on the basis of a wellestablished metadata framework like RDF yields the obvious advantage that it can build on the vast amount of software libraries and tools that support storage, management, exchange, and a certain type of reasoning over such data. The main drawback is that these formats lack any form of support for the video, as well as the audio domain, and do not even provide basic spatial or temporal concepts. The development of a specialized vocabulary to describe gaze tracking scenarios would mean a huge effort. Additionally, it would have to include the implementation of tools for visualization. Desirable features, like streamability, do not seem to fit the original static nature of these data formats well. Furthermore, given the specificity of the field of eye-tracking, a wide support by common multimedia players seems unlikely.
Using a specialized, detailed standard for video metadata like mpeg-7 is the second possible approach. mpeg-7 provides a well-defined and established vocabulary for various annotations, but lacks a vocabulary for gaze data. Nevertheless, mpeg-7 supports the description of points or regions of interest in space and time, and hence, seems well suited to store the essential information of eye-tracking data. Unfortunately, no standard media player (like VLC, MediaPlayer, and Windows Media Player) currently seems to support the visualization of mpeg-7 video annotations. Hence, one would have to extend these players to visualize embedded eye-tracking data—fortunately, one can build on existing mpeg-7 libraries (Bailer, Fürntratt, Schallauer, Thallinger, and Haas, 2011). We think that, when implementing such multimedia player extensions, one should aim at a generic solution that can also be used to visualize other mpeg-7 annotations, as this would foster development, distribution, and support.
Even though the first two approaches seem better suited in the long run, they appear neither realizable with manageable effort nor on a short time scale. Hence, in the remainder of this paper, we focus on the third approach, which allows for the quick development of a prototype to demonstrate the idea, concepts, possibilities, and to gain experience in its application. The idea is to adopt existing technologies, like subtitles, captions, audio tracks, and online links (Bertellini and Reich, 2010), already supported by existing media players. Although these technologies are not yet geared towards the storage and presentation of gaze data, all important features can be realized based on these formats. Reusing or “hijacking” formats has the benefit that they will be widely supported by current multimedia players. Even though there appears to be no general drawing framework, some subtitle formats include drawing commands that allow highlighting regions in a scene. These commands can also be used for visualizing gaze data or for auditory display of EEG data. Using a player’s standard methods for displaying subtitles and switching between different languages, one can then display gaze data and switch between data of different subjects. Using player add-ons, simple VA tasks are possible by reusing tracks of the multimedia container.
When putting metadata into a reused format, one must be careful that no information is lost. Additionally, one should bear in mind that the original format was designed for a different purpose, so it may not have out-of-the-box support for desirable features, e.g., simultaneous display of gaze points of multiple subjects. We chose to use usf for three main reasons: First, its specification considers possible methods for drawing shapes, a prerequisite for instantaneous visualization with a normal multimedia player. Second, it allows for storing additional data, a necessity for carrying all eye-tracking data so that expert visualization with specialized tools is possible from the same single file. Finally, and most important, usf is—like the preferable mpeg-7—a xml-based format and thereby capable of holding complex data structures. However, although basic usf is supported by some existing media players, the drawing commands are not implemented. We provide an extension for the VLC media player that implements some drawing commands. Also, we provide a converter to the ass format, which is widely supported, including drawing commands, due to the free ass library, thereby allowing out-of-the-box visualization of eye-tracking data that works with many current multimedia players. However, its plain text format is too restricted to hold all desirable information. Both approaches will be discussed in more detail in the next section.
Software Tools for Gaze Data in Multimedia Containers
Beyond two prototypes of multimedia containers where we embedded gaze data and object annotation using two different subtitle formats, we present in this section our new tool for converting metadata along with the stimulus video sequence into a mkv container. Followed by a brief introduction of a tool for decomposing a multimedia container, we present add-ons to VLC which allow simple VA on multimodal data—here gaze data—using a standard media player.
Container Prototypes
For the encapsulation of metadata, we implement two kinds of prototypes where the subtitle track is reused as a carrier of metadata. Our first prototype based on usf encapsulates the complete gaze metadata without loss and can be visualized in a modified version of the VLC media player. Based on the ass format, the second one is only enabled to carry selected metadata for visualization, but these visualizations can be displayed by current media players, as well as standalone DVD players.
Metadata as usf. In order to use
usf for encapsulating eye-tracking data, we analyzed which features of
usf are available in the latest releases of common multimedia players. One of the most common media players is the VLC media player. The current developer version 3.0.0 already supports a variety of
usf attributes, which are
text,
image,
karaoke, and
comment. The latest
usf specification introduces an additional attribute
shape that is still marked as under development, although this specification is already quite old. Since gaze data is commonly visualized with simple geometric shapes, like circles and rectangles, the use of the
shape attributes for instantaneous gaze visualization of subjects seems to be quite appropriate.
Listing 1. Section of the usf specification (Paris et al., 2010), * marked attributes are added to the specification and implemented in our altered VLC player. |
![Jemr 10 00029 i001]() |
Since the exact specification of the
shape attribute is, as mentioned above, not complete, we particularized it on rectangles, polygons, and points, as illustrated in Listing 1. These simple geometric shapes were taken as first components in order to visualize a multitude of different types of elements. Point-like visualizations are useful to describe locations without area information, e.g., for gaze point analysis in eye-tracking studies. Rectangles are most commonly used as a bounding box for object of interest annotations, whereas polygons provide a more specific, but complex way of describing the contour of an object. Further, as can be seen in
Figure 9, one can use different geometric shapes to differentiate between, e.g., saccades marked by circles and fixations marked by rectangles.
The so-called
subsusf codec module of the VLC player, handles the visualization of all
usf content. In detail, this codec module receives streams of the subtitle data for the current frame from the demuxer of VLC and renders a frame overlay at runtime. We extended this module with additional parsing capabilities for our specified shape data, which is then drawn into the so-called
subpictures and passed on to the actual renderer of VLC. Since the thread will be called for every frame, the implementation is time-critical, and we decided to use the fast rasterization algorithms of Bresenham (1965). Additionally, we added an option to fill the shapes, which is implemented with the scan line algorithm (Wylie, Romney, Evans, and Erdahl, 1967). In order to ensure that our enhancements of the
subsusf module will be available within the next VLC, we submitted our changes to the official VLC 3.0.0 developer repository and suggested an ongoing development of the
usf. In
Figure 2,
Figure 7 and
Figure 9, instantaneous visualization of gaze data by the VLC player can be seen.
For testing all features of our implementation, we created a mkv container, containing a dummy video file, as well as usf files with all supported attributes. VLC can open these containers and yields the desired visualization of geometric object annotations (Schöning, Faion, Heidemann, and Krum nack, 2017). This proves our concept that the incorporation of metadata into usf is possible and that streaming, using mkv as a container format, is also possible, since both content and metadata are integrated as temporal payload. Opening the container in a released version of VLC (2.2.4) without the additional adjustments in the subsusf module will not conflict with the normal video playback, but will not visualize the incorporated annotations, either.
Metadata as ass. Since the usf-based prototype requires a modified version of the VLC media player, a broad audience is still excluded from watching the gaze data set without that player. Therefore, we provide a second prototype based on ass, a subtitle format with drawing commands, that is widely supported thanks to the free ass library, and which is also supported by many standalone DVD players. In contrast to usf, the ass subtitle format cannot carry all of the desired metadata, as it is not capable of representing complex data structures. Due to its design, it is only capable of embedding content that can be visualized, thus, any other content will be lost by the conversion from, e.g., gaze data to ass.
For generating the lossy ass files from lossless usf files, a simple translation stylesheet using xslt (Extensible Stylesheet Language Transformations) has been written. After the conversion, a mkv container is created including the video and one ass track for each subject. The resulting container makes metadata accessible for a broad audience, as the ass visualization can be displayed by many video players without the need of modification.
VA Using Standard Multimedia Players
The use of gaze data presented in multimedia containers is quite intuitive, as the user interface builds on metaphors known from entertainment content. Hence, the user can change the visualizations like subtitles. Different types of auditory display can be selected, e.g., audio languages (Schöning, Gert, et al., 2017), shown in
Figure 2b. The nature of the general subtitle system in VLC only allows for one active subtitle track at all times, which unfortunately limits the range of possibilities for VA tasks significantly. Since it is often important to visualize gaze-object relations, e.g., to compare the reactions of subjects to a given stimulus, we developed two VLC add-ons for the simultaneous and synchronized playing of different subtitle tracks.
Since such a feature is rarely needed in general contexts and is therefore missing, the VLC player also does not have the required abilities. To improve its usability as a scientific tool, we extend it with this operation commonly needed in the domain of visual analytics. At this point, two VA solutions for providing additional functionalities are discussed.
On the one hand, one might directly extend the source code of the software chosen for the extension if the proposed usf subtitle format is used. Even if the required changes are quite complex and low-level, they will pay off in high performance and in a broad range of dependencies between inner software parts of the application. However, the increase in complexity with the resulting consequences for performance and security is hardly acceptable if just a small fraction of users will need it. Moreover, such a custom patch would require custom compilation of the software until it is sufficiently tested and has found its way into the binary releases of the player. The procedure of translating VLC’s human-readable code into binary executables and linking it to the more than 50 required dependencies on the machine of the user requires advanced knowledge of the system, and its infrastructure, let alone that compilation may require root rights.
On the other hand, one might use some runtime processing to avoid these drawbacks. By default, the VLC multimedia player supports custom add-ons, which are executed on demand and do not need any code compilation. Therefore, it would be possible to have pieces of code that are easy to install, usable, and removable without any problems or residues such as applications on current smartphones. Technically, these add-ons are small text files with the extension lua stored in scope-dependent directories. These scripts contain code written in a slightly outdated version of the crossplatform scripting language LUA, which was designed to provide a fast interface to extend and customize complex software projects like video games after their release. Therefore, LUA extensions match our needs perfectly. We think that the advantages in usage outweigh the in comparison to binary code poorer performance of scripting solutions. Therefore, we built two analysis tools using the dynamic approach. Even if both add-ons may be able to fulfill the same goals of interactive visualization and analysis of different data set items, like the object of interest annotation and the gaze points of participant P1A, at the same time, they still differ. However, not only do they differ in their mode of presentation—one stimulus window versus several stimulus windows—, but also in their advantages and disadvantages of their graphical representation depending on the context of the VA task and the content of the data set.
Our first extension (
SimSub), shown in
Figure 5a, allows visualizing different eye-tracking datasets in multiple windows. It may be used to provide a rough overview of the experimental results of multiple participants. A simple click on the requested subtitle tracks creates another instance which behaves exactly like the previous one. The parallel video streams are synchronized within a certain hierarchy. The VLC player started by the researcher works as a controller: Changes in its playback are transmitted in real-time towards the other windows, guaranteeing an exact comparison on a frame basis. The other instances have neither control elements nor timelines, but are purely depending on the “master” window. Technically, the extension uses the fact that the VLC multimedia player provides an interface for remote access. If an eye-tracking dataset should be visualized in parallel, it starts another instance in the background, sets the proper subtitle track, and synchronizes it with the main window. Any modification of the current state, e.g., pausing or stopping is immediately sent via interprocess communication towards all the other instances. This approach allows using all the embedded features of the VLC player regarding its capabilities of many supported subtitle formats. Moreover, the isolated instances may not influence each other in exceptional circumstances such as parsing errors and allow a straightforward VA workflow by a fully customizable order of the visualizations, with the ability to show and hide them on demand. However, the multiple instances may have a performance impact if used in an excessive manner. This restriction arises due to the limitation of the provided
LUA interface that makes it necessary to handle some commands in a very cumbersome manner. Hence, our current implementation sometimes lacks correct synchronization of the video and the subtitle between all open views, especially when working with three or more views.
Our second extension (
MergeSub) allows visualizing eye-tracking data from different participants using a single instance of the VLC player, which also avoids the synchronization problems (cf.
Figure 5b). It is designed to find tiny differences in the experimental results of multiple participants, which might become obvious only if played directly side by side. The researcher selects an arbitrary number of subtitle tracks, which are then displayed together on the video sequence. Technically, this ability is provided by an automatic merge of the selected subtitle tracks into a single track, saved in a temporary file (we do not influence the renderer directly). As it is imported afterwards, just like any other subtitle, the researcher has full access to all capacities and advanced functionalities of the VLC player. However, this approach limits the usable subtitle formats dramatically, because it requires a custom merging routine for every format. Our extension so far covers just the formats which are sophisticated enough to visualize eye-tracking data. As a further improvement for this add-on, we plan to assign different colors to the subtitle tracks, which will drastically improve the usability.
With a combination of both techniques, a simple workflow for a basic VA task including reasoning is possible. The dynamic extensibility of the VLC multimedia player would allow further developments regarding the needs of researchers. Thus, an out-of-the-box multimedia player will become a powerful VA software. With its ability for easily available and usable custom extensions, today’s multimedia players are suitable for a variety of VA add-ons, which can be used in parallel or one by one.
Conclusion
The importance of gaze data in general, and especially the amount of VA applications, will significantly increase, once gaze data, as well as other metadata, are provided in multimedia containers. We have demonstrated instantaneous visualization using these multimedia containers with common video players. By extending today’s multimedia player with add-ons, such as our add-ons for multi-subtitle visualization, simplistic VA is also possible. Hence, no specific VA software is needed and no new interaction metaphors must be trained. Familiar metaphors are used in our add-ons, e.g., metaphors like switching subtitles. In the long term, and as a result of the data format comparison in
Table 1, the research community and the industry should seek to establish a standard based on
mpeg-7 for recording, exchanging, archiving, streaming and visualizing, gaze as well as other metadata. Due to its proper specification,
mpeg-7 provides a platform for consistent implementations in all media players, but due to its specification, the implementation and testing will require many resources. By muxing
mpeg-7 into
mp4 containers, streaming, visualization, and auditory display are guaranteed. Unfortunately, we recognize a lack of
mpeg-7 standard libraries and integration in current media players. In order to quickly close this gap, we have presented ad-hoc prototypes allowing us to promote embedding metadata into multimedia containers, leading to the immediate use of the metadata. Our approach reuses the
usf subtitle format to encode eye-tracking data, allowing to visualize them in a patched version of the popular VLC media player. For other media players, as well as for standalone
DVD players, we provide the
ass-based multimedia container. To motivate the VA software designers also to develop VA add-ons based on existing software, we provide two VLC player add-ons, which allow the reasoning if, e.g., two gaze trajectories of two different subjects are focused on the same object of interest. For the future, we plan to expand VA add-ons to support almost every kind of scientific metadata and to use the insights gained for bio-inspired, as well as, sensory-improved computer vision (Schöning, Faion, and Heidemann, 2016).