Assessment of QoE for Video and Audio in WebRTC Applications Using Full-Reference Models

: WebRTC is a set of standard technologies that allows exchanging video and audio in real time on the Web. As with other media-related applications, the user-perceived audiovisual quality can be estimated using Quality of Experience (QoE) measurements. This paper analyses the behavior of different objective Full-Reference (FR) models for video and audio in WebRTC applications. FR models calculate the video and audio quality by comparing some original media reference with the degraded signal. To compute these models, we have created an open-source benchmark in which different types of reference media inputs are sent browser to browser while simulating different kinds of network conditions in terms of packet loss and jitter. Our benchmark provides recording capabilities of the impairment WebRTC streams. Then, we use different existing FR metrics for video (VMAF, VIFp, SSIM, MS-SSIM, PSNR, PSNR-HVS, and PSNR-HVS-M) and audio (PESQ, ViSQOL, and POLQA) recordings together with their references. Moreover, we use the same recordings to carry out a subjective analysis in which real users rate the video and audio quality using a Mean Opinion Score (MOS). Finally, we calculate the correlations between the objective and subjective results to ﬁnd the objective models that better correspond with the subjective outcome, which is considered the ground truth QoE. We ﬁnd that some of the studied objective models, such as VMAF, VIFp, and POLQA, show a strong correlation with the subjective results in packet loss scenarios.


Introduction
Web Real-Time Communications (WebRTC) is a technology that allows sharing media between web browsers in a standard fashion. The purest form of a WebRTC application follows a peer-to-peer (P2P) architecture, in which a web browser accesses the camera and microphone of a host to send its media (video and audio) in real time to a remote browser.
Practitioners use standard WebRTC Application Programming Interfaces (APIs) in JavaScript to implement this kind of web application, in which HTML5 video tags play WebRTC streams. In addition to P2P decentralized real-time media applications, WebRTC has a significant impact on the development of other kinds of web applications, such as conferencing or broadcasting systems. In conferencing systems, several browsers share their media using the concept of room (i.e., an N to N communication). In broadcasting systems, a browser shares its media to several receivers (i.e., a 1 to N communication). These applications typically require some kind of server-side infrastructure.
Quality of Experience (QoE) is a measurement of the level of satisfaction of the users of an application or system. The amendment 5 of the ITU-T Recommendation P.10/G.100 defines QoE as "the degree of delight or annoyance of the user of an application or service." Therefore, QoE is a user-centric concern, and so, it is not trivial to measure or estimate. This paper analyses the behavior of different objective Full-Reference (FR) QoE models for video and audio in WebRTC applications. To carry out this work, we have created and released as open-source a benchmark in which we assess a WebRTC application in an automated fashion. This benchmark allows recording the received WebRTC streams, which act as impairment signals for the FR models. Then, using the same recordings, we carry out a subjective analysis with real users to try to find correlations between the objective and subjective outcomes.
This piece of research is part of ElasTest (https://elastest.io/), an H2020 European Union project aimed at creating a comprehensive solution to ease the process of development, operation, and maintenance of end-to-end tests for different kind of applications, including web and mobile [1].

Background
This section provides a summary of the technologies supporting this work, i.e., WebRTC and QoE. Moreover, this section also provides a summary of the latest related works available in the literature.

WebRTC
WebRTC is a Voice over Internet Protocol (VoIP) technology that expands on and integrates existing real-time communications technologies, such as Session Initiation Protocol (SIP)-based architectures [2]. WebRTC allows the development of high-quality applications with media capabilities in web browsers by leveraging three types of JavaScript APIs: (i) MediaStream, to acquire audio and video streams; (ii) RTCPeerConnection, to communicate audio and video in real time; and (iii) RTCDataChannel, to communicate arbitrary application data in real time.
The WebRTC stack takes care of the media plane, providing different capabilities for encoding, decoding, or NAT (Network Address Translation) traversal.However, WebRTC is signaling agnostic, and therefore, there are no standard ways to set up or terminate WebRTC calls, which must be implemented by developers using custom mechanisms. Figure 1 shows the WebRTC protocol stack in the media plane, both for the RTCPeerConnection and RTCDataChannel APIs [3]. As depicted in this figure, browsers send their user media using the RTCPeerConnection API through SRTP (Secure Real-Time Transport Protocol) and custom data using the RTCDataChannel API through SCTP (Stream Control Transport Protocol). Then, the next protocol layer is DTLS (Datagram Transport Layer Security), which allows securing data exchange between WebRTC peers using encryption techniques. Regarding connectivity, WebRTC peers are usually connected to the Internet behind firewalls and NAT (Network Address Translators) devices. The WebRTC stack relies on the protocol ICE (Interactive Connectivity Establishment) to achieve connectivity between peers. ICE collects all available candidates to the connection. Depending on the nature of the NAT devices, STUN (Session Traversal Utilities for NAT) or TURN (Traversal Using Relay around NAT) servers might be required [4].
In WebRTC, low latency and timeliness are more important than reliability. For this reason, UDP (User Datagram Protocol) is the transport protocol of the whole stack. UDP offers no reliability or order of the data and delivers each packet to the upper layers at the moment it is available. All in all, different aspects may affect the quality of WebRTC communications: packet loss (due to using best-effort protocols at transport and network level) and jitter (due to the variation in the latency on a packet flow between peers) [5].
Another relevant mechanism implemented in WebRTC which may affect the perceived quality is the bandwidth estimation (BWE). Modern BWE algorithms implemented in browsers predict congestion by analyzing the delay between packets. These algorithms progressively increase the bitrate used to encode the media according to the estimated bandwidth. When a receiver detects congestion, it sends REMB (Receiver Estimated Max Bitrate) messages to the sender, which adapts the transmission bitrate accordingly [6].

QoE
The term QoE and its underlying concepts were discussed in the context of the European Network on Quality of Experience in Multimedia Systems and Services (Qualinet) starting in 2003. As a result, Le Callet et al., created the Qualinet White Paper on definitions about QoE [7]. This report defines Influence Factors (IF) as the characteristic which may have an influence on the QoE for the final user and established three categories for IFs:

•
Human IFs: the properties related to the user as a human being, such as demographic and socioeconomic background or emotional state.

•
System IFs: technical properties that determine the quality of an application or service, such as media capture, encoding, transmission, or rendering.

•
Context IFs: factors of the user's environment in terms of physical (e.g., location, space, and movement), temporal (e.g., time of day, duration, and frequency of use), or economic characteristics (e.g., costs, subscription type, or brand of the service).
QoE assessment is not trivial because user satisfaction is challenging to quantify. Typically, we classify the QoE assessment methods into two categories. First, subjective QoE models quantify the quality perceived by the final users by soliciting their evaluation scores. This score typically follows a Mean Opinion Score (MOS) rate, for instance, from 1 (bad quality) to 5 (excellent quality). On the one hand, subjective methods are costly and time-consuming, but on the other hand, these methods are considered the most accurate way to measure QoE since the actual users are involved in getting the metric.
Second, objective QoE models compute a metric as a function of QoS parameters and external factors. These methods were developed to avoid involving the users in the assessment process. Objective models are supposed to correlate with subjective values, which are considered the ground truth of QoE. A relevant type of objective assessment metric is known as media-layer. In these types, media signals are used as input to calculate the QoE. There are three different categories of media-layer models: • Full-Reference (FR): these models use some reference for audio or video together with the distorted signal (impairment) to compute the value of QoE.

•
Reduced-Reference (RR): these methods have access to partial information regarding the original reference (typically statistical) together with the impairment signal.
• No-Reference (NR): these methods estimate the value of QoE without any knowledge of the original reference.
Due to its importance in this paper, we analyze some of the most relevant FR models available in the literature. Table 1 shows a comprehensive summary of these models for video and audio: Image assessment method based on the characteristics of the Human Visual System (HVS) [9]. There are some variants, such VIFp (VIF in the pixel domain) [10].

SSIM (Structural Similarity)
It measures the difference of structure between some original and impairment image in terms of luminance, contrast, and structure [11]. There are several models derived from SSIM, for example, MS-SSIM (Multi-Scale SSIM), conducted over multiples scales of subsampling [12].

MSE (Mean Squared Error)
It is a straightforward metric to compute differences between two video signals [13].

PSNR (Peak Signal-to-Noise Ratio)
It is the proportion between the maximum signal and the impairment noise [14]. PSNR has several variants, such as PSNR-HVS (extension of PSNR using properties of the HVS) [15] or PSNR-HVS-M (improvement of PSNR-HVS using visual masking) [16].
Video Quality Metric (VQM) Linear combination of several impairment parameters [17] Audio PESQ (Perceptual Evaluation of Speech Quality) It is an audio quality assessment model standardized as the ITU-T recommendation P.862 [18] ViSQOL (Virtual Speech Quality Objective Listener) FR model for audio quality assessment modeling the human speech quality perception using a spectro-temporal measure of similarity [19] POLQA (Perceptual Objective Listening Quality Analysis) The successor of PESQ, standardized as the ITU-T recommendation P.863 [20] 1.

Related Work
Reference [21] present a comprehensive review of the QoE assessment for WebRTC applications in [21]. This paper first provides an analysis of how WebRTC topologies affect QoE. Second, it proposes a group of Key Performance Indicators (KPIs) to estimate the QoE in WebRTC applications. Finally, this paper presents a Systematic Literature Review (SLR) on the QoE assessment in the WebRTC arena carried out in 2017. Table 2 contains a summary of this review, presenting the aspects of Quality of Service (QoS) and QoE covered in each paper.
We find new works in the field of QoE in WebRTC after SLR. For example, Dunja and Skorin-Kapov investigate how different video encoding parameters (e.g., bit rate, resolution, or frame rate) influence QoE in Reference [22]. This paper presents a subjective evaluation of WebRTC calls carried out with mobile devices. Participant reported ratings of different test scenarios of video encoding parameters. At the same time, the authors collected WebRTC statistics using the Google Chrome WebRTC-internals tool. They concluded that old smartphones should use a maximum resolution of 640 × 480 px. Besides, when bandwidth is limited, both resolution and video bitrate should also be reduced. The same authors extended their work investigating how the Google Congestion Control (GCC) algorithm impacts QoE if different scenarios of packet loss and video codec configuration are applied [23].
Reference [24] analyses the effect on the QoE of Multi-View Video and Audio (MVV-A) contents (e.g., the Olympics Games) transmitted over WebRTC. To do that, the authors carried out a subjective experiment of the perceived QoE of these contents with various network conditions. This work concludes that a high number of viewpoints tend to provide better QoE, while the content which requires frequent viewpoint changes degrades the QoE. The same authors extended their work in Reference [25], comparing the QoE in MVV-A contents with transmission methods using video and audio streaming MPEG-DASH (Moving Picture Experts Group Dynamic Adaptive Streaming over HTTP).
Husić et al. investigate how different context IFs influence the QoE in WebRTC video calls [26]. Using subjective experimentation and statistical analysis, the authors conclude that the task complexity and the duration of WebRTC calls influence the QoE.

Motivation
The real-time nature of WebRTC imposes specific characteristics that might affect the perceived quality for end-users. For the different WebRTC application types (1 to 1, 1 to N, or N to N), we can find a common factor. There is always at least one browser that sends its media (we refer it as the presenter) and at least one browser which is receiving that media (we refer it as the viewer). For this reason, this paper focuses on media-layer FR metrics for calculating the QoE of WebRTC applications, both for video and audio. We aim to evaluate the QoE of video and audio by comparing the media in the viewer (impairment signal) with the media in the presenter (original reference).
Reference [21] introduced in the Related Work section concludes that an open challenge in the domain of QoE is to identify a group of optimal models to assess the audio and video quality in WebRTC applications. This paper contributes to this vision, providing a continuation of our previous work [42], which presented the preliminary results about the evaluation of several video quality models for WebRTC applications. The first aspect in which the current paper improves our previous work is the evaluation of audio and video instead of video-only evaluation. Some of the participants of the subjective analysis in our previous paper pointed out that the assessment of video sequences without audio was nonnatural. Therefore, it is a possible source of bias when evaluating video quality subjectively. Moreover, in Reference [42], we used a single video in which was a close-up of a woman being interviewed, simulating a regular WebRTC call. This video had a size of 640 × 480 px, which is a typical resolution for WebRTC streams. To improve and confirm our previous results, in this paper, we use the same video sequence but, this time, also with audio. Besides, we use an other video sequence, much more complicated in terms of video content. It is a clip about the famous online game "Fortnite" of 1280 × 720 px, i.e., High-Definition (HD). Thanks to the use of this reference, we have detected a problem in our previous approach related to video assessment. We realized the alignment of video frames of the impairment concerning the reference is not perfect after sending the video with WebRTC. This fact introduces an error factor in the video quality calculation using FR methods. To solve this problem, we have designed and implemented an alignment method for the impairment signal based on overlaying the frame number in the video of the presenter side. Then, we use an Optical Character Recognition (OCR) mechanism to adjust the video properly in the receiver side before computing the FR video scores. Last but not least, we have changed the System Under Test (SUT) in this paper. In our previous work, we use a WebRTC application built with the OpenVidu framework, which is a Selective Forwarding Units (SFU) videoconferencing system based on Kurento Media Server [43]. The motivation to use this SUT was simplicity since OpenVidu allows accessing the RTCPeerConnection JavaScript objects from the global scope of the browser. This feature was required to record the WebRTC stream. As explained later, we have enhanced this process by using the so-called Monkey Patching technique. All in all, we use a simple P2P WebRTC application instead of an SFU-based. This change reduces the complexity of the SUT, allowing to analyze the raw capabilities of WebRTC in terms of QoE.

Contributions
The first contribution of this paper is the implementation of an experimental benchmark aimed to calculate different objective FR metrics for audio and video in WebRTC applications. The methodology which follows our benchmark has three stages. First, we generate a proper media input (reference) with video and audio for the QoE evaluation of a WebRTC application. Second, we carry out a WebRTC call between two browsers, using the reference stream previously generated in one browser while recording the received media in the other browser (impairment) under different network conditions in terms of packet loss and jitter. We also considered simulating delay in the network. Delay has a direct influence on the end-to-end latency, which in turn has a significant impact on the conversational QoE. Nevertheless, this parameter does not affect the media quality calculated with a FR model, since this kind of models uses one direction of the communication (reference signal in the presenter and impairment in the viewer) to calculate the video quality. Nevertheless, we did not carry experimentation based on delay because this parameter affects the end-to-end latency but not the media quality computed with FR algorithms. Third and last, using the reference and impairment streams, we use existing FR models to calculate the QoE both for video and audio. We enable other practitioners to follow the proposed methodology by releasing the benchmark as an open-source GitHub repository.
The second contribution of this paper is a subjective evaluation, carried out with real users. We reuse the impairment sequences obtained previously with the benchmark during the objective analysis. Subjective assessment is considered the actual value of QoE [44]. For this reason, we want to find out the perceived video and audio quality of the obtained impairment sequences subjectively. Then, we analyze the data gathered in this subjective survey to find correlations between the objective results obtained in the experimental benchmark. As a result, we determine which one of the selected FR models provides better estimations of video and audio quality in WebRTC applications for different types of input sequences and network conditions.

Benchmark for Objective Evaluation
As introduced before, objective media-layer FR metrics calculate the QoE by comparing an original signal with its degraded version. The first contribution of this paper is the design and implementation of a benchmark to obtain these signals (reference and impairment) for WebRTC applications.
As explained next, we have created this benchmark on the top of well-known testing frameworks and tools. Besides, we released the resulting artifacts within an open-source repository (Apache 2 license) available on GitHub (https://github.com/elastest/elastest-webrtc-qoe-meter). The workflow implemented in this repository has three stages: (i) input generation, (ii) SUT evaluation, and (iii) QoE calculation. Figure 2 illustrates this process. The next sections provide fine-grained details of each step.

Input Generation
In the first place, we need a proper reference media sequence (video and audio) to carry out a WebRTC call. This input is used in the second stage to feed the WebRTC stream of a browser. In theory, we could use any modern browser implementing the WebRTC stack, but in practice, Chrome is the browser that provides a wider variety of capabilities to assess WebRTC applications automatically. Specifically, we use the following configuration options to feed custom video and audio for WebRTC: • -use-file-for-fake-video-capture=/path/to/test.y4m: This flag allows feeding a custom video file (instead of using the actual media from the camera) for WebRTC communications. Chrome requires the format of this file follows a YUV color encoding system (raw format). • -use-file-for-fake-audio-capture=/path/to/test.wav: This flag allows to feed a custom audio file (instead of using the actual audio from the microphone) for WebRTC communications. Chrome requires a WAV file format for this (raw format).
A relevant aspect we need to take care of is the synchronization of the reference and impairment signals. As explained before, to calculate the objective FR metrics for video and audio, the original and impairment sequences need to be perfectly aligned. To achieve this, we generate the reference signal with the following constraints: • We place the media source in between two padding signals, discarded in the third stage (QoE calculation).

•
We encode the reference video in YUV and the audio in WAV, which are the formats supported by Chrome.

•
We use a fixed value of the video frame rate (24 fps). Then, we overlay the frame number within the media to ensure perfect alignment in the third stage (see more details on Section 2.1.3).
We have selected a couple of open video sources available in the Internet Archive (https://archive. org/) released with Creative Commons license (https://creativecommons.org/licenses/by-sa/3.0/). First, a conversational-type video has been used (https://archive.org/details/e-dv548_lwe08_christa_ casebeer_003.ogg). In the video, we can see the first plane of a woman answering the questions of a man behind the camera. This video has a resolution of 640 × 480 px, which is very usual in laptops and mobile devices. Then, we use a more complicated video in terms of encoding. It is a fragment of the famous online game "Fortnite" explained by a male narrator (https://archive.org/details/ ForniteBattle8/fornite+battle+2.mp4). This video has a resolution of 1280 × 720 px.
We have created a shell script to carry out this first stage automatically. The name of the script is generate_input_video.sh. To implement this script, we use FFmpeg (https://www.ffmpeg.org/). FFmpeg is a powerful open-source toolbox for handling multimedia streams. The steps followed in this script are as follows: 1. Download the video sample from the Internet Archive (using the command-line tool wget). 2. Cut the original video. As explained later, we use the impairment sequences also for subjective analysis. To avoid a very long experiment time, we limit the duration of the reference sequences to 30 s. 3. Overlay each frame number in the video content. As explained later, each frame number is recognized directly from the video content using an OCR tool. For this reason, first, we need to attach each frame number in the video at this point. 4. Create padding video based on a test pattern. This padding video sequence is a test source pattern composed of different colored bars (see Figure 3a). 5. Concatenate the final video, attaching the padding at the beginning and the end of the sequence, and the actual content (interview or game) in the middle. 6. Convert the final video to Y4M and the final audio to WAV, as required by Google Chrome.

SUT Evaluation
The next stage consists of carrying out a WebRTC call between a couple of browsers (presenter and viewer). There are several important decisions to be taken in this step. First of all, we need to select the SUT. As introduced before, this paper improves our previous work [42] by employing simple P2P WebRTC calls instead of media-server based conferences. The objective is to evaluate the behavior of the WebRTC stack without any intermediate element, which could affect the final QoE.
To select a proper SUT, we have analyzed the official collection of samples provided by the WebRTC team, the so-called "WebRTC Samples" (https://webrtc.github.io/samples/). Among other test applications, this collection contains a full-featured WebRTC video chat called AppRTC. This application is well-known in the WebRTC community since Maire Reavy (Firefox Media Product Lead) and Serge Lachapelle (Chrome Product Manager) used its early version to carry out the first call between Chrome and Firefox in 2013 [45]. AppRTC follows a P2P architecture, in which the media flows from browser to browser, which is perfect for our needs. This application uses Collider (a WebSocket-based signaling server implemented in Go language) to establish and release the WebRTC call between browsers. All in all, we use the online version of AppRTC (https://appr.tc/) as our SUT.
The next relevant choice is the type of browser to be used. As introduced before, any modern WebRTC-compliant browser, such as Chrome, Firefox, or Edge, could be selected. Nevertheless, in practice, the browser providing a more comprehensive toolbox of capabilities for automated testing WebRTC is Google Chrome. In particular, we use the Chrome options to feed custom video and audio (reference sequence previously generated) as WebRTC input. This media is encoded in the browser and sent by WebRTC to an other browser.
For repeatability, we need to exercise the SUT using the browsers in an automated fashion, and therefore, we need a way to drive Chrome automatically. Selenium WebDriver is a testing framework that allows creating end-to-end tests for assessing the client-side of web applications. It allows driving automatically real browsers (such as Chrome or Firefox) from a language binding (such as Java, JavaScript, or Python) [46]. Selenium WebDriver is the de-facto standard to interact automatically with a web application using its Graphical User Interface (GUI). Besides, it allows executing custom JavaScript code in the browsers.
We use JavaScript to record the WebRTC stream in the viewer browser. Using Selenium WebDriver, at the beginning of the test, we inject the RecordRTC library (https://github.com/muaz-khan/RecordRTC) in the SUT. RecordRTC is an open-source JavaScript library that acts as a wrapper of the MediaStream Recording API (https://www.w3.org/TR/mediastream-recording/), allowing to record WebRTC streams using JavaScript in a friendly manner. In the WebRTC JavaScript API, RTCPeerConnection objects provide access to MediaStream objects. The Selenium WebDriver executes JavaScript from the global scope of the browser. As a result, we need to access the RTCPeerConnection objects in order to record the WebRTC stream with the MediaStream Recording API. To achieve this, we use the so-called Monkey Patching technique, which allows modifying the default behavior of a given piece of code at runtime without changing its source code. In particular, we override the original RTCPeerConnection constructor to collect any RTCPeerConnection in a global array. This array is later accessed with Selenium WebDriver to start, stop, and get the WebRTC streams. We have implemented this technique using the JavaScript snippet below. This snippet forms part of a custom JavaScript library called elastest-remote-control.js. We inject this library in the SUT using Selenium WebDriver before starting the WebRTC call.
Thanks to the use of this technique, practitioners can reuse our benchmark to assess any kind of WebRTC application following the proposed methodology. To illustrate this, the open-source repository implementing our methodology also provides test examples using different SUTs, for example, based on the Janus WebRTC server (https://janus.conf.meetecho.com/), OpenVidu SFU (https://openvidu.io/), and several WebRTC samples. We use JUnit version 5 to implement an automated test that exercises the SUT through Selenium WebDriver. JUnit version 5 is the latest version of the popular general-use testing framework for Java. It provides a brand-new architecture and a programming and extension model called Jupiter [21].
Another relevant technology to support our benchmark is Docker. Docker allows us to deploy and run any application as a lightweight and portable container on the top of a Linux kernel [47]. We use Docker as it allows applications and their dependencies to be shipped as isolated containers. As introduced before, we use Google Chrome as the preferred browser. In particular, we use instances of Chrome executed in Docker containers. This fact allows manipulating the network of the browser without affecting the rest of the elements in the local host. This way, we can execute NetEm (Network Emulator) commands in the browser container in a straightforward manner. NetEm is a Linux package that allows customizing several network parameters to simulate different network conditions, such as delay, jitter, or packet loss [48].
To simplify the use of browsers in Docker container through Selenium and JUnit 5, we have created and released as open-source the tool Selenium-Jupiter (https://bonigarcia.github.io/seleniumjupiter/). Selenium-Jupiter is an extension for Selenium in JUnit 5, which provides seamless integration with Docker. Figure 3 shows different screenshots of the SUT being controlled automatically by Selenium WebDriver (note the text below the address bar stating that the browser is being controlled by automated test software). All in all, we have implemented a JUnit test case (called AppRtcAdvancedTest.java) that uses Selenium to assess AppRTC automatically. This test accepts different parameters that allow us to configure the packet loss and jitter in the network of the Docker container of the presenter browser. We repeat the experiment 7 times: • 1 for the ideal scenario, in which no packet loss nor jitter happens. • 3 for scenarios with packet loss: 15%, 30%, and 45%, respectively). • 3 for scenarios with jitter: 25 ms, 50 ms, and 75 ms, respectively (75 ms is the maximum recommended value to avoid jitter distortions [49]).

Survey for Subjective Evaluation
After the objective experimentation, we carried out a subjective evaluation. We reuse the impairment sequences previously obtained as input in the subjective evaluation. The objective is two-folded. On the one hand, we want to find correlations between the quality perceived (in video and audio) by real users and the objective results previously obtained. On the other hand, we want to find out if the alignment process implemented does not affect the perceived video quality in the eyes of real users.
We have designed this subjective study following the general perceived performance estimation process defined in ITU-T Recommendation G.1030 [50] and BT.500-11 [51]. Thus, we ask a group of users to watch the audiovisual impairment sequences and to rate these sequences using a five-point MOS scale as follows: 1 = bad, 2 = poor, 3 = fair, 4 = good, and 5 = excellent. To try to balance out the bias produced for some expectations regarding the expected quality (for example, from excellent to poor quality), sequences are presented in a random order in the survey.
Before releasing the actual survey, we carried out a pilot survey using a smaller group of users. This pilot allows us to check the correctness of the survey. Besides, we use the pilot to evaluate how the alignment process is perceived by this group of users. Thus, we present the aligned sequence in addition to its equivalent after the alignment process. We assume that, if the alignment process is correct, the perceived video quality on both sequences should be the same. We select a small group of participants for the pilot, made up of expert engineers in the research group of Universidad Rey Juan Carlos (URJC).
After the pilot, in the general availability survey, we use only the impairment sequences. This survey contains a total of 16 video sequences of 30 s each: 2 video references (game and interview) and 7 impairments: ideal network conditions, packet loss (15%, 30%, and 45%), and jitter (25 ms, 50 ms, and 75 ms) per video type (game and interview). We ask each participant to rate the 14 impairment sequences, estimating a time to complete the survey of 10 minutes. To try to get a heterogeneous user group to limit the bias per human or context IF [7], we have asked for participation in the survey using different channels: the public forum about WebRTC discuss-webrtc (https://groups.google.com/forum/#!forum/discuss-webrtc), the Reddit channel about WebRTC (https://www.reddit.com/r/WebRTC/), and our Twitter accounts.

Results
This section provides a comprehensive summary of the experimentation results, both for the objective and the subjective analysis. The raw data for this analysis is public in the GitHub repository (https://github.com/elastest/elastest-webrtc-qoe-meter), with the rest of the artifacts of the benchmark.

Objective Analysis
Regarding objective assessment, we have compiled different data for video (VMAF, VIFp, SSIM, MS-SSIM, PSNR, PSNR-HVS, and PSNR-HVS-M) and audio (PESQ, ViSQOL, and POLQA) for two different video sequences with different contents (interview and game). To be able to interpret the results correctly, it is essential to be aware of the value ranges for each objective model: VMAF score ranges from 0 (lowest quality) to 100 (distorted and reference video is equal) [52]. VIFp score ranges from 0 (lowest quality) and 1 (the distorted and the reference video are the same) [10]. SSIM and MS-SSIM index ranges from 0 (no structural similarity) and 1 (two identical sets of data) [11]. PSNR, PSNR-HVS, and PSNR-HVS-M range between 20 dB (lower quality) and 60 dB (good quality) [53]. PESQ, ViSQOL, and POLQA are perceptual models for audio quality in which their scores range from 1 (bad) to 5 (excellent) [18,20,54]. Figure 4 shows the average of objective outcomes for the interview sequence. The first chart (labeled as a) shows the evolution of the video quality for different packet loss scenarios (0%, 15%, 30%, and 45%). On the one hand, we display the results for VMAF, PSNR, PSNR-HVS, and PSNR-HVS-M using the primary y-axis of the chart, since these values range from 0 to 99. On the other hand, we display VIFp, SSIM, and MS-SSIM using the secondary y-axis, since the values range from 0 to 1. As expected, the video quality measured with these objective models decreases when the packet loss ratio is higher. We observe an equivalent behavior in Figure 4c, in which we calculate the video quality of the interview sequence in different scenarios of jitter (0 ms, 25 ms, 50 ms, and 75 ms). Figure 4b shows the evolution of the audio quality for the interview sequence using the same packet loss scenarios. Again, the obtained results are as expected because the audio quality decreases as long as the packet loss increases. Figure 4d shows the evolution of the audio quality for the same sequence but, this time, in the different jitter scenarios. It is relevant to observe that, this time, the three objective models under study provide from fair to good results of audio quality: PESQ average 3.72, standard deviation of 0.27; ViSQOL 4.28, standard deviation of 0.06, and POLQA 4.46, standard deviation of 0.33. According to these results, we can see that jitter does not impact much the quality estimated by the objective FR models for audio under study. It is also remarkable to observe that the audio quality can increase as long as the jitter rises (this is happening in the case of POLQA in Figure 4d). As discussed in Reference [54], this effect might occur on ViSQOL and POLQA due to the alignment and warping within their algorithms to deal with the perceptual impact of playout adjustments caused by jitter buffers.  Figure 5 shows the average of objective results for the second audiovisual input under study, i.e., the game sequence (1280 × 720 px). The results are coherent with the behavior observed in the interview sequence. Quality for video and audio are monotonically decreasing functions when using different packet loss scenarios (Figure 5a,b). The same happens with the video quality in different jitter scenarios, but again, the audio quality remains almost constant when calculated with different values of jitter: PESQ average 3.65, standard deviation of 0.2; ViSQOL average 4.46, standard deviation of 0.27; and POLQA average 4.5, standard deviation of 0.38. In light of the results, we can confirm that the packet loss rate is a relevant parameter for the objective QoE in WebRTC. When this rate is higher, the computed objective video and audio quality is lower. This behavior is coherent from the two sequences under study. On the other side, we checked that the jitter scenarios (0 ms, 25 ms, 50 ms, and 75 ms) affect the objective video quality. However, it does not affect in the same way the audio quality computed with PESQ, ViSQOL, and POLQA.
Even though the shapes of the resulting charts are similar in the two sequences under study (interview and game), a relevant difference between them is the absolute values of video quality.
For example, the average of the VMAF score of the interview sequence is 76.18, with a standard deviation of 10.45 in the case of packet loss scenarios, and the average is 69.3, with a standard deviation of 15.5 when jitter happens. The VMAF average for these scenarios (packet loss and jitter) for the game sequence is 52.23 (standard deviation of 15.25) and 42.43 (standard deviation of 17.9). This result is an other relevant finding since it confirms that WebRTC calls are suitable for conversational video calls; however, the QoE is impacted when delivering complex videos. As a final remark in the objective experimentation, we have compared the results of video quality using the video impairment sequence before and after the alignment process. The objective of this part is to quantify the level of improvement obtained thanks to the alignment process. To simplify this part, we focused on a single objective model for video: VMAF. As explained next, this model shows from notable to strong correlation with the subjective results. Focusing on the game sequence in ideal network conditions (i.e., without packet loss and jitter), we check that the VMAF obtained using the sequence before the alignment process as input is 46.17, while the VMAF obtained using the sequence but after the alignment process is 66.52. This fact supposes the alignment process improves the objective result obtained with VMAF in 20.35%. As shown in Figure 6a, this level of improvement is lower as long as the packet loss rate is higher (5.43% for 15% of packet loss, 3.33% for 30% of packet loss, and 1.67% for 45% of packet loss). We observe the same behavior in the case of jitter when assessing the game sequence (Figure 6b). Nevertheless, the results are very different when using the interview sequence. Focusing again on the ideal scenario (no packet loss nor jitter), the VMAF obtained before the alignment process is 84.83. At the same time, the result after the alignment process is 88.47 (i.e., an improvement of only 3.64%). We find equivalent outcomes in the rest of packet loss and also in the jitter scenarios (Figure 6b).
To understand the motivation of these figures, we need to think about the content of the video under test. On the one hand, the interview sequence, which always shows the first plane of a woman, does not change significantly from frame to frame. For this reason, if a frame is skipped or duplicated, the resulting VMAF does not suffer from significant biases. On the other side, the rapid movement of the game sequences causes relevant changes in the VMAF score when frames are not appropriately compared with the reference. All in all, we can conclude that the alignment process is especially relevant to assess the QoE using FR objective models when using non-conversational sequences, such as online games or sports, to name a few.

Subjective Analysis
As explained before, we use the media recordings obtained in the objective experimentation to carry out also a subjective analysis. This first part of the subjective analysis is a pilot, executed using a group of 7 experts. The objective of this pilot is two-folded. First, we want to detect potential issues in the instrument before releasing it to the community. Second, we want to evaluate the accuracy of the alignment process. To that aim, we ask the participants to evaluate the video and audio quality for the recorded test sequences after and before the alignment process. The hypothesis is that, if the alignment process is correct, the scores for both sequences should be equivalent. Figures 7 and 8 show the evolution of the video and audio quality scored in the pilot for the interview and game sequences, respectively. We can see at first glance that the shape of the functions is quite similar in all cases. The average of the difference between both results (original and alignment sequences) is 0.27 with a standard deviation of 0.46 in the case of video quality in the interview sequence and also 0.27 on average (standard deviation of 0.49) for audio. The average difference in video quality is 0.19 (standard deviation of 0.29) and 0.25 (standard deviation of 0.44) for audio in the game sequence.
In light of this result, we concluded that our alignment process does not affect the perceived quality in video and audio because expert users do not observe relevant differences between the video and audio quality in both sequences. All in all, in the next part of the subjective survey, we only use the aligned sequences. This way, we ask for participation in a survey to rate the WebRTC impairment sequences for the ideal scenario and also in the case of network problems (packet loss and jitter), giving a total of 14 videos. We obtained a total of 32 valid answers (12.5% women, 87.5% men) from 11 different countries (Colombia, Germany, India, Ireland, Mexico, Pakistan, Russia, Spain, Turkey, the United Kingdom, and the United States). The age of the participants ranges from 24 to 51 years (average 32.53 years, standard deviation of 8.26 years). Figure 9 illustrates the results. These charts show the evolution of the average MOS for video and audio for both sequences (interview and game) in the different network scenarios.
Inspecting these charts, we can observe the video/audio scores in packet loss scenarios and video in jitter scenarios result in monotonically decreasing functions. In contrast, the scores in audio quality in jitter scenarios remain almost constant. In other words, we check that the shape of the functions is equivalent to the results obtained in the objective part (Figures 4 and 5).   To check the extent to which the subjective outcomes change together compared to the objective results, we calculate the correlation coefficient between the MOS results for video and audio with the data obtained in the objective experimentation. To that aim, we use the Pearson correlation, which evaluates the strength and direction of the linear relationship between pairs of continuous variables [55]. A coefficient of Pearson correlation (r) varies from 0 (no correlation) to ±1 (perfect correlation). A positive correlation indicates there is a direct relationship between the variables (i.e., change together), while a negative correlation implies an inverse relationship (i.e., change opposite).
We use the IBM SPSS (Statistical Package for the Social Sciences) Statistics toolbox to calculate the Pearson correlation between the MOS scores obtained in the subjective survey and the data from the objective experiment. This tool also provides the statistical significance (p) of the correlation. We use the value of p to reject a null hypothesis when p is lower than a given threshold. In our case, the null hypothesis (H 0 ) is that subjective and objective outcomes are not correlated. Typically, researchers claim statistical significance when p ≤ 0.05. Nevertheless, to avoid controversy [56], we use a more stringent threshold in our research, i.e., p ≤ 0.01 [57]. Table 3 shows a summary of the Pearson correlation (r) and the statistical significance (p) of our dataset. Each column in this table corresponds to experiment network scenarios (packet loss and jitter) for each test sequence (interview and game). Each cell contains the value of r and p of the objective models (VMAF, VIFp, SSIM, MS-SSIM, PSNR, PSNR-HVS, PSNR-HVS-M, PESQ, ViSQOL, and POLQA) crossed with the MOS score from the subjective survey. A relevant conclusion of this analysis is that none of the objective models for video and audio correlates strongly in all the studied scenarios. Regarding video, we only find a strong correlation between VMAF (r = 0.991, p = 0.009) and VIFp (r = 0.992, p = 0.010) with the MOS score in the case of the interview sequence in different scenarios of packet loss. This fact is coherent with our previous research [42]. Regarding audio, we only find a strong correlation in the same situation (interview with packet loss) between the POLQA and MOS (r = 0.993, p = 0.007).
Another significant finding is the difference between the correlation results for audio with jitter. We find reliable results in the rest of the scenarios (video/audio with packet loss and video with jitter). Nevertheless, the statistical correlation depicted in Table 3 reflects poor correlation in the audio with jitter for both sequences (interview and game). For example, the results of PESQ in the interview sequence are r = 0.445, p = 0.555, while POLQA for the same sequence is r = −0.160, p = 0.840. The motivation of this heterogeneous correlation seems to be related to the perception of the users. From an objective point of view, in light of the results presented in Section 3.1, the audio quality in jitter scenarios remains tightly clustered. Nevertheless, the video quality gets worse as long as the jitter increases. This fact seems to affect the score rated to the audio quality by the participants, and as a result, the resulting correlation in audio does not show any linear relationship.

Limitations
We have detected several limitations in the presented approach. First, the proposed alignment process for the impairment sequences is based on the OCR recognition of the frame number labeled within the video track. This process might solve some problems for detecting skipped frames.
Nevertheless, the audio remains the same before and after the alignment process. This fact might cause slight differences in the duration of the video and audio tracks of the resulting aligned media.
Second, the method for participants recruitment (i.e., asking for participation using the WebRTC discuss forum, Twitter, and Reddit) is not ideal because, in practice, the range of demographics is not as heterogeneous as we would like to avoid biases. Different participant recruitment models could be considered in the future to achieve a more widespread sample.
Lastly, the chosen parameters for the jitter tests (0-70ms) have significantly less of an impact on the subjective MOS scores for audio than for video. This fact is likely as a result of the WebRTC jitter buffer using playout adjustments to deal with jitter of this magnitude so that the jitter is largely imperceptible [58]. As a result, the ability to evaluate the audio quality under jitter conditions is limited. A more comprehensive study of jitter in the future could be conducted on whether the range of jitter tested yielded a more comprehensive range of quality ratings.

Discussion
This paper presents a research effort in the field of QoE assessment of WebRTC applications using FR models for video (VMAF, VIFp, SSIM, MS-SSIM, PSNR, PSNR-HVS, and PSNR-HVS-M) and audio (PESQ, ViSQOL, and POLQA). We created and released as open-source a benchmark aimed to evaluate this model. Practitioners can reuse this benchmark to assess any WebRTC application, following a three-tier methodology: (i) input generation; (ii) SUT evaluation; and (iii) QoE calculation.
We have used the AppRTC video chat as SUT. Nevertheless, our benchmark is prepared to evaluate any other WebRTC application in an automated fashion. Thanks to the Monkey Patching technique, we can collect the RTCPeerConnection objects created with the JavaScript WebRTC API and, then, record the media streams using the MediaStream Recording API. All in all, we can record any WebRTC stream using this capability. Then, we use these recordings together with the reference as input of different FR models to calculate video and audio QoE. In addition to AppRTC, our GitHub repository contains test examples using other SUTs, such as Janus, OpenVidu, or WebRTC samples.
Browsers can change the bitrate to encode the media according to the estimated bandwidth in WebRTC. As a result, received WebRTC streams can be different in terms of the number of frames of the original reference. For this reason, we have designed an alignment process to compare the impairment accurately with the reference. In light of the results, we have checked that this alignment process is required when the nature of the media sent by WebRTC is not-conversational, for example, in online games or sports, to name a few.
Finally, we have carried out a subjective analysis reusing the impairment recordings obtained in the objective assessment. We aim to determine the existing correlations between the MOS scores obtained in the subjective survey with the objective results. We find that none of the FR models studied show strong correlations with the MOS outcomes for all the network scenarios (packet loss and jitter) and the different types of media sequences (conversational and online game). We believe this fact is an opportunity for future research since there is a lack of specific FR models for video and audio for real-time communications, such as WebRTC.

Conflicts of Interest:
The authors declare no conflict of interest.