Review Reports - Statistical Error Propagation Affecting the Quality of Experience Evaluation in Video on Demand Applications

Round 1

Reviewer 1 Report

The paper systematically investigates impacts of QoS sampling issues and user diversity (in ratings) on QoE confidence intervals. In the prominent case, it reveals that confidence intervals (i.e. the degree of uncertainty in the user ratings) are maximal in the most prominent range o packet loss rates (between 0.001 and 0.01), which is some kind of bad news for anybody evaluating QoE by means of QoS.

The discussion of the shapes in Figures 2-5 deserves some more attention in the explanations. IMHO, it is not a coincidence that this behaviour is observed. It is natural that variations are natural in the middle of the scale (cf. reference [21]), and with your models, this coincides with PLR between 0.001 and 0.01. While for small PLR, the sensitivity is high (inherent property of the exponential model), for large PLR, the ratings are just low. Section 4.3, paragraph 3: Talking about errors in the context of user ratings is not appropriate -- it's natural that different users have different views, in particular when quality is debatable (i.e. not top or crappy). And -- the user is always right ;)

However, it has to be realised that this particular behaviour is so far confirmed for the particular application and model under study. Thus, above conclusion cannot be more than a warning flag. More applications and models would have to be studied in order to reach more generic conclusions (even if, according to the reviewer's experience, 1% packet loss seems to be some kind of critical limit for many applications). Indeed, the validity of the results is conditioned on the QoE-QoS model used. The authors have used models with R-squared values in the order of more than 90%, which is good, but not optimal. Furthermore, the models are for loss and jitter only, but not for a combination of both. Surely, jitter is not as critical as loss, but the shown correlation values (weak positive) indicate that loss and jitter have a slight tendency to appear together. So to which extent can they strengthen each other's impact when they appear together? The authors might want to comments on these limitations, and take them as inspiration for follow-up work.

Regarding jitter, it would be good to provide a proper definition _which_ jitter definition is used here (there are several jitter definitions around -- maximal deviation, standard deviation of inter-packet times, etc.), including the underlying time window, in order to facilitate replicability of the results. The other challenge -- according to the reviewer's experience -- is how to generate realistic jitter patterns in the user experiments, in particular if the shaper changes the order of packets... NetEm is not (only) to blame for this, the user is responsible for the modelling aspects, i.e. to match reality and emulation. (BTW, the same goes for random packet loss, which may not be the case in reality -- burst loss can have a rather big additional impact on QoE.) Regarding the explanations of Table 4: The absolute value of the standard error (in %) increases with the PLR, but the relative value decreases from 0.31 (at PLR = 0.05%) to 0.031 (at PLR = 5%). Like all emulated and simulated processes, an approach towards the tails actually increases the relative variability. This implies a larger risk for deviations from the configured values at small PLRs, unfortunately in the realm of greatest sensitivity of QoE estimation to QoS (according to the model). Some clarifications would be welcome.

A detail w.r.t. the confidence interval (CI) formula in (6): strictly, Z = 1.96 applies for an infinite number of samples n; please remind the reader about the number of samples that was used to calculate the CI. (If n < 40, the Student t-distribution could be used, Z => t_{n-1,0.975) which is typically larger than 2. However, for "sufficiently large n", the Normal approximation works well.)

List of abbreviations: Please consider using alphabetical order.

When proofreading the manuscript for the next time, the authors are welcome put specific attention onto missing/superfluous articles.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

- Abstract: "delay, delay jitter, and Packet Loss Ratio" -> I think these are more generally known as latency, jitter and packet loss.
- Abstract: "Prior literature has studied sampling errors in QoS measurements; however, there is no account of propagation of these sampling errors to QoE evaluation." -> First, these should be two sentences. Secondly, while I understand the sentence itself, I think everything that is after "however" should be re-phrased, as it is quite hard to read. Probably, something like: "However, the propagation of these errors in QoE has not been evaluated before or quantified"
- Abstract: What is MOS?
- Introduction: "Huawei and Telefonica are collaborating on projects such as" -> can you add a reference?
- Introduction: "of VoD applications.[3]." -> extra dot before citation
- Introduction: "Authors in [4,5] "-> do not use citation numbers instead of words. Add the name of the authors and the citation. Also why is this sentence in the introduction instead of related work, and what is the purpose of this sentence. Why do we care about this? This should be clarified.
- Introduction: "There are no accounts that report the propagation of the errors" -> ok but why do we care about this? Why hasn't anyone cared about this before work, and what would be the value of finding out an answer to this question? I think these needs to be clarified.
- Introduction: "we use data acquired by our industrial collaborator Teragence using sampling" -> has Terangence performed the sampling or the authors? This seems unclear to me
- Introduction: I suspect paragraphs 4 and 5 meant to be the contribution of this work. I find it very poorly organised and unclear. What QoE functions? What technique? What earlier work? Was this introduced before?
- Introduction: "The sampling error is expressed as 95% confidence intervals" -> so what? why do I care as a reader?
- Methodology: In the first paragraph of this section it says "The QoS parameters considered in this paper are
PLR and jitter." the in the first paragraph of section 3.1 it says "captured data consists of measurements of PLR, delay, jitter, the location of the mobile device and the mobile operator". Why is this different? why are only two features in the first and then 4 in the next sentence? Why are the delay and location not considered, even though are captured?
- Conclusion: What is the significance of the results? Why are there important? Any criticism of the results?
- Conclusion: What is some future work?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The paper has improved a lot however please focus on making these small changes

Abstract: "Previously research" -> "Previous research"
- Abstract: "Previously research has focussed on sampling errors in QoS measurements. However, the propagation of these sampling errors in QoS through to the QoE values has not been evaluated before." -> Why do we care about this? Why do we need an answer to this? Why do we need this evaluation? What happens if we don't know the effect of the propagation of sampling errors in QoS?
- Introduction: "How the error in QoS propagates to QoE depends on the mathematical function that links them." -> which error? also, this is phrased like a question. Re-phrase to a statement.
- Introduction: "All these studies concluded with a mathematical relationship between QoE and QoS." -> which is?
- Results: "points for jitter, We presented the" -> should be a full stop instead of comma

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

The manuscript raises an interesting issue, namely how errors in the QoS sampling obstruct QoE estimates. The reviewer is happy to see the intense use of QoE models and their parameters. While the overall insights appear very interesting, there are some question marks and issues that need clarification in order to better understand the results and the underlying phenomena.

Equation (1) expresses a standard error rather than a standard deviation. The same applies for equation (8), behind Z. The reviewer could not see an immediate problem, as the factor n is captured, but the notions should be used in the right way in order to avoid confusion.

Actually, the use of PLP for _measured_ packet loss "probabilities" is confusing. Apart from the fact that probabilities cannot be measured, only estimated, it might be helpful to flag the nature of the data in the name, e.g. call it "Packet Loss Ratio" (PLR). BTW, having 800+/-400 packet volleys à 20 packets (i.e. 8e3 to 2e4) to reliably estimate PLP (rather PLR) in the order of 1e-4..1e-3 sounds rather courageous in itself (and demonstrating the implications of this is probably one of the main messages of the manuscript, right?).

There is another concern when it comes to the use of netem for low PLP values. According the experience of the reviewer, shapers are all but working in an optimal way (unfortunately, s/he cannot give references, as that might reveal her/his identity). The smaller the (configured) PLP becomes, the greater the deviation between PLP and (actual) PLR might become. It would be interesting to know how strong this effect impacts the results -- i.e. not only the sampling, but even the generation of the loss is suboptimal.

Equation (4): it might be worthwhile to even test a multiplicative approach, and compare to the additive approach.

It would be nice to incorporate the units right into equations (5) (Jitter => Jitter/ms). The same goes for Tables 2 and 3.

Equation (7): why is it given if it is not used at all?

Figures 4-7 deserve more explanations. How much of the "big bump" between PLP/PLR (?) values of 1e-4 and 1e-3 can be attributed to perception, how much to samling issues and how much to shaper (mal-)function? It would be helpful to explain the curves together with the absolute average QoE (= MOS) values, or -- maybe even better -- take a look at the QoE distributions for the different settings. BTW, there is a very interesting paper "SOS: The MOS is not enough" by Hossfeld et al., discussing on the standard deviation of opinion scores, which is maximal in the middle of the scale and (naturally) reduced towards the edges.

The related work on network measurement techniques (second-to-last paragraph in Section 2) is limited/biased towards one co-author's own work.

Reviewer 2 Report

In general, a clear, research-based structure of the manuscript is missing. Please provide clear research questions, maybe also hypotheses, which are answered/evaluated explicitly in the paper. Also, the used (empirical) methods are not well explained/defined, e.g., why did you use field measurements and user tests? How are they linked?

A lab study with only 11 participants does not meet methodological standards. Please use at least 25 participants.

The subsections in section 4 are not connected in a way, that the reader understand why this actions take place.

It is unclear why it was necessary to group the network data by provider. What is the benefit here?

Please also provide significance for Pearson correlation coefficient.

It is unclear if equation 6 is based on previous work. Was it your idea?

Please provide Figure 1 in higher quality, i.e., resolution is too low

In caption of Table 2: what does operation points mean?

The conclusion section is not comprehensive. Please add a discussion section, in which you critically discuss your approach, list shortcommings, etc. Furthermore, a clear answering of stated research questions is needed.

If Pearson correlation is 0, it means that there is no LINEAR correlation. It does not mean, that there is no correlation in general.

Please provide more details about the platform/tools of your industrial collaborator Teragence, e.g. ,about the state-of-the-art platform for cellular measurement.

There are lots of grammatical and general English language errors and flaws. I strongly recommend to ask a native speaker to review the whole text. Some examples:

“The data constitute the measurement of packet loss” “video data transmission is variable-bit-rate and hence is regarded as bursty” “QoE is an exponential of these QoS parameters”

Many references are missing:

1.Introdcution: "Considerable research …“ => please provide a detailed and comprehensive list of papers. “The technique we use id in line with earlier work. In first chapter: “QoE is expressed as 95% confidence interval”. Why not 90%? You assume that PLP process follows a Bernoulli distribution. Why? Are there any references?

Section 2 is hard to read/understand, a more systematic structure is needed. In the current manuscript, technical statements are made, but it is unclear how these statements are linked together.

There are some statements which are not in line with state of the art. If there are references available, please quote it, e.g., “results can be used to map QoE to PLP …” Would make more sense to map QoE to PLP. Also in section 2: what are operating points?

Reviewer 3 Report

Related work is unclear and ambiguous in several places. For instance: "the author of [21] attempted to measure". I find this a binary action: he either did it or he didn't. Comparison with other baselines is missing