You are currently viewing a new version of our website. To view the old version click .
Engineering Proceedings
  • Proceeding Paper
  • Open Access

9 May 2023

Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models †

and
Faculty of Automation and Computer Engineering, Novosibirsk State Technical University, Pr. K. Marksa 20, 630073 Novosibirsk, Russia
*
Author to whom correspondence should be addressed.
Presented at the 15th International Conference “Intelligent Systems” (INTELS’22), Moscow, Russia, 14–16 December 2022.
This article belongs to the Proceedings 15th International Conference “Intelligent Systems” (INTELS’22)

Abstract

Intelligent systems today are increasingly required to predict or imitate human perception and behavior. In this, feature-based Machine Learning (ML) models are still common, since collecting appropriate training data from human subjects for the data-hungry Deep Learning models is costly. Considerable effort is put into ensuring data quality, particularly in crowd-annotation platforms (e.g., Amazon MTurk), where fees of top workers can be several times higher than the median. The common knowledge is that quality of input data is beneficial for the end quality of ML models, though quantitative estimations of the effect are rare. In our study, we investigate how labeled data quality affects the accuracy of models that predict users’ subjective impressions—per the scales of Complexity, Aesthetics and Orderliness assessed by 70 subjects. The material, about 500 web page screenshots, was also labeled by 11 workers of varying diligence, whose work quality was validated by another 20 verifiers. Unexpectedly, we found significant negative correlations between the workers’ precision and R 2 s of the models, for two out of the three scales ( r 11 = 0.768 for Aesthetics, r 11 = 0.644 for Orderliness). We speculate that the controversial effect might be explained by a bias in the indiligent labelers’ output that corresponds to subjectivity in human perception of visual objects.

1. Introduction

One of the implicit assumptions in Machine Learning (ML) is that the data that get through the preliminary screenings and tweaks to the model training stage are appropriate. As for ML models that seek to predict or simulate human behavior, such as user behavior models (UBMs) in the field of Human–Computer Interaction (HCI), the situation is rather more sophisticated. The actual interaction-related data, which are generally the input of the predictive UBMs [1], arguably cannot be “bad”, as long as they reflect the human “imperfection”. However, there are also increasingly important subjective dimensions, from perceptional “how pleasant is our website design” in HCI to “how likely is it that you would recommend our service to a friend” in marketing. By definition, the subjective impressions are usually directly provisioned by human subjects—although indirect methods do exist, e.g., facial emotion recognition. Correspondingly, Deep Learning is slow to take off in this field, and an ample share of the models are feature based and rely on labeled data and the subjective assessments.
There is a general consensus that inaccurately annotated data are a hindrance and that the labeled data quality does not come for free. In micro-task platforms, such as Amazon Mechanical Turk (MTurk), filtering of crowdworkers can be carried out by a reputation that is principally based on the Approval Rate supplied by task requesters [2]. The fees charged by higher-paid workers are about four times above the median ones in MTurk [3], even though it has been shown that even top workers can be indiligent [4]. Reputation might have seemed an easy solution to crowd-labeled data quality a decade ago [2], but the arsenal of methods and tools has been rapidly expanding since then [5], as we subsequently outline in Section 2.1. The currently mainstream data quality control methods are majority/group consensus and ground truth, which necessarily imply redundancy (several workers performing the same task), wasting up to 33% of the output.
Even if data labeling work is carried out by volunteers and is technically free, their limited effort should be used efficiently too. Although volunteers generally have higher motivation than crowdworkers, redundancy might still be necessary to reach the certainty thresholds [6]. Setting the latter is actually a major problem for a requester, which we believe is not adequately covered in existing studies. Similarly to software debugging, more is always better, and there is no hard threshold to improving the quality of the data, only the one advised by practicability. Many developments to improve input data for UBMs, e.g., the enhanced version of the robust Aalto Interface Metrics (https://github.com/aalto-ui/aim, accessed on 1 June 2022) [7], are underway with the best intentions. Unfortunately, estimating the concrete “return on investment” in data quality remains problematic, as quantitative studies of its end effect in ML are scarce.
In our paper, we explore the relation between the completeness and precision of the input data produced by 11 human labelers and the quality of the ensuing 33 user behavior models built for 487 web page screenshots assessed by another 70 participants. Rather unexpectedly, we find that the significant correlation between the labelers’ precision and the quality of the models constructed for the subjective scales of aesthetics and orderliness is negative. We attribute this preposterous result to the bias in indiligent labelers that brings their output closer to some subjective dimensions of human visual perception. We did not find any significant correlations for the labeling of completeness—even for complexity, which is known to be affected by the number of visual elements. Our results question the traditional data quality measures’ applicability for human-related data, although further research is necessary.
The outcome has been preliminary reported and discussed at the 2021 Fall Conference of Ergonomic Society of Korea (ESK). In the current paper, we present the extended version of our results, referencing some of our previous related publications, such as [8,9]. In Section 2, we briefly review the research relevant to human behavior data quality in ML and describe our experiment. In Section 3, we construct the models and analyze the effects of the input data on their quality. In the final section, we discuss the findings and their possible causes, and outline directions for further research.

3. Results

3.1. Descriptive Statistics

In total, we collected 12705 assessments for the 497 UIs. Further, the 11 labelers specified 42,716 elements in 495 UIs (see [Table 1] in [9]), and the quality of their work was evaluated by 20 verifiers. Some UIs had technical problems or incomplete evaluations, so, we remained with 487 valid UIs (98.0%), for which the descriptive statistics are presented in Table 1. The first and second names of the labelers are abbreviated in the IDs.
Table 1. The descriptive statistics per the labelers (M ± SD).
To check for the homogeneity of the UI assessments per the 11 labelers, we ran ANOVA tests for all three scales. We found a barely significant effect of ID only on ScaleO ( F 10 , 476 = 1.87   , p = 0.047 ), but not on ScaleC ( F 10 , 476 = 1.21   , p = 0.284 ) or ScaleA ( F 10 , 476 = 1.63 , p = 0.096 ). The post-hoc test for ScaleO (Tukey HSD, since there were many levels of the independent variables) found significant difference (at α = 0.05 ) only between labelers PV and PE ( p = 0.012 ). The variances were not different ( p = 0.372 ), so the ANOVA assumptions were met. Pearson correlations for the assessments per UIs were highly significant between ScaleA and ScaleO ( r 487 = 0.771   , p < 0.001 ), as well as between ScaleC and ScaleO ( r 487 = 0.145   , p = 0.001 ), but not between ScaleC and ScaleA.
In the verification, 37,053 labeled elements were specified as correct and 4967 as incorrect, and the mean Precision per labelers was 88.7%, which indicates a reasonably good work quality. The Pearson correlation between Precision and SC per labelers was not significant ( p = 0.727 ), which suggests that these two aspects of UI labeling quality are distinct. The correlation between SC and the average number of correct objects was significant ( r 11 = 0.622   , p = 0.041 ), unlike for the number of all labeled objects ( r 11 = 0.170   , p = 0.618 ), which reinforces the meaningfulness of the verification.

3.2. The Effect of the Input Data Quality in the Models

To construct the UBMs, we relied on simple linear regression, since we only had a limited number of data samples (41–54) for each labeler. So, we built 33 models, each having the same 8 factors calculated from each labeler’s output. The R 2 s obtained for the models are presented in Table 2, together with the mean labelers’ quality parameters obtained from the UI’s verifications.
Table 2. The labelers’ and the models’ quality.
Since the number of screenshots processed by each labeler (UI) was not exactly the same (see in Table 1), we checked its correlations with R 2 s for each of the three scales. We found that neither of the Pearson correlations was significant at α = 0.05 , so treating all labelers’ models universally is justified.
The subsequent Pearson correlations analysis revealed that the SC did not have a significant correlation (at α = 0.05 ) with the models’ quality parameter ( R 2 ) for either of the scales. Even for ScaleC, the correlation was r 11 = 0.062 ( p = 0.856 ) , whereas the visual complexity of a user interfaces is known to be influenced by the number of elements [25]. For the sake of checking the conceptual validity of our SC variable, we also checked the association between the factual average number of elements per UI for each labeler and the R 2 s. Again, neither of the Pearson correlations were significant (at α = 0.05 ), the correlation for ScaleC being r 11 = 0.274 ( p = 0.415 ) .
For precision, we found significant negative correlations with the R 2 s for ScaleA ( r 11 = 0.768 , p = 0.006 ) and ScaleO ( r 11 = 0.644 , p = 0.032 ), but not for ScaleC ( r 11 = 0.051   , p = 0.883 ). Recognizing the possible inaccuracy of our quality measures, we tried treating R 2 and the precision as ordinal variables—this is rather practical, since task requesters are often interested in only accepting the output from the best labelers. However, the results did not change very much for Kendall’s tau-b correlation measure: τ 11 = 0.491   , p = 0.036 for ScaleA and τ 11 = 0.418   , p = 0.073 for ScaleO.

4. Discussion and Conclusions

Seeking to explore the effect of input data quality, we undertook an experimental study with 101 human participants and 497 web UIs. Our assumption was that better quality of the UI labeling should result in better quality of UBMs.
Contrary to our expectations, we found significant negative correlations between the labeling quality parameters and the resulting models’ quality (see Table 2) for the subjective impression dimensions of aesthetics ( r 11 = 0.768 ) and orderliness ( r 11 = 0.644 ). Before deciding to report the negative research results in the current paper, we revisited the possible biases. However, the following considerations re-enforce the validity of our findings:
  • Invalid UI assessment: there was almost no significant difference in the distribution of the ratings per the labelers.
  • Invalid UI labeling: dimensions of precision (88.7%) and SC (77.8%) indicated high work quality and were distinct.
  • Invalid Verification: SC was correlated ( r 11 = 0.622 ) with the number of correct objects, but not with the number of all objects.
  • Invalid subjective impressions scales: as expected, ScaleA and ScaleO had significant positive correlation ( r 487 = 0.771 ), while ScaleC and ScaleO had significant negative correlation ( r 487 = 0.145 ). The relation between ScaleA and ScaleC was more controversial, as known from the literature [20], and we did not find a significant correlation.
  • Imperfection in quality measurement: we tried the objective measure for SC (elements per UI) and ordinal scale correlation (Kendall’s tau-b) for Precision, but there were no major changes in the outcomes.
  • Uncontrolled differences in the models: the sample sizes varied from 41 to 45 (and even to 54 for one of the labelers), but there was no correlation between UI and the models’ R 2 .
The discovered negative correlations between the labelers’ precision and the quality of the resulting models are not entirely clear to us, and we do not yet have a convincing explanation. We would like to note that the effect was found for the scale of aesthetics and the related scale of orderliness, but not for the less subjective scale of complexity. It is believed that aesthetics judgements for visual objects are rather high level, involving the factors of layout, visual hierarchy, colors, etc. Individual elements are grouped according to Gestalt principles, and imprecisions and omissions might even contribute to that—think of an Impressionist painting. Correspondingly, we might speculate that the indilligent workers would have a bias towards picking the UI elements and labeling them in a way matching the actual human perception. However, a much closer look at their output would be required before making any justified conclusions.
Among the limitations of our study, we see the relative minimalism of the linear regression UBMs. We only employed eight factors, and often they would not even be significant in the models. Correspondingly, the absolute quality levels of some models were rather modest, while the average R 2 per the 33 models turned out to be 0.281. The latter is arguably acceptable for our small-scale study that deliberately incorporated the potentially low-quality input data. For instance, in our another study with the same set of university websites screenshots, R 2 s ranged from 0.105 to 0.248 (similarly, aesthetics had the highest R 2 of the three scales) [8]. However, recognizably, there the number of factors was smaller, and the number of samples was higher. In any case, in the current study we were interested in the relative values and never intended to use the models in production.
Our further research prospects involve experimentation with more labelers and a more diverse set of the web UIs. Having collected more data, we plan to employ artificial neural network (ANN) models, instead of the simple linear regression ones. ANNs are known as universal approximators and can naturally handle systematic bias in data.

Author Contributions

Conceptualization, M.B.; methodology, M.B.; software, V.K.; validation, M.B. and V.K.; formal analysis, M.B.; investigation, M.B.; resources, M.B. and V.K.; data curation, M.B. and V.K.; writing—original draft preparation, M.B.; writing—review and editing, M.B.; visualization, M.B.; supervision, M.B.; project administration, M.B.; funding acquisition, M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by RFBR, grant number 19-29-01017.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Faculty of Humanities of Novosibirsk State Technical University (protocol code 7_02_2019).

Data Availability Statement

The data presented in this study are available on request from M.B. (the first author). The data are not publicly available due to privacy reasons.

Acknowledgments

We would like to thank those who contributed to the project. Sebastian Heil, Martin Gaedke, Anna Stepanova and Galina Hamgushkeeva; as well as Alyona Bakaeva, who provided inspiration for this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Oulasvirta, A. User interface design with combinatorial optimization. Computer 2017, 50, 40–47. [Google Scholar] [CrossRef]
  2. Peer, E.; Vosgerau, J.; Acquisti, A. Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 2014, 46, 1023–1031. [Google Scholar] [CrossRef] [PubMed]
  3. Hara, K.; Adams, A.; Milland, K.; Savage, S.; Callison-Burch, C.; Bigham, J.P. A data-driven analysis of workers’ earnings on Amazon Mechanical Turk. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–14. [Google Scholar]
  4. Saravanos, A.; Zervoudakis, S.; Zheng, D.; Stott, N.; Hawryluk, B.; Delfino, D. The hidden cost of using Amazon Mechanical Turk for research. In International Conference on Human-Computer Interaction; Springer International Publishing: New York, NY, USA, 2021; pp. 147–164. [Google Scholar]
  5. Daniel, F.; Kucherbaev, P.; Cappiello, C.; Benatallah, B.; Allahbakhsh, M. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Comput. Surv. (CSUR) 2018, 51, 1–40. [Google Scholar] [CrossRef]
  6. Salk, C.; Moltchanova, E.; See, L.; Sturn, T.; McCallum, I.; Fritz, S. How many people need to classify the same image? A method for optimizing volunteer contributions in binary geographical classifications. PLoS ONE 2022, 17, e0267114. [Google Scholar] [CrossRef] [PubMed]
  7. Oulasvirta, A.; De Pascale, S.; Koch, J.; Langerak, T.; Jokinen, J.; Todi, K.; Laine, M.; Kristhombuge, M.; Zhu, Y.; Miniukovich, A.; et al. Aalto Interface Metrics (AIM): A service and codebase for computational GUI evaluation. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings, Berlin, Germany, 14–17 October 2018; pp. 16–19. [Google Scholar]
  8. Boychuk, E.; Bakaev, M. Entropy and compression based analysis of web user interfaces. In International Conference on Web Engineering; Springer International Publishing: New York, NY, USA, 2019; pp. 253–261. [Google Scholar]
  9. Heil, S.; Bakaev, M.; Gaedke, M. Assessing completeness in training data for image-based analysis of web user interfaces. CEUR Workshop Proc. 2019, 2500, 17. [Google Scholar]
  10. Thakkar, D.; Ismail, A.; Kumar, P.; Hanna, A.; Sambasivan, N.; Kumar, N. When is Machine Learning Data Good? Valuing in Public Health Datafication. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–16. [Google Scholar]
  11. Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. [Google Scholar]
  12. Wiemer, H.; Dementyev, A.; Ihlenfeldt, S. A Holistic Quality Assurance Approach for Machine Learning Applications in Cyber-Physical Production Systems. Appl. Sci. 2021, 11, 9590. [Google Scholar] [CrossRef]
  13. Batini, C.; Cappiello, C.; Francalanci, C.; Maurino, A. Methodologies for data quality assessment and improvement. ACM Comput. Surv. (CSUR) 2009, 41, 1–52. [Google Scholar] [CrossRef]
  14. Bakaev, M.; Avdeenko, T. Intelligent information system to support decision-making based on unstructured web data. ICIC Express Lett. 2015, 9, 1017–1023. [Google Scholar]
  15. Taleb, I.; Serhani, M.A.; Dssouli, R. Big data quality: A survey. In Proceedings of the IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, USA, 2–7 July 2018; pp. 166–173. [Google Scholar]
  16. Bakaev, M.; Khvorostov, V.; Heil, S.; Gaedke, M. Web intelligence linked open data for website design reuse. In International Conference on Web Engineering; Springer International Publishing: New York, NY, USA, 2017; pp. 370–377. [Google Scholar]
  17. Ehrlinger, L.; Wöß, W. A survey of data quality measurement and monitoring tools. Front. Big Data 2022, 5, 850611. [Google Scholar] [CrossRef] [PubMed]
  18. Alwan, A.A.; Ciupala, M.A.; Brimicombe, A.J.; Ghorashi, S.A.; Baravalle, A.; Falcarin, P. Data quality challenges in large-scale cyber-physical systems: A systematic review. Inf. Syst. 2022, 105, 101951. [Google Scholar] [CrossRef]
  19. Swazinna, P.; Udluft, S.; Runkler, T. Measuring Data Quality for Dataset Selection in Offline Reinforcement Learning. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–8. [Google Scholar]
  20. Miniukovich, A.; Marchese, M. Relationship between visual complexity and aesthetics of webpages. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–13. [Google Scholar]
  21. Jonietz, D. A concept for fitness-for-use evaluation in Machine Learning pipelines. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia, 6–14 December 2021. [Google Scholar]
  22. Lee, Y.W.; Pipino, L.L.; Funk, J.D.; Wang, R.Y. Journey to Data Quality; The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  23. Hagendorff, T. Linking Human And Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine Learning. Minds Mach. 2021, 31, 563–593. [Google Scholar] [CrossRef] [PubMed]
  24. Ciarochi, J. Racist robots: Eradicating algorithmic bias. Triplebyte Compil. Blog. 2020. Available online: https://triplebyte.com/blog/racist-robots-detecting-bias-in-ai-systems (accessed on 1 June 2022).
  25. Bakaev, M.; Heil, S.; Khvorostov, V.; Gaedke, M. Auto-extraction and integration of metrics for web user interfaces. J. Web Eng. 2018, 17, 561–590. [Google Scholar] [CrossRef]
  26. Geiger, R.S.; Cope, D.; Ip, J.; Lotosh, M.; Shah, A.; Weng, J.; Tang, R. “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data? Quant. Sci. Stud. 2021, 2, 795–827. [Google Scholar] [CrossRef]
  27. Sambasivan, N.; Kapania, S.; Highfill, H.; Akrong, D.; Paritosh, P.; Aroyo, L.M. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–15. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.