Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models †
Abstract
:1. Introduction
2. Method and Related Work
2.1. Data Quality Control in ML
2.2. Human Factor in Data Quality
2.3. The Experiment Description
2.3.1. Material
2.3.2. Procedure
UI Assessment
- How visually complex is the UI: Complexity;
- How aesthetically pleasant is the UI: Aesthetics;
- How orderly is the UI: Orderliness.
UI Labeling
The Labeling Verification
2.3.3. Subjects
- The UI assessment was performed by 70 participants (43 females, 27 males), whose age ranged from 18 to 29 (mean 20.86, SD = 1.75).
- The UI labeling was performed a few months later by another 11 participants (6 male, 5 female), with the age ranging from 20 to 24 (mean = 20.5, SD = 0.74).
- The verification of the labelers’ output was performed a few months later by yet another 20 participants (10 male, 10 female), whose ages ranged from 20 to 22 (mean = 21.1, SD = 0.45).
2.3.4. Design
- number of all UI elements,
- number of text elements,
- share of the text elements’ area in the screenshot,
- number of image elements,
- share of the image elements’ area,
- number of background image elements,
- share of the background image elements’ area,
- share of whitespace (the screenshot area minus all the other labeled elements).
3. Results
3.1. Descriptive Statistics
3.2. The Effect of the Input Data Quality in the Models
4. Discussion and Conclusions
- Invalid UI assessment: there was almost no significant difference in the distribution of the ratings per the labelers.
- Invalid UI labeling: dimensions of precision (88.7%) and SC (77.8%) indicated high work quality and were distinct.
- Invalid Verification: SC was correlated () with the number of correct objects, but not with the number of all objects.
- Invalid subjective impressions scales: as expected, ScaleA and ScaleO had significant positive correlation (), while ScaleC and ScaleO had significant negative correlation (). The relation between ScaleA and ScaleC was more controversial, as known from the literature [20], and we did not find a significant correlation.
- Imperfection in quality measurement: we tried the objective measure for SC (elements per UI) and ordinal scale correlation (Kendall’s tau-b) for Precision, but there were no major changes in the outcomes.
- Uncontrolled differences in the models: the sample sizes varied from 41 to 45 (and even to 54 for one of the labelers), but there was no correlation between UI and the models’ .
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Oulasvirta, A. User interface design with combinatorial optimization. Computer 2017, 50, 40–47. [Google Scholar] [CrossRef]
- Peer, E.; Vosgerau, J.; Acquisti, A. Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 2014, 46, 1023–1031. [Google Scholar] [CrossRef] [PubMed]
- Hara, K.; Adams, A.; Milland, K.; Savage, S.; Callison-Burch, C.; Bigham, J.P. A data-driven analysis of workers’ earnings on Amazon Mechanical Turk. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–14. [Google Scholar]
- Saravanos, A.; Zervoudakis, S.; Zheng, D.; Stott, N.; Hawryluk, B.; Delfino, D. The hidden cost of using Amazon Mechanical Turk for research. In International Conference on Human-Computer Interaction; Springer International Publishing: New York, NY, USA, 2021; pp. 147–164. [Google Scholar]
- Daniel, F.; Kucherbaev, P.; Cappiello, C.; Benatallah, B.; Allahbakhsh, M. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Comput. Surv. (CSUR) 2018, 51, 1–40. [Google Scholar] [CrossRef]
- Salk, C.; Moltchanova, E.; See, L.; Sturn, T.; McCallum, I.; Fritz, S. How many people need to classify the same image? A method for optimizing volunteer contributions in binary geographical classifications. PLoS ONE 2022, 17, e0267114. [Google Scholar] [CrossRef] [PubMed]
- Oulasvirta, A.; De Pascale, S.; Koch, J.; Langerak, T.; Jokinen, J.; Todi, K.; Laine, M.; Kristhombuge, M.; Zhu, Y.; Miniukovich, A.; et al. Aalto Interface Metrics (AIM): A service and codebase for computational GUI evaluation. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings, Berlin, Germany, 14–17 October 2018; pp. 16–19. [Google Scholar]
- Boychuk, E.; Bakaev, M. Entropy and compression based analysis of web user interfaces. In International Conference on Web Engineering; Springer International Publishing: New York, NY, USA, 2019; pp. 253–261. [Google Scholar]
- Heil, S.; Bakaev, M.; Gaedke, M. Assessing completeness in training data for image-based analysis of web user interfaces. CEUR Workshop Proc. 2019, 2500, 17. [Google Scholar]
- Thakkar, D.; Ismail, A.; Kumar, P.; Hanna, A.; Sambasivan, N.; Kumar, N. When is Machine Learning Data Good? Valuing in Public Health Datafication. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–16. [Google Scholar]
- Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. [Google Scholar]
- Wiemer, H.; Dementyev, A.; Ihlenfeldt, S. A Holistic Quality Assurance Approach for Machine Learning Applications in Cyber-Physical Production Systems. Appl. Sci. 2021, 11, 9590. [Google Scholar] [CrossRef]
- Batini, C.; Cappiello, C.; Francalanci, C.; Maurino, A. Methodologies for data quality assessment and improvement. ACM Comput. Surv. (CSUR) 2009, 41, 1–52. [Google Scholar] [CrossRef]
- Bakaev, M.; Avdeenko, T. Intelligent information system to support decision-making based on unstructured web data. ICIC Express Lett. 2015, 9, 1017–1023. [Google Scholar]
- Taleb, I.; Serhani, M.A.; Dssouli, R. Big data quality: A survey. In Proceedings of the IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, USA, 2–7 July 2018; pp. 166–173. [Google Scholar]
- Bakaev, M.; Khvorostov, V.; Heil, S.; Gaedke, M. Web intelligence linked open data for website design reuse. In International Conference on Web Engineering; Springer International Publishing: New York, NY, USA, 2017; pp. 370–377. [Google Scholar]
- Ehrlinger, L.; Wöß, W. A survey of data quality measurement and monitoring tools. Front. Big Data 2022, 5, 850611. [Google Scholar] [CrossRef] [PubMed]
- Alwan, A.A.; Ciupala, M.A.; Brimicombe, A.J.; Ghorashi, S.A.; Baravalle, A.; Falcarin, P. Data quality challenges in large-scale cyber-physical systems: A systematic review. Inf. Syst. 2022, 105, 101951. [Google Scholar] [CrossRef]
- Swazinna, P.; Udluft, S.; Runkler, T. Measuring Data Quality for Dataset Selection in Offline Reinforcement Learning. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–8. [Google Scholar]
- Miniukovich, A.; Marchese, M. Relationship between visual complexity and aesthetics of webpages. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–13. [Google Scholar]
- Jonietz, D. A concept for fitness-for-use evaluation in Machine Learning pipelines. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia, 6–14 December 2021. [Google Scholar]
- Lee, Y.W.; Pipino, L.L.; Funk, J.D.; Wang, R.Y. Journey to Data Quality; The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Hagendorff, T. Linking Human And Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine Learning. Minds Mach. 2021, 31, 563–593. [Google Scholar] [CrossRef] [PubMed]
- Ciarochi, J. Racist robots: Eradicating algorithmic bias. Triplebyte Compil. Blog. 2020. Available online: https://triplebyte.com/blog/racist-robots-detecting-bias-in-ai-systems (accessed on 1 June 2022).
- Bakaev, M.; Heil, S.; Khvorostov, V.; Gaedke, M. Auto-extraction and integration of metrics for web user interfaces. J. Web Eng. 2018, 17, 561–590. [Google Scholar] [CrossRef]
- Geiger, R.S.; Cope, D.; Ip, J.; Lotosh, M.; Shah, A.; Weng, J.; Tang, R. “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data? Quant. Sci. Stud. 2021, 2, 795–827. [Google Scholar] [CrossRef]
- Sambasivan, N.; Kapania, S.; Highfill, H.; Akrong, D.; Paritosh, P.; Aroyo, L.M. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–15. [Google Scholar]
UI Labeling | UI Assessment | ||||
---|---|---|---|---|---|
ID | UIs | Elements | ScaleC | ScaleA | ScaleO |
AA | 54 | 4802 | 3.67 ± 0.55 | 4.20 ± 0.86 | 4.43 ± 0.63 |
GD | 44 | 3520 | 3.55 ± 0.56 | 4.07 ± 0.74 | 4.47 ± 0.53 |
KK | 44 | 3927 | 3.33 ± 0.59 | 4.32 ± 0.71 | 4.59 ± 0.57 |
MA | 44 | 5349 | 3.60 ± 0.63 | 4.02 ± 0.76 | 4.36 ± 0.56 |
NE | 44 | 4994 | 3.57 ± 0.65 | 3.97 ± 0.90 | 4.34 ± 0.64 |
PV | 43 | 4544 | 3.69 ± 0.74 | 4.34 ± 0.74 | 4.66 ± 0.58 |
PE | 42 | 2569 | 3.69 ± 0.64 | 3.79 ± 1.07 | 4.16 ± 0.80 |
SV | 43 | 3737 | 3.54 ± 0.63 | 4.22 ± 0.90 | 4.46 ± 0.68 |
ShM | 41 | 1675 | 3.55 ± 0.71 | 4.05 ± 0.88 | 4.43 ± 0.56 |
SoM | 45 | 3266 | 3.62 ± 0.73 | 4.25 ± 0.91 | 4.44 ± 0.68 |
VY | 43 | 3630 | 3.47 ± 0.61 | 4.07 ± 0.83 | 4.52 ± 0.67 |
Total | 487 | 42,013 | 3.57 ± 0.64 | 4.12 ± 0.86 | 4.44 ± 0.64 |
UI Labeling Quality | Models’ Quality (s) | ||||
---|---|---|---|---|---|
ID | SC | Precision | ScaleC | ScaleA | ScaleO |
AA | 73.0% | 89.0% | 0.108 | 0.149 | 0.114 |
GD | 84.3% | 89.9% | 0.261 | 0.345 | 0.222 |
KK | 82.5% | 95.5% | 0.261 | 0.252 | 0.152 |
MA | 75.1% | 72.0% | 0.362 | 0.486 | 0.295 |
NE | 78.3% | 85.1% | 0.316 | 0.488 | 0.416 |
PV | 81.7% | 91.6% | 0.363 | 0.289 | 0.199 |
PE | 72.0% | 77.9% | 0.165 | 0.568 | 0.611 |
SV | 80.4% | 97.4% | 0.277 | 0.176 | 0.213 |
ShM | 77.5% | 89.5% | 0.337 | 0.324 | 0.215 |
SoM | 56.0% | 95.9% | 0.304 | 0.309 | 0.198 |
VY | 95.5% | 92.8% | 0.204 | 0.110 | 0.169 |
Avg. | 77.8% | 88.7% | 0.269 | 0.318 | 0.255 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bakaev, M.; Khvorostov, V. Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models. Eng. Proc. 2023, 33, 3. https://doi.org/10.3390/engproc2023033003
Bakaev M, Khvorostov V. Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models. Engineering Proceedings. 2023; 33(1):3. https://doi.org/10.3390/engproc2023033003
Chicago/Turabian StyleBakaev, Maxim, and Vladimir Khvorostov. 2023. "Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models" Engineering Proceedings 33, no. 1: 3. https://doi.org/10.3390/engproc2023033003