Dataset Constrution through Ontology-Based Data Requirements Analysis
Abstract
:1. Introduction
2. Related Work
3. Ontology-Based Approach to Data Requirements Analysis
- Data Labeling: Data requirements can help determine the types of labeled data the system needs to collect, such as the category, location, size, and other information about the products, which can guide the data labeling work, ensuring that the system receives accurate labeled data to support the training of machine learning models.
- Dataset Diversity: Data requirements ensure that the system accepts diverse data, including various types of products, or videos captured under different lighting conditions, and images of products from different angles and scales, which can help ML systems better adapt to various real-world scenarios.
- Feature Engineering: Data requirements can help to select and extract the most relevant features, thereby improving the performance of the system.
3.1. Ontology-Based Domain Model
- C denotes a set of concepts
- : {is, has} denotes a set of relation names
- φ: denotes the relations among concepts where φ(c, l)indicates a set of concepts related with concepts c through label l.
3.2. Criteria-Based Data Requirements Specification
- Attribute Coverage Criterion. Given a domain ontology O and a set of crucial entity concepts , for each value combination of attributes of entities in and the entity nodes directly related to an entity in , there must exist at least one data point in the dataset to cover it.
- Relation Coverage Criterion. Given a domain ontology O, a set of crucial entity concepts and the entity nodes directly related to an entity in , for each possible entity combination of pair and the value of relation concept r related to and , there must exist at least one data point in the dataset to cover it.
Algorithm 1: Data Requirements Generate. |
Input: domain ontology O, entity nodes Output: the data requirements and |
4. The Supporting Tool
5. Experiments and Analysis
5.1. Experimental Setup
5.2. Experiments Results
- Experiment 1a and 1b: In experiments 1a and 1b, we evaluated YOLO’s performance in detecting the label ’sign’. We filtered the candidate dataset to obtain 321 data points that do not satisfy the data requirements and an equal number of data points that satisfy the data requirements as the training dataset, respectively. It is important to note that the dataset in Experiment 1a does not satisfy the characteristics mentioned in the data requirements, while the dataset in Experiment 1b is constructed by satisfying the data requirements derived from the attribute criterion. When images in the training dataset satisfy the characteristics described by , the recognition accuracy of the image recognition system is higher compared to datasets that do not satisfy these characteristics.
- Experiment 2a and 2b: In experiments 2a and 2b, similar to the previous experiment, we assessed YOLO’s performance in detecting the label Sign for images. We filtered the candidate datasets to obtain 1316 data that do not satisfy the data requirements and an equal number of data that satisfy the data requirements derived from relation criterion as training datasets, respectively. If the training dataset satisfies the data requirement , the recognition accuracy of the image recognition system for traffic signs will decrease compared to the system trained on the dataset that satisfies the triple .
- Experiment 3a and 3b: In experiments 3a and 3b, we filtered the candidate dataset to obtain 1768 data that do not satisfy the data requirements and an equal number of data that satisfy the data requirements derived from relation criterion as training dataset, respectively. The result shows that if the training dataset satisfies , and at the same time, the recognition accuracy of the detection system for traffic signs is higher than in dataset without the data requirements.
- Experiment 4a and 4b: In experiments 4a and 4b, we filtered candidate datasets to obtain 3260 data that do not satisfy the data requirements and an equal number of data that satisfy the data requirements derived from relation criterion as the training dataset, respectively. The result shows that if the training dataset satisfies and at the same time, the recognition accuracy of the detection system for traffic signs is higher than in the dataset without the data requirements.
5.3. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Strickland, E. Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small is the New Big. IEEE Spectr. 2022, 59, 22–50. [Google Scholar] [CrossRef]
- Habibullah, K.M.; Gay, G.; Horkoff, J. Non-functional requirements for machine learning: Understanding current use and challenges among practitioners. Requir. Eng. 2023, 28, 283–316. [Google Scholar] [CrossRef]
- Ahmad, K.; Bano, M.; Abdelrazek, M.; Arora, C.; Grundy, J. What’s up with requirements engineering for artificial intelligence systems? In Proceedings of the 2021 IEEE 29th International Requirements Engineering Conference (RE), Notre Dame, IN, USA, 20–24 September 2021; pp. 1–12. [Google Scholar]
- Zhang, J.; Zou, X.; Kuang, L.D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A more comprehensive traffic sign detection benchmark. Hum.-Centric Comput. Inf. Sci. 2022, 12, 1–21. [Google Scholar]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv 2018, arXiv:1809.09600. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
- Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
- Gupta, N.; Patel, H.; Afzal, S.; Panwar, N.; Mittal, R.S.; Guttula, S.; Jain, A.; Nagalapatti, L.; Mehta, S.; Hans, S.; et al. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv 2021, arXiv:2108.05935. [Google Scholar]
- Pan, H.; Xi, Y.; Wang, L.; Nan, Y.; Su, Z.; Cao, R. Dataset construction method of cross-lingual summarization based on filtering and text augmentation. PeerJ Comput. Sci. 2023, 9, e1299. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image data augmentation for deep learning: A survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
- Wightman, R.; Touvron, H.; Jégou, H. Resnet strikes back: An improved training procedure in timm. arXiv 2021, arXiv:2110.00476. [Google Scholar]
- Yao, Y.; Zhang, J.; Shen, F.; Liu, L.; Zhu, F.; Zhang, D.; Shen, H.T. Towards automatic construction of diverse, high-quality image datasets. IEEE Trans. Knowl. Data Eng. 2019, 32, 1199–1211. [Google Scholar] [CrossRef]
- Li, Y.; Meng, L.; Chen, L.; Yu, L.; Wu, D.; Zhou, Y.; Xu, B. Training data debugging for the fairness of machine learning software. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 2215–2227. [Google Scholar]
- Giunchiglia, E.; Imrie, F.; van der Schaar, M.; Lukasiewicz, T. Machine Learning with Requirements: A Manifesto. arXiv 2023, arXiv:2304.03674. [Google Scholar]
- Zhang, R.; Albrecht, A.; Kausch, J.; Putzer, H.J.; Geipel, T.; Halady, P. DDE process: A requirements engineering approach for machine learning in automated driving. In Proceedings of the 2021 IEEE 29th International Requirements Engineering Conference (RE), Notre Dame, IN, USA, 20–24 September 2021; pp. 269–279. [Google Scholar]
- Ries, B.; Guelfi, N.; Jahic, B. An mde method for improving deep learning dataset requirements engineering using alloy and uml. In Proceedings of the 9th International Conference on Model-Driven Engineering and Software Development, Virtual Event, 8–10 February 2021; SCITEPRESS: Setúbal, Portugal, 2021; pp. 41–52. [Google Scholar]
- Chu, X.; Ilyas, I.F.; Krishnan, S.; Wang, J. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 2201–2206. [Google Scholar]
- Abedjan, Z.; Golab, L.; Naumann, F.; Papenbrock, T. Data Profiling; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- De Coste, M.; Li, Z.; Khedri, R. The prediction of mid-winter and spring breakups of ice cover on Canadian rivers using a hybrid ontology-based and machine learning model. Environ. Model. Softw. 2023, 160, 105577. [Google Scholar] [CrossRef]
- Asudeh, A.; Jin, Z.; Jagadish, H. Assessing and remedying coverage for a given dataset. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 554–565. [Google Scholar]
- Tang, K.; Niu, Y.; Huang, J.; Shi, J.; Zhang, H. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3716–3725. [Google Scholar]
- Barzamini, H.; Shahzad, M.; Alhoori, H.; Rahimi, M. A multi-level semantic web for hard-to-specify domain concept, Pedestrian, in ML-based software. Requir. Eng. 2022, 27, 1–22. [Google Scholar] [CrossRef]
- Li, M.; Lu, Q.; Long, Y.; Gui, L. Inferring affective meanings of words from word embedding. IEEE Trans. Affect. Comput. 2017, 8, 443–456. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. Acm Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2113–12132. [Google Scholar] [CrossRef] [PubMed]
Crucial Entity | Attribute | Value | |
---|---|---|---|
shape | circle | {{(, circle), (, red), (, flourish)}, | |
square | {(, circle), (, red), (, withered)}, | ||
color | red | {(, circle), (, blue), (, flourish)}, | |
yellow | {(, circle), (, blue), (, withered)}, | ||
leaf | flourish | {(, square), (, red), (, flourish)}, | |
withered | {(, square), (, red), (, withered)}, | ||
{(, square), (, blue), (, flourish)}, | |||
{(, square), (, blue), (, withered)}} |
Critial Relation | Value | Critial Entity | |
---|---|---|---|
under | {{(Tree, under, Traffic Sign)}, | ||
behind | {{(Tree, behind, Traffic Sign)}, | ||
{{(Vehicle, under, Traffic Sign)}, | |||
{{(Vehicle, behind, Traffic Sign)}, | |||
{{(Pedestrian, under, Traffic Sign)}, | |||
{{(Pedestrian, behind, Traffic Sign)}, | |||
{{(Traffic Light, behind, Traffic Sign)}, | |||
{{(Traffic Light, under, Traffic Sign)}} |
Subject | Object | Predicates |
---|---|---|
car | footpath | [’intervened’, ’walked’, ’dumped’, …] |
pedestrian | walker | [’hit’, ’travels’, ’make’, ’walking’, ’gets’, …] |
pedestrian | vehicle | [’identified’, ’say’, ’closed’, ’charged’, ’rendering’, …] |
motorist | intersection | [’improve’, ’kills’, …] |
bus | bridge | [’add’, ’run’, ’stop’, ’delivering’, ’work’, ’see’, …] |
… | … | … |
Entities | Relations | E-R-E Triples | Entities We Choose | Relations We Choose |
---|---|---|---|---|
building | above | tree behind sign | building | above |
sign | holding | street has sign | tree | in front of |
tree | has | sign above car | sign | behind |
car | behind | building in front of sign | under | |
street | walking on | woman walking on street | ||
woman |
Id | Data Requirements |
---|---|
{{(color, )), (shape, )), (size, ))} | |
{(color, )), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
{(color, ), (shape, ), (size, )} | |
… | … |
Exp. | Datasets Satisfy the Data Requirements | Data Size | mAP%50 | |
---|---|---|---|---|
Constructing Datasets | 1a | - | 321 | 0.7748 |
Randomly (Base Case) | 2a | - | 1316 | 0.8717 |
3a | - | 1768 | 0.8695 | |
4a | - | 3260 | 0.7754 | |
Constructing Datasets | 1b | 321 | 0.8018 | |
Using Data Requirements | 2b | 1316 | 0.9015 | |
3b | 1768 | 0.8712 | ||
4b | 3260 | 0.7839 |
Exp. | Base Case | Satisfying Data Requirements |
---|---|---|
Experiment 3. | 1768 | 1300 |
Experiment 4. | 3260 | 2500 |
Dataset Distribution | mAP_0.5 | |
---|---|---|
= 20%, = 30%, = 50% | 0.7615 | |
= 40%, = 40%, = 20% | 0.7832 | |
= 20%, = 30%, = 50% | 0.7963 | |
= 40%, = 40%, = 20% | 0.7398 | |
= 40%, = 40%, = 20% | 0.7216 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, L.; Wang, X. Dataset Constrution through Ontology-Based Data Requirements Analysis. Appl. Sci. 2024, 14, 2237. https://doi.org/10.3390/app14062237
Jiang L, Wang X. Dataset Constrution through Ontology-Based Data Requirements Analysis. Applied Sciences. 2024; 14(6):2237. https://doi.org/10.3390/app14062237
Chicago/Turabian StyleJiang, Liangru, and Xi Wang. 2024. "Dataset Constrution through Ontology-Based Data Requirements Analysis" Applied Sciences 14, no. 6: 2237. https://doi.org/10.3390/app14062237
APA StyleJiang, L., & Wang, X. (2024). Dataset Constrution through Ontology-Based Data Requirements Analysis. Applied Sciences, 14(6), 2237. https://doi.org/10.3390/app14062237