Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model
Abstract
:1. Introduction
2. Proposed Method
2.1. Models
2.2. Preprocessing Steps
2.3. Similarity of Data from ODs (Term-Element Matrix)
2.4. Co-Occurrence of Elements (Term-Element Matrix)
3. Experimental Details
3.1. Purpose and Method
3.2. Attributes of Training Data (Corpus)
4. Result and Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ohsawa, Y.; Hayashi, T.; Kido, H. Restructuring Incomplete Models in Innovators Marketplace on Data Jackets. In Springer Handbook of Model-Based Science; Springer: Cham, Switzerland, 2017; pp. 1015–1031. [Google Scholar]
- Zheng, Z.; Peng, Y.; Wu, F.; Tang, S.; Chen, G. An Online Pricing Mechanism for Mobile Crowdsensing Data Markets. In Proceedings of the 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing, Chennai, India, 10–14 July 2017; pp. 1–10. [Google Scholar] [CrossRef]
- Liang, F.; Yu, W.; An, D.; Yang, Q.; Fu, X.; Zhao, W. A Survey on Big Data Market: Pricing, Trading and Protection. IEEE Access 2018, 6, 15132–15154. [Google Scholar] [CrossRef]
- Balazinska, M.; Howe, B.; Suciu, D. Data Markets in the Cloud: An Opportunity for the Database Community. Proc. VLDB Endow. 2011, 4, 1482–1485. [Google Scholar]
- Rabinovich, E.; Cheon, S. Expanding Horizons and Deepening Under-standing via the Use of Secondary Data Sources. J. Bus. Logist. 2011, 32, 303–316. [Google Scholar] [CrossRef]
- Ellram, M.L.; Tate, L.W. The Use of Secondary Data in Purchasing and Supply Management (P/SM) Research. J. Purch. Supply Manag. 2016, 22, 250–254. [Google Scholar] [CrossRef]
- Acquisti, A.; Gross, R. Predicting Social Security Numbers from Public Data. Proc. Natl. Acad. Sci. USA 2009, 106, 10975–10980. [Google Scholar] [CrossRef] [PubMed]
- Aperjis, C.; Huberman, A.B. A Market for Unbiased Private Data: Paying Individuals According to Their Privacy Attitudes; Stanford University: Stanford, CA, USA, 2012. [Google Scholar] [CrossRef]
- Xu, L.; Jiang, C.; Wang, J.; Yuan, J.; Ren, Y. Information Security in Big Data: Privacy and Data Mining. IEEE Access 2014, 2, 1149–1176. [Google Scholar] [CrossRef] [Green Version]
- Niu, C.; Zheng, Z.; Wu, F.; Gao, X.; Chen, G. Achieving Data Truthfulness and Privacy Preservation in Data Markets. IEEE Trans. Knowl. Data Eng. 2019, 31, 105–119. [Google Scholar] [CrossRef]
- Maes, P. Agents that reduce work and information overload. Commun. ACM 1994, 37, 30–40. [Google Scholar] [CrossRef]
- Hayashi, T.; Ohsawa, Y. Retrieval System for Data Utilization Knowledge Integrating Stakeholders’ Interests. In Proceedings of the AAAI Spring Symposium Series, Beyond Machine Intelligence: Understanding Cognitive Bias and Humanity for Well-being AI, Palo Alto, CA, USA, 26–28 March 2018. [Google Scholar]
- Koga, T.; Aoyama, K. Product Behavior and Topological Structure Design System by Step-by-Step Decomposition. In Proceedings of the 2004 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Salt Lake City, UT, USA, 28 September–2 October 2004; pp. 425–437. [Google Scholar] [CrossRef]
- Hayashi, T.; Ohsawa, Y. VARIABLE QUEST: Network Visualization of Variable Labels Unifying Co-occurrence Graphs. In Proceedings of the 2017 International Conference on Data Mining Workshops, New Orleans, LA, USA, 18–21 November 2017; pp. 577–583. [Google Scholar] [CrossRef]
- Hayashi, T.; Ohsawa, Y. Inferring Variable Labels Using Outlines of Data in Data Jackets by Considering Similarity and Co-occurrence. Int. J. Data Sci. Anal. 2018, 6, 351–361. [Google Scholar] [CrossRef]
- Ohsawa, Y.; Liu, C.; Hayashi, T.; Kido, H. Data Jackets for Externalizing Use Value of Hidden Datasets. In Proceedings of the 18th International Conference on Knowledge Based and Intelligent Information and Engineering System, Gdynia, Poland, 15–17 September 2014; Volume 35, pp. 946–953. [Google Scholar] [CrossRef]
- Salton, G.; Wong, A.; Yang, C.S. A Vector Space Model for Automatic Indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
- Peter, D.; Turney, D.P.; Pantel, P. From Frequency to Meaning: Vector Space Models of Semantics. J. Artif. Intell. Res. 2010, 37, 141–188. [Google Scholar] [Green Version]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, S.G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Conference on Neural Information Processing Systems, NIPS 2013, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Le, V.Q.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv, 2014; arXiv:1405.4053. [Google Scholar]
- Salton, G.; Buckley, C. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
- Kudo, T.; Matsumoto, Y. Japanese Dependency Structure Analysis Based on Support Vector Machines. In Proceedings of the Empirical Methods in Natural Language Processing and Very Large Corpora, Hong Kong, China, 7–8 October 2000; pp. 18–25. [Google Scholar] [CrossRef]
Symbol | Description |
---|---|
DJ | Summary of data in natural language (data jacket) |
OD | Outline of data described in DJs (data outline) |
VL | Name/meaning of variables in data (variable label) |
# DJs | |
# VLs | |
# terms in ODs | |
Set of elements | |
# elements |
Attribute | Example Elements |
---|---|
variable | latitude, longitude, address, weather, time, year |
type | number, text, image, table |
format | CSV, PDF, JSON, TXT, MOV |
Number of DJs | 1502 |
Total number of terms in DJs | 38,722 |
Unique terms in DJs | 7886 |
Total number of formats in DJs | 1421 |
Unique formats in DJs | 48 |
Total number of types in DJs | 2956 |
Unique types in DJs | 18 |
Total number of VLs in DJs | 7552 |
Unique VLs in DJs | 5559 |
F-Measure | Precision | Recall | |
---|---|---|---|
Matrix | 0.266 ± 0.190 | 0.664 ± 0.443 | 0.174 ± 0.141 |
Matrix | 0.250 ± 0.203 | 0.619 ± 0.463 | 0.164 ± 0.152 |
TSM | 0.046 ± 0.162 | 0.034 ± 0.124 | 0.078 ± 0.281 |
Doc2vec | 0.003 ± 0.104 | 0.089 ± 0.261 | 0.026 ± 0.071 |
F-Measure | Precision | Recall | |
---|---|---|---|
Matrix | 0.530 ± 0.247 | 0.816 ± 0.304 | 0.423 ± 0.243 |
Matrix | 0.463 ± 0.269 | 0.712 ± 0.364 | 0.371 ± 0.251 |
TSM | 0.089 ± 0.203 | 0.079 ± 0.178 | 0.120 ± 0.313 |
Doc2vec | 0.179 ± 0.182 | 0.276 ± 0.309 | 0.144 ± 0.151 |
F-Measure | Precision | Recall | |
---|---|---|---|
Matrix | 0.110 ± 0.210 | 0.165 ± 0.311 | 0.095 ± 0.191 |
Matrix | 0.089 ± 0.180 | 0.131 ± 0.264 | 0.078 ± 0.169 |
TSM | 0.041 ± 0.096 | 0.034 ± 0.078 | 0.060 ± 0.145 |
Doc2vec | 0.001 ± 0.008 | 0.001 ± 0.013 | 0.001 ± 0.009 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hayashi, T.; Ohsawa, Y. Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model. Information 2019, 10, 107. https://doi.org/10.3390/info10030107
Hayashi T, Ohsawa Y. Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model. Information. 2019; 10(3):107. https://doi.org/10.3390/info10030107
Chicago/Turabian StyleHayashi, Teruaki, and Yukio Ohsawa. 2019. "Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model" Information 10, no. 3: 107. https://doi.org/10.3390/info10030107
APA StyleHayashi, T., & Ohsawa, Y. (2019). Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model. Information, 10(3), 107. https://doi.org/10.3390/info10030107