SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network
Abstract
:1. Introduction
- (1)
- An unsupervised schema matching method called SNMatch is proposed for column semantic-type matching on unlabeled tabular data.
- (2)
- A cell text encoder and a column text-embedding method for column text clustering by semantic type are proposed, which consider cell format features and semantic features.
- (3)
- We combine PU learning technology into the model of column semantic-type detection.
- (4)
- We show that SNMatch achieves better performance than existing methods on column semantic-type detection tasks without training data.
2. Related Works
2.1. Column Semantic-Type Detection
2.2. Semantic Features
3. Problem Definition
4. Methodology
4.1. The Architecture of SNMatch
4.2. Training Data Generation
4.3. The Siamese Network
4.4. Cell Text Encoder
4.5. Clustering Column Semantic Types
5. Experiments
5.1. Datasets
5.2. Baselines
5.3. Experiment Metrics
5.4. Experiment Results
5.4.1. Comparison with the Baselines
5.4.2. Siamese Network
5.4.3. Efficiency of Featured Vectors
5.4.4. Ablation Experiments
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Doan, A.; Halevy, A.; Ives, Z. 5-Schema Matching and Mapping. In Principles of Data Integration; Doan, A., Halevy, A., Ives, Z., Eds.; Morgan Kaufmann: Burlington, MA, USA, 2012; pp. 121–160. [Google Scholar]
- Wang, R.; Li, Y.; Wang, J. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering, Anaheim, CA, USA, 3–7 April 2023; pp. 1502–1515. [Google Scholar]
- An, X.; You, S.; Guo, Z.; Lu, Z.; Zheng, B.; Shi, S.; Song, Y. Column concept determination based on multiple evidences. Concurr. Comput. Pract. Exp. 2021, 33, e5457. [Google Scholar] [CrossRef]
- Limaye, G.; Sarawagi, S.; Chakrabarti, S. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 2010, 3, 1338–1347. [Google Scholar] [CrossRef]
- Goel, A.; Knoblock, C.A.; Lerman, K. Exploiting structure within data for accurate labeling using conditional random fields. In Proceedings of the International Conference on Artificial Intelligence (ICAI). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), Las Vegas, NV, USA, 16–19 July 2012; pp. 1–9. [Google Scholar]
- Bhagavatula, C.S.; Noraset, T.; Downey, D. TabEL: Entity linking in web tables. In International Semantic Web Conference; Springer: Cham, Switzerland, 2015; pp. 425–441. [Google Scholar]
- Chen, J.; Jiménez-Ruiz, E.; Horrocks, I.; Sutton, C. Colnet: Embedding the semantics of web tables for column type prediction. Proc. AAAI Conf. Artif. Intell. 2019, 33, 29–36. [Google Scholar] [CrossRef]
- Hulsebos, M.; Hu, K.; Bakker, M.; Zgraggen, E.; Satyanarayan, A.; Kraska, T.; Demiralp, C.; Hidalgo, C. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1500–1508. [Google Scholar]
- Zhang, D.; Suhara, Y.; Li, J.; Hulsebos, M.; Demiralp, C.; Tan, W.-C. Sato: Contextual semantic type detection in tables. Proc. VLDB Endow. 2019, 13, 1835–1848. [Google Scholar] [CrossRef]
- Xie, J.; Cao, C.; Liu, Y.; Cao, Y.; Li, B.; Tan, J. Column Concept Determination for Chinese Web Tables via Convolutional Neural Network. In International Conference on Computational Science; Springer: Cham, Switzerland, 2018; pp. 533–544. [Google Scholar]
- Wang, D.; Shiralkar, P.; Lockard, C.; Huang, B.; Dong, X.L.; Jiang, M. TCN: Table Convolutional Network for Web Table Interpretation. In Proceedings of the Web Conference 2021, New York, NY, USA, 18 May 2021; pp. 4020–4032. [Google Scholar]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Kiryo, R.; Niu, G.; du Plessis, M.C.; Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1674–1684. [Google Scholar]
- Venetis, P.; Halevy, A.Y.; Madhavan, J.; Pasca, M.; Shen, W.; Wu, F.; Miao, G. Recovering semantics of tables on the web. Proc. VLDB Endow. 2011, 4, 528–538. [Google Scholar] [CrossRef]
- Deng, D.; Jiang, Y.; Li, G.; Li, J.; Yu, C. Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases. Proc. VLDB Endow. 2013, 6, 1606–1617. [Google Scholar] [CrossRef]
- Yang, J.; Gupta, A.; Upadhyay, S.; He, L.; Goel, R.; Paul, S. TableFormer: Robust Transformer Modeling for Table-Text Encoding. arXiv 2022, arXiv:2203.00274. [Google Scholar]
- Maji, S.; Rout, S.S.; Choudhary, S. DCoM: A Deep Column Mapper for Semantic Data Type Detection. arXiv 2024, arXiv:2106.12871. [Google Scholar]
- Bordawekar, R.; Bandyopadhyay, B.; Shmueli, O. Cognitive database: A step towards endowing relational databases with artificial intelligence capabilities [DB/OL]. arXiv 2017, arXiv:1712.07199. [Google Scholar]
- Fernandez, R.C.; Madden, S. Termite: A system for tunneling through heterogeneous data. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Amsterdam, The Netherlands, 5 July 2019; pp. 1–8. [Google Scholar]
- Cappuzzo, R.; Papotti, P.; Thirumuruganathan, S. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 1335–1349. [Google Scholar]
- Suhara, Y.; Li, J.; Li, Y.; Zhang, D.; Demiralp, Ç.; Chen, C.; Tan, W.C. Annotating Columns with Pre-trained Language Models. In Proceedings of the 2022 International Conference on Management of Data, New York, NY, USA, 22–27 June 2023; pp. 1493–1503. [Google Scholar]
- Sun, Y.; Xin, H.; Chen, L. RECA: Related Tables Enhanced Column Semantic Type Annotation Framework. Proc. VLDB Endow. 2023, 16, 1319–1331. [Google Scholar] [CrossRef]
- Deng, X.; Sun, H.; Lees, A.; Wu, Y.; Yu, C. TURL: Table Understanding through Representation Learning. SIGMOD Rec. 2022, 51, 33–40. [Google Scholar] [CrossRef]
- Hu, K.; Gaikwad, S.; Hulsebos, M.; Bakker, M.A.; Zgraggen, E.; Hidalgo, C.; Kraska, T.; Li, G.; Satyanarayan, A.; Demiralp, Ç. Viznet: Towards a large-scale visualization learning and benchmarking repository. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
- Lehmberg, O.; Ritze, D.; Meusel, R.; Bizer, C. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 75–76. [Google Scholar]
Name | Post | Phone Number | Available | |
---|---|---|---|---|
Nick Bayley | HR Assistant | nickbaley@fmb.com | 123-45678 | Mon.–Fri. |
Arthur Fritch | Program Manager | arthur@fmb.com | 135-79864 | Mon.–Sat. |
Patrick Rutherford | Team Leader | prutherford@fmb.com | 246-80975 | Mon.–Sat. |
Name | Table | Columns | Cells | Semantic Types | SD of Semantic-Type Count |
---|---|---|---|---|---|
MACST | 362 | 1191 | 50,311 | 58 | 25.08 |
VizNet-Manyeyes | 2821 | 9469 | 1,333,693 | 77 | 184.74 |
Method | Macro Precision | Macro Recall | Macro F1 Score | Micro F1 Score |
---|---|---|---|---|
MACST | ||||
LDA | 0.194 | 0.212 | 0.162 | 0.216 |
BTM | 0.255 | 0.271 | 0.213 | 0.317 |
TF-IDF | 0.343 | 0.125 | 0.127 | 0.120 |
FastText | 0.343 | 0.254 | 0.207 | 0.307 |
BERT | 0.357 | 0.312 | 0.267 | 0.317 |
SNMatch | 0.361 | 0.357 | 0.290 | 0.377 |
VizNet-Manyeyes | ||||
LDA | 0.212 | 0.225 | 0.186 | 0.450 |
BTM | 0.263 | 0.260 | 0.212 | 0.444 |
FastText | 0.257 | 0.170 | 0.126 | 0.264 |
BERT | 0.289 | 0.216 | 0.164 | 0.264 |
SNMatch | 0.292 | 0.287 | 0.220 | 0.380 |
Dataset | Precision | Recall | F1 Score |
---|---|---|---|
MACST | 0.84 | 0.63 | 0.72 |
VizNet-Manyeyes | 0.66 | 0.69 | 0.68 |
Method | Accuracy |
---|---|
+ clustering | 0.93 |
Auto-encoder + clustering | 0.72 |
Auto-encoder + one layer neural network | 0.82 |
Auto-encoder + multi-class SVM | 0.75 |
Method | Macro Precision | Macro Recall | Macro F1 Score | Micro F1 Score |
---|---|---|---|---|
MACST | ||||
without FastText | 0.314 | 0.302 | 0.240 | 0.340 |
with FastText | 0.361 | 0.357 | 0.290 | 0.377 |
VizNet-Manyeyes | ||||
without FastText | 0.282 | 0.214 | 0.090 | 0.220 |
with FastText | 0.292 | 0.287 | 0.220 | 0.380 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nie, T.; Mao, H.; Liu, A.; Wang, X.; Shen, D.; Kou, Y. SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network. Mathematics 2025, 13, 607. https://doi.org/10.3390/math13040607
Nie T, Mao H, Liu A, Wang X, Shen D, Kou Y. SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network. Mathematics. 2025; 13(4):607. https://doi.org/10.3390/math13040607
Chicago/Turabian StyleNie, Tiezheng, Hanyu Mao, Aolin Liu, Xuliang Wang, Derong Shen, and Yue Kou. 2025. "SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network" Mathematics 13, no. 4: 607. https://doi.org/10.3390/math13040607
APA StyleNie, T., Mao, H., Liu, A., Wang, X., Shen, D., & Kou, Y. (2025). SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network. Mathematics, 13(4), 607. https://doi.org/10.3390/math13040607