BERT-Based Schema Matching for Integrating Heterogeneous Flood Data: A Case Study in Korea
Abstract
1. Introduction
- Proposes an automated schema matching method using BERT that reduces data preparation costs for flood data integration.
- Develops a model capable of standardizing heterogeneous variable names without requiring explicit domain expertise.
- Introduces generative text augmentation that enriches the model’s ability to learn diverse expressions and leads to higher accuracy in matching.
- Demonstrates that the proposed method achieves notable accuracy improvements compared to existing embedding-based approaches.
- Validates the practical utility of the model through a case study that applies the method to real Korean flood datasets for simulation-based disaster response.
2. Related Work
2.1. Explicit Knowledge-Driven Approaches
2.1.1. Rule-Based Methods
2.1.2. Dictionary-Based Methods
2.1.3. Ontology-Based Methods
2.2. Data-Driven Learning Approaches
2.2.1. Embedding-Based Models
2.2.2. Pre-Trained Language Models
2.2.3. Generative Models and Data Augmentation
3. Schema Matching on Flood Data with LLMs
3.1. Preliminary
3.2. Overview
3.3. Text Augmentation Module
3.4. BERT-Based Embedding Module
3.5. Automatic Matching Module
| Algorithm 1: LLM-based Schema Matching for Flood Data |
| Input: - S: Set of standard column names. - T: Set of target column names. - G: Generative language model. - B: Fine-tuned BERT model. Output: - M: Mapping of target column names to standard column names. Steps: 1. Text Augmentation: For each s ∈ S: (a) Generate augmented variations As using G. (b) Add As to Saugmented. 2. Embedding Generation: (a) Compute embeddings Es for each s ∈ Saugmented using B. (b) Compute embeddings Et for each t ∈ T using B. 3. Similarity Calculation: For each t ∈ T: (a) Calculate cosine similarity between Et and Es. (b) Identify the top-k similar s based on similarity scores. 4. Schema Mapping: For each t ∈ T: Assign t to s with the highest similarity score. Update M with the mapping t → s. Return: M. |
4. Empirical Study
4.1. Experimental Setting
4.2. Experimental Evaluation
4.3. Case Study
4.4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jiang, Y.; Zevenbergen, C.; Ma, Y. Urban Pluvial Flooding and Stormwater Management: A Contemporary Review of China’s Challenges and “Sponge Cities” Strategy. Env. Sci. Policy 2018, 80, 132–143. [Google Scholar]
- Westra, S.; Fowler, H.J.; Evans, J.P.; Alexander, L.V.; Berg, P.; Johnson, F.; Kendon, E.J.; Lenderink, G.; Roberts, N. Future Changes to the Intensity and Frequency of Short-duration Extreme Rainfall. Rev. Geophys. 2014, 52, 522–555. [Google Scholar] [CrossRef]
- Jongman, B.; Ward, P.J.; Aerts, J.C.J.H. Global Exposure to River and Coastal Flooding: Long Term Trends and Changes. Glob. Environ. Chang. 2012, 22, 823–835. [Google Scholar] [CrossRef]
- Li, C.; Sun, N.; Lu, Y.; Guo, B.; Wang, Y.; Sun, X.; Yao, Y. Review on Urban Flood Risk Assessment. Sustainability 2022, 15, 765. [Google Scholar] [CrossRef]
- Rahm, E.; Bernstein, P.A. A Survey of Approaches to Automatic Schema Matching. VLDB J. 2001, 10, 334–350. [Google Scholar] [CrossRef]
- Shvaiko, P.; Euzenat, J. Ontology Matching: State of the Art and Future Challenges. IEEE Trans. Knowl. Data Eng. 2011, 25, 158–176. [Google Scholar] [CrossRef]
- Wu, Z.; Shen, Y.; Wang, H.; Wu, M. An Ontology-Based Framework for Heterogeneous Data Management and Its Application for Urban Flood Disasters. Earth Sci. Inf. 2020, 13, 377–390. [Google Scholar]
- Koutras, C.; Fragkoulis, M.; Katsifodimos, A.; Lofi, C. REMA: Graph Embeddings-Based Relational Schema Matching. In Proceedings of the EDBT/ICDT Workshops, Copenhagen, Denmark, 30 March 2020; p. 17. [Google Scholar]
- Hättasch, B.; Truong-Ngoc, M.; Schmidt, A.; Binnig, C. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. arXiv 2022, arXiv:2203.04366. [Google Scholar]
- Ayala, D.; Hernández, I.; Ruiz, D.; Rahm, E. Leapme: Learning-Based Property Matching with Embeddings. Data Knowl. Eng. 2022, 137, 101943. [Google Scholar] [CrossRef]
- Parciak, M.; Vandevoort, B.; Neven, F.; Peeters, L.M.; Vansummeren, S. Schema Matching with Large Language Models: An Experimental Study. arXiv 2024, arXiv:2407.11852. [Google Scholar] [CrossRef]
- Oh, H.; Jones, A.; Finin, T. Employing Word-Embedding for Schema Matching in Standard Lifecycle Management. J. Ind. Inf. Integr. 2024, 38, 100547. [Google Scholar]
- Sheetrit, E.; Brief, M.; Mishaeli, M.; Elisha, O. Rematch: Retrieval Enhanced Schema Matching with LLMs. arXiv 2024, arXiv:2403.01567. [Google Scholar] [CrossRef]
- Kired, N.E.; Ravat, F.; Song, J.; Teste, O. Embedding-Based Data Matching for Disparate Data Sources. In Proceedings of the International Conference on Big Data Analytics and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2024; pp. 66–71. [Google Scholar]
- Doan, A.; Halevy, A.Y. Semantic Integration Research in the Database Community: A Brief Survey. AI Mag. 2005, 26, 83. [Google Scholar]
- Bellahsene, Z.; Bonifati, A.; Rahm, E. Schema Matching and Mapping, 1st ed.; Bellahsene, Z., Bonifati, A., Rahm, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; ISBN 978-3-642-16517-7. [Google Scholar]
- Do, H.-H.; Rahm, E. COMA—A System for Flexible Combination of Schema Matching Approaches. In Proceedings of the VLDB’02: Proceedings of the 28th International Conference on Very Large Databases; Elsevier: Amsterdam, The Netherlands, 2002; pp. 610–621. [Google Scholar]
- Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 2006, 19, 1–16. [Google Scholar] [CrossRef]
- Kedad, Z.; Xue, X. Mapping Discovery for XML Data Integration. In Proceedings of the OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”; Springer: Berlin/Heidelberg, Germany, 2005; pp. 166–182. [Google Scholar]
- Chen, C.; Golshan, B.; Halevy, A.Y.; Tan, W.-C.; Doan, A. BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration. IEEE Data Eng. Bull. 2018, 41, 10–22. [Google Scholar]
- Rashid, S.M.; McCusker, J.P.; Pinheiro, P.; Bax, M.P.; Santos, H.O.; Stingone, J.A.; Das, A.K.; McGuinness, D.L. The Semantic Data Dictionary–an Approach for Describing and Annotating Data. Data Intell. 2020, 2, 443–486. [Google Scholar] [CrossRef] [PubMed]
- Asif-Ur-Rahman, M.; Hossain, B.A.; Bewong, M.; Islam, M.Z.; Zhao, Y.; Groves, J.; Judith, R. A Semi-Automated Hybrid Schema Matching Framework for Vegetation Data Integration. Expert. Syst. Appl. 2023, 229, 120405. [Google Scholar]
- Wu, K.; Zhang, J.; Ho, J.C. CONSchema: Schema Matching with Semantics and Constraints. In Proceedings of the European Conference on Advances in Databases and Information Systems; Springer: Berlin/Heidelberg, Germany, 2023; pp. 231–241. [Google Scholar]
- Pan, Z.; Pan, G.; Monti, A. Semantic-Similarity-Based Schema Matching for Management of Building Energy Data. Energies 2022, 15, 8894. [Google Scholar] [CrossRef]
- Mukherjee, D.; Bandyopadhyay, A.; Chowdhury, R.; Bhattacharya, I. Learning Knowledge Graph for Target-Driven Schema Matching. In Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD); ACM: New York, NY, USA, 2021; pp. 65–73. [Google Scholar]
- Narayan, A.; Chami, I.; Orr, L.; Arora, S.; Ré, C. Can Foundation Models Wrangle Your Data? arXiv 2022, arXiv:2205.09911. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process Syst. 2017, 30. [Google Scholar]
- Pires, T.; Schlinger, E.; Garrette, D. How Multilingual Is Multilingual BERT? arXiv 2019, arXiv:1906.01502. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. Kr-Bert: A Small-Scale Korean-Specific Language Model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
- Brain, S.K.T. KoBERT: Korean BERT Pre-Trained Cased 2019. GitHub repository. Available online: https://github.com/SKTBrain/KoBERT (accessed on 10 February 2026).
- Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar]
- Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-Agnostic BERT Sentence Embedding. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 878–891. [Google Scholar]







| English Description | Korean Variable Name (Busan City) | Korean Variable Name (Incheon City) |
|---|---|---|
| Unique identifier | Gaebyeol Beonho (개별번호) | Goyu Sikbyeolja (고유식별자) |
| Installation date | Seolchi Ilja (설치일자) | Seolchi Ilja (설치일자) |
| Sewer structure type | Gwangeo Hyeongtae (관거형태) | Gujomul Hyeongtae (구조물형태) |
| Manhole diameter | Ttukkeong Gugyeong (뚜껑구경) | Ttukkeong Gugyeong (뚜껑구경) |
| Manhole material | Ttukkeong Jaejil (뚜껑재질) | Ttukkeong Jaejil (뚜껑재질) |
| Administrative district | Haengjeong-dong (행정동) | Haengjeong-dong (행정동) |
| English Name | Korean Name/Explanation |
|---|---|
| Name | Conduit Name (관거 이름) |
| FromNode | Start Node of Conduit (관거 시작노드) |
| ToNode | End Node of Conduit (관거 종점노드) |
| Shape | Conduit Shape (관거 모양) |
| Thickness | Conduit Thickness (관거 두께) |
| Diameter | Conduit Diameter (관거 지름) |
| SedimentHeight | Sediment Height (퇴적물의 높이) |
| Roughness | Roughness Coefficient (거칠기 계수) |
| Length | Conduit Length (관거 길이) |
| InitialFlow | Initial Flow (초기 유량) |
| MaximumFlow | Maximum Flow (최대 유량) |
| SeepageLossRate | Seepage Loss Rate (누수 속도) |
| FlapGate | Presence of Flap Gate (플랩 게이트 존재 여부) |
| Target Variable Name | Transformed Name (Top-3 Similarity) | Standard Variable Name | Correct Matching |
|---|---|---|---|
| Unique Identifier (고유 식별자) | Conduit Name (관의 이름) | Conduit Name (관의 이름) | Y |
| End Node of Conduit (관의 종점노드) | |||
| Roughness Coefficient (거칠기 계수) | |||
| Structure Type (구조물 형태) | Conduit Shape (관의 모양) | Conduit Shape (관의 모양) | Y |
| Sediment Height (퇴적물의 높이) | |||
| Roughness Coefficient (거칠기 계수) | |||
| Pipe Diameter (관경) | Roughness Coefficient (거칠기 계수) | Conduit Diameter (관거 지름) | N |
| Conduit Length (관의 길이) | |||
| Conduit Shape (관의 모양) | |||
| Material texture (재질) | Roughness Coefficient (거칠기 계수) | Roughness Coefficient (거칠기 계수) | Y |
| Conduit Thickness (관의 두께) | |||
| Conduit Shape (관의 모양) | |||
| Length (연장) | Roughness Coefficient (거칠기 계수) | Conduit Length (관의 길이) | N |
| Conduit Thickness (관의 두께) | |||
| Conduit Shape (관의 모양) | |||
| Dry-weather Flow Velocity (청천시 유속) | Initial Flow (초기 유량) | Initial Flow (초기 유량) | Y |
| Maximum Flow (최대 유량) | |||
| Sediment Height (퇴적물의 높이) |
| Target Variable Name | Transformed Name (Top-3 Similarity) | Standard Variable Name | Correct Matching |
|---|---|---|---|
| Unique Identifier (고유 식별자) | Conduit Name (관의 이름) | Conduit Name (관의 이름) | Y |
| Conduit Shape (관의 모양) | |||
| Conduit Thickness (관의 두께) | |||
| Structure Type (구조물 형태) | Conduit Shape (관의 모양) | Conduit Shape (관의 모양) | Y |
| Sediment Height (퇴적물의 높이) | |||
| Conduit Diameter (관거 지름) | |||
| Pipe Diameter (관경) | Roughness Coefficient (거칠기 계수) | Conduit Diameter (관거 지름) | Y |
| Conduit Diameter (관거 지름) | |||
| Conduit Shape (관의 모양) | |||
| Material texture (재질) | Roughness Coefficient (거칠기 계수) | Roughness Coefficient (거칠기 계수) | Y |
| Conduit Thickness (관의 두께) | |||
| Initial Flow (초기 유량) | |||
| Length (연장) | Roughness Coefficient (거칠기 계수) | Conduit Length (관의 길이) | Y |
| Conduit Length (관의 길이) | |||
| Conduit Shape (관의 모양) | |||
| Dry-weather Flow Velocity (청천시 유속) | Initial Flow (초기 유량) | Initial Flow (초기 유량) | Y |
| Maximum Flow (최대 유량) | |||
| Sediment Height (퇴적물의 높이) |
| Model | Hit@1 | Hit@3 | Precision | Recall | MRR | Note |
|---|---|---|---|---|---|---|
| Rule-based Method | 0.044 | 0.133 | 0.066 | 0.255 | 0.107 | Dictionary rules (synonyms) |
| KoSBERT | 0.144 | 0.267 | 0.101 | 0.255 | 0.200 | General KoSBERT model |
| BERT_fin | 0.450 | 0.650 | 0.217 | 0.590 | 0.533 | Fine-tuned on flood corpus |
| Ours (fin+aug) | 0.526 | 0.737 | 0.246 | 0.700 | 0.614 | Data augmentation (3 types) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Choe, T.; Shin, M.; Kim, K.; Yang, M.; Man, K.L.; Kim, M. BERT-Based Schema Matching for Integrating Heterogeneous Flood Data: A Case Study in Korea. Systems 2026, 14, 267. https://doi.org/10.3390/systems14030267
Choe T, Shin M, Kim K, Yang M, Man KL, Kim M. BERT-Based Schema Matching for Integrating Heterogeneous Flood Data: A Case Study in Korea. Systems. 2026; 14(3):267. https://doi.org/10.3390/systems14030267
Chicago/Turabian StyleChoe, Taeyoung, Mincheol Shin, Kwangyoung Kim, Myungseok Yang, Ka Lok Man, and Mucheol Kim. 2026. "BERT-Based Schema Matching for Integrating Heterogeneous Flood Data: A Case Study in Korea" Systems 14, no. 3: 267. https://doi.org/10.3390/systems14030267
APA StyleChoe, T., Shin, M., Kim, K., Yang, M., Man, K. L., & Kim, M. (2026). BERT-Based Schema Matching for Integrating Heterogeneous Flood Data: A Case Study in Korea. Systems, 14(3), 267. https://doi.org/10.3390/systems14030267

