A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources
Abstract
1. Introduction
2. Related Work
2.1. Traditional Web Parsing and Information Extraction Approaches
2.2. Transformer-Based Models for Structural and Semantic Understanding
2.3. Self-Learning and Adaptive Schema Generation Techniques
2.4. Automated Extraction Pipelines for Environmental and Hazard Monitoring
3. Materials and Methodology
3.1. Overview of the Data Acquisition and Processing Pipeline
- If the extracted output satisfies the validity criteria (e.g., minimum length threshold, structural plausibility), the schema is considered valid and stored in the repository for future reuse.
- If the schema is deemed invalid, the system proceeds with a two-stage decision process:
- Generating an embedding representation of the input HTML;
- Comparing it with stored schema profile embeddings;
- Selecting candidate schemas that exceed a predefined similarity threshold η.
- If the resulting extraction satisfies the validity criteria → schema reuse is accepted.
- Otherwise → proceed to schema regeneration.
- If valid → it is added to the schema repository.
- If invalid → the regeneration process may be repeated or the document is marked as extraction failure.
3.2. Keyword-Based Search Query Construction
3.3. HTML Retrieval and Sanitization
LLM-Based Schema Generation
3.4. Python-Based Content Extraction
3.5. Structured Storage of Parsed Content
3.6. Adaptive Self-Learning Mechanism
4. Mathematic Model of the Scheme Extraction Mechanism
4.1. Schema Representation
4.2. Extraction and Utility Score
- Length adequacy
- 2.
- Structural confidence
- 3.
- Semantic adequacy.
- 4.
- Validity Criterion and Decision Function
- 5.
- Schema Matching with a Repository
- 6.
- LLM-Based New Schema Generation
- 7.
- Overall Optimization View.
- LLM parameters θ, which determine the quality of generated schemas.
- The schema repository S, which determines the effectiveness of reusing previously validated extraction patterns.
- The schema matching rule via utility U and similarity filter Scand.
5. Results
5.1. Test Set and Data Acquisition Strategy
5.1.1. Data Acquisition Strategy
Keyword Construction
Site Filtering
Search Depth
Deduplication Strategy
- Exact URL matching;
- Normalization techniques (removal of duplicate or tracking parameters).
Crawling Time Window
Page Selection Criteria
- Presence of domain-specific keywords;
- Non-empty textual content after preprocessing.
Failure Handling and Retry Mechanism
- If extraction results do not satisfy validity criteria, the schema is rejected.
- Schema reuse from the repository is attempted.
- If unsuccessful, schema regeneration is triggered.
- In case of retrieval failure, alternative queries may be executed.
5.1.2. Dataset Characteristics
- Structural diversity (variation in DOM complexity and layout);
- Content variability (differences in length and detail);
- Metadata inconsistency (explicit vs. implicit publication data);
- Multilingual content.
5.1.3. Preprocessing and Normalization
- Removal of scripts and non-informative elements;
- Normalization of malformed HTML;
- Reduction in boilerplate content;
- Conversion to a machine-readable format.
5.1.4. Benchmark Design
- Main textual content;
- Optional metadata (e.g., publication date).
- Quantitative metrics (extraction accuracy and success rate);
- Qualitative assessment (completeness, relevance, and coherence).
5.1.5. Validity and Limitations
- All models are tested under identical conditions.
- The dataset captures high structural variability.
- Performance differences are consistent across metrics.
- Absence of fully annotated ground truth;
- Partial reliance on automated evaluation;
- Moderate dataset size.
5.2. Compared LLM Models
- Lightweight and efficiency-oriented architectures (GEMMA-3-27B);
- High-capacity generative models with strong reasoning capabilities (GPT_OSS-20B);
- Widely adopted open-weight models designed for general-purpose language understanding (LLAMA-3-8B).
5.3. Experimental Protocol
- CPU: Intel Core i9-14700K;
- GPU: 2 × NVIDIA RTX 5090 Ti (32 GB VRAM each);
- RAM: 128 GB;
- Storage: 4 TB SSD;
- Operating System: Ubuntu 24.04 LTS.
5.4. Metrics and Scoring
- Content Completeness—the extent to which the main textual content of the webpage was fully captured;
- Semantic Correctness—the degree of alignment between the extracted content and the original source;
- Relevance—the absence of unrelated, duplicated, or boilerplate content;
- Coherence—readability, structural consistency, and logical flow of the extracted text.
6. Discussion
- Stronger selector robustness mechanisms. The prevalence of fragility-related failures suggests that schema generation should avoid overly specific selectors (e.g., deep DOM chains and nth-child patterns) and instead prefer stable anchors such as semantic tags, consistent class patterns, and content-aware constraints.
- Hybrid extraction with fallback. When LLM-generated selectors fail or produce noisy outputs, integrating lightweight fallback mechanisms, such as boilerplate removal heuristics, content-density rules, or template-based extraction may improve reliability for structurally complex domains.
- Improved validation beyond length thresholds. Extending validation to include semantic plausibility checks (e.g., language consistency, earthquake keyword coverage for the extracted main text, or date format constraints) can reduce false positives where irrelevant blocks pass minimal checks.
- Expanded evaluation and reproducibility. The current evaluation would be strengthened by (i) a larger test set (more pages per domain and more domains), (ii) manual ground truth for at least a subset, and (iii) reporting confidence intervals or variance across repeated samples.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| Abbreviation | Definition |
| LLM | Large Language Model |
| GPT | Generative Pre-Trained Transformer |
| GEMMA | General Efficient Multimodal Model Architecture |
| LLAMA | Large Language Model Meta AI |
| HTML | HyperText Markup Language |
| CSS | Cascading Style Sheets |
| XPath | XML Path Language |
| DOM | Document Object Model |
| JSON | JavaScript Object Notation |
| API | Application Programming Interface |
| UMAP | Uniform Manifold Approximation and Projection |
| NLP | Natural Language Processing |
| RAG | Retrieval-Augmented Generation |
| CPU | Central Processing Unit |
| GPU | Graphics Processing Unit |
| Extraction Acc | Extraction Accuracy |
| GPT Score | Large Language Model-Based Evaluation Score |
References
- Saleem, S.; Asim, M.N.; Dengel, A. ReqNet: An LLM-driven computational framework for automated requirements extraction from unstructured documents. Complex Intell. Syst. 2026, 12, 38. [Google Scholar] [CrossRef]
- Kartiyanta, M.A.; Ancilla, E.; Jingga, K. Performance evaluation for cost-effective retrieval process for multi-document retrieval-augmented generation on a domain-specific dataset. In Proceedings of the 2025 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT); IEEE: New York, NY, USA, 2025; pp. 719–725. [Google Scholar]
- Singhania, S.; Razniewski, S.; Weikum, G. Recall them all: Retrieval-augmented language models for long object list extraction from long documents. arXiv 2024, arXiv:2405.02732. [Google Scholar] [CrossRef]
- Alam, F.; Struß, J.M.; Chakraborty, T.; Dietze, S.; Hafid, S.; Korre, K.; Venktesh, V. Overview of the CLEF 2025 CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval. In CLEF 2025; Springer: Cham, Switzerland, 2025; pp. 199–223. [Google Scholar]
- Srinivasan, A.G.; George, R.J.; Joe, J.K.; Kant, H.; Harshith , M.R.; Sundar, S.; Suresh, S.; Vimalkanth, R.; Vijayavallabh. Enhancing financial RAG with agentic AI and Multi-HyDE: A novel approach to knowledge retrieval and hallucination reduction. In Proceedings of the 10th Workshop on Financial Technology and Natural Language Processing; Association for Computational Linguistics: Suzhou, China, 2025; pp. 19–32. [Google Scholar]
- Dumitru, A.; Venktesh, V.; Jatowt, A.; Anand, A. Evaluating list construction and temporal understanding capabilities of large language models. In Proceedings of the ACM SIGIR ICTIR 2025; Association for Computing Machinery (ACM): New York, NY, USA, 2025; pp. 369–379. [Google Scholar]
- Huet, S.; SanJuan, É. A Benchmark Collection for Assessing Scholarly Search by Non-Educated Users. In Proceedings of CLEF Working Notes 2025, CEUR Workshop Proceedings, Madrid, Spain, 9–12 September 2025; pp. 4286–4300. [Google Scholar]
- Suzgun, M.; Melas-Kyriazi, L.; Sarkar, S.; Kominers, S.D.; Shieber, S. The Harvard USPTO patent dataset: A large-scale, well-structured, and multi-purpose corpus of patent applications. Adv. Neural Inf. Process. Syst. 2023, 36, 57908–57946. [Google Scholar]
- Abbas, M.; Bashir, S.; Saadatmand, M.; Enoiu, E.P.; Sundmark, D. Requirements similarity and retrieval. In Handbook on Natural Language Processing for Requirements Engineering; Springer: Cham, Switzerland, 2025; pp. 61–88. [Google Scholar]
- Suresh, S.; Rani, A.; Patwa, P.; Reganti, A.; Jain, V.; Chadha, A.; Ekbal, A. Overview of Factify5WQA: Fact verification through 5W question-answering. arXiv 2024, arXiv:2410.04236. [Google Scholar] [CrossRef]
- Alsaç, A.; Yılmaz, Ü.; Koçoğlu, F.Ö.; Şeker, Ş.E. Towards intelligent IT service management: A comparative evaluation of machine learning and language models for ticket classification and assignment. In Proceedings of the 10th International Conference on Computer Science and Engineering (UBMK); IEEE: New York, NY, USA, 2025; pp. 265–270. [Google Scholar]
- Athale, M.; Vaddina, V. Knowledge graph-based repository-level code generation. In Proceedings of the IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code); IEEE: New York, NY, USA, 2025; pp. 169–176. [Google Scholar]
- Kasmaee, A.S.; Khodadad, M.; Saloot, M.A.; Sherck, N.; Dokas, S.; Mahyar, H.; Samiee, S. ChemTEB: Chemical text embedding benchmark—An overview of embedding models performance and efficiency on a specific domain. arXiv 2024, arXiv:2412.00532. [Google Scholar]
- Ermakova, L.; SanJuan, E.; Huet, S.; Azarbonyad, H.; Di Nunzio, G.M.; Vezzani, F.; Kamps, J. Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone. In CLEF 2024; Springer: Cham, Switzerland, 2024; pp. 283–307. [Google Scholar]
- Megha Mariam, K.M.; Jawahar, C.V. Attend to what I say: Highlighting relevant content on slides. In Proceedings of ICDAR; Springer: Cham, Switzerland, 2025; pp. 21–37. [Google Scholar]
- Kiesel, J.; Çöltekin, Ç.; Gohsen, M.; Heineking, S.; Heinrich, M.; Fröbe, M.; Stein, B. Overview of Touché 2025: Argumentation systems. In CLEF 2025 Working Notes; CEUR: Aachen, Germany, 2025. [Google Scholar]
- Ravenda, F.; Bahrainian, S.A.; Raballo, A.; Mira, A.; Crestani, F. A self-supervised seed-driven approach to topic modelling and clustering. J. Intell. Inf. Syst. 2025, 63, 333–353. [Google Scholar] [CrossRef]
- Bashir, S.; Abbas, M.; Saadatmand, M.; Enoiu, E.P.; Bohlin, M.; Lindberg, P. Requirement or not, that is the question: A case from the railway industry. In REFSQ; Springer: Cham, Switzerland, 2023; pp. 105–121. [Google Scholar]
- Shah, C.; Shah, A.; Varma, L.; Bhan, S.; Patil, N. Sentence restructuring with user-controlled difficulty using NLP. In Proceedings of ICCCNT; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
- Laud, T.; Kacha-Ochana, A.; Sumner, S.A.; Krishnasamy, V.; Law, R.; Schieber, L.; ElSherif, M. Large-scale analysis of online questions related to opioid use disorder on Reddit. In Proceedings of ICWSM; AAAI Press: Washington, DC, USA, 2025; Volume 19, pp. 1068–1084. [Google Scholar]
- Beauchemin, D.; Tremblay, Y.; Youssef, M.A.; Khoury, R. COLE: A comprehensive benchmark for French language understanding evaluation. arXiv 2025, arXiv:2510.05046. [Google Scholar] [CrossRef]
- Akhare, R.; Shinde, S.K. Personalised video summarisation using video-text multimodal fusion. Int. J. Comput. Vis. Robot. 2025, 15, 379–394. [Google Scholar] [CrossRef]
- Sae-Oueng, A.; Kerdthaisong, K.; Sukhantharat, K.; Phasook, P.; Chuangkrud, P.; Damrongrat, C.; Kongyoung, S. Pantip multi-turn datasets generated from a Thai large social platform forum using sentence similarity techniques. In Proceedings of iSAI-NLP; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Tewari, A. LegalPro-BERT: Classification of legal provisions by fine-tuning BERT large language model. arXiv 2024, arXiv:2404.10097. [Google Scholar]
- Rakhimova, D.; Turarbek, A.; Karyukin, V.; Sarsenbayeva, A.; Alieyev, R. Legal AI in Low-Resource Languages: Building and Evaluating QA Systems for the Kazakh Legislation. Computers 2025, 14, 354. [Google Scholar] [CrossRef]
- Tiwari, P.K. Malware detection using control flow graphs. In Proceedings of DICCT; IEEE: New York, NY, USA, 2024; pp. 216–220. [Google Scholar]
- Götharsson, M.; Stahre, K.; Gay, G.; de Oliveira Neto, F.G. Exploring the role of automation in duplicate bug report detection: An industrial case study. In Proceedings of AST 2024; ACM: New York, NY, USA; IEEE: New York, NY, USA, 2024; pp. 193–203. [Google Scholar]
- Nashid, N.; Ding, D.; Gallaba, K.; Hassan, A.E.; Mesbah, A. Characterizing multi-hunk patches: Divergence, proximity, and LLM repair challenges. arXiv 2025, arXiv:2506.04418. [Google Scholar] [CrossRef]
- Kim, T.E.; Coelho, J.; Onilude, G.; Singh, J. TeamCMU at Touché: Adversarial co-evolution for advertisement integration and detection in conversational search. arXiv 2025, arXiv:2507.00509. [Google Scholar]
- Gundawar, A.; Valmeekam, K.; Verma, M.; Kambhampati, S. Robust planning with compound LLM architectures: An LLM-modulo approach. arXiv 2024, arXiv:2411.14484. [Google Scholar] [CrossRef]
- Azam, U.; Razzak, I.; Vishwakarma, S.; Jameel, S. Uncertainty modelling in under-represented languages with Bayesian deep Gaussian processes. In Proceedings of COLING; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 1438–1450. [Google Scholar]
- Ullah, F.; Faheem, A.; Azam, U.; Ayub, M.S.; Kamiran, F.; Karim, A. Detecting cybercrimes in accordance with Pakistani law: Dataset and evaluation using PLMs. In Proceedings of LREC-COLING 2024; ELRA and ICCL: Turin, Italy, 2024; pp. 4717–4728. [Google Scholar]
- Koloski, B.; Lavrač, N.; Cestnik, B.; Pollak, S.; Škrlj, B.; Kastrin, A. AHAM: Adapt, help, ask—Model harvesting LLMs for literature mining. In IDA; Springer: Cham, Switzerland, 2024; pp. 254–265. [Google Scholar]
- Turganbayeva, A.; Rakhimova, D.; Karyukin, V.; Karibayeva, A.; Turarbek, A. Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language. Information 2022, 13, 411. [Google Scholar] [CrossRef]





| LLM Model | Success Rate | Extraction Acc | GPT Quality | Final Score |
|---|---|---|---|---|
| GPT_OSS | 85.0% | 96.5% | 71.1% | 84.26 |
| GEMMA | 85.0% | 92.4% | 63.2% | 80.68 |
| LLAMA | 85.0% | 52.4% | 44.8% | 63.13 |
| Model | Precision | Recall | F1-Score |
|---|---|---|---|
| GPT_OSS | 0.965 | 1.000 | 0.982 |
| GEMMA | 0.924 | 0.941 | 0.933 |
| LLAMA | 0.524 | 0.706 | 0.601 |
| Model | Content Completeness | Semantic Correctness | Relevance | Coherence | Std. Dev. | Avg. Score |
|---|---|---|---|---|---|---|
| GPT_OSS | 9.3 | 9.2 | 9.0 | 9.1 | 0.6 | 9.15 |
| GEMMA | 8.5 | 8.4 | 8.2 | 8.1 | 0.7 | 8.30 |
| LLAMA | 6.8 | 6.5 | 6.0 | 5.5 | 0.9 | 6.20 |
| Model | N (Documents) | Mean Extraction Score | Mean GPT Score (0–10) | Mean Main Text Quality (0–10) | Mean Date Quality (0–10) | Zero-Text Pages | Pages with Extraction Score < 1 |
|---|---|---|---|---|---|---|---|
| GEMMA | 17 | 0.924 | 6.32 | 6.69 | 6.79 | 1 | 3 |
| GPT_OSS | 17 | 0.965 | 7.11 | 7.48 | 9.11 | 0 | 2 |
| LLAMA | 17 | 0.524 | 4.48 | 5.04 | 4.32 | 5 | 14 |
| Domain | Total Errors | Penalty (Points) | Missing Content | Wrong Element | Excessive Noise | Fragility | Minor |
|---|---|---|---|---|---|---|---|
| https://tengrinews.kz | 50 | −64.5 | 21 | 6 | 17 | 5 | 1 |
| https://www.volcanodiscovery.com | 28 | −44.5 | 6 | 15 | 2 | 5 | 0 |
| https://dknews.kz | 20 | −25.5 | 11 | 2 | 3 | 4 | 0 |
| https://rus.azattyq.org | 14 | −4.5 | 3 | 4 | 2 | 5 | 0 |
| https://voshod-solnca.ru | 12 | −19.0 | 6 | 4 | 2 | 0 | 0 |
| https://www.gov.kz | 12 | −20.5 | 1 | 10 | 0 | 1 | 0 |
| https://kndc.kz | 10 | −10.5 | 1 | 4 | 3 | 2 | 0 |
| https://el.kz | 9 | −14.0 | 0 | 6 | 1 | 2 | 0 |
| https://prg.kz | 9 | −7.0 | 1 | 6 | 2 | 0 | 0 |
| https://www.zakon.kz | 9 | −12.5 | 3 | 3 | 2 | 0 | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Turarbek, A.; Rakhimova, D.; Adetbekov, Y.; Nurgali, A. A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers 2026, 15, 294. https://doi.org/10.3390/computers15050294
Turarbek A, Rakhimova D, Adetbekov Y, Nurgali A. A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers. 2026; 15(5):294. https://doi.org/10.3390/computers15050294
Chicago/Turabian StyleTurarbek, Assem, Diana Rakhimova, Yeldos Adetbekov, and Azat Nurgali. 2026. "A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources" Computers 15, no. 5: 294. https://doi.org/10.3390/computers15050294
APA StyleTurarbek, A., Rakhimova, D., Adetbekov, Y., & Nurgali, A. (2026). A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers, 15(5), 294. https://doi.org/10.3390/computers15050294

