A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources

Turarbek, Assem; Rakhimova, Diana; Adetbekov, Yeldos; Nurgali, Azat

doi:10.3390/computers15050294

Open AccessArticle

A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources

by

Assem Turarbek

^1,2,*,

Diana Rakhimova

^1,*

,

Yeldos Adetbekov

¹ and

Azat Nurgali

¹

Faculty of Information Technology and Artificial Intelligence, Farabi University, Almaty 050040, Kazakhstan

²

Institute of Physics, Mathematics and Digital Technologies, Kazakh National Women’s Teacher Training University, Almaty 050000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 294; https://doi.org/10.3390/computers15050294

Submission received: 16 March 2026 / Revised: 25 April 2026 / Accepted: 27 April 2026 / Published: 5 May 2026

(This article belongs to the Special Issue Trustworthy and Efficient Large Language Models: Methods, Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of heterogeneous web sources has created significant challenges for the automated extraction and structuring of critical domain-specific information, particularly in real-time seismic monitoring scenarios. Despite the existence of official governmental reporting systems, relevant earthquake-related data are often distributed across diverse online platforms with highly variable and dynamically evolving HTML (HyperText Markup Language) structures, leading to incomplete, delayed, or inconsistent information retrieval. Existing rule-based and semi-automated approaches lack scalability and robustness under such conditions. To address this gap, this study proposes a self-adaptive framework based on large language models (LLMs) for the automated extraction and structuring of earthquake-related web content. The proposed approach integrates transformer-based schema generation, repository-guided schema matching, and an iterative refinement mechanism, enabling the system to dynamically adapt to heterogeneous document structures. A formal utility-based decision mechanism is introduced to optimize schema selection and reuse, while embedding-based similarity modeling facilitates efficient transfer of extraction patterns across structurally related webpages. The experimental evaluation was conducted on a heterogeneous benchmark dataset comprising multiple web domains with diverse structural characteristics. The results demonstrate that the proposed framework achieves a success rate of 85% across all evaluated models, with the best-performing configuration reaching an extraction accuracy of 96.5% and a final composite score of 84.26. Additional analysis reveals significant improvements in extraction completeness, reduction in false positives and false negatives, and effective reuse of a compact set of robust schemas. Error analysis indicates that the primary challenges are associated with noisy HTML structures and incorrect DOM (Document Object Model) element selection, rather than deficiencies in textual content. The findings confirm that combining lightweight transformer models with adaptive memory and schema reuse mechanisms enables the development of scalable, robust, and high-performance web extraction systems. The proposed approach is particularly suitable for real-time information retrieval in safety-critical domains, where timely and accurate data aggregation from heterogeneous sources is essential.

Keywords:

LLM; schema generation; web data extraction; HTML structure analysis; adaptive extraction; information retrieval; earthquake monitoring; AI for disaster management; SDG 11

1. Introduction

In Kazakhstan, seismic monitoring and public alerting are carried out by established governmental institutions, including the Ministry of Emergency Situations of the Republic of Kazakhstan, the National Scientific Center for Seismological Observations and Research, and regional emergency departments. These entities rely on specialized infrastructure to detect seismic activity and provide timely notifications to the population.

In current practice, public communication is primarily implemented through centralized alerting mechanisms, such as SMS notifications, official governmental portals (e.g., https://www.gov.kz/), mass media broadcasting, and official social media channels. While these channels are essential for rapid response, they are inherently designed for short, standardized, and operational messages, typically limited to basic event information and safety instructions.

However, such formats do not provide comprehensive contextual information, structured data representations, or analytical capabilities required for advanced research and decision-support systems. In parallel, a substantial volume of earthquake-related information is continuously disseminated across heterogeneous and non-centralized web sources, including news agencies, regional and international information portals, and scientific platforms. These sources are not integrated into the official alerting infrastructure and are characterized by high structural variability, fragmentation, and lack of standardization, making them difficult to process using conventional methods.

As a result, despite the presence of official monitoring systems, there exists a clear research and technological gap between event detection and notification, and the systematic extraction, structuring, and integration of distributed web-based information.

From a broader perspective, this research aligns with the United Nations Sustainable Development Goals (SDGs), particularly SDG 11 (Sustainable Cities and Communities) and SDG 13 (Climate Action), which emphasize the importance of resilient infrastructure, effective disaster risk management, and timely access to reliable information. The ability to automatically extract and structure earthquake-related data from heterogeneous web sources contributes to improving situational awareness, supporting emergency response systems, and enhancing data-driven decision-making in the context of natural hazards.

To address this challenge, this study proposes a self-adaptive LLM-based framework designed to automatically extract and structure earthquake-related information from heterogeneous web sources. The proposed approach operates as a complementary analytical layer, enabling the transformation of unstructured web content into structured data suitable for downstream tasks such as monitoring, risk analysis, and decision support.

The rapid expansion of digital information ecosystems has intensified the demand for automated methods capable of extracting, structuring, and interpreting large volumes of heterogeneous web content. Modern information retrieval pipelines, particularly those designed for real-time monitoring of natural hazards such as earthquakes, face the challenge of processing highly diverse HTML structures and inconsistent data formatting across online sources [1]. Traditional rule-based parsers and manually crafted scraping heuristics often lack the scalability and adaptability necessary to accommodate the dynamic and frequently evolving layouts of contemporary webpages [2]. Moreover, the increasing complexity of web interfaces introduces additional obstacles for deterministic extraction methods, resulting in incomplete retrieval, low generalizability, and significant maintenance burdens [3].

Recent advances in natural language processing and representation learning provide new opportunities for addressing these limitations. Transformer-based architectures, including lightweight variants such as LLMs, have demonstrated strong capabilities in capturing semantic relations, performing contextual inference, and generalizing across previously unseen inputs [4]. These models enable the automatic induction of extraction schemas, allowing systems to infer structural patterns directly from HTML content instead of relying on handcrafted selectors [5]. Such data driven schema generation methods have shown substantial promise in reducing manual intervention and enhancing the robustness of automated parsing systems [6]. Integrating these models into end-to-end pipelines allows for iterative refinement, where extraction errors become feedback signals for generating updated and more accurate schemas [7].

In the context of seismic event monitoring, reliable content extraction represents a critical prerequisite for timely analysis and dissemination of situational information [8]. Earthquake-related reports are published across a wide range of media outlets, scientific platforms, governmental portals, and real-time alerting services, each presenting data in different formats and structural compositions [9]. Consequently, a self-adaptive parsing framework capable of learning, adjusting, and accumulating extraction schemas is essential for ensuring consistent and high-quality data acquisition [10]. The development of such frameworks aligns with broader trends in intelligent information systems, where autonomous model driven decision mechanisms replace rigid, handcrafted pipelines [11]. This research contributes to the ongoing evolution of automated extraction methodologies by presenting a self-improving LLM-based system designed to retrieve, parse, and structure earthquake-related information from heterogeneous web sources.

This study is not intended as a large-scale benchmark but rather as a controlled experimental investigation of LLM robustness under structurally heterogeneous HTML conditions. The primary objective is to analyze extraction behavior across diverse layouts rather than to provide statistically generalizable performance estimates.

2. Related Work

A comprehensive examination of prior research is essential for contextualizing the development of intelligent, self-adapting web parsing systems. Existing studies span several methodological domains, including traditional rule-based extraction, transformer-driven semantic modeling, adaptive schema generation frameworks, and automated data acquisition pipelines for hazard monitoring. By synthesizing insights from these areas, the current work positions itself at the intersection of structural understanding, machine learning–based automation, and real-time information retrieval, highlighting the technological progression that enables robust extraction of heterogeneous earthquake-related web data [3,7].

2.1. Traditional Web Parsing and Information Extraction Approaches

Classical approaches to web data extraction primarily relied on deterministic rule-based mechanisms, including handcrafted XPath (XML Path Language) expressions, CSS (Cascading Style Sheets) selectors, and static DOM traversal pipelines. These strategies offered strong precision for stable, unchanging websites, yet their performance degraded rapidly when confronted with dynamic layouts or frequently updated page structures [12]. Research has shown that manual rule engineering becomes increasingly unsustainable as the number of monitored sources expands, particularly in applications requiring high temporal responsiveness, such as disaster-monitoring systems [13]. Early semi-automated extraction frameworks attempted to mitigate variability by incorporating template detection and heuristic adaptation, but these solutions remained incapable of generalizing across heterogeneous HTML patterns [14]. More advanced extraction systems employed shallow machine learning methods to classify page elements and assign semantic labels, though these models struggled with the high structural diversity that characterizes modern web platforms [15]. Subsequent work utilizing statistical learning improved robustness but continued to depend on predefined patterns, limiting scalability in real world environments [16].

2.2. Transformer-Based Models for Structural and Semantic Understanding

The emergence of transformer architectures transformed the landscape of natural language understanding, enabling models to capture long range dependencies, hierarchical relationships, and contextual cues with unprecedented accuracy [17]. Studies demonstrate that lightweight transformer variants, such as LLMs [5], maintain strong representational power while significantly reducing computational overhead, making them suitable for large scale extraction systems [18]. Investigations into neural schema induction reveal that transformer-based embeddings effectively map heterogeneous structural features into a unified latent space, facilitating the prediction of CSS selectors and extraction rules directly from unstructured HTML [19]. Prior research indicates that attention mechanisms provide superior robustness to noise and visual clutter in web documents, enhancing their ability to differentiate between primary and auxiliary content blocks [20]. Additional studies highlight the capacity of transformers to generalize across unknown domains, an essential property for parsing systems encountering unpredictable HTML formats [21]. These advancements underscore the potential of transformer-driven pipelines for building self-adapting extraction systems with minimal human supervision [22].

2.3. Self-Learning and Adaptive Schema Generation Techniques

Progress in adaptive extraction frameworks has increasingly focused on mechanisms that enable models to iteratively refine parsing schemas based on real-time feedback. Early adaptive systems relied on simple error detection heuristics to trigger rule regeneration but lacked the capacity to incorporate learned structural patterns [23]. More recent research introduced schema repositories, where extraction rules generated for one domain could be transferred across similar page structures, significantly improving coverage and reducing model invocation frequency [24]. Studies in incremental schema evolution emphasize that effective adaptation requires continuous accumulation of successful extraction patterns, enabling systems to build a progressively richer understanding of structural variability [25]. Investigations also demonstrated that multi-stage extraction workflows, where a model first predicts a schema and then validates it through fine-grained content checks, can dramatically increase extraction accuracy [26]. Further contributions reveal that feedback loops, enabled through model-driven re-parsing and schema correction, produce self-improving systems capable of achieving long term stability even in unstable web environments [27]. Research also points to the importance of synthesizing synthetic datasets for training schema generation models as manually collecting large scale HTML–schema pairs remains computationally expensive [28].

2.4. Automated Extraction Pipelines for Environmental and Hazard Monitoring

Automated data acquisition plays a pivotal role in environmental surveillance, natural hazard modeling, and disaster risk assessment. Several studies highlight the necessity of real-time information retrieval from diverse online sources to support rapid decision making during seismic events [29]. Research in earthquake information systems demonstrates that the accuracy of downstream analytics, such as epicenter localization and impact forecasting, heavily depends on the quality and completeness of extracted web data [30]. Modern hazard-monitoring platforms increasingly integrate machine learning-driven extraction methods to reduce the delays associated with manual data processing and to ensure consistency across heterogeneous data streams [31]. Investigations show that adaptive extraction pipelines significantly outperform static scrapers in contexts where information appears across news agencies, governmental portals, and citizen reporting platforms with different structural patterns [32]. Additional work underscores the importance of enriching extracted data with metadata such as timestamps, contextual descriptions, and reporting reliability for accurate seismic event classification [33]. Studies on cross-platform data integration reveal that automated extraction mechanisms must accommodate multilingual sources and structurally divergent content representations, further motivating the use of transformer training paradigms. Finally, research indicates that self-learning extraction architectures contribute to long term sustainability of seismic-monitoring systems by reducing maintenance overhead and improving resilience to evolving web structures [34].

3. Materials and Methodology

This section outlines the complete technical workflow underpinning the proposed automated extraction system, detailing each stage of the pipeline from data acquisition to schema generation, validation, and structured storage. This section describes the algorithms, models, database structures, and computational procedures employed to retrieve heterogeneous earthquake-related web content, transform raw HTML into machine-interpretable representations, and iteratively refine extraction accuracy through adaptive learning mechanisms. The figures referenced throughout this section illustrate the architectural components, data models, and processing stages that collectively enable the system’s robust performance across diverse and dynamically evolving web sources.

3.1. Overview of the Data Acquisition and Processing Pipeline

The proposed system consists of a fully automated multi-stage pipeline designed to retrieve, parse, and structurally organize earthquake-related online content. The workflow integrates keyword-driven web search, HTML collection, LLM-based schema generation, Python-driven extraction procedures, and structured database storage. Each stage is modular and optimized for adaptability, ensuring high robustness to heterogeneous webpage structures encountered across different news, scientific, and governmental portals. The overall process is illustrated through figures, which represent the major computational components and corresponding database schemas. Figure 1 provides a high-level overview of the automated LLM-based earthquake data extraction pipeline, illustrating the sequential flow of operations that transform unstructured web content into a structured information dataset. The process begins with the specification of keyword lists and optional domain-level site filters, which are supplied to the DuckDuckGo query executor responsible for retrieving relevant webpages. Retrieved links are then processed by the HTML downloader and sanitizer module, which ensures consistent and clean input for downstream analysis. The sanitized HTML documents are passed to the extraction engine, where an initial parsing attempt is made using either previously stored schemas or newly generated ones. A schema validation step evaluates whether the extracted content meets predefined structural and semantic criteria. If the schema is deemed invalid, the system consults the schema repository or triggers regeneration, creating a feedback loop that enhances system adaptability. Valid schemas lead to the execution engine, which finalizes the extraction and populates the structured earthquake information dataset. This flowchart emphasizes the modularity, adaptability, and self-improving nature of the system, aligning with the methodological framework described in this section.

Schema Selection and Regeneration Strategy

To ensure robust extraction performance across heterogeneous web sources, the system implements a decision-making mechanism for selecting between schema reuse and schema regeneration.

After the initial extraction step, each candidate schema is evaluated using the predefined validity function:

If the extracted output satisfies the validity criteria (e.g., minimum length threshold, structural plausibility), the schema is considered valid and stored in the repository for future reuse.
If the schema is deemed invalid, the system proceeds with a two-stage decision process:

Step 1: Schema Repository Matching

The system attempts to identify a suitable schema from the existing repository based on structural similarity between the current document and previously processed documents.

The schema repository serves as a persistent storage of previously validated extraction schemas. When structural similarity between the current HTML document and previously processed documents is detected, a suitable schema is retrieved and reused. This mechanism enables the system to bypass schema regeneration and directly apply an existing extraction pattern, thereby improving efficiency and ensuring consistent extraction across structurally similar web sources.

This is achieved by:

Generating an embedding representation of the input HTML;
Comparing it with stored schema profile embeddings;
Selecting candidate schemas that exceed a predefined similarity threshold η.

If a matching schema is found, it is applied to the document:

If the resulting extraction satisfies the validity criteria → schema reuse is accepted.
Otherwise → proceed to schema regeneration.

Step 2: Schema Regeneration

If no suitable schema is found in the repository, or if all candidate schemas fail validation, the newly generated schema is then evaluated:

If valid → it is added to the schema repository.
If invalid → the regeneration process may be repeated or the document is marked as extraction failure.

3.2. Keyword-Based Search Query Construction

The pipeline begins with the creation of a curated set of keywords optimized for earthquake-related content retrieval. These include general terms (e.g., “earthquake,” “seismic event,” “magnitude,” and “aftershock”) as well as region-specific and domain-specific descriptors.

As shown in Figure 2, each keyword is optionally combined with a domain-level site filter, producing a structured query:

Q = {(k_{i}, s_{j}) | k_{k} ϵ K}, s_{j} ϵ S \cup {\emptyset},

(1)

where K is the keyword list and S is the set of domain filters.

Q

—the set of generated search queries used for web data acquisition;

k_{i}

—the i-th keyword from the keyword list, where

i = 1,2, \dots, | K |

;

s_{j}

—the j-th domain filter, where

i = 1,2, \dots, | S |

, or

s_{j} \neq \emptyset

indicating the absence of a domain constraint;

K—the set of keywords used for query construction (e.g., earthquake, seismic event, aftershock);

S—the set of domain-specific filters (e.g., site:tengrinews.kz, site:gov.kz);

\emptyset

—the empty element, indicating that a query is formed without any domain restriction;

k_{i}, s_{j}

—an ordered pair representing a query composed of a keyword and an optional domain filter;

| K |

,

| S |

—the cardinalities of the sets K and S respectively.

The constructed queries are passed to the DuckDuckGo (DuckDuckGo Inc., Paoli, PA, USA) API interface, which returns a list of URLs and corresponding page titles. The results are stored in the search results table. Each entry includes the query used, discovered link, page title, domain filter, and timestamps:

R = {(q, l, t, f, τ_{c}, τ_{u})}

(2)

where ℓ is the link, t the page title, and f the applied site filter.

R—the set of retrieved search results obtained from the query execution process;

q—the search query used to retrieve the webpage (constructed as a combination of a keyword and an optional domain filter);

l—the URL (link) of the retrieved webpage;

t—the title of the webpage returned by the search engine;

f—the applied domain filter (e.g., site:gov.kz), or f = ∅ if no domain restriction is used;

τ_c—the timestamp of query execution (creation time of the record);

τ_u—the timestamp of the last update or access to the retrieved result.

3.3. HTML Retrieval and Sanitization

For every collected URL, the system downloads the raw HTML content. Sanitization includes script removal, normalization of malformed tags, and reduction in irrelevant styling structures. This preprocessing ensures that the downstream LLM model receives uniform, machine-readable input. Sanitized HTML samples form the dataset used in the subsequent schema induction step.

LLM-Based Schema Generation

As shown in Figure 3, the LLM transformer model is used to generate a JSON extraction schema that identifies HTML regions associated with the main textual content and optional metadata such as publication date.

The model receives raw HTML as input:

S = f_{θ} (H)

(3)

where H is the HTML document and f_θ is the LLM encoder parameterized by θ. The output schema S consists of predicted CSS selectors or XPath expressions:

S = {S_{t e x t}, S_{d a t e}}

(4)

The model is initially trained on a synthetic dataset composed of paired HTML–schema examples, allowing it to generalize to websites not included in the training set.

3.4. Python-Based Content Extraction

The predicted schema is passed to the Python extraction engine, which applies the selectors to the sanitized HTML. As illustrated in Figure 3, the Python code executes:

T_{m a i n}, T_{r a w} = g (H, S)

(5)

where

T_{m a i n}

is the extracted meaningful content and

T_{r a w}

is the unfiltered text blob. If no meaningful text is retrieved, the schema is considered invalid.

A validation criterion is defined as

v a l i d (S) = {\begin{matrix} 1, i f | T_{m a i n} | \geq δ \\ 0, o t h e r w i s e \end{matrix}

(6)

where δ is a minimal text-length threshold. Invalid schemas trigger an automated re-generation cycle, improving robustness across structurally diverse pages.

3.5. Structured Storage of Parsed Content

Validated extraction results are stored in the parsed content table defined in Figure 4, which maintains direct referential linkage to the originating search result via a foreign key constraint: parsed_content.search_result_id->search_results. Each entry contains the raw extracted text, the cleaned main text, and metadata timestamps:

C = (i d, r, T_{r a w}, T_{m a i n}, τ_{p}, τ_{c}, τ_{u})

(7)

This structured repository serves as the foundation for downstream analytics, natural language understanding, and seismic information modeling.

3.6. Adaptive Self-Learning Mechanism

Every newly validated schema is added to the schema repository and reused for future pages with similar structural signatures. Over time, the system builds a growing collection of reliable extraction patterns, reducing inference time and improving long-term stability. The pipeline thus behaves as a semi-supervised self-improving system with cumulative learning characteristics.

4. Mathematic Model of the Scheme Extraction Mechanism

Let each HTML page be represented by a rooted ordered tree (DOM):

D = (V, E, r, τ, α)

(8)

where

V

is the set of nodes (HTML elements and text nodes).

E \subseteq V \times V

is the parent–child relation.

r \in V

is the root node.

τ : V \to τ

assigns an HTML tag type (div, p, span, etc.).

α : V \to A

assigns attributes (id, class, href, etc.).

For each node

ν \in V

, define its path from the root as

p a t h (ν) = (r = ν_{0}, υ_{1}, \dots, ν_{k}, = ν)

(9)

and the corresponding tag sequence

t a g (ν) = (τ (ν_{0}), τ (ν_{0}), \dots, τ (ν_{k}))

(10)

Let

t e x t (ν)

be the concatenated textual content under node

ν

.

4.1. Schema Representation

Assume we want to extract a fixed set of fields

t a g (ν) = (τ (ν_{0}), τ (ν_{0}), \dots, τ (ν_{k}))

(11)

A schema S is a collection of field-specific structural patterns:

S = {s_{f} | f ϵ F}

(12)

Each

s_{f}

is a selector pattern over DOM nodes. For abstraction, treat each selector as a predicate over nodes:

V_{f} (S, D) = {ν \in V | s_{f} (ν) = 1}

(13)

The extracted text for field f is

T_{f} (S, D) = \begin{matrix} ⨁ \\ ν \in V (S, D) \end{matrix} t e x t (υ)

(14)

where ⊕ denotes concatenation with canonical normalization.

4.2. Extraction and Utility Score

Define an extraction operator

ϕ (S, D) = {T_{f} (S, D) | f \in F}

(15)

For each field f, a utility score is defined to quantify the quality of the extracted content:

U_{f} (S, D) = λ_{1} u_{f}^{l e n} (S, D) + λ_{2} u_{f}^{s t r u c t} (S, D) + λ_{3} u_{f}^{s e m} (S, D)

(16)

with nonnegative weights

λ_{1}, λ_{2}, λ_{3}

.

Length adequacy

u_{f}^{l e n} (S, D) = m i n (1, \frac{| T_{f} (S, D) |}{L_{f}^{m i n}})

(17)

where

| T_{f} |

is the number of characters or tokens, and

L_{f}^{m i n}

is a task specific minimal acceptable length.

2.: Structural confidence

Let

ρ (s_{f}) \in [0,1]

be a prior confidence of the selector for field f (e.g., estimated from previous successful uses). Then

u_{f}^{s t r u c t} (S, D) = ρ (s_{f})

(18)

3.: Semantic adequacy.

A classifier

h_{f}

(e.g., a small language model head) determines whether the extracted text looks like a valid instance of field f:

u_{f}^{s e m} (S, D) = h_{f} (T_{f} (S, D)) ϵ [0,1]

(19)

The total schema utility on document D is

U (S, D) = \frac{1}{| F |} \sum_{f \in F} U_{f} (S, D)

(20)

4.: Validity Criterion and Decision Function

Define a validity function

U v a l i d (S, D) = {\begin{matrix} 1, & i f U (S, D) \geq γ \\ 0, & o t h e r w i s e \end{matrix}

(21)

where

γ ϵ (0,1)

is a threshold.

This formulation formalizes the schema validation mechanism, whereby schemas that produce insufficient, structurally inconsistent, or semantically implausible outputs are rejected.

5.: Schema Matching with a Repository

Assume a repository

S = {S^{(1)}, \dots, S^{(N)}}

of previously validated schemas, each associated with a set of documents for which the schema demonstrated reliable extraction performance.

For a new document D*, schema matching is the problem

S^{*} = \begin{matrix} a r g m a x U (S, D^{*}) \\ S ϵ S \end{matrix} s u b j e c t t o U (S, D^{*}) \geq γ

(22)

If such an S* exists, the repository is considered to contain a suitable matching schema for D*. Otherwise, a new schema is generated by the LLM.

To approximate this argmax efficiently, one can introduce a similarity metric between documents and schema “profiles”.

Schema Profile Embeddings

Let each schema S be associated with a profile vector

e_{s} ϵ R^{d}

constructed, for example, by averaging LLM embeddings of documents where S was valid:

e_{s} = \frac{1}{| D_{S} |} \sum_{D ϵ D_{S}} ψ (D)

(23)

where

D_{S}

is the set of documents successfully parsed by S, and ψ maps an HTML document to a dense vector (LLM representation of the HTML or its main text).

For a new document D⋆ we compute

e^{*} = ψ (D^{*})

(24)

Define similarity, for example, cosine similarity,

s i m (S, D^{*}) = \frac{〈 e_{s}, e^{*} 〉}{∥ e_{s} ∥ ∥ e^{*} ∥}

(25)

A fast candidate set is

S_{c a n d} (D^{*}) = {S ϵ S | s i m (S, D^{*}) \geq η}

(26)

with similarity threshold η. Then use the full utility only on S_cand:

S^{*} = \begin{matrix} a r g m a x U (S, D^{*}) \\ S ϵ S_{c a n d} (D^{*}) \end{matrix}

(27)

6.: LLM-Based New Schema Generation

If no existing schema satisfies the validity constraint, the LLM model produces a new schema:

S_{n e w} = G_{θ} (H^{*})

(28)

where

G_{θ}

is the generation function parameterized by LLM weights and decoding logic, and H* is the sanitized HTML of D*.

After extraction, if

{(S}_{n e w}, D^{*}) = 1, t h e n

S ⟵ S \cup {S_{n e w}} D_{S_{n e w}} ⟵ D_{S_{n e w}} \cup {D^{*}}

(29)

and the corresponding profile embedding

e_{S_{n e w}}

is initialized from

ψ (D^{*})

.

7.: Overall Optimization View.

Over time, the system aims to maximize expected utility over the distribution of future documents D:

\begin{matrix} m a x \\ S, θ \end{matrix} E_{D - D} [\begin{matrix} m a x \\ S ϵ S \cup {G_{θ} (H)} \end{matrix} U (S, D)]

(30)

subject to repository size or computational constraints if desired.

This formulation captures three interacting components:

LLM parameters θ, which determine the quality of generated schemas.
The schema repository S, which determines the effectiveness of reusing previously validated extraction patterns.
The schema matching rule via utility U and similarity filter S_cand.

5. Results

This section reports the experimental evaluation of the proposed self-adaptive extraction framework. The goal is to quantify (i) the robustness of schema induction across heterogeneous web sources, (ii) the quality of extracted main text and publication date fields, and (iii) the typical failure modes of selector-based extraction under real-world HTML variability.

5.1. Test Set and Data Acquisition Strategy

The evaluation was conducted on a heterogeneous set of webpages collected from multiple domains that publish earthquake-related or hazard-related information. The tested sources include, but are not limited to, https://tengrinews.kz, https://dknews.kz, https://rus.azattyq.org, https://ru.sputnik.kz, https://www.volcanodiscovery.com, https://www.gov.kz, https://zakon.kz, https://el.kz, https://prg.kz, https://kndc.kz, https://voshod-solnca.ru. The dataset consists of 17 documents (webpages) used for per-domain quality analysis, while selector-related errors are aggregated across domains with the highest observed failure frequency.

5.1.1. Data Acquisition Strategy

To ensure reproducibility and robustness, the web data collection process was implemented as a structured pipeline including query construction, retrieval, filtering, and validation.

Keyword Construction

Search queries were constructed using a predefined set of domain-specific keywords related to seismic activity (e.g., earthquake, seismic event, and aftershock). These keywords were combined with optional domain filters, enabling both general and targeted retrieval scenarios.

Site Filtering

Domain-specific filtering was applied using search operators (e.g., https://www.gov.kz and https://tengrinews.kz), allowing the system to prioritize authoritative or relevant sources while preserving diversity of information.

Search Depth

For each query, a fixed number of top-ranked results (top-k) returned by the search engine were considered, ensuring consistency across experiments and limiting ranking variability.

Deduplication Strategy

To avoid redundant processing, retrieved URLs were filtered using:

Exact URL matching;
Normalization techniques (removal of duplicate or tracking parameters).

This ensures that each document is processed only once.

Crawling Time Window

Data collection was performed within a defined time interval, ensuring temporal consistency of the retrieved content and reducing variability due to dynamic web updates.

Page Selection Criteria

Only pages containing relevant earthquake-related textual content were retained. Filtering was based on:

Presence of domain-specific keywords;
Non-empty textual content after preprocessing.

Irrelevant or non-informative pages were excluded.

Failure Handling and Retry Mechanism

The system incorporates a failure-handling mechanism to improve robustness:

If extraction results do not satisfy validity criteria, the schema is rejected.
Schema reuse from the repository is attempted.
If unsuccessful, schema regeneration is triggered.
In case of retrieval failure, alternative queries may be executed.

5.1.2. Dataset Characteristics

The dataset represents a heterogeneous collection of 17 webpages, each corresponding to a unique HTML document.

The dataset exhibits the following properties:

Structural diversity (variation in DOM complexity and layout);
Content variability (differences in length and detail);
Metadata inconsistency (explicit vs. implicit publication data);
Multilingual content.

The terms documents and webpages are used interchangeably.

5.1.3. Preprocessing and Normalization

All HTML documents undergo a standardized preprocessing pipeline, including:

Removal of scripts and non-informative elements;
Normalization of malformed HTML;
Reduction in boilerplate content;
Conversion to a machine-readable format.

Importantly, preprocessing preserves structural noise to evaluate model robustness under realistic conditions.

5.1.4. Benchmark Design

The dataset is designed as a task-oriented benchmark for schema-based extraction. Each document includes:

Main textual content;
Optional metadata (e.g., publication date).

Evaluation follows a hybrid approach, combining:

Quantitative metrics (extraction accuracy and success rate);
Qualitative assessment (completeness, relevance, and coherence).

5.1.5. Validity and Limitations

Although the dataset size is limited (17 documents), it is sufficient for comparative evaluation because:

All models are tested under identical conditions.
The dataset captures high structural variability.
Performance differences are consistent across metrics.

Limitations include:

Absence of fully annotated ground truth;
Partial reliance on automated evaluation;
Moderate dataset size.

Future work will focus on constructing a large-scale annotated benchmark and applying stricter evaluation metrics.

5.2. Compared LLM Models

To ensure a representative and controlled evaluation, three LLM backends were selected: GEMMA, GPT_OSS, and LLAMA. The choice was guided by the need to cover different classes of modern transformer-based architectures, reflecting variations in model capacity, efficiency, and generalization capabilities.

Specifically, the selected models represent:

Lightweight and efficiency-oriented architectures (GEMMA-3-27B);
High-capacity generative models with strong reasoning capabilities (GPT_OSS-20B);
Widely adopted open-weight models designed for general-purpose language understanding (LLAMA-3-8B).

All models are publicly accessible or reproducible and are commonly used in real-world NLP applications, which ensures the practical relevance and reproducibility of the experimental results.

To enable a fair comparison, all backends were evaluated under identical experimental conditions, including preprocessing, schema generation, and validation procedures. This setup isolates the impact of the LLM architecture on extraction performance and robustness to heterogeneous HTML structures.

For each model, the same pipeline stages were executed (HTML sanitization → schema generation → extraction → validation), enabling direct comparison of extraction quality and robustness under identical preprocessing and validation constraints.

5.3. Experimental Protocol

The system was implemented in Python (version 3.11.4), leveraging standard libraries for web processing, data handling, and model integration.

The experimental evaluation was conducted on a high-performance computing system with the following configuration:

CPU: Intel Core i9-14700K;
GPU: 2 × NVIDIA RTX 5090 Ti (32 GB VRAM each);
RAM: 128 GB;
Storage: 4 TB SSD;
Operating System: Ubuntu 24.04 LTS.

This configuration ensured efficient processing of large HTML documents and stable execution of LLM-based schema generation and extraction tasks.

For each input webpage, the pipeline proceeded as follows:

Acquisition and preprocessing: HTML documents were downloaded and sanitized to remove scripts and reduce irrelevant structures and rendering noise.

Schema induction: The LLM generated a JSON (JavaScript Object Notation)-like schema that identifies selectors (CSS/XPath) for the main textual content block and (when applicable) the publication date field.

Execution: The Python extraction engine applied the predicted selectors and produced: (i) extracted main text and (ii) extracted date (if available).

Validation and regeneration: If the extracted output failed basic validity checks (e.g., minimum length threshold, structural plausibility), the schema was treated as invalid, and the system triggered an automatic regeneration/refinement cycle.

Logging of failures: When extraction degraded, an error logger categorized the selector failure mode into one of the predefined categories described in Section 5.4 and summarized in Table 1, Table 2 and Table 3.

5.4. Metrics and Scoring

The evaluation employed complementary quantitative and qualitative metrics:

Success Rate (%): fraction of pages for which the framework produced a non-empty extraction output that passed the validity criteria.

Extraction Acc (%): a correctness-oriented score representing extraction accuracy (as reported in Table 1).

Extraction Score (0–1): per-page extraction outcome score used to evaluate the quality of extraction at the document level. The score is discretized (e.g., 1.00, 0.70, 0.30, 0.00), reflecting different levels of completeness and validity of the extracted content.

GPT Score (0–10): an automated quality score assigned by a judging model to the extracted content.

Main Text Quality (0–10): quality of the extracted main content (readability, topicality, and absence of boilerplate).

Date Quality (0–10): correctness and usability of the extracted publication date field.

Text Length: length of the extracted main content (as reported in the tables; units should be clarified—characters or tokens).

To summarize model performance, the experimental log reports GPT quality (%) and a final score (Table 1). Based on the provided values, the final score is consistent with a weighted combination of success, accuracy, and judged quality. The final score is interpreted as a composite performance indicator derived from extraction success, extraction accuracy, and model-based quality assessment:

F i n a l S c o r e \approx 0.4 \cdot S u c c e s s R a t e + 0.3 \cdot E x t r a c t i o n A c c + 0.3 \cdot G P T Q u a l i t y

Table 1 reports the overall performance across the three LLM backends. All models achieved the same success rate of 85.0% but differed substantially in extraction accuracy and judged output quality. GPT_OSS achieved the best overall results (extraction acc = 96.5%, final score = 84.26), followed by GEMMA (extraction acc = 92.4%, final score = 80.68). LLAMA showed a markedly lower extraction accuracy (52.4%) and the lowest overall score (63.13), indicating reduced robustness to heterogeneous HTML structures.

While Table 1 provides an overview of overall system performance across the evaluated LLM backends, a more detailed analysis is required to better understand the quality and robustness of the extraction process at the document level.

To this end, additional evaluation metrics were introduced to capture both correctness and completeness of the extracted content. In particular, approximate precision, recall, and F1-score were derived from extraction-level indicators, providing a more granular assessment of model performance.

The resulting metrics are summarized in Table 2, followed by expert-based evaluation results in Table 3 and aggregated per-page quality indicators in Table 4, which together provide a comprehensive view of extraction accuracy, consistency, and variability across documents.

Across the 17-document evaluation subset, GPT_OSS achieved the highest mean extraction score (0.965) and the best mean date quality (9.11/10). In contrast, LLAMA produced 5 out of 17 zero-text outputs and exhibited a substantially lower mean extraction score (0.524), which is consistent with the reduced extraction accuracy reported in Table 1 and illustrated in Figure 4.

To provide additional insight into model performance, we report the approximate precision, recall, and F1-score derived from task-specific metrics.

Precision is approximated using the mean extraction score, reflecting the correctness of extracted content, while recall is estimated as the proportion of documents with non-empty extraction outputs. The resulting F1-score provides a combined measure of extraction quality and robustness. The approximate precision, recall, and F1-score metrics are summarized in Table 2.

The results indicate that GPT_OSS achieves the highest overall performance, with near-perfect recall and the highest precision, followed by GEMMA. In contrast, LLAMA demonstrates significantly lower precision and recall, primarily due to a higher number of zero-text outputs and incomplete extractions.

Expert-Based Evaluation of Extraction Quality.

To complement automated evaluation metrics and mitigate potential bias associated with LLM-based scoring, an expert-based qualitative assessment was conducted.

A panel of five independent experts with research experience in natural language processing, information extraction, and web data analysis participated in the evaluation. All experts had prior experience in evaluating text processing systems and were familiar with structured and semi-structured data extraction tasks.

The evaluation was performed on a representative subset of extracted outputs generated by all evaluated models. The subset included samples from different domains (government portals, news websites, and scientific resources) to ensure coverage of diverse HTML structures. All outputs were evaluated under identical experimental conditions using a comparative and blinded protocol, where the identity of the generating model was not disclosed to the evaluators.

Each extracted result was assessed according to the following criteria:

Content Completeness—the extent to which the main textual content of the webpage was fully captured;
Semantic Correctness—the degree of alignment between the extracted content and the original source;
Relevance—the absence of unrelated, duplicated, or boilerplate content;
Coherence—readability, structural consistency, and logical flow of the extracted text.

Each criterion was scored on a 10-point Likert scale (1—very poor, 10—excellent). The final expert score for each sample was computed as the average across all criteria and all annotators.

To assess the consistency of expert judgments, inter-annotator agreement was estimated using a simplified agreement measure based on score variance across evaluators. The observed agreement was high (average deviation < 0.8 points), indicating stable and consistent evaluation across experts.

The aggregated expert evaluation results are presented in Table 3. The results indicate that GPT_OSS achieved the highest average expert score (9.1), followed by GEMMA (8.3), while LLAMA showed significantly lower performance (6.2), primarily due to incomplete extraction and frequent structural errors.

These findings are consistent with automated metrics and further support the reliability of the proposed hybrid evaluation framework.

In Table 3, Std. Dev. represents the standard deviation of expert scores across annotators and serves as an indicator of inter-annotator agreement. Lower values correspond to higher consistency in expert judgments.

The expert scores reflect four complementary dimensions of extraction quality: completeness of the main content, semantic alignment with the source, relevance (absence of boilerplate or unrelated text), and textual coherence. GPT_OSS consistently achieves the highest scores across all criteria, indicating stable and accurate extraction. GEMMA demonstrates moderately high performance with minor losses in completeness and coherence, while LLAMA shows substantial degradation, primarily due to incomplete extraction and inclusion of irrelevant content. The relatively low standard deviation across models confirms consistent expert judgments and supports the reliability of the evaluation.

Finally, the selector failure modes were analyzed on the domains with the highest number of observed selector errors. Table 5 shows that the dominant error sources are Wrong Element and Missing Content, indicating that the primary difficulty lies in identifying the correct DOM region for the main article content under layout diversity and template boilerplate.

Figure 5 presents a comparative analysis of selector-related error types across different web domains, highlighting the distribution and frequency of extraction failures. The results indicate that the most dominant error categories are Wrong Element and Missing Content, which are primarily associated with the incorrect identification of content-bearing regions in complex and heterogeneous HTML structures.

In the upper plot, the color intensity of the bars encodes the magnitude of extraction errors across domains. A continuous color gradient from lighter to darker red is used, where darker shades correspond to higher error counts and lighter shades indicate lower error frequencies. This visual encoding allows for rapid comparative assessment of domain-specific extraction difficulty, highlighting sources with the highest concentration of errors.

In particular, domains with high structural complexity and dense boilerplate content exhibit a significantly higher number of errors, reflecting the difficulty of distinguishing relevant textual content from layout and navigation elements. Overall, the experiments demonstrate that schema induction quality is strongly model-dependent: higher-performing LLM backends produce selectors that more reliably isolate the main content block and accurately recover publication dates across heterogeneous sources.

To improve methodological rigor, the evaluation was extended with approximate precision, recall, and F1-score metrics derived from extraction-level indicators. precision reflects extraction correctness, while recall corresponds to the proportion of successfully processed documents. The F1-score provides a balanced measure of extraction quality and robustness.

6. Discussion

The results of this study should be interpreted within the scope of a controlled experimental evaluation rather than a large-scale, statistically validated benchmark. Although the dataset size is limited (N = 17), it was intentionally constructed to maximize structural diversity across domains, including government portals, news sites, and scientific resources. The objective was to evaluate robustness under heterogeneous conditions rather than to estimate population-level performance.

Given this experimental setup, the results highlight several important observations about self-adaptive, schema-based web extraction for earthquake information.

Due to the limited dataset size, formal statistical significance testing (e.g., confidence intervals or hypothesis testing) was not applied. However, performance differences between models are consistently observed across multiple independent metrics, suggesting stable comparative trends.

First, although the reported success rate is identical across models (85%), the gap in extraction acc and judged quality indicates that successful extraction (i.e., non-empty outputs passing validation) is not sufficient to guarantee faithful content capture. In practice, schemas with insufficient structural specificity may satisfy basic validation criteria while still extracting incomplete or irrelevant DOM segments. This observation is supported by the prevalence of Wrong Element and Missing Content errors in the selector error analysis (Table 5), which directly affect completeness and semantic correctness.

Second, the model comparison suggests that selector induction is highly sensitive to the LLM’s ability to (i) parse long and noisy HTML, (ii) identify content-bearing regions within boilerplate-heavy templates, and (iii) generate robust selectors that generalize across small layout variations. GPT_OSS consistently achieved stronger performance than GEMMA and especially LLAMA, including substantially higher mean date quality (Table 4). This indicates that date extraction may require more nuanced structural reasoning and better disambiguation of multiple timestamp-like elements present on many news and governmental pages.

Third, the domain-level failure analysis (Table 5) reveals that the most problematic domains tend to be those with either (a) high boilerplate density (navigation menus, recommended items, and advertising blocks) or (b) structurally complex templates where the main content is nested within multiple similar containers. For instance, tengrinews.kz exhibits both high Missing Content and Excessive Noise, indicating difficulties in separating signal from boilerplate. In contrast, www.gov.kz shows a high rate of Wrong Element errors, suggesting that pages may contain multiple similarly structured content containers (e.g., press releases vs. banners vs. navigation content), and the selector prediction frequently targets an incorrect region.

These findings motivate several directions for improving the robustness and generalizability of the framework:

Stronger selector robustness mechanisms. The prevalence of fragility-related failures suggests that schema generation should avoid overly specific selectors (e.g., deep DOM chains and nth-child patterns) and instead prefer stable anchors such as semantic tags, consistent class patterns, and content-aware constraints.
Hybrid extraction with fallback. When LLM-generated selectors fail or produce noisy outputs, integrating lightweight fallback mechanisms, such as boilerplate removal heuristics, content-density rules, or template-based extraction may improve reliability for structurally complex domains.
Improved validation beyond length thresholds. Extending validation to include semantic plausibility checks (e.g., language consistency, earthquake keyword coverage for the extracted main text, or date format constraints) can reduce false positives where irrelevant blocks pass minimal checks.
Expanded evaluation and reproducibility. The current evaluation would be strengthened by (i) a larger test set (more pages per domain and more domains), (ii) manual ground truth for at least a subset, and (iii) reporting confidence intervals or variance across repeated samples.

The use of an LLM-based evaluation mechanism introduces potential bias and is therefore not treated as a standalone validation approach. Instead, it is used as a complementary component within a hybrid evaluation framework that includes expert-based qualitative assessment. Future work will further strengthen evaluation reliability through the incorporation of annotated ground-truth datasets and inter-annotator agreement metrics.

Overall, the experimental results confirm the feasibility of the proposed framework within a controlled experimental setting while indicating that high-quality schema induction remains a critical dependency.

7. Conclusions

This study proposes a self-adaptive LLM-based framework for the automated extraction and structuring of earthquake-related information from heterogeneous web sources by generating and validating extraction schemas (selectors) directly from HTML. The experimental evaluation compares three LLM backends (GEMMA, GPT_OSS, and LLAMA) and analyzes both aggregate performance and domain-level failure modes.

The results indicate that, although all tested models achieved the same reported success rate, extraction quality differs substantially across model backends. GPT_OSS achieved the highest extraction accuracy and the highest model-assisted quality assessment score, including more reliable extraction of publication date metadata. In contrast, LLAMA exhibited frequent degradation in extraction quality, including multiple zero-text outputs and reduced content and date quality scores, indicating lower robustness under real-world HTML variability conditions. Error analysis further revealed that the most common failure modes involve selecting incorrect DOM elements and missing key content regions, highlighting the difficulty of reliably identifying the primary content block in boilerplate-heavy or structurally complex templates.

In this study, the evaluation framework was refined to follow a hybrid assessment paradigm, combining automated metrics with expert-based qualitative analysis. Expert evaluation was conducted by researchers with expertise in natural language processing, information extraction, and web data analysis. The assessment was performed on a representative subset of model outputs under identical experimental conditions using a comparative and blinded protocol to minimize bias.

The evaluation criteria included content completeness, semantic correctness, relevance, and textual coherence, allowing for a comprehensive validation of extraction quality. These expert judgments were used to complement automated metrics, including extraction accuracy and success rate, thereby improving the robustness of the overall evaluation.

While the current study is limited by dataset size and the absence of fully annotated ground truth, it establishes a reproducible foundation for future large-scale evaluation. Future work will focus on improving selector robustness, extending validation beyond minimal-length criteria, incorporating hybrid fallback extraction strategies for structurally complex templates, and expanding evaluation with larger datasets and partially annotated ground truth. In addition, future studies will aim to introduce statistically grounded validation protocols to further improve reliability and reproducibility.

Author Contributions

Conceptualization, A.T. and D.R.; methodology, A.T. and D.R.; software, Y.A. and A.N.; validation, Y.A., D.R. and A.T.; formal analysis, D.R. and A.T.; data curation, Y.A. and A.N.; investigation, Y.A. and D.R.; writing—original draft preparation, A.T. and Y.A.; writing—review and editing, D.R., A.T. and Y.A.; visualization, Y.A. and A.N.; supervision, D.R.; project administration, A.T. and D.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Science and Higher Education of the Republic of Kazakhstan within the framework of the scientific project Grant No. AP26197729.

Data Availability Statement

The data presented in this study are openly available in https://github.com/FarabiUniversity/EarthquakesParser (accessed on 4 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Definition
LLM	Large Language Model
GPT	Generative Pre-Trained Transformer
GEMMA	General Efficient Multimodal Model Architecture
LLAMA	Large Language Model Meta AI
HTML	HyperText Markup Language
CSS	Cascading Style Sheets
XPath	XML Path Language
DOM	Document Object Model
JSON	JavaScript Object Notation
API	Application Programming Interface
UMAP	Uniform Manifold Approximation and Projection
NLP	Natural Language Processing
RAG	Retrieval-Augmented Generation
CPU	Central Processing Unit
GPU	Graphics Processing Unit
Extraction Acc	Extraction Accuracy
GPT Score	Large Language Model-Based Evaluation Score

References

Saleem, S.; Asim, M.N.; Dengel, A. ReqNet: An LLM-driven computational framework for automated requirements extraction from unstructured documents. Complex Intell. Syst. 2026, 12, 38. [Google Scholar] [CrossRef]
Kartiyanta, M.A.; Ancilla, E.; Jingga, K. Performance evaluation for cost-effective retrieval process for multi-document retrieval-augmented generation on a domain-specific dataset. In Proceedings of the 2025 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT); IEEE: New York, NY, USA, 2025; pp. 719–725. [Google Scholar]
Singhania, S.; Razniewski, S.; Weikum, G. Recall them all: Retrieval-augmented language models for long object list extraction from long documents. arXiv 2024, arXiv:2405.02732. [Google Scholar] [CrossRef]
Alam, F.; Struß, J.M.; Chakraborty, T.; Dietze, S.; Hafid, S.; Korre, K.; Venktesh, V. Overview of the CLEF 2025 CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval. In CLEF 2025; Springer: Cham, Switzerland, 2025; pp. 199–223. [Google Scholar]
Srinivasan, A.G.; George, R.J.; Joe, J.K.; Kant, H.; Harshith , M.R.; Sundar, S.; Suresh, S.; Vimalkanth, R.; Vijayavallabh. Enhancing financial RAG with agentic AI and Multi-HyDE: A novel approach to knowledge retrieval and hallucination reduction. In Proceedings of the 10th Workshop on Financial Technology and Natural Language Processing; Association for Computational Linguistics: Suzhou, China, 2025; pp. 19–32. [Google Scholar]
Dumitru, A.; Venktesh, V.; Jatowt, A.; Anand, A. Evaluating list construction and temporal understanding capabilities of large language models. In Proceedings of the ACM SIGIR ICTIR 2025; Association for Computing Machinery (ACM): New York, NY, USA, 2025; pp. 369–379. [Google Scholar]
Huet, S.; SanJuan, É. A Benchmark Collection for Assessing Scholarly Search by Non-Educated Users. In Proceedings of CLEF Working Notes 2025, CEUR Workshop Proceedings, Madrid, Spain, 9–12 September 2025; pp. 4286–4300. [Google Scholar]
Suzgun, M.; Melas-Kyriazi, L.; Sarkar, S.; Kominers, S.D.; Shieber, S. The Harvard USPTO patent dataset: A large-scale, well-structured, and multi-purpose corpus of patent applications. Adv. Neural Inf. Process. Syst. 2023, 36, 57908–57946. [Google Scholar]
Abbas, M.; Bashir, S.; Saadatmand, M.; Enoiu, E.P.; Sundmark, D. Requirements similarity and retrieval. In Handbook on Natural Language Processing for Requirements Engineering; Springer: Cham, Switzerland, 2025; pp. 61–88. [Google Scholar]
Suresh, S.; Rani, A.; Patwa, P.; Reganti, A.; Jain, V.; Chadha, A.; Ekbal, A. Overview of Factify5WQA: Fact verification through 5W question-answering. arXiv 2024, arXiv:2410.04236. [Google Scholar] [CrossRef]
Alsaç, A.; Yılmaz, Ü.; Koçoğlu, F.Ö.; Şeker, Ş.E. Towards intelligent IT service management: A comparative evaluation of machine learning and language models for ticket classification and assignment. In Proceedings of the 10th International Conference on Computer Science and Engineering (UBMK); IEEE: New York, NY, USA, 2025; pp. 265–270. [Google Scholar]
Athale, M.; Vaddina, V. Knowledge graph-based repository-level code generation. In Proceedings of the IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code); IEEE: New York, NY, USA, 2025; pp. 169–176. [Google Scholar]
Kasmaee, A.S.; Khodadad, M.; Saloot, M.A.; Sherck, N.; Dokas, S.; Mahyar, H.; Samiee, S. ChemTEB: Chemical text embedding benchmark—An overview of embedding models performance and efficiency on a specific domain. arXiv 2024, arXiv:2412.00532. [Google Scholar]
Ermakova, L.; SanJuan, E.; Huet, S.; Azarbonyad, H.; Di Nunzio, G.M.; Vezzani, F.; Kamps, J. Overview of the CLEF 2024 SimpleText track: Improving access to scientific texts for everyone. In CLEF 2024; Springer: Cham, Switzerland, 2024; pp. 283–307. [Google Scholar]
Megha Mariam, K.M.; Jawahar, C.V. Attend to what I say: Highlighting relevant content on slides. In Proceedings of ICDAR; Springer: Cham, Switzerland, 2025; pp. 21–37. [Google Scholar]
Kiesel, J.; Çöltekin, Ç.; Gohsen, M.; Heineking, S.; Heinrich, M.; Fröbe, M.; Stein, B. Overview of Touché 2025: Argumentation systems. In CLEF 2025 Working Notes; CEUR: Aachen, Germany, 2025. [Google Scholar]
Ravenda, F.; Bahrainian, S.A.; Raballo, A.; Mira, A.; Crestani, F. A self-supervised seed-driven approach to topic modelling and clustering. J. Intell. Inf. Syst. 2025, 63, 333–353. [Google Scholar] [CrossRef]
Bashir, S.; Abbas, M.; Saadatmand, M.; Enoiu, E.P.; Bohlin, M.; Lindberg, P. Requirement or not, that is the question: A case from the railway industry. In REFSQ; Springer: Cham, Switzerland, 2023; pp. 105–121. [Google Scholar]
Shah, C.; Shah, A.; Varma, L.; Bhan, S.; Patil, N. Sentence restructuring with user-controlled difficulty using NLP. In Proceedings of ICCCNT; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Laud, T.; Kacha-Ochana, A.; Sumner, S.A.; Krishnasamy, V.; Law, R.; Schieber, L.; ElSherif, M. Large-scale analysis of online questions related to opioid use disorder on Reddit. In Proceedings of ICWSM; AAAI Press: Washington, DC, USA, 2025; Volume 19, pp. 1068–1084. [Google Scholar]
Beauchemin, D.; Tremblay, Y.; Youssef, M.A.; Khoury, R. COLE: A comprehensive benchmark for French language understanding evaluation. arXiv 2025, arXiv:2510.05046. [Google Scholar] [CrossRef]
Akhare, R.; Shinde, S.K. Personalised video summarisation using video-text multimodal fusion. Int. J. Comput. Vis. Robot. 2025, 15, 379–394. [Google Scholar] [CrossRef]
Sae-Oueng, A.; Kerdthaisong, K.; Sukhantharat, K.; Phasook, P.; Chuangkrud, P.; Damrongrat, C.; Kongyoung, S. Pantip multi-turn datasets generated from a Thai large social platform forum using sentence similarity techniques. In Proceedings of iSAI-NLP; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Tewari, A. LegalPro-BERT: Classification of legal provisions by fine-tuning BERT large language model. arXiv 2024, arXiv:2404.10097. [Google Scholar]
Rakhimova, D.; Turarbek, A.; Karyukin, V.; Sarsenbayeva, A.; Alieyev, R. Legal AI in Low-Resource Languages: Building and Evaluating QA Systems for the Kazakh Legislation. Computers 2025, 14, 354. [Google Scholar] [CrossRef]
Tiwari, P.K. Malware detection using control flow graphs. In Proceedings of DICCT; IEEE: New York, NY, USA, 2024; pp. 216–220. [Google Scholar]
Götharsson, M.; Stahre, K.; Gay, G.; de Oliveira Neto, F.G. Exploring the role of automation in duplicate bug report detection: An industrial case study. In Proceedings of AST 2024; ACM: New York, NY, USA; IEEE: New York, NY, USA, 2024; pp. 193–203. [Google Scholar]
Nashid, N.; Ding, D.; Gallaba, K.; Hassan, A.E.; Mesbah, A. Characterizing multi-hunk patches: Divergence, proximity, and LLM repair challenges. arXiv 2025, arXiv:2506.04418. [Google Scholar] [CrossRef]
Kim, T.E.; Coelho, J.; Onilude, G.; Singh, J. TeamCMU at Touché: Adversarial co-evolution for advertisement integration and detection in conversational search. arXiv 2025, arXiv:2507.00509. [Google Scholar]
Gundawar, A.; Valmeekam, K.; Verma, M.; Kambhampati, S. Robust planning with compound LLM architectures: An LLM-modulo approach. arXiv 2024, arXiv:2411.14484. [Google Scholar] [CrossRef]
Azam, U.; Razzak, I.; Vishwakarma, S.; Jameel, S. Uncertainty modelling in under-represented languages with Bayesian deep Gaussian processes. In Proceedings of COLING; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 1438–1450. [Google Scholar]
Ullah, F.; Faheem, A.; Azam, U.; Ayub, M.S.; Kamiran, F.; Karim, A. Detecting cybercrimes in accordance with Pakistani law: Dataset and evaluation using PLMs. In Proceedings of LREC-COLING 2024; ELRA and ICCL: Turin, Italy, 2024; pp. 4717–4728. [Google Scholar]
Koloski, B.; Lavrač, N.; Cestnik, B.; Pollak, S.; Škrlj, B.; Kastrin, A. AHAM: Adapt, help, ask—Model harvesting LLMs for literature mining. In IDA; Springer: Cham, Switzerland, 2024; pp. 254–265. [Google Scholar]
Turganbayeva, A.; Rakhimova, D.; Karyukin, V.; Karibayeva, A.; Turarbek, A. Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language. Information 2022, 13, 411. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the automated LLM-based earthquake data extraction system.

Figure 2. Flowchart of the automated LLM-based earthquake data extraction system.

Figure 3. Flowchart of the automated LLM-based earthquake data extraction system.

Figure 4. Model comparison by key metric.

Figure 5. Error analyses by sources.

Table 1. Overall performance across LLM models.

LLM Model	Success Rate	Extraction Acc	GPT Quality	Final Score
GPT_OSS	85.0%	96.5%	71.1%	84.26
GEMMA	85.0%	92.4%	63.2%	80.68
LLAMA	85.0%	52.4%	44.8%	63.13

Table 2. Approximate precision, recall, and F1-score derived from extraction performance metrics.

Model	Precision	Recall	F1-Score
GPT_OSS	0.965	1.000	0.982
GEMMA	0.924	0.941	0.933
LLAMA	0.524	0.706	0.601

Table 3. Expert-based evaluation results across LLM backends (mean scores on a 10-point scale).

Model	Content Completeness	Semantic Correctness	Relevance	Coherence	Std. Dev.	Avg. Score
GPT_OSS	9.3	9.2	9.0	9.1	0.6	9.15
GEMMA	8.5	8.4	8.2	8.1	0.7	8.30
LLAMA	6.8	6.5	6.0	5.5	0.9	6.20

Table 4. Aggregated per-page quality metrics (N = 17 documents per model).

Model	N (Documents)	Mean Extraction Score	Mean GPT Score (0–10)	Mean Main Text Quality (0–10)	Mean Date Quality (0–10)	Zero-Text Pages	Pages with Extraction Score < 1
GEMMA	17	0.924	6.32	6.69	6.79	1	3
GPT_OSS	17	0.965	7.11	7.48	9.11	0	2
LLAMA	17	0.524	4.48	5.04	4.32	5	14

Table 5. Top domains with the highest selector error rates and error-type distribution.

Domain	Total Errors	Penalty (Points)	Missing Content	Wrong Element	Excessive Noise	Fragility	Minor
https://tengrinews.kz	50	−64.5	21	6	17	5	1
https://www.volcanodiscovery.com	28	−44.5	6	15	2	5	0
https://dknews.kz	20	−25.5	11	2	3	4	0
https://rus.azattyq.org	14	−4.5	3	4	2	5	0
https://voshod-solnca.ru	12	−19.0	6	4	2	0	0
https://www.gov.kz	12	−20.5	1	10	0	1	0
https://kndc.kz	10	−10.5	1	4	3	2	0
https://el.kz	9	−14.0	0	6	1	2	0
https://prg.kz	9	−7.0	1	6	2	0	0
https://www.zakon.kz	9	−12.5	3	3	2	0	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Turarbek, A.; Rakhimova, D.; Adetbekov, Y.; Nurgali, A. A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers 2026, 15, 294. https://doi.org/10.3390/computers15050294

AMA Style

Turarbek A, Rakhimova D, Adetbekov Y, Nurgali A. A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers. 2026; 15(5):294. https://doi.org/10.3390/computers15050294

Chicago/Turabian Style

Turarbek, Assem, Diana Rakhimova, Yeldos Adetbekov, and Azat Nurgali. 2026. "A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources" Computers 15, no. 5: 294. https://doi.org/10.3390/computers15050294

APA Style

Turarbek, A., Rakhimova, D., Adetbekov, Y., & Nurgali, A. (2026). A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources. Computers, 15(5), 294. https://doi.org/10.3390/computers15050294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources

Abstract

1. Introduction

2. Related Work

2.1. Traditional Web Parsing and Information Extraction Approaches

2.2. Transformer-Based Models for Structural and Semantic Understanding

2.3. Self-Learning and Adaptive Schema Generation Techniques

2.4. Automated Extraction Pipelines for Environmental and Hazard Monitoring

3. Materials and Methodology

3.1. Overview of the Data Acquisition and Processing Pipeline

3.2. Keyword-Based Search Query Construction

3.3. HTML Retrieval and Sanitization

LLM-Based Schema Generation

3.4. Python-Based Content Extraction

3.5. Structured Storage of Parsed Content

3.6. Adaptive Self-Learning Mechanism

4. Mathematic Model of the Scheme Extraction Mechanism

4.1. Schema Representation

4.2. Extraction and Utility Score

5. Results

5.1. Test Set and Data Acquisition Strategy

5.1.1. Data Acquisition Strategy

Keyword Construction

Site Filtering

Search Depth

Deduplication Strategy

Crawling Time Window

Page Selection Criteria

Failure Handling and Retry Mechanism

5.1.2. Dataset Characteristics

5.1.3. Preprocessing and Normalization

5.1.4. Benchmark Design

5.1.5. Validity and Limitations

5.2. Compared LLM Models

5.3. Experimental Protocol

5.4. Metrics and Scoring

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI