1. Introduction
Research on safety accidents has long been hindered by the lack of unified, standardized, and reusable public datasets. Constructing high-quality datasets is fundamental for advancing related research. However, manual collection and annotation of data are costly and time-consuming and do not scale to large datasets. Moreover, privacy-preserving regulations often remove or generalize key quasi-identifiers in accident reports, making it impossible to directly align structured accident records with narrative reports. This challenge falls within the domain of privacy-preserving record linkage (PPRL), which aims to link records while protecting sensitive information [
1]. Consequently, accident-related corpora face additional challenges compared to open-domain text tasks, including limited scale, diverse genres, and sparse annotations. Specifically, the challenges are as follows:
Diverse data sources and structures: Authoritative sources (e.g., accident statistics and direct reporting systems) provide accurate information but are limited in scope and volume. In contrast, semi-structured or unstructured texts (e.g., news reports, online content, internal documents) are abundant and rich in detail but exhibit wide variation in writing style, focus, and information granularity.
Cross-document event alignment difficulties: Structured records typically contain key attributes (event time, location, losses, responsible units), yet many public reports obscure or omit such identifiable cues due to privacy or compliance requirements, hindering alignment with structured data.
Long-tail label distribution: Accident categories follow a long-tail distribution, with a few common types dominating the dataset while many rare categories have few instances.
Therefore, effectively integrating heterogeneous multi-source data and leveraging structured information for scalable automatic annotation has become a core challenge in constructing accident analysis datasets. In the field of natural language processing (NLP), event linking aims to connect event mentions in text to corresponding event records in a knowledge base [
2] and has long been a focal point in evaluation tasks such as ACE and TAC. Compared to entity coreference, event linking faces greater technical challenges, and the research is relatively underdeveloped. The core difficulties include cross-source narrative differences, lexical diversity, local information loss, and global consistency constraints.
The development of event linking has evolved from early rule-based methods [
3,
4,
5,
6,
7,
8,
9,
10,
11] to traditional machine learning techniques [
12,
13,
14,
15,
16] and, more recently, to deep learning methods [
17,
18,
19,
20,
21]. With the rise of pre-trained large language models (LLMs), significant progress has been made in event linking methods based on reading comprehension [
22], knowledge distillation [
23], and instruction fine-tuning [
24]. However, existing LLM-based approaches typically assume observable anchors and stable surface forms; faced with anonymized, fine-grained accident texts, they suffer from granularity mismatch (e.g., city vs. district), missing anchors (suppressed entities), and over-generalization—yielding confident yet poorly calibrated scores without structural guarantees. In contrast, a lattice-constrained formulation can recover interpretable alignment signals from residual spatiotemporal structure and maintain calibrated probabilities suitable for audit.
To address these challenges, this paper proposes an event linking method based on spatiotemporal containment consistency. The method uses structured records from accident statistics and reporting systems as anchors to accurately link event descriptions in open-domain texts. It models administrative hierarchies and temporal granularity in a unified spatiotemporal lattice, leveraging a subset of the Region Connection Calculus (RCC-8) [
25] for spatial containment consistency and a subset of Allen’s interval algebra [
26] for temporal consistency. A probability fusion mechanism is introduced to integrate multiple confidence factors and output cross-document alignment probabilities with uncertainty measures, thereby effectively separating samples suitable for automatic annotation from those requiring human review. Additionally, based on these spatiotemporal constraints, we introduce a controlled data augmentation strategy using an instruction-tuned large language model (LLM) to automatically generate a large number of high-confidence labeled samples. Experimental results show that the proposed method can effectively align structured accident records with multi-source public reports, support scalable automatic annotation, and provide an expandable solution for efficiently constructing high-quality, auditable accident corpora. The main contributions of this study are summarized as follows:
A spatiotemporal containment consistency-based cross-document event linking method. By incorporating domain knowledge, we define a set of spatiotemporal and type consistency criteria, compute anchor weights using smoothing and monotonic projection, and develop a multi-feature fusion scoring model to calculate the alignment probability between structured records and event mentions in online texts.
A sample selection strategy combining maximum alignment probability and probability gap. Under an active-learning framework, we introduce a selection criterion that considers both the maximum alignment probability and the probability gap, enabling high-confidence automatic annotation while maintaining high recall. This provides quantitative support for human–machine collaborative annotation.
The proposed method achieves performance comparable to the strongest baseline models in accident dataset construction tasks. It demonstrates its effectiveness and practicality across several key metrics.
The remainder of this paper is organized as follows:
Section 2 reviews related work;
Section 3 describes the proposed model and methods in detail;
Section 4 introduces experimental design and datasets;
Section 5 analyzes experimental results;
Section 6 discusses the results; and
Section 7 discusses the limitations of the method and future directions and concludes the paper.
2. Literature Review
Event linking is a core task that aims to associate event mentions in text with corresponding event nodes in a knowledge base [
2]. The key challenges lie in handling cross-source narrative differences, lexical diversity, local information loss, and global consistency constraints. Research in this field has generally progressed through several stages: rule-based and template-driven methods, graph models with global inference, supervised learning with neural representations, and end-to-end cross-document modeling methods driven by pre-trained language models (PLMs). However, applying these PLM-driven methods to the specialized context of accident reports—which often involve anonymized fields, hierarchical spatiotemporal granularity, and domain-specific terminology—introduces additional challenges that general PLM approaches may not fully address.
Early research mainly relied on rule-based and heuristic matching methods. For example, Humphreys et al. [
3] proposed a cross-document event alignment method based on template slots and manually defined rules, using “trigger word consistency + key argument overlap” as criteria for matching and suppressing erroneous matches through domain knowledge. Ahn [
4] emphasized, from a task decomposition viewpoint, that the underlying “trigger word–argument” structure plays a crucial role in event coreference and linking. He highlighted that errors in initial event recognition can propagate through the pipeline and negatively affect downstream linking performance. Raghunathan et al. [
5] introduced a multi-round filtering algorithm in entity coreference, in which rules are applied progressively from strict to broader criteria to merge mentions. Lu and Ng [
6] later adapted this concept to event linking, significantly improving performance under limited data conditions. Cybulska and Vossen [
7,
8,
9] proposed the “event bag” representation method, encoding events as sets of attribute slots and introducing a weighted slot matching mechanism that balances interpretability and robustness. Hovy et al. [
10,
11] further explored the theoretical boundaries of event identity and proposed a three-tier classification system (“same, different, quasi-same”), providing a theoretical foundation for subsequent studies on coreference and hierarchical event relationships. Although rule-based approaches can achieve high precision and are highly interpretable, they tend to generalize poorly across different domains and struggle to capture implicit semantic relationships.
Graph-based approaches represent event mentions as nodes in a graph, with edges encoding semantic similarity between mentions. They achieve globally consistent clustering by using algorithms such as spectral clustering, minimum-cut partitioning, or graph propagation. Chen and Ji [
12] proposed a two-stage framework (“feature combination → graph construction → graph partitioning”) and clearly defined the process for cross-document event linking as “local scoring followed by global partitioning.” In subsequent work [
13], they applied a minimum-cut criterion to explicitly partition event clusters, thereby alleviating inconsistencies between local scoring and global clustering. Lee et al. [
14] constructed a bipartite graph structure for entities and events, performing joint reasoning through shared arguments and optimizing clustering quality with an incremental merging strategy, significantly improving the accuracy and interpretability of cross-document alignment. Choubey and Huang [
15] proposed an iterative expansion strategy guided by argument consistency. In each iteration, high-confidence event pairs are merged and similarity scores are updated, significantly improving recall—especially in multi-source news scenarios. Yang et al. [
16] proposed a hierarchical generative model using non-parametric Bayesian methods. By incorporating prior constraints, this model guides the clustering structure and avoids the need to predefine the number of clusters. Graph-based methods provide strong global consistency and efficient information propagation. However, their performance heavily depends on the accuracy of the initial similarity estimates, and they often face high computational complexity when applied to large datasets.
With the introduction of annotated corpora such as ECB+ [
27] and KBP [
28], data-driven approaches have gradually become mainstream. Supervised learning methods typically treat event linking as a mention classification or clustering task and integrate features such as trigger words, semantic classes, and syntactic structures. Chen et al. [
29] proposed a multi-parser ensemble framework to enhance the robustness of mention determination. However, these corpora cover open-domain or newswire texts, which may differ significantly from the specialized accident reports considered here, potentially affecting the direct applicability of methods. Lu et al. [
30,
31] systematically explored joint reasoning and multi-task learning approaches. They used Markov logic networks and multi-task learning to improve global consistency and introduced a probabilistic ranking mechanism to mitigate global inconsistencies. Bejan and Harabagiu [
32,
33] developed an unsupervised event coreference model using non-parametric Bayesian methods. Leveraging external semantic resources such as FrameNet, their model achieved performance comparable to supervised methods on open-domain texts. Araki and Mitamura [
34] jointly trained trigger identification and event coreference using a structured perceptron, significantly reducing error propagation. Upadhyay et al. [
35] critically examined cross-document evaluation metrics and advocated using multiple evaluation measures to provide a comprehensive assessment of model performance. Together, supervised and generative models have improved feature fusion and global reasoning. Techniques such as joint learning and ranking optimization have been key to transitioning from local mention matching to coherent global clustering.
Neural network methods significantly reduce the dependence on manual feature engineering by learning representations end-to-end. For example, Kenyon-Dean et al. [
17] employed clustering-based regularization to train a BiLSTM encoder, drawing similar events closer in the embedding space and thus enhancing model robustness. Peng et al. [
18] addressed event detection and coreference jointly under weak supervision, offering a feasible solution to the high annotation costs of event tasks. Barhom et al. [
19] extended the span-based representation and cascade clustering mechanism from entity coreference to event linking. This approach achieved state-of-the-art performance on the ECB+ dataset while simplifying the traditional pipeline. Cremisini and Finlayson [
20] conducted a systematic analysis and found that careful negative sampling and threshold tuning could achieve strong baseline performance, while some complex model components contributed minimally. Allaway et al. [
21] proposed a sequential cross-document coreference method, reducing inference complexity through state memory and historical consistency constraints, making it suitable for large-scale scenarios. The main advantage of neural methods lies in their ability to learn transferable semantic representations and support end-to-end optimization. Incorporating sequential reasoning further improves computational efficiency.
The emergence of pre-trained language models such as BERT has significantly enhanced semantic alignment and context modeling capabilities. For instance, Joshi et al. [
36] applied BERT to entity coreference and demonstrated the model’s ability to capture cross-sentence context. They later improved phrase-level representations in event modeling by using span masking strategies [
37]. In cross-document coreference resolution tasks, Cattan et al. [
38] proposed a transformer-based framework for linking events and entities across long documents. Yu et al. [
39] utilized a shared encoder to jointly learn representations for events and their arguments, highlighting the importance of capturing document-level structural information in event linking. Zeng et al. [
40] introduced paraphrasing resources and argument-aware embeddings to alleviate matching issues caused by lexical diversity. Caciularu et al. [
41] proposed a Cross-Document Language Model (CDLM) that significantly improved consistency via multi-document masking pre-training. Longformer [
42] and other long-text models have also been widely used for document-level encoding. Taken together, these key advances in the pre-trained language model era, including cross-document pre-training, span-level semantic enhancement, and long-context encoding, have significantly improved handling of lexical diversity and semantic equivalence.
In recent years, large language models (LLMs) have introduced new approaches to event linking. Some studies [
43,
44] evaluated the performance of GPT-series models in zero-shot and few-shot information extraction tasks, finding that while LLMs perform well in simple scenarios, they still significantly lag behind specialized models on more complex tasks. To combine the broad capabilities of LLMs with the strengths of smaller models, several works have attempted to integrate LLMs’ semantic summarization ability with the task-specific fine-tuning of small language models (SLMs). For example, Min et al. [
45] used GPT-4 to generate event summaries, then employed an SLM for coreference resolution. Nath et al. [
23] generated reasoning explanations with an LLM and distilled these into a smaller model to improve interpretability and performance. Wang et al. [
24] used instruction fine-tuning to unify multiple extraction tasks, while Ding et al. [
46] enhanced the model’s sensitivity to semantic differences by utilizing counterfactual samples.
However, despite their strong performance on several benchmarks, LLM-based methods still have limitations. Research [
22,
44,
45] indicates that prompt-based methods struggle to cover complex annotation standards, and LLMs still show lower accuracy in event coreference compared to supervised models. Bugert et al. [
47] found that feature engineering systems perform more robustly in cross-corpus testing, whereas purely neural models have poor generalization across domains. Together, these results suggest that effective annotation systems should incorporate corpus-specific features and appropriate constraint mechanisms to achieve better cost-effectiveness in annotation tasks.
3. Methodology
To address the issues of weakened explicit cues in publicly available accident texts due to anonymization, as well as significant genre and structural differences in cross-source texts, we frame our problem as a privacy-preserving record linkage (PPRL) task. Because anonymization often removes or generalizes key quasi-identifiers such as time and location, standard record linkage methods cannot directly align structured accident records with narrative reports. We therefore propose an event linking method based on spatiotemporal containment consistency to link accident descriptions across documents while preserving privacy. Structured records from accident statistics and direct reporting systems serve as anchors for accurate matching with event description fragments in open-domain texts. The overall process, as illustrated in
Figure 1, includes four core modules: feature extraction, consistency calculation, probability fusion, and decision output.
First, we model administrative hierarchies and temporal granularity as a spatiotemporal lattice based on a subset of the Region Connection Calculus (RCC-8) [
25] and Allen’s interval algebra [
26]. In this lattice, spatial containment relations (e.g., province–city–district–town) and temporal interval relations (e.g., contains, overlaps, before/after) are formalized as primitive operations. A set of spatiotemporal and type consistency criteria is defined based on domain knowledge. Next, anchor weights for each criterion are calculated through smoothing estimation and monotonic projection. Then, a multi-feature fusion scoring model is constructed, which integrates multi-factor confidence and outputs cross-document alignment probabilities with uncertainty measures. Finally, an active-learning framework combines a maximum-probability and probability-gap sample selection strategy, effectively delineating the boundary between automatically labeled samples and those requiring human review.
Additionally, to alleviate data sparsity, we propose a controlled data augmentation strategy leveraging an instruction-tuned large language model (LLM). Augmented samples are generated under strict spatiotemporal constraints to preserve privacy and semantic fidelity; each generated example is re-evaluated by the alignment model, and only those whose consistency scores remain within a predefined tolerance are retained. We also discuss potential risks of LLM-based augmentation, including semantic drift, fabrication, and bias, and incorporate manual or semi-automatic filters to mitigate these issues. The subsequent sections of this chapter will formalize the definition of each component of this method and provide a detailed explanation.
3.1. Symbol and Object Definitions
Let the set of accident records be . Due to privacy regulations, these structured records often have sensitive quasi-identifiers removed or generalized, so missing or aggregated values must be handled explicitly. Each record represents a potential anchor event from the structured accident database and comprises three types of elements:
- (1)
Spatiotemporal elements . The spatial location records the administrative hierarchy—province, city, district (or county), and street/town. The temporal attribute is a time interval indicating the event’s occurrence window.
- (2)
Enumerative non-spatiotemporal elements , such as accident severity level, accident type, and work scenario. Each field corresponds to a hierarchical tree defined by industry standards; for example, accident types may be arranged as “fire > electric fire > transformer fire”.
- (3)
Structural elements , including numeric attributes such as casualties or economic losses. These are stored in a vector ; if a component is missing due to anonymization, a mask is applied to prevent penalizing the absence of information.
Let the set of accident texts from the internet and industry channels be , where each text consists of a narrative description and additional scattered context. Because these narratives are typically anonymized, they may omit specific times, locations, or names; therefore, we only extract candidate values and their confidences, rather than assuming a single true value. Using rule-based patterns and regular expressions, we derive multiple candidate sets from each text:
- -
A spatial candidate set , where each candidate encodes up to four administrative levels, and missing levels are denoted by null symbols.
- -
A temporal candidate interval set , representing possible time intervals extracted from the narrative.
- -
An enumerative field candidate set , corresponding to possible values for fields like accident type or severity.
- -
A structured vector , empty if no numeric information is mentioned.
Each candidate is associated with an extraction confidence score: , and quantify the reliability of the spatial, temporal, and enumerative candidates, respectively, as determined by the extraction algorithm.
The objective of cross-document accident linkage is to match each text to its most likely structured record . To this end, we compute an alignment probability for every pair of text and candidate record. A dual-threshold decision rule is then applied: if the maximum probability exceeds a confidence threshold and the gap between the top two probabilities exceeds a separation threshold , we automatically label as linked to ; otherwise, the case is flagged for human review. Both thresholds and are dimensionless constants optimized on a validation set to balance false positives against annotation cost.
3.2. Spatiotemporal Lattice Structure and Consistency Criterion Modeling
One of our key innovations is to cast the matching problem in a formal spatiotemporal lattice that captures both spatial and temporal containment relationships. This lattice provides a common framework for evaluating the consistency of a candidate event with an anchor record across multiple dimensions. Specifically, we employ a subset of the Region Connection Calculus (RCC-8) to represent topological relations among administrative regions and a subset of Allen’s interval algebra to represent temporal interval relations. These formal systems allow us to quantify how closely a candidate’s location or time relates to the record’s location or time (e.g., exact match, containment, overlap, disjointness). We collectively refer to these criteria as spatiotemporal containment consistency.
Spatial consistency quantifies how well a candidate location matches a record location on the administrative lattice. We classify the relationship between candidate and record into four categories (a subset of the RCC-8 relations):
- −
EXACT: all four administrative levels match (province, city, district, town);
- −
ANCESTOR: the candidate is a higher-level ancestor of the record (e.g., same city but no district specified);
- −
COUSIN: the candidate and record share a higher-level ancestor but differ at a lower level (e.g., same city, different districts);
- −
DISJOINT: no common ancestor exists.
Figure 2 visualizes these four relations.
We assign a weight
to each relation to reflect its importance. The spatial compatibility for text
and record
is defined as:
where
returns the relation class between the
-th spatial candidate and the record, and
is the extraction confidence of that candidate. This formulation ensures that more compatible candidates contribute higher scores.
Temporal consistency measures how well a candidate time interval aligns with the record’s time interval. Based on a subset of Allen’s interval algebra, we consider five relations.
- −
EXACT: the candidate interval exactly matches the record interval;
- −
CONTAINS: the candidate interval fully contains the record interval;
- −
OVERLAP: the intervals overlap, but neither contains the other;
- −
AFTER (or BEFORE): the candidate starts after the record ends (or vice versa);
- −
DISJOINT: no temporal relationship exists.
Figure 3 illustrates the five temporal relations.
We define weights
,
,
,
,
such that higher compatibility relations receive larger weights. If a record contains multiple possible time intervals (e.g., due to aggregated reports), we iterate
over each interval to evaluate the relation and take the maximum compatibility. The temporal compatibility is then computed as:
where
is the confidence of the
-th temporal candidate.
Non-spatiotemporal consistency accounts for enumerative categorical fields, such as accident severity, type, or work scenario. Each field value corresponds to a node on a hierarchical tree
, where the tree structure captures categorical relations (e.g., “fire” > “electric fire” > “transformer fire”). Given a record value
and a candidate value
, we calculate the distance on the tree
, which is dimensionless and represents how closely the two values match. Non-spatiotemporal consistency is then defined using an exponential kernel function:
where
controls the decay rate of similarity.
If the field is not mentioned in the text, we set , avoiding penalization for missing values, which ensures that the model does not incorrectly penalize the absence of information.
Structural information consistency measures the similarity of numeric attributes, such as casualties or economic losses. Let
be the record’s numeric vector and
be the candidate’s vector, with missing components masked by a binary vector
. We compute the masked 1-norm difference:
which accounts for missing information using the mask vector
.
This difference is then transformed into a compatibility score using an exponential decay function:
where
controls the rate of decay.
If the candidate text does not provide numeric information (due to anonymization), we set and assign , avoiding penalization for missing data and allowing other features, such as time, location, and type, to dominate in matching. This approach ensures that missing or anonymized numeric values do not result in severe mismatches.
The spatiotemporal lattice structure and consistency criteria described above enable the system to handle noisy and incomplete data effectively. By combining spatial, temporal, and non-spatiotemporal consistency measures, the model produces stable and interpretable compatibility scores. This approach ensures that even when some information is missing or anonymized, the system can make reliable event linking decisions. These consistent frameworks, such as the RCC-8 and Allen subset, ensure that the model’s decisions are robust and interpretable.
3.3. Weight Estimation and Monotonic Projection
To convert discrete relationships into comparable numerical weights, this study adopts the Laplace (Lidstone) smoothing method [
48]. This method improves the stability of weight estimation, particularly when the sample size is small. Laplace smoothing is a regularization technique that adjusts empirical frequency estimates by introducing a prior, helping avoid extreme probability values of 0 or 1 due to insufficient samples. This technique has been widely used in tasks such as text classification, language modeling, and record linkage [
49]. The core idea is to smooth empirical frequencies by introducing prior information, enhancing the robustness of the data fusion process [
48].
Let
be a discrete relationship category, where the number of positive examples in the calibration set is denoted as
, the number of negative examples as
, and the total number of examples as
. If we directly use the empirical frequency
, extreme estimates may arise, especially with small sample sizes. To mitigate this, we apply the Beta–Bernoulli model for smoothing. The prior
is set as
, and the posterior expectation is:
This smoothing formula ensures that the probability estimate remains bounded and less susceptible to overfitting or underfitting due to sparse data. The parameter (a pseudo-count) controls the intensity of smoothing and can be adjusted for different levels of uncertainty.
Some relationships between features have an explicit order of strength. For example, temporal relationships generally follow the order “Exact match ≥ Upper-level containment ≥ General overlap ≥ After official ≥ Disjoint,” and spatial relationships usually satisfy “Exact match ≥ Upper-level containment ≥ Same ancestor, different branches ≥ No relation.” Due to sampling fluctuations, independently smoothed weights may violate this natural order of relations. To address this issue, we apply a monotonic projection method to ensure that the weights respect the inherent ordering of relations. The projection process is formulated as:
This minimization problem is efficiently solved using the Pool-Adjacent-Violators (PAV) algorithm [
50,
51], which adjusts the weights while maintaining the predefined order. After projection, the final weight vector
reflects the trend in the data and preserves the monotonicity of relationships.
It should be noted that the above method is not only applicable to temporal and spatial relationships but can also be used for other discrete mappings that require weighting. For example, relationships in enumerative non-spatiotemporal fields (e.g., accident level, accident type) can be smoothed using the Beta–Bernoulli model, with the order of categories adjusted through monotonic projection. This method ensures that categorical relationships like “same level > same ancestor > different branches” are handled in a statistically robust way, providing stable weights for the fusion model. For structured fields, the numeric compatibility has already been computed using the masked distance-exponential mapping, as discussed earlier, so these can be treated directly on the same scale as discrete weights.
The output of this section is a stable and comparable compatibility table for each discrete relationship. These compatibility scores will serve as inputs for the fusion model in
Section 3.4, where they will be fused with other continuous compatibility values. By ensuring statistical robustness and monotonicity at this stage, we can achieve a fair trade-off between different features and produce reliable, interpretable alignment probabilities.
3.4. Multi-Feature Scoring Model
This section aims to fuse the compatibility values of different features on the same scale and output a calibrated alignment probability. Through the weight estimation and monotonic projection methods described in the previous section, various discrete relationships have been mapped to stable and dimensionless compatibility values. Based on these pre-processed features, the fusion module learns the interaction and trade-off mechanisms across features. The model structure is lightweight, highly interpretable, and maintains a monotonic non-decreasing property for each input component. Consistent with
Section 3.1 and
Section 3.2, the input to the fusion module includes the spatial compatibility
, the temporal compatibility
, the set of enumerative compatibilities
, the structural numeric compatibility
, and the semantic compensation term
; all are normalized to
prior to fusion.
For any candidate pair
, the input vector is defined as:
where
and
denote the spatial and temporal compatibilities;
denotes the compatibility of the
-th enumerative field (with
fields);
denotes the masked numeric (structural) compatibility; and
denotes a semantic compensation term derived from normalized text–record semantic similarity. By construction, every component lies in
, so the vector dimensionality is
.
The semantic compensation is computed from a normalized semantic similarity between the narrative of and the description associated with (e.g., cosine similarity of sentence embeddings, linearly mapped to ). It increases the score when two texts are semantically close, even if some quasi-identifiers are generalized or missing due to anonymization, but it cannot override hard spatiotemporal conflicts enforced by and . This provides an interpretable balance: lattice-based consistency yields transparent constraints, while contributes robustness to wording variation.
To enhance numerical stability, a log-odds transformation is applied to each dimension:
where
is a small constant.
This transformation expands values near 0/1 in a controlled way, reduces saturation effects, and makes different compatibility components more comparable before they enter the fusion layer.
The fusion module uses a single hidden-layer feedforward architecture, ensuring non-negative weights and using non-decreasing activation functions to maintain monotonicity:
where
,
,
and
are trainable parameters. The weights are processed using the softplus function to ensure non-negativity.
Because all effective weights are non-negative and the activation is monotone, the output score is monotone in each input component. Hence, improving any single compatibility (e.g., making time or location more consistent) cannot decrease the overall score, which preserves interpretability.
The alignment probability is calibrated using a temperature
scaling method:
where
is the sigmoid, and
,
are fitted on a validation set by minimizing log loss or Brier score.
We also monitor ECE to ensure that the predicted probabilities match empirical hit rates, which is critical for auditable decisions in PPRL. Calibration transforms the raw score into a probability with reliable uncertainty semantics.
The model is trained using a small-scale alignment labeled dataset, including human-verified samples and high-confidence pseudo-labeled samples. The loss function uses binary cross-entropy with an additional
regularization term:
where
is the alignment label, and
is the regularization coefficient. The hidden-layer width is set in a small range (e.g., 8–32) to control capacity; the optimizer is Adam with early stopping. All hyperparameters (including temperature) are selected on a validation set to avoid ad hoc choices.
To align with privacy-preserving record linkage, features exclude direct identifiers; only quasi-identifier-based compatibilities and semantic similarity (with lattice constraints) are used. After training, we assess feature contributions via local sensitivity or integrated-gradient-style attributions to support post hoc interpretability.
3.5. Inference, Decision-Making, and Uncertainty
At the inference stage, let the alignment probability between text
and candidate record
be
. Define the Bag-Level Strength and Probability Gap as:
Here, and (with ) are explicitly defined and computed from the calibrated , ensuring the statistics retain probabilistic meaning for auditable PPRL decisions.
Intuitively, the strength summarizes the top alignment confidence for
, while the gap quantifies the separation between the best and the second-best candidates; both are computed from the calibrated probabilities in
Section 3.4. Using calibrated
ensures that these decision statistics retain probabilistic meaning, which is essential for auditable, privacy-preserving record linkage (PPRL).
The actual decision adopts a dual-threshold rule: when the strength exceeds and the gap exceeds , automatically link to the corresponding ; otherwise, mark it for human review. Both and are dimensionless and tuned on a validation set to balance false positives and review cost; we also verify that decisions are well-calibrated by monitoring Brier score and expected calibration error (ECE). If the gap is small, manual review is required even when absolute strength is high.
For evidently conflicting candidate pairs, a one-time pre-calibration penalty is applied to prevent severely inconsistent matches from receiving high probabilities. We modify the raw fusion score as:
As Equation (14) indicates, if the temporal relationship is AFTER or DISJOINT and the spatial relationship is likewise DISJOINT, we subtract a fixed constant
from the pre-calibration score and then calibrate the adjusted probability as outlined in
Section 3.4.
The penalized probability is obtained via temperature scaling:
This “hard-conflict” guard leverages the RCC-8 and Allen subsets from
Section 3.2 to enforce interpretable consistency when spatiotemporal evidence is contradictory.
When multiple texts point to the same record and satisfy the auto-alignment condition, merging avoids duplicate counting. We define the set of auto-aligned texts for record
as:
If , we sort by in descending order, keep the top-confidence text as primary, and merge others as auxiliary evidence under the same record, preserving source and timestamps. Merging does not alter labels; it consolidates provenance for auditing.
To support long-term operation and audits, the inference pipeline records relation categories and smoothed weights; all compatibility scores; input vectors; fusion scores; final probabilities; , ; decisions; and merge identifiers. Logs use pseudonymized IDs and hashed tokens to fit PPRL constraints while retaining reproducibility. We continuously monitor Brier score and ECE; when calibration drift is observed, temperature parameters are re-estimated without retraining the combiner.
3.6. LLM-Based Data Augmentation
Let the set of automatically aligned samples in the current training slice be
. From this pool we construct a stricter subset for augmentation:
where
and
are defined in
Section 3.5, and
,
are stricter than the auto-labeling thresholds to ensure high-precision seeds.
Augmented texts are generated under domain constraints. We compile a domain knowledge base
(terminology, entity hierarchies) and use the enumerative ontologies
from
Section 3.2. For each seed
aligned to
, we derive a constraint set
from the RCC-8 subset (spatial) and Allen subset (temporal), fixing the target topological and interval relations relative to
. This ensures that augmented texts preserve spatiotemporal consistency.
We employ an instruction-tuned generator
with prompt templates
to produce diversified but consistent variants. For each seed
linked to
, we generate
candidates:
We apply three controlled transformations:
- (i)
same-class entity substitution (siblings under );
- (ii)
parent–child rewriting (ontology level up/down) while keeping the RCC-8/Allen relations to unchanged;
- (iii)
knowledge-guided paraphrases that avoid adding direct identifiers.
For each
, we re-extract features and recompute
,
,
,
,
and the calibrated probabilities
(pipeline as in
Section 3.4 and
Section 3.5). A candidate
is accepted iff all of the following hold:
where
,
,
are per-dimension tolerances (no degradation beyond a small margin) and
is a probability floor for retention.
We enforce PPRL-compatible safeguards during augmentation: blocklisting direct identifiers (names, precise addresses), no finer-grained geocoding than the anchor’s administrative level, time expressions preserved/aggregated to keep the Allen relation, toxicity/bias filters, and manual spot checks on a fixed budget. Rejected candidates are discarded and not logged beyond aggregate statistics.
Accepted augmented texts are added only to the training set with a sample weight (smaller than original samples to reduce confirmation bias); validation and test sets remain untouched. All provenance is logged: seed ID, , template, constraint set, pre/post compatibilities, and acceptance decisions, supporting audits and ablations.
This constrained augmentation increases coverage while preserving the spatiotemporal relations to and the calibrated alignment behavior of the model. By combining lattice-aware prompts, per-component tolerance checks, and a probability floor, we mitigate semantic drift and privacy risks while yielding diverse, label-consistent training texts.
3.7. Model Training and Implementation Details
The training process is divided into two stages. In the first stage, a small number of manually verified samples together with high-confidence samples obtained through strict rule-based filtering are used to lightly train the monotonic combiner described in
Section 3.4, producing the initial probability outputs. Combined with the dual-threshold rule of
Section 3.5, this completes the first round of large-scale automatic annotation. This stage generates a high-precision auto-labeled pool for subsequent augmentation and retraining while keeping validation/testing untouched.
Based on the accident occurrence time, the data are then split into training, validation, and test sets to avoid temporal leakage. We denote the chronological splits as , and , respectively, where all hyperparameters are selected on and final metrics are reported only on . The auto-labeled pool used for augmentation is limited to .
In Stage-1, we train the monotone combiner with non-negative weights and monotone activations (see
Section 3.4) to obtain raw scores
and calibrated probabilities
. We then apply the dual thresholds
and
from
Section 3.5 to each text
to produce auto-labels, yielding a seed pool
and its stricter subset for augmentation
as defined in
Section 3.6.
In Stage-2, only the fusion module is fully trained and re-calibrated on
; the validation and test sets are never augmented or re-estimated. Accepted augmented samples (see
Section 3.6) are included with a weight
to mitigate confirmation bias; the temperature parameter
is refit on
using Brier/ECE.
A unified preprocessing pipeline is applied:
- (i)
Text normalization (numbers/units, temporal expressions, place names);
- (ii)
Feature extraction (temporal anchors and granularity, administrative levels, enumerative field nodes, structural vectors, semantic similarity);
- (iii)
Consistency score computation and caching (columnar storage indexed by sample–candidate pairs).
Candidate retrieval uses an inverted index keyed by time windows, administrative hierarchy, and event type; the per-text candidate cap is , tuned on to balance recall and cost; and vectorized batch computations are used to accelerate scoring.
The first-stage training uses the Adam optimizer with a fixed learning rate and weight decay, batch size of 256, and early stopping with a patience of five epochs. The model input consists of the log-odds transforms of compatibility scores (see
Section 3.4). A single-hidden-layer monotonic network with softplus activations ensures non-negative effective weights (see
Section 3.4). Hyperparameters are selected on
via grid or line search; final settings are fixed before
evaluation.
With an average of candidates per text and texts, candidate retrieval/scoring is ; the fusion layer training is approximately with training pairs and input dimension . Static fields and intermediate compatibility are cached to avoid re-computation; matrix-batch operations are used end-to-end to reduce wall-clock time and memory traffic.
Experiments are conducted in Python 3.10 and PyTorch 2.x with fixed random seeds; both data and models are version-controlled. For auditability, we log relation categories and smoothed weights, all
scores, input vectors
, fusion scores
, probabilities
, thresholds
,
, decisions, and merge IDs (
Section 3.5). If calibration drift is detected (Brier/ECE),
is re-estimated on
without retraining.
Final evaluation on includes accuracy-type measures and calibration (Brier/ECE). We also conduct ablations, disabling one component at a time (lattice constraints, smoothing/projection, temperature calibration, semantic compensation) to quantify each contribution.
4. Experimental Setup
4.1. Data Sources and Processing
The experiments rely on two complementary sources. The first is a collection of structured accident records exported from authoritative statistics and direct reporting systems, denoted by . The second is a multi-source narrative corpus gathered from public web and industry channels, denoted by . To align with privacy-preserving record linkage (PPRL), all direct identifiers are removed or generalized before access; only quasi-identifiers (time, location, type) are retained at coarsened granularity.
Each record
comprises the spatial vector
over administrative levels (province, city, district/county, street/town); the temporal interval
; enumerative non-spatiotemporal fields
mapped to domain ontologies
; and structural numeric attributes
(e.g., casualties, losses) accompanied by a mask for missing components. All schemas follow public safety standards; enumerative fields are normalized to node IDs in
to enable tree-distance computations introduced in
Section 3.2.
For each text , we apply normalization (numbers/units, sentence segmentation), then extract candidate sets using rules and regular expressions: the spatial set (up to four administrative levels, missing levels left null), the temporal set , and the enumerative set , plus the structural vector if present. Each candidate carries an extraction confidence , , . We perform web-scale de-duplication and near-duplicate detection (shingling + locality-sensitive hashing) to suppress repeated reports while preserving provenance. Direct identifiers, if any, are masked prior to storage to comply with PPRL.
Location strings are canonicalized to the administrative hierarchy and geocoded to level IDs. In preprocessing, we also derive topological relations using a subset of RCC-8 (e.g., EXACT/ANCESTOR/COUSIN/DISJOINT) between
and
, so that spatial compatibility
can be computed consistently at inference. Temporal expressions are normalized to closed intervals. We compute interval relations using a subset of Allen’s algebra (EXACT/CONTAINS/OVERLAP/BEFORE/AFTER) between
and
, enabling temporal compatibility
. These lattice-aware features operationalize the theoretical framework in
Section 3.2 during data processing.
Enumerative fields extracted from
are mapped into
via dictionaries and alias tables, with synonym consolidation and out-of-vocabulary back-off to parent nodes to avoid brittle mismatches. Structural attributes (if mentioned) are parsed into
with unit normalization; a binary mask is retained to indicate missing components so that
in
Section 3.2 applies no penalty when data are anonymized or absent.
We sort all items by occurrence time and partition them into , and to avoid temporal leakage. Only is used for auto-labeling and LLM-based augmentation; is used for hyper-parameter/temperature tuning; is strictly held out. We have collected 1328 structured accident records and 6865 narratives collected between 2014 and 2024, across 8 accident types. Each record includes a 4-level spatial hierarchy vector, a time interval, enumerative fields, and numeric attributes. All baselines are processed with the same normalization, candidate extraction, and annotation pipeline (symbols/objects, consistency scores, dual-threshold decision) and use the same candidate cap to match recall. Scripts are version-controlled; logs store only pseudonymized IDs and hashed tokens to comply with PPRL while preserving reproducibility.
4.2. Baseline Models
We select baseline models under a small-sample, two-stage training and calibration setting. All comparison methods share the same data splits, candidate generation pipeline, and evaluation protocol as the proposed method. On the input side, accident summary texts are uniformly used as the alignment target, denoted as
, while direct-report records are converted into brief descriptions by concatenating structured elements (time, location, category, casualties, etc.) as the alignment reference, denoted as
. On the output side, ranking quality, classification quality, and probability calibration quality are reported uniformly, using the calibrated alignment probability
from
Section 3.4, and thresholds
and gap
as well as temperature-scaling parameters (
,
) are determined only on the validation set.
To ensure reproducibility and fairness, the baselines adopt the same processing and annotation pipeline as the proposed method (symbol and object definitions, consistency score computation, dual-threshold decision rules, etc.) and maintain the same candidate pool size and recall target. All methods apply the same dual-threshold decision on
: auto-label if
and the top-two probability gap
; otherwise, send to review (cf.
Section 3.1). Probability calibration quality is reported with Brier score and ECE, consistent with
Section 3.4. This unified setup guarantees comparability under the same task formulation, reducing biases introduced by differences in data handling or engineering components.
The rule–probability baseline implements event linking under the Fellegi–Sunter (F-S) probabilistic record linkage framework [
52]. It maps each summary–record sample to a field-level comparison vector, separately evaluating time interval relations, administrative-level containment relations, hierarchical distances of enumerative non-spatiotemporal elements, and consistency of structured vectors, then computing discrete or continuous evidence components. Based on a small calibration set, it estimates the conditional distributions of true matches and non-matches to generate likelihood ratios and produces “match,” “non-match,” or “undecided” judgments via upper and lower thresholds. For comparability, we convert the F-S scores to calibrated
via temperature scaling
on the validation set and apply the same
,
decision as above. This method focuses on element-level evidence, offering an interpretable performance upper bound suitable for evaluating the limit when only structured consistency is used. The implementation retains interpretability for each component and jointly tunes thresholds and rejection strategies on the validation set to remain consistent with the probability calibration and risk-control standards of the proposed method.
The retrieval–re-ranking baseline uses the Cross-Document Event Coreference Search (CDECS) method [
53], which transforms cross-document event coreference into a retrieval–ranking task and supports zero-shot or small-sample processes. Candidate recall is performed by an unfine-tuned general semantic encoder, followed by lightweight threshold adjustment and temperature scaling on the validation set for calibration. In this task, the accident summary is encoded as a query vector, and the textualized description of direct-report records is encoded as a library vector; nearest-neighbor recall is first performed, then similarity scores or a lightweight classification head are used for ranking and probabilistic scoring. We report probabilities as
after temperature scaling
, and apply the same
,
rule; initial recall uses top-
nearest neighbors.
The zero-shot linking baseline employs the dual-tower dense retrieval model from the BLINK framework [
54]. BLINK decomposes the “mention–entity linking” task into dual-tower retrieval and cross-encoder fine ranking, relying primarily on entity description texts rather than task-specific supervision, thus operating in zero-shot conditions. For this task, direct-report records are normalized into event-element descriptions (short texts of time, location, category, consequences, etc.) to build the vector index; accident summaries are encoded as query vectors for top-
candidate retrieval and ranking, and similarity is mapped to probabilities
via temperature scaling on the validation set. The same
,
decision is used.
The generative zero-shot decision baseline adopts the Argument-Aware Event Linking (AAEL) method [
23]. Without human annotation, it uses argument-aware prompts to structure the input so that the model focuses more on matching time, location, and participants during ranking. Applied to this task, summary–record pairs are filled into a prompt template, and the model outputs a judgment on whether they describe the same event and a brief rationale, which we convert to
using the same
calibration on the validation set, followed by the
,
decision rule to align with the evaluation standard of this paper. This baseline tests the transferability and potential risks of large language models in cross-document event linking under low-supervision conditions.
The teacher–student distillation baseline uses an LLM-based method proposed at NAACL 2024 [
55]. Its core idea is to have an autoregressive large language model generate free-text rationales as distant-supervision signals and distill them into a lightweight student model, thereby improving cross-document event coreference performance without increasing manual annotation. In this task, the teacher model generates alignment/non-alignment rationales and soft labels for summary–record samples, while the student model trains a classification head under small-sample conditions using a joint loss (hard-label cross-entropy plus soft-label KL divergence [
56]). During inference, only the student model is used, with temperature scaling on the validation set. The output probabilities are reported as
and thresholded with
,
for consistency with our evaluation protocol. This baseline operates stably under few-label and pseudo-label settings and can test the feasibility and benefits of transferring teacher-model capabilities and achieving efficient student-model inference.
Together, these five baselines cover interpretable evidence-based probabilistic linking, retrieval–re-ranking under general semantic paradigms, dense retrieval zero-shot linking, argument-aware generative re-ranking, and “free-text rationale + distillation” low-annotation paradigms. All have peer-reviewed empirical results on public benchmarks (e.g., ECB+ [
27]) and require only minimal adaptation when transferred to the one-to-one “accident summary–direct report” alignment task, making them suitable references under the small-sample and zero-shot conditions of this study. All metrics and decisions are computed on the calibrated
with
from
Section 3.4 and the same
,
rule.
4.3. Parameter Settings
Parameter settings include only points directly related to experiment reproducibility and adopt consistent selection principles across methods whenever possible. All methods share the same data splits, extractor versions, and candidate generation pipeline. Candidate pool size is determined on the validation set according to the criterion of “minimizing the average number of candidates while ensuring target recall” and is kept unchanged on the test set. All probability outputs are reported as the calibrated alignment probability
and use temperature scaling with parameters
fitted once on the validation set to minimize log loss or the Brier score; decisions follow the same dual-threshold rule with
(confidence) and
(top-two gap) as in
Section 3.1,
Section 3.2,
Section 3.3 and
Section 3.4; no adjustments are made during testing.
Text truncation strategies are kept consistent across methods: only “accident summary” and “record synopsis” are used, with priority given to preserving time, location, and key numerical expressions to avoid biases introduced by length differences.
For the Fellegi–Sunter baseline, key parameters include the prior smoothing strength of components and the likelihood ratio thresholds. Component conditional distributions are estimated from the small calibration set using symmetric Beta–Bernoulli smoothing, and the smoothing coefficient is selected on the validation set by grid search to jointly optimize the Brier score and ECE. For time and administrative-level relationships, component weights are subjected to isotonic projection using equidistant regression to preserve the order of strength, applied to the spatiotemporal relation weights
and
for consistency with
Section 3.2 and
Section 3.3. The likelihood ratio adopts joint upper and lower thresholds, with threshold selection primarily maximizing F1 while controlling the false positive rate within a preset range on the validation set. The log-likelihood ratio is linearly mapped to log-odds before temperature scaling and converted to
with the same
to ensure comparability of probability outputs across methods.
For the CDECS baseline, the core parameters are the encoder, similarity temperature, and recall depth. The encoder uses a publicly available general semantic model and remains frozen; similarity is fitted to a single temperature parameter on the validation set to map cosine similarity stably to log-odds; recall depth matches the candidate pool size, with no separate optimization per method to avoid selection bias. If small-sample fine-tuning is required, only a few contrastive learning steps are performed on positive–negative pairs from the calibration set, with learning rate and steps chosen on the validation set; if no fine-tuning is performed, the encoder remains frozen and only temperature and thresholds are fitted on the validation set. All reported probabilities are calibrated with , and the , decision is applied uniformly.
For the BLINK dense retrieval baseline, zero-shot operation is emphasized. The dual-tower encoder uses public weights and remains frozen; the entity side (here, the “record synopsis” side) builds the vector index, and the query side is the “accident summary.” Under zero-shot settings, the cross-encoder re-ranking is disabled, relying only on dual-tower similarity ranking, with temperature scaling and threshold fitting on the validation set. Approximate retrieval parameters of the index are aligned with the candidate pool size, and no additional model-specific tuning is performed. If minimal adaptation is applied, it is limited to standardizing the textual format of “record synopsis” templates within the validation set without updating encoder parameters. Outputs are presented as after calibration, and decisions use the same , rule.
For the LLM-based zero-shot re-ranking baseline, a deterministic inference configuration is used to reduce variance. The generation temperature is set to zero or near zero, sampling is disabled, and the maximum generation length is sufficient to cover the judgment and brief rationale. Inputs use a fixed pairwise template explicitly specifying time, location, and field elements; outputs are parsed into binary decision scores, then normalized to probabilities on the validation set using logistic regression or temperature scaling. To avoid prompt drift, the template remains unchanged throughout the experimental period; if refusals or “cannot judge” outputs occur, they are uniformly treated as the lowest-confidence negative examples, and this rule is consistently applied on the validation set without ad hoc modifications on the test set. For comparability, final probabilities are calibrated as via , followed by the , decision.
For the “rationale + distillation” baseline, a teacher–student two-stage configuration is adopted. The teacher is an open-source autoregressive large model used only to generate alignment/non-alignment rationales and soft labels offline and does not participate in online inference. The student is a long-context encoder with a lightweight classification head, trained only on the calibration set and high-confidence pseudo-labels under small-sample conditions, using a weighted sum of hard-label cross-entropy and soft-label KL divergence [
56] as the loss function. The weights of the two losses, learning rate, training steps, and early-stopping criteria are all chosen by small grid search on the validation set, aiming to maximize F1 and Hit@k without sacrificing the Brier score or ECE. The distilled student runs independently at test time and uses the same temperature scaling and thresholds without invoking teacher inference. At evaluation, we report calibrated
with
and apply the shared
,
rule.
To ensure fair comparison, the above settings follow two principles:
- (i)
any decision-related hyperparameters are chosen only on the validation set, with no temporary adjustments on the test set;
- (ii)
method-specific engineering optimizations are restricted within feasible bounds to avoid incomparability due to different candidate pool sizes, different text truncation strategies, or different external resources.
4.4. Evaluation Metrics
We adopt the following evaluation suite: Hit@k, ROC-AUC, Precision, Recall, F1, the Brier score, and Expected Calibration Error (ECE). All probability-based metrics are computed on the calibrated alignment probability
(cf.
Section 3.4). Ranking quality is assessed with Hit@k (Top-k hit rate) and ROC-AUC. Hit@k evaluates whether the ground-truth target appears within the top-k positions of the candidate list sorted by
; common choices of k are 1, 5, and 10, which reflect the model’s best-match capability and the extent of candidate coverage useful during human review. Here
denotes the rank cut-off for Hit@k and should not be confused with the enumerative field count
in
Section 3.4 or the retrieval depth
used during candidate recall. ROC-AUC measures the robustness of the overall ranking, evaluated over
for positive and negative pairs, with higher values indicating stronger discrimination between positive and negative instances. Precision, Recall, and F1 are computed from binary decisions obtained by the dual-threshold rule on
: predict “linked” if
and the top-two probability gap
(parameters fixed from the validation set). The Brier score computes the mean squared error between predicted probabilities and the true labels, applied to
, where lower values indicate more accurate probability estimates; ECE is computed on
using a fixed binning scheme shared across methods and assesses calibration by comparing predicted probabilities with empirical accuracies across confidence bins. These two metrics provide important guidance for setting decision thresholds between automated processing and human review in practical deployments and align with the temperature-scaling setup
established on the validation set in
Section 3.4.
5. Experimental Results and Analysis
This section reports and analyzes the main-task alignment under a few-shot, two-stage setting. To maintain evaluation consistency, all methods adopt the same data split, candidate-set generation pipeline, and input truncation strategy. Decision thresholds and temperature-based probability calibration parameters are fit on the validation set and then fixed on the test set to avoid leakage and ensure reproducibility.
5.1. Overall Performance Analysis
Table 1 summarizes the performance of each method on ranking metrics (Hit@1, Hit@5), classification metrics (Precision, Recall, F1), and probability metrics (ROC-AUC, Brier, ECE). For transparency and comparability, all metrics are computed under the same split and candidate pipeline as specified in
Section 4. Our method achieves the highest F1 (62.46%) and the best calibration (Brier = 0.14, ECE = 1.97%), while maintaining competitive ranking (Hit@1 = 41.51%, Hit@5 = 77.33%); ROC-AUC is 87.34%, close to the best (87.56% by CDECS). These results indicate strong accuracy with reliable probabilities for automatic labeling.
From the overall comparison, the proposed method performs well on several key metrics, showing competitive results particularly on Hit@1, F1, ROC-AUC, and the calibration metrics (Brier and ECE). Concretely, our method obtains Hit@1 = 41.51% and Hit@5 = 77.33%, F1 = 62.46%, ROC-AUC = 87.34%, Brier = 0.14, and ECE = 1.97%, representing the best F1 and calibration among all systems while keeping discrimination comparable to the strongest baseline.
In terms of ranking, AAEL achieves the best Hit@1 (42.12%), as shown in
Figure 4a, followed by our method at 41.51% (a gap of 0.61%). The remaining methods rank as LLM (40.93%), CDECS (38.58%), BLINK (36.84%), and F-S (33.49%). For Hit@5, LLM leads with 77.88%, while our method attains 77.33% (a gap of 0.55%); AAEL (75.63%), CDECS (74.71%), BLINK (72.08%), and F-S (69.31%) follow. While Hit@k reflects top-rank retrieval quality, small deltas in ranking do not necessarily translate into more reliable automatic labeling, which depends critically on probability calibration and thresholding.
Regarding classification metrics, as shown in
Figure 4b, our method attains the highest F1 score (62.46%), exceeding the next-best LLM (60.65%) by 1.81%. Further analysis shows our Precision is 73.92%, second only to F-S (74.83%); our Recall is 54.07%, 1.24% lower than the best LLM (55.31%). Under a unified threshold across systems, this operating point offers a favorable Precision–Recall balance for auto-labeling, whereas F-S favors Precision at the expense of Recall and LLM does the opposite.
In probability calibration, as shown in
Figure 5, our method achieves the lowest ECE (1.97%) and Brier score (0.14) among all methods. The Brier score measures the mean squared error of probabilistic predictions, where smaller values indicate predictions closer to the true distribution. ECE reflects calibration error; smaller values indicate more reliable confidence. Good calibration enables direct mapping from predicted probabilities to expected accuracy, thereby supporting threshold selection between automatic labeling and human review. For probability-based discrimination, ROC-AUC evaluates overall separability across thresholds; CDECS performs best (87.56%), and ours is a close second (87.34%, Δ = 0.22 pp), so pairing strong discrimination with superior calibration is crucial for dependable PPRL-style deployment.
Figure 6 depicts the probability calibration of each model. The x-axis denotes the mean predicted probability within a bin, and the y-axis the empirical positive rate. The ideal calibration curve (black dashed line) corresponds to perfect probability estimates. Results show clear differences across models: our method’s curve lies closest to the diagonal, indicating high agreement between predicted and true probabilities; traditional methods (e.g., F-S) fall clearly below the diagonal, revealing systematic overestimation; some models (e.g., CDECS, BLINK) underestimate in low-probability bins but overestimate in high-probability bins.
5.2. Stability Analysis Across Different Sources
To further evaluate the robustness of each model in practical settings, this subsection examines the impact of text source on model performance. We partition the test set into three categories by source: government portals (e.g., National Energy Administration, State Grid, Local Emergency Management Authorities), professional institutions (e.g., Polaris Power, Xuexi Qiang’an integrated media platform), and public media (e.g., NetEase News, Chudian News). We report Hit@1 and F1 for each source, as shown in
Table 2. For fairness, the candidate-set generation pipeline, probability calibration, and decision thresholds remain the same across sources (fixed from validation), and the test split is identical for all methods.
In the cross-source comparison, the Hit@1 curves exhibit pronounced layering and fluctuations. Specifically, all models perform best on government texts, followed by professional-institution texts, and drop to the lowest on public-media texts. As shown in
Figure 7a, the curves of AAEL and our method (Ours) remain in the top tier with relatively smooth trajectories, indicating stronger cross-domain stability; in contrast, F-S shows the largest fluctuations, with a marked decline on media data, suggesting weaker adaptability to non-standardized texts. This ordering is consistent with the F1 pattern in
Figure 7b and reflects increasing structural variability from government to public-media sources.
To better understand source-specific differences in information structure, we randomly sampled 100 accident reports from each source and computed the completeness of extracted accident information. We focused on coverage across nine key fields: time of occurrence, location, casualties, economic loss, involved project, involved organization, operational stage, accident type, and injury-causing agent. This sampling analysis is descriptive and is used solely to explain performance differences rather than to train or tune models.
Figure 8 presents the coverage radar chart.
Figure 8 intuitively shows the differentiated coverage patterns. Government sources are most balanced, with broad arcs on basic fields (time, location, casualties) but no disclosure for involved organization, involved project, and injury-causing agent, consistent with standardized reporting requirements. Such structurally regular texts—with fixed information positions—benefit rule-based or traditional feature-matching models (e.g., F-S), which can achieve higher accuracy on basic fields. The lack of certain fields is largely attributable to privacy/compliance practices in public reporting, which reduces the availability of identifiers for linking.
By contrast, professional-institution texts exhibit prominent sectors on operational stage, accident type, and injury-causing agent, highlighting a focus on technical analysis. Terminology-dense, detail-rich documents favor models with semantic understanding (e.g., CDECS, LLM), which capture technical attributes and therefore perform well on cause- and agent-related fields. Traditional keyword models tend to degrade under such semantic complexity. Our method, which integrates spatiotemporal consistency with calibrated semantic fusion, maintains competitive Hit@1 and the best F1 in this source (
Table 2), indicating robustness to terminology-dense narratives.
Public-media texts cover basic information like time and location reasonably well, but the sectors for involved organization and injury-causing agent narrow significantly, reflecting timeliness and selectivity in news reporting. Their looser structure, redundant details, and narrative style pose challenges to all models. Even so, the high coverage of basic fields provides surface-matching opportunities, enabling models with stronger semantic generalization (e.g., Ours) to maintain relatively stable overall performance. Consistent with
Table 2, our method attains the best F1 (60.53%) and the top Hit@1 (39.87%) on public-media texts, suggesting that calibrated probabilities help resist noise introduced by narrative redundancy.
In sum, accident texts from different sources exhibit systematic structural differences that markedly affect model performance. Government texts are standardized and field-complete, favoring traditional matching models but limiting deep reasoning due to avoidance of sensitive details. Professional-institution texts are terminology-dense with rich technical detail, better suited to semantically capable models for deeper analysis and linking. Public-media texts are loosely structured and information-redundant, challenging for all models, yet their high coverage of basic fields still benefits models with robust generalization. The findings indicate that models relying on surface feature matching generalize poorly across sources, whereas models incorporating semantic understanding and domain adaptation better accommodate structural differences and thus maintain more stable cross-source performance. From a deployment perspective, these observations motivate reporting both ranking/classification and calibration metrics by source so that operating thresholds for auto-labeling vs. human review can be tuned to each source’s structural profile.
5.3. Stability Analysis Under Different Decision Thresholds
In practical deployments of event linking, the choice of decision threshold directly affects both predictive outcomes and utility. To comprehensively assess each model’s stability across thresholds, we examine how the F1 score balance varies under different operating points, with the aim of verifying whether our calibration advantage yields more reliable outputs across settings. The threshold also defines the operational boundary between automatic labeling and human review, making stability across thresholds practically important.
To ensure comparability, we use a unified test set and fixed model parameters and evaluate three decision thresholds—0.3, 0.5, and 0.7—computing F1 for each model at each threshold while keeping all other experimental conditions unchanged. Temperature-scaling parameters for probability calibration remain fixed from the validation stage to avoid leakage and to ensure a fair comparison. The three thresholds correspond to lenient (0.3), default (0.5), and conservative (0.7) operating points.
Table 3 and
Figure 9 report the F1 scores at each threshold and the across-threshold fluctuation range (Δ).
Our method maintains consistently high F1 across thresholds and exhibits the smallest fluctuation (only 1.74%), indicating the best threshold robustness. In contrast, baselines fluctuate between 3.25% and 4.52% (AAEL 3.25, CDECS 3.44, LLM 3.74, F-S 3.91, BLINK 4.52), suggesting greater sensitivity to threshold selection. This aligns with our calibration advantage (lower Brier/ECE in
Section 5.1), which reduces the mismatch between predicted probabilities and empirical accuracy and thus dampens metric oscillation across operating points.
5.4. Automatic Labeling Efficacy Analysis
We evaluate automatic labeling efficacy from an active-learning perspective. The analysis is based on the “maximum probability + probability gap” fusion strategy, which identifies high-uncertainty cases and precisely delineates the boundary between auto-labeling and human review. Metrics include the proportion of auto-labeled items, the proportion sent to human review, the accuracy of high-confidence auto-labels, and the proportion of true positives within the human review set. The execution results are shown in
Table 4. Following the protocol in
Section 3.5, both the decision threshold and the probability-gap parameter are fixed on the validation set and held constant on the test set to prevent leakage and enable fair comparison. For clarity, Auto denotes the share of texts automatically labeled; Acc (HC auto) is the accuracy among high-confidence auto-labels; Review is the share routed to human assessment; and TPR (review set) is the true-positive rate within the items sent to review. This design aligns with the PPRL triage principle by separating high-confidence, low-risk cases for automation from ambiguous cases that require human scrutiny.
The results indicate that our method delivers the best automatic labeling efficacy. The auto-label proportion reaches 77.7%, while the human review proportion drops to 22.3%. The accuracy of high-confidence auto-labels is 97.51%, indicating highly reliable automated outputs; within the human review set, the true-positive rate is 81.46%, confirming that the strategy effectively routes difficult, high-value cases for manual inspection. Compared with the F-S baseline (auto-label 58.8%, human review 41.2%), our approach reduces the human review share by 18.9 percentage points. Competing models (CDECS, BLINK, AAEL, LLM) achieve auto-label shares between 62.6% and 71.4%, all below our method. Furthermore, Acc (HC auto) of 97.51% exceeds all baselines by 1.68–5.14 pp (vs. 95.83% for the strongest competitor and 92.37% for F-S), while the review-set TPR of 81.46% indicates that the human queue is enriched for genuinely positive cases—an operationally desirable triage property. These advantages stem from more accurate probability calibration coupled with the “maximum-probability + probability-gap” criterion, which together provide reliable uncertainty estimates and a principled boundary between automated processing and human review, thereby improving throughput and quality while avoiding rework from erroneous labels.
5.5. Ablation Study
To evaluate the contribution of each proposed component, an ablation study was performed by systematically removing one module at a time from the complete model while keeping all other settings, datasets, and hyperparameters unchanged. The evaluation followed the same experimental protocol as described in
Section 4, and all results were reported on the identical test set to ensure fair comparison. The components analyzed include:
- (i)
Spatiotemporal lattice consistency module (, ), which provides spatial–temporal constraints using the RCC-8 and Allen-subset relations;
- (ii)
Smoothing and monotonic projection module, which regularizes weight estimation and preserves the natural order among relation strengths;
- (iii)
Semantic compensation term (), which enhances robustness to linguistic variation;
- (iiii)
Temperature-based probability calibration, which improves the reliability of probabilistic outputs.
Each ablation variant was retrained from scratch to isolate its effect.
Table 5 summarizes the results in terms of F1 (linking accuracy), Brier score (probability error), and ECE (expected calibration error).
The full model achieves the best overall performance (F1 = 62.46%, Brier = 0.14, ECE = 1.97%). Removing the spatiotemporal lattice consistency causes the most severe degradation, reducing F1 to 57.21% (−5.25%) and worsening calibration (Brier = 0.18, ECE = 3.84%). This confirms that the RCC-8/Allen-based lattice is fundamental for interpreting anonymized quasi-identifiers and ensuring coherent matching decisions.
Eliminating the semantic compensation term results in a smaller but consistent decrease in recall, leading to F1 = 60.28% (−2.18%), Brier = 0.15, and ECE = 2.22%. The drop is most pronounced for narrative-heavy public-media sources, where anonymization causes missing quasi-identifiers. This finding demonstrates that the semantic component complements lattice-based reasoning by restoring soft contextual clues.
Without smoothing and monotonic projection, the model exhibits less stable behavior, with F1 = 59.04% (−3.42%), Brier = 0.16, and ECE = 2.91%. This indicates that unsmoothed empirical weights introduce overconfidence and disrupt the monotonic relationship among spatial and temporal relations, impairing calibration stability.
Finally, removing temperature-based calibration barely affects F1 (62.11%, −0.35%) but substantially degrades probability reliability (Brier = 0.17, ECE = 4.08%). The absence of calibration results in overconfident scores and higher threshold sensitivity, which undermines auditability in privacy-preserving record linkage. The ablation results highlight three main insights:
- (i)
Spatiotemporal consistency is indispensable for reasoning over anonymized accident reports, enabling structured correlation of location and time even when explicit identifiers are removed.
- (ii)
Statistical regularization via smoothing and monotonic projection significantly enhances stability and prevents overfitting to sparse relation frequencies.
- (iii)
Semantic compensation improves recall under heterogeneous textual expressions, while probability calibration ensures the model’s predictions retain auditable probabilistic meaning.
Together, these modules jointly ensure both quantitative accuracy and qualitative interpretability, validating the necessity of each design choice for reliable, privacy-preserving record linkage.
6. Results
The proposed method delivers consistently strong results across ranking, classification, and probability-based metrics. On the test set, it attains Hit@1 of 41.51% and Hit@5 of 77.33%, while achieving the best F1 of 62.46% with Precision 73.92% and Recall 54.07%. Probability-based discrimination is also competitive (ROC-AUC 87.34%). Most importantly for deployment, we follow a validation-fit/test-fix protocol for probability calibration and thresholds. The model exhibits the most reliable probability estimates among all systems evaluated, with the lowest Brier score (0.14) and the lowest ECE (1.97%). These properties are crucial in privacy-preserving record linkage (PPRL) settings, where calibrated probabilities support auditable thresholding and governance. Together, the findings indicate that small differences in Hit@k do not fully capture downstream reliability, whereas improvements in F1 and calibration translate into more trustworthy confidence scores for operational decisions.
Stratified analyses show stable performance across heterogeneous sources, including government portals, professional institutions, and public media, with the expected ordering—best on government texts and worst on public-media texts—holding across all methods. Our approach remains in the top tier for each source and shows reduced variability relative to feature-matching baselines. This robustness is attributable to the combination of structured spatiotemporal constraints and semantic modeling: the RCC-8-based spatial relations and the Allen-subset temporal relations enforce interpretable containment consistency (, ), while the semantic compensation term () recovers soft evidence when quasi-identifiers are generalized or omitted.
Threshold-sensitivity experiments further corroborate the method’s stability. Sweeping the decision threshold over 0.3, 0.5, and 0.7, the model sustains the highest F1 at each operating point and exhibits the smallest fluctuation range (Δ = 1.74%) among all systems. In practice, such reduced sensitivity is consistent with better probability calibration and the use of constrained monotonic fusion, simplifying deployment scenarios where operating points may change with workload or policy. Consequently, operating-point selection can be aligned with review budgets while maintaining consistent utility across regimes.
Finally, the automatic labeling study demonstrates clear practical gains. Using a decision rule that combines maximum probability with a probability-gap criterion, the system automatically labels 77.7% of cases with 97.51% accuracy among high-confidence outputs, while routing 22.3% to human review; within the review subset, the true-positive rate reaches 81.46%. Relative to a strong feature-matching baseline, this reduces the human review share by 18.9%, increasing throughput without sacrificing quality. These outcomes reflect a calibrated triage: high-confidence items are automated, whereas ambiguous or cluster-adjacent cases—identified via the probability gap—are preferentially escalated to review, enriching the reviewer queue with genuinely informative positives.
Overall, the results show that integrating spatiotemporal lattice constraints with monotonic probability fusion and explicit calibration yields a model that is not only accurate but also operationally reliable. The ablation study further substantiates these contributions by quantifying the role of lattice consistency, smoothing/monotonic projection, semantic compensation, and temperature-based calibration. The approach provides calibrated confidence suitable for principled thresholding, maintains robustness across sources and operating points, and enables scalable, auditable automatic labeling in real-world accident-corpus construction.
7. Discussion
Our results confirm that explicit spatiotemporal containment modeling and end-to-end probability calibration jointly yield both competitive accuracy and operational reliability for cross-document accident linking. Concretely, the method matches strong baselines on ranking and discrimination while delivering the lowest Brier and lowest ECE, and it sustains the highest F1 across decision thresholds with the smallest performance fluctuation. These properties are especially valuable under PPRL constraints for strictly anonymized accident texts, where auditable probabilities and consistent behavior across operating points are prerequisites for deployment. In practice, this translates into an auto-label share of 77.7% with 97.51% accuracy on high-confidence outputs and a focused review queue (TPR 81.46%).
Viewed against prior lines of research, the approach complements both rule/template and graph-based methods—known for interpretability and global consistency but sensitive to similarity heuristics—as well as neural and PLM/LLM paradigms that excel at semantic alignment but can struggle when anchors are obscured by anonymization. By encoding administrative hierarchies and temporal granularity as a lattice (RCC-8/Allen subset) and employing a monotone combiner, the method injects domain knowledge that constrains plausible matches, thus mitigating lexical variability and granularity mismatch that frequently degrade open-domain systems when transferred to safety texts. This design reconciles two desiderata often viewed as competing: (i) interpretable, auditable decision factors and (ii) transferable semantic representations robust to style and source shifts. Concretely, interpretability is enforced by the monotone lattice combiner—raising spatial/temporal consistency can only increase the link score—and by a hard-conflict guard that deterministically penalizes AFTER/DISJOINT temporal–spatial contradictions before calibration; both signals are logged for audit. Semantic robustness then enters as a bounded compensation term: on the most challenging public-media texts, this balance yields the best F1 (60.53%) with top-tier Hit@1 (39.87%), showing that can recover matches when quasi-identifiers are generalized, while the lattice prevents semantically plausible but topologically inconsistent links. Finally, explicit calibration converts constrained scores into trustworthy probabilities (Brier = 0.14; ECE = 1.97%), so borderline items with small probability gaps are triaged to review rather than auto-labeled—consistent with our 22.3% review share enriched with true positives (81.46%).
The stratified analyses across government portals, professional institutions, and public media situate the results in a broader context of source-dependent structure. Government reports are standardized and field-complete, favoring all systems but especially those relying on surface features; professional-institution texts are terminology-dense and analytically rich, rewarding models with stronger semantic understanding; public-media reports are loosely structured and selective, challenging every method. Our system’s cross-source stability is best explained by its dual reliance on (a) lattice-based containment consistency, which regularizes inference when key fields are present, and (b) calibrated probabilities, which reduce overconfidence when fields are missing or conflicting—thereby preserving throughput without inflating false positives.
Threshold-sensitivity experiments illuminate how calibration informs governance in human-in-the-loop pipelines. The two-tier rule—requiring both a minimum probability and a probability gap—operationalizes a conservative policy: obvious cases are automated; ambiguous, cluster-adjacent items are triaged for review. Unlike systems that implicitly conflate similarity with probability, our framework explicitly calibrates outputs (via temperature scaling) and logs decision metadata (strength/gap, relation categories, penalties), enabling principled choices of operating points tied to desired review budgets or service-level constraints. This aligns with deployment needs in critical domains, where auditability matters as much as mean accuracy.
At the same time, the study reveals limitations that motivate further work. First, upstream extraction errors (e.g., time normalization, location resolution, argument parsing) propagate into fusion scores; while metadata logging and spot checks mitigate this, extraction quality remains a bottleneck. Second, small-sample calibration with monotonic projection may underrepresent rare factor combinations; when the calibration set lacks coverage, probability estimates can be conservative or biased in the tails. Third, the fixed conflict penalty may over-suppress true positives in edge cases where administrative boundaries and reporting practices diverge (e.g., cross-jurisdiction incidents), suggesting a need for context-aware penalty schedules. In addition, feature sparsity in long-tail categories can limit the reliability of categorical compatibilities (); hierarchical smoothing or Bayesian shrinkage may further stabilize these estimates.
These observations point to several future directions: (1) Joint extraction–linking with uncertainty propagation (e.g., marginalizing over NER/normalization hypotheses) could reduce error cascades. (2) Adaptive calibration—such as hierarchical temperature scaling or Bayesian calibration—may better handle long-tail regimes and data drift. (3) Open-set detection and near-duplicate clustering could curb false links in bursty incident streams. (4) Multilingual and cross-regional extensions would test the lattice under diverse administrative ontologies and reporting norms. (5) Human-in-the-loop interfaces that expose calibrated probabilities, gap margins, and factor-wise contributions could improve reviewer efficiency and trust. (6) Finally, releasing auditable benchmarks with structured–unstructured pairs and calibration targets would facilitate apples-to-apples comparisons and accelerate progress on reliable event linking in safety-critical domains. Where feasible, these directions should preserve PPRL constraints by avoiding direct identifiers and documenting any relaxation via explicit governance policies.
Recent safety engineering research underscores the importance of analyzing spatiotemporal patterns to prevent accidents. Li et al. [
57] mined 174 coal mine explosion investigation reports, categorized them by time and region and applied frequent itemset mining to identify coordination failures and risk propagation paths. Xie et al. [
58] analyzed spatial–temporal distributions and used correlation and regression models to trace causal factors in urban underground pipeline disasters. These studies rely on summarizing existing reports rather than linking anonymized events. In contrast, our lattice-constrained framework goes further by encoding administrative hierarchies and temporal granularity, instantiating relations via RCC-8 and Allen’s interval algebra, and calibrating probabilities. This cross-document event linking capability not only produces auditable matches but also enables construction of high-quality accident corpora, providing a technical foundation for data-driven accident prevention research. In sum, by fusing structured constraints with calibrated probabilistic inference, the proposed system advances beyond accuracy alone toward deployable reliability—maintaining performance across sources and thresholds, reducing manual burden, and providing transparent signals for governance and audit. This offers a concrete and extensible path for building large-scale, auditable accident corpora under privacy constraints, integrating interpretable lattice-based reasoning with well-calibrated, semantics-aware fusion.