Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling

Zeng, Wenhua; Tang, Wenhu; Yuan, Diping; Zhang, Hui; Duan, Pinsheng; Hu, Shikun

doi:10.3390/app15147928

Open AccessArticle

Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling

by

Wenhua Zeng

^1,2,*

,

Wenhu Tang

^1,*

,

Diping Yuan

³,

Hui Zhang

²

,

Pinsheng Duan

⁴ and

Shikun Hu

^2,5

¹

School of Electric Power Engineering, South China University of Technology, Guangzhou 510641, China

²

Shenzhen Urban Public Safety and Technology Institute, Shenzhen 518024, China

³

Shenzhen Research Institute, China University of Mining and Technology, Shenzhen 518057, China

⁴

School of Mechanics and Civil Engineering, China University of Mining and Technology, Xuzhou 221116, China

⁵

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7928; https://doi.org/10.3390/app15147928

Submission received: 24 June 2025 / Revised: 14 July 2025 / Accepted: 14 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue Advances in Smart Construction and Intelligent Buildings)

Download

Browse Figures

Versions Notes

Abstract

Modeling accident investigation reports is crucial for elucidating accident causation mechanisms, analyzing risk evolution processes, and formulating effective accident prevention strategies. However, such reports are typically long, hierarchically structured, and information-dense, posing unique challenges for existing language models. To address these domain-specific characteristics, this study proposes SAFE-Transformer, a Structure-Aware and Format-Enhanced Transformer designed for long-document modeling in the emergency safety context. SAFE-Transformer adopts a dual-stream encoding architecture to separately model symbolic section features and heading text, integrates hierarchical depth and format types into positional encodings, and introduces a dynamic gating unit to adaptively fuse headings with paragraph semantics. We evaluate the model on a multi-label accident intelligence classification task using a real-world corpus of 1632 official reports from high-risk industries. Results demonstrate that SAFE-Transformer effectively captures hierarchical semantic structure and outperforms strong long-text baselines. Further analysis reveals an inverted U-shaped performance trend across varying report lengths and highlights the role of attention sparsity and label distribution in long-text modeling. This work offers a practical solution for structurally complex safety documents and provides methodological insights for downstream applications in safety supervision and risk analysis.

Keywords:

accident report modeling; structure-aware encoding; hierarchical sparse attention; multi-label classification; semantic fusion; emergency safety intelligence

1. Introduction

Accident investigation reports serve as critical vehicles for analyzing accident causation mechanisms, understanding risk evolution processes, and formulating effective prevention strategies. Systematic analyses of these reports significantly contribute to the identification of safety risks and optimization of accident prevention management systems [1,2]. Traditional methods for modeling accident reports primarily rely on rule-based templates or classic machine learning algorithms, such as decision trees and support vector machines, for shallow feature extraction [3,4,5,6,7,8]. However, these methods are constrained by high modeling costs and limited semantic understanding capabilities.

With advancements in Transformer architectures [9], deep learning approaches utilizing pretrained language models like BERT [10] have become mainstream in accident text analysis, demonstrating substantial advantages in tasks such as accident classification and risk entity recognition [11,12,13,14,15]. Nevertheless, existing studies largely concentrate on sentence-level or short-text scenarios [16,17,18], leaving effective solutions for more realistic document-level modeling relatively unexplored.

Document-level analysis enables context-aware identification of semantic associations across paragraphs and comprehensive causal chains. Such an approach precisely identifies dispersed accident risks, incident causes, contributing factors, and their interrelationships across various document sections [15,19]. Utilizing global context, this method extracts detailed accident intelligence—including event sequences, human factors, technical elements, environmental influences, and management factors [20,21]. Consequently, this analysis assists safety management authorities in building complete chains of risks, events, and evidence, uncovering hidden patterns from extensive historical accident cases. It provides a scientific basis for explaining accident causation mechanisms, studying risk evolution processes, and formulating targeted accident prevention strategies.

Accident investigation reports, typically drafted by domain experts based on on-site inspections and technical analyses, are characterized by extensive length, complex structure, and dense information. While the compilation of these reports strictly adheres to the “four-part” logical framework established by the “Regulations on Production Safety Accident Reporting and Investigation Handling” (State Council Order No. 493)—specifically, (1) accident occurrence and emergency response, (2) direct/indirect cause analysis, (3) responsibility attribution, and (4) recommended corrective measures—in practice, significant variations exist among different industries, regions, and accident types. Nonetheless, most reports consistently include sections on accident profile, event sequences, emergency response, causal analyses, responsibility attribution, and preventive measures. Similar frameworks exist in many other countries. For instance, the U.S. Occupational Safety and Health Administration (OSHA) has a set of guidelines for reporting workplace accidents, which, while differing in structure, also focus on key aspects such as causation analysis, responsible parties, and corrective actions [22]. Furthermore, the European Statistics on Accidents at Work (ESAW) methodology (Regulation n.349/2011) guides accident data practices in Europe, providing a standardized framework for occupational accident reporting [23]. Public datasets under this framework typically use alpha-numeric coding to protect privacy. In addition, the international standard ISO 45001 has significantly improved the utilization of accident datasets within occupational health and safety management systems globally, emphasizing systematic data-driven safety improvement and risk management [24,25]. These international practices underscore the broader applicability and relevance of accident data modeling research.

Modeling accident investigation reports faces three primary challenges. Firstly, report lengths frequently surpass the maximum input length constraints of pre-trained language models, prompting researchers to utilize truncation or chunking strategies [17,26,27], potentially resulting in the loss of critical semantic elements such as operational conditions, human factors, and risk factors. Secondly, traditional hierarchical models such as HAN [28], Hi-Transformer [29] and ERNIE-SPARSE [30] typically rely on implicit hierarchical structures (“word → sentence → document”), limiting their capacity to explicitly capture the unique chapter-level relationships characteristic of accident reports, thereby failing to provide sufficient domain-specific semantic anchors necessary for global semantic modeling. Lastly, downstream tasks involving accident risk evolution and causal reasoning often span multiple paragraphs, and current general-purpose sparse attention mechanisms such as BigBird [31] and Longformer [32] lack domain-specific structural priors. This limitation leads to semantic drift when balancing long-range dependency capture with computational efficiency, potentially omitting critical details and adversely impacting performance on downstream tasks.

Notably, the inherent hierarchical directory structure in accident investigation reports, such as chapter numbering like “5.1.2” and heading texts, provides natural semantic anchors that facilitate structured navigation for global information retrieval and relationship mining [33]; yet, this feature has not been fully utilized. Accordingly, the aim of this study is to develop a Transformer-based model that explicitly incorporates section numbers and heading semantics as structural priors, thereby capturing the hierarchical and semantic structures intrinsic to accident investigation reports and enhancing the extraction of meaningful accident intelligence from long and complex documents. Toward this end, we propose a novel Structure-Aware and Format-Enhanced Transformer (SAFE-Transformer). This model employs a dual-stream encoding architecture that decouples symbolic features of the hierarchical structure and heading text. The structural stream explicitly models the symbolic features of chapter numbering, while the semantic stream extracts heading textual features. Both streams dynamically and adaptively integrate with paragraph content through a gating mechanism. Unlike existing hierarchical models, SAFE-Transformer enhances semantic associations through symbolic-aware encoding of chapter numbering and utilizes hierarchical structural constraints to guide sparse attention computations, effectively reducing complexity while preserving essential structural information. To validate SAFE-Transformer, we use a multi-label classification task for accident intelligence, aiming to assign multiple safety intelligence labels, such as “unsafe human behaviors”, “unsafe material conditions”, and “management deficiencies”, to accident reports. Experimental results demonstrate that the structural anchors constructed via symbolic-aware encoding, combined with semantic streams from headings and the dynamic gating fusion mechanism, effectively achieve adaptive integration of global and local contextual semantics, significantly enhancing multi-label classification performance. The principal objectives and contributions of this work are threefold:

(1): Proposing SAFE-Transformer as the first model to inject section numbers and heading semantics as explicit structural priors into a Transformer;
(2): Validating the performance of SAFE-Transformer using a national-level accident dataset to demonstrate its practical effectiveness;
(3): Showing how SAFE-Transformer alleviates key-information loss, context fragmentation, and long-tail effects through evaluation of a multi-label accident-intelligence benchmark.

The remainder of the paper is structured as follows: Section 2 reviews related research; Section 3 presents an overview of the overall methodology; Section 4 details the proposed SAFE-Transformer model; Section 5 outlines the experimental design; Section 6 analyzes experimental results; Section 7 discusses methodological limitations and future improvements; and Section 8 summarizes the research conclusions.

2. Literature Review

2.1. Accident Investigation Report Text Analysis

The systematic analysis of accident data is historically rooted in the fundamental logic of “learning from mistakes”, a philosophy underpinning safety management improvements for decades [34,35]. Early studies primarily relied on statistical methods and simple data categorization techniques to identify recurring accident patterns and causal factors [36,37,38]. However, recent advances in Artificial Intelligence (AI)—a broad suite of computational techniques designed to perform tasks typically requiring human intelligence, such as learning from data, recognizing patterns, and making reasoned decisions [39]—have dramatically enhanced the capacity to analyze accident data by automatically extracting complex textual patterns and semantic relationships from reports. Specifically, techniques like machine learning and deep learning can extract insights from large datasets, thereby automating tasks and enabling predictive analytics that can improve organizational efficiency in occupational health and safety (OHS) [40].

This technological shift has particularly revolutionized accident report analysis, a cornerstone of safety science research. Early AI-driven approaches in this domain predominantly leveraged traditional machine learning algorithms [41,42]. For instance, Goh and Ubeynarayana [41] systematically evaluated six machine learning algorithms—Support Vector Machines (SVM), Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbors (KNN), Decision Trees (DT), and Naive Bayes (NB)—using 1000 construction accident reports from the U.S. Occupational Safety and Health Administration (OSHA). Similarly, Zhang et al. [42] compared models such as SVM, LR, and KNN for construction accident reports, employing Sequential Quadratic Programming (SQP) to dynamically adjust classifier weights. Comberti et al. [43] developed a Self-Organizing Map (SOM) coupled K-Means (SKM) approach to cluster over 4000 occupational injury records from Italy’s Piedmont region. This method identified distinct accident patterns, revealed underlying causal mechanisms, and quantified both the severity and frequency of each category. Lombardi et al. [23,44] applied unsupervised algorithms to ESAW-based data, identifying specific high-risk operations and environmental factors in construction sites and landfill operations.

Machine learning methods typically rely heavily on manually engineered features, limiting their capability to capture deep semantic associations. With the rapid advancement of deep learning technologies, neural network-based models have gradually been introduced into accident analysis [45,46,47]. Li and Wu [45] developed a convolutional neural network (CNN)-based text classification model that extracts local semantic features via convolutional layers, outperforming traditional algorithms in classifying construction accidents. Similarly, another study utilized CNN and Bidirectional Long Short-Term Memory networks (Bi-LSTM) for classifying construction accidents [46]. Qiao et al. [47] compared ten shallow learning methods and five deep learning methods using 4770 OSHA accident reports, finding that SVM and CNN performed best among shallow and deep models, respectively, with CNN showing superior feature-learning capabilities. Luo et al. [11] proposed a multi-layer convolutional network integrating CNN and Word2Vec embeddings for advanced analysis of complex textual patterns in accident-type classification. Liu and Yang [12] combined Hidden Markov Models, Conditional Random Fields (CRF), and Bi-LSTM to build a named entity recognition model, integrating Random Forest algorithms and knowledge graph techniques for visualizing and quantifying risk associations in railway accidents.

Recently, the emergence of pretrained large language models (LLMs) has transformed accident analysis paradigms. Fang et al. [13] applied BERT to classify near-miss construction incidents. Ray et al. [14] utilized LLMs to automate identification of construction equipment failures by classifying accident narratives and extracting failure details. Ahmadi et al. [15] employed advanced LLMs, including GPT-4.0, Gemini Pro, and LLaMA 3.1, alongside zero-shot learning and customized prompt engineering, to extract critical accident attributes such as root causes, injury causes, affected body parts, severity, and time from construction reports. However, Chalkidis [48] evaluated the ChatGPT (gpt-3.5-turbo) model on the LexGLUE legal text classification benchmark, revealing an average zero-shot micro-F1 of 49.0%, considerably lower than fine-tuned models (78.9%). Although LLMs possess extensive world knowledge, their training objective—predicting the next token—fundamentally differs from discriminative text-classification objectives [48,49,50]. Studies indicate that LLMs underperform compared to smaller fine-tuned task-specific models (e.g., BART [51], LegalBERT [52]) in complex tasks involving multi-label and long-text classifications.

In summary, significant attention has been devoted to accident report analysis. Numerous studies have extensively explored machine learning, deep neural networks, and pretrained language models within emergency safety domains, employing text classification, named entity recognition, and structured information extraction techniques to enhance accident-type classification, causation analysis, and risk assessments. Despite substantial progress, limitations persist. Existing research primarily focuses on sentence-level descriptions or short reports, neglecting comprehensive document-level analysis. Furthermore, when applying pretrained Transformer models for contextual representation, studies largely emphasize optimizing classification heads for specific tasks like accident type, cause classification, and named entity recognition. Strategies like truncation, chunking, or sampling employed to handle texts exceeding model input limits often lead to key semantic fragment loss and global context fragmentation, negatively impacting downstream task performance.

2.2. Long Text Modeling Based on Transformer Architecture

The traditional Transformer employs a self-attention mechanism, calculating associations between each token and all other tokens in a sequence, causing a notable decrease in inference speed and performance when processing lengthy texts [53,54]. Especially in specialized domains with lengthy documents, information decay weakens semantic associations between distant tokens, often leading to neglected crucial long-range dependencies [54]. Current long-text modeling methods primarily include sparse attention mechanisms, hierarchical positional encodings, and chunk-based training strategies [32]. Balancing computational efficiency and semantic integrity remains the core challenge in long-text modeling.

Early exploration of sparse attention strategies led to Child et al. [55] proposing the Sparse Transformer, which systematically reconstructed the dense self-attention matrix of conventional Transformers into sparse operations of lower computational complexity through predefined sparse patterns and sparse factorization frameworks (from

O (n^{2})

to

O (n \sqrt{n})

). However, this fixed sparsity pattern can limit flexible modeling of multi-granularity contexts. Subsequently, Beltagy et al. [32] introduced the Longformer, which incorporates Dilated Sliding Window and Global Attention mechanisms to maintain linear computational complexity while supporting modeling sequences up to 32K characters (for language modeling tasks) or over 4K tokens (for downstream tasks). The sliding window in Longformer expands its receptive field through dilated sampling, capturing distant dependencies, while global attention highlights task-critical information alongside local context preservation.

Ainslie et al. [56] proposed the ETC (Extended Transformer Construction) model, dividing inputs into global sequences (such as paragraph summaries) and long sequences (detailed text) through a global–local attention mechanism. It restricts attention within local windows (e.g., 84 tokens), reducing computational complexity from quadratic to linear. Additionally, ETC introduced relative positional encodings, enhancing dynamic positional relationship capturing and explicitly encoding structured input, exhibiting strong performance with long texts and structured data. However, ETC faced efficiency issues and limitations in capturing local contexts for extremely long sequences. Addressing these shortcomings, Zaheer et al. [31] introduced the BigBIRD model, extending ETC with a comprehensive sparse attention scheme, combining global tokens for global context, local windows for neighboring dependencies, and random attention for enhanced long-range information interactions, effectively reducing complexity to linear while preserving full attention representation capabilities.

Contrasting sparse attention methods, hierarchical approaches adopt hierarchical iterative modeling architectures to aggregate semantics from local to global contexts, achieving multi-granularity representations [28,53,57]. Yang et al. [28] proposed the Hierarchical Attention Network (HAN) for document classification, exploiting the natural hierarchical structure of documents (word → sentence → document). HAN leverages bidirectional GRUs to encode contextual information layer by layer, dynamically highlighting important content through attention mechanisms. Despite integrating implicit document structures, HAN suffers from one-way information flow and limited parallelization due to sequential dependencies inherent in RNNs, impairing efficiency in processing long documents.

To overcome these limitations, Wu et al. [29] developed the Hi-Transformer model, employing a bidirectional context information transmission mechanism. Initially, sentence-level Transformers independently encode local semantics. Subsequently, document-level Transformers compute global self-attention across sentences, integrating document themes and logic into new sentence representations. Finally, a feedback mechanism propagates global context information back to word-level representations, forming hierarchical embeddings rich in local details and global context. This hierarchical decoupling and cross-layer attention significantly reduce computational complexity and enhance semantic awareness.

Recent studies have combined these paradigms. ERNIE-SPARSE [30] incorporates Hierarchical Sparse Transformer (HST) and Self-Attention Regularization (SAR) for long sequence modeling. It partitions inputs into chunks, each processed with sparse attention, aggregates chunk information via representative tokens (e.g., [CLS]), and performs global self-attention on these tokens. Updated tokens propagate global context back to the original sequences through residual connections, addressing information bottlenecks and topological inconsistencies.

He et al. [58] recently introduced the Hierarchical Document Transformer (HDT), a Transformer architecture explicitly designed for hierarchical long documents. HDT employs anchor tokens (e.g., [SENT], [SEC], [DOC]) to represent document hierarchies explicitly. A hierarchical sparse attention mechanism restricts interactions within levels or between adjacent hierarchical levels, enabling multilevel context aggregation. Hierarchical positional encodings further enhance sensitivity to structural relationships, and customized kernels implemented via Triton reduce computational complexity from O(n²) to O(n·s), where s is the longest sentence length.

Despite considerable advancements, existing methods still face limitations when modeling lengthy documents in specialized domains. Hierarchical Transformer models (HAN [28], Hi-Transformer [29], ERNIE-SPARSE [30], HDT [58]) predominantly utilize implicit “word → sentence → document” hierarchies, suitable for general text analysis but insufficient for professionally structured documents, such as accident investigation reports. Explicit hierarchical structures (e.g., chapter numbering “5.1.2” and titles) explicitly reflect logical content organization by authors [59,60], providing semantic anchors rich in domain knowledge.

Studies indicate structural awareness through implicit learning (Structural Grokking) may require extensive training data and prolonged training periods, limiting practical efficiency [61]. Conversely, explicit structure-aware mechanisms significantly enhance downstream task performances like summarization [62,63,64]. Explicit hierarchical directory structures inherent in structured, specialized technical documents like accident reports facilitate semantic understanding, global navigation, and precise information retrieval, positioning, and semantic mining [33,62].

Therefore, explicitly modeling and integrating these inherent hierarchical structures into document representation learning potentially enhances performance in tasks such as accident intelligence classification and document summarization, advancing automated analytical methods within specialized domains.

3. Overview of Methodology

This research follows a systematic methodology comprising four interconnected phases to achieve the objectives of developing, validating, and analyzing the proposed SAFE-Transformer model. The overall workflow is illustrated in Figure 1.

In the model development phase, we design a dual-stream encoding architecture to explicitly distinguish between symbolic features of section numbering and heading texts in accident reports. Through a structure-driven sparse attention mechanism, hierarchical format-aware positional encoding, and a dynamic gating unit, the model achieves adaptive fusion of global structural information and local semantic context.

In the model validation phase, we construct a specialized multi-label classification task for accident intelligence, leveraging 1632 real-world accident investigation reports at the national level across five high-risk industries: energy, construction, transportation, petrochemical, and industrial trade. This phase aims to validate the effectiveness and applicability of the proposed model in realistic scenarios.

In the performance comparison phase, SAFE-Transformer is compared with mainstream long-text baseline models, including RoBERTa [65], Longformer [32], BigBird [31], and ERNIE-SPARSE [30]. Comprehensive evaluations are conducted to assess model superiority and robustness across varying document lengths and label distributions.

Finally, in the result discussion phase, we systematically analyze performance trends from three critical perspectives: document length, label distribution, and attention mechanism design. This phase elucidates the advantages of the proposed model in capturing semantic structures within long texts, while identifying existing performance limitations and potential directions for future improvement.

4. SAFE-Transformer Model

4.1. Model Architecture

To address the unique challenges associated with modeling long-form accident investigation reports, this study proposes a Structure-Aware and Format-Enhanced Transformer (SAFE-Transformer). The model consists of three main components: an input layer, a structure parsing layer, and an encoding layer. Its effectiveness is evaluated through a multi-label classification task focused on accident intelligence, as illustrated in Figure 2.

In the structure parsing layer, we construct a rule set (RE Set) covering 38 types of document outline structures based on the hierarchical characteristics commonly found in accident reports. This rule set is used to identify and extract hierarchical numbering and section headings from the documents. These hierarchical elements are then inserted at the beginning of the corresponding structural segments using auxiliary markers and are input into the encoding layer alongside their hierarchical position information.

Within the encoding layer, we design a structure-driven sparse attention mask mechanism that allows information exchange only between tokens at the same hierarchical level (i.e., sibling tokens) and between parent and child tokens. Additionally, we incorporate a gating mechanism to perform semantic fusion between hierarchical heading nodes and paragraph content. To evaluate the effectiveness of the proposed method, we append a simple multi-label text classifier to the output of the encoder and assess the performance of SAFE-Transformer on the accident intelligence classification task. Detailed descriptions of each model component are provided in Section 4.2, Section 4.3, Section 4.4, Section 4.5, Section 4.6 and Section 4.7.

4.2. Structural Parsing Rule Set

To achieve document hierarchical structure recognition and extraction, we design a rule set for structural directory identification based on regular expressions. The rule set encompasses six foundational categories capable of recognizing 38 hierarchical syntactic patterns, including Chinese ordinal indicators, arabic numerals, roman numerals, alphabetic numbering, special symbols and an unknown pattern (UNK). A representative subset of these syntactic patterns is presented in Table 1.

4.3. Structural Anchor Markers

Following the approaches of [26,27,58], we insert special anchor markers at the starting positions of hierarchical units. In addition to markers describing document, section, and sentence boundaries ([DOC], [SEC], [SENT]), we introduce markers for hierarchical structural formats and titles ([LEVEL], [TITLE]). Our method supports dynamic hierarchies by maintaining a hierarchical stack

S

during anchor marker insertion to ensure correct nesting order:

The hierarchy stack $S$ is initialized as empty and reset upon processing new documents;
if the current top-level $S_{top} \geq l$ , the stack continuously pops elements until $S_{top} < l$ , ensuring legal hierarchical nesting (e.g., enforcing sequential patterns like “1. → (1) → ① → 1)”);
the current level $l$ is then pushed into $S$ , with its parent relationship recorded as $Parent (l) = S_{top}$ .

The hierarchy stack

S

guarantees that a child-level’s parent node is its nearest superior hierarchy, thereby maintaining correct parent–child relationships and preventing illegal cross-level skips (e.g., “1. → ①”).

Additionally, we dynamically assign [LEVEL] and [TITLE] markers to the hierarchical syntax and titles recognized and extracted through regular expressions. Specifically, the [LEVEL] tag is instantiated as

[L_{l - v_{l}}]

, where

l

denotes the hierarchical depth (e.g.,

l = 4

corresponds to a fourth-level heading), and

v_{l}

represents the normalized numbering for level

l

(e.g., “1”) maps to

v_{l} = a r_r b r

, indicating “Arabic numeral with right bracket”). The [TITLE] tag directly encapsulates the captured heading text

{title}_{l}

, extracted by regular expressions.

The final output of the hierarchical parsing module is a document sequence annotated with anchor markers and hierarchical titles:

D^{'} = \{[D O C] \oplus [L_{l - v_{l}}] \oplus {[title}_{l}] \oplus t_{c o n t e n t} ∣ t_{i} \in D\}

(1)

where

\oplus

denotes the concatenation of anchor markers with original text, and

t_{content}

represents the body paragraphs subordinate to the heading, composed of [SEC] and [SENT] markers.

4.4. Hierarchical Format and Positional Encoding

To support heterogeneous symbolic mixed nesting, we propose a Format-Aware Hierarchical Positional Encoding (HPE) method. Specifically, we fuse the hierarchical symbol formats extracted by the hierarchy parsing module with positional encodings through feature transformations, enabling the model to distinguish hierarchical semantics represented by different symbolic forms (e.g., “1.”, “①”). This involves two key steps:

Step 1: Symbolic Format Embedding Definition and Fusion

For each hierarchical level

l \in {1, 2, \dots, L}

, we define a corresponding symbol-type embedding

E_{sym}^{l} \in ℝ^{d_{e}}

based on its symbolic format, where

d_{e}

denotes the embedding dimension. For example, in the hierarchical symbol system “1. → (1) → ① → 1)”:

The Level 1 symbol format “1.” is encoded as

E_{sym}^{1}

, capturing the feature of “Arabic numeral with a period”;

The Level 2 symbol format “(1)” is encoded as

E_{sym}^{2}

, representing “parenthesized numeral”;

Similarly,

E_{sym}^{3}

and

E_{sym}^{4}

encode the features of “circled numeral” and “numeral with half-bracket”, respectively.

These embeddings are initialized via a learnable parameter matrix and jointly optimized with other model parameters during training, allowing the model to capture the correlation between symbolic formats and hierarchical semantics.

Step 2: Symbolic Format-Enhanced Hierarchical Positional Encoding

To enable the model to handle complex hierarchical structures, we fuse the symbol-type embedding

E_{sym}^{l}

with the encoded hierarchical position index

p^{l}

. The formula for symbol-aware positional encoding is defined as:

HPE (i) = \sum_{l = 1}^{L} \{\begin{array}{l} \sin (ω_{k} \cdot (W^{pos} p_{l} + W^{sym} E_{sym}^{l})) & if i = 2 k \\ \cos (ω_{k} \cdot (W^{pos} p_{l} + W^{sym} E_{sym}^{l})) & if i = 2 k + 1 \end{array}

(2)

where

L

denotes the document hierarchy depth,

p_{l}

represents the hierarchical position index,

E_{sym}^{l} \in ℝ^{d_{e}}

is the symbol-type embedding vector, and

d_{e}

is the symbol embedding dimension;

ω_{k}

serves as the frequency modulation factor;

W^{pos} \in ℝ^{d_{model} \times d_{model}}

and

W^{sym} \in ℝ^{d_{model} \times d_{e}}

are learnable linear projection matrices designed to map features from distinct semantic spaces into a unified vector space through transformation.

To prevent interference between positional information and symbolic semantics, orthogonal regularization constraints

({‖W_{pos}^{⊤} W_{pos} - I‖}_{F}^{2} + {‖W_{sym}^{⊤} W_{sym} - I‖}_{F}^{2})

are applied to

W^{pos}

and

W^{sym}

during training.

ω_{k} = 1 / 10000^{2 k / d_{model}}

acts as the frequency modulation factor, and

d_{model}

denotes the positional encoding dimension. When the hierarchy depth

L

is large, a normalization process is applied to

p_{l} = l / L

to constrain the hierarchical position index within the [0, 1] range, mitigating numerical instability caused by vanishing or exploding gradients in high-frequency dimensions due to deep hierarchies.

4.5. Attention Mask

We propose a sparse attention mask generation method based on dynamic hierarchical structures, which constrains cross-level attention interactions to reduce computational complexity, as illustrated in Figure 3. The attention mask

M \in {0, 1}^{n \times n}

is constructed according to the following rules:

Intra-level interaction mask: For any two tokens

i

and

j

within hierarchy level

l

, attention is allowed if and only if they share the same parent node across all higher levels. The mask is defined as:

M_{l} [i, j] = 1 \Leftrightarrow \forall l^{'} \leq l : p_{i}^{l^{'}} = p_{j}^{l^{'}}

(3)

where

p_{i}^{l^{'}} \in ℕ

denotes the index of token

i

at level

l^{'}

. This condition ensures that sibling nodes can only attend to each other if they share the same structural parent.

Inter-level parent–child interaction mask: For adjacent hierarchy levels

l

and

l + 1

, bidirectional attention between parent and child nodes is defined as:

M_{l \leftrightarrow l + 1} [i, j] = 1 \Leftrightarrow \{\begin{array}{l} Parent (p_{i}^{l + 1}) = p_{j}^{l} (child node i \to parent node j) \\ Parent (p_{j}^{l + 1}) = p_{i}^{l} (parent node i \to child node j) \end{array}

(4)

where

Parent (p_{i}^{l + 1}) \in ℕ

denotes the parent index of token

i

at level

l + 1

, maintained via the hierarchical stack

S

described in Section 4.2.

Global mask fusion: Local masks across all levels are merged using element-wise logical OR to form the final global sparse attention mask:

M = \lor_{l = 1}^{l = L - 1} (M_{l} \lor M_{l \leftrightarrow l + 1})

(5)

Here,

L

is the maximum depth of the hierarchy, and

\lor

denotes element-wise logical disjunction. This design reduces the attention complexity from

O (n^{2})

in the fully connected case to

O (n \log d)

, where

d

is the average subtree width, while preserving essential structural information flow.

4.6. Encoder

The hierarchical encoder comprises a stack of

N

improved SAFE blocks, each consisting of three core sublayers: a hierarchical multi-head self-attention sublayer, a gated semantic fusion sublayer, and a feedforward sublayer.

Input preprocessing: The encoder integrates word embeddings with Hierarchical Symbolic Format-aware Positional Encoding (HSFPE), which encodes both hierarchical position indices and symbolic format types (see Section 4.2) to provide structured positional priors for each token.

Hierarchical multi-head self-attention sublayer: Using the dynamically generated sparse attention mask from Section 4.4, this sublayer computes interaction weights between tokens, allowing information to flow only among same-level sibling nodes and direct parent–child pairs. For instance, the attention mask for the heading “5.1 Direct Cause” permits access to its parent node “5. Accident Causes” (level 1), the heading itself (level 2), and its associated paragraph content (level 3).

Gated semantic fusion sublayer: Inspired by the document context feedback mechanism proposed in [29], we implement a three-way semantic fusion via dynamic gating. Using the hierarchical index

p_i d

, we retrieve from the encoder cache the parent node representation

h_{parent}

, the current heading representation

h_{title}

, and the weighted-pooled content representation

h_{content}

. The fusion is controlled by a gate vector

g = [g_{p}, g_{t}, g_{c}] \in {[0, 1]}^{3}

, computed via a sigmoid activation:

g = σ (W_{g} \cdot [h_{parent}; h_{title}; h_{content}] + b_{g})

(6)

Here,

W_{g} \in ℝ^{3 d_{h} \times 3}

is a learnable parameter matrix.

The final fused representation is computed as:

{h^{'}}_{fused} = g_{p} \cdot h_{parent} + g_{t} \cdot h_{title} + g_{c} \cdot h_{content}

(7)

This mechanism enables dynamic aggregation of hierarchical semantics based on learnable weights.

Feedforward sublayer: The fused representation is further refined through a two-layer feedforward network with ReLU activation:

h_{fused} = FFN ({h^{'}}_{fused}) = W_{2} \cdot Re L U (W_{1} \cdot {h^{'}}_{fused} + b_{1}) + b_{2}

(8)

Residual connections and layer normalization are applied throughout to ensure training stability and gradient flow.

4.7. Classifier

To validate the effectiveness of the proposed method, this section designs a simple multi-label classification head. Given an accident investigation report document

D = \{t_{1}, t_{2}, \dots, t_{n}\}

,where each paragraph

t_{i}

can belong to multiple predefined semantic categories

C = \{c_{1}, \dots, c_{k}\}

simultaneously (with

k

denoting the number of categories), the model predicts a binary label vector

y_{i} \in {0, 1}^{k}

.

For any paragraph

t_{i}

in the document, its encoder output representation

h_{fused}

is computed via Equation (8). To adapt to the multi-label classification task, a learnable projection matrix maps

h_{fused}

to the category semantic space:

s_{i} = W_{cls} h_{i}^{fused} + b_{cls}

(9)

where

s_{i}

is the category score vector for the paragraph

t_{i}

,

W_{cls} \in ℝ^{8 \times d_{model}}

and

b_{cls} \in ℝ^{8}

are the learnable projection matrix and bias, respectively.

Considering the semantic correlations between hierarchical headings and their subordinate body text in accident reports, a hierarchically constrained classification decision function is designed:

p (c_{k} ∣ t_{i}) = σ (s_{i, k} + α \cdot I_{title} (t_{i}) \cdot Attn (c_{k}, h_{parent}))

(10)

where

σ

is the sigmoid function outputting category probabilities,

I (\cdot)

is an indicator function (equal to 1 if is a hierarchical heading, otherwise 0),

Attn (\cdot)

computes the attention weights of the parent node representation

h_{parent}

to the categories

c_{k}

, and

α

is a learnable hierarchical semantic reinforcement coefficient.

For hierarchical heading nodes, Equation (10) inherits contextual semantics through the parent node attention mechanism to mitigate semantic parasitism. By contrast, non-heading paragraphs are classified solely on their local semantics, which prevents cross-hierarchy noise propagation.

To optimize the multi-label classification objective, a binary cross-entropy loss function is defined:

L = \sum_{i = 1}^{n} \sum_{k = 1}^{k} [y_{i, k} \log p (c_{k} ∣ t_{i}) + (1 - y_{i, k}) \log (1 - p (c_{k} ∣ t_{i}))]

(11)

where

L

is the binary cross-entropy loss term driving multi-label classification.

5. Experiments

5.1. Dataset

To evaluate the effectiveness of the proposed SAFE-Transformer model, we constructed a dedicated dataset for accident investigation report text mining, referred to as the Accident Investigation Corpus (AIC). The data sources span the Ministry of Emergency Management of China, various local emergency management bureaus, and regional government information disclosure platforms. A total of 1632 accident investigation reports issued between 2010 and 2024 were collected.

The dataset covers five high-risk industries: energy, construction and housing, transportation, petrochemical, and industrial trade. It includes diverse accident types such as falls from height, electric shock, mechanical injury, and struck-by-object incidents. After data cleansing, the original reports retain key sections including accident overview, incident process, cause analysis, evidence description, responsibility attribution, corrective measures, and legal basis. Specifically, data cleansing involved removing non-textual elements—including CSS formatting codes, XML markup tags, and page layout coordinates—while preserving only plain textual content and essential logical symbols such as parentheses, enumerations, and punctuation marks, ensuring textual integrity and semantic consistency.

On average, each report contains approximately 3042 Chinese characters and exhibits complex and varied document hierarchies. Heading formats span six structural types, including Chinese ordinals, Arabic numerals, and Roman numerals. The directory structures within the reports reflect significant variability across industries and regions. As such, the dataset is characterized by high textual density and intricate structural complexity.

Annotation was conducted collaboratively by three annotators with safety engineering expertise and two specialists in natural language processing. Based on practical needs in accident report text mining, we defined nine accident intelligence labels, as detailed in Table 2. These labels are distributed across various body paragraphs within the reports.

To align with the hierarchical structure of accident reports, the annotation team developed a set of regular expression rules

R = \{R_{1}, R_{2}, \dots, R_{L}\}

for automatically identifying and matching paragraph-level identifiers. Each rule

R_{l}

corresponds to a hierarchical level

l

, detects the associated symbolic format

v_{l}

, converts it into a standardized encoding, and records the complete hierarchical tree path (e.g., “1.2 → 1.2.1 → (a)”).

Annotation was conducted on all 1632 reports. On average, each report contained 62 text-level labels and 24 structural nodes, yielding a total of 12,358 annotated documents. To ensure annotation reliability, inter-annotator agreement was assessed using Krippendorff’s α coefficient, which reached 0.89 (exceeding the commonly accepted threshold of 0.8), indicating strong annotation consistency. Figure 4 shows an example of the original report structure and its annotated counterpart.

The annotated dataset was randomly split using a 7:2:1 ratio into training (8650 documents), validation (2472 documents), and test (1236 documents) sets. Stratified sampling was employed to ensure consistent label distributions across subsets. For rare labels—such as “L6 Legal Provisions”, which account for only 0.8%—the validation and test sets were augmented to include at least 5% of such low-frequency samples to enhance evaluation of the model’s generalization capability.

5.2. Baseline Models

To evaluate the effectiveness of SAFE-Transformer in modeling long-form accident reports, we adopt a multi-label classification task focused on accident intelligence extraction. A typical multi-label text classification model consists of two components: an encoder, which transforms the input document into a vector representation, and a classifier, which predicts one or more target labels based on this representation [27]. Since the primary focus of this study is on the encoder design, we select the following models as baselines: RoBERTa [65], Longformer [32], BigBird [31], and ERNIE-SPARSE [30]. Configurations of these baseline models are summarized in Table 3. We intentionally exclude SOTA models that specifically optimize the classification head, as our emphasis is on evaluating encoding performance rather than classification architecture.

5.3. Parameter Settings

To ensure fairness and reproducibility, all baseline models follow a unified parameter configuration strategy, with adaptive adjustments made as necessary based on model-specific characteristics. Each baseline adopts the optimal settings recommended in its official open-source implementation to avoid evaluation bias due to parameter discrepancies.

For the SAFE-Transformer model, the encoder consists of 12 Transformer layers with a hidden size of 768 and 12 attention heads. The maximum depth of the dynamic hierarchical stack is set to 10 to accommodate the nesting levels observed in real-world documents. The dimensionality of symbolic-type embeddings is 64, and they are integrated with positional encodings under orthogonal regularization (λ = 0.05). In the format-aware positional encoding module, both symbolic-type and hierarchical index embeddings are set to 64 dimensions. The frequency scaling factor ω_k; for position encoding is dynamically computed as

ω_{k} = 1 / 10000^{2 k / d_{model}}

. Sparse hierarchical attention masks are generated dynamically based on document structure. The maximum stack depth

L_{\max}

is set to 6 to align with the AIC dataset annotation protocol. The training process uses linear learning rate warm-up (10% of total steps) and label smoothing with a factor of 0.1 to mitigate label imbalance. The weight matrix of the gated semantic fusion sublayer is initialized using Xavier uniform initialization, while biases are set to zero. The hierarchical fusion matrix

W_{g} \in ℝ^{3 d_{h} \times 3}

is initialized orthogonally with

{‖W_{g} W_{g}^{⊤} - I‖}_{F}^{2} = 0

. The hidden size of the gating network is set to 256. The classification layer is implemented as a two-layer fully connected network with a hidden size of 512 and ReLU activation.

5.4. Model Fine-Tuning

To achieve effective adaptation to document-level hierarchical structure modeling, all baseline models and SAFE-Transformer are fine-tuned using a two-stage strategy.

In the first stage, global fine-tuning is performed using domain-general pre-trained weights. All models employ the AdamW [66] optimizer with an initial learning rate of 5 × 10⁻⁵, weight decay of 0.01, and gradient clipping threshold of 1.0. The batch size is dynamically adjusted based on input length, with the number of tokens per batch capped at 512. Training proceeds for 5 epochs with early stopping based on validation loss; if no improvement is observed for 5 consecutive epochs, training halts. Dropout is set to 0.1 in attention modules and 0.2 in feedforward layers.

For SAFE-Transformer, we introduce hierarchy-sensitive staged fine-tuning:

Stage 1 (first 30% of training steps): Freeze the base Transformer layers and only train the symbolic-aware positional encoder and the gated fusion module. The learning rate is reduced to 5 × 10⁻⁶ to stabilize hierarchical feature extraction.
Stage 2 (remaining 70% of steps): Unfreeze all layers for joint optimization. The AdamW optimizer is re-initialized with a dynamic learning rate, decaying from a peak of 2 × 10⁻⁵ to 1 × 10⁻⁶. Gradient sparsity enhancement is applied to the attention mask generator to prevent local optima in hierarchical stack updates.

A dynamic hybrid regularization strategy is employed: (1) DropPath (drop probability = 0.2) is applied to parent node embeddings. (2) Zoneout (probability = 0.15) is used at the output layer of the gating network to preserve hierarchical continuity. (3) Label smoothing (0.1) is combined with Focal Loss to mitigate long-tail class imbalance.

An early stopping mechanism (patience = 5) is used alongside model checkpoint ensembling, preserving the top-3 checkpoints with the highest validation F1 scores to enhance robustness.

All experiments are conducted on four NVIDIA A100 GPUs (NVIDIA, Santa Clara, CA, USA) using mixed precision training. The random seed is fixed at 42 to ensure result consistency. SAFE-Transformer is accelerated using PyTorch 2.0 and the FlashAttention-2 kernel, with FP16 gradient clipping set to 1.0. The checkpoint with the lowest validation loss is selected for final testing. Each experiment is repeated five times, and the average result is reported to minimize the effect of randomness. Total fine-tuning time is kept within 12 h on A100 GPUs.

6. Result

6.1. Overall Performance Analysis

Table 4 presents the overall performance of each model on the multi-label accident intelligence classification task.

The experimental results demonstrate that SAFE-Transformer achieved the best performance across all five metrics. Specifically, the micro-average AUC reached 73.45%, representing a 4.71 percentage point improvement over the traditional short-text model RoBERTa [65], and a 1.56 point gain over the best-performing long-text baseline, BigBird [31]. The simultaneous improvements in macro-average AUC and F1 (70.12% and 68.73%, respectively) indicate that SAFE-Transformer maintains superior performance on frequent labels while also significantly enhancing recognition of low-frequency labels. The micro-average F1 score reached 71.23%, improving upon Longformer [32] (66.52%) and BigBird [31] (70.18%) by 4.71 and 1.05 points, respectively—validating the effectiveness of the dual-stream encoding and dynamic gating mechanism in capturing global–local semantic interactions. Notably, SAFE-Transformer achieved the highest P@5 score of 76.33%, demonstrating its superior ability to accurately rank the top five predicted labels, which is critical for extracting actionable accident intelligence.

Under comparable parameter scales and training configurations across all models, the consistent and cross-metric gains confirm that the structure-aware and format-enhanced strategies bring stable and substantial performance improvements in modeling long-form accident reports. In contrast to long-text optimized models, RoBERTa [65] adopts a bidirectional self-attention mechanism and processes long texts through truncation, which limits its capacity to capture long-range dependencies. As a result, it recorded the lowest macro-F1 (62.15%) and micro-F1 (64.79%) scores among all baselines, highlighting the limitations of general-purpose pretrained models in handling domain-specific structured texts.

Among the long-text models, the performance differences among various sparse attention implementations were also notable, reflecting distinct sensitivities to task-specific indicators. BigBird [31] combines global attention (targeting special tokens like [CLS]), sliding window attention (capturing local context), and random attention (facilitating token interaction via random sampling). It outperformed other baselines in micro-AUC (71.89%) and macro-AUC (68.92%), indicating that global block sparse attention is well-suited for modeling long-range dependencies.

ERNIE-SPARSE [30], utilizing Hierarchical Sparse Transformers and Self-Attention Regularization (SAR), excelled in micro-F1, macro-F1, and P@5, highlighting its strong sensitivity to local semantics. In comparison, SAFE-Transformer’s proposed Struct-Driven Hierarchical Sparse Attention showed a unique advantage across metrics: by enforcing attention constraints via dynamic hierarchical masking, it significantly outperformed other sparse attention schemes in both P@5 (76.33%) and macro-F1 (68.82%). This indicates its superior ability to balance global dependency modeling with local semantic precision.

Further analysis revealed that SAFE-Transformer’s symbol-sensitive sparse attention strategy—focusing on structural cues such as “Article X” or list markers like “(a)”—led to a 4.2% increase in macro-F1 for low-frequency labels (e.g., L6: legal clauses), validating the critical role of structured domain-aware sparsity in handling long-tail label distributions. Longformer [32] showed moderate performance in our experiments, suggesting that merely expanding attention span is insufficient to accommodate the structural characteristics of vertical-domain texts like accident reports.

Table 5 compares the P@5 performance of different models across accident reports of varying lengths. The data reveal a non-linear relationship between classification performance and document length for the long-text optimized models. When the sequence length is L ≤ 512, RoBERTa [65] achieves the highest P@5 score of 73.13. However, as the sequence length increases to 4096, performance across all models declines, forming an inverted U-shaped trend.

It is noteworthy that long-text-optimized models, including SAFE-Transformer, underperform the standard self-attention model RoBERTa [65] in short-text scenarios. This finding aligns with the conclusions of Xiong et al. [67]. However, in medium-length documents, long-text models consistently exhibit stronger performance. For instance, when the sequence length ranges from 1K to 2K tokens, models such as SAFE-Transformer, ERNIE-SPARSE [30], Longformer [32], and BigBird [31] all maintain P@5 scores above 70%, indicating strong robustness.

We further examined the distribution of Micro/Macro AUC and F1 scores across different document lengths, as visualized in Figure 5. The inverted U-shaped performance fluctuation is consistently observed among all long-text-optimized models. To explore the underlying causes of this phenomenon, we conducted a detailed analysis from the perspectives of label distribution and attention mechanism design.

Label Distribution Analysis: We analyzed the distribution of data labels in the test set across different document lengths, as shown in Figure 6. A pronounced long-tail distribution is observed in shorter documents, where high-frequency labels such as accident overview, emergency response, and liability attribution account for over 95% of the label occurrences (see Figure 6a). Further investigation into the sources of these short documents revealed that most of them originate from government information portals. These public-facing websites are often constrained in length and typically disclose only basic accident details, including the time, location, type, severity, casualty figures, emergency response process, and a brief statement on responsibility. Such texts tend to adopt a journalistic reporting style similar to that of open-domain news articles.

In medium-length reports (1K–2K words), the long-tail effect remains evident but is noticeably mitigated (see Figure 6b,c). Labels describing causes of accidents—such as unsafe behaviors, hazardous environments, and organizational management deficiencies—appear with greater frequency. These documents generally cover a complete narrative chain: “accident overview → cause analysis → responsibility determination → corrective actions”, and are rich in informational content. Statistical analysis shows that 86% of these reports are sourced from accident reporting and investigation systems maintained by local emergency management departments.

In long-form reports (3K–4K words), the frequency of certain labels breaks the long-tail constraint, forming local peaks—for instance, label L6 (legal clauses). Further analysis reveals that such extended reports usually correspond to major accidents characterized by complex contributing factors, multiple violations, and challenges in assigning liability. These documents often contain repeated references to incident descriptions, unlawful conduct, and legal provisions, resulting in dense cross-references and posing significant learning challenges for models.

Attention Mechanism Patterns: The baseline models evaluated in this study adopt differing attention mechanisms. RoBERTa [65] employs full attention (quadratic attention), allowing every token to interact with all others in the sequence. In contrast, long-text models reduce computational complexity through sparse attention mechanisms. To directly observe the distribution of attention weights across long-text models, we selected 200 short accident reports from the test set and visualized their attention distributions. Figure 7 presents attention heatmaps derived from this analysis.

As shown in Figure 7, BigBird [31] combines global attention (focusing on special tokens such as [CLS]), sliding window attention (for local context), and random attention (random token interactions) [31,68], resulting in several long-range focus regions within the heatmap. Longformer [32] applies a sliding window (w = 3) and task-relevant global attention; its fixed window yields broader attention coverage than BigBird [31]. During pretraining and fine-tuning, both BigBird and Longformer [32] prioritize long-range dependencies, but this may become redundant in short-text tasks and introduce instability through random attention.

ERNIE-SPARSE [30] incorporates hierarchical sparse attention (Hierarchical Sparse Transformer) and self-attention regularization. In the heatmap, ERNIE-SPARSE exhibits a distinct hierarchical structure. However, in short texts, many parameters may remain inactive, leading to overparameterization.

SAFE-Transformer adopts a hierarchical sparse attention mechanism with block-wise dynamic sparsity, constraining information flow within the same level (e.g., between words in a sentence) or across parent–child levels (e.g., between section anchors and sentence anchors), resulting in multi-level contextual aggregation. Its heatmap displays both wide attention coverage and clearly layered regions. However, this integration of local and global attention may become redundant in short-text processing [69,70].

In summary, the suboptimal performance of long-text-optimized models in short-document classification tasks may be attributed not only to the long-tail label distribution, but also to fundamental mismatches between attention architecture design and the semantic characteristics of short texts. Traditional models like RoBERTa [65] use full attention, enabling every token to interact directly with all others—an approach that effectively captures multi-level dependencies in short texts. In contrast, long-text models such as BigBird and Longformer [32] rely on sparse attention (e.g., sliding windows, random sampling, or fixed global foci), which sacrifices some global interactions to achieve scalability. This “attention pruning” may sever critical semantic links in short texts. For example, during classification, if target words are isolated due to sparsity or window partitioning, model misclassification may occur.

Additionally, long-text models are often pretrained on document-level objectives, such as masked document modeling or inter-paragraph relationship prediction, which biases them toward capturing macroscopic semantic structures. However, short texts often depend on fine-grained local semantics—such as the tight association between modifiers and core entities—which these models may not learn adequately. Moreover, random noise introduced by sparsity mechanisms is harder to dilute in the limited token space of short documents, further weakening generalization performance compared to traditional architectures.

6.2. Ablation Study

To evaluate the contributions of each module within SAFE-Transformer, we conducted an ablation study using a controlled variable method. Specifically, we removed the structural parsing layer (Struct), symbol-aware positional encoding (SymEnc), hierarchical sparse attention (HierAttn), and gated semantic fusion (GateFuse) modules one at a time and compared performance changes. All experiments were conducted on accident reports with a sequence length of 2048. The results are presented in Table 6.

Structural Parsing Layer: Removing the structural parsing layer (SAFE w/o Struct) led to the most significant performance degradation. Micro AUC dropped by 3.63% and P@5 decreased by 5.28%, indicating that explicitly encoding document-level hierarchical syntax (e.g., numbering structures like “1.2 → (a)”) is critical for effective long-text modeling. The performance drop primarily stems from the model’s inability to distinguish the logical relationships between section headings and body paragraphs, resulting in semantic confusion. For instance, in predicting labels related to liability attribution, the model without structural parsing may mistakenly associate content under “5.1 Direct Causes” with the unrelated heading “6. Corrective Measures”.

Symbol-Aware Positional Encoding: Disabling symbol-aware encoding (SAFE w/o SymEnc) resulted in a 2.17% drop in Micro AUC and a 3.17% decrease in P@5. This supports the importance of capturing the semantic heterogeneity of symbolic cues (e.g., “1.” vs. “①”). For example, in a paragraph labeled “Qualification Review”, the model may incorrectly classify “2) Unlicensed Operation of Special Equipment” as related to “Construction Plans”, failing to recognize that “2)” corresponds to a secondary clause of liability.

Hierarchical Sparse Attention: Replacing hierarchical sparse attention with full attention (SAFE w/o HierAttn) led to a 3.30% drop in Micro AUC and a 21.7% increase in computation time. The full attention mechanism introduces cross-hierarchical noise—such as interference from non-direct parent nodes—reducing the model’s sensitivity to deeply nested structures. For instance, in the case of the fourth-level heading “① Improper Scaffolding Setup”, the model may incorrectly associate it with the third-level heading “3.1 Management Deficiencies” rather than its immediate parent “(1) Construction Management Issues”.

Gated Semantic Fusion: Disabling the gated fusion module (SAFE w/o GateFuse) led to a 1.06% drop in Micro F1 and a 2.31% decline in P@5. This suggests that dynamically aggregating semantic signals from parent nodes, headings, and paragraph content improves the precision of key paragraph recognition. For example, under the heading “4.2 Lack of Protective Measures”, the model without gating may overlook details in the body text such as “failure to wear a safety harness”, relying too heavily on the general semantic of “protective measures” in the heading.

SAFE-Transformer’s performance improvements rely on the synergy of its individual modules. Among them, the structural parsing layer contributed the most (32.6%), followed by hierarchical attention (28.1%) and gated fusion (22.9%), with symbol-aware encoding accounting for 16.4%. These findings confirm that effective modeling of long texts in vertical domains requires the explicit representation of document structure, symbolic semantics, and hierarchical interactions.

7. Discussion

This study systematically verifies the effectiveness of incorporating structure-aware and format-enhanced strategies into the Transformer architecture for modeling accident investigation reports, which are characterized by high structural complexity and varying document lengths. Experimental results demonstrate that SAFE-Transformer significantly outperforms representative baselines such as RoBERTa [65], Longformer [32], and BigBird [31] across overall metrics and in handling long-tail multi-label categories. Notably, its advantages in P@5 and Macro-F1 underscore its strong ability to capture critical semantic fragments and rare labels. In contrast to traditional long-text approaches based on truncation or random sparsity, SAFE-Transformer explicitly parses section numbers and headings to transform symbolic-level information into global semantic anchors, enabling it to maintain stable performance across different document lengths. This finding aligns with recent studies [62,63,64] highlighting the benefits of explicit structural modeling in enhancing text reasoning and generation quality, further confirming the synergistic value of combining structural priors with sparse attention mechanisms.

The observed inverted U-shaped performance trends across short, medium, and long documents offer important insights into long-text modeling strategies. While dense attention remains irreplaceable for fine-grained precision in short texts, structurally guided sparse attention achieves a better trade-off between efficiency and performance as sequence length increases. This suggests that future accident text analysis systems could adopt adaptive encoding strategies based on document length to balance accuracy and responsiveness. Additionally, ablation results show that the structural parsing layer and hierarchical sparse attention contribute most to performance gains, while gated fusion and symbol-aware encoding serve as stabilizers. These findings highlight the importance of multi-module synergy and provide a clear blueprint for future modular expansions.

Despite promising results, several limitations should be acknowledged. First, the rule-based parser covering 38 directory patterns may not generalize well across regions, industries, or multilingual settings. Future work could explore differentiable parsers or few-shot prompt learning to reduce template dependence. Second, although the SAFE-Transformer effectively extracts accident intelligence, it does not explicitly address privacy protection—a crucial issue in Europe and many other countries. Given that accident reports frequently contain sensitive information, future research should investigate privacy-preserving strategies such as anonymization mechanisms or differential privacy integration into the Transformer architecture. Third, the current evaluation is limited to multi-label classification. The model’s generalizability has not yet been validated in more complex applications such as causal chain extraction, evidence tracking, and abstractive summarization, which demand more sophisticated cross-hierarchical reasoning and long-range alignment. Furthermore, regarding causative agents in the “Accident profile”, the current annotation framework may not fully align with the European Statistics on Accidents at Work (ESAW) model, potentially limiting cross-national applicability. Consequently, additional comparative studies are required to evaluate and potentially enhance the data modeling approach in diverse international contexts. Fourth, although SAFE-Transformer reduces computational complexity relative to full-attention models, its inference latency on general-purpose GPUs remains higher than that of operator-optimized dense models. Further acceleration may require integration with dynamic sparse kernels, adaptive pruning, or hardware-friendly positional encoding.

The primary contribution of this work lies not only in proposing a structure-aware Transformer tailored to accident investigation reports but also in offering a generalizable paradigm for semantic modeling of domain-specific long documents. By explicitly incorporating structural signals from the domain, the model reshapes attention pathways to balance global dependency capture with computational efficiency. This approach can be extended to other hierarchically formatted technical texts, such as regulatory documents, audit reports, and clinical records. Looking forward, integrating SAFE-Transformer with retrieval-augmented generation and knowledge graph reasoning could support the development of end-to-end systems for accident intelligence extraction and decision support—thereby narrowing the gap between automated analysis and expert-level insight, and enhancing the timeliness and interpretability of safety risk monitoring and emergency response.

8. Conclusions

This study proposes the SAFE-Transformer model, a structure-aware and format-enhanced Transformer architecture tailored to the characteristics of accident investigation reports, which are typically lengthy, structurally complex, and information-dense. By explicitly parsing 38 types of hierarchical directory symbols at the input stage and incorporating symbol-aware positional encoding, hierarchical sparse attention, and gated semantic fusion mechanisms, SAFE-Transformer achieves significant improvements over representative baselines such as RoBERTa [65], Longformer [32], BigBird [31], and ERNIE-SPARSE [30] on a multi-label accident intelligence classification task. The model attains up to a 4.71-point increase in micro-averaged AUC and demonstrates particularly strong performance in recognizing low-frequency labels.

Ablation studies further confirm that the structural parsing layer and hierarchical sparse attention are the primary contributors to performance gains, while the dual-stream encoding and dynamic gating modules effectively balance global–local semantic integration. These results suggest that explicitly modeling document-level structural hierarchies and symbolic semantics not only improves classification accuracy but also enhances sensitivity to long-tail concepts and robustness across varying document lengths. SAFE-Transformer thus offers a generalizable paradigm for integrating structural priors into Transformer-based architectures and can be extended to other hierarchically formatted technical documents, such as legal codes and audit reports. Theoretically, this study highlights the significance of incorporating explicit structural priors into deep learning architectures, demonstrating that hierarchical and symbolic features substantially enhance semantic representation learning. Practically, SAFE-Transformer provides a powerful tool for safety supervision agencies, facilitating the automated extraction of accident intelligence, systematic risk assessment, and evidence-based decision-making. Future research directions include exploring transferability of SAFE-Transformer to multilingual accident datasets, developing interactive systems for real-time safety intelligence analysis, and investigating methods to further reduce computational complexity for large-scale practical deployments. This provides a methodological foundation for automated knowledge extraction and decision support in the safety domain.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, W.Z.; writing—review and editing, W.T., D.Y., H.Z., P.D. and S.H.; supervision and funding acquisition, W.T. and D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [Shenzhen Science and Technology Program], grant number [KCXFZ20230731093902005].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study can be made available by the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, N.; Ma, L.; Liu, Q.; Wang, L.; Deng, Y. An Improved Text Mining Approach to Extract Safety Risk Factors from Construction Accident Reports. Saf. Sci. 2021, 138, 105216. [Google Scholar] [CrossRef]
Pandithawatta, S.; Ahn, S.; Rameezdeen, R.; Chow, C.W.K.; Gorjian, N. Systematic Literature Review on Knowledge-Driven Approaches for Construction Safety Analysis and Accident Prevention. Buildings 2024, 14, 3403. [Google Scholar] [CrossRef]
Ha, J.; Haralick, R.M.; Phillips, I.T. Recursive XY Cut Using Bounding Boxes of Connected Components. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; IEEE: New York, NY, USA, 1995; Volume 2, pp. 952–955. [Google Scholar] [CrossRef]
Lebourgeois, F.; Bublinski, Z.; Emptoz, H. A Fast and Efficient Method for Extracting Text Paragraphs and Graphics from Unconstrained Documents. In Proceedings of the 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, The Hague, The Netherlands, 30 August–3 September 1992; IEEE Computer Society: Washington, DC, USA, 1992; Volume 1, pp. 272–273. [Google Scholar] [CrossRef]
O’Gorman, L. The Document Spectrum for Page Layout Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 1162–1173. [Google Scholar] [CrossRef]
Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. Layoutlmv3: Pre-training for Document AI With Unified Text and Image Masking. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 4083–4091. [Google Scholar] [CrossRef]
Shilman, M.; Liang, P.; Viola, P. Learning Nongenerative Grammatical Models for Document Analysis. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 17–21 October 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 962–969. [Google Scholar] [CrossRef]
Dalitz, C. Kd-trees for Document Layout Analysis. In Document Image Analysis with the Gamera Framework, Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein; Shaker Verlag: Düren, Germany, 2009; Volume 8, pp. 39–52. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; ACL Anthology: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Luo, X.; Li, X.; Song, X. Convolutional Neural Network Algorithm–Based Novel Automatic Text Classification Framework for Construction Accident Reports. J. Constr. Eng. Manag. 2023, 149, 04023128. [Google Scholar] [CrossRef]
Liu, C.; Yang, S. Using Text Mining to Establish Knowledge Graph from Accident/Incident Reports in Risk Assessment. Expert Syst. Appl. 2022, 207, 117991. [Google Scholar] [CrossRef]
Fang, W.; Luo, H.; Xu, S. Automated Text Classification of Near-Misses from Safety Reports: An Improved Deep Learning Approach. Adv. Eng. Inform. 2020, 44, 101060. [Google Scholar] [CrossRef]
Ray, U.; Arteaga, C.; Ahn, Y. Enhanced Identification of Equipment Failures from Descriptive Accident Reports Using Language Generative Model. Eng. Constr. Archit. Manag. 2024. ahead of print. [Google Scholar] [CrossRef]
Ahmadi, E.; Muley, S.; Wang, C. Automatic Construction Accident Report Analysis Using Large Language Models (LLMs). J. Intell. Constr. 2025, 3, 1–10. [Google Scholar] [CrossRef]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune Bert for Text Classification? In Proceedings of the China National Conference on Chinese Computational Linguistics, Kunming, China, 18–20 October 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 194–206. [Google Scholar] [CrossRef]
Adhikari, A.; Ram, A.; Tang, R.; Lin, J. Docbert: Bert for Document Classification. arXiv 2019, arXiv:1904.08398. [Google Scholar] [CrossRef]
Mosbach, M.; Andriushchenko, M.; Klakow, D. On the Stability of Fine-Tuning Bert: Misconceptions, Explanations, and Strong Baselines. arXiv 2020, arXiv:2006.04884. [Google Scholar] [CrossRef]
Dong, T.; Yang, Q.; Ebadi, N.; Luo, X.R.; Rad, P. Identifying Incident Causal Factors to Improve Aviation Transportation Safety: Proposing a Deep Learning Approach. J. Adv. Transp. 2021, 1, 5540046. [Google Scholar] [CrossRef]
Wang, J.; Yan, M. Application of an Improved Model for Accident Analysis: A Case Study. Int. J. Environ. Res. Public Health 2019, 16, 2756. [Google Scholar] [CrossRef] [PubMed]
Shayboun, M. Toward Accident Prevention Through Machine Learning Analysis of Accident Reports. Master’s Thesis, Universidade Tecnica de Lisboa, Lisboa, Portugal, 2022. [Google Scholar]
Chi, S.; Han, S. Analyses of Systems Theory for Construction Accident Prevention with Specific Reference to OSHA Accident Reports. Int. J. Proj. Manag. 2013, 31, 1027–1041. [Google Scholar] [CrossRef]
Lombardi, M.; Mauro, F.; Fargnoli, M.; Napoleoni, Q.; Berardi, D.; Berardi, S. Occupational Risk Assessment in Landfills: Research Outcomes from Italy. Safety 2023, 9, 3. [Google Scholar] [CrossRef]
Karanikas, N.; Weber, D.; Bruschi, K.; Brown, S. Identification of Systems Thinking Aspects in ISO 45001: 2018 on Occupational Health & Safety Management. Saf. Sci. 2022, 148, 105671. [Google Scholar] [CrossRef]
Marhavilas, P.K.; Pliaki, F.; Koulouriotis, D. International Management System Standards Related to Occupational Safety and Health: An Updated Literature Survey. Sustainability 2022, 14, 13282. [Google Scholar] [CrossRef]
Chalkidis, I.; Fergadiotis, M.; Kotitsas, S.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels. arXiv 2020, arXiv:2010.01653. [Google Scholar] [CrossRef]
Dai, X.; Chalkidis, I.; Darkner, S.; Elliott, D. Revisiting Transformer-Based Models for Long Document Classification. arXiv 2022, arXiv:2204.06683. [Google Scholar] [CrossRef]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; ACL Anthology: Stroudsburg, PA, USA, 2016; pp. 1480–1489. [Google Scholar] [CrossRef]
Wu, C.; Wu, F.; Qi, T.; Huang, Y. Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling. arXiv 2021, arXiv:2106.01040. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; Chen, L.; Lu, Y.; Feng, S.; Feng, Z.; Wang, H. Ernie-Sparse: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention. arXiv 2022, arXiv:2203.12276. [Google Scholar] [CrossRef]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Ahmed, A. Big Bird: Transformers for Longer Sequences. In Proceedings of the Advances in Neural Information Processing Systems 33, Online, 6–12 December 2020; Volume 33, pp. 17283–17297. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Han, S.W.; Eun, H.J.; Kim, Y.S.; Kóczy, L.T. A Document Classification Algorithm Using the Fuzzy Set Theory and Hierarchical Structure of Documents. In Computational Science and Its Applications—ICCSA 2004; Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3043, pp. 1480–1489. [Google Scholar] [CrossRef]
Reason, J. Managing the Risks of Organizational Accidents; Routledge: London, UK, 2016. [Google Scholar] [CrossRef]
Leveson, N.; Daouk, M.; Dulac, N.; Marais, K. Applying STAMP in Accident Analysis. In NASA Conference Publication; NASA: Washington, DC, USA, 1998; pp. 177–198. [Google Scholar]
Pimble, J.; O’Toole, S. Analysis of Accident Reports. Ergonomics 1982, 25, 967–979. [Google Scholar] [CrossRef] [PubMed]
Abdel-Aty, M.A.; Radwan, A.E. Modeling Traffic Accident Occurrence and Involvement. Accid. Anal. Prev. 2000, 32, 633–642. [Google Scholar] [CrossRef] [PubMed]
Elvik, R.; Mysen, A. Incomplete Accident Reporting: Meta-Analysis of Studies Made in 13 Countries. Transp. Res. Rec. 1999, 1665, 133–140. [Google Scholar] [CrossRef]
Farahani, M.; Ghasemi, G. Artificial Intelligence and Inequality: Challenges and Opportunities. Int. J. Innov. Educ. 2024, 9, 78–99. [Google Scholar] [CrossRef]
Fiegler-Rudol, J.; Lau, K.; Mroczek, A.; Kasperczyk, J. Exploring Human–AI Dynamics in Enhancing Workplace Health and Safety: A Narrative Review. Int. J. Environ. Res. Public Health 2025, 22, 199. [Google Scholar] [CrossRef] [PubMed]
Goh, Y.M.; Ubeynarayana, C.U. Construction Accident Narrative Classification: An Evaluation of Text Mining Techniques. Accident Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Fleyeh, H.; Wang, X. Construction Site Accident Analysis Using Text Mining and Natural Language Processing Techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Comberti, L.; Demichela, M.; Baldissone, G.; Fois, G.; Luzzi, R. Large Occupational Accidents Data Analysis with a Coupled Unsupervised Algorithm: The SOM k-Means Method. An Application to the Wood Industry. Safety 2018, 4, 51. [Google Scholar] [CrossRef]
Lombardi, M.; Fargnoli, M.; Parise, G. Risk Profiling from the European Statistics on Accidents at Work (ESAW) Accidents’ Databases: A Case Study in Construction Sites. Int. J. Environ. Res. Public Health 2019, 16, 4748. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wu, C. Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Appl. Sci. 2023, 13, 10599. [Google Scholar] [CrossRef]
Zhang, J.; Zi, L.; Hou, Y. A C-BiLSTM Approach to Classify Construction Accident Reports. Appl. Sci. 2020, 10, 5754. [Google Scholar] [CrossRef]
Qiao, J.; Wang, C.; Guan, S. Construction-Accident Narrative Classification Using Shallow and Deep Learning. J. Constr. Eng. Manag. 2022, 148, 04022088. [Google Scholar] [CrossRef]
Chalkidis, I. ChatGPT May Pass the Bar Exam Soon, but Has a Long Way to Go for the Lexglue Benchmark. arXiv 2023, arXiv:2304.12202. [Google Scholar] [CrossRef]
Tan, B.; Yang, Z.; AI-Shedivat, M.; Xing, E.P.; Hu, Z. Progressive Generation of Long Text with Pretrained Language Models. arXiv 2020, arXiv:2006.15720. [Google Scholar] [CrossRef]
Han, G.; Tsao, J.; Huang, X. Length-Aware Multi-Kernel Transformer for Long Document Classification. arXiv 2024, arXiv:2405.07052. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Zettlemoyer, L. Bart: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar] [CrossRef]
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Muppets Straight Out of Law School. arXiv 2020, arXiv:2010.02559. [Google Scholar] [CrossRef]
Huang, Y.; Xu, J.; Lai, J.; Jiang, Z.; Chen, T.; Li, Z.; Zhao, P. Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey. arXiv 2023, arXiv:2311.12351. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Raffel, C. mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer. arXiv 2020, arXiv:2010.11934. [Google Scholar] [CrossRef]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Ainslie, J.; Ontanon, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Yang, L. ETC: Encoding Long and Structured Inputs in Transformers. arXiv 2020, arXiv:2004.08483. [Google Scholar] [CrossRef]
Zhang, H.; Liu, X.; Zhang, J. Hegel: Hypergraph Transformer for Long Document Summarization. arXiv 2022, arXiv:2210.04126. [Google Scholar] [CrossRef]
He, H.; Flicke, M.; Buchmann, J. HDT: Hierarchical Document Transformer. arXiv 2024, arXiv:2407.08330. [Google Scholar] [CrossRef]
Taylor, B.M.; Beach, R.W. The Effects of Text Structure Instruction on Middle-Grade Students’ Comprehension and Production of Expository Text. Reading Res. Q. 1984, 19, 134–146. [Google Scholar] [CrossRef]
Guthrie, J.T.; Britten, T.; Barker, K.G. Roles of Document Structure, Cognitive Strategy, and Awareness in Searching for Information. Read. Res. Q. 1991, 26, 300–324. [Google Scholar] [CrossRef]
Murty, S.; Sharma, P.; Andreas, J. Grokking of Hierarchical Structure in Vanilla Transformers. arXiv 2023, arXiv:2305.18741. [Google Scholar] [CrossRef]
Li, M.; Hovy, E.; Lau, J.H. Summarizing Multiple Documents with Conversational Structure for Meta-Review Generation. arXiv 2023, arXiv:2305.01498. [Google Scholar] [CrossRef]
Cao, S.; Wang, L. HIBRIDS: Attention with Hierarchical Biases for Structure-Aware Long Document Summarization. arXiv 2022, arXiv:2203.10741. [Google Scholar] [CrossRef]
Ruan, Q.; Ostendorff, M.; Rehm, G.H. Improving Extractive Text Summarization with Hierarchical Structure Information. arXiv 2022, arXiv:2203.09629. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. Roberta: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Xiong, W.; Oğuz, B.; Gupta, A.; Chen, X.; Liskovich, D.; Levy, O.; Mehdad, Y. Simple Local Attentions Remain Competitive for Long-Context Tasks. arXiv 2021, arXiv:2112.07210. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2022, 55, 1–28. [Google Scholar] [CrossRef]
Pham, H.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C. Understanding Long Documents with Different Position-Aware Attentions. arXiv 2022, arXiv:2208.08201. [Google Scholar] [CrossRef]
Tan, N.Ö.; Peng, A.Y.; Bensemann, J.; Bao, Q.; Hartill, T.; Gahegan, M.; Witbrock, M. Input-Length-Shortening and Text Generation via Attention Values. arXiv 2023, arXiv:2303.07585. [Google Scholar] [CrossRef]

Figure 1. Overall workflow.

Figure 2. Model architecture diagram.

Figure 3. Sparse attention mask based on hierarchical structures. Rule 1: Tokens at the same hierarchical level are allowed to interact only if they share the same parent node across all upper levels. As shown in (a), tokens B, C, and H share the same parent node A at level 1, thus satisfying the intra-level interaction condition and should be permitted to attend to each other. Similarly, D and E, as well as F and G, are allowed to interact. However, interactions between D and F, D and G, E and F, or E and G are prohibited since their respective parent nodes differ (B vs. C). Rule 2: Bidirectional attention is allowed between parent and child nodes across adjacent hierarchical levels. For example, A → B, A → C, B → A, and C → A are all permitted. The final sparse attention mask is illustrated in (b).

Figure 4. Example of the original report structure and its annotated counterpart. (a) Partial structure of the original report (b) Partial structured annotation information.

Figure 5. AUC and F1 performance curves across documents of varying lengths. (a) Micro-AUC scores versus document length, showing the inverted U-shaped trend with peak performance in medium-length documents (1K-2K tokens). (b) Macro-AUC scores versus document length, highlighting the impact on rare labels (e.g., L6 Legal Provisions) that improves in medium-length reports. (c) Micro-F1 scores versus document length, demonstrating similar non-linear patterns as AUC metrics. (d) Macro-F1 scores versus document length, revealing performance degradation on long-tail categories in very short (<512 tokens) and very long (>4K tokens) documents.

Figure 6. Label distribution across accident reports of varying lengths. (a) Short documents (L ≤ 512): L1–L3 labels dominate (>95% occurrence), characteristic of government portal briefings. (b) Medium-short (512 < L ≤ 1024): Long-tail distribution eases with emerging L7/L8 causal analysis. (c) Medium-long (1024 < L ≤ 2048): Balanced labels form a complete narrative chain, predominant in local emergency reports. (d) Long documents (2048 < L ≤ 4096): L6 (Legal Provisions) peaks due to complex liability cases. (e) Gini coefficient: Quantifies a U-shaped imbalance shift—extreme in shorts (L1–L3), balanced in mediums, and renewed in longs (L6-dominated).

Figure 7. Attention heatmaps of long-text optimized models on short accident reports. Color scale: dark blue represents low attention, bright yellow indicates high attention. (a) BigBird combines global, sliding window, and random attention, creating long-range focus regions; (b) Longformer applies sliding window and task-relevant global attention, yielding broader coverage; (c) ERNIE-SPARSE exhibits hierarchical sparse attention but risks overparameterization in short texts; (d) SAFE-Transformer uses hierarchical sparsity with block-wise dynamics, enabling multi-level aggregation.

Table 1. Regular expression set for hierarchical numbering scheme (partial).

Numbering Scheme	Codes	Regular Expression	Examples
Arabic numerals	ar	^\d+[.. ]?$	1. 2.
Uppercase letters	alpha_s	^[A-Z][.. ]?$	A. B.
Lowercase letters	alpha_l	^[a-z][.. ]?$	a. c.
Arabic numerals with right parentheses	ar_rbr	^\d+[))]$	1) 2)
Arabic numerals in full parentheses	ar_wbr	^[((][^\d+[))]$	(1) (2)
Circled numbers	ar_cir	[\u2460-\u2473]	① ⑮
Roman numerals	rom	^[IVXLC]+[.. ]?$	IV. XII.

Table 2. Presents the set of accident intelligence labels used in this study.

Code	Label Name	Description
L1	Accident Profile	Summarizes basic accident information, including casualty statistics, operational context, causative agents, and accident classification.
L2	Emergency Response	Describes the immediate on-site handling of the incident, including rescue actions and emergency response measures.
L3	Accident Liability	Indicates the attribution of responsibility to individuals or organizations involved in the incident.
L4	Management Deficiencies	Refers to shortcomings in safety management systems, such as inadequate supervision, training, or procedural enforcement.
L5	Actions and Prevention	Covers both corrective measures implemented post-incident and proactive strategies to prevent similar future occurrences.
L6	Legal Provisions	Cites specific laws, regulations, or legal clauses used as the basis for accident handling and accountability.
L7	Unsafe Acts	Captures human-related safety violations, such as unauthorized operations, procedural breaches, or negligence.
L8	Unsafe Conditions	Refers to hazardous physical states of tools, equipment, or materials that contributed to the accident.
L9	Unsafe Environment	Describes environmental risks present at the worksite, such as poor weather, unstable terrain, or inadequate lighting.

Table 3. Summary of baseline model configurations.

Model	Micro-AUC	Macro-AUC
RoBERTa	Standard multi-head self-attention	12–16 heads, head dimension 64, Dropout: 0.1
Longformer	Sliding window + global attention	Window size 256–512, one global token, no dilation window
BigBird	Random + window + global attention	Random connections r = 1, window size w = 3, global tokens g = 1, block size 2
ERNIE-SPARSE	Hierarchical sparse attention	Intra-window sparsity, 2–8 global representative tokens, regularization coefficient α = 0.5–10

Table 4. Overall performance metrics.

Model	Micro-AUC	Macro-AUC	Micro-F1	Macro-F1	P@5
RoBERTa	68.74	65.21	64.79	62.15	67.42
Longformer	69.35	66.83	66.52	64.78	71.15
BigBird	71.89	68.92	70.18	67.33	74.19
ERNIE-SPARSE	70.86	67.98	70.45	67.98	75.27
SAFE-Transformer	73.45	70.12	71.23	68.73	76.33

Table 5. P@5 Performance on reports of varying lengths.

Model	L = [0, 512]	L = [512, 1024]	L = [1024, 2048]	L = [2048, 4096]
RoBERTa	73.13	68.41	66.75	59.12
Longformer	69.43	70.83	71.32	69.16
BigBird	69.15	73.82	70.22	71.25
ERNIE-SPARSE	70.36	75.03	71.29	70.25
SAFE-Transformer	71.48	75.87	72.21	71.35

Table 6. Ablation Study Results.

Model Variant	Micro AUC	Macro AUC	Micro F1	Macro F1	P@5
SAFE-Transformer	73.45	70.12	70.23	68.82	76.33
SAFE w/o Struct	69.82	67.05	67.12	65.74	71.05
SAFE w/o SymEnc	71.28	68.93	68.35	66.82	73.16
SAFE w/o HierAttn	70.15	68.11	67.84	66.02	72.47
SAFE w/o GateFuse	71.03	69.24	69.17	67.35	74.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, W.; Tang, W.; Yuan, D.; Zhang, H.; Duan, P.; Hu, S. Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling. Appl. Sci. 2025, 15, 7928. https://doi.org/10.3390/app15147928

AMA Style

Zeng W, Tang W, Yuan D, Zhang H, Duan P, Hu S. Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling. Applied Sciences. 2025; 15(14):7928. https://doi.org/10.3390/app15147928

Chicago/Turabian Style

Zeng, Wenhua, Wenhu Tang, Diping Yuan, Hui Zhang, Pinsheng Duan, and Shikun Hu. 2025. "Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling" Applied Sciences 15, no. 14: 7928. https://doi.org/10.3390/app15147928

APA Style

Zeng, W., Tang, W., Yuan, D., Zhang, H., Duan, P., & Hu, S. (2025). Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling. Applied Sciences, 15(14), 7928. https://doi.org/10.3390/app15147928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structure-Aware and Format-Enhanced Transformer for Accident Report Modeling

Abstract

1. Introduction

2. Literature Review

2.1. Accident Investigation Report Text Analysis

2.2. Long Text Modeling Based on Transformer Architecture

3. Overview of Methodology

4. SAFE-Transformer Model

4.1. Model Architecture

4.2. Structural Parsing Rule Set

4.3. Structural Anchor Markers

4.4. Hierarchical Format and Positional Encoding

4.5. Attention Mask

4.6. Encoder

4.7. Classifier

5. Experiments

5.1. Dataset

5.2. Baseline Models

5.3. Parameter Settings

5.4. Model Fine-Tuning

6. Result

6.1. Overall Performance Analysis

6.2. Ablation Study

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI