7.1. Topic Characterization Through Words
As illustrated in Step 2 of
Figure 1, the fire investigation texts were first analyzed using topic modeling to identify latent semantic structures.
Table 2 presents the resulting thematic groups, providing an overview of how the documents are organized into distinct topics. In addition,
Table 3 presents the distribution of words derived from the estimated topic–word probabilities (
) obtained in the topic modeling stage. The table lists the top five terms with the highest posterior probabilities for each topic, highlighting representative keywords that characterize the semantic content of each domain.
Topic 1 reflects the incidents associated with oil vapors generated from organic solvents or related chemical processes, frequently occurring in cleaning operations or wastewater treatment facilities. Topic 2 captures the fire risks arising from the ignition of combustible materials due to friction or static electricity during equipment operation, as suggested by keywords such as mixer and drum can. Topic 3 represents general electrical fires, characterized by terms such as electrical short circuit and circuit breaker, while Topic 5 is more narrowly related to electrical heating sources (e.g., air conditioners and heating wires). Topic 4 emphasizes ignition triggered by sludge accumulation within ventilation systems, particularly in laboratory environments.
Topic 6 highlights the fires directly linked to dust collection equipment, where flames may propagate through ducts or filter systems containing combustible dust. In contrast, Topic 10, although conceptually related, places less emphasis on dust collection equipment itself, and instead indicates fire hazards involving adjacent facilities, such as ventilation ducts, plastics, and drying equipment. Topic 7 is dominated by spontaneous combustion events caused by the improper storage of combustible residues, including processed byproducts such as sesame dregs. Topic 8 involves forklift-related fires, which occur across both chemical and general factory settings, often due to battery or engine compartment failures. Topic 9 concerns reactor-related incidents in chemical plants, where abnormal reactions generate oil vapors leading to explosions.
Other topics describe equipment-specific or context-specific fire causes. Topic 11 focuses on drying equipment, particularly in cases involving powders such as silicon. Topic 12 reflects fires ignited by the thermal oil of banbury mixing equipment. Topics 13–15 capture more general factory-related accidents: Topic 13 refers to welding-induced ignition near cooling towers or sandwich panels; Topic 14 highlights motor-related electrical fires in air compressors; and Topic 15 illustrates landfill or waste-area fires, frequently initiated by discarded cigarette butts.
While biterm modeling provides an effective means of partitioning documents into topics, it assigns every word to all topics, which limits its ability to capture direct relationships among words. Since the primary goal of this study is to identify the words that carry substantial meaning within fire investigation documents, it is essential to explore how words are grouped through their interactions. In this respect, topic–word distributions offer probabilistic information on the degree of association between words and topics, thereby serving as a valuable resource for indirectly inferring inter-word relationships. Leveraging this information enables us to examine documents not only at the level of topics, but also from the perspective of word-level associations, ultimately facilitating a more interpretable summarization of fire-related narratives.
7.2. Thematic Aggregation of Topics
Building upon the topic–word distributions, we further inferred interactions among words to explore higher-level thematic structures.
Figure 2 presents a two-dimensional projection of the latent positions of the 449 words introduced in
Section 3, with clusters identified using the
k-means algorithm. The visualization highlights clusters by distinct colors, showing how semantically related words are grouped in close proximity. Based on model selection criteria, a total of 15 clusters were identified as valid. As shown in
Figure 2, words located near one another in the latent space tend to form coherent clusters, thereby capturing meaningful associations beyond the topic-level representation.
The interpretation of clusters requires careful consideration because the meaning of a single word is better understood in relation to its neighboring terms. A word located at the center of a cluster typically exhibits a high probability of contributing to the generation of a topic, often alongside other words in the same group. However, the semantic similarity within a cluster may arise either from the joint contribution of the words themselves or from their shared functional context with alternative terms. Thus, rather than examining individual words in isolation, it is crucial to evaluate their surrounding vocabulary to delineate the collective meaning of each cluster. In this sense, clusters serve as categorical units in which semantic coherence emerges from localized word proximities, allowing for the clarification of latent thematic structures.
Following Step 3 in
Figure 1, the estimated latent positions of words were used to cluster semantically related terms based on their pairwise distances.
Table 4 presents the top 10 representative keywords for each cluster, arranged according to their proximity to other words within the same cluster. This ordering enables a more precise characterization of the distinctive features of each cluster. While the identified clusters are not directly labeled as specific fire causes, their semantic coherence often reflects shared causal contexts. Accordingly, these clusters can be interpreted as data-driven representations of latent cause structures underlying fire incidents. Cluster 1 is anchored by the central term
distillation column, which suggests the risk of explosion due to chemical leaks or failures in pressure control. Surrounding words such as
exposure,
solvent,
heptane, and
Silanes manufacturing reinforce the theme of hazardous material leakage leading to fires or explosions. Accordingly, Cluster 1 is labeled as
Fire or explosion caused by hazardous material leakage.
Cluster 2 excludes the generic fire-related term ignition source and instead centers on polishing, which can produce ignition hazards when frictional heat contacts combustible materials. Neighboring words such as painting, floor, waste, and interior wall indicate that this cluster corresponds to Ignition from heat accumulation near combustible interior materials.
Cluster 3 is defined by the term incinerate, reflecting the inherent fire risks of incineration processes. Associated words such as waste wood, interior materials, heat, and urethane highlight ignition due to residual heat, while cutting machines and dust collection equipment suggest mechanical sources of smoldering. Thus, Cluster 3 is interpreted as Ignition due to residual heat post-work with combustibles.
Cluster 4 revolves around the term heat of reaction, but further interpretation requires contextualization with neighboring terms such as expired reagents, corn, and cooking oil. These materials, when awaiting disposal, pose risks of spontaneous combustion, particularly when combined with oxygen, heat waves, or rainwater. Hence, this cluster represents Spontaneous combustion from abandoned chemicals and oils.
Cluster 5 is structured around abnormal reaction, which links to numerous other terms and signals fires or explosions during abnormal chemical processing. Words such as pharmaceutical, film, oil, and epoxy point toward industrial settings where chemical instability can result in severe accidents. This cluster is labeled as Fires or explosions from abnormal reactions during chemical handling.
Cluster 6 includes terms like storage, cable, control box, and electrical circuit board, reflecting general electrical fires not tied to specific chemical processes. Its theme is summarized as Electrical fires in areas handling flammable materials.
Cluster 7 is characterized by coating, which signals risks associated with flammable paints. Surrounding terms such as corrosion, cutting oil, grinders, and presses suggest ignition by sparks or friction from industrial machinery. Accordingly, Cluster 7 is described as Ignition of flammable substances (e.g., machine oil) by friction heat or spark from machinery.
Cluster 8 is centered on vapor, a strong indicator of fire hazards in chemical plants. Terms such as nucleic acid, grease, and hazardous materials highlight risks arising from the ignition of volatile organic vapors, supporting the interpretation of Ignition of flammable vapor during organic solvent use.
Cluster 9 highlights the term drying room, referring to environments with elevated fire risks due to sustained high temperatures. Neighboring words, including cosmetic, ventilation fans, rotating, small amount, and decompose, suggest fire hazards associated with oil vapor and production processes. Thus, Cluster 9 is classified as Ignition from temperature rise in oil vapor areas.
Cluster 13 incorporates dilution, power outage, tracking, muller, vulcanizer, and melting fusion, suggesting interactions between hazardous materials, electrical ignition sources, and mechanical heat. This combination indicates Ignition in areas with accumulated combustible dust and oil vapors.
Cluster 14 includes terms such as micro, seat pad, deodorizing tower, absorbent pad, and gunnysack, which are not individually definitive. However, the presence of compost and sesame dregs highlights substances prone to spontaneous combustion through oxidation heat. Accordingly, this cluster represents Spontaneous combustion from improper oil residues storage or disposal.
Finally, Clusters 10, 11, 12, and 15 are each characterized by distinct keywords: Cluster 10 corresponds to Fire from ignition in accumulated combustibles, Cluster 11 to Fires from electrical factors in process equipment sites, Cluster 12 to Fires during machinery maintenance, and Cluster 15 to Ignition from sparks inside dust collection equipment with filters and debris. Importantly, these categories represent fire scenarios that extend beyond chemical plants to general factory environments.
Taken together, the interpretation of clusters provides meaningful categorical insights into the semantic structure of fire investigation records. However, semantic similarity alone does not capture the extent of economic severity associated with each term. To address this, we incorporated fire damage estimates into the analysis by estimating word-level coefficients via LASSO regression. This approach allows us to link clusters with the magnitude of potential financial losses. The coefficients estimated for each word quantify its relative contribution to explaining variations in property damage estimates.
Figure 3 presents a three-dimensional visualization, where the horizontal axes represent the latent positions of words and the vertical axis corresponds to their regression coefficients. Each colored dot denotes a single word, and the color indicates its cluster membership consistent with the grouping shown in
Figure 2. This visualization allows for direct comparison between semantic proximity (latent positions) and the estimated contribution of each word to financial damage. Words located at higher positions along the coefficient axis correspond to terms associated with higher risk of financial loss, whereas clusters with generally lower coefficient values indicate groups of words linked to less severe incidents. By examining the clusters from multiple viewing angles, one can observe that several clusters exhibit a wide range of coefficient magnitudes, suggesting heterogeneous risk levels within the same semantic domain, while others maintain consistently low coefficients, reflecting more homogeneous, low-risk contexts.
As illustrated, even within the same semantic cluster, words exhibit heterogeneous patterns: some words have positive coefficients, indicating stronger associations with larger property losses, while others display negative coefficients, reflecting lower associated damage levels. This observation highlights that clusters capture thematic similarity, but do not necessarily imply uniform economic consequences. Hence, incorporating these regression coefficients into subsequent analyses provides an additional, economically grounded perspective. In particular, the integration of semantic clustering with property damage-based coefficients motivates the construction of a risk index that simultaneously reflects linguistic structure and financial severity, thereby offering a more comprehensive measure of fire-related risks.
7.3. Risk Index Estimation
The preceding analyses demonstrate that clusters derived from topic–word distributions provide semantically coherent categories of fire-related terms, while regression coefficients estimated from property damage amounts capture the associated economic severity. Importantly, these two perspectives highlight complementary aspects of risk: semantic clusters reflect the contextual mechanisms of fire occurrence, whereas coefficients quantify their financial impact. Moreover, as shown in the regression analysis, even words within the same cluster may exhibit heterogeneous patterns of association with damage amounts, indicating that semantic similarity alone is insufficient for fully characterizing fire risk.
To address this limitation, we propose the construction of a composite risk index that integrates both linguistic and economic dimensions. To quantify the relative contribution of words and clusters to fire-related incidents, we define three levels of risk indices: the word-level index , the cluster-level index , and the overall word index . Each measure captures a distinct dimension of risk, ranging from fine-grained lexical associations to broader thematic categories. By jointly considering (i) cluster-level semantic associations and (ii) word-level coefficients derived from loss data, the risk index provides a systematic measure of fire risk that is interpretable in terms of language use and grounded in economic outcomes.
As described in Step 4 of
Figure 1, the Lasso-based risk modeling was applied to estimate the
that measures the relative contribution of word
i within cluster
c to the estimation of property damage amounts. A higher value indicates that the word is more strongly associated with larger expected losses compared to other words in the same cluster.
Table 5 presents the top ten words with the highest
values in each cluster, along with their Risk Index. For example, in Cluster 2 (
Ignition from heat accumulation near combustible interior materials) and Cluster 7 (
Ignition of flammable substances by friction heat or spark from machinery), words such as
flame,
heat, and
grinder show high
values, reflecting their strong linkage with higher levels of estimated property damage. Thus,
highlights words whose relative importance provides insight into the financial risk implications captured within each cluster.
The index
summarizes the risk associated with cluster
c as a whole, aggregating the estimated coefficients of its constituent words. A higher
indicates that, on average, words belonging to this cluster are strongly predictive of higher property losses.
Table 6 reports these values. The five clusters with the highest
include the following: (i)
Ignition of flammable vapor during organic solvent use, (ii)
Electrical fires in areas handling flammable materials, (iii)
Ignition from temperature rise in oil vapor areas, (iv)
Ignition from heat accumulation near combustible interior materials, and (v)
Fires or explosions from abnormal reactions during chemical handling. These topics correspond to scenarios where ignition sources and flammable environments directly translate into severe financial consequences. By contrast, the clusters with the lowest
, such as
Ignition from sparks inside dust collection equipment with filters and debris or
Fires from electrical factors in process equipment sites, represent situations with relatively weaker association to large-scale losses. In this way,
provides an interpretable measure of how strongly each cluster of fire-related factors contributes to financial risk.
The index
extends beyond cluster membership to evaluate the global risk contribution of word
i, accounting for its position across the latent embedding space.
Table 7 lists the top twenty words by
. For example,
chemical material,
shipping area, and
hazardous material emerge as the top three words. These terms directly connect to concrete accident scenarios, such as the generation of flammable vapors from sludge leakage, electrical fires in distribution boards, and vapor leakage during hazardous material processing. The
index therefore highlights words that not only carry lexical salience within clusters, but also broader cross-cluster risk relevance.
Collectively, these three indices provide a multi-layered framework: identifies salient words within clusters, ranks the clusters by their aggregate hazard potential, and detects globally critical words linked to real-world accident narratives.