1. Introduction
The construction industry is characterized by labor-intensive operations and densely populated work environments, making it one of the most hazardous sectors globally [
1]. Despite significant efforts to raise awareness and implement safety management practices, construction-related fatalities remain alarmingly persistent. Researchers have increasingly adopted data mining and artificial intelligence (AI)-based techniques to reduce fatalities in the construction industry. These include the integration of natural language processing with gated recurrent units and symbiotic organism search algorithms for safety assessment [
2], virtual-reality-based safety training for modular construction [
3], safety analysis in highway construction using GPT-3.5 [
4], and machine learning-based integrated safety management systems tailored to site-specific characteristics [
5].
However, the effectiveness of advancements in training, monitoring technologies, and risk evaluation strategies remains limited without a holistic understanding of the diverse attributes of construction accidents and their interrelationships. This gap is amplified by the technical, organizational, and task-level complexities inherent in construction work [
6]. Another issue is that existing studies often focus on narrowly defined scenarios, for instance, formwork-related incidents [
7], scaffolding-related falls [
8], or highway construction [
9], providing only fragmented insights into accident causation. Each construction task or material introduces distinct risks, requiring safety strategies tailored to specific operational contexts.
Understanding how multiple factors interact, such as project type, activity performed, involved objects, and causal triggers like human error or environmental conditions, is crucial for effective accident prevention. For example, a mapping of such interaction would allow construction or safety managers to be aware of the potential risk while scheduling construction activities on a particular day. This allows the manager to plan safety-related activities as a distinct set of tasks within the overall schedule, incorporating safety requirements to address potential hazards, therefore affording the industry with scheduling for safety measures, even daily. In this context, safety scheduling refers to the systematic integration of safety tasks and preventive measures into the project timeline based on data-driven insights about risk-prone factor combinations.
This underscores the need for careful investigation of past accidents to systematically map these interactions [
10], particularly under varying circumstances, and to apply the findings toward mitigating accident risks, an area largely underexplored in the existing literature [
5]. In contrast to traditional predictive analytics, clustering and association-based techniques such as Multiple Correspondence Analysis (MCA) and Association Rule Mining (ARM) are better suited for grouping and analyzing the complex, multifactorial interactions inherent in construction accident data [
11].
A significant number of recent studies have applied MCA and ARM independently to investigate and uncover complex interdependencies among accident-related attributes [
6,
11,
12,
13,
14]. However, the independent application of each approach presents distinct limitations. For example, MCA may struggle to explain data variance, particularly when dealing with a large number of categories [
15]. On the other hand, while ARM is widely recognized for its ability to reveal detailed associations among categorical variables, the high volume and diversity of attributes often result in an overwhelming number of association rules [
14,
16]. This makes its standalone application particularly challenging when analyzing the highly categorical and complex nature of construction accident data.
Therefore, the present study has two primary objectives. First, it addresses the lack of a framework for understanding on-site conditions that contribute to accidents, particularly the chain of mechanisms underlying construction accidents. To achieve this, the study uncovers critical associations among key accident components (such as accident type, project type, tasks performed, and objects involved) and links them to contributing causes (such as human error, procedural deficiencies, equipment malfunctions, task-specific hazards, and environmental conditions), by integrating two data mining techniques: MCA and ARM. This integrated approach enhances interpretability and analytical precision, while supporting the development of targeted intervention strategies aligned with daily construction activities. Second, the study addresses the present limitations of using MCA and ARM independently, by adopting an integrated analytical framework that applies MCA prior to ARM. In this sequential framework, MCA-based clustering is used first to reduce complexity and structure the dataset. With a more structured and focused dataset, ARM can then operate more efficiently and generate more meaningful, scenario-specific association rules.
To the best of our knowledge, this is the first study to implement this dual data-mining approach within the context of construction safety. Through an accident breakdown structure-based analysis, this study potentially offers safety professionals a robust tool for understanding how and under what conditions accidents occur based on past incident analysis, thereby facilitating the formulation of more precise and effective safety strategies tailored to specific future construction scenarios.
2. Theoretical Background to ARM & MCA in Construction Accident
2.1. MCA Application to Accident Analysis
MCA statistically visualizes relationships between two or more categorical attributes by extracting the maximum information inherent in the data. MCA dimensions are interpreted based on coordinate positions; a two-dimensional depiction is usually sufficient to explain most of the variance [
17]. Eigenvalues measure the quantity of categorical information accounted for by each dimension; a higher eigenvalue indicates larger total variance among the variables in that dimension. The largest possible eigenvalue is 1.
In MCA, data are structured in a contingency table, with rows representing observations and columns representing attributes. A binary matrix encodes the attribute categories, and row/column indices are calculated by dividing the contingency-table frequencies by the marginal frequencies. Singular value decomposition (SVD) is then used to decompose the matrix of standardized residuals into matrices containing information on row and column indices. Consider a tabular dataset of categorical variables, denoted as X, where X consists of I observations across K categorical attributes. If the total number of distinct categories across all attributes is J, then:
where J
k is the number of categories for attribute k [
18]. To include all categories in the dataset, an additional data matrix with dimensions I × J is created, where each attribute is represented by multiple columns to display its possible categorical values. For example, for an attribute with two categories, the presence or absence of a category in an observation is indicated by 1 or 0, respectively.
If the sum of all entries in X is N, the probability matrix Z is computed as
If r and c indicate the vectors of the row and column totals of Z, respectively, the diagonals of c and r are D
r and D
c, respectively, and the MCA factor score can be obtained from the SVD of the following matrix:
Here, ∆ is the singular matrix of the diagonal values and ∧ is the eigenvalue matrix, such that ∧ = ∆^2. The observation (rows and columns, respectively) coordinates are given as [
18]:
These equations provide the coordinates for plotting the rows and columns in a lower-dimensional space (typically two-dimensional), which reveals the relationships between the categories [
18]. The inter-point distance in this plot indicates the level of association: smaller distances suggest a stronger association, whereas larger distances indicate a weaker or no association.
MCA is widely used in safety research, particularly in road traffic and collision studies. For example, Das and Sun [
19] applied MCA to analyze fatal run-off-road crashes in Louisiana using accident data from 2004 to 2011, while Jalayer and Zhou [
20] used it to identify patterns in motorcycle- and motorcyclist-related attributes influencing at-fault motorcycle-involved crashes based on a 2009–2013 dataset from Alabama. MCA has also been employed in the analysis of crash patterns in wrong-way driving accidents [
15], in fatal accident analyses involving wrong-way highway driving [
21], crash characteristics across different temporal levels [
22], and ship-to-ship collisions [
23].
In the construction sector, Kamardeen [
12] applied MCA to a dataset of 1048 construction accident records collected over 13 years (2002–2014) to investigate fatality patterns. The study examined variables such as worker age, sex, occupation, injury type and location, and mechanism of injury, identifying seven distinct accident clusters, including fatal falls, equipment-related deaths, and noise- and substance-related incidents. However, a major limitation of this study was its omission of specific causal trigger attributes, such as procedural failures, human errors, and environmental factors, which are crucial for understanding why such accidents occur.
While researchers across various disciplines have demonstrated the utility of MCA in uncovering latent patterns within complex categorical datasets, its application in construction accident research remains notably limited. Existing studies have shown that MCA is effective in reducing dimensionality and identifying co-occurring patterns that are often hidden in raw data. However, because MCA does not assume any specific data distribution, and given the inherently multifactorial and categorical nature of construction accident data, where tasks, objects, and environmental conditions frequently interact, this lack of assumption may interfere with the accurate interpretation of results if MCA is applied in isolation. Therefore, MCA is better positioned as a preliminary step for grouping similar accident scenarios, where it can organize complex data into structured clusters and enhance the effectiveness of subsequent analytical methods such as ARM.
2.2. ARM Application to Construction Accident Analysis
Association Rule Mining (ARM) is an unsupervised learning technique used to discover the inherent relationships and interactions between a set of attributes in a database. Unlike supervised learning, which is used for prediction or classification, ARM identifies associations among attributes in a chronological sequence rather than deriving a cause-and-effect relationship. The Apriori algorithm is commonly used for ARM to identify frequent itemsets. The ARM evaluation metrics are support, confidence, and lift. Support indicates the co-occurrence frequency proportion of the antecedent (X) and consequent (Y) in the database [
24]:
Here, P(X ∪ Y) is the probability of X and Y co-occurring in a derived set of items and N is the total number of derived items.
Confidence is the proportion of items where Y occurs given X occurrence [
24]:
Lift is the comparison of the Y occurrence probability given X with its actual occurrence probability [
24]:
ARM-generated rules are meaningful only when they are significant for the entire dataset. To ensure meaningful rules, the least significant rules are filtered using minimum support and confidence values. Additionally, lift < 1 indicates a negative association, whereas lift > 1 suggests that the presence of X induces Y [
21].
Applications of ARM in accident-related studies have typically focused on linking accident types with various accident attributes,
Table 1 summarizes the relevant studies and their focus areas. In general, most existing studies emphasize isolated factors or simple pairwise associations, failing to capture the dynamic, multifaceted nature of construction environments. Key contextual elements, such as the temporal and spatial dimensions of accidents, on-site working conditions, and the sequence of events leading to incidents, are frequently overlooked [
5,
25].
For example, Liao and Perng [
24] used ARM to identify injury-related attributes based on construction accident reports in Taiwan from 1999 to 2004. Cheng et al. [
25] applied ARM to analyze cause-and-effect relationships in 1347 construction accidents in Taiwan between 2000 and 2007. Shin et al. [
26] used attributes such as contract amount, number of daily workers, worker tenure, age, and occupation, along with accident content, type, day, time, consequences, and project progress rate, to analyze their interactions in relation to each accident. Similar approaches were adopted by Guo et al. [
27] and Ayhan et al. [
16], who examined the interplay between accident-related factors in an effort to develop attribute-based accident prevention systems. More recently, Kim et al. [
14] analyzed project characteristics, working objects, activities, and accident types using big data, while Shao et al. [
11] focused on associations among area, accident type, outcome, and time in collapse-related accidents in China.
Machfudiyanto et al. [
28] applied ARM to analyze 2503 cases of unsafe behavior in Indonesian construction projects, aiming to uncover the causal relationships between unsafe acts, accident types, and law violations. The results revealed two main consolidated rules linking unsafe practices, such as improper material placement and failure to use safety lines, with falls, struck-by injuries, and violations of national safety regulations. Similarly, Yao et al. [
29] proposed a comprehensive risk assessment framework for construction safety accidents by integrating ARM and Bayesian Networks (BNs) to identify causal relationships among 33 key factors derived from 166 accident reports. ARM was used to uncover strong correlations between human, technical, environmental, and management factors, while the BN model quantified accident likelihood, severity, and sensitivity, highlighting management and human factors as the dominant causes.
Rafindadi et al. [
30] investigated 302 fatal construction accidents in Malaysia by ARM and concluded that management factors, hazardous site conditions, and unsafe actions by workers are primary contributors in construction fatalities. Guo et al. [
31] examined 101 falls from height accident cases using ARM in China. They established an index of 64 causative factors based on operator condition and behavior, equipment and facility conditions, site conditions, and production operations management.
Table 1.
Studies in which ARM and MCA were applied to construction accidents (chronological order).
Table 1.
Studies in which ARM and MCA were applied to construction accidents (chronological order).
| Method | Authors | Objective | Finding | Enhancement/Limitation |
|---|
| ARM | Liao and Perng [24] | Characteristics of construction sites injury attributes at | Safety performance is influenced by multiple factors such as weather, age, etc. | Excessive number of generated rules; suggested utilizing statistically based pruning technologies |
| Cheng et al. [25] | Cause-and-effect relationship between construction accident factors | Insufficient awareness of safety issues and potential hazards on the part of both workers and management may contribute to accidents occurrence | Exclusive rule generation for fall or tumble-related incidents, despite consideration of five accident types |
| Shin et al. [26] | Meaningful insights, derivation from 12 set of accident attributes for enhanced safety management | Worker age and experience influence safety behavior, with scaffolding and elevated work areas presenting highest accident risk | Excessive number of rules requiring manual removal; while multiple attributes are considered, certain factors, contract amount or progress rate argued to be not directly relevant to the accident |
| Rafindadi et al. [30] | Uncover hidden cause–effect relationships in construction industry accidents | Combinations of human, equipment, site, and management factors in deadly accidents | While management issues are significant addition to the other causal factors, total of 100 rules were generated from 300 accident cases, raising concerns about the significance |
| Guo et al. [31] | Work-at-height accident cases | 64 triggering factors in fall type accidents in China | Focuses solely on fall type accident; 64 rules only from 100 cases, suggesting requirement of pruning technique |
| Guo et al. [27] | Analysis of unsafe behavior of workers | Unsafe acts by workers vary in different stages of construction | Considers only one metro construction site project, therefore not generalized; focuses only on worker’s behavior and ignored causal triggers like surrounding conditions, equipment issues, etc. |
| Machfudiyanto et al. [28] | Analyze unsafe behaviors and identify key behavioral and regulatory causes of accidents | Strong associations between unsafe acts and accident types | Demonstrates ARM’s effectiveness for behavior-based analysis; however, lacks integration with other contextual or environmental variables. |
| Yao et al. [29] | Develop a hybrid risk assessment model by integrating ARM with Bayesian Networks (BN) | Management and human factors are the most influential causes; BN enabled quantitative reasoning for both likelihood and severity of accidents | Improves interpretability by combining ARM with BN, but smaller datasets raises concerns about generalization |
| Ayhan et al. [16] | Investigations of factors involved in nine different accident types | Cause and effect relationships in occupational accidents | Excessive number of rules; focused on the accident attribute interrelationship ignoring the causal triggers, causal objects or involved tasks |
| Kim et al. [14] | Accident scenarios generation based on work type and object causing accident | 76 association rules were generated for reinforced concrete work, temporary work, and earthworks work breakdown structure | Excessive number of rules; only derived rules for specific work types and object; multiple factors need to be considered for dynamics |
| Shao et al. [11] | Accident attribute associations evaluation for collapse-type accidents based on causal factors—human, material, machine, manage., technical issue, etc. | Association of various factors between the construction scheme and organizations | Although it explores the accident frequency based on the causal factor, it does not focus on accident breakdown structure, such as activity responsible, object type, etc. |
| Yoon et al. [6] | Risk assessment in the 4-M (Material, Method, Machine, or Man) technique | Relationship between the 4-M factors with each accident type (fall, struck by, hit, crushes, and caught in-between) and improved safety management based on the analysis | Focusing only on the 4-M causal factors and did not include other prospective variables such as accident criteria, construction type, activity type, or object type |
| MCA | Kamardeen [12] | Patterns in construction fatalities | Identified 7 fatality clusters and explained the relationship between the factors triggering the incidents | Suggested improved safety management schemes based on the analysis |
| MCA + ARM | Amiri et al. [32] | Factors influencing accidents at construction sites | Analyzed the accident criteria for fall, traffic, electric shock, and burn-type accidents | Although the authors used MCA and ARM, MCA was mostly used for pattern analysis, and the results were not integrated with ARM |
While these studies provide valuable insights, they primarily focus on either specific or high-level attributes such as project characteristics, task categories, workers’ profiles or temporal information, often with the aim of predicting accident types. Although these factors are important for understanding the criteria that contribute to accidents, this outcome-oriented approach makes it difficult to examine how specific objects (e.g., heavy equipment or temporary facilities) or activities (e.g., loading, painting, cutting) directly influence the occurrence of incidents. Many of the studies provide only a partial consideration of the influencing attributes, and more critically, they often overlook the underlying causal triggers that elevate the risk of accidents in relation to other key factors. For instance, was the accident caused by inadequate supervision, lack of safety awareness, equipment malfunction, or adverse weather conditions? Although Yoon et al. [
6] categorized accident causes using the 4-M framework, material, method, machine, and man, the study did not explore how these causal triggers interact with other critical accident attributes, such as accident type, activity, or object involved.
As a result, the influence of these causes on other accident-related variables remains unclear, and their interdependencies are still vague. This lack of clarity highlights the need for further investigative analysis of past accident data using ARM. Such analysis would allow a more thorough examination of how immediate causal triggers interact with other critical accident factors, thereby helping to fill the identified gap in the literature.
2.3. Integration of MCA and ARM for Construction Risk Assessment
Few researchers have actually combined MCA and ARM in safety analysis. Amiri et al. [
32] applied multiple data-mining techniques to Iranian construction-industry accident datasets from 2007 to 2011. Using MCA, they first determined the related accident–attribute associations such as worker’s age, marital status, time or injured part, etc. Additionally, they applied decision tree ensembles to analyze the accident-influencing factors and examined the relationships between the accident criteria and consequences using ARM. Recently, Rahman et al. [
13] applied both MCA and ARM to investigate animal–vehicle crashes: first, MCA generated crash clusters, and then ARM found the frequent attribute associations in severe crash groups.
Table 1 summarizes previous applications of MCA and ARM in construction safety accident studies. A common outcome observed across these studies is the generation of a large number of association rules or repeated patterns, which often require manual filtering based on relevance or importance. While some researchers have suggested the use of statistical pruning techniques, no recent study within the construction domain has actually implemented such methods. Therefore, as proposed in this study, the potential of using MCA as an initial pruning technique warrants further investigation.
MCA and ARM both identify patterns in datasets but differ in purpose and approach: MCA focuses on visualizing relationships and reducing dimensionality among categorical attributes, while ARM extracts explicit rules to reveal frequent patterns and potential cause-and-effect links. One issue with MCA is that it can struggle with high-dimensional datasets when it comes to adequately explaining variance [
15]. When the explained variance is low, unrelated attribute categories may appear in close proximity within the MCA plot, depending on the factor scores calculated, as outlined in Equations (1) and (2), which might lead to inaccurate clustering and can hinder the pattern analysis in construction accident. As a result, relying solely on MCA for detailed analysis may compromise the accuracy of findings, particularly in the context of construction accident scenarios. In contrast, ARM identifies patterns based on the attribute co-occurrence frequency (Equations (6)–(8)). Previous construction-safety research agrees that ARM can generate an overwhelming number of rules, necessitating pruning techniques or expert knowledge to distinguish significant from insignificant rules [
14,
16]. Additionally, for improperly set metrics, ARM may produce too few rules, particularly for imbalanced data, as highlighted by Cheng et al. [
25].
Given the distinct limitations of MCA and ARM when applied independently, a combined approach offers a promising solution that leverages the strengths of both techniques while mitigating their respective weaknesses. Applying MCA as a preliminary clustering step can serve as an effective pre-pruning technique for ARM. This sequential approach facilitates the extraction of more concise and relevant association rules by narrowing the analytical focus to context-specific accident clusters. It enables a deeper examination of how risk factors collectively influence safety outcomes across various stages of construction by mapping accident causes within specific operational contexts. These refined insights can support the formulation of more precise, data-informed safety measures and adaptive intervention frameworks tailored to the evolving conditions of construction projects.
3. Methodology
The research methodology comprises four steps: data collection and filtering (step 1), MCA (step 2), ARM (step 3) and improved safety strategy implementation guideline through identification of underlying risk drivers (step 4). The Korea Authority of Land and Infrastructure Safety (KALIS) has implemented the Construction Safety Management Integrated Information (CSI) database, which continuously collects and updates data on construction accident cases. The CSI system provides standardized information on hazard factors, categorized by occupation, and has been widely utilized by Korean researchers for accident analysis and investigation [
5,
6,
14]. This study also adopts the CSI database for analysis of construction accident causes.
In Step 1, data collection and preprocessing is conducted. The CSI dataset comprises a comprehensive record of construction-site events, including personal illnesses and incidents ambiguously categorized as “others” or “not classified.” To ensure the accuracy and reliability of subsequent analyses, these non-informative categories were systematically excluded using custom Python (Version 3.9) scripting, which removed all rows with the corresponding entries from the database. Additionally, rows containing missing data were eliminated to maintain consistency and integrity across the dataset. The original CSI dataset contained approximately 14,286 entries from 2019 to 2023, and after removing the irrelevant and missing entries, the dataset was reduced to 12,484 valid samples.
The triggers of construction site accidents often involve multiple causative factors and must be examined across a broad range of attributes [
11].
Figure 1 illustrates the five categorical attributes and the number of categories for each attribute considered in the present study. The attribute selection process was conducted with five considerations: following the ‘where’ (facility), ‘what’ (accident), where (activity), ‘how’ (causal object), and ‘why’ (causal object) sequence for risk factor extraction in the accident breakdown structure. “Facility” refers to the construction project type, such as a bridge, building, or water supply system, etc. “Accident” indicates common construction site accident types in Korea (e.g., falls, hits, stuck, and cuts) [
6,
14]. “Activity” refers to specific tasks (e.g., dismantling, installation, maintenance, or moving, etc.). “Causal object” indicates the category of accident-associated objects (e.g., excavators, dump trucks, or cranes, etc.) as defined in the dataset.
“Cause” contains expert-identified reasons for accident triggers, with 42 categories, including worker unawareness, poor dismantling procedures, improper equipment operation, and incorrect personal protective equipment (PPE) usage, etc., which are systematically recorded in the CSI database [
6]. Each attribute and the respective categories are included in
Figure 1.
In Step 2, MCA is applied to the filtered CSI database. Applying MCA to these selected five attributes generates multiple specific scenarios, such as falls, hits, stuck, and cut accidents, by clustering relevant facility, causal object, causal triggers and activity types with the corresponding accident types in close proximity. While MCA offers a useful visual representation of data, to support the identification of associations between attribute categories and to determine the optimal number of clusters, previous studies have used hierarchical clustering as a complementary technique [
33]. In this study, hierarchical clustering was applied to the MCA coordinate data points to determine the association. Dendrograms were then used to visually determine the optimal number of clusters based on these associations. At the end of step 2, each of the identified clusters were extracted for further analysis with ARM.
In Step 3, ARM is used to generate meaningful association rules that identify specific causal risk factors within each of the clusters identified in Step 2. From the CSI data, the 42 categories of accident causes are grouped into five major classes: (a) human behavior (e.g., worker negligence or inadequate use of PPE), (b) procedural factors (e.g., poor installation methods or improper work sequencing), (c) mechanical issues (e.g., equipment malfunctions or operational defects), (d) task-related factors (e.g., unsafe movement or work posture), and (e) environmental conditions (e.g., weather conditions or unsafe surrounding structures). ARM uncovers detailed associations between these cause classes and specific activities or objects at the construction project level for different accident types.
In Step 4, the findings are critically analyzed to identify specific stages within the daily activity schedule that require advance inspection, thereby enhancing the efficiency and focus of safety planning. Finally, the process for developing targeted safety guidelines and effective safety management strategies is demonstrated through an example case study.
4. Accident Pattern and Causation Analysis
4.1. MCA-Based Cluster Generation
The five MCA attributes included facility, activity, causal object, accident type, and cause. The results were plotted in two dimensions to visualize the relationships among these accident attributes (
Figure 2). MCA reduces dimensionality by projecting the original data onto new dimensions (axes) that capture the maximum variance within the dataset.
Figure 2a, Dimensions 1 and 2 (horizontal and vertical axes, respectively) represent the first and second principal components. The eigenvalues (as described in Equation (3)) quantify the amount of variance explained by each dimension; thus, Dimensions 1 and 2 are the principal axes that capture the greatest variance. Points that appear close to each other in the plot are more similar or related within the context of the analyzed dimensions.
Due to the large number of categories, the original MCA plot appears visually congested, hindering clear interpretation of associations, a known limitation of MCA, as previously discussed. To overcome this issue and to separate closely positioned points that may carry redundant information, this study employs hierarchical clustering based on dendrogram analysis to systematically determine the optimal number of clusters and extract the corresponding data points.
Table 2 summarizes the eigenvalues for each dimension. A higher eigenvalue indicates a greater proportion of total variance explained by that dimension. The relatively low eigenvalues for the first two dimensions suggest a high degree of heterogeneity among the attributes, with each contributing unique and distinct information [
34]. This heterogeneity reflects the inherently random and complex nature of construction site accidents. As shown in
Table 2, the first two dimensions together account for approximately 2.5% of the total variance. Although the cumulative variance explained by the first two dimensions was low, and additional dimensions contributed less than 1% each, and therefore they were not included to avoid amplifying noise and overfitting. The complete cumulative variance up to the tenth dimension is presented in
Table S1. This outcome of lower values of variance supports the initial assumption that MCA alone is insufficient to capture the complete structure of construction accident data. Consequently, it should be paired with a complementary analytical method, such as ARM, to more effectively uncover the underlying associations among accident attributes.
Next, hierarchical clustering was applied to the coordinates generated from MCA to uncover meaningful groupings within the data. This process employed Ward’s linkage method, which minimizes total within-cluster variance at each merging step [
35]. As an unsupervised learning technique, hierarchical clustering produces a dendrogram, a tree-like structure that visualizes how individual data points are progressively grouped into larger clusters.
Figure 2b presents the dendrogram constructed from the MCA-derived data points. The top of the dendrogram represents the full dataset, with all data points eventually merged into a single cluster. In contrast, the densely populated lower section displays the original MCA data points before clustering. Horizontal cuts at different vertical distances yield varying levels of clustering granularity; in this study, a cut was made at approximately above 40 (dashed orange line), resulting in three distinct clusters containing 720, 549, and 7366 entries, respectively. This process effectively filters out noise within the data, reducing the dataset from 12,484 to 8635 for more focused analysis.
The clusters identified through hierarchical clustering were then utilized to generate association rules independently. This approach narrows the rule generation scope to more homogeneous groups, thereby enhancing the specificity, relevance, and interpretability of the resulting association rules.
4.2. ARM-Based Cause Analysis
ARM also considers the same five attributes: accident, structure, activity, object, and actual cause type. The ARM parameters (support, confidence, and lift) for rule generation typically depend on the dataset. Setting support and confidence thresholds too low can lead to an overwhelming number of rules, whereas excessively high thresholds may produce too few, limiting their analytical value. To strike a balance between rule relevance and interpretability, minimum thresholds of 0.01 for support, 60% for confidence, and ≥1.0 for lift were applied consistently across all clusters based on previous literature on established metrics [
14,
16]. The following subsections present the analysis results for each cluster, with section headings including the support, confidence, and lift values associated with each cluster.
4.2.1. Cluster 1
Table 3 presents association rules derived through Association Rule Mining (ARM) for Cluster 1. In ARM, an if–then relationship is established between sets of accident-related factors, where the antecedents predict the likelihood of the consequents. Each rule’s lift value > 1 indicates a meaningful association, while high confidence values (>80%) signify reliability.
For instance, Rule 1 reveals that during installation activities, if workers show negligent behavior, there is a high probability (92% confidence, lift = 1.46) that a cut-type accident will occur, involving tool-type objects in building construction projects. Similar scenarios appear in Rules 2 and 3, where formwork and carpentry or setup activities, coupled with worker negligence, are also highly associated with cut-type injuries and the use of tools. Conversely, Rules 4 and 5 highlight how cutting activities performed in building-type facilities can lead to cut injuries when construction tools are used under unsafe conditions, such as poor equipment operation or reckless actions.
The network graph in
Figure S1 effectively illustrates the key associations within Cluster 1. The central node, Accident_Cut, is connected to a variety of activity types, causes, facilities, and object types that frequently co-occur in cut-related incidents. This visual mapping highlights how Cluster 1 specifically groups factor combinations that are strongly associated with cut-type accidents, offering valuable insights for identifying and mitigating such scenarios on construction sites.
The rules in Cluster 1 reveal two dominant themes: (1) human-behavior-related causes, particularly negligence and recklessness, and (2) mechanical or equipment-related issues, such as poor tool handling or unsafe operational procedures. These patterns reinforce known industry concerns, but the data-driven specificity provided by the model adds operational clarity. For instance, traditional safety protocols might flag “cutting” or “tool use” as generic hazards; in contrast, the model pinpoints when these become critical, e.g., during formwork tasks under negligent attitude, allowing managers to prioritize inspections.
4.2.2. Cluster 2
Table 4 presents the association rules derived through ARM for Cluster 2, which primarily includes hit and stuck type incidents. Only five rules were generated, all of which involve excavation-related activities. For stuck-type accidents, the likelihood of occurrence is high, about 87.5% confidence, for excavation activities, especially when the worker makes a judgment error while working near an excavation slope (object type). This rule also shows a strong association with a lift value of 7.51. Inadequate removal of wastes is identified as a key contributing factor that can trigger hit-type accidents in both excavation and drilling and blasting activities, with 100% confidence (Rules 4 and 5). However, these rules do not involve any specific object type, which may indicate the absence of a frequently co-occurring object type within these attribute combinations. Except for Rules 2 and 3, none of the rules include the facility type. This indicates that, while facility-specific associations (Buildings, Water Supply and Sewerage) appear in certain cases, their influence is less consistent and generally weaker compared to the activity and cause factors that dominate the cluster. In Rule 2, both Water Supply and Sewerage and buildings facilities were found to be associated with hit-type accidents, and the likelihood of occurrence is exactly 100% when the worker exhibits neglectful behavior. In Rule 3, the complexity of excavation activities significantly influences hit-type accidents with 73% confidence; however, both of these rules also lack specific object-type information.
Overall, this cluster reveals more fragmented associations rather than comprehensive links across all five considered attributes.
Figure S2 presents the network graph visualizing the associations between the attribute categories, where Activity_Excavation appears as the central connecting node. In summary, the rules generated in Cluster 2 demonstrate how task complexity, judgment errors, and site management deficiencies contribute to specific accident types in excavation operations. These insights can be integrated into pre-task risk assessments (e.g., job hazard analyses). For example, supervisors should conduct real-time evaluations of terrain and assign more experienced personnel to high-risk slope zones. Additionally, given the limited influence of facility type, safety planning should prioritize activity and cause factors over project classification.
4.2.3. Cluster 3
Cluster 3 reveals multiple associations under diverse construction scenarios (
Table 5), reflecting a broad spectrum of activity–accident relationships. For example, during installation activities, both hit- and fall-type accidents are likely to occur in building-type projects when workers exhibit neglectful behavior. Among these, hit-type accidents show a stronger association, with a confidence level of 89% (Rule 7), whereas fall-type accidents are also linked to the presence of formwork as an object (Rule 6).
In the case of dismantling activities, hit-type accidents can occur under two distinct conditions: (1) due to poor dismantling procedures in building projects, even without a specific object type (confidence: 54%, lift: 7.14, Rule 1), or (2) due to worker negligence when handling formwork (Object) (confidence: 51%, lift: 1.49, Rule 5). While the confidence values are moderate, they still represent meaningful risks and should be considered in risk assessments. Fall-type accidents from installation activities due to poor installation method also indicates a moderately associated risk (confidence: 50%, lift: 2.21). Transportation activities are associated with both hit- and stuck-type accidents, each showing over 80% confidence within building-type projects, mostly stemming from worker negligence and without any specific object type involved (Rules 2 and 8). Meanwhile, fall-type accidents are more likely to result from worker negligence during finishing activities (confidence: 75%, lift: 2.06, Rule 3) than during moving activities (confidence: 64%, lift: 1.51, Rule 4), again in the context of building-type projects.
This cluster is notably more diverse than the others, likely due to its inclusion of the largest number of entries, which may contribute to the variation in patterns.
Figure S3 presents a network graph visualizing the associations among all accident-related factors in Cluster 3. One of the most prominent patterns across this cluster is the recurring presence of worker negligence, which is linked to multiple accident types across various activities. In contrast, object type plays a relatively minor role in this cluster, as only two rules include a specific object (formwork). Overall, Cluster 3 highlights the complex and multifactorial nature of accident causation in construction. The data reveals that across varied tasks, from installation to transport, human behavior, especially negligence, consistently emerges as a critical risk factor. As such, safety strategies should focus not only on task-specific protocols but also on reinforcing a strong safety culture that addresses behavioral risks at all levels of project execution. From an operational standpoint, these findings support the use of the MCA + ARM framework not only to detect high-risk activity–cause–accident patterns that may not be evident in traditional risk assessments, but also to encourage behavior-based safety interventions, particularly for routine tasks such as finishing, moving, or dismantling.
4.3. Cluster-Based Patterns and Risk Interpretations
A critical review of the derived association rules suggests that rule informativeness is highly dependent on contextual segmentation. In Clusters 1 and 2, ARM produced high-confidence, high-lift rules despite fewer rule generations, indicating strong intra-cluster homogeneity. In contrast, Cluster 3, though rule-rich, yielded a wider range of confidence values and weaker inter-variable consistency, likely due to greater input diversity. This reinforces that clustering granularity directly influences the signal-to-noise ratio in ARM outputs, thereby affecting the reliability of actionable insights.
Fall-, hit-, and cut-type accidents were more prevalent than stuck-type incidents, consistent with prior CSI-data-based studies [
6,
14], as reflected in
Table S2. Similar patterns have been reported in other regional studies, where fall and hit accidents outnumber other types [
6,
36], suggesting a broader trend in construction accident occurrences. Percentage tables showing the distribution of accident types within each cluster are provided in
Tables S3–S5. Among all causes, human behavior, including negligence, reckless actions, and judgment errors, appeared most frequently, highlighting persistent safety lapses across activity types. This underlines the need for enhanced supervision, particularly aligned with scheduled activities, to create a more responsive and dynamic safety management framework.
Notably, object type was often absent in high-confidence rules despite its inclusion, indicating that object-specific hazard patterns may be less stable across contexts. This challenges the emphasis on object-focused risk assessments, such as formwork, scaffolding [
7,
8], and construction tools or equipment [
37], commonly seen in previous research. Similarly, facility type appeared inconsistently, implying limited predictive value unless combined with task- and behavior-related variables. These findings underscore the importance of interaction effects, which are often overlooked in univariate or bivariate models.
Overall, the findings align with and extend previous research. Prior studies have linked cut-type accidents to tasks involving tools and small machinery [
38], which this study further associates with building projects, human behavior, and mechanical issues. Hit- and stuck-type accidents during excavation, as noted by [
38], and the role of inadequate waste removal in hit-type incidents [
39] are also supported. Unlike earlier work, the present study connects these activities to specific root causes, such as human behavior and task-related factors, improving practical relevance. Similarly, the associations of fall accidents with dismantling, moving, and installation tasks involving formwork and scaffolding [
40,
41] and hit accidents with materials during transportation and setup [
14,
36] are reaffirmed. However, the proposed approach goes further by linking such patterns to explicit causal triggers, enabling daily risk prioritization, something not feasible without this level of causal mapping.
Given the dynamic nature of construction projects, these attribute interactions should be systematically recorded. A national integrated system could support real-time risk alerts by allowing managers to input daily activity details, such as facility type, task, and object, into a database that maps these to known accident–cause associations. Additionally, a risk assessment indicator could be incorporated, categorizing risks as moderate (50–65%), high (70–85%), or very high (>85%) based on the confidence levels associated with each rule to assist the project or risk manager in decision-making or in developing necessary strategies.
5. Hypothetical Case Study: Daily Safety-Risk Assessment Using MCA + ARM Insights
This section presents a hypothetical case study demonstrating how insights from the MCA + ARM framework can be applied to real-world construction safety scenarios, highlighting its potential for enhanced risk management and accident prevention:
A safety manager (X) at a large construction company is overseeing a wastewater treatment plant construction project. Key tasks for the day include, 1: Heavy material moving near formwork structures (facility type: Water Supply and Sewerage); 2: Excavation work for pipeline laying; 3: Tool-intensive setup of scaffolding around an overhead tank. Each task is entered into the company’s integrated safety-risk management system, which leverages insights from the MCA + ARM framework. The input consists of the activity keywords (“moving”, “excavation” and “setup”) and associated objects (“formwork”, “pipe” and “tool”) and all under the common facility type “Water Supply and Sewerage”.
The system cross-references each input against a pre-mapped database of MCA cluster-based association rules derived from historical construction accident data. When keywords are entered, the system first retrieves all association rules related to the input activity. Next, it filters the rules by the relevant object or facility type to ensure contextual relevance. Finally, the system outputs the associated risk factors and provides an overall risk assessment, along with a qualitative risk indication level (e.g., Moderate, High, Very High), which is determined based on the confidence level (
Table 6).
Guided by the output, X implements a proactive risk mitigation plan. Each intervention targets the risk factors identified for each task, with the intensity of the intervention informed by the rule’s confidence and lift values (i.e., higher-risk scenarios prompt more stringent controls). Key measures include:
Targeted Supervision for Movement/Fall Prevention: For the task involving heavy material movement near formwork (moderate risk; 64% confidence for fall accidents), the safety manager (X) deploys a dedicated on-site supervisor to monitor worker activities throughout the shift. While the confidence and lift values are comparatively lower than other scenarios, they still highlight worker behavior, particularly negligence, as a critical contributing factor. The supervisor ensures strict adherence to fall prevention protocols, including the use of designated walkways, controlled pace of movement, and maintenance of three-point contact during vertical transitions.
Engineering Controls and PPE for Excavation Activities: For the excavation and pipeline installation task (high risk; 73% confidence for hit-type accidents), X enforces a dual-layered safety strategy comprising engineering and administrative controls. Temporary barricades are installed around open trenches to physically isolate workers from mobile equipment and to prevent inadvertent access to high-risk zones. The high lift value (2.38) indicates a strong contextual relationship between excavation activities and impact-related incidents, thus warranting immediate and stringent risk control measures. In parallel, a second supervisory team is tasked with conducting scheduled safety audits and ensuring the continuous integrity of barriers and signage.
Enhanced Safety Briefing and PPE Enforcement for Tool-Based Setup Work: For the scaffolding setup task involving hand tools (very high risk; 100% confidence for cut-type accidents), X organizes a task-specific pre-work safety briefing. The session emphasizes precision in tool handling, safe posture, and adherence to best practices during assembly operations. All personnel are required to undergo a PPE verification process, including the mandatory use of cut-resistant gloves, face shields, and protective eyewear. No worker is permitted to begin operations without full compliance. The high-confidence rule justifies a zero-tolerance policy and full enforcement of safety protocols.
Throughout the day, X systematically tracks the implementation and compliance of each intervention using a structured checklist aligned with the identified MCA–ARM risk profiles. Based on this data-driven strategy, X expects that the targeted interventions will contribute to lowering accident rates within the identified hazard categories compared to projects lacking such analytical support. In the future, data collected from each intervention could be fed back into the system to continuously refine the association rules and update the probability weighting of risk factors over time. For performance evaluation, the framework can be aligned with standard safety indicators such as the Total Recordable Incident Rate (TRIR) and Near-Miss Frequency Rate (NMFR), as well as operational metrics such as alert response time and the number of preventive actions triggered per shift. Additionally, the rules can be recalibrated at major project phases, particularly during specific activity, object, or shift combinations, depending on the frequency of new data inputs or near-miss events recorded.
The safety manager would oversee data entry and response actions, while foremen and supervisors would be responsible for on-site verification and reporting within predefined service-level timeframes (SLA), for example, confirming mitigation within one hour for “very high” risks. This process also provides the opportunity to develop project-specific alert trigger thresholds based on feedback collected from the field.
The integration of these metrics will enable quantitative validation of the framework’s effectiveness in future pilot implementations. An example hypothetical table (
Table S6) has been added as a proposed layout for control mapping in such future applications. Although hypothetical, this case study illustrates the practical applicability and potential of integrating pattern-driven insights into proactive, task-level construction safety management.
6. Conclusions
The present study addresses a critical gap in the construction safety analytics literature, namely, the tendency to aggregate diverse accident scenarios in ways that obscure meaningful causation patterns. Without identifying immediate and actionable causes, it becomes difficult to conduct effective risk assessments or to develop proactive safety strategies, especially when the goal is to manage risks both across construction stages and in day-to-day operations. Additionally, this study responds to the limitations of prior research employing cluster-based pattern analysis techniques, such as ARM, which often generate an overwhelming number of rules with limited interpretability. By integrating MCA with ARM, this study reduces data dimensionality, thereby enabling the extraction of more focused, interpretable, and actionable safety insights.
The proposed approach accounts for multiple dimensions of construction accidents, including accident types and contextual attributes, factors that have often been underexplored in previous studies. It follows a systematic logic based on the sequence of “where” (facility), “what” (accident), “when/where” (activity), “how” (causal object), and “why” (cause), forming an accident breakdown structure that aligns with a prototype risk assessment framework for maintaining safe construction sites. A hypothetical case study was presented to demonstrate how these insights can be integrated into daily risk assessments, tailored to scheduled construction activities.
Despite its strengths, the framework has limitations. First, while ARM effectively captures direct co-occurrences between causes and outcomes, it does not infer temporal or causal directionality. For example, although “worker’s negligence” frequently appears across various accident types and activities, it remains unclear whether it serves as a primary cause or a compounding factor influenced by other latent variables, such as task complexity or environmental conditions. Additionally, even high-confidence association rules should be interpreted with caution, especially in imbalanced datasets where certain accident types (such as falls or hits) are disproportionately represented, as observed in previous studies. Second, the dataset used originates from Korea, which introduces potential regional or cultural biases in the identified patterns. The prevalence of certain accident types (such as falls, hits) and the dominance of behavioral causes may reflect localized work practices, regulatory contexts, or reporting standards. Third, while the framework reduces dimensional complexity, imbalanced data distributions can still lead to high-confidence rules for overrepresented categories. This highlights the need for data balancing techniques (such as, SMOTE, under-sampling) and rule prioritization strategies in future implementations. While MCA-based pruning proved effective for this specific construction accident scenario and its broader applicability is encouraged, the present findings do not yet confirm whether it can fully overcome the traditional manual rule-removal practices used to address redundancy or repetition, as noted in previous research. Fourth, the explained variance in MCA is very low due to the large number of categories, which, although a common issue in MCA, can still be problematic. Additionally, while the hierarchical clustering cut-off point was selected based on visual inspection of the dendrogram structure, the authors acknowledge that this introduces a subjective element. Future work should incorporate quantitative stability metrics, such as silhouette scores or bootstrap-based cluster validation, to improve the robustness of cluster selection. Finally, the hypothetical case study, while illustrative, does not provide empirical evidence of real-world effectiveness. Relying solely on a single safety manager’s perspective may introduce subjective bias based on individual experience. Future implementations will therefore consider multiple managerial perspectives or team-based validation to improve generalizability. Additionally, future work should focus on pilot testing the framework in actual construction projects to validate its usability and impact.
In conclusion, the MCA–ARM integrated framework presents a promising method for uncovering latent, context-specific patterns in construction accident data. However, its broader applicability should be further investigated at a global level to support practical, cross-regional safety assessments. Moving forward, integrating these insights into adaptive safety management systems will be essential for enabling proactive, data-informed risk mitigation in construction environments.