A Study on Data Pre-Processing and Accident Prediction Modelling for Occupational Accident Analysis in the Construction Industry

: In the construction industry, it is di ﬃ cult to predict occupational accidents because various accident characteristics arise simultaneously and organically in di ﬀ erent types of work. Furthermore, even when analyzing occupational accident data, it is di ﬃ cult to deduce meaningful results because the data recorded by the incident investigator are qualitative and include a wide variety of data types and categories. Recently, numerous studies have used machine learning to analyze the correlations in such complex construction accident data; however, heretofore the focus has been on predicting severity with various variables, and several limitations remain when deriving the correlations between features from various variables. Thus, this paper proposes a data processing procedure that can e ﬃ ciently manipulate accident data using optimal machine learning techniques and derive and systematize meaningful variables to rationally approach such complex problems. In particular, among the various variables, the most inﬂuential variables are derived through methods such as clustering, chi-square, Cramer’s V, and predictor importance; then, the analysis is simpliﬁed by optimally grouping the variables. For accident data with optimal variables and elements, a predictive model is constructed between variables, using a support vector machine and decision-tree-based ensemble; then, the correlation between the dependent and independent variables is analyzed through an alluvial ﬂow diagram for several cases. Therefore, a new processing procedure has been introduced in data preprocessing and accident prediction modelling to overcome di ﬃ culties from complex and diverse construction occupational accident data, and e ﬀ ective accident prevention is possible by deriving correlations of construction accidents using this process.


Introduction
In recent decades, various industrial safety management systems have been introduced and improved upon; however, occupational safety remains unstable and low. In particular, in the construction industry, various fields and types of work are undertaken simultaneously and organically, and a wide range of hazard factors are present. Thus, safety management in the construction industry is difficult, owing to the complexity of numerous activities and the involvement of various entities [1]. Moreover, most of the work is performed by humans; hence, techniques to predict the dataset is defective and small [28,29]; furthermore, it offers considerably greater applicability than competing techniques, owing to its single parameter [30]. DT instabilities can be overcome by a boosting approach; that is, by growing a forest of DTs and performing multiple verifications of a given tree's classification result [31,32]. With these advantages, DTs have been successfully applied in a variety of research fields, including medicine [33], social sciences [25], business management [34], construction engineering and management [35], and process industry [36]. Table 1 summarizes the latest research trends in the prediction of construction accidents using occupational accident data; here, as in other fields, ANN-, SVM-, and DT-based ensemble methods have been applied [1,8,37,38]. In accordance with the research findings in various fields and the latest trends in construction accident research, this study applied SVM-and DT-based ensemble methods, because they are more suitable for classifying and predicting construction accident data. Many researchers have conducted analysis studies across numerous fields using various past accident data; however, several limitations have been found in the analysis of occupational accident data. First, the individual and subjective opinions of the person who prepares the occupational accident report are reflected in the data; therefore, it is difficult to process and reflect the characteristics of the occupational accident data in the construction industry, which are created without a composition procedure and include many types of variables and values [1]. Second, the structure of occupational accident data includes mixed variables (e.g., numerical and categorical text representations) and absent information. These numerous variable types, along with the composition of many categories, create difficulties and ambiguities in interpreting the results with data elements; as a result, only limited correlations can be derived between variables [8]. The following conclusions can be drawn from the existing research results. Numerous types of variables and values are present in the occupational accident data, which makes it difficult to process data, reflect characteristics, and interpret correlations. However, if the variables are excessively reduced, their characteristics are lost, and meaningful conclusions cannot be drawn. Therefore, the types and ranges of values for suitable variables must be standardized to properly utilize data containing more construction accident information. Moreover, a process that can easily capture construction accident trends must be established, and a prediction method capable of learning from past accidents to minimize the risk of future ones is required.
The previous accident analysis has the disadvantage of evaluating a single dependent variable by a single independent variable, and it is only predicting one dependent variable (severity, etc.) with data written qualitatively and subjectively. The purpose of this study is to overcome this and derive a correlation between objective variables without alternatively manipulating the data, and to establish an accident prediction model through this, and to establish accident prevention measures. Therefore, we propose an optimized data preprocessing method to minimize the major variables and elements in diverse and complex occupational accident data, and we construct an ML prediction model to achieve this. Furthermore, correlation analysis is conducted via an alluvial flow diagram. Finally, accident concept analysis-employing clustering and visualization through principal component analysis (PCA) of the relationships between major variables-is used to provide more extensive conclusions.

Materials and Methods
The procedure applied in this study consisted of four steps, as shown in Figure 1; here, each step included further details on the segmentalized procedure. In the first step, the elements of the occupational accident data (subjectively and qualitatively prepared beforehand) were standardized into similar elements, and an initial dataset containing 16 variables and 21 elements was constructed. In the second step, the first data preprocessing step reduces the variables via four methods: latent class cluster analysis (LCCA), chi-square, Cramer's V, and predictor importance; from the results of these four methods, seven variables were selected. In the third step, a second data preprocessing step was performed, to reduce the elements in the variables. Three variables contained more than ten elements; thus, their severity was predicted by the ML method while decreasing the number of elements. The optimal number of elements was determined by comparing the maintenance of the prediction performance, and a final dataset was constructed. In the fourth step, correlation analysis between variables, using the final dataset; detailed analysis of the grouping of clusters; and visualization analysis through PCA were performed. The correlations of input variables were analyzed by varying the output variable using the final dataset; the correlations of all variables were analyzed using the alluvial flow diagram for the major output. Next, detailed analyses for each group were conducted by grouping the final dataset using LCCA. Finally, the severity levels were visualized using the three major variables extracted by PCA.

Initial Data and Data Description
The occupational accident data used in this study were collected from the database of the safety management system of a large construction company in Korea, for the period from 2015-2020; in total, 963 occupational accident data entries for the construction site were used. Since there are some studies using similar or small data in previous studies for accident analysis and prediction, the number of samples in this study is judged to be sufficient for machine learning [1,39,40]. However, the initial occupational accident dataset included too many factors, as well as over 130 occupational categories, and over 400 assailing materials. When an occupational accident occurs, the person in charge of writing the accident information differs between construction sites, and because the information is qualitative and subjective, the same content can be expressed differently owing to the non-systematic manner of information entry. Therefore, after checking terms used as standard or general from the Occupational Safety and Health Administration (OSHA) in the USA, Health and Safety Executive (HSE) in the UK, and Korea Occupational Safety and Health Agency (KOSHA) in Korea, we standardized the elements and preprocessed them with similar ones, to reconstruct the occupational accident data. The reconstructed initial dataset consisted of 16 variables (14 categorical and two binary), and they are the same as or similar to variables in other studies [1,8]. Here, the terms that are not generally defined are written for easy understanding. Brief descriptions of each variable are presented in Table 2.  (i) Type of work (TW): This variable represents the victim's job role at the construction project. It consists of 15 elements: "carpenter," "painter," "scaffolder," "stonemason," "safety officer," "welder," "equipment operator," "electric piping equipment worker," "landscaper," "window worker," "structural steel/steel frame worker," "concrete worker," "tunnel worker," "earth worker," and "woodworker." (ii) Type of accident (TA): This indicates the type of accident that the victim suffered. It consists of ten elements: "jamming," "fall down," "fall off," "hit," "collapse," "struck," "imbalance and uncontrolled motion," "occupational diseases," "mutilation/cut/puncture," and "fire/explosion/blast." (iii) Injured part (IP): This refers to the part of the body that received the injury. It consists of 12 elements in total: pelvis, ear, eye, leg, multiple head location, foot, hand, brain, mouth, nose, arm, and chest/abdomen.
(iv) Assailing material (AM): This variable is a standard used by the Korea Occupational Safety and Health Agency (KOSHA); it refers to the substance directly responsible for causing harm to the victim.
In this study, a total of 21 elements were considered: "formwork/shores," "construction and mining machinery," "stair and ladder," "metal fine particles/trace elements/dust/fumes," "other buildings/structures/etc.," "end portion and opening," "fauna and flora," "floor and ground/etc.," "scaffolding and work plate," "equipment," "machinery parts and appendages," "hand tool nonpowered," "container and pack," "transporting," "lifting equipment," "machinery," "land transportation," "manpower machinery," "processing equipment/machinery," and "natural phenomena" (e.g., working environment and atmospheric conditions), "material," "electrical equipment/parts," "debris/garbage," and "hand tool powered." (v) Cause of accident (CA): This indicates the cause of the accident and contains seven factors: "unsafe work," "lack of personal protective equipment," "facility defect/collapse," "lack of safety measures," "work equipment defect," "carelessness," and "third-party liability." (vi) Severity: This represents the accident's severity categorization, based on the risk assessment criteria used by OO Construction in Korea. It is divided into three stages: Level 1 ("slight injury") describes a minor injury, including ligaments and fractures; Level 2 ("serious injury") includes fractures of critical areas (face, chest, and abdomen); and Level 3 ("fatal injury") includes critical-area fractures requiring surgery and permanent disabilities caused by problems such as damage to the five senses (vision, hearing, etc.). These severity criteria were a classified list of personal damage in Korea (steps 1-14). Steps 1-7 are classified as fatal injury, steps 8-14 are serious injury, and other injuries are slight injuries.

Data Preprocessing
Data preprocessing is an essential task in data mining and has been reported to consume (on average) more than 60% of the total effort of the entire process [41]. In particular, because construction accident data include numerous variables and types of values, the dataset must be preprocessed or standardized before analysis; otherwise, the presence of outliers, omissions, and term inconsistencies in the data makes interpreting the analysis results difficult; this renders the trends incomprehensible and can thereby produce misleading analysis results. Furthermore, when reducing variables and elements to facilitate meaningful interpretations, data preprocessing must be performed by cross-comparing several methods instead of one.

Latent Class Cluster Analysis (LCCA)
LCCA is an unsupervised learning algorithm based on a probability model [42]; it sorts data with similar properties into potential clusters, by classifying them into maximally heterogeneous data groups. It analyzes the complex interrelationships between the observed variables and applies these to a comprehensive dataset irrespectively of the data type (categorical, binary, or continuous), to derive maximum heterogeneity [43]. To determine the number of clusters in LCCA, the optimal number of layers is first determined by analyzing various statistical criteria (i.e., the Akaike information criteria (AIC), Bayesian information criteria (BIC), and consistent AIC (CAIC)) and the entropy R-squared value. The more constant the statistical criteria values are, the larger the entropy R-squared value, and the more appropriate the potential grouping [44]. In this study, LCCA was first used as a form of contribution analysis between variables; then, it was used to conduct a detailed analysis of the construction accident data.

Chi-Square Test
The chi-square test analyzes categorical data by using the chi-square distribution to verify the significance of the observed and expected frequencies. It is primarily used to verify the goodness of fit, homogeneity, and independence of the data [8]. The chi-square test is used when comparing the distributions of individual groups; and the independence test determines whether a dependency exists between the two characteristics of the data. In this study, the chi-square test was compared in 16 cases with one dependent variable and 15 independent variables, and a variable with a p-value of 0.05 or less was used to select the main variable.

Cramer's V test
The chi-square test increases in proportion to the number of rows and columns in the contingency table; however, it is limited in that relative comparisons are difficult. Cramer's V test is a new test method derived from the chi-square test. A value closer to 1 in the positive range (0,1) signifies the greatest relevance [45]. Cramer's V test was performed in the same way as the chi-square test, and was used as one of the common major variable extraction methods.

Support Vector Machine (SVM)
An SVM, proposed by Cortes and Vapnik (1995), is a statistical supervised learning algorithm; it was initially developed for regression work but was later applied to linear and nonlinear classification. In an SVM, the hyperplane that marks the boundary in the dataspace is trained to maximize the distance to the nearest data [22]. SVMs can achieve a higher performance in classification and regression problems than other statistical and ML techniques; this is because, unlike existing ML techniques (which are prediction methods based on probability estimation), they do not directly estimate probability but only predict classification results. The most important element of constructing an SVM model is setting the appropriate parameters [23]. When inappropriate parameters are set, the prediction accuracy can drop sharply, or problems of inability can arise due to overfitting [21]. Furthermore, when it becomes difficult to classify data within a limited dimension, the SVM can map data to a higher-dimensional space and classify them using a kernel function. This kernel function performs a dot product operation to prevent the computation requirements (which are proportional to the dimensions of the data) from increasing; examples include linear, radial basis, and order polynomial functions.

Ensemble
The ensemble method guides a final learner to derive the optimal result by combining existing weak learners; bagging (bootstrap aggregating) and boosting are representative examples of such methods. The bagging method creates several partial datasets by sampling the test dataset, and it derives the final, optimal result by combining the results of weak learners trained for each partial dataset. The boosting method goes beyond the bagging method and sequentially assigns weights to the misclassified data using the results of weak learners in the partial dataset to learn the next weak learner and derive the result [27]. In this study, a DT was used as a weak learner, and the LSBoost (Least-Square Boost) method was used to reflect the weights of misclassified data. Misclassification can be compensated for by implementing a weight equal to the current misclassification in the partial dataset of each step in the sequential training process in the next dataset [31,32]. In addition, in this study, it was used to predict the dependent variable and to analyze the contribution of independent variables that contribute to the dependent variable during prediction.

Principal Component Analysis (PCA)
PCA expresses independent variables as principal components through a linear combination. It selects the axis containing the eigenvector featuring the largest variance (which is the principal component in three dimensions or higher) and plots it in a lower dimension while preserving the characteristics of the data as much as possible. PCA was first proposed by Pearson (1901) and subsequently developed by Hotelling (1936) and Jolliffe to establish a modern theory [46,47]. PCA derives the covariance matrix of the existing dataset and calculates the eigenvector V and eigenvalue λ. Through this, the eigenvector with the largest variance is used and analyzed as the main axis. However, in some cases, an eigenvector with a large variance does not necessarily indicate a high degree of division in the data. In this study, PCA was used to visualize categorical data and predict severity.

First Data Preprocessing for Selection of Major Variables
Since it is difficult to perform meaningful analysis by simply predicting construction accident data including various variables and elements with ML methods, the standardization and preprocessing of data are essential. Therefore, the first data preprocessing was carried out to derive the main variables that have a major influence on the construction accident. First, LCCA was conducted using XLSTAT (2020) software, to select key variables in the construction accident data. Datasets containing both binary and categorical variables were used for the analysis because the type of data does not affect LCCA implementation. Furthermore, LCCA was first applied as a data preprocessing method, because it can group without specifying a separate target. First, grouping was conducted by increasing the number of clusters from 1 to 10, to determine the optimal number for classifying accident data. Then, the statistical values of BIC, AIC, and CAIC and the entropy R-squared (indicator) values were checked, to determine the optimum number of clusters; the most suitable way to determine this number is to observe the decrease in BIC, AIC, and CAIC and select the point at which the value remains stable after a certain point. The AIC value decreased as the number of clusters was increased; however, the BIC and CAIC values began to stabilize after the number of clusters reached 5. The entropy R-squared value was used as another control criterion. R values close to 1.0 indicate the importance of the model. However, large multivariate datasets generally tend to become more important when the number of clusters increases, owing to the high level of heterogeneity; thus, they play an ancillary role in determining the optimal model. After dividing the data into 1 to 10 clusters, we found that when the number of clusters was 5, the BIC value was 18,500, the CAIC value was 17,800, and the entropy R-squared value began to stabilize at 0.93. Therefore, the appropriate number of clusters was determined to be 5. Figure 2 shows the results of dividing the initial dataset into five clusters. Eight variables (e.g., accident classification, process rate, and month) included in the five clusters at the same rate did not affect grouping; therefore, they were excluded from the major variables, and nine variables showing differentiation from other cluster groups were selected. When using the variables selected by LCCA in the first data preprocessing step (i.e., headquarter, year, type of work, type of accident, injured part, workplace, assailing materials, cause of accident, and severity), it was found that the construction accident data could be grouped more accurately using five clusters. Although LCCA was able to select nine major variables [1], the reliability of extracted major variables can be increased if a common one is selected using various methods rather than selecting them using only the results of one method. Therefore, the predictor importance and the independence of variables were calculated using the chi-square, Cramer's V, and ML methods, which are generally used to calculate the independence of variables and identify highly correlated variables in text and categorical data analysis. The relationships between variables in the accident data were analyzed. From the 16 variables, one was used as the output, and the remaining 15 were used as inputs, to find the most important variables. Variables that contributed to the predicted output for 16 cases were found to be generally similar, and Table 3 shows the severity results for the predictions. Among the 15 input variables predicting severity, six that were used by all four methods were included among the eight most significant variables. The selected variables were year, type of work, type of accident, injured part, assailing materials, and cause of accident; in total, seven variables (including severity) were determined to exhibit a strong correlation.

Second Data Preprocessing for Reduction of Elements
Through the first data preprocessing stage, 16 variables from the initial dataset were reduced to seven important ones. However, because there were up to 13 elements under each variable, numerous elements remained, which made it difficult to interpret the construction accident data and identify trends. Therefore, the second data preprocessing step was performed, to reduce the number of elements while preserving the data characteristics as much as possible; in this step, the elements showing similar trends in the type of work, injured part, and assailing materials variables (which each contained more than ten elements) were standardized and reduced down to 5-6 elements. Among the ML methods, the severity was predicted for eight cases using the SVM and the DT-based ensemble method which were reported to be more suitable for the analysis of construction accident data [8], and this was performed to find the minimum number of elements that maintains the prediction accuracy.
In previous studies, the injury types (bruise, ligament injury, fracture, etc.) were included to predict severity [1,8,37]. However, since most severity levels are determined based on the injury type, there is a very large correlation between the two. In this study, it was confirmed that the prediction including variable of the injury type has high accuracy and low bias as in previous studies. However, since this study aims to extract and optimize the main variables through a simple method and to confirm the correlation between the variables through the alluvial flow diagram, variables of the injury type that significantly weaken the contribution of other variables were excluded. Therefore, because the variable having the greatest relationship with the severity is excluded, the predictive performance may be lower compared to previous studies. Table 4 shows the severity prediction results for eight cases, obtained using ensemble-and SVM-based methods. The nested cross validation (CV) was applied for the purpose of minimizing the bias of the prediction result due to overfitting in training and verification of ML. The SVM predicted severity with high accuracy prior to the nested CV, though it showed low accuracy afterward. These results are thought to be a more generalized result by the nested CV. In the ensemble method, the errors before and after nested CV were small, and relatively minimal overfitting was predicted to occur compared to the SVM model. In the nested CV prediction results for eight cases, the ensemble method was predicted to achieve a~10% higher accuracy than the SVM method, and it was found to be more suitable for datasets containing a variety of variables and elements. When predicting severity by reducing the elements of a variable, we found that if all three variables were reduced, the ensemble method showed an accuracy of 67.29%, only~3% lower than the case without reduction. This indicates that the characteristics of the data did not change significantly. Therefore, reducing the elements in the data made it easier to analyze the accident data and identify trends, and it simplified the analysis of complex data in which correlations are difficult to find.
A final dataset-featuring seven variables and a maximum of ten elements-was formed by selecting the major variable through the first data preprocessing stage and standardizing the elements through the second one. Similar results were obtained when selecting each of the seven variables as outputs and predicting using ML; thus, we concluded that this final dataset was valid.

Prediction of Various Dependent Variables
In the second data preprocessing stage, the injured part was expected to strongly correlate with the severity prediction; in some cases, the prediction accuracy was significantly lowered when elements of other variables were simultaneously changed. Therefore, the correlations between variables were analyzed because some variables may be highly influential among the seven variables. In addition, an analysis was conducted to confirm the predictability of other variables, instead of simply predicting the severity of the seven variables. Correlation analysis was performed through ensemble-based prediction and predictor importance calculation because the ensemble method is more suitable than the SVM for the data in this study. Similar to the first data preprocessing step, seven predictions were made with one variable as an output and the other as an input, and the accuracy, precision, recall, and F1 score were calculated as reliability indicators; the results are presented in Table 5. In general, the proportions of element data in the output are similar; furthermore, when analyzing two elements, the accuracy was used to evaluate the model performance. However, because this study's occupational accident data contained variables with more than five elements, it was difficult to accurately evaluate the model reliability using accuracy alone. Therefore, the F1 score was calculated and analyzed, and the developed model's reliability evaluation results are presented in Table 5. Here, for each element, a true positive (TP) denotes a value that correctly predicts the correct (actual) result, a false positive (FP) denotes a value that incorrectly predicts the correct (actual) result, and a false negative (FN) is a value that incorrectly predicts the wrong (non-actual) result. The precision was calculated using Equation (1) for the TP and FP in each class, and the recall was calculated using Equation (2) for the TP and FN in each class. After that, the average precision and recall were calculated using Equations (3) and (4), and the F1 score was calculated using Equation (5).
Average Precision = P(A) + P(AD) + P(C) + P(CD) /4 (3) In the SVM and ensemble predictions with the nested CV, the accuracy of most results exceeded 50%; however, in the predictions for year, the accuracy was relatively low. This was not expected to correlate strongly with the variables used as the other inputs. The F1 score-which represents the harmonic average of precision and recall-measured most output scores considerably lower than their accuracy scores. This is because the data were concentrated on certain elements when predicting elements lower than the output variable; furthermore, because of the nature of the algorithm, the prediction method may have been more concentrated on variables with considerable data. However, in the case of severity, the accuracy and F1 score values were very similar; the injured part was predicted with six elements but showed a tendency to decrease slightly compared to other variables. Although it did not achieve high accuracy in predicting various outputs, predictions were possible to some extent, and because separate correlations may exist between variables and elements, a detailed analysis of each output was conducted.

Correlation Analysis between Variables
In most previous studies, simply predicting one dependent variable with ML or analyzing the relationship between one dependent variable and the remaining independent variables through chi-square test [8,37,48]. Moreover, current studies of causal inference have been performed by a complex algorithm [49,50], so reasonable inference results have not effectively been applied to qualitative and subjective accident data in construction field. Preferably, it may be more appropriate to analyze it in stages rather than an algorithm that solves everything at once. Separate preprocessing requires less computational cost and effort because it can pre-filter more data to select and use major variables. The major variables can be managed in advance, enabling efficient data management. Thus, the proposed model can extract major variables in an easy and simple way for many types of data written on qualitative and subjective judgments and predict accident outcomes. Figure 3 compares the contributions of input variables to the output predicted by the ensemble method. By predicting the output, variables that strongly contribute to the prediction can be identified. Variables with large contributions vary according to the predicted output, and variables with larger predicted contributions indicate greater correlations. When the severity was predicted, the injured part and type of accident were found to correlate strongly; when the injured part was predicted, the severity and type of accident were found to correlate strongly. Thus, it can be seen that strong correlations exist between some variables, though not all. Although Figure 3 clearly shows the contributions of input variables to individual outputs, it does not clearly show the overall relevance. Therefore, network analysis results are schematically illustrated in Figure 4 to clarify the relationships between variables. The arrows indicate the direction of the contribution, and the line thickness indicates its magnitude. Large correlations are observed between the type of accident and assailing materials, cause of accident and injury site, and injured part and severity. As such, it has been confirmed that there is a separate correlation between the major variables having a large relationship in the occurrence of construction accidents, and it needs to be utilized to prevent construction accidents through correlation analysis.

Correlation Analysis between Elements
In order to apply the accident analysis results to the safety measures at the construction site, it is necessary to pay attention to the correlation between variables contributing to the accident, rather than simply increasing the prediction accuracy [37,48].
Correlations between variables can be analyzed through contribution and network analyses; however, these analyses struggle to capture the correlations between the elements included in the variables. Therefore, in Figure 5, a detailed correlation analysis is shown for the top three variables in terms of F1 score (severity, injured area, and type of accident, respectively) using an alluvial flow diagram; these three variables strongly contributed to predicting the type of accident. The trends of correlation contribution to the type of accident show that injuries on the outside of the upper body occurred mostly as a result of "fall down" due to "heavy non-fixture" or "light non-fixture (equipment)," or due to the "carelessness" of the victim. Here, it is thought that such accidents can be prevented if workers who work "heavy non-fixture" or "light non-fixture (equipment)" are aware of accident cases through pre-work education. Figure 6 shows an alluvial flow diagram for the type of accident and severity, which both contribute strongly to the prediction of the injured part. Overall, serious injuries were found to occur most often due to "fall off" and "fall down" accidents; it can also be seen that injuries occurred to the outside of the upper or lower body. The most fatal injuries are seen to occur as a result of falling accidents, and injuries to the face or upper body were most common. By analyzing these overall trends, we expect to be able to reduce the occurrence of accidents by providing customized safety training and safety protection equipment to workers in high-risk roles.  The variables that primarily contribute to severity prediction are year, type of accident, and injured part; the alluvial flow diagram for these factors is shown in Figure 7. In the relationship between year and type of accident, "fall down" can be seen to be the most prevalent accident across most years, followed by "fall off." The most frequently injured areas were the outside of the upper and lower body, and most of these were found to suffer from serious or slight injuries; fatal injuries were most frequently caused by lower-body injuries due to falling accidents. Moreover, a strong correlation was confirmed between fatal injuries and accidents in which the victim's head was hit by an object.
The correlations were analyzed for variables with large correlations when the output was found to differ in the alluvial flow diagram. Through the two data preprocessing steps, the complexity of the initial construction accident data was resolved, and the correlations of construction accidents could be readily understood through the alluvial flow diagram using variables with a large correlation to the output. Through this, we expect to be able to help prevent construction accidents, provided appropriate safety measures are established for the specific accident types. However, in alluvial flow diagram analysis, identifying detailed trends can be difficult because the flow of the previous variable is integrated with the next. Therefore, detailed analysis was carried out, by grouping the final dataset via the LCCA used in the first data preprocessing step.

Grouping with LCCA
LCCA can be used to identify data trends via detailed analysis of the major elements included in the group; it can also select major variables by identifying variables that heavily contribute to grouping, similar to the first data preprocessing stage. Advantageously, this method can capture the flow when two or more variables that are difficult to represent in the alluvial flow diagram are connected. The five attributes that are most influential in the differentiability of clusters are presented in Table 6, which summarizes the ratio between the total number of observations in the dataset and the specified cluster. Each cluster can be grouped by clustering objects with high similarity according to the similarity of seven variables out of 963 objects. In the previous study, there is a limitation in selecting the variable by applying LCCA only to binary variables with two elements. However, in this study, LCCA was applied to categorical variables with many elements [1].  Cluster 1 includes data elements such as "permanent fixture" (under assailing material), "fall down" and "fall off" (under type of accident), outside of the lower body (under injured parts), and "serious injury." Cluster 2 includes data on "hit" and "struck" accidents (under type of accident), "third-party liability" (under cause of accident), "heavy non-fixture" (under assailing material), and outside of the upper body. Occupational accident data in the construction industry can be distinguished through the selection of differentiated elements between clusters. However, influential elements in each cluster can be partially duplicated in other clusters. For example, Clusters 1 and 4 contain falls as a major attribute among the types of accident; however, they are otherwise differentiated in terms of injured part and severity. By their nature, construction accidents can be affected by numerous variables, and identifying trends may be difficult; however, accident data can be grouped by the relationships between other influential attributes. The construction accident concept of each cluster, using the influential attributes in Table 6, are defined and presented in Table 7. In Table 7, Cluster 1 includes the "fall down" and "fall off" accidents that result in serious injury to the outside of the lower body from a "permanent fixture." Cluster 2 includes cases of injury to the outside of the upper body due to "hit" accidents caused by a "heavy non-fixture." Cluster 3 includes cases of injury to the outside of the upper body whilst using a "light non-fixture (equipment)" or "portable tool." Hence, each cluster contains the types of accidents that occur most frequently in the construction industry, and their proportions are also similar. By grouping construction accident data, we can quantitatively verify the existing empirical knowledge of construction managers, and we anticipate that construction site safety can be improved by establishing appropriate safety measures for the different types of construction accidents.

Visualization with PCA
In general, the PCA method selects a major variable that can be easily used to classify data, by finding a variable with a large influence across many variables and utilizing it to reduce the dimensions of the variable. PCA primarily uses numerical data [51]; categorical data are difficult to use because they do not have separate numerical values according to the variables and items. In a study that applied PCA based on the construction industry survey data, scores were set for each item and used to conduct PCA and reduce dimensions using numerical values [38]. In this study, four methods were used to identify major variables; then, PCA was applied as a visualization method rather than a dimension-reduction one. To visualize the severity level, which was the variable with the highest predictive accuracy as determined through ML analysis, the major variables (e.g., year, type of accident, and injured part) were used as PCA data. The character-type categorical data were converted into numeric-type categorical data and utilized. Figure 8 shows the results of PCA using three variables, and each data point is displayed in red, blue, and light green, depending on the severity level. By plotting the major variables through PCA, the severity levels can be distinguished relatively clearly. Moreover, it can be seen that the severity level is classified by the injured part (PC3) rather than the year (PC1) or type of accident (PC2); this shows that the variable most strongly correlated with severity in the correlation analysis is the injured part. In other words, it is possible to predict the severity through PCA classification using three major variables. In this study, in order to overcome the limitations of the current accident data processing, the correlation between major variables derived from occupational accident data was constructed by various ML algorithms and new data process procedure. In previous studies, only one method was used to select major variables, whereas, in this study, four methods were used to select the main variables. Next, there is a difference in selecting an optimized element at a point where prediction accuracy is maintained using the ML method. In addition, for accident analysis currently used in the field, risk assessment is the most representative method, mainly to derive hazard factors based on experience, to calculate severity by intensity and frequency, and to prepare countermeasures. However, this method of this study is a case analysis using the result of accident prediction analysis for each variable of accident data, which is more suitable for actual accidents. Through the correlation between the major variables identified in this study, various construction accident data can be used to establish more practical accident prevention measures by constructing an accident prediction model.

Conclusions
In this study, an efficient data preprocessing technique and ML application were developed to analyze occupational accident data in the construction industry, where it is difficult to derive features owing to a large number of variables and elements in the accident data. The following conclusions were drawn:

•
For construction accident data involving many variables and wide categories, it is possible to identify the most influential variable among many variables by using clustering, chi-square test, and other procedures.

•
Because the types or categories of the major variables are numerous, it is difficult to identify meaningful relationships. Therefore, standardization and element grouping can be performed, and the accuracy can be analyzed according to the categories of the variables; through this, an optimal grouping using the fewest elements can be found.

•
The correlations between factors can be analyzed by examining the correlations between and contributions of variables, using ML analysis on the optimal variable type and category.

•
Through PCA and clustering, the distribution and combinations of variables that contribute to the prediction of each variable can be understood, and we anticipate that effective accident prevention measures can be established by utilizing these results.

•
The severity level in the classified list of personal damage was predicted and analyzed, so this study can have some limitations. The more quantitative data such as the days of convalescence for each accident can yield more reliable results. • There are differences in variables and elements to be filled out because construction accident data are all different in forms managed by countries and companies. Therefore, to apply the analysis method proposed in this study, the data standardization is necessary.