Identification of Defect Generation Rules among Defects in Construction Projects Using Association Rule Mining

: This study aims to identify the defect generation rules between defects, to support effective defect prevention at construction sites. Numerous studies have been performed to identify the relations between defect causes, to prevent defects in construction projects. However, identifying the inter-causal pattern does not yet guarantee an ultimate grasp of what constitutes proper defect mitigation strategies, unless the underlying defect-to-defect generation rules are thoroughly understood too. Specifically, if a defect generated in a work process is ignored without taking necessary corrective action, then additional defects could be generated in its following works as well. Thus, to minimize defect generation, this study analyzes the defects in the sequence of a construction work. To achieve this, the authors collected 9054 defect data, and association rule mining is used to analyze the rules between the defects. Consequently, 216 rules are identified, and 152 rules are classified into 3 categories along with 4 experts (71 expected rules, 22 unexpected but explainable rules, and 59 unexpected and unexplainable rules). The generation rules between the defects identified in this study are expected to be used to regularize various defect types to determine those that require priority management.


Introduction
In construction projects, quality defects are major factors degrading the project performance. Quality defects at a construction site are extremely important issues for management, because they cause economic losses, such as schedule delays, cost overruns, and disputes because of rework [1][2][3][4]. Love et al. [3] argued that the cost for the rework required owing to defects can reach approximately 10% of the construction cost. Therefore, it is important to identify the causes of such defects in an early stage and remove them in advance to prevent quality defects from occurring.
In this context, construction defects are recognized as factors that have to be prevented in advance, instead of being treated at a subsequent stage. However, because quality management in the construction stage depends on repairing based on inspections of the work results, rather than preventing defects earlier, similar forms of defects are continuously formed. Because these defects occur in numerous combinations, owing to their complex relations, their management is difficult [3,[5][6][7][8]. In addition, there are numerous defect types to be managed and various combinations of defect causes; therefore, quality managers cannot appropriately determine the rules of defect generation that can be correlated.
To address this issue, numerous studies have been performed to identify the causes of quality defects and minimize them by eliminating the causes [1][2][3]7,8]. Previous studies have made efforts to identify relations between the defect causes, to prevent defects in construction projects [1][2][3]7,8]. These studies have attempted to prevent defects by identifying their single causes using statistical analysis and analyzing the relationship between the causes of the defects to determine the combinations that cause them.
These studies contributed to a better understanding of the relationship between the defects causes for the sake of drawing optimum proactive defect mitigation strategies. Nonetheless, the full picture of what constitutes a better prioritization strategy should also realize the implications of the various types of defects in terms of their evolving defect-to-defect generative mechanisms. Actually, at a construction site, if a defect generated in a construction work process is ignored without taking corrective action, another defect could be generated in connection with it in its following construction works. Understanding these implications would provide grounds for measuring defect severities, to complement root cause analyses for drawing optimum defect mitigation strategies. Therefore, in addition to the study that attempts to identify the complex generation structure between the causes, a study that attempts to prevent defects in advance by determining the generation rules between the possible defects is also necessary. Based on this recognition, the authors aim to develop a framework that can identify and analyze the complex generation structure between the various causes of construction defects and their systematic classification.

Literature Review
For management, it is important to understand the location and the characteristics with which defects are generated in a construction phase. To understand defect generation, the existing studies attempted to identify and analyze defects by the process, type, cause, and time. Kim [9] performed an analysis of defect types and causes using the defect performance data of Korean construction companies targeting high-rise residential buildings. Furthermore, he suggested a defect management measure in view of its results. Kwon [10] classified the defects that can be generated in a multi-family housing into those before and after occupation, and analyzed the generation type by the process. He further suggested five measures for defect reduction: applying a fair construction schedule, training, improving recognition for supervisors, and appropriate manpower arrangement according to the construction scale. Lee [11] identified defect generation causes by the type and process, using a survey on the annual defect generation status of a multi-family housing, and proposed measures to deal with each cause and prevent defects. Song [12] determined the types and causes in finishing a work based on the defects generated in interior finishing works in a multi-family housing. Furthermore, he divided the causes by the type into defects by materials and construction workers, defect by followup workers, and defect by equilibrium and furniture placement, and analyzed them. Kang [13] identified priority management targets by analyzing 1.71 million defects of 105 multi-family housings supplied by large construction companies. The existing studies [9][10][11][12][13] analyzing the types and characteristics of defect generation classify defect generation types by the construction process, in view of the collected cases, derived preventive measures, and priority management targets by estimating the impact and importance of the causes. In addition, the causes were derived by the classification of the defect types generated, and each cause was considered a factor that generates defects independently, because the analysis between the causes was not considered.
Defects can be generated by the root causes with the most influence on defect generation, and also by a combination of numerous causes. Existing studies that attempted to prevent defects by analyzing the relation between the root cause and the causes of defects have provided knowledge based on which an effective performance can be realized during quality management in a construction site. This is achieved by estimating the importance and influence of the causes of defects to derive the root cause and by deriving a defect generating mechanism considering the relation between the causes.
Mireille [14] classified the causes of defects based on the condition, effect, nonconformance, and distress, to analyze the root cause of the defects that can be generated, and identified a sequential mechanism suggesting that causes are generated sequentially and not independently (Condition → Effect → Nonconformance → Distress). Love [3] proposed a causal model that causes reconstruction by focusing on the factors that are difficult for humans to identify based on defect cases, to identify the defect generation structure in terms of missing errors. Cui [15] surveyed by distinguishing the defect generation types, frequencies, and responsible agents for a defect-generated process in view of the defect cases to identify the causes of defects and the factors that affect them. In addition, each influence factor was subcategorized to derive the degree of influence on the defect generation that can be caused by the responsible agents, and the causes were classified into high, middle, and low levels, according to the degree of influence. Al Jassmi et al. [5] derived a defect generation mechanism using the "fault tree analysis," with which they analyzed the root causes of defects. Furthermore, they identified the defect generation structure that the defects are generated sequentially (organization influence → defective supervision → preconditions for defective acts → defective acts) under the assumption that they are generated not by a single cause but by the causal relations between the causes. Furthermore, each of the causal factors was classified into the mechanism, and the influence of each cause was evaluated to select the priority management targets [7].
The existing studies [3,5,7,14,15] mentioned above attempted not only to analyze the root cause of defects but also to identify the mechanism in which the causing factors occur sequentially to generate defects by analyzing the relation between the causes and determining the defect generation structure. Most of the causes were identified as man, equipment, construction methods, materials, etc.; however, if defects are generated in the previous works, they will inevitably result in defects in the following works, and it is considered that there is a limit in resolving unpredictable defects only with the derived causes. In addition, because there are numerous combinations of causes that generate the defects derived from the existing studies, managing all the causes at the construction sites where the construction costs and period are to be considered is still limited. Particularly, numerous studies have attempted to derive the causes and identify the relations between the causes; however, a study that attempts to assist effective quality control work by identifying the generation structure between the possible defects is still lacking.

Data Collection
To identify the defect generation structure, 9054 cases of non-conformance reports (NCRs) and field action notices (FANs) were collected from 31 construction projects in Korea. The detailed information obtained from the NCRs and FANs include the defect cases, classification (for example, work type and facility), keyword (for example, crack and vertically faulty), and circumstantial description (see Table 1).
All the 9054 cases of NCRs and FANs collected were carefully analyzed and classified by the work type. The work type was classified into 14 types (for example, reinforced concrete work, brick work, carpenter work, furniture, and fixture work) specified in the NCRs and FANs. The most frequent defects were generated in the furniture and fixture works in 2984 cases (33.0%), followed by window work in 1803 cases (19.9%), and tile work in 1481 cases (16.4%) ( Table 2).
Based on this classification, the defect types were refined. The items that were less related to the defects generated in other work types, such as scratches and damages, were excluded from the defect types, because they were unsuitable for the purpose of present study. Finally, a total of 66 defect types were reorganized considering the characteristics by the work type. With the reorganized defect types by work type, a basic data set was constructed based on the 9054 cases of NCR and FAN.

Analysis of Defect Generation Rules Using Association Rule Mining
Association Rule Mining (ARM) is defined as a methodology to discover interesting relationships hidden in the large amount of data sets [16]. ARM is one of the data mining techniques that acquire unlearned knowledge from a specific set of data and determine patterns that can explain the acquired knowledge [17,18]. In general, it is useful when determining the association of two or more groups. In other words, ARM is a useful method for not only verifying common sense and wellknown rules (e.g., expected rules), but also finding hidden new rules (e.g., unexpected rules).
ARM can use the dataset to identify the rules that occur together frequently. The values derived by ARM are Support, Confidence, and Lift. 'Support' is the probability value (i.e., X → Y) that Y occurs when X occurs within the dataset and is obtained from Equation (1). For example, if horizontality and verticality faults are generated in the reinforced concrete work, there is a probability of generating a verticality fault in the window work simultaneously. The "Support" value is an index to measure how frequently two defects are generated simultaneously, and a standard to determine how important the rules are.
"Confidence" is a conditional probability value that "Y" is generated when "X" is generated in the 'X → Y' relationship and is obtained from Equation (2). For example, when horizontal and vertical faults are generated in the reinforced concrete work, there is a probability that a verticality fault in the window work will occur. The "Confidence" value is a standard to present how much the rules can be trusted. Owing to the "Support" and "Confidence" being probability values, the closer they are to 1, the more connection they have.
The "Lift" value is a measure that denotes how much the conditional probability of "Y' increases when "X" is generated from the probability of "Y" relative to the total, and is obtained from Equation (3) [19]. Thus, it depicts how much "X" influences the probability of "Y" in the "X → Y" relationship. For example, it is a value that demonstrates how much the horizontality and verticality faults influence in reinforced concrete work, i.e., the probability of generating a verticality fault in the window work. The "Lift" value can be used to infer only those rules with higher correlations in the 'X → Y' relationship. If the "Lift" value is higher than 1 (i.e., "Lift" > 1), it is considered to be in a positive correlation, and if it is identical to 1 (i.e., "Lift" = 1), it is regarded as an independent relationship, and if it is between 0 and 1 (i.e., 0 < "Lift" < 1), it is a negative correlation with a low relationship.
ARM can derive the rules over three steps: 1) users define the min-support of 'Support' to select rules beyond the min-support; 2) users define the min-confidence of 'Confidence' to select rules beyond the min-confidence value; 3) select rules that the 'Lift' value is greater than 1 (i.e., 'Lift' > 1). Here, there is no evidence that a specific value is correct because the criteria for the min-support and min-confidence required for the ARM application are arbitrarily set by researchers in consideration of the study purpose and data characteristics [8,19].
Therefore, in present study, all the rules are considered important in that the min-support value is not set. Further, for the min-confidence value, 0.7 was applied, and the standard for that is in view of the value used in the existing studies [20,21]. In addition, owing to the purpose of present study, which is to identify the rules such that the defects in previous works may lead to defects in the future works, the rules in which the order of construction work process is reversed are excluded. A process of deriving the rules using ARM is displayed in Figure 1. A total of 1237 rules were derived in the first ARM analysis stage using the basic dataset previously prepared. Subsequently the rules were filtered using the second stage (i.e., min support > 0.7), third stage (i.e., Lift > 1), fourth stage (i.e., work process). Using the filtering process, 216 significant rules were finally identified. Table 3 shows 216 rules arranged according to the 'confidence' level measuring the strength of the relationship between defect causes. Rule 120 identified as the strongest rule shows that when a pockmark and surface blister (D4) occurs in the reinforced concrete work (W1), a discoloration (D66) is 89% likely to occur in stone board work (W14). Rule 172, which turns out to be the second most powerful rule, suggests that D2 in W11 can generate D27 in W6 with a 86.2% probability.

Classification of Defect Generation Rules
The ARM technique, which is applied in view of the large amount of data, can derive numerous rules according to the number of data, owing to which useful rules were filtered using the minsupport, min-confidence, and lift values. Furthermore, 216 rules were finally identified from 9054 cases of defect data, except for the rules with reversed construction work processes. However, there is a limit in analyzing only the rules with the probability (i.e., confidence) of generating rules and correlation (i.e., lift) in the ARM technique. Therefore, it is necessary to distinguish whether the identified rules are the expected rules or unexpected rules for construction practitioners to apply in real construction projects.
For this, in-depth interviews were conducted with four experts having domain knowledge and field experience to distinguish between expected rules and unexpected rules driven out of the data analysis through ARM (i.e., extracted rules) [16]. The interviewees consisted of one expert (MSc, and Professional Engineer) with more than 20 years in the field experiences, two experts (BSc) with more than 15 years in the quality management area (dealing with all of work type), and the other one (Ph.D.) with more than 10 years in the construction management research area. Results obtained from interviews with four people may not be sufficient to obtain statistical validity. However, given that 216 defect generation rules should be carefully analyzed, it is more effective to have in-depth interviews with a small number of experts than to survey many respondents.
The experts were asked to analyze 216 rules by classifying them into three categories: "Expected Rule," "Unexpected but Explainable Rule", and "Unexpected and Unexplainable Rule". In this section, the process in Figure 2 is used to classify 216 rules into three types of useful rules (i.e., Expected Rule, Unexpected but Explainable Rule, and Unexpected and Unexplainable Rule).
First, the "Expected Rule" is a rule that they can easily expect without looking at the analysis results. For example, Rule 70, "a separation (D26) in the waterproofing work following a crack (i.e., D2) is generated in the reinforced concrete work in Table 3 can be referred to as "Expected Rule." Second, "Unexpected but Explainable Rule" is a rule that can be explained based on the analysis results among the rules unexpected before looking at the analysis results. Specifically, it can be referred to as a piece of information that was not known before but came to be known recently. For example, it is "Rule 76" in Table 3, the case that a separation (D26) is generated in the waterproofing work following horizontality and verticality faults (i.e., D6) that are generated in the reinforced concrete work." Finally, the "Unexpected and Unexplainable Rule" is a rule that is still difficult to explain among the rules never expected before looking at the analysis results. Similarly, to classifying rules into three, it is required to see if one is an expected rule before looking at the analysis results (i.e., Expected? in Figure 2(b)) and determine whether it is possible to explain the unexpected rules (i.e., Explainable? in Figure 2(b)). For an accurate classification of rules, ARM and the details of the three useful rules were thoroughly explained to the experts before the classification. To classify easily into three types when interviewing with the experts, the interview paper was composed in such an approach that it can be marked by "ㆍ" (i.e., Expected Rule), " ! " (i.e., Unexpected but Explainable Rule), and " ? " (i.e., Unexpected and Unexplainable Rule) separately. The result of classifying the 216 rules for defect generation by the interview with the experts is as listed in Table 4. The rules marked as "ㆍ," "!", and "?" by all the four experts (see Table 4, Rules 1, 28, 29, 54, 65, 154, 206, and 216) indicate that their opinions were all in agreement. First, taking Rule 1 as an example in "Expected Rule" (Rule 1, 29, and 54), it depicts that if a form deformation (i.e., D8) occurs in the reinforced concrete work (i.e., W1), cracks will be generated in the concrete following the formwork. Specifically, it can be classified as an "Expected Rule" by anyone without the analysis results.
Further, "Rule 65" was classified as "Unexpected but Explainable Rule." To analyze it, if a reinforcing bar placement/assembly fault (i.e., D10) is generated in the reinforced concrete work (i.e., W1), cracks and pockmarks can be generated, causing separations (i.e., D32) in the plastering work (i.e., W7). Specifically, the rules presented an unexpected result in view of the experience of the experts' but can be explained logically.
Finally, Rules 28, 154, 206, and 216 were classified as "Unexpected and Unexplainable Rule." These rules are new information derived by the ARM analysis but are unexpected and unexplainable rules. However, they obtained significant values of "Confidence" and "Lift" from the ARM analysis, requiring additional analysis.
Not all rules have been agreed upon by four experts. There are also rules where there are differences of opinion among experts. For example, in Rule 2, expert B classified as "Expected" but the other three as "Unexpected but Explainable." This is because their experiences and knowledge levels are different. The rule 2 can be reclassified to a unanimous "Expected" after being explained by expert B. Another example of disagreement is rule 155. Experts A and D classified the rule as "Unexpected but Explicit," while experts B and C classified it as "Unexpected and Unexclusive." This means that experts A and D have not been fully persuaded by the explanations of experts B and C. We found it difficult to resolve these disagreements completely even after post-mortem. Furthermore, if we forced them to reach an agreement, prejudice could rather affect classification. Based on this recognition, we decided to analyze only the rules unanimously agreed upon by four experts and deal with the other remaining rules in our subsequent studies. As listed in Table 5, there are 152 rules that the opinions of the experts are all in agreement (70.4%) and the rules without agreement are 64 (29.6%).

Result and Discussion
The purpose of present study was to identify rules that the defects generated in previous works lead to defects in the future works to prevent the quality defects generated. Therefore, 152 rules were finally identified by conducting the ARM analysis on 9054 cases of defect data and expert interviews. The 152 rules were arranged in a matrix form so that they could be identified at an overview based on the type and defect ( Figure 3).
As displayed in Fig. 3, the "X" value of ARM is marked on the vertical axis and "Y" value on the horizontal axis; the rules of the work type (i.e., Wi) and defect type (i.e., Di) are indicated as "•," "!," "?." In terms of the analysis of the rules, for example, if W1-D2 (i.e., X) occur in the Figure 3 "X → Y" form of ARM, it can be interpreted as "Expected Rule (i.e., "•") in which W1-D3 (i.e., Y) are generated. From Fig. 3, the defect generation rules can be interpreted largely in three ranges arranged by the matrix (see Figure 3, ①, ②, and ③).
First, in ① (i.e., red section), most of the rules are "Expected Rules," which can be considered that the experts in quality management understand the relations between the defects that can be generated in the same work process or adjacent work process. However, in ②, it can be considered that owing to the rules in the work processes being remote from each other, over time they are identified as "Unexpected but Explainable Rule" and "Unexpected and Unexplainable Rule," and that the experts in quality management have not identified them yet. Therefore, it is expected to provide a new piece of information, i.e., the generation rules between the defects that can be generated in the distant (i.e. not close in sequence) work processes other than adjacent processes suggest a new direction for the quality management.
Next, in ② (i.e., gray section), it can be calculated how much the defects generated in the previous work process influence the defects in the future works process. For example, the reinforced concrete work is the previous process, and if a defect is generated in the process, numerous defects will be generated in the future work process. Specifically, the reinforced concrete work has a tremendous impact on the defects that can be generated in the future work's process. Therefore, it is expected to assist in determining and managing the previous work processes in advance and have a tremendous influence on the defect generation in the future process by using this matrix.
Lastly, ③ (i.e., blue section) presents the defects to be managed in the previous works to prevent defects generated in a specific work process. It also depicts that the defects that induce W8-D33 in ③ are W1-D3, W7-D30, D32, and W8-D34, 35 (white box in ③ section). Therefore, it is expected to minimize the defects in the specific work process by identifying the type of the defects generated in the previous process.

Conclusions
Defects in a construction site can cause economic losses, such as schedule delay and cost overrun owing to rework. Construction defects are recognized as factors that have to be prevented in advance, rather than be treated in the future. The aim of the study is to identify the rules for the defect generation that can be generated in the following construction works following being generated in the previous works. To achieve this, the authors used an association rule mining (ARM) to easily identify the relevant defect rules amongst a large amount of data.
A total of 1237 rules were derived through the ARM analysis of 9043 defects, followed by 216 rules selected through filtering process. The selected rules were classified into 71 "Expected" rules, 22 "Unexpected but Explainable" rules, 59 "Unexpected and Unexplainable" rules and 64 disagreements through in-depth interviews with four experts. Finally, 152 rules were presented in a matrix for preventive quality management in the construction phase. Using this matrix, when a defect occurs, it is easy to identify which defects are likely to generate in the following work process. In view of these determined facts, present study has identified the generation rules between defects considering the characteristics of the construction work that defects can be overlapped.
Furthermore, the study touched on a problem that quality management efforts are usually drawn based on quality managers' experience, where in fact there are inter-relations between the various types of defects which are not necessarily comprehended by mere expert judgment. The rules between defects identified in present study can be utilized to prioritize defect prevention efforts to tackle those un-noticeably severe defects that has potential of propagating to create a series of other defects. In particular, it is relevant that the quality manager has identified the rules that were not known by his experience to suggest a new direction in the researches and practices of quality management.
Meanwhile, the study revealed some limitations. First, a deeper consideration is needed for the classification of the rules that experts did not agree on. Delphi techniques, Focus group interviews, etc., are expected to be effective in reaching their agreements. This agreement is expected to further present new management measures in quality control. Further in-depth interviews with more experts with extensive experience and knowledge are expected to elicit statistically validity on the analysis results. Finally, more research will also need to be conducted on the relationship between the defect generation rules and the cost of recovering their defects. This will greatly contribute to the development of a more cost-effective quality control strategy.