Real-World Failure Prevention Framework for Manufacturing Facilities Using Text Data

: In recent years, manufacturing companies have been continuously engaging in research for the full implementation of smart factories, with many studies on methods to prevent facility failures that directly affect the productivity of the manufacturing sites. However, most studies have only analyzed sensor signals rather than text manually typed by operators. In addition, existing studies have not proposed an actual application system considering the manufacturing site environment but only presented a model that predicts the status or failure of the facility. Therefore, in this paper, we propose a real-world failure prevention framework that alerts the operator by providing a list of possible failure categories based on a failure pattern database before the operator starts work. The failure pattern database is constructed by analyzing and categorizing manually entered text to provide more detailed information. The performance of the proposed framework was evaluated utilizing actual manufacturing data based on scenarios that can occur in a real-world manufacturing site. The performance evaluation experiments demonstrated that the proposed framework could prevent facility failures and enhance the productivity and efﬁciency of the shop ﬂoor.


Introduction
Recently, manufacturing companies have become increasingly interested in the realization of smart manufacturing [1][2][3], where operation and information technologies are combined to enhance the efficiency and productivity of the manufacturing process [4,5]. The goals of smart manufacturing include reducing costs, enhancing productivity, improving transparency in manufacturing sites, and enabling autonomous control of production [6]. These goals can be achieved by utilizing recently introduced artificial intelligence (AI) technologies and data collected from manufacturing sites [7,8]. Data are particularly essential for the successful application of AI technologies, from achieving a specific goal to realizing complete smart manufacturing [9].
Prevention of facility failure is one of the main objectives of the implementation of smart manufacturing. Facility failure means that a machine or other equipment becomes inoperable because of such factors as breakdown, material supply shortage, and operator breaks. Such malfunction is known to be an immediate cause of a decrease in productivity on a manufacturing shop floor [10]. Facilities are usually interdependent at the shop-floor level, as a product goes through multiple facilities for manufacturing [11]. Therefore, a malfunction of one facility is critical to the entire processing line, as it affects other facilities.
Several studies have been conducted to develop a method to prevent equipment breakdown using failure logs that contain information on previous failures. For example, Li et al. [12] ] tried to diagnose malfunctions by using bearing vibration, and Li et al. [13] attempted to predict breakdowns by analyzing text data written by operators. A failure log contains multiple factors that describe the situation of each malfunction, with each factor recorded using either sensor signal data or non-sensor signal data. Sensor signal data is recorded in a predefined format, such as vibration, voltage, and pressure collected from a sensor attached to a facility, whereas non-sensor signal data has no specific format, such as text, images, and video [14].
Previous research on the prevention of facility failure has major drawbacks. First, few studies have proposed the implementation of systems to prevent malfunctions in real-world manufacturing sites, to the best of our knowledge. Moreover, most research has presented methods to predict the occurrence of a failure without listing the categories of possible malfunctions, which is necessary information to prevent the failure [15,16]. These methods do not allow operators to actively detect breakdowns; rather, the real-time interaction between an operator and the system can prevent the operator from performing a task [17]. While a few studies exist that used text data or focused on failure prediction categories [13,18], they utilized manual approaches, which were highly time-consuming and not suitable for general applications.
In particular, significant amounts of text written by operators manually are accumulated in manufacturing sites for the following reasons. First, few facilities are systematically integrated, so events that occur in such facilities must be manually recorded. Even in facilities with well-integrated systems, some of the necessary information is not automatically collected. Second, due to frequent changes in facilities, a system cannot completely cover every incident that occurs in the facility. Many of these exceptions can only be logged through a manual process.
For example, failure logs collected from an actual manufacturing company, Woojin Industry Co. located in Ansan-si, Gyeonggi-do, Korea [19], contain many operators' texts. These logs are composed of factors such as the phenomena, causes of breakdowns, and repairs. The repairs are neither automatically recorded nor selected using a button with limited choices. Furthermore, they cannot be strictly categorized, so they are collected in text form. While the phenomena and causes of failure are provided using a selection button, operators often write them in the repairs text box because the choices provided by the selection button do not include all possible phenomena and causes.
Therefore, analyzing text data written manually by operators is essential for failure prevention, but it is more complicated than analyzing sensor signal data since it requires a high level of data analysis expertise and has high computational costs. It is even more challenging than analyzing other types of non-sensor signal data, such as images. Text data is computationally intractable due to its non-explicit structure, leading to many exceptions in the analysis process [20] and the characteristics of text data represented as high-dimensional vectors [21].
There are three specific challenges in analyzing text data collected from manufacturing sites. First, technical terms related to the manufacturing facilities cannot be analyzed using traditional term-by-term natural language processing (NLP) because there are many cases in which multiple terms are combined to represent a specific meaning. For example, "leakage" refers to the flow of liquids or gases, and "applicator" means a machine that spreads coatings or adhesives on something, whereas "leak applicator" means a machine that seals the space between the mating surfaces of facility to prevent leakage, which has the opposite meaning of the simple combination of the meanings of the two words. Second, different operators use different expressions to indicate the same thing. For instance, "breaking of a wire" is expressed as both "wire is broken" and "broken wire". To cope with this problem, a method is needed to extract the words that represent important semantic aspects in texts and summarize their content [22]. Third, misspellings exist, as human errors occur in the manual writing of records.
Such a method can be used in the real world if the failure prevention system is designed considering the actual manufacturing shop floor characteristics. Since there are various facilities with a wide range of failure categories [23,24], it is impossible to manually define all possible breakdown types in the system. Therefore, automated categorization of the breakdowns is required to generalize the application and save time. If the type of predicted failure is provided, an operator can prepare for the breakdown. Furthermore, the failure type notification should be given at an appropriate time, such as before the operation begins or during a break, to avoid interrupting the operation. If the operator is notified of a potential breakdown during a time that requires their attention, they cannot respond appropriately to it. There are also failure categories that cannot be responded to unless the information is provided in advance. For instance, the operator cannot act immediately after being alerted of "parts breakage".
To this end, we propose a real-world facility failure prevention framework that will help an operator to prepare for a possible malfunction by providing a pre-failure alert list that contains information about any breakdowns that may occur during their working hours. The notification of each possible failure includes the failure category, the part which may fail, and the probability of malfunction. An operator can prevent facility failures by preparing for each possible failure noted in the pre-failure alert list before beginning work.
Specifically, the proposed framework consists of two processes: A deferred-time process and a real-time process. In the deferred-time process, the failure pattern database is constructed by analyzing past failure logs using four steps. First, to structure the text data in the facility failure logs, phrases are extracted from the verbiage contained in the logs, and the logs are vectorized utilizing the extracted phrases. Second, the failures are categorized using weighted k-means clustering [25], which considers the importance levels of various factors in the logs. Third, each log is expanded to identify failure patterns by mapping relevant information such as weather and operator expertise. Finally, the failure patterns that occur frequently are identified to build a failure pattern database. In the real-time process, an operator who is about to begin work gets a pre-failure alert list extracted from the database so they can check the equipment in advance.
The remainder of this paper is organized as follows. Section 2 comprises a literature review on facility failure prediction and prevention. In Section 3, the proposed facility failure prevention framework is introduced and explained. In Section 4, experimental results using real-world data are presented to show the effectiveness of the proposed framework. In Section 5, the proposed framework is validated by studying real-world facility failure scenarios that could occur in actual manufacturing sites. Finally, the paper is concluded in Section 6.

Literature Review
Several studies using various data analysis methods have been conducted regarding the prevention of facility failure. Diverse approaches were adopted for prevention, including failure prediction, failure diagnosis, and remaining useful life (RUL) prediction. Failure prediction is the forecasting of a failure based on historical data, and failure diagnosis is an examination of the cause of the malfunction [26]. RUL prediction refers to forecasting the time until failure by analyzing historical data [27]. A study considered the relevance analysis between breakdowns and environmental factors such as weather conditions, manufacturer, and equipment, but that study was focused on analyzing the textual failure logs rather than preventing breakdown [28]. Studies on the approaches mentioned above are listed in Table 1. The data utilized in previous research are either sensor signal or non-sensor signal. Sensor signal data have been widely used in research on failure prevention in the manufacturing domain. Previous studies used sensor signal data for pressure [15,[29][30][31], temperature [29,31,32], vibration [12,16,26,[33][34][35][36][37][38][39][40][41], and voltage [45] for failure diagnosis, failure prediction, and RUL prediction. The types of artificial neural networks (ANN) that were applied are auto-encoder [12,15,16,36,39,41], long short-term memory (LSTM) [12,30,31,45], convolutional neural networks (CNN) [37], deep neural networks (DNN) [28,41], and weightless neural networks (WNN) [32]. The k-nearest neighbors (k-NN) algorithm [34], support vector machine (SVM) [26], and transfer learning with high-order Kullback-Leibler (HKL) [38] were also used in a few studies, and analyses were performed using similarity measures [35].
Among non-sensor signal data, images and text were used for failure prediction, failure diagnosis, and RUL prediction. Types of neural networks such as CNN [31,37], DNN [41], and auto-encoder [41] are applied to the image data. There have been a few studies using text data [13,18,28,47]. Li et al. [13] and Bai et al. [18] predicted failure categories, which were manually labeled by operators, and Zhao et al. [47] predicted failure categories that were predefined in the data.
In addition, most studies have predicted facility status-whether the device malfunctions or not-despite the variety of failure categories that can occur [48]. A few studies predicted facility categories, which were manually annotated [13,18] or predefined [47]. However, manually labeled failure categories cannot be generalized because failure categories vary for each process. Thus, predefined failure categories are not suitable for real-world scenarios.

Materials and Methods
The proposed real-world failure prevention framework consists of a deferred-time process that constructs a failure pattern database and a real-time process that alerts operators to prevent facility failures. The deferred-time process constructs a database of breakdown cause-and-effect pairs from the raw failure logs. The real-time process uses the database to extract a pre-failure alert list of breakdown types that can occur during the operator's working hours, based on each operator's real-time logs, including such factors as the equipment they are using, their proficiency, and the current time.

Deferred Time Process: Overall Process Description
In the deferred-time process, the failure pattern database is constructed from the raw facility failure logs (RFFLs), consisting of two data types: Categorical and text data. The categorical data are collected when one of the predefined categories is selected, and the text data are entered directly by operators because the failure cause is not entered as categorical data. RFFLs are parsed into phrase units that include the meaning of the phenomena and the causes of breakdowns and are categorized in facility failure logs (FFLs). Extended facility failure logs (EFFLs) include environmental information such as site temperature, date, operator, and facility information mapped to the FFL records and failure categories.
Subsequently, failure patterns are identified to create cause-and-effect rules based on the relationships found in the data. A failure pattern consists of the surrounding environment and the breakdown caused by that particular environment. These patterns are the final output of the process, and they are accumulated to create the failure pattern database through the following steps.
First, phrases are extracted from the text data to categorize RFFLs that contain the failure information. Second, facility failure categories are generated utilizing FFLs by clustering the logs. Third, three data elements-FFLs, failure categories, and environmental data-are mapped to each other and EFFLs for pattern mining. Finally, facility failure patterns are identified by applying FP-growth to the EFFLs. The sub-steps and the specific approaches of each step are summarized in Table 2. Moreover, an overview of these steps is shown in Figure 1. Table 2. Summary of the deferred time process. Step

Sub-Step Approach
Phrase extraction Constructing the word usage-based dictionary Defining four word usages Extending the word usage-based dictionary with reference data Collecting the reference data by crawling the online dictionary and by interviewing experts Phrase extraction using the word usage-based dictionary to generate facility failure logs (FFLs) Tagging words in text data using the dictionary and extracting phrases using the tagged words

Deferred Time Process: Phrase Extraction
As mentioned above, the text data in the RFFLs is separated into phrases, which are the minimum unit in which multiple words are combined to have meaning. For phrase extraction, we construct a word usage-based dictionary to distinguish the text data. Here, the reference data, which are composed of mechanical-related terminologies commonly used in the actual site and the text data, are used to build the dictionary.
Four word usages are adopted to build the word usage-based dictionary. Four items, WD, ST, SE, and E, are defined as follows. WD is an abbreviation for "word" and indicates a word that must be expressed in a phrase because it has meaning. ST is an abbreviation for "separate" and indicates a word in the phrase that has meaning and also separates phrases. SE and E are abbreviations for "separate and eliminate" and "eliminate", respectively, and indicate words that have no meaning and thus are unnecessary for phrase extraction. SE is also a separator between phrases. We build a dictionary tagging the above items for words that frequently appear in text data of RFFLs.
Meanwhile, the reference data collected by crawling the online mechanical terminology dictionary [49] and by interviewing experts comprise terminologies that are difficult to understand with general knowledge alone because they are used only in the field. This online terminology dictionary provides a set of terms related to the specific field. Terms and definitions from the online terminology dictionary are crawled and added to the word usage-based dictionary. Facility-related terms are tagged as WD because they should be expressed in phrases, as are expertise terms acquired from the expert interviews.
The overall steps to extract phrases using the constructed word usage-based dictionary are as follows. First, each word in the text data is extracted and tagged using the dictionary and Python language. Phrases are extracted when they contain necessary words, and unnecessary words are deleted. A phrase is a unit used to identify the phenomenon or cause of failure and should not include action details. Therefore, the extracted phrases that contain action descriptions such as "replacement", "exchange", "action", and "change" are deleted in post-processing. Multiple phrases can be extracted, depending on the amount of facility information included in the inputs.
Examples of phrases extracted using the word usage-based dictionary are shown in Table 3. The input in Example 1 is "Parts wear due to excessive rubbing". "Parts", "wear", "excessive", and "rubbing" are WD, and "due to" is SE because the words preceding "due to" indicate a failure phenomenon and the words following show the cause of breakdown. Two phrases in the example are derived as "parts wear" and "excessive rubbing". In Example 2, "Introduction of white foreign matter into the product protector part" is extracted from "Operation of the equipment has stopped due to the introduction of white foreign matter into the product protector part".

Deferred Time Process: Failure Categorization
Vectorization is an essential step in transforming data from non-numeric to numeric [50][51][52]. The FFL data types, categorical and text data, are embedded into a vector space using one-hot encoding and phrase2vec [53], respectively. The one-hot encoding algorithm determines the number of dimensions by the number of categories, with the corresponding category value set to 1 and the other category values set to 0 [52]. Phrase2vec, based on word2vec [51] which expresses words having the same meaning as adjacent vectors, generates vector representations of phrases, considering their meanings. Extracted phrases are represented as vectors reflecting the calculated semantic similarities between phrases. Therefore, the phrases extracted from the text data that are written differently by each operator, even though they describe the same type of failure, are expressed as adjacent vectors using phrase2vec, implemented by the Python package Gensim [54].
The phrase2vec, also known as the continuous skip-gram model, consists of three neural network layers: The input, projection, and output layers, as shown in Figure 2. The skip-gram model maximizes the average log probability in Equation (1): where φ t and φ t+j are the center phrase and the j-th word from the center phrase, respectively. T is the number of words and phrases in the learning sentence, s is the window size of the model, and j is the window size index. The window size refers to the number of surrounding words before and after the center phrase to represent in the vector. Let u denote the unique words and phrases on the training dataset for the skip-gram model. In the case of Gensim's implementation model, Google News dataset [55] consisting of about 100 billion words is utilized for the model training. For the embedding vector generation, the phrase transformed into a v-dimensional one-hot vector (the input layer) is multiplied with the u × d weight matrix, producing a d-dimensional embedding vector (the projection layer). Finally, the estimated probabilities of u unique words positioned in φ t+j is calculated by multiplying the generated embedding vector by the d × u weight matrix (the output layer). Model training proceeds so that the estimated probabilities calculated from the embedding vector becomes close to the one-hot embedding vector of the answer word positioned in φ t+j . Therefore, the embedding vector contains information on surrounding words since the embedding vector is trained to infer the surrounding words well. In the case of phrases having similar meanings, in many cases, the words appearing around them are also similar, so the embedding vectors also become closer. Details on the skip-gram model can be found in [53].
A log of FFLs is created by concatenating the embedding vectors from text data and one-hot encoding from categorical data.
where l n refers to the n-th log of FFLs, ⊕ is the concatenation symbol, o n f is the one-hot encoding of the categorical data for the n-th log, and v n φ is the embedding vector of the extracted phrase for the n-th log.
Failure categorization using the log of FFLs generated from the vectorization step proceeds to derive significant failure patterns. Failure patterns are identified from the situations in which a breakdown occurs and the corresponding failure categories. The more diverse the situation in which the category occurs is, the more reliable are the pattern results that can be obtained. Therefore, FFLs consisting of multiple phrases should be clustered into a few failure categories.
For the categorization, weighted k-means clustering is used to assign different weights to each factor in the FFLs. K-means clustering, which is widely used because of its simple nature and intuitive advantages, minimizes the variance of the distances between the cluster center and the data point. Weighted k-means clustering aims to minimize the distance variance per cluster by grouping the given data into k clusters and assigning weights to factors. The weights are a directly adjustable parameter.
The objective function of weighted k-means clustering is defined in Equation (3): where c k is calculated by Equation (4).
i n k indicates whether l n belongs to the k-th cluster or not, with i n k set to 1 if l n belongs to the k-th cluster and 0 otherwise. K, N, and M are the numbers of clusters, logs, and factors, respectively, in the FFLs. w m is a weight vector of the factors in the FFLs. l n m and c k refer to a log in the n-th row with m factors and a centroid for the k-th cluster. The entire process of clustering the vector representation of FFL logs using weighted k-means clustering is shown in Algorithm 1.
Randomly select k centroids from all l n ; Set limit of iterations, MaxIter; For all l n , calculate the weighted distance to each centroid c k , ∑ M m=1 w m | l n m − c k | 2 ; Assign all l n to its nearest centroid; Compute the new controids by using Equation (4); Iter ← Iter + 1; until The centroids do not change or Iter ≥ MaxIter; Weights should be given for all factors. In our case, a total of six factors, Facility, Phenomenon, Cause, Location, Part, and Extracted phrases were used. Here, Facility, Phenomenon, Cause, Location, and Part are categorical data in RFFLs, and Extracted phrases are generated from text data. Each factor contributes only an assigned weight, and the given weights are different in the clustering process. Since the extracted phrases are represented in the vector based on semantic similarity, they are clearly distinguished from similar and dissimilar phrases. Therefore, extracted phrases have more influence than other factors.
We use two types of weights in our experiments. One is provided by experts, and the other emphasizes the extracted phrases. The cluster is configured according to the parameter k, which is the number of clusters. In this paper, it refers to the number of failure categories, with the FFLs clustered into a total of k failure categories. The clustering process was implemented using the Pycluster [56] package, a module that provides clustering algorithms in Python.

Deferred Time Process: Data Extension
In this step, data for identifying significant failure patterns are generated by mapping the data. Data that include various factors can lead to diverse patterns. Because many factors can form different combinations, this increases the number of pattern cases. Thus, the data are extended through mapping to extract meaningful patterns.
Extended data are mapped using three types of data: Environmental data, failure categories, and FFLs. Environmental data include the shop-floor temperature, operator proficiency, working hours, and facility information. Operator proficiency means the competence of each operator working on the shop floor. Working hours indicate the shift time and meal-break time by shift. Facility information includes items such as the process to which the facility belongs, the year it was installed on the shop floor, and the facility code.
Data extension proceeds in two steps, as follows. First, the failure categories, which were output by the failure categorization step, are mapped to the FFLs, the output of the phrase extraction step. Second, the mapped data in the first step are connected to the environmental data. The final mapped data from the above two steps are used for pattern mining.

Deferred Time Process: Failure Pattern Mining
Facility failure patterns representing situations where failure categories occur are derived using the frequent patterns growth (FP-growth) algorithm [57], which mines frequent patterns. The FP-growth algorithm constructs a tree structure containing the frequency information and mines frequent itemsets through a recursive divide-and-conquer approach. Details on the FP-growth algorithm can be found in [57]. We utilized the FPgrowth solution of SPMF open-source software [58], specialized in pattern mining, available at the SPMF website [59].
Patterns consist of antecedent rules, which describe the cause eliciting a consequence, and a consequent rule, which describes the result. The antecedents in the facility failure patterns include the part, breakdown situation, temporal information such as the month and day, location information such as the facility, and environmental information such as the operator and age of the facility. The consequents in the facility failure patterns are the failure categories.
There are two indicators for evaluating the mining patterns: Support and confidence, calculated using the antecedents and the consequent. Support refers to the probability that a pattern includes a specific environment and failure category in the whole pattern and is defined by Equation (5): In Equation (5), X is the antecedents and Y is the consequent. Confidence is the probability that the pattern contains a specific environment, including a particular failure category, and is defined as in Equation (6):

Real-Time Process: Pre-Failure Alert Based on Failure Pattern Database
Alerting operators during their working hours through pre-failure alert lists, which include possible malfunction types, allows the operators to prevent potential failures. The operators enter their real-time logs sequentially into the alert system before beginning work, and the alert system uses the log to list the corresponding failure categories, which are the consequents in the matching-pattern rules, confidence values, and support values from the failure pattern database. The list is sorted by confidence level, with the highest-confidence failure types displayed first, so that the operator can check the facility before beginning work and prevent a potential failure.
Real-time logs include environmental information such as the facility, weather, day of the week, and the operator who will receive the list. Real-time logs vary from operator to operator and from time to time. Due to the nature of the manufacturing sites, operator shifts may occur many times each day. The pre-failure alert list, which is provided based on the real-time logs, corresponds to the characteristics of a site where the working environment changes frequently.
The operator can use the given list to act in advance to prevent a potential malfunction before starting work. Because the failure types are listed in order of confidence level, the operator can identify the most likely failure category. If the highest-confidence failure type in the pre-failure alert list given to a particular operator is "leaking", they can check any parts that can leak before they begin work. Thus, they can prevent failures by working through the list provided based on the current real-time logs. This process is illustrated in Figure 3.

Results
The proposed framework was evaluated using a real-world dataset provided by the manufacturer Woojin Industry [19]. This manufacturer collected failure logs, called RFFLs in this paper, which recorded the relevant data when a machine malfunctioned. A total of 1394 RFFLs used in this experiment were collected from the oxygen sensors between February 2014 and January 2018.
Examples of RFFLs are shown in Table 4. No.1 in Table 4 means that the electric machine of OZ Application 3 malfunctioned due to the discharge of the sensor battery. However, there is no category for this cause, so the operator entered the breakdown cause manually in the comment. No.3 in Table 4 is the log of a failure caused by a foreign substance contaminating the wiring of OZ Checker 3. Disconnection was not in the phenomenon category, so the operator added it to the comment. Specifically, RFFLs include the facility, phenomenon, cause, location, part, and comment. Facility is the facility name where the malfunction occurred, consisting of 25 category values. Phenomenon is the symptom of failure and is recorded with one of 14 values. Cause means the fundamental event causing facility failure and takes one of 17 values, although some causes are not covered by any of the provided values. Location is the group of parts or components where the failure occurred, with 5 and 19 values, respectively. The comment is composed of data manually typed by operators to describe the failure because it is difficult to explain the phenomenon or cause effectively using only the given category options. A summary of the RFFLs is shown in Table 5.

Experimental Settings
There are three parameters to be determined for failure categorization. First, the dimension of the vector to represent the extracted phrases for categorization is determined. Second, the number of failure categories, k, is determined. Third, the weight for the k-means clustering of each factor in the FFLs is set.
Extracted phrases in the FFLs are embedded in a vector of 200 dimensions using phrase2vec. Vector dimensions that are too large or too small will limit the ability to fully express the meaning of the text. The vector dimension 200 is commonly used for text data [60].
The number of failure categories, k, is qualitatively predefined in the range from 15 to 20. With a larger k, the categorization meaning disappears, whereas with a smaller k, the variance between categories increases [61]. Furthermore, the following two reasons influenced us in setting the k value in the range from 15 to 20. First, there are 14 phenomena and 17 causes in the RFFLs. Second, managers and operators at the site judged that from 15 to 20 failure categories were adequate.
In categorizing FFLs into approximately k categories, weights were given differently for each factor in the FFLs, as shown in Table 6. The weights for the factors were determined by interviewing experts on-site and by emphasizing the extracted phrases. The facility weight was the highest, and extracted phrases had the lowest weights in the expert weighting. Emphasizing phrases had the largest weights among the extracted phrases, and they had the smallest weights for phenomena and causes that were not properly categorized. The weight values were selected for the best performance after the experiments for several cases.  For the performance comparison, k-means clustering, which is not considered weighting by the factors, was selected as the baseline. K-means clustering clustering groups the given data into k clusters to minimize the variance of the distance from each cluster [62]. The method treats all factors fairly in the clustering process and sets the weights of all features equally when evaluating dissimilarity, as shown in the "Given equally" column of Table 6.
Popular cluster evaluation metrics such as the adjusted rand index and V-measure were employed to identify performance differences between the proposed and comparison methods. Both metrics are external cluster validation measures and have a value between 0 and 1. External index measures evaluate whether a label is assigned to the correct class using known data for the labels. Therefore, labeled data is required to use the two metrics to evaluate the failure categorization. The data were labeled by qualitative failure categorization by experts, resulting in 708 FFL logs collected, labeled, and assigned to seven failure categories.
Adjusted rand index [63] is a metric to solve a problem with the rand index [64], which tends to increase in value as the number of clusters increases. Rand index is the ratio of the number of pairs that are correctly clustered to the number of all pairs. The adjusted rand index is shown in Equation (7): where n is the total number of possible combinations of pairs from the given data. TP (true positive) is a pair clustered in the same category when the labels are the same. TN (true negative) refers to a pair clustered in different categories when the labels are different. FP (false positive) and FN (false negative) indicate incorrect clustering. FP is a pair with the same label that is clustered in different categories, whereas FN is a pair with different labels that is clustered in the same category. V-measure [65] is defined as the harmonic mean of homogeneity and completeness. Let Q, K be a set of classes categorized qualitatively by experts and a set of clusters, respectively. Then, the homogeneity score is defined as follows: where H(Q | K) and H(Q) are defined as in Equations (9) and (10), respectively.
where b qk is the number of logs in both categorized class q and cluster k, and N and n are the number of logs and the number of classes, respectively. The homogeneity score increases when each cluster contains only the same labels as are possible, as shown in Equation (8). The completeness score aims to include all given labels in one cluster and is calculated by Equation (11).
In Equation (14), the parameter β is used to adjust the weights of homogeneity h and completeness c.
The failure patterns derived from pattern mining were evaluated using two values: Support and confidence. The two values are calculated from the antecedents and consequents, which are the environment and failure categories, respectively. The higher the two values, the more frequently the pattern occurs, with the patterns with the highest values judged as the most significant. Therefore, we evaluated the quality of the failure patterns by identifying and comparing the distributions and maximum values of confidence and support.

Phrase Extraction Results
In the phrase extraction step, significant phrases were extracted from RFFL comments using a word-usage dictionary. The dictionary was constructed by dividing phrases, which are the smallest units with meaning contained in the comment, by their word usage. In addition, technical jargon for the apparatus, included in the reference data or obtained by crawling, was added to the dictionary. The phrases were extracted by matching the words in the comment by defined word usages using the dictionary. Table 7 shows examples of phrases extracted from RFFLs. Since multiple phrases were extracted from a single row of the comment factor, total 2446 phrases were extracted from 1394 RFFLs. The frequency of appearance (as a percentage of the total) by clustering 2446 phrases into 17 representative types is shown in Figure 4. It is confirmed that phrases related to wiring break were extracted the most.   The results of the performance measurement are depicted in Figure 5. As mentioned above, V-measure and rand index were used to compare k-means clustering (black line in Figure 5), weighted k-means clustering with the weights given by experts (green line in Figure 5), and weighted k-means clustering with the weights emphasizing phrases (light blue line in Figure 5). The average, minimum value, and maximum value after repeating the calculation 15 times are displayed together on each data point. Weighted k-means clustering using the weights emphasizing phrases generated better results than the other methods. In particular, the result shows the best performance when k is 17. The rand index result tends to increase as k increases, and the range of weighted k-means focused on extracted phrases is smaller than that of others. A similar trend is found in the V-measure results.

K-means
Weighted k-means (experts) Weighted k-means (phrases) Figure 5. Experiment results of the proposed and compared methods in failure categorization using rand index and V-measure according to diverse k.

Data Extension Results
EFFLs, which consist of antecedents and a consequent, are generated by matching three data elements: FFLs, failure categories, and environmental data. The input data for pattern derivation and the factors are shown in Table 8. There are 21 factors in total, with 20 antecedents and one consequent. Morning or afternoon or night indicates the time period when the breakdown occurred. Lunch time or not, dinner time or not, and midnight mealtime or not indicate whether the facility fails in a mealtime as defined at the site. Average temperature is the site temperature at the time of failure. The cumulative failure number of the facility is the cumulative number of times the facility failed, and the cumulative failure number of the operator is the cumulative number of times any facility failed while being controlled by the operator. The degree of facility aging is the time from when the facility was installed at the site to when it failed. Failure patterns were obtained by applying FP-growth to EFFLs. Table 9 shows examples of failure patterns that have high confidence values. The pattern in the first row of the table can be described as follows. When the antecedents are "OZ Machine 1", "Person of action 1", "Air equipment", "Not shift time", "Not lunch time", "Not dinner time", and "Not midnight meal time", the failure category "aging" accounted for 29% of the total patterns. The confidence value is 97%, indicating that the failure category "aging" accounts for over 90% of the overall patterns that satisfy the antecedents. Table 9. Examples of the extracted failure patterns sorted by confidence value. The maximum confidence and support values among the derived patterns are 100% and 53%, respectively. A total of 2653 patterns were derived, and the failure categories derived in these patterns accounted for 11 out of the 17 possible failure categories. Figure 6 is a scatter plot demonstrating the support and confidence values for each pattern. The x-axis shows the support, the y-axis shows the confidence, and one point corresponds to one pattern. The more patterns with the same support and confidence values, the darker the color of the point. The scatter plot shows that the confidence values are distributed over a large area for the same support value. In particular, it can be seen that clear failure patterns are detected through the dark points that exist in the area of more than 80% confidence. We also classified 17 possible failure categories from a 124-dimensional one-hot vector representing antecedents using an ANN, implemented with scikit-learn [66] package of Python. The ANN, composed of one 100-dimensional hidden layer and rectified linear units (ReLU) activation function, was trained using Adam optimizer [67] with β 1 = 0.9, β 2 = 0.999, and = 10 −8 . L2 regularization with weight decay of 10 −4 was also used to prevent overfitting. All 2653 data were randomly split into training set and test set at a ratio of 8 to 2, and the ANN model was trained for 30 epochs through cross entropy loss. As a result of measuring the classification accuracy for the test set after model training with the training set, accuracy of 85.2% was obtained.

Real-World Facility Failure Scenario
The proposed framework can be applied to real-world scenarios at actual manufacturing sites. For real-world applications, we compared the pre-failure alert lists of two operators controlling the same facility. Let Operators 1 and 2 have different job proficiencies-high expertise and low expertise, respectively. The examples are shown in Table 10, the pre-failure alert list for Operator 1, and Table 11, the pre-failure alert list for Operator 2. The scenario shows the failure types in Operator 1's list are more diverse than those in Operator 2's. Assuming the preceding rule is the same, the difference in failure types implies that the confidence values are low, so the list for Operator 1 shows more types than that for Operator 2.
The pre-failure alert lists for the two operators included in this scenario are visualized in Figure 7, which shows the pre-failure alert list patterns for Operator 1, and Figure 8, which shows the same for Operator 2. White nodes denote the antecedents, and gray denotes the consequent. The edges indicate the confidence value for the pattern with the antecedents and the consequent, where the higher the value, the thicker the edge. The visualization confirms that Operator 1's list has low confidence values with more breakdown types than Operator 2's.

Conclusions
In this paper, we proposed a real-world facility failure prevention framework based on failure logs to prevent facility failures that directly affect the production rate at manufacturing sites. Specifically, operators are alerted with pre-failure alert lists at every shift time, reflecting that many manufacturing sites have multiple shifts and the work environment changes frequently. The failure logs containing text data were categorized and mapped to environmental data to build a failure pattern database to generate the list. When the real-time logs are searched in the failure pattern database via the alert system, a pre-failure alert list corresponding to the real-time logs can be provided.
To determine the failure phenomena and causes from the text data input directly by the operator, phrases were extracted to determine the meaning. In the failure categorization step for pattern mining, the weight of each factor in the failure logs was assigned to give more weight to phrases that indicate the breakdown phenomenon or cause. The data have been extended to vary the number of pattern cases by mapping to various environmental information. When the list was extracted for two operators with different proficiencies given the environmental information, we found that high-confidence-value breakdown types were extracted for less skilled operators.
Even when a pre-failure alert list has been given, breakdowns can occur for a variety of reasons. Therefore, our future work will be building a system that provides repair methods for the corresponding breakdowns so that, even if a malfunction occurs, it can be managed successfully.