1. Introduction
The creation and application of classification category labels are essential for transforming complex information into structured knowledge, enabling consistency and supporting complex decision making and automation across a wide range of disciplines, including scientific research, commercial operations, and industrial maintenance. Classification is a supervised learning task that involves predicting a category label for given input data. Categories used for summary and reporting purposes have historically been identified by domain experts based on their past experiences and norms. Our interest lies in the general case where expert-generated category lists require improvement, and unsupervised learning, on its own, struggles to effectively identify categories for multi-class classification of human-generated texts.
As a case study, we look at identifying categories for classifying failure modes in industrial maintenance. For many organisations across domains like defence, mining, and manufacturing, asset maintenance is a multi-million dollar task. In 2023, the heavy machinery repair and maintenance industry in Australia had a revenue of USD 16.5 billion [
1]. It is essential for maximising the runtime productivity of assets, minimising repair costs, and minimising threats to human safety in the workplace. Our focus is on creating a list of codes to represent the manner in which a failure can occur on some asset, known as failure modes [
2].
In industrial maintenance, whenever a piece of equipment or an asset fails, has work performed on it, or has a request for work to be performed, a Maintenance Work Order (MWO) is written up by a maintenance technician or the equipment operator who observed the failure. Example MWOs are shown in
Table 1. These MWOs can be thought of as doctors’ records in the industrial world, a world in which the role of equipment is played by the patients. MWO records contain valuable observations by technical personnel about the deterioration and/or failure of equipment or its parts (physical objects). These records are unstructured short sentences, usually four to eight words in length. The language used to record MWOs is informal, being filled with colloquialisms, acronyms, misspellings, and technical jargon [
3]. Grammatically, MWOs are diverse, using past, present, or present participle verbs (leaked, leaks, leaking); nouns (the leak); or states (e.g., has a leak) or synonyms thereof in describing how equipment is deteriorating or failing.
In industrial maintenance, accurate analysis of MWOs is vital for developing effective maintenance strategies and improving asset management. Failure Modes and Effects Analysis (FMEA) is a process in which engineers identify the function of each component of an asset and determine all potential undesirable states (failure modes), failure causes, and failure effects associated, so that improvements can be made for maintenance strategies, activities, and future designs. This information is stored in tables, such as in
Table 2. MWOs, which are essentially records of failure events and failure modes, are an integral source of information for FMEA. Analysis of historical MWOs helps identify recurring issues and their root causes and informs decision-making processes regarding preventive maintenance, deterioration detection, repairs, and equipment replacement to prevent future undesired events.
In order to facilitate data analytics, organisations want to classify undesirable behaviour observations in MWOs into a pre-determined list of standardised Failure Mode Codes (FMCs) in a reproducible way. For example, the MWO “air conditioner leaking” might be assigned an FMC
LEA that categorises leaking failures. This is a routine and time-consuming task for reliability engineers, and there is significant interest in machine-assisted classification to a set of pre-determined categories [
3,
4,
5,
6,
7,
8,
9].
Table 2.
Example FMEA table for heater system [
10].
Table 2.
Example FMEA table for heater system [
10].
| Component | Function | Failure Mode | Failure Effect |
|---|
| Heaters | To heat up unit | (a) overcurrent | loss of all heating |
| | | (b) short circuit | loss of all heating |
| | | (c) earth fault | loss of all heating |
| Terminal box | Connect supply to heaters | (a) overcurrent | loss or reduction of heating |
| | | (b) short circuit | loss of all heating |
| | | (c) cable failure | loss or reduction of heating |
However, in working with organisations, we have observed many occasions when a category list is not fit for purpose when their actual MWO data are classified. For example, we note the extensive use of the category ‘other’ and the poor agreement between different classifiers (human or machine).
Currently, there is no agreed standard list of generic failure modes for the equipment, even for commonly used equipment such as motors, compressors, and pumps (for example) and the limited well-understood ways in which these assets can fail. Some reasons for a lack of a standard list of maintenance FMCs include the following. First, while there are domain-specific standard FMC lists, such as the FMC list in ISO14224 [
2] for the oil and gas sector, they only cover assets of core interest to offshore oil and gas. Secondly, due to the incompatibility of many FMCs in other domain-specific lists like ISO14224, many organizations use their own FMC lists developed internally by domain experts. This results in inconsistent FMC coding across organisations, limiting benchmarking and information exchange between equipment vendors and asset owners. Finally, the development of FMC lists by individual companies often results in lists of poorly described and differentiated FMCs, creating issues for technicians who need to decide which FMC to select [
11,
12].
These issues motivate our investigation into unsupervised machine learning to generate a set of FMCs from the MWO data that describe the failure event. Previous works that approach category generation relying on statistical methods such as topic modelling [
13,
14,
15] or clustering [
15] have experienced challenges when applied to a dataset of MWOs. This is due to the sparse and colloquial nature of the short texts [
16]. Another study [
17] extracts different degradation states of excavator buckets, using Convolutional Neural Networks (CNNs) to perform feature extraction from word2vec and LSA representations of the MWO as the CNN input and output, respectively. This combined embedding is then used in K-means clustering, with each cluster being a category of degradation state. A small number of clusters were formed, with many shared words characterising each cluster, particularly with equipment words. When it comes to FMCs, having too few codes will not sufficiently cover enough ways in which equipment fails in a way that is descriptive and helpful for analysis. Additionally, action words that are more important in describing degradation state, like repair and replace, are more limited in vocabulary and used more commonly across MWOs compared to words describing failure modes like leak, blown, disconnected, or warm. A way of prioritising failure concepts is needed.
An essential engineering task in the design phase of products and processes is Failure Modes and Effects Analysis [
10]. The design process considers first the desired function of the product or process [
18]. For each function, potential functional failures and associated failure modes (risks and controls) are identified. It is not until later in the design phase that the type of equipment (e.g., a conveyor or truck) to deliver the function is identified. We hypothesise that finding a way to incorporate knowledge of the function of the equipment into a knowledge graph will assist in clustering of failure modes since failure modes and functions are linked, at least in theory and practice, in the engineering design process. To achieve this, we look for ways to incorporate knowledge about type hierarchy into MWO text embeddings.
We use triples which include entity and relation typing available in the annotated MaintIE MWO dataset to train a Bidirectional Long Short-Term Memory (Bi-LSTM) to perform feature extraction between the annotations and off-the-shelf embeddings. We hypothesise that leveraging of expert knowledge (captured in the annotations) to create sentence embeddings can improve unsupervised function-based clustering performance. To evaluate the success of different embedding-based approaches, we define criteria for what makes a good cluster for a failure mode category and use Normalised Mutual Information Score to assess the clusters. From an engineering perspective, success is the creation of clusters from MWO texts that provide insight into engineering functions and failure modes experienced by the equipment described in the MWO. The goal from a computer science perspective is to explore how the inclusion of additional information in the form of triples impacts the performance of an unsupervised clustering task.
The rest of the paper is structured as follows:
Section 1 explores previous works related to extracting a set of codes from work orders.
Section 2 discusses the MWO dataset used as well as any data exploration we conducted prior to our final method.
Section 2.3 explores different off-the-shelf unsupervised methods.
Section 3 and
Section 4 analyse and discuss the results and what insights they provide us.
Section 5 describes the contributions and potential future works for this study.
Related Work
In order to identify a list of generic categories from a dataset of short texts, we examine methods to extract the hidden trends or similarities within a dataset. Previous works use unsupervised learning techniques such as topic modelling and clustering to extract hidden trends and group documents.
Gibbs Sampling Dirichlet Mixture Model (GSDMM) [
19] is a topic modelling algorithm for short texts. It treats each document as only having one topic and can infer the number of clusters automatically. In one study, clustering techniques were used on MWOs before using GSDMM on each cluster (semantically similar MWOs) to extract topics [
13]. These topics were the concepts contained within each cluster and were then used to highlight possible areas that were overlooked when subject matter experts created a taxonomy for key systems, actions, and issues (failure mode). A similar approach used a hybrid of unsupervised learning techniques to automatically generate a taxonomy of terms for manufacturing capability data [
15]. This was achieved through performing clustering to increase the accuracy and efficiency of topic modelling, and the four topics produced were used as high-level labels on clusters. Two factors improved the performance of the topic modelling in this study: having a small number of topics (between 3–4) and having a large list of stopwords (around 3000 words) to filter out irrelevant terms [
15].
Unlike topic modelling techniques that rely solely on statistical methods, text clustering depends on how sentences are represented in vector form, and can incorporate semantic, statistical, and syntactical information when grouping texts. Clustering algorithms are a point of interest in this literature review as they provide a way to group MWOs by exploring the hidden patterns in the data without the need for expert annotations upon the dataset, creating clusters of MWOS based on similarities related to failure modes.
In text clustering, text data are represented in a multi-dimensional spatial representation, and the distance between points is used to determine clusters. In order to perform text clustering, several decisions need to be made, including what features will be represented, how to measure similarity between documents based on those features, what clustering algorithm to use, and how to measure its performance [
20]. With text clustering algorithms, documents are represented as a vector, and the words in each document represent the features of the vector. There are often two issues with this vector representation of documents, one being the huge vector size of text documents and the other being the large number of terms in the vocabulary of possible words [
21]. The first issue is less relevant due to the short text format of MWOs, though the second issue still applies, as there are many ways to describe the same maintenance work done in MWOs.
Feature engineering is a preprocessing step used to extract features from data. Statistical text feature engineering algorithms [
21] include Term Frequency-Inverse Document Frequency (TF-IDF) [
14,
22] and latent semantic analysis (LSA) [
15,
17,
23], which are designed to improve clustering by removing unnecessary or unimportant terms and increase interpretability of clusters [
21]. However, MWOs by nature are sparse, where each word within a document will usually only appear once, making it difficult for traditional count-based feature selection techniques to determine semantic similarity [
23]. An example is TF-IDF, which involves the term frequency as a metric. In the case of short texts, text frequency scores will likely be 1, resulting in a high-dimensional representation of documents and complexity problems [
16].
As an alternative, word embeddings have been used to represent the semantics of words and utilised in clustering maintenance records. A popular word embedding method is Word2Vec [
17,
23,
24,
25], which represents a word as a vector that takes into account the semantic, syntactical, and contextual data of the word. However, when dealing with representing MWOs, further pre-processing must be performed on each word vector due to the dataset being filled with spelling mistakes, acronyms, and other inconsistencies [
17,
26]. Statistical text feature selection and word embeddings have been used to create a deep embedding of an MWO [
17], while a hierarchical word taxonomy of hypernyms was used to form semantic features in another study [
27].
Information-theoretic measures, including mutual information (MI), have been used for representation shaping and clustering evaluation. For example, a previous study [
28] uses mutual information maximization objectives at the sequence and token level to perform clustering for short texts. Clusters were evaluated through accuracy and Normalized Mutual Information (NMI), which quantifies the amount of shared information between cluster assignments and the true labels. NMI requires true labels to be calculated, suiting it to supervised tasks rather than unsupervised clustering. Another study [
29] performed hierarchical clustering on a dataset of microblog short texts, using mutual information as a measure of similarity between formed clusters.
One commonly used clustering algorithm is K-means clustering [
30]. It is a hard clustering method where each item belongs to exactly one cluster, and it involves making clusters based on the proximity (measured using Euclidean distance) of items within a vector space to a user-specified number of cluster centroids. There are previous studies into K-means algorithm use with maintenance records, including using Convolutional Neural Networks (CNNs) to perform feature extraction prior to K-means clustering to extract clusters of MWOs containing different degradation states of equipment [
17] and for the clustering of manufacturing suppliers according to their capabilities with Latent Semantic Analysis (LSA) feature selection [
15]. K-means clustering has been shown to be useful in working with data that is not homogeneous in nature, which is applicable to MWOs.
Hierarchical clustering involves clusters being formed iteratively by regrouping previously formed groups together to eventually form one super-cluster in a bottom-up approach, or a top-down approach in the opposite direction. The bottom-up Ward agglomerative hierarchical approach to clustering has been explored in clustering data on maintenance activities by failure mode [
9], although it is difficult to evaluate performance due to a lack of prior knowledge or training data labels.
This literature review examined multiple studies that explore the identification of a set of FMCs using unsupervised learning techniques, including topic modelling and clustering, of unstructured texts. In doing so, it was revealed that many traditional clustering and topic modelling approaches relying on statistical techniques will not have the same performance when dealing with extracting a set of FMCs from MWOs, due to the sparse nature and lack of co-occurrence patterns within the technical short texts.
2. Materials and Methods
2.1. Dataset and Knowledge Graph
Our area of focus is the process plant and equipment sector. Engineers know from experience that each type of equipment will fail in a number of ways. The challenge for organisations is to develop sensible FMCs for their data. The task of identifying FMCs for hundreds of thousands of types of equipment is too large to be carried out by hand [
31]. For example, the ISO 15926-4 [
32] Reference Data Library has a list of 2146 classes for equipment types (e.g., pump, motor, fan). There is increasing interest in looking at the information captured in the Maintenance Work Order (MWO) data to identify relevant FMCs and map these codes to the equipment that generates the failures classified by the FMC label [
33,
34,
35].
Our intuition is to map the FMCs to the inherent function of the physical object rather than its equipment type. The inherent function is defined as the “function of an object, independent of any application of the object” [
36]. Inherent function is a permanent, essential, or characteristic attribute of the object. The IEC 81346-2 Standard provides a classification structure based on inherent function and a mapping of equipment types to an inherent function hierarchy. The inherent function list (in IEC 81346-2) at the top level (L1) is a much smaller set (
n = 17) compared to the number of equipment types (
n = thousands). We hypothesise that leveraging engineering knowledge linking each physical object class to its inherent function, and by extension its functional failure (as shown in
Figure 1), will assist in the semantic identification of sensible categories for failure modes. This idea mimics the way engineers use their knowledge of the (1) inherent function of the physical object; (2) how specific physical objects can fail, which is linked to their inherent function; and (3) how that failure is usually observed when performing the FMC task.
Figure 1 illustrates the abstractions used by engineers to identify FMCs from an MWO text. In this example, the MWO text is ‘differential oil is leaking’. We show how an engineer focuses on the equipment type (differential) and ‘knows’ that a differential is part of the power transmission system for an engine. Engineers also ‘know’ that differentials have the inherent function of transmitting force and are contained by a housing with the inherent function of storing oil for lubrication. They also ‘know’ that any physical object that has a containment function (for example, a tank or a pipe) has the potential to leak. Note that none of this information is explicit in the MWO. This knowledge linking a physical object to its inherent function and an inherent function to potential failures is immutable. Some of this knowledge has been captured in engineering textbooks and tables in International Standards.
We use the MaintIE [
37] public knowledge graph (KG) dataset and schema for MWO texts. At the time of writing, MaintIE is the largest open dataset of annotated industrial MWOs. The relationship between physical objects and their inherent function class in the KG schema is based on the International Engineering Standard IEC 81346-2 [
36] MaintIE contains a gold standard (1076 fine-grained expert-annotated texts) mapped to a multi-level hierarchy with physical object broken down to 17 Level 1 (L1) inherent functions (e.g., controlling, holding, protecting objects), and these further broken down to 160 Level 2 (L2) inherent functions (a diagram of the schema is available at
https://github.com/nlp-tlp/maintie/blob/main/SCHEME.md, accessed on 27 February 2026). Each physical object mentioned in the MWO is mapped to an inherent function at the L2 level, as shown in
Figure 2.
We also use a second (larger) MaintIE coarse-grained corpus (silver standard), which has MWOs annotated by a deep learning model trained on the fine-grained corpus. The annotations are then reviewed by an industry expert. This corpus consists of 7000 annotated work order texts. Unlike MaintIE Gold being annotated to the L2 level (for example, thermostat is annotated as
PhysicalObject/SensingObject/TemperatureSensingObject), MaintIE silver is not annotated to L1 or L2 levels (for example, thermostat is annotated as
PhysicalObject). Since we require engineering knowledge in the form of inherent function to use MaintIE Silver, we automatically annotated a portion of the MWOs in MaintIE Silver to the L2 level by referencing the
PhysicalObject entities in MaintIE Gold. Since a piece of equipment will generally only have one inherent function, any
PhysicalObject in MaintIE Gold that also occurs in MaintIE Silver will have the same inherent function. Thus, MaintIE Gold and a portion of MaintIE Silver form the dataset used in this study. A summary of the count and types of nodes and relations in Maintie Gold and Silver is available in
Table 3.
2.2. Introducing Synthetic Data to MaintIE
The MaintIE dataset is the largest open-source annotated dataset available. Across MaintIE Gold and Silver, there is an uneven distribution of inherent functions, as shown in
Figure 3. This impacts clustering and topic modelling algorithms and can lead to classes of failure mode terms or inherent functions with lower numbers not being represented. To combat this, synthetic data generated based on MaintIE [
38] is introduced into the dataset.
The synthetic data is generated by first extracting triples from the MaintIE Gold knowledge graph to create valid engineering paths from a piece of equipment to a failure mode. An example of this is the path from the equipment “windscreen” with the failure mode property “crack”. Additional paths are formed by leveraging the hierarchical contains, hasPart, and isA relations in MaintIE. For example, from the path “backhoe hasPart windscreen hasProperty crack”, we can extend additional paths “windscreen hasProperty crack” and “backhoe windscreen hasProperty crack”. Any new paths that do not originally exist in the MaintIE dataset are validated for technical correctness by an SME.
A
GPT-4o mini (
https://platform.openai.com/docs/models/gpt-4o-mini, accessed on 23 September 2024) Large Learning Model is tasked with generating grounded MWOs constrained by the validated paths. Few-shot prompting is used to incorporate the style of real MWOs to make the generated MWOs authentic.
The generated synthetic work orders are added, ignoring any work orders that have to do with inherent functions “guiding” and “holding”, to lessen the skew of the dataset as shown in
Figure 4. All work orders involving the inherent function “informationProcessing” were also removed from the dataset due to the small number of work orders, even after adding the synthetic dataset. Maintenance of information processing equipment is generally outsourced to external maintainers who have their own separate records and are maintained following a fixed schedule. Both of these result in companies not producing a work order for their maintenance work and thus are unlikely to occur in the context of the maintenance industry.
Another observation from prior analysis of the MaintIE dataset was the frequent occurrence of the “leak” failure mode and its variants like “leaking” or “leaks”. This also caused a skew in the dataset in terms of failure mode words in the MWO, and so synthetic data generated that included the word “leak” or similar was not added to the dataset.
2.3. Preliminary Experiments with Topic Modelling
We first test off-the-shelf topic modelling algorithms designed to work with short texts, including Gibbs Sampling Dirichlet Mixture Model (GSDMM), BERTopic, and Top2Vec, on the MWO dataset. Every short text topic modelling technique struggles in several ways.
When GSDMM is performed on the MaintIE dataset, the resulting failure mode clusters are incomprehensible to subject matter experts, as shown in
Table 4. Action words like repair, replace, and change are more prevalent throughout the dataset, forming topics that are not representative of failure modes. Even upon removal of action words, the text was too sparse to form meaningful clusters. This is a limitation of using an entirely statistical method on a technical dataset.
BERTopic [
39] is another topic modelling technique we test. It uses transformers and a class-based TF-IDF [
39] to create topics. However, upon examining the topics being produced, it was determined that the topics were merely grouping together MWOs by equipment type, as seen in
Table 5, which is not useful for the purpose of creating generic failure mode classes. We considered removing all the equipment words from the dataset and performing topic modelling solely on failure state, process, and property words alone. However, it would lead to a loss of too much information, as the physical object itself is an important feature for what sort of failure event is possible, so it was decided against. Any topics that would be produced would ignore a key part of the MWOs, and any generic failure mode categories resulting from it would lead to grouping together MWOs that, from an engineering perspective, do not belong together. An example of this could be “air leak near side of door” and “engine leak oil” being part of the same category solely due to the word “leak”.
We test a third topic modelling technique, Top2Vec [
40]. It uses Doc2Vec and Uniform Manifold Approximation and Projection to create topics. However, with the MaintIE dataset, only two topics were returned, each containing a large number of MWOs. These topics do not return any useful insights into what the set of generic FMCs could be.
Each of these topic modelling experiments’ poor performance is due to the nature of the MaintIE dataset being a corpus of unstructured short texts. Even though these algorithms are designed with shorter texts in mind, MWOs are extremely short, many of which are less than eight words long. As an attempt to overcome the lack of recurring co-occurrence patterns in the texts due to word sparsity, the inherent function of each equipment was retrieved from the MaintIE dataset and appended to the MWO across the three different topic modelling algorithms. However, the topics remained indecipherable, with a large number of different types of inherent functions present in each topic. Without semantic and engineering knowledge to inform the grouping of MWOs, any topic modelling approaches fail to return any useful insights into failure mode categories that exist within the text and are let down by relying on a statistical means of analysis.
2.4. Method
We hypothesise that including the annotated knowledge graph (KG) in the embedding process will positively impact the clustering process to identify suitable categories for Failure Mode Classifications (FMCs) from the texts of Maintenance Work Orders (MWOs).
We test four embedding methods over three different clustering approaches, as shown in
Figure 5. Two of the embedding methods (Averaged Word2Vec [
41] and Sentence-BERT [
42]) are off-the-shelf methods and do not introduce additional engineering knowledge. In contrast, Word2Vec [
41] Bidirectional Long Short-Term Memory (Bi-LSTM) [
43] embedding and Sentence-BERT [
42] dense neural network embedding, both methods inspired by a previous study [
17], allow for the introduction of a KG into the embedding process.
The clustering approaches include K-means [
30], average agglomerative hierarchical clustering [
44], and Ward agglomerative hierarchical clustering [
45]. We perform every combination of embedding and clustering approaches on the dataset and evaluate the resulting clusters from each combination. Clusters are evaluated in three ways: statistical data on cluster shape and characteristics, manual analysis of each cluster by subject matter experts with an engineering background, and a Normalized Mutual Information (NMI) score.
2.5. Sentence Embedding
To incorporate engineering knowledge from the KG within the Word2Vec Bi-LSTM and Sentence-BERT NN embedding methods, we use semantic representations of MWOs and attempt to predict the inherent function of the part of the physical object identified as having the undesirable state in the MWO. The final hidden layers of both models are extracted and used as a sentence embedding. As mentioned previously, we focus on the inherent function because the inherent function of a physical item is an important consideration for failure identification and is available to us in the form of L2-level entity annotations in the MaintIE KG.
2.5.1. Extracting the Inherent Function Labels
We create training labels for our feature extraction methods by determining which physical item is most relevant to each MWO, on which the failure has occurred. The output labels are then selected through the engineering knowledge stored in the PhysicalObject subclass entity annotations. The most relevant physical item of an MWO is decided through the KG relations by selecting the physical item with a hasParticipant relation with the undesirable behaviour node mentioned in the MWO. For example, in the MWO “air conditioner in truck has leak”, the KG triple involving the failure mode “leak” hasParticipant/hasPatient “air conditioner” suggests that “air conditioner” is the most relevant physical object. Since “air conditioner” has the inherent function emitting, the inherent function label for this MWO is emitting.
In the case where there are multiple physical items that have a hasParticipant relation with the undesirable behaviour node, the relations of the MaintIE KG, namely the hasPart and isA relations, are used to determine the most relevant physical item. The hasPart relation captures a physical object’s parts, for example, an engine hasPart radiator. Since the part is where the failure mechanism is experienced, it is more relevant to the FMC determination, and its inherent function is used in the analysis. An example of this is shown in
Figure 6. The isA relation denotes that a physical object class is a type of another physical object class, for example, a diesel engine isA(n) engine. In this case, the more specific physical item (diesel engine) is selected, and its inherent function (driving) is used as the training label.
2.5.2. Word2Vec Bidirectional Long Short-Term Memory (Bi-LSTM) Embeddings
In this approach, as shown in
Figure 7, Word2Vec [
41] embeddings are used as input for a Bi-LSTM [
43] model to predict the MWO’s inherent function, to learn the patterns between the semantic word embeddings and the expert knowledge in the form of inherent function. The final hidden layer is extracted and used as the sentence embedding of the MWO for the unsupervised clustering task. This is achieved using a skip-gram Word2Vec (
https://code.google.com/archive/p/word2vec/, accessed on 15 May 2025) model to transform each MWO into a
-dimensional matrix, where
is the embedding size hyperparameter set by the user, and
is the maximum number of words in an MWO in the dataset. In our experiments,
is set to 100 following the parameter informed by [
17], while
is the maximum number of words in an MWO in this study’s dataset, which is 12.
Output labels in the form of the MWO’s inherent function are created as mentioned in
Section 2.5.1. The Bi-LSTM model is used to learn and extract the hidden patterns between the Word2Vec semantic inputs and the inherent function output labels in MWOs. The Bi-LSTM model begins with an input layer with a shape matching the Word2Vec embedding input (
). This is followed by two Bi-LSTM layers.
Bi-LSTM is a recurrent neural network that learns patterns from a sequence in the forward and backward direction. It is composed of two LSTM layers, one that processes the input in the forward direction and one in the backward direction to capture context. LSTM networks make use of input gates to retain long-term dependencies to reduce the vanishing gradient problem. LSTM networks maintain a hidden and cell state over each time step and perform the following calculations:
where
W and
U are weight matrices,
is the input at time
t,
is the hidden state at time
t,
is the cell state at time
t,
b is a bias vector,
denotes the sigmoid activation function,
is the hyperbolic tangent function, and ⊙ denotes element-wise multiplication.
For this model, the first Bi-LSTM layer contains a forward and a backward LSTM with 64 units and uses the tanh and sigmoid activation functions. The second LSTM layer contains a forward and backward LSTM with 32 units and uses tanh and sigmoid activation functions.
The two Bi-LSTM layers are followed by a fully connected layer with 64 units and use the ReLU activation function. This is followed by another fully connected layer with
units, where
is the size of the sentence embeddings. The value of
is set as 10, following [
17]. This fully connected layer is extracted after training to act as input for the clustering step, resulting in an MWO sentence embedding with a dimension of 10. Finally, there is an output layer with 16 units, 1 for each type of inherent function label. A softmax activation function is used to produce a probability distribution over the 16 classes (1 for each inherent function in the dataset).
2.5.3. Sentence-BERT (SBERT) Neural Network (NN) Embeddings
SBERT [
42] embeddings are created using the “all-mpnet-base-v2” pre-trained model and used as the input for training a neural network to predict the inherent function and create embeddings of MWOs.
The neural network architecture consists of five fully connected layers. The first three fully connected layers have 128, 64, and 32 nodes, and all use the tanh activation function. The fourth layer has units and also uses the tanh activation function. This hidden layer is extracted after training to act as deep sentence embeddings for clustering, resulting in an MWO sentence embedding with a dimension of 10. The last layer in the neural networks is a fully connected layer with 16 units and the softmax activation function; it is used to predict the inherent function of the MWO.
2.5.4. Training the Deep Embeddings
The dataset is split up into a training set of 80% of MWOs and a test set of 20% of MWOs. Both the new Word2Vec Bi-LSTM and SBERT NN embedding methods are trained from scratch over the training set with the Adam optimiser with a learning rate of 0.001 (controlling how fast weights are updated per step) and use sparse categorical cross-entropy loss to compute loss between the predicted inherent function and the actual label. The model is validated with the test set, and early stopping is used with a patience of 20 over 200 epochs.
2.6. Clustering Algorithms
Two types of clustering algorithms are tested: K-means and agglomerative hierarchical. K-means [
30] creates a pre-determined number of centroids in vector space. Each item closest to a centroid is considered part of that cluster. The algorithm iteratively moves the centroids around the vector space to find good clusters that maximise the similarity within the cluster and the dissimilarity with items outside the cluster.
In agglomerative hierarchical clustering, each item starts as a cluster of 1. The algorithm joins the most similar clusters together. The process can be halted at any time when the desired number of clusters remain. There are different ways to calculate the similarity between clusters, including average linkage [
44] (which measures the average distance between the items in the two clusters) and Ward linkage [
45] (which measures how much the sum of squares will increase if two clusters are merged). Both average and Ward linkage are tested in this research.
One challenge with performing this unsupervised clustering is that the number of clusters that exist within the dataset is unknown. To combat this, clustering is performed for all numbers of clusters between 5 and 50, and from each set of clusters, the silhouette score is calculated and plotted.
Silhouette score [
46] is a metric of the quality of clusters that are produced. It calculates the similarity of members within the cluster and dissimilarity with non-member points. Silhouette score is plotted over the number of clusters as shown in
Figure 8, and the ideal number of clusters is selected from the highest silhouette score. Another method of determining the number of clusters is through the dendrogram produced during hierarchical clustering, as shown in
Figure 9, which can be manually examined to determine the number of clusters that best suits. In dendrograms, clusters that are more similar to each other are joined earlier, and thus the clusters will merge at a lower distance, which is taken into account when selecting a good number of clusters.
3. Results
We compare the clusters created from the new embedding methods in
Section 2.5 with off-the-shelf average Word2Vec and SBERT embeddings that do not incorporate additional engineering knowledge. Clusters are evaluated in three ways: statistical data on cluster shape and characteristics, manual analysis on each cluster by subject matter experts with an engineering background, and NMI score [
47].
3.1. Defining Good Clusters
Two of the three evaluation methods (statistical data on cluster shape and characteristics and manual analysis on each cluster by subject matter experts) refer to the criteria of a cluster that is a “good cluster”. A good cluster is defined as a cluster consisting of at least 80% of work orders with between one to three different types of inherent functions, and it contains more than a minimum number of MWOs (20). This criteria is derived by two maintenance engineering subject matter experts upon examination of clusters in order to identify clusters that contain equipment with similar functions and failure mode categories. Having clusters with at least 80% of work orders with between one to three different types of inherent functions ensures that good clusters are those informed by inherent function as a feature. It also eliminates clusters that are grouped based on generic terms like “not working” that are found across many different failures. Not including clusters with less than 20 MWOs in good clusters removes clusters that are too small to act as a generic failure mode category.
3.2. Analysis Based on Statistical Characteristics
Statistical characteristics of the clusters (mean, median, minimum, and maximum number of documents and the distribution of inherent functions across clusters) provide a direct comparison of the shapes and distribution of clusters for each embedding and clustering method used. The results of this are shown in
Table 6,
Table 7 and
Table 8.
For most cases, the average Word2Vec and SBERT embedding approaches have a much lower percentage of documents in good clusters when compared to their percentage of clusters in good clusters. The embedding methods incorporating engineering knowledge outperform off-the-shelf embeddings in clustering MWOs.
Table 8 shows the Word2Vec Bi-LSTM and SBERT NN embeddings achieving 75.3% and 68.6% of documents in good clusters, compared to just 40.4% and 22.4% for the averaged Word2Vec and SBERT embeddings. This suggests that off-the-shelf embeddings struggle to capture the technical meaning of the text, resulting in many unclear and poorly formed clusters. In contrast, embeddings that incorporate engineering knowledge (captured in the KG triples) improve unsupervised clustering results, producing more meaningful clusters (for the engineer), with Word2Vec Bi-LSTM consistently yielding the best results. This aligns with the study’s goal from a computer science side, exploring how the inclusion of additional information in the form of triples impacts the performance of an unsupervised clustering task.
The choice of clustering method plays a significant role. K-means produces consistently equal-sized clusters, which is reflected in the highest maximum number of documents per cluster. However, this is due to its tendency to prioritize shape over meaning and document similarity, which is problematic for datasets (like MaintIE) that have uneven distribution of failure mode words and inherent function. Meanwhile, Ward hierarchical clustering outperforms both average hierarchical and K-means clustering in the proportion of good clusters.
3.3. Manual Analysis of Each Cluster by Subject Matter Experts
Each of the clusters was reviewed by two experienced reliability engineers. The review involves an assessment of the inherent function classification and the failure modes and equipment in each cluster. It should be noted that there is only limited room for interpretation in this expert review, as the IEC 81346-2 standard has a mapping between all equipment classes and function; the experts use this. Likewise, the mapping between function and failure mode leaves limited room for interpretation. For example, a chair cannot leak, but a pipe can. This review provides insights into latent patterns within clusters captured by the sentence embedding, such as grouping together a specific piece of equipment, or reoccurring inconsequential phrases such as ‘left-hand side’. This aligns with the study’s goal from an engineering perspective, where we evaluate success based on the creation of clusters from MWO texts providing insight into engineering functions and failure modes experienced by the equipment described in the MWO.
3.3.1. Averaged Word2Vec Embeddings and SBERT Embeddings
Clusters produced by averaged Word2Vec and SBERT embeddings have a lower proportion of good clusters compared to the new embedding methods. This results in clusters that are grouped by features that are irrelevant in forming a list of FMCs, for example, clusters that grouped MWOs containing the phrase ‘left-hand side’ or general failure words like ‘fault’ that are not specific to the inherent function. Word2Vec and SBERT embeddings alone are not successful in categorising inherent functions and failure modes in MWOs.
3.3.2. Word2Vec Bi-LSTM Embeddings
A majority (80.7% as shown in
Table 8) of clusters formed from Word2Vec Bi-LSTM embeddings are good clusters, and some examples of good clusters are shown in
Table 9. For example, over 80% of work orders in Cluster 1 have the inherent function of storing, covering or guiding. By looking at the top failure modes and physical objects in this cluster, experts determine that Cluster 1 groups MWOs with storing, covering or guiding functions with material or structural failures. Meanwhile, Cluster 2 captures MWOs describing o-ring failures. Cluster 4 groups MWOs describing structural and material failures in buckets. All of these good clusters can be used to form a list of potential FMCs.
3.3.3. SBERT NN Embeddings
SBERT NN embedding clusters also return a higher number of good clusters. This includes Cluster 1 in
Table 10, which is characterised by failures of generating objects like pumps and batteries. Cluster 4 is another good cluster for emitting objects with relevant (out, blown) and general failure modes (unserviceable, not working).
When examining the bad clusters, Clusters 17 and 19 group MWOs that contain general failure mode terms such as ‘fault’, ‘unserviceable’, or ‘not working’. ‘Unserviceable’, ‘not working’, and ‘fault’ occur as top failure modes in many clusters, as they are common terms used in MWOs to describe equipment when there is no obvious symptom for the problem.
3.4. Normalised Mutual Information Score (NMI)
NMI [
47] is used to measure similarity between clusters formed in each experiment and MWOs that are grouped solely by inherent function. NMI is calculated from the following function:
where
X and
Y are the examined cluster labels and the true labels;
is the mutual information metric; and
is entropy [
47]. If the cluster labels and the true labels are the same, then the NMI is 1.
When calculating NMI for the clusters produced with each embedding method, the true labels
Y are cluster labels of MWOs that are grouped solely by inherent function. The inherent function of each MWO is extracted by the same method in
Section 2.5.1. Meanwhile, the examined cluster labels
X are produced from the resulting clusters for every combination of embedding and clustering method. The number of clusters is set to 16 (equal to the number of inherent functions in the dataset). This gives an idea of how much inherent function as a feature contributes to the embeddings. If inherent function is a very strong feature, the resulting clusters are similar to MWOs grouped solely by inherent function and return an NMI of close to 1.
Since Word2Vec Bi-LSTM and SBERT neural network embeddings are created by training a model to predict the inherent function of MWOs as an intermediate supervision step, they are expected to have a stronger cluster alignment with the MWO inherent functions’ labels. Consequently, they are expected to have higher NMI scores than the off-the-shelf embedding methods.
NMI is used as a verification that the embedding methods have successfully internalized and encoded the inherent function constraints embedded in the MWO embeddings. The NMI scores reflect the strength of expert knowledge captured within each embedding approach. The clusters are also considered for their purpose as category labels by expert analysis; this is described in
Section 3.3.
Table 11 shows that across all the different clustering approaches, Word2Vec Bi-LSTM and SBERT NN embedding approaches consistently had a higher NMI score than averaged Word2Vec and SBERT embeddings, with Word2Vec Bi-LSTM embeddings returning the highest NMI score.
3.5. Identifying FMCs from Good Clusters
Of the combinations tested, the best combination of embedding method and clustering method is Word2Vec Bi-LSTM and Ward hierarchical. This approach produces clusters with the highest distribution of documents and clusters belonging to good clusters. The NMI score for Word2Vec Bi-LSTM embeddings is higher than the other embedding methods, suggesting that inherent function is best captured using that approach.
Potential FMC categories and the associated inherent function are identified by subject matter experts from the manual analysis of the Word2Vec Bi-LSTM approach in
Section 3.3 by reviewing the list of good clusters and their links. As shown in
Figure 10, some of the 23 potential FMCs identified include:
C1—objects with storing and covering function (e.g., tanks and buckets) having structural and material failures (e.g., cracks, leaks, and missing parts);
C15—objects with guiding function (universal joints, hoses, chains) with failures that are mechanical in nature and not to do with leaking;
C5—oil filters, air filters, and filters on differentials and centrifuges with matter processing function having ‘leaking’ failures;
C2—O-rings with covering function failures;
C18—lights with emitting function failures (e.g., out, unserviceable, fault, and blown);
C4—bucket teeth and adaptors with matter processing function having material or structural problems (e.g., missing, loose, or worn).
Subject matter experts examined the clusters and considered if each cluster mapped to an inherent function. As mentioned earlier, this mapping is informed by the IEC 81346-2 standard which has a mapping between equipment type and the 17 functions, as well as an understanding of the relationship between function and failure mode. One of the interesting, but not unexpected findings, is that inherent function maps into two or three clusters because different components are involved within the same inherent function. For example, in
Figure 10, protecting function is divided into three clusters: one cluster for coolant leaks, one for breaks and switches, and one for oil and air filter failures. Additionally, some clusters have more than one inherent function type but have something else in common grouping them together (e.g., C5 has filters across the holding and matter processing inherent functions). This observation accords with engineering experience that quite different types of equipment can have the same function and, therefore, failure modes, so they form separate clusters.
4. Discussion
The results presented in
Section 3 show how semantic representation alone is insufficient to capture engineering knowledge in Maintenance Work Orders (MWOs). Off-the-shelf embedding methods struggle with the technical nature of the text and produce clusters that are poorly formed. The two novel feature extraction approaches that directly incorporate engineering knowledge extracted from the KG triples produce embeddings that represent latent relationships in the data alongside semantic meaning. These embedding methods that incorporate engineering knowledge outperform the off-the-shelf embedding methods, aligning with the study’s goal from a computer science perspective to explore how introducing expert knowledge in the form of KG triples impacts the clustering performance of the MWOs.
The best results come from the Word2Vec Bi-LSTM embedding and Ward hierarchical clustering approach. This approach produces clusters with the highest distribution of documents (75.3%) and clusters (80.7%) belonging to good clusters, as shown in
Section 3.2. The clusters are made up of a large number of ‘good’ clusters and return important insights between engineering functions and failure modes, making them candidates for generic failure mode categories, as seen in
Section 3.3. The NMI score for Word2Vec Bi-LSTM embeddings is higher than the other embedding method, suggesting that inherent function is best captured using that approach, as seen in
Section 3.4. From these clusters, we achieve the study’s goal from an engineering perspective to create a list of potential FMCs linked to the inherent function of physical objects through a data-driven approach.
4.1. Limitations
The quality of ‘good’ clusters produced is dependent on the distribution of inherent function over the dataset. For example, from the list of potential FMCs identified, none of the clusters represents the inherent function of human interaction. This is a consequence of the MaintIE dataset being unbalanced in some inherent function categories (e.g., Human Interaction and Information Processing) and associated undesirable behaviours. This impact was partially reduced by the introduction of synthetic data.
One of the challenges with the raw MWO texts in the MaintIE data set is that many failure descriptions are generic to all equipment; notably, ‘unserviceable’ and ‘not working’ occur in numerous clusters. A dataset, say, from warranty data might contain more detailed observations of the failure, and this could be a focus of future work.
4.2. Future Work
During feature extraction, the Word2Vec Bi-LSTM model was able to perform inherent function classification with a training set accuracy of 0.806 and a test set accuracy of 0.661. The SBERT NN embeddings had a training accuracy of 0.700 and a test accuracy of 0.509. These results can likely be improved upon with further experimentation of model architecture and with a larger dataset of MWOs with coverage over more physical objects, inherent functions, and more detailed failure descriptions.
The mapping shown in
Figure 10 successfully captures the relationships between function and failure in the dataset. We are able to produce meaningful clusters that subject matter experts identify as related to failure mode categories. The results are promising, already producing useful insights from the data (MaintIE dataset of 7000 work orders). MaintIE is currently the largest publicly available annotated dataset of MWOs. Upon the release of larger datasets or for organisations to replicate this work on their own datasets of MWOs, annotation in accordance with the MaintIE schema is required [
37]. This can be achieved by fine-tuning the existing MaintIE model with an organisation’s own MWOs annotated with the MaintIE schema using an annotation tool such as QuickGraph [
48].
5. Conclusions
We test data-driven approaches to identify categories using unsupervised clustering approaches with and without the incorporation of external knowledge. We explore the use of embeddings that combine a semantic representation of MWOs with engineering knowledge, specifically in the form of a KG. Performance is assessed by statistical analysis, manual analysis by subject matter experts, and the NMI score.
We demonstrate that embedding methods can incorporate engineering knowledge of inherent function alongside semantic meaning to identify generic failure mode clusters related to the inherent function of the physical object mentioned in the MWO. Both SBERT NN and Word2Vec Bi-LSTM embeddings significantly outperform off-the-shelf embedding methods, demonstrating the value of introducing engineering knowledge when creating embeddings for technical texts.
We identified 23 generic categories from the ‘good’ clusters. These clusters, created from unstructured texts in MWOs, have common equipment and failure modes (for example, pins, bolts, and bearings that are missing, loose, or fail to track). These are all associated with the loss of a ’holding’ function as defined in IEC 81346 [
36]. We note that clusters with the same function but quite different equipment and failure modes are to be found in separate clusters. Thus, there are 23 clusters although there are only 17 IEC 81346-2 functions. Both of these findings accord with engineering judgement that failure modes (and the equipment that has these failure modes) should have a meaningful relationship to function. This suggests that the IEC 81346-2 functional hierarchy has value as a model for structuring equipment hierarchy for maintenance.
From the ‘bad’ clusters, we gain insights into other potential forms of similarity in MWOs beyond the inherent function, which will form the basis of future work. While we were able to gain valuable insights into potential FMCs for classification list creation through this approach, we recognise that the list of categories (from an engineering perspective) is not as definitive as we had hoped. There is no comprehensive coverage in the MaintIE dataset of all inherent function classes, and many of the failure mode descriptors (e.g., ‘unserviceable’) apply to too many physical objects. We attempt to mitigate the impact of this coverage issue by introducing synthetic data.
Future work could apply this approach to MWOs or warranty data with more detailed failure descriptions (these are generated in, for example, the aerospace or car sectors) and a wider range of physical assets. The core of this approach is datasets annotated with KG schema that capture immutable domain knowledge relevant to the clustering task. These datasets and schemas exist in other technical domains beyond engineering, and we encourage further exploration of domain knowledge embeddings to discover latent clusters for classification list development.
As organisations seek to use AI to automate routine tasks, such as the assignment of a failure mode code to each MWO, there needs to be some guardrails to assist in checking the accuracy. At present, this is achieved by having test sets classified by humans. The results of this work suggest that a semantic layer containing a knowledge graph mapping equipment types to inherent function and inherent function to associated failure modes could form the basis of a quality control process for failure mode classification.