Semantic and Engineering-Based Embedding for Classification List Development

Feng, Jadeyn; Lau, Allison; Hodkiewicz, Melinda; Woods, Caitlin; Stewart, Michael

doi:10.3390/make8030061

Open AccessArticle

Semantic and Engineering-Based Embedding for Classification List Development

by

Jadeyn Feng

¹

,

Allison Lau

¹

,

Melinda Hodkiewicz

^2,*

,

Caitlin Woods

¹

and

Michael Stewart

^1,3

¹

Department of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA 6009, Australia

²

School of Engineering, The University of Western Australia, Crawley, WA 6009, Australia

³

Commonwealth Scientific and Industrial Research Organisation (CSIRO), Kensington, WA 6151, Australia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(3), 61; https://doi.org/10.3390/make8030061

Submission received: 30 January 2026 / Revised: 24 February 2026 / Accepted: 28 February 2026 / Published: 4 March 2026

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

The creation and application of classification category labels are essential tasks for transforming complex information into structured knowledge. Categories are used for summary and reporting purposes and have historically been identified by domain experts based on their past experiences and norms. Our interest lies in the general case where expert-generated category lists require improvement, and unsupervised learning, on its own, struggles to effectively identify categories for multi-class classification of human-generated texts. We hypothesise that including an annotated knowledge graph (KG) in an embedding process will positively impact unsupervised clustering performance. Our goal is to identify clusters that can be labelled and used for classification. We look at unsupervised clustering of Maintenance Work Order (MWO) texts. MWOs capture vital observations about equipment failures in process and heavy industries. The selected KG contains a mapping of equipment types to their inherent function based on the IEC 81346-2 international standard for classification of objects in industrial systems. Performance is assessed by statistical analysis, subject matter experts, and Normalized Mutual Information score. We demonstrate that Word2Vec Bi-LSTM and Sentence-BERT NN embedding methods can leverage equipment inherent function information in the KG to improve failure mode cluster identification for the MWO. Organisations seeking to use AI to automate assignment of a failure mode code to each MWO currently need test sets classified by humans. The results of this work suggest that a semantic layer containing a knowledge graph mapping equipment types to inherent function, and inherent function to failure modes could assist in quality control for automated failure mode classification.

Keywords:

technical language processing; clustering; knowledge graph; Maintenance Work Order; FMEA; failure mode; IEC 81346; ISO 14224

Graphical Abstract

1. Introduction

The creation and application of classification category labels are essential for transforming complex information into structured knowledge, enabling consistency and supporting complex decision making and automation across a wide range of disciplines, including scientific research, commercial operations, and industrial maintenance. Classification is a supervised learning task that involves predicting a category label for given input data. Categories used for summary and reporting purposes have historically been identified by domain experts based on their past experiences and norms. Our interest lies in the general case where expert-generated category lists require improvement, and unsupervised learning, on its own, struggles to effectively identify categories for multi-class classification of human-generated texts.

As a case study, we look at identifying categories for classifying failure modes in industrial maintenance. For many organisations across domains like defence, mining, and manufacturing, asset maintenance is a multi-million dollar task. In 2023, the heavy machinery repair and maintenance industry in Australia had a revenue of USD 16.5 billion [1]. It is essential for maximising the runtime productivity of assets, minimising repair costs, and minimising threats to human safety in the workplace. Our focus is on creating a list of codes to represent the manner in which a failure can occur on some asset, known as failure modes [2].

In industrial maintenance, whenever a piece of equipment or an asset fails, has work performed on it, or has a request for work to be performed, a Maintenance Work Order (MWO) is written up by a maintenance technician or the equipment operator who observed the failure. Example MWOs are shown in Table 1. These MWOs can be thought of as doctors’ records in the industrial world, a world in which the role of equipment is played by the patients. MWO records contain valuable observations by technical personnel about the deterioration and/or failure of equipment or its parts (physical objects). These records are unstructured short sentences, usually four to eight words in length. The language used to record MWOs is informal, being filled with colloquialisms, acronyms, misspellings, and technical jargon [3]. Grammatically, MWOs are diverse, using past, present, or present participle verbs (leaked, leaks, leaking); nouns (the leak); or states (e.g., has a leak) or synonyms thereof in describing how equipment is deteriorating or failing.

In industrial maintenance, accurate analysis of MWOs is vital for developing effective maintenance strategies and improving asset management. Failure Modes and Effects Analysis (FMEA) is a process in which engineers identify the function of each component of an asset and determine all potential undesirable states (failure modes), failure causes, and failure effects associated, so that improvements can be made for maintenance strategies, activities, and future designs. This information is stored in tables, such as in Table 2. MWOs, which are essentially records of failure events and failure modes, are an integral source of information for FMEA. Analysis of historical MWOs helps identify recurring issues and their root causes and informs decision-making processes regarding preventive maintenance, deterioration detection, repairs, and equipment replacement to prevent future undesired events.

In order to facilitate data analytics, organisations want to classify undesirable behaviour observations in MWOs into a pre-determined list of standardised Failure Mode Codes (FMCs) in a reproducible way. For example, the MWO “air conditioner leaking” might be assigned an FMC LEA that categorises leaking failures. This is a routine and time-consuming task for reliability engineers, and there is significant interest in machine-assisted classification to a set of pre-determined categories [3,4,5,6,7,8,9].

Table 2. Example FMEA table for heater system [10].

Component	Function	Failure Mode	Failure Effect
Heaters	To heat up unit	(a) overcurrent	loss of all heating
		(b) short circuit	loss of all heating
		(c) earth fault	loss of all heating
Terminal box	Connect supply to heaters	(a) overcurrent	loss or reduction of heating
		(b) short circuit	loss of all heating
		(c) cable failure	loss or reduction of heating

However, in working with organisations, we have observed many occasions when a category list is not fit for purpose when their actual MWO data are classified. For example, we note the extensive use of the category ‘other’ and the poor agreement between different classifiers (human or machine).

Currently, there is no agreed standard list of generic failure modes for the equipment, even for commonly used equipment such as motors, compressors, and pumps (for example) and the limited well-understood ways in which these assets can fail. Some reasons for a lack of a standard list of maintenance FMCs include the following. First, while there are domain-specific standard FMC lists, such as the FMC list in ISO14224 [2] for the oil and gas sector, they only cover assets of core interest to offshore oil and gas. Secondly, due to the incompatibility of many FMCs in other domain-specific lists like ISO14224, many organizations use their own FMC lists developed internally by domain experts. This results in inconsistent FMC coding across organisations, limiting benchmarking and information exchange between equipment vendors and asset owners. Finally, the development of FMC lists by individual companies often results in lists of poorly described and differentiated FMCs, creating issues for technicians who need to decide which FMC to select [11,12].

These issues motivate our investigation into unsupervised machine learning to generate a set of FMCs from the MWO data that describe the failure event. Previous works that approach category generation relying on statistical methods such as topic modelling [13,14,15] or clustering [15] have experienced challenges when applied to a dataset of MWOs. This is due to the sparse and colloquial nature of the short texts [16]. Another study [17] extracts different degradation states of excavator buckets, using Convolutional Neural Networks (CNNs) to perform feature extraction from word2vec and LSA representations of the MWO as the CNN input and output, respectively. This combined embedding is then used in K-means clustering, with each cluster being a category of degradation state. A small number of clusters were formed, with many shared words characterising each cluster, particularly with equipment words. When it comes to FMCs, having too few codes will not sufficiently cover enough ways in which equipment fails in a way that is descriptive and helpful for analysis. Additionally, action words that are more important in describing degradation state, like repair and replace, are more limited in vocabulary and used more commonly across MWOs compared to words describing failure modes like leak, blown, disconnected, or warm. A way of prioritising failure concepts is needed.

An essential engineering task in the design phase of products and processes is Failure Modes and Effects Analysis [10]. The design process considers first the desired function of the product or process [18]. For each function, potential functional failures and associated failure modes (risks and controls) are identified. It is not until later in the design phase that the type of equipment (e.g., a conveyor or truck) to deliver the function is identified. We hypothesise that finding a way to incorporate knowledge of the function of the equipment into a knowledge graph will assist in clustering of failure modes since failure modes and functions are linked, at least in theory and practice, in the engineering design process. To achieve this, we look for ways to incorporate knowledge about type hierarchy into MWO text embeddings.

We use triples which include entity and relation typing available in the annotated MaintIE MWO dataset to train a Bidirectional Long Short-Term Memory (Bi-LSTM) to perform feature extraction between the annotations and off-the-shelf embeddings. We hypothesise that leveraging of expert knowledge (captured in the annotations) to create sentence embeddings can improve unsupervised function-based clustering performance. To evaluate the success of different embedding-based approaches, we define criteria for what makes a good cluster for a failure mode category and use Normalised Mutual Information Score to assess the clusters. From an engineering perspective, success is the creation of clusters from MWO texts that provide insight into engineering functions and failure modes experienced by the equipment described in the MWO. The goal from a computer science perspective is to explore how the inclusion of additional information in the form of triples impacts the performance of an unsupervised clustering task.

The rest of the paper is structured as follows:

Section 1 explores previous works related to extracting a set of codes from work orders.
Section 2 discusses the MWO dataset used as well as any data exploration we conducted prior to our final method.
Section 2.3 explores different off-the-shelf unsupervised methods.
Section 2.4 describes our method and experiments.
Section 3 and Section 4 analyse and discuss the results and what insights they provide us.
Section 5 describes the contributions and potential future works for this study.

Related Work

In order to identify a list of generic categories from a dataset of short texts, we examine methods to extract the hidden trends or similarities within a dataset. Previous works use unsupervised learning techniques such as topic modelling and clustering to extract hidden trends and group documents.

Gibbs Sampling Dirichlet Mixture Model (GSDMM) [19] is a topic modelling algorithm for short texts. It treats each document as only having one topic and can infer the number of clusters automatically. In one study, clustering techniques were used on MWOs before using GSDMM on each cluster (semantically similar MWOs) to extract topics [13]. These topics were the concepts contained within each cluster and were then used to highlight possible areas that were overlooked when subject matter experts created a taxonomy for key systems, actions, and issues (failure mode). A similar approach used a hybrid of unsupervised learning techniques to automatically generate a taxonomy of terms for manufacturing capability data [15]. This was achieved through performing clustering to increase the accuracy and efficiency of topic modelling, and the four topics produced were used as high-level labels on clusters. Two factors improved the performance of the topic modelling in this study: having a small number of topics (between 3–4) and having a large list of stopwords (around 3000 words) to filter out irrelevant terms [15].

Unlike topic modelling techniques that rely solely on statistical methods, text clustering depends on how sentences are represented in vector form, and can incorporate semantic, statistical, and syntactical information when grouping texts. Clustering algorithms are a point of interest in this literature review as they provide a way to group MWOs by exploring the hidden patterns in the data without the need for expert annotations upon the dataset, creating clusters of MWOS based on similarities related to failure modes.

In text clustering, text data are represented in a multi-dimensional spatial representation, and the distance between points is used to determine clusters. In order to perform text clustering, several decisions need to be made, including what features will be represented, how to measure similarity between documents based on those features, what clustering algorithm to use, and how to measure its performance [20]. With text clustering algorithms, documents are represented as a vector, and the words in each document represent the features of the vector. There are often two issues with this vector representation of documents, one being the huge vector size of text documents and the other being the large number of terms in the vocabulary of possible words [21]. The first issue is less relevant due to the short text format of MWOs, though the second issue still applies, as there are many ways to describe the same maintenance work done in MWOs.

Feature engineering is a preprocessing step used to extract features from data. Statistical text feature engineering algorithms [21] include Term Frequency-Inverse Document Frequency (TF-IDF) [14,22] and latent semantic analysis (LSA) [15,17,23], which are designed to improve clustering by removing unnecessary or unimportant terms and increase interpretability of clusters [21]. However, MWOs by nature are sparse, where each word within a document will usually only appear once, making it difficult for traditional count-based feature selection techniques to determine semantic similarity [23]. An example is TF-IDF, which involves the term frequency as a metric. In the case of short texts, text frequency scores will likely be 1, resulting in a high-dimensional representation of documents and complexity problems [16].

As an alternative, word embeddings have been used to represent the semantics of words and utilised in clustering maintenance records. A popular word embedding method is Word2Vec [17,23,24,25], which represents a word as a vector that takes into account the semantic, syntactical, and contextual data of the word. However, when dealing with representing MWOs, further pre-processing must be performed on each word vector due to the dataset being filled with spelling mistakes, acronyms, and other inconsistencies [17,26]. Statistical text feature selection and word embeddings have been used to create a deep embedding of an MWO [17], while a hierarchical word taxonomy of hypernyms was used to form semantic features in another study [27].

Information-theoretic measures, including mutual information (MI), have been used for representation shaping and clustering evaluation. For example, a previous study [28] uses mutual information maximization objectives at the sequence and token level to perform clustering for short texts. Clusters were evaluated through accuracy and Normalized Mutual Information (NMI), which quantifies the amount of shared information between cluster assignments and the true labels. NMI requires true labels to be calculated, suiting it to supervised tasks rather than unsupervised clustering. Another study [29] performed hierarchical clustering on a dataset of microblog short texts, using mutual information as a measure of similarity between formed clusters.

One commonly used clustering algorithm is K-means clustering [30]. It is a hard clustering method where each item belongs to exactly one cluster, and it involves making clusters based on the proximity (measured using Euclidean distance) of items within a vector space to a user-specified number of cluster centroids. There are previous studies into K-means algorithm use with maintenance records, including using Convolutional Neural Networks (CNNs) to perform feature extraction prior to K-means clustering to extract clusters of MWOs containing different degradation states of equipment [17] and for the clustering of manufacturing suppliers according to their capabilities with Latent Semantic Analysis (LSA) feature selection [15]. K-means clustering has been shown to be useful in working with data that is not homogeneous in nature, which is applicable to MWOs.

Hierarchical clustering involves clusters being formed iteratively by regrouping previously formed groups together to eventually form one super-cluster in a bottom-up approach, or a top-down approach in the opposite direction. The bottom-up Ward agglomerative hierarchical approach to clustering has been explored in clustering data on maintenance activities by failure mode [9], although it is difficult to evaluate performance due to a lack of prior knowledge or training data labels.

This literature review examined multiple studies that explore the identification of a set of FMCs using unsupervised learning techniques, including topic modelling and clustering, of unstructured texts. In doing so, it was revealed that many traditional clustering and topic modelling approaches relying on statistical techniques will not have the same performance when dealing with extracting a set of FMCs from MWOs, due to the sparse nature and lack of co-occurrence patterns within the technical short texts.

2. Materials and Methods

2.1. Dataset and Knowledge Graph

Our area of focus is the process plant and equipment sector. Engineers know from experience that each type of equipment will fail in a number of ways. The challenge for organisations is to develop sensible FMCs for their data. The task of identifying FMCs for hundreds of thousands of types of equipment is too large to be carried out by hand [31]. For example, the ISO 15926-4 [32] Reference Data Library has a list of 2146 classes for equipment types (e.g., pump, motor, fan). There is increasing interest in looking at the information captured in the Maintenance Work Order (MWO) data to identify relevant FMCs and map these codes to the equipment that generates the failures classified by the FMC label [33,34,35].

Our intuition is to map the FMCs to the inherent function of the physical object rather than its equipment type. The inherent function is defined as the “function of an object, independent of any application of the object” [36]. Inherent function is a permanent, essential, or characteristic attribute of the object. The IEC 81346-2 Standard provides a classification structure based on inherent function and a mapping of equipment types to an inherent function hierarchy. The inherent function list (in IEC 81346-2) at the top level (L1) is a much smaller set (n = 17) compared to the number of equipment types (n = thousands). We hypothesise that leveraging engineering knowledge linking each physical object class to its inherent function, and by extension its functional failure (as shown in Figure 1), will assist in the semantic identification of sensible categories for failure modes. This idea mimics the way engineers use their knowledge of the (1) inherent function of the physical object; (2) how specific physical objects can fail, which is linked to their inherent function; and (3) how that failure is usually observed when performing the FMC task.

Figure 1 illustrates the abstractions used by engineers to identify FMCs from an MWO text. In this example, the MWO text is ‘differential oil is leaking’. We show how an engineer focuses on the equipment type (differential) and ‘knows’ that a differential is part of the power transmission system for an engine. Engineers also ‘know’ that differentials have the inherent function of transmitting force and are contained by a housing with the inherent function of storing oil for lubrication. They also ‘know’ that any physical object that has a containment function (for example, a tank or a pipe) has the potential to leak. Note that none of this information is explicit in the MWO. This knowledge linking a physical object to its inherent function and an inherent function to potential failures is immutable. Some of this knowledge has been captured in engineering textbooks and tables in International Standards.

We use the MaintIE [37] public knowledge graph (KG) dataset and schema for MWO texts. At the time of writing, MaintIE is the largest open dataset of annotated industrial MWOs. The relationship between physical objects and their inherent function class in the KG schema is based on the International Engineering Standard IEC 81346-2 [36] MaintIE contains a gold standard (1076 fine-grained expert-annotated texts) mapped to a multi-level hierarchy with physical object broken down to 17 Level 1 (L1) inherent functions (e.g., controlling, holding, protecting objects), and these further broken down to 160 Level 2 (L2) inherent functions (a diagram of the schema is available at https://github.com/nlp-tlp/maintie/blob/main/SCHEME.md, accessed on 27 February 2026). Each physical object mentioned in the MWO is mapped to an inherent function at the L2 level, as shown in Figure 2.

We also use a second (larger) MaintIE coarse-grained corpus (silver standard), which has MWOs annotated by a deep learning model trained on the fine-grained corpus. The annotations are then reviewed by an industry expert. This corpus consists of 7000 annotated work order texts. Unlike MaintIE Gold being annotated to the L2 level (for example, thermostat is annotated as PhysicalObject/SensingObject/TemperatureSensingObject), MaintIE silver is not annotated to L1 or L2 levels (for example, thermostat is annotated as PhysicalObject). Since we require engineering knowledge in the form of inherent function to use MaintIE Silver, we automatically annotated a portion of the MWOs in MaintIE Silver to the L2 level by referencing the PhysicalObject entities in MaintIE Gold. Since a piece of equipment will generally only have one inherent function, any PhysicalObject in MaintIE Gold that also occurs in MaintIE Silver will have the same inherent function. Thus, MaintIE Gold and a portion of MaintIE Silver form the dataset used in this study. A summary of the count and types of nodes and relations in Maintie Gold and Silver is available in Table 3.

2.2. Introducing Synthetic Data to MaintIE

The MaintIE dataset is the largest open-source annotated dataset available. Across MaintIE Gold and Silver, there is an uneven distribution of inherent functions, as shown in Figure 3. This impacts clustering and topic modelling algorithms and can lead to classes of failure mode terms or inherent functions with lower numbers not being represented. To combat this, synthetic data generated based on MaintIE [38] is introduced into the dataset.

The synthetic data is generated by first extracting triples from the MaintIE Gold knowledge graph to create valid engineering paths from a piece of equipment to a failure mode. An example of this is the path from the equipment “windscreen” with the failure mode property “crack”. Additional paths are formed by leveraging the hierarchical contains, hasPart, and isA relations in MaintIE. For example, from the path “backhoe hasPart windscreen hasProperty crack”, we can extend additional paths “windscreen hasProperty crack” and “backhoe windscreen hasProperty crack”. Any new paths that do not originally exist in the MaintIE dataset are validated for technical correctness by an SME.

A GPT-4o mini (https://platform.openai.com/docs/models/gpt-4o-mini, accessed on 23 September 2024) Large Learning Model is tasked with generating grounded MWOs constrained by the validated paths. Few-shot prompting is used to incorporate the style of real MWOs to make the generated MWOs authentic.

The generated synthetic work orders are added, ignoring any work orders that have to do with inherent functions “guiding” and “holding”, to lessen the skew of the dataset as shown in Figure 4. All work orders involving the inherent function “informationProcessing” were also removed from the dataset due to the small number of work orders, even after adding the synthetic dataset. Maintenance of information processing equipment is generally outsourced to external maintainers who have their own separate records and are maintained following a fixed schedule. Both of these result in companies not producing a work order for their maintenance work and thus are unlikely to occur in the context of the maintenance industry.

Another observation from prior analysis of the MaintIE dataset was the frequent occurrence of the “leak” failure mode and its variants like “leaking” or “leaks”. This also caused a skew in the dataset in terms of failure mode words in the MWO, and so synthetic data generated that included the word “leak” or similar was not added to the dataset.

2.3. Preliminary Experiments with Topic Modelling

We first test off-the-shelf topic modelling algorithms designed to work with short texts, including Gibbs Sampling Dirichlet Mixture Model (GSDMM), BERTopic, and Top2Vec, on the MWO dataset. Every short text topic modelling technique struggles in several ways.

When GSDMM is performed on the MaintIE dataset, the resulting failure mode clusters are incomprehensible to subject matter experts, as shown in Table 4. Action words like repair, replace, and change are more prevalent throughout the dataset, forming topics that are not representative of failure modes. Even upon removal of action words, the text was too sparse to form meaningful clusters. This is a limitation of using an entirely statistical method on a technical dataset.

BERTopic [39] is another topic modelling technique we test. It uses transformers and a class-based TF-IDF [39] to create topics. However, upon examining the topics being produced, it was determined that the topics were merely grouping together MWOs by equipment type, as seen in Table 5, which is not useful for the purpose of creating generic failure mode classes. We considered removing all the equipment words from the dataset and performing topic modelling solely on failure state, process, and property words alone. However, it would lead to a loss of too much information, as the physical object itself is an important feature for what sort of failure event is possible, so it was decided against. Any topics that would be produced would ignore a key part of the MWOs, and any generic failure mode categories resulting from it would lead to grouping together MWOs that, from an engineering perspective, do not belong together. An example of this could be “air leak near side of door” and “engine leak oil” being part of the same category solely due to the word “leak”.

We test a third topic modelling technique, Top2Vec [40]. It uses Doc2Vec and Uniform Manifold Approximation and Projection to create topics. However, with the MaintIE dataset, only two topics were returned, each containing a large number of MWOs. These topics do not return any useful insights into what the set of generic FMCs could be.

Each of these topic modelling experiments’ poor performance is due to the nature of the MaintIE dataset being a corpus of unstructured short texts. Even though these algorithms are designed with shorter texts in mind, MWOs are extremely short, many of which are less than eight words long. As an attempt to overcome the lack of recurring co-occurrence patterns in the texts due to word sparsity, the inherent function of each equipment was retrieved from the MaintIE dataset and appended to the MWO across the three different topic modelling algorithms. However, the topics remained indecipherable, with a large number of different types of inherent functions present in each topic. Without semantic and engineering knowledge to inform the grouping of MWOs, any topic modelling approaches fail to return any useful insights into failure mode categories that exist within the text and are let down by relying on a statistical means of analysis.

2.4. Method

We hypothesise that including the annotated knowledge graph (KG) in the embedding process will positively impact the clustering process to identify suitable categories for Failure Mode Classifications (FMCs) from the texts of Maintenance Work Orders (MWOs).

We test four embedding methods over three different clustering approaches, as shown in Figure 5. Two of the embedding methods (Averaged Word2Vec [41] and Sentence-BERT [42]) are off-the-shelf methods and do not introduce additional engineering knowledge. In contrast, Word2Vec [41] Bidirectional Long Short-Term Memory (Bi-LSTM) [43] embedding and Sentence-BERT [42] dense neural network embedding, both methods inspired by a previous study [17], allow for the introduction of a KG into the embedding process.

The clustering approaches include K-means [30], average agglomerative hierarchical clustering [44], and Ward agglomerative hierarchical clustering [45]. We perform every combination of embedding and clustering approaches on the dataset and evaluate the resulting clusters from each combination. Clusters are evaluated in three ways: statistical data on cluster shape and characteristics, manual analysis of each cluster by subject matter experts with an engineering background, and a Normalized Mutual Information (NMI) score.

All code can be found in this project’s GitHub repository (https://github.com/nlp-tlp/Hons24-Jadeyn, accessed on 27 February 2026).

2.5. Sentence Embedding

To incorporate engineering knowledge from the KG within the Word2Vec Bi-LSTM and Sentence-BERT NN embedding methods, we use semantic representations of MWOs and attempt to predict the inherent function of the part of the physical object identified as having the undesirable state in the MWO. The final hidden layers of both models are extracted and used as a sentence embedding. As mentioned previously, we focus on the inherent function because the inherent function of a physical item is an important consideration for failure identification and is available to us in the form of L2-level entity annotations in the MaintIE KG.

2.5.1. Extracting the Inherent Function Labels

We create training labels for our feature extraction methods by determining which physical item is most relevant to each MWO, on which the failure has occurred. The output labels are then selected through the engineering knowledge stored in the PhysicalObject subclass entity annotations. The most relevant physical item of an MWO is decided through the KG relations by selecting the physical item with a hasParticipant relation with the undesirable behaviour node mentioned in the MWO. For example, in the MWO “air conditioner in truck has leak”, the KG triple involving the failure mode “leak” hasParticipant/hasPatient “air conditioner” suggests that “air conditioner” is the most relevant physical object. Since “air conditioner” has the inherent function emitting, the inherent function label for this MWO is emitting.

In the case where there are multiple physical items that have a hasParticipant relation with the undesirable behaviour node, the relations of the MaintIE KG, namely the hasPart and isA relations, are used to determine the most relevant physical item. The hasPart relation captures a physical object’s parts, for example, an engine hasPart radiator. Since the part is where the failure mechanism is experienced, it is more relevant to the FMC determination, and its inherent function is used in the analysis. An example of this is shown in Figure 6. The isA relation denotes that a physical object class is a type of another physical object class, for example, a diesel engine isA(n) engine. In this case, the more specific physical item (diesel engine) is selected, and its inherent function (driving) is used as the training label.

2.5.2. Word2Vec Bidirectional Long Short-Term Memory (Bi-LSTM) Embeddings

In this approach, as shown in Figure 7, Word2Vec [41] embeddings are used as input for a Bi-LSTM [43] model to predict the MWO’s inherent function, to learn the patterns between the semantic word embeddings and the expert knowledge in the form of inherent function. The final hidden layer is extracted and used as the sentence embedding of the MWO for the unsupervised clustering task. This is achieved using a skip-gram Word2Vec (https://code.google.com/archive/p/word2vec/, accessed on 15 May 2025) model to transform each MWO into a

d_{E} * d_{w}

-dimensional matrix, where

d_{E}

is the embedding size hyperparameter set by the user, and

d_{w}

is the maximum number of words in an MWO in the dataset. In our experiments,

d_{E}

is set to 100 following the parameter informed by [17], while

d_{w}

is the maximum number of words in an MWO in this study’s dataset, which is 12.

Output labels in the form of the MWO’s inherent function are created as mentioned in Section 2.5.1. The Bi-LSTM model is used to learn and extract the hidden patterns between the Word2Vec semantic inputs and the inherent function output labels in MWOs. The Bi-LSTM model begins with an input layer with a shape matching the Word2Vec embedding input (

d_{E} * d_{w}

). This is followed by two Bi-LSTM layers.

Bi-LSTM is a recurrent neural network that learns patterns from a sequence in the forward and backward direction. It is composed of two LSTM layers, one that processes the input in the forward direction and one in the backward direction to capture context. LSTM networks make use of input gates to retain long-term dependencies to reduce the vanishing gradient problem. LSTM networks maintain a hidden and cell state over each time step and perform the following calculations:

\begin{matrix} f_{t} & = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \end{matrix}

(1)

\begin{matrix} i_{t} & = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \end{matrix}

(2)

\begin{matrix} g_{t} & = tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}) \end{matrix}

(3)

\begin{matrix} o_{t} & = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \end{matrix}

(4)

\begin{matrix} c_{t} & = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t} \end{matrix}

(5)

\begin{matrix} h_{t} & = o_{t} ⊙ tanh (c_{t}) \end{matrix}

(6)

where W and U are weight matrices,

x_{t}

is the input at time t,

h_{t}

is the hidden state at time t,

c_{t}

is the cell state at time t, b is a bias vector,

σ ()

denotes the sigmoid activation function,

tanh ()

is the hyperbolic tangent function, and ⊙ denotes element-wise multiplication.

For this model, the first Bi-LSTM layer contains a forward and a backward LSTM with 64 units and uses the tanh and sigmoid activation functions. The second LSTM layer contains a forward and backward LSTM with 32 units and uses tanh and sigmoid activation functions.

The two Bi-LSTM layers are followed by a fully connected layer with 64 units and use the ReLU activation function. This is followed by another fully connected layer with

h_{E}

units, where

h_{E}

is the size of the sentence embeddings. The value of

h_{E}

is set as 10, following [17]. This fully connected layer is extracted after training to act as input for the clustering step, resulting in an MWO sentence embedding with a dimension of 10. Finally, there is an output layer with 16 units, 1 for each type of inherent function label. A softmax activation function is used to produce a probability distribution over the 16 classes (1 for each inherent function in the dataset).

2.5.3. Sentence-BERT (SBERT) Neural Network (NN) Embeddings

SBERT [42] embeddings are created using the “all-mpnet-base-v2” pre-trained model and used as the input for training a neural network to predict the inherent function and create embeddings of MWOs.

The neural network architecture consists of five fully connected layers. The first three fully connected layers have 128, 64, and 32 nodes, and all use the tanh activation function. The fourth layer has

h_{E} = 10

units and also uses the tanh activation function. This hidden layer is extracted after training to act as deep sentence embeddings for clustering, resulting in an MWO sentence embedding with a dimension of 10. The last layer in the neural networks is a fully connected layer with 16 units and the softmax activation function; it is used to predict the inherent function of the MWO.

2.5.4. Training the Deep Embeddings

The dataset is split up into a training set of 80% of MWOs and a test set of 20% of MWOs. Both the new Word2Vec Bi-LSTM and SBERT NN embedding methods are trained from scratch over the training set with the Adam optimiser with a learning rate of 0.001 (controlling how fast weights are updated per step) and use sparse categorical cross-entropy loss to compute loss between the predicted inherent function and the actual label. The model is validated with the test set, and early stopping is used with a patience of 20 over 200 epochs.

2.6. Clustering Algorithms

Two types of clustering algorithms are tested: K-means and agglomerative hierarchical. K-means [30] creates a pre-determined number of centroids in vector space. Each item closest to a centroid is considered part of that cluster. The algorithm iteratively moves the centroids around the vector space to find good clusters that maximise the similarity within the cluster and the dissimilarity with items outside the cluster.

In agglomerative hierarchical clustering, each item starts as a cluster of 1. The algorithm joins the most similar clusters together. The process can be halted at any time when the desired number of clusters remain. There are different ways to calculate the similarity between clusters, including average linkage [44] (which measures the average distance between the items in the two clusters) and Ward linkage [45] (which measures how much the sum of squares will increase if two clusters are merged). Both average and Ward linkage are tested in this research.

One challenge with performing this unsupervised clustering is that the number of clusters that exist within the dataset is unknown. To combat this, clustering is performed for all numbers of clusters between 5 and 50, and from each set of clusters, the silhouette score is calculated and plotted.

Silhouette score [46] is a metric of the quality of clusters that are produced. It calculates the similarity of members within the cluster and dissimilarity with non-member points. Silhouette score is plotted over the number of clusters as shown in Figure 8, and the ideal number of clusters is selected from the highest silhouette score. Another method of determining the number of clusters is through the dendrogram produced during hierarchical clustering, as shown in Figure 9, which can be manually examined to determine the number of clusters that best suits. In dendrograms, clusters that are more similar to each other are joined earlier, and thus the clusters will merge at a lower distance, which is taken into account when selecting a good number of clusters.

3. Results

We compare the clusters created from the new embedding methods in Section 2.5 with off-the-shelf average Word2Vec and SBERT embeddings that do not incorporate additional engineering knowledge. Clusters are evaluated in three ways: statistical data on cluster shape and characteristics, manual analysis on each cluster by subject matter experts with an engineering background, and NMI score [47].

3.1. Defining Good Clusters

Two of the three evaluation methods (statistical data on cluster shape and characteristics and manual analysis on each cluster by subject matter experts) refer to the criteria of a cluster that is a “good cluster”. A good cluster is defined as a cluster consisting of at least 80% of work orders with between one to three different types of inherent functions, and it contains more than a minimum number of MWOs (20). This criteria is derived by two maintenance engineering subject matter experts upon examination of clusters in order to identify clusters that contain equipment with similar functions and failure mode categories. Having clusters with at least 80% of work orders with between one to three different types of inherent functions ensures that good clusters are those informed by inherent function as a feature. It also eliminates clusters that are grouped based on generic terms like “not working” that are found across many different failures. Not including clusters with less than 20 MWOs in good clusters removes clusters that are too small to act as a generic failure mode category.

3.2. Analysis Based on Statistical Characteristics

Statistical characteristics of the clusters (mean, median, minimum, and maximum number of documents and the distribution of inherent functions across clusters) provide a direct comparison of the shapes and distribution of clusters for each embedding and clustering method used. The results of this are shown in Table 6, Table 7 and Table 8.

For most cases, the average Word2Vec and SBERT embedding approaches have a much lower percentage of documents in good clusters when compared to their percentage of clusters in good clusters. The embedding methods incorporating engineering knowledge outperform off-the-shelf embeddings in clustering MWOs. Table 8 shows the Word2Vec Bi-LSTM and SBERT NN embeddings achieving 75.3% and 68.6% of documents in good clusters, compared to just 40.4% and 22.4% for the averaged Word2Vec and SBERT embeddings. This suggests that off-the-shelf embeddings struggle to capture the technical meaning of the text, resulting in many unclear and poorly formed clusters. In contrast, embeddings that incorporate engineering knowledge (captured in the KG triples) improve unsupervised clustering results, producing more meaningful clusters (for the engineer), with Word2Vec Bi-LSTM consistently yielding the best results. This aligns with the study’s goal from a computer science side, exploring how the inclusion of additional information in the form of triples impacts the performance of an unsupervised clustering task.

The choice of clustering method plays a significant role. K-means produces consistently equal-sized clusters, which is reflected in the highest maximum number of documents per cluster. However, this is due to its tendency to prioritize shape over meaning and document similarity, which is problematic for datasets (like MaintIE) that have uneven distribution of failure mode words and inherent function. Meanwhile, Ward hierarchical clustering outperforms both average hierarchical and K-means clustering in the proportion of good clusters.

3.3. Manual Analysis of Each Cluster by Subject Matter Experts

Each of the clusters was reviewed by two experienced reliability engineers. The review involves an assessment of the inherent function classification and the failure modes and equipment in each cluster. It should be noted that there is only limited room for interpretation in this expert review, as the IEC 81346-2 standard has a mapping between all equipment classes and function; the experts use this. Likewise, the mapping between function and failure mode leaves limited room for interpretation. For example, a chair cannot leak, but a pipe can. This review provides insights into latent patterns within clusters captured by the sentence embedding, such as grouping together a specific piece of equipment, or reoccurring inconsequential phrases such as ‘left-hand side’. This aligns with the study’s goal from an engineering perspective, where we evaluate success based on the creation of clusters from MWO texts providing insight into engineering functions and failure modes experienced by the equipment described in the MWO.

3.3.1. Averaged Word2Vec Embeddings and SBERT Embeddings

Clusters produced by averaged Word2Vec and SBERT embeddings have a lower proportion of good clusters compared to the new embedding methods. This results in clusters that are grouped by features that are irrelevant in forming a list of FMCs, for example, clusters that grouped MWOs containing the phrase ‘left-hand side’ or general failure words like ‘fault’ that are not specific to the inherent function. Word2Vec and SBERT embeddings alone are not successful in categorising inherent functions and failure modes in MWOs.

3.3.2. Word2Vec Bi-LSTM Embeddings

A majority (80.7% as shown in Table 8) of clusters formed from Word2Vec Bi-LSTM embeddings are good clusters, and some examples of good clusters are shown in Table 9. For example, over 80% of work orders in Cluster 1 have the inherent function of storing, covering or guiding. By looking at the top failure modes and physical objects in this cluster, experts determine that Cluster 1 groups MWOs with storing, covering or guiding functions with material or structural failures. Meanwhile, Cluster 2 captures MWOs describing o-ring failures. Cluster 4 groups MWOs describing structural and material failures in buckets. All of these good clusters can be used to form a list of potential FMCs.

3.3.3. SBERT NN Embeddings

SBERT NN embedding clusters also return a higher number of good clusters. This includes Cluster 1 in Table 10, which is characterised by failures of generating objects like pumps and batteries. Cluster 4 is another good cluster for emitting objects with relevant (out, blown) and general failure modes (unserviceable, not working).

When examining the bad clusters, Clusters 17 and 19 group MWOs that contain general failure mode terms such as ‘fault’, ‘unserviceable’, or ‘not working’. ‘Unserviceable’, ‘not working’, and ‘fault’ occur as top failure modes in many clusters, as they are common terms used in MWOs to describe equipment when there is no obvious symptom for the problem.

3.4. Normalised Mutual Information Score (NMI)

NMI [47] is used to measure similarity between clusters formed in each experiment and MWOs that are grouped solely by inherent function. NMI is calculated from the following function:

N M I (X, Y) = \frac{I (X, Y)}{\sqrt{H (X) H (Y)}}

where X and Y are the examined cluster labels and the true labels;

I ()

is the mutual information metric; and

H ()

is entropy [47]. If the cluster labels and the true labels are the same, then the NMI is 1.

When calculating NMI for the clusters produced with each embedding method, the true labels Y are cluster labels of MWOs that are grouped solely by inherent function. The inherent function of each MWO is extracted by the same method in Section 2.5.1. Meanwhile, the examined cluster labels X are produced from the resulting clusters for every combination of embedding and clustering method. The number of clusters is set to 16 (equal to the number of inherent functions in the dataset). This gives an idea of how much inherent function as a feature contributes to the embeddings. If inherent function is a very strong feature, the resulting clusters are similar to MWOs grouped solely by inherent function and return an NMI of close to 1.

Since Word2Vec Bi-LSTM and SBERT neural network embeddings are created by training a model to predict the inherent function of MWOs as an intermediate supervision step, they are expected to have a stronger cluster alignment with the MWO inherent functions’ labels. Consequently, they are expected to have higher NMI scores than the off-the-shelf embedding methods.

NMI is used as a verification that the embedding methods have successfully internalized and encoded the inherent function constraints embedded in the MWO embeddings. The NMI scores reflect the strength of expert knowledge captured within each embedding approach. The clusters are also considered for their purpose as category labels by expert analysis; this is described in Section 3.3.

Table 11 shows that across all the different clustering approaches, Word2Vec Bi-LSTM and SBERT NN embedding approaches consistently had a higher NMI score than averaged Word2Vec and SBERT embeddings, with Word2Vec Bi-LSTM embeddings returning the highest NMI score.

3.5. Identifying FMCs from Good Clusters

Of the combinations tested, the best combination of embedding method and clustering method is Word2Vec Bi-LSTM and Ward hierarchical. This approach produces clusters with the highest distribution of documents and clusters belonging to good clusters. The NMI score for Word2Vec Bi-LSTM embeddings is higher than the other embedding methods, suggesting that inherent function is best captured using that approach.

Potential FMC categories and the associated inherent function are identified by subject matter experts from the manual analysis of the Word2Vec Bi-LSTM approach in Section 3.3 by reviewing the list of good clusters and their links. As shown in Figure 10, some of the 23 potential FMCs identified include:

C1—objects with storing and covering function (e.g., tanks and buckets) having structural and material failures (e.g., cracks, leaks, and missing parts);
C15—objects with guiding function (universal joints, hoses, chains) with failures that are mechanical in nature and not to do with leaking;
C5—oil filters, air filters, and filters on differentials and centrifuges with matter processing function having ‘leaking’ failures;
C2—O-rings with covering function failures;
C18—lights with emitting function failures (e.g., out, unserviceable, fault, and blown);
C4—bucket teeth and adaptors with matter processing function having material or structural problems (e.g., missing, loose, or worn).

Subject matter experts examined the clusters and considered if each cluster mapped to an inherent function. As mentioned earlier, this mapping is informed by the IEC 81346-2 standard which has a mapping between equipment type and the 17 functions, as well as an understanding of the relationship between function and failure mode. One of the interesting, but not unexpected findings, is that inherent function maps into two or three clusters because different components are involved within the same inherent function. For example, in Figure 10, protecting function is divided into three clusters: one cluster for coolant leaks, one for breaks and switches, and one for oil and air filter failures. Additionally, some clusters have more than one inherent function type but have something else in common grouping them together (e.g., C5 has filters across the holding and matter processing inherent functions). This observation accords with engineering experience that quite different types of equipment can have the same function and, therefore, failure modes, so they form separate clusters.

4. Discussion

The results presented in Section 3 show how semantic representation alone is insufficient to capture engineering knowledge in Maintenance Work Orders (MWOs). Off-the-shelf embedding methods struggle with the technical nature of the text and produce clusters that are poorly formed. The two novel feature extraction approaches that directly incorporate engineering knowledge extracted from the KG triples produce embeddings that represent latent relationships in the data alongside semantic meaning. These embedding methods that incorporate engineering knowledge outperform the off-the-shelf embedding methods, aligning with the study’s goal from a computer science perspective to explore how introducing expert knowledge in the form of KG triples impacts the clustering performance of the MWOs.

The best results come from the Word2Vec Bi-LSTM embedding and Ward hierarchical clustering approach. This approach produces clusters with the highest distribution of documents (75.3%) and clusters (80.7%) belonging to good clusters, as shown in Section 3.2. The clusters are made up of a large number of ‘good’ clusters and return important insights between engineering functions and failure modes, making them candidates for generic failure mode categories, as seen in Section 3.3. The NMI score for Word2Vec Bi-LSTM embeddings is higher than the other embedding method, suggesting that inherent function is best captured using that approach, as seen in Section 3.4. From these clusters, we achieve the study’s goal from an engineering perspective to create a list of potential FMCs linked to the inherent function of physical objects through a data-driven approach.

4.1. Limitations

The quality of ‘good’ clusters produced is dependent on the distribution of inherent function over the dataset. For example, from the list of potential FMCs identified, none of the clusters represents the inherent function of human interaction. This is a consequence of the MaintIE dataset being unbalanced in some inherent function categories (e.g., Human Interaction and Information Processing) and associated undesirable behaviours. This impact was partially reduced by the introduction of synthetic data.

One of the challenges with the raw MWO texts in the MaintIE data set is that many failure descriptions are generic to all equipment; notably, ‘unserviceable’ and ‘not working’ occur in numerous clusters. A dataset, say, from warranty data might contain more detailed observations of the failure, and this could be a focus of future work.

4.2. Future Work

During feature extraction, the Word2Vec Bi-LSTM model was able to perform inherent function classification with a training set accuracy of 0.806 and a test set accuracy of 0.661. The SBERT NN embeddings had a training accuracy of 0.700 and a test accuracy of 0.509. These results can likely be improved upon with further experimentation of model architecture and with a larger dataset of MWOs with coverage over more physical objects, inherent functions, and more detailed failure descriptions.

The mapping shown in Figure 10 successfully captures the relationships between function and failure in the dataset. We are able to produce meaningful clusters that subject matter experts identify as related to failure mode categories. The results are promising, already producing useful insights from the data (MaintIE dataset of 7000 work orders). MaintIE is currently the largest publicly available annotated dataset of MWOs. Upon the release of larger datasets or for organisations to replicate this work on their own datasets of MWOs, annotation in accordance with the MaintIE schema is required [37]. This can be achieved by fine-tuning the existing MaintIE model with an organisation’s own MWOs annotated with the MaintIE schema using an annotation tool such as QuickGraph [48].

5. Conclusions

We test data-driven approaches to identify categories using unsupervised clustering approaches with and without the incorporation of external knowledge. We explore the use of embeddings that combine a semantic representation of MWOs with engineering knowledge, specifically in the form of a KG. Performance is assessed by statistical analysis, manual analysis by subject matter experts, and the NMI score.

We demonstrate that embedding methods can incorporate engineering knowledge of inherent function alongside semantic meaning to identify generic failure mode clusters related to the inherent function of the physical object mentioned in the MWO. Both SBERT NN and Word2Vec Bi-LSTM embeddings significantly outperform off-the-shelf embedding methods, demonstrating the value of introducing engineering knowledge when creating embeddings for technical texts.

We identified 23 generic categories from the ‘good’ clusters. These clusters, created from unstructured texts in MWOs, have common equipment and failure modes (for example, pins, bolts, and bearings that are missing, loose, or fail to track). These are all associated with the loss of a ’holding’ function as defined in IEC 81346 [36]. We note that clusters with the same function but quite different equipment and failure modes are to be found in separate clusters. Thus, there are 23 clusters although there are only 17 IEC 81346-2 functions. Both of these findings accord with engineering judgement that failure modes (and the equipment that has these failure modes) should have a meaningful relationship to function. This suggests that the IEC 81346-2 functional hierarchy has value as a model for structuring equipment hierarchy for maintenance.

From the ‘bad’ clusters, we gain insights into other potential forms of similarity in MWOs beyond the inherent function, which will form the basis of future work. While we were able to gain valuable insights into potential FMCs for classification list creation through this approach, we recognise that the list of categories (from an engineering perspective) is not as definitive as we had hoped. There is no comprehensive coverage in the MaintIE dataset of all inherent function classes, and many of the failure mode descriptors (e.g., ‘unserviceable’) apply to too many physical objects. We attempt to mitigate the impact of this coverage issue by introducing synthetic data.

Future work could apply this approach to MWOs or warranty data with more detailed failure descriptions (these are generated in, for example, the aerospace or car sectors) and a wider range of physical assets. The core of this approach is datasets annotated with KG schema that capture immutable domain knowledge relevant to the clustering task. These datasets and schemas exist in other technical domains beyond engineering, and we encourage further exploration of domain knowledge embeddings to discover latent clusters for classification list development.

As organisations seek to use AI to automate routine tasks, such as the assignment of a failure mode code to each MWO, there needs to be some guardrails to assist in checking the accuracy. At present, this is achieved by having test sets classified by humans. The results of this work suggest that a semantic layer containing a knowledge graph mapping equipment types to inherent function and inherent function to associated failure modes could form the basis of a quality control process for failure mode classification.

Author Contributions

Conceptualization, M.H. and C.W.; methodology, J.F., A.L., M.H., C.W. and M.S.; software, J.F.; validation, M.H., C.W. and M.S.; formal analysis, M.H., C.W. and M.S.; investigation, J.F., A.L., M.H., C.W. and M.S.; resources, A.L., M.H., C.W. and M.S.; data curation, J.F., A.L., M.H. and C.W.; writing—original draft preparation, J.F. and M.H.; writing—review and editing, J.F., A.L., M.H., C.W. and M.S.; visualization, J.F., A.L., M.H., C.W. and M.S.; supervision, M.H., C.W. and M.S.; project administration, J.F. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in Semantic and Engineering Knowledge-based Embedding for Maintenance Work Order Clustering at https://github.com/nlp-tlp/Hons24-Jadeyn (accessed on 27 February 2026). These data were derived from MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Low-Quality Maintenance Short Texts at https://github.com/nlp-tlp/maintie (accessed on 27 February 2026) which is available in the public domain.

Acknowledgments

The authors thank the UWA NLP-TLP group for their guidance and support and are especially thankful to Tyler Bikaun, author of the MaintIE and MaintNorm datasets used in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Reilly, M. Heavy Machinery Repair and Maintenance in Australia; Industry Report; IBIS World: Melbourne, Australia, 2023. [Google Scholar]
ISO-14224; Petroleum, Petrochemical and Natural Gas Industries—Collection and Exchange of Reliability and Maintenance Data for Equipment. Standard, International Organization for Standardization: Geneva, Switzerland, 2016.
Payette, M.; Abdul-Nour, G.; Meango, T.J.M.; Diago, M.; Côté, A. Leveraging Failure Modes and Effect Analysis for Technical Language Processing. Mach. Learn. Knowl. Extr. 2025, 7, 42. [Google Scholar] [CrossRef]
Awasthi, P.; Thomas, M.; Junghare, D.; Bianco, M. A Machine Learning Framework for Failure Mode Identification from Warranty Data. In Proceedings of the 2025 Annual Reliability and Maintainability Symposium (RAMS), Destin, FL, USA, 27–30 January 2025; pp. 1–6. [Google Scholar]
Malan, F.; Jooste, J.L. Text Mining Techniques for Identifying Failure Modes. J. Qual. Maint. Eng. 2023, 29, 666–682. [Google Scholar] [CrossRef]
Lee, S.; Ottermo, M.V.; Hauge, S.; Lundteigen, M.A. Towards Standardized Reporting and Failure Classification of Safety Equipment: Semi-automated Classification of Failure Data for Safety Equipment in the Operating Phase. Process Saf. Environ. Prot. 2023, 177, 1485–1493. [Google Scholar] [CrossRef]
Stewart, M.; Hodkiewicz, M.; Li, S. Large Language Models for Failure Mode Classification: An Investigation. arXiv 2023, arXiv:2307.16699. [Google Scholar] [CrossRef]
Hong, S.; Kim, J.; Yang, E. Automated Text Classification of Maintenance data of Higher Education Buildings using Text Mining and Machine Learning Techniques. J. Archit. Eng. 2022, 28, 04021045. [Google Scholar] [CrossRef]
Chen, L.; Nayak, R. A Case Study of Failure Mode Analysis with Text Mining Methods. In Proceedings of the 2nd International Workshop on Integrating Artificial Intelligence and Data Mining (AIDM 2007), Gold Coast, Australia, 1 December 2007; Australian Computer Society: Sydney, NSW, Australia, 2007; pp. 49–60. [Google Scholar]
IEC-60812:2006; Analysis Techniques for System Reliability-Procedure for Failure Mode and Effects Analysis (FMEA). Standard, International Electrotechnical Commission: Geneva, Switzerland, 2006.
Sexton, T.; Hodkiewicz, M.; Brundage, M.P. Categorization Errors for Data Entry in Maintenance Work Orders. Annu. Conf. PHM Soc. 2019, 11. [Google Scholar] [CrossRef]
Hodkiewicz, M.; Kelly, P.; Sikorska, J.; Gouws, L. A Framework to Assess Data Quality for Reliability Variables. In Engineering Asset Management; Springer: London, UK, 2008; pp. 137–147. [Google Scholar] [CrossRef]
Sobhkhiz, S.; El-Diraby, T. A Semi-Supervised Framework for Generating Multi-Dimensional Taxonomies from Asset Maintenance Documents. Eng. Appl. Artif. Intell. 2025, 161, 112010. [Google Scholar] [CrossRef]
Gunda, T.; Hackett, S.; Kraus, L.; Downs, C.; Jones, R.; McNalley, C.; Bolen, M.; Walker, A. A Machine Learning Evaluation of Maintenance Records for Common Failure Modes in PV Inverters. IEEE Access 2020, 8, 211610–211620. [Google Scholar] [CrossRef]
Sabbagh, R.; Ameri, F. A Framework Based on K-Means Clustering and Topic Modeling for Analyzing Unstructured Manufacturing Capability Data. J. Comput. Inf. Sci. Eng. 2019, 20, 011005. [Google Scholar] [CrossRef]
Kulkarni, A.; Terpenny, J.; Prabhu, V. Leveraging Active Learning for Failure Mode Acquisition. Sensors 2023, 23, 2818. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Baraldi, P.; Zio, E. A Novel Method for Maintenance Record Clustering and its Application to a Case Study of Maintenance Optimization. Reliab. Eng. Syst. Saf. 2020, 203, 107103. [Google Scholar] [CrossRef]
Blanchard, B.S.; Fabrycky, W.J.; Fabrycky, W.J. Systems Engineering and Analysis; Prentice Hall: Englewood Cliffs, NJ, USA, 1990; Volume 4. [Google Scholar]
Yin, J.; Wang, J. A Dirichlet Multinomial Mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 233–242. [Google Scholar]
Li, G.; Zhang, Q.; Zheng, R.; Wang, C. A Fault Analysis Method Based on Text Clustering. In Proceedings of the 2020 5th International Conference on Computer and Communication Systems (ICCCS), Shanghai, China, 15–18 May 2020. [Google Scholar] [CrossRef]
Alelyani, S.; Tang, J.; Liu, H. Feature Selection for Clustering: A Review. In Data Clustering; Chapman and Hall/CRC: New York, NY, USA, 2018; pp. 29–60. [Google Scholar] [CrossRef]
Hajjem, M.; Latiri, C. Combining IR and LDA Topic Modeling for Filtering Microblogs. Procedia Comput. Sci. 2017, 112, 761–770. [Google Scholar] [CrossRef]
Xu, J.; Xu, B.; Wang, P.; Zheng, S.; Tian, G.; Zhao, J.; Xu, B. Self-Taught Convolutional Neural Networks for Short Text Clustering. Neural Netw. 2017, 88, 22–31. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Chen, B.; Zhou, S.; Chang, W.; Ji, X.; Wei, C.; Hou, W. A Text-Driven Aircraft Fault Diagnosis Model Based on a Word2vec and Priori-Knowledge Convolutional Neural Network. Aerospace 2021, 8, 112. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Y.; Gu, H.; Liu, L.; Zhang, J.; Lin, H. Defect Diagnosis Method of Main Transformer Based on Operation and Maintenance Text Mining. In Proceedings of the 2020 IEEE International Conference on High Voltage Engineering and Application (ICHVE), Beijing, China, 6–10 September 2020. [Google Scholar] [CrossRef]
Woods, C.; Selway, M.; Bikaun, T.; Stumptner, M.; Hodkiewicz, M. An Ontology for Maintenance Activities and its Application to Data Quality. Semant. Web 2024, 15, 319–352. [Google Scholar] [CrossRef]
Škrlj, B.; Kralj, J.; Lavrač, N.; Pollak, S. Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture. Mach. Learn. Knowl. Extr. 2019, 1, 575–589. [Google Scholar] [CrossRef]
Kamthawee, K.; Udomcharoenchaikit, C.; Nutanong, S. MIST: Mutual Information Maximization for Short Text Clustering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 11309–11324. [Google Scholar] [CrossRef]
Wang, Y.; Wu, L.; Shao, H. Clusters Merging Method for Short Texts Clustering. Open J. Soc. Sci. 2014, 2, 186. [Google Scholar] [CrossRef][Green Version]
Lloyd, S. Least Squares Quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Sobhkhiz, S.; El-Diraby, T. Integrating Unstructured Data Analytics and BIM to Support Predictive Maintenance. In Life-Cycle of Structures and Infrastructure Systems; CRC Press: London, UK, 2023; pp. 1794–1801. [Google Scholar]
ISO/TS-15926-4:2024; Industrial Automation Systems and Integration—Integration of Life-Cycle Data for Process Plants Including Oil and Gas Production Facilities—Part 4: Core reference data. Standard, International Organization for Standardization: Geneva, Switzerland, 2024.
McArthur, J.; Shahbazi, N.; Fok, R.; Raghubar, C.; Bortoluzzi, B.; An, A. Machine Learning and BIM Visualization for Maintenance Issue Classification and Enhanced Data Collection. Adv. Eng. Inform. 2018, 38, 101–112. [Google Scholar] [CrossRef]
Mostafa, K.; Attalla, A.; Hegazy, T. Data Mining of School Inspection Reports to Identify the Assets with Top Renewal Priority. J. Build. Eng. 2021, 41, 102404. [Google Scholar] [CrossRef]
Naqvi, S.M.R.; Ghufran, M.; Varnier, C.; Nicod, J.M.; Javed, K.; Zerhouni, N. Unlocking Maintenance Insights in Industrial Text through Semantic Search. Comput. Ind. 2024, 157, 104083. [Google Scholar] [CrossRef]
IEC 81346-2; Industrial Systems, Installations and Equipment and Industrial Products—Structuring Principles and Reference Designations — Part 2: Classification of objects and codes for classes. Standard, International Electrotechnical Commission: Geneva, Switzerland, 2019.
Bikaun, T.K.; French, T.; Stewart, M.; Liu, W.; Hodkiewicz, M. MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 10939–10951. [Google Scholar]
Lau, A.; Feng, J.; Hodkiewicz, M.; Woods, C.; Stewart, M.; Polpo, A. Generating Authentic Grounded Synthetic Maintenance Work Orders. IEEE Access 2025, 13, 145888–145904. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Angelov, D. Top2vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Graves, A.; Mohamed, A.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
Sokaln, R.R.; Michene, C.D. A Statistical Method for Evaluating Systematic Relationships. Univ. Kans. Bull. 1958, 38, 1409–1438. [Google Scholar]
Ward, J.H., Jr. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Strehl, A.; Ghosh, J. Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Bikaun, T.; Stewart, M.; Liu, W. QuickGraph: A Rapid Annotation Tool for Knowledge Graph Extraction from Technical Text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 270–278. [Google Scholar]

Figure 1. Demonstration of engineering chain of thought relating a specific physical object class (differential) to sub-systems and their inherent function, thence to each function’s associated functional failure, and the ways in which the functional failure can be observed. The list of parts and their functions is taken from the IEC 81346-2 Standard.

Figure 2. Figure altered from [37], an example MWO in MaintIE, showing the entities, relations, and engineering knowledge in the form of inherent function-labelled subclasses.

Figure 3. Distribution of inherent functions in MaintIE Gold and Silver.

Figure 4. Distribution of inherent functions in MaintIE Gold, MaintIE Silver, and synthetic generated data.

Figure 5. Experiments showing testing of four sentence embedding models with three clustering models.

Figure 6. Example of inherent function extraction from the MaintIE knowledge graph, where the red text, “EmittingObject”, is the annotation containing the extracted inherent function.

Figure 7. The Word2Vec Bi-LSTM model architecture.

Figure 8. Silhouette score (y-axis) over number of clusters (x-axis) for Word2Vec Bi-LSTM embeddings and Ward agglomerative hierarchical clustering. The ideal number of clusters is set to 31, which has the highest silhouette score.

Figure 9. Dendrogram produced for average hierarchical clustering and Word2Vec Bi-LSTM embeddings. Slicing a dendrogram horizontally at different heights produces a different numbers of clusters. Each cluster is represented by a different colour.

Figure 10. Good clusters identified from Word2Vec Bi-LSTM and Ward hierarchical, where each cluster is an outlined circle containing multiple coloured circles representing MWOs of the same inherent function. Each cluster has a unique label (i.e., C1–C31), and clusters have been arranged based on inherent function. The size of the coloured circle is proportional to the total number of MWOs in the cluster.

Table 1. Examples of maintenance work orders.

Example	Description
A	Air horn not working compressor awaiting
B	Replace damaged glass rear and 1/4
C	Replace U/S diff drain plugs

Table 3. Count of nodes and relations in the MaintIE Gold and Silver datasets [37]. L1 is the top level of 17 functions—e.g., controlling, holding, and protecting objects—and L2 is the lower level with 160 functions—e.g., cabinet, crankcase, housing are examples of physical object/holding object/enclosing object.

	MaintIE Gold		MaintIE Silver
Number of inherent function nodes	17 at L1; 160 at L2		17 at L1 only
	Total count	Unique count	Total count	Unique count
Physical object nodes	1994	222	13472	2379
Undesirable behaviour nodes (process, property, state)	619	52	3605	546
hasPart relations	533	417	3873	3290
hasParticipant relations	1372	1063	8550	7461

Table 4. Sample top words in GSDMM topics.

Topic No.	Top Topic Words
1	repair hand in cracked crack left side window right
2	out change engine pump universal drive cabin water shaft
3	replace unserviceable and hose in machine inverter auxilliary batteries battery
4	hand mechanical inspection hour right left roller track chain guide
5	on fault brake alarm drag park unserviceable all dash light

Table 5. Sample top words in BERTopic topics.

Topic No.	Top Topic Words
1	light, ignition, dash, switch,
2	fault, transmission, diagnose, repair,
3	tyre, position, tyres, change,
4	conditioner, air, compressor, unserviceable,
5	oil, leak, engine, leaks, text,

Table 6. Comparison of clusters formed using K-means clustering.

	Averaged Word2Vec Embeddings	SBERT Embeddings	Word2Vec BiLSTM Embeddings	SBERT Neural Network Embeddings
Number of clusters	29	47	27	49
Median number of documents per cluster	92	60	95	57
Max number of documents per cluster	266	148	220	182
Percentage of good clusters	41.4%	25.5%	74.1%	61.4%
Percentage of documents in good clusters	29.7%	21.0%	72.6%	74.1%

Table 7. Comparison of clusters formed using average hierarchical clustering.

	Averaged Word2Vec Embeddings	SBERT Embeddings	Word2Vec BiLSTM Embeddings	SBERT Neural Network Embeddings
Number of clusters	23	29	15	41
Median number of documents per cluster	66	20	146	44
Max number of documents per cluster	987	1336	851	349
Percentage of good clusters	60.9%	51.7%	73.3%	56.1%
Percentage of documents in good clusters	25.2%	4.3%	52.5%	67.0%

Table 8. Comparison of clusters formed using Ward hierarchical clustering.

	Averaged Word2Vec Embeddings	SBERT Embeddings	Word2Vec BiLSTM Embeddings	SBERT Neural Network Embeddings
Number of clusters	45	49	31	49
Median number of documents per cluster	61	51	77	52
Max number of documents per cluster	233	263	221	158
Percentage of good clusters	53.3%	34.7%	80.7%	59.2%
Percentage of documents in good clusters	40.4%	22.4%	75.3%	68.6%

Table 9. Sample of clusters formed from Word2Vec Bi-LSTM embeddings and Ward hierarchical clustering.

Cluster Number	Cluster Size	Top Failure Modes	Top Equipment	Top Functions (Number of MWOs)
1	115	cracked crack cracks leak missing	tank fuel tank cabin seat mud bucket seat	storing (70) covering (20) guiding (11) holding (5) protecting (3)
2	33	blown weeping failed	o-ring engine pump o-ring boom hose o-ring transmission hose o-ring air conditioner hose o-ring	covering (31) guiding (2)
4	62	missing worn loose broken need	tooth bucket teeth cutting edges cutting edge	matterprocessing (51) interfacing (4) holding (4) protecting (1) controlling (1)
6	114	crack cracked cracks worn found	exhaust shield rock deflector liner handrail windscreen	protecting (69) controlling (21) interfacing (13) holding (6) covering (2)

Table 10. Sample of clusters formed from SBERT NN embeddings and Ward hierarchical clustering.

Cluster Number	Cluster Size	Top Failure Modes	Top Equipment	Top Functions (Number of MWOs)
1	41	not working unserviceable leaking dropped cell error	pump grease pump battery auto-spray compressor	generating (29) storing (4) protecting (2) matterprocessing (1) covering (1)
4	135	out unserviceable not working fault blown	lights light boom vims keypad light engine air conditioner	emitting (129) protecting (2) guiding (2) driving (1) holding (1)
17	46	fault broken unserviceable needs replacing no charge	accumulator brake battery isolator heater	controlling (17) generating (10) storing (7) emitting (4) protecting (4)
19	42	unserviceable fault not working worn faults	fan drag iov filters pump cooler pump lube system	presenting (13) protecting (9) generating (8) matterprocessing (7) guiding (2)

Table 11. NMI scores across all embedding and clustering methods.

	Averaged Word2Vec Embeddings	SBERT Embeddings	Word2Vec BiLSTM Embeddings	SBERT Neural Network Embeddings
K-means	0.2484	0.1403	0.4814	0.4273
Average Hierarchical	0.2078	0.0643	0.4463	0.4095
Ward Hierarchical	0.2442	0.1337	0.4650	0.4088

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, J.; Lau, A.; Hodkiewicz, M.; Woods, C.; Stewart, M. Semantic and Engineering-Based Embedding for Classification List Development. Mach. Learn. Knowl. Extr. 2026, 8, 61. https://doi.org/10.3390/make8030061

AMA Style

Feng J, Lau A, Hodkiewicz M, Woods C, Stewart M. Semantic and Engineering-Based Embedding for Classification List Development. Machine Learning and Knowledge Extraction. 2026; 8(3):61. https://doi.org/10.3390/make8030061

Chicago/Turabian Style

Feng, Jadeyn, Allison Lau, Melinda Hodkiewicz, Caitlin Woods, and Michael Stewart. 2026. "Semantic and Engineering-Based Embedding for Classification List Development" Machine Learning and Knowledge Extraction 8, no. 3: 61. https://doi.org/10.3390/make8030061

APA Style

Feng, J., Lau, A., Hodkiewicz, M., Woods, C., & Stewart, M. (2026). Semantic and Engineering-Based Embedding for Classification List Development. Machine Learning and Knowledge Extraction, 8(3), 61. https://doi.org/10.3390/make8030061

Article Menu

Semantic and Engineering-Based Embedding for Classification List Development

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Dataset and Knowledge Graph

2.2. Introducing Synthetic Data to MaintIE

2.3. Preliminary Experiments with Topic Modelling

2.4. Method

2.5. Sentence Embedding

2.5.1. Extracting the Inherent Function Labels

2.5.2. Word2Vec Bidirectional Long Short-Term Memory (Bi-LSTM) Embeddings

2.5.3. Sentence-BERT (SBERT) Neural Network (NN) Embeddings

2.5.4. Training the Deep Embeddings

2.6. Clustering Algorithms

3. Results

3.1. Defining Good Clusters

3.2. Analysis Based on Statistical Characteristics

3.3. Manual Analysis of Each Cluster by Subject Matter Experts

3.3.1. Averaged Word2Vec Embeddings and SBERT Embeddings

3.3.2. Word2Vec Bi-LSTM Embeddings

3.3.3. SBERT NN Embeddings

3.4. Normalised Mutual Information Score (NMI)

3.5. Identifying FMCs from Good Clusters

4. Discussion

4.1. Limitations

4.2. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI