1. Introduction
In the past few years, refurbished mobile phones have gained significant popularity due to being a cost-effective and sustainable alternative to the expensive smartphones. Hence, the refurbishing mobile industries are growing significantly. The mobile refurbishing process involves the identification of defects in used smartphones, followed by their repair, with the goal of reselling the devices to new customers [
1,
2]. This process ensures that the refurbished devices meet the established quality standards as well as align with customer expectations. In the refurbishment process, the accurate identification and detection of physical and functional defects are considered key components that are essential to maintain quality standards [
3]. This phase involves a detailed inspection procedure to assess the device condition and determine its suitability for reselling. However, to accurately identify the defects and perform the necessary repair actions requires high-level technical expertise, which includes a thorough understanding of smartphones’ components, structural integration and functional performance [
4].
Despite the significant growth of the refurbishment industry, the majority of mobile phone companies still rely on manual defect detection methods. These methods are primarily carried out by the human experts for the visual examination of the devices, fault identification and necessary repair procedure recommendation. However, this human-centred approach introduces variability and subjectivity, which can compromise the consistency and reliability of defect assessment [
5]. Clearly, the accuracy of any data collected regarding assessment will be a function of an assessor’s level of expertise. For instance, novices may fail to identify tricky defects or propose incorrect process rework steps or repairs, while even experienced technicians may produce inconsistent results depending on particular bias relating to device model or image quality. Furthermore, these challenges increase due to limited quantities of labelled data and experts particularly in the case of SMEs, which makes judgement a critical but unstable component of the decision-making process. To mitigate these issues, it is essential to establish standardised training protocols and detailed assessment guidelines. Integrating automated, structured training solutions can enhance the consistency of expertise while also reducing the training costs incurred during the upskilling of new technicians and can improve the overall consistency and scalability of defect detection within the refurbishment process.
This study investigates the augmentation of the process of assessing operator responses via Natural Language Processing techniques. Text similarity techniques have been employed in a wide variety of applications including healthcare [
6], finance [
7,
8], education [
9,
10,
11] and the manufacturing industry [
12,
13]. Text similarity techniques provide a powerful tool to compare and categorise textual information and support efficient information retrieval. In manufacturing, these approaches help in quality control, record management, technical documentation and information retrieval, and predictive maintenance [
14].
In manufacturing, there is a vast amount of unstructured textual data in the form of maintenance reports, production logs and operator notes. Text similarity techniques can help analyse these records to identify patterns, defect description matching, and corrective action based on the past events [
15].
In the remanufacturing industry, where used products are restored to optimal condition for resale, accurate and consistent defect detection is crucial. Remanufactured items often exhibit varying level of damages and defects, making the inspection process more challenging and subjective. Visual inspection team typically consist of experts with varying levels of expertise, ranging from experience to novice personnel. The assessments of the defective products is not only critical due to the complex nature, but it also guides for repair actions to ensure the product quality and industrial standards [
16].
Differences in expertise, terminology used, and interpretation often lead to inconsistent defect identification and classification. Therefore, operator standards are vital to maintain the quality, ensure reliability and to reduce errors. This work investigates the use of text similarity techniques to compare the defect descriptions provided by experts and operators, identifying differences between them. Such insights can then help design standardised inspection guidelines and tailor required training programmes to upskill operators and ultimately improve the overall quality control process [
1].
We have applied various short text similarity approaches to compare expert and operator responses, so that different levels of expertise can be modelled effectively within a visual inspection setting. Furthermore, a disagreement analysis was conducted to evaluate the differences between similarity scores assigned by human experts and those predicted by each text similarity model. In addition, we systematically analysed these disagreements to identify and highlight the key factors contributing to their discrepancies. This analysis helps in assessing the expertise level of individual operators and also offers valuable insight regarding the production of design training guidelines.
This manuscript is an extended version of our paper presented at the 2025 Irish Signals and Systems conference [
1]. In this extended version, we have made the following significant additions: (i) incorporated a comprehensive disagreement analysis to identify the key factors contributing to low similarity scores, (ii) introduced a tagging framework to classify disagreement types, offering a more readily interpretable evaluation of model behaviour, (iii) added new experimental results using multiple models with preprocessing to evaluate improvements in performance robustness, and (iv) expanded the analysis with deeper discussions, improved methodology, and additional experimentation. These enhancements provide a more in-depth understanding of a model’s capabilities and limitations in real-world, domain-specific applications. A specific added focus of the work is our investigation of the potential impact of the variability in descriptions that are provided by small, often heterodox, groups of experts with non-uniform levels of expertise. Characterising such differences adds significant practical value in that it can inform the design of training programmes, quality control screening procedures, and suggests new standards for annotation guidelines that will improve the consistency and accuracy in defect identification and ultimately the profitability of the entire process. Additionally, the work provides insights into how domain expertise influences interpretation and scoring in a more general sense. The work highlights that even a slight difference in phrasing and vocabulary can affect the overall quality assessment process and the paper provides a new roadmap that can mitigate the effects of such differences in a practical setting.
2. Literature Review
Defect detection is a critical component in quality assurance and process efficiency, particularly where product rework or remanufacture is an integral process step. Rapid advancements in deep learning, particularly in computer vision, have significantly enhanced the detection accuracy by improving repeatability, while minimising errors and processing times. Once a defect is identified, further decisions regarding repair or disposal can be finalised immediately, hence minimising production costs. The technological development in high-resolution image acquisition technologies and AI-based computer vision algorithms utilising deep neural networks has led to substantial automation of defect detection processes across industrial settings. Although AI-based defect detection systems serve as human-assistive tools, successful implementation of such systems requires that human operators are highly trained to work alongside automated cyber-physical systems. This human–machine collaboration is essential to ensure the provision of high-quality inputs that support timely and informed decision-making in dynamic and complex manufacturing environments [
17].
The question of applying deep learning techniques for surface defect detection using computer vision has received much attention in the literature [
18,
19,
20,
21,
22,
23]. However, most of these studies did not focus on the so-called ‘human in the loop’ factors when it comes to the integration of computer vision and image processing techniques in defect detection [
24]. For instance, Wang et al. [
25] proposed a hybrid transformer architecture for defect detection in the steel industry. Their approach integrates a Convolutional Neural Network (CNN) with a vision-based transformer to improve detection performance across multiple scales and to enhance attention to distinct image regions. Another study [
26] employed the EfficientDet model to identify defects in ultrasonic images of steel components. The model was trained to identify and detect the defects in various shapes and aspect ratios, enabling a fully automated defect detection process. In their study, Luo et al. [
27] proposed a memory-attended multi-interference network for defect detection classification at the image level. They evaluated their approach on four industrial datasets, including different textures generated by computers, steel surface, defective production items, and electrical commutators. In [
28], Singh and Desai presented a framework for image-based defect detection using a pre-trained ResNet 101 CNN with multi-class support vector machine classifier. This framework was evaluated on tapered roller defects. Saliency-based defect detection from images was analysed by Bai et al. [
29], where salient regions obtained from test images were compared against the local discrepancy of the same regions on a defect-free template image. Crack detection in materials was under investigation with the application for crack detection for pavements on roads. Deep learning methods were used to model crack detection in three broad categories: classification-based, object-detection-based, and segmentation-based [
30]. The same authors also reviewed crack detection using 3D data in traditional and deep learning techniques.
More recently, the question of deep learning modelling techniques being employed by the remanufacturing industry has been considered, where assisted inspection has been used to assess products that are disassembled, inspected, cleaned, reconditioned, or reassembled to maintain the quality level equivalent to that of new products. Nwankpa et al. [
31] proposed a deep learning based inspection framework using Deep Convolutional Neural Network, for defect detection on remanufactured mild steel plates to automate the visual inspection process. Another AI-driven approach is proposed by Kaiser et al. [
32] to automate the initial visual inspection of the returned cores in remanufacturing processes. Since the returned cores vary in condition and quality, the proposed system integrates various deep learning techniques such as reinforcement learning and anomaly detection to automate the decision-making process and replace manual inspection with a more consistent and autonomous approach. Since these processes are complex, it requires technical expertise and advanced skills for accurate analysis and inspection. In most cases, product inspection relies heavily on operator expertise. The deployment of deep learning models where human intervention can be of variable quality makes it interesting to investigate the comparison of operator and expert performance in such scenarios. However, the lack of experimental validation and full-scale deployment data remains a key limitation of many studies. For example, Saiz et al. [
33] proposed an automated inspection and classification system for remanufactured automotive components. They employed an ensemble learning approach to classify the defects as good, rectifiable, and rejectable according to the defect size as per the given criteria for the dataset of 660 images.
Although computer-vision-based deep learning techniques have been proven to be efficient solution for defect detection, the human-in-the-loop remains a critical aspect of the manufacturing and remanufacturing industry, particularly in the context of Industry 5.0, where it is regarded as a key pillar emphasising operators’ knowledge, skills, and ability to collaborate with cyber-physical systems on the factory floor [
34].
In parallel with computer vision, Natural Language Processing (NLP), has emerged as another powerful area of Artificial Intelligence. While computer vision has been widely applied to surface defect detection in the manufacturing industry, NLP, on the other hand, has primarily been employed for the text based tasks such as predictive maintenance, where it supports decision-making by analysing the textual data, maintenance logs and operators’ reports [
35,
36]. The study conducted by Moghaddam et al. [
37] aims to develop an enhanced operator assistance system for a manufacturing environment by utilising NLP techniques to interpret open-ended and incomplete queries. The proposed system enables intent inference and provides relevant support without relying on traditional question-answer pairs dependencies. Sheikh et al. [
38] explored AI and text analytics to improve defect identification in Printed Circuit Board (PCB) assembly. They applied NLP techniques such as Latent Class Analysis and Latent Semantic Analysis to generate features from unstructured textual data of operators’ observations from testing processes and employed a Naïve Bayes classification model to categorise the defects using those features. May et al. [
39] proposed a novel NLP pipeline to process and analyse the digitally recorded comments from machine operators describing manufacturing failures. They used vectorisation methods such as Bag-of-words to generate numerical representation of the words and then evaluated the performance of two Gradient Boosting Decision Tree classifiers to predict the severity of machine downtime. In a study [
40], the authors explored the application of NLP in the semiconductor manufacturing industry by using the textual data. They employed SONY’s proprietary NLP engine to analyse the quality issues and extract features of manufacturing equipment by using Bag-of-Ngrams and Chi-square tests. Ansari et al. [
41] developed an AI-driven multifunctional system for the automotive industry that employs a similarity algorithm to match maintenance faults with technician expertise, enabling the selection of the most appropriate maintenance personnel for the task. Cadavid et al. [
42] explored BERT-based language models, specifically CamemBERT and FlauBER, to process and analyse unstructured maintenance log reports. Moreover, to enhance interpretability, they applied LIME to explain individual model predictions and proposed a method for extracting insights from diverse maintenance records. Their findings demonstrated that the finetuned pre-trained models outperformed traditional feature-based approaches, even with minimal text preprocessing. Another study conducted by Öztürk et al. [
43] utilised SBERT-based word embeddings in combination with manual keyword extraction to examine repair and maintenance documentation. Since these reports are written by different engineers, they often reflect varying interpretations and potential inconsistencies. To address this issue, semantic similarities were measured using cosine similarity, allowing the researchers to cluster similar maintenance events and support the development of optimised maintenance strategies.
The techniques and similar approaches discussed above can support operators, engineers, and technicians in executing tasks more efficiently and reduce tensions across the remanufacturing/rework process. The variability present in data records and written comments poses a challenge for the accurate interpretation of information and real time decision-making regarding the downstream steps to accomplish a given task. The existence of humans ‘in the loop’ can help provide contextual and perhaps nuanced interpretations of the data, thereby making the modelling of accurate decision-making even more complex. Based on the above analysis, it is clear that the use of deep learning for the modelling of operator responses is a timely addition so that training can be enhanced in a nascent and rapidly evolving sector such as the mobile refurbishing industry.
As part of a defect analysis process improvement activity in an SME-type recycling operation focussing on mobile phone refurbishment, operators were asked to respond to open-ended, short subjective questions regarding defects in mobile phone screens. Their responses were then compared to expert-provided answers and scored to assess the expertise levels of the operators. The study employed pre-trained language models, including SentenceBERT [
48] and Word2Vec [
44] to generate embeddings to determine the textual similarity with the cosine similarity measure. Initially, a comparative analysis was conducted among semantic-based, context-free, and lexical-based approaches to identify the most effective model. Subsequently, a disagreement analysis was performed to pinpoint the key factors contributing to low similarity scores between operator and expert responses generated by each model. The data for this experiment was collected through a survey (Mobile Defect Detection Survey) specifically designed for operators, aimed at capturing natural language responses related to mobile defect detection. The details of the survey are provided in
Section 3.1 below.
3. Methodology
The experiment was conducted in multiple phases, broadly divided into four component parts: Questionnaire Formulation, Dataset Generation, NLP analysis, and Disagreement analysis. As part of a validation study with a local remanufacturing enterprise, the responses were collected in accordance with university ethical standards. All data was anonymised appropriately to maintain GDPR compliance. The questionnaire design phase is discussed in
Section 3.1 and provides comprehensive insights into identifying defect types, selecting appropriate question types (Multiple-Choice-Questions or Open-ended), and formation of the questionnaire for effective data collection. The dataset generation phase discussed in
Section 3.2 involves collecting expert diagnoses of defects from the image data available and collecting operator responses through the survey. The third phase of NLP Analysis is discussed in
Section 3.3, which discusses the subjective assessment aspect of the experiment. The final phase of disagreement analysis is discussed in
Section 3.4, which evaluates the differences between operator and expert responses to identify key areas of divergence and inform targeted improvements in operator training.
3.1. Questionnaire Design
The first stage in this phase involved experts identifying defect types from images of 50 mobile phone screens. The experts evaluated and classified the defects according to their type and severity level, as summarised in
Table 2. In addition, recommended repair actions were determined for each defect category, such as polishing for minor defects and screen replacement for major defects.
The next step involved selecting a diverse set of question types, which included both objective questions, which provided multiple-choice options, and subjective questions, which required operators to respond in natural language. The questionnaire comprised various question types, including both objective (O) and subjective (S) formats, as detailed below:
- -
Existence (O)—presence or absence of any defect
- -
Counting (O)—number of defects observed
- -
Query Object (S)—type and description of the defect
- -
Location-based (O)—specific area on the mobile screen where the defect is located
- -
Threshold (O)—severity level of the defect, ranging from low to high
- -
Action-based (O)—recommended action/repairment of the defect
- -
Repairability (O)—degree of repairability from low to high
After defect identification and question selection, a questionnaire was created to gather responses from experts and operators for subjective evaluation. Among the eight questions, seven are objective, with only the fourth question being open-ended. Responses to this open-ended question were limited to 40 characters to minimise inappropriate content and to limit the responses within the range of short answers.
The questionnaire used in this survey consisted of two main parts. In the first part, all the necessary information, such as defect types, their descriptions and their predefined severity levels, is provided as a guideline for the participating operators. The second part of the survey consisted of eight (subjective and objective) questions associated with each mobile screen image. These images provided for defect identification were also marked to specify the regions on the mobile screen so that accurate locations of the existing defect(s) could also be identified. For example, if a defect appears in the upper region of the screen on the right side, it can be mentioned as defect present at the top-right or upper right area of the screen.
This survey was designed in such a way that makes it easier for operators to understand the types of defects and respond more consistently and accurately.
3.2. Dataset Generation
The designed questionnaire was distributed to 8 operators with varying levels of expertise, along with one expert. The expert’s responses to the questions were recorded as ground truth answers for all the mobile phone images. A total of 400 responses were collected from the operators. All responses were recorded in a CSV file. As mentioned above, only Question 4 is open-ended, for which the collected responses are in natural language. Therefore, upon completion of the survey, responses to Question 4 were extracted for subjective scoring, as the other questions are objective and provide limited response options. The extracted operators’ responses were used to compare and find the semantic and syntactic similarity with Ground Truth (expert) responses.
Given below are the expert answers to Q4 (defect description). For this experiment, six distinct expert answers have been identified so that the reviewer can loop around finding the near-semantic candidate response that applies to the provided image. This expert answer set is then used as guidelines for operator training.
- 1.
Scratches, blobs on the phone screen
- 2.
Scratch on the front camera area of screen
- 3.
Random pattern scratches on the phone screen
- 4.
Minor hairline crack on the phone screen
- 5.
Long visible crack line on the phone screen
- 6.
Screen shattered from the edge
3.3. NLP Analysis
This section provides an in-depth overview of the preprocessing techniques, semantic and syntactic similarity models, and evaluation metrics utilised in the subjective scoring process.
Figure 1 provides an illustration. Our dataset contains pairs of text responses from experts and operators, which are used by the models to measure similarity. Before feeding these pairs into the models, we apply several preprocessing steps (optional) to clean and standardise the text. Once the text pairs are prepared, the models convert them into numerical representations, allowing similarity scores to be calculated using the selected similarity measures. Once the similarity scores were generated for all the text pairs, we used them to assess the model’s performance through various evaluation metrics. Additionally, these similarity scores were used to conduct disagreement analysis and tagging, to help us better understand the differences between expert and operator responses.
3.3.1. Preprocessing
It is important to normalise the text at the initial stage so that the machine learning model can learn more efficiently. The responses in our dataset are short and composed of domain-specific terminology describing mobile screen defects, with minimal presence of general language. Moreover, some responses contained special characters such as commas, periods, and hyphens. Therefore, we first applied special character removal, followed by case normalisation. Since many deep learning models are case-sensitive, variations like “Defect”, “DEFECT”, “DEFeCt” and “defect” may be treated as different tokens. Therefore, applying case normalisation ensures a consistent interpretation during text processing. Additionally, variations in word forms, particularly plurals like scratches, lines, and cracks, were also observed. To further refine the text, we applied lemmatisation to convert words to their base forms. Unlike stemming, which often generates inaccurate root forms, lemmatisation provides linguistically correct base words and was therefore preferred in this study.
3.3.2. Model Selection
To measure the similarity between expert and operator answers, we used various models including sentence-level and word-level deep learning-based models as well as word-level syntactic-based statistical measures. Deep learning models (SBERT and Word2Vec) first convert the text into embeddings, also called vectors of ground truth and operator responses, and measure the cosine similarity between those embeddings, whereas syntactic-based models (e.g., Dice) use common words in the text to measure the similarity between them without any contextual/semantic understanding.
The SBERT framework and Sentence-transformers library is a modification of the pre-trained BERT model based on triplet and Siamese network structures to generate semantically meaningful sentence embeddings. These embeddings can then be compared to find similar sentences using the cosine similarity measure. A Siamese neural network consists of two identical but weight-shared neural networks. The outputs of these networks are compared using a metric such as Euclidean distance or cosine similarity. Since each network computes the same function, their weight-sharing property ensures consistent predictions. A sentence transformer takes two sentences, Sentence-A and Sentence-B, as input, and produces sentence embedding vectors. Comparing these vectors then provides a similarity score [
49]. There are various pre-trained models available in the SBERT (
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html (accessed on 12 March 2025)) library, which are trained on different corpora for various downstream tasks. The best-performing models for different types of natural language tasks have also been identified and presented. We used the “all-mpnet-base-v2” model, trained on a large and diverse dataset of over one billion training pairs.
- 2.
Word2Vec (Context-free):
This model produces word embeddings that are context-independent, representing each word with a single vector. It operates by taking a single word as input and generating a corresponding output vector, in contrast to BERT, which considers entire sequences to create context-aware embeddings.
- 3.
Dice coefficient (Lexical-based) [
50]:
It is a lexical similarity measure for two strings and is given by:
It measures the word-level overlap between two texts, A and B, by comparing the number of shared tokens relative to the total number of tokens in both texts.
3.3.3. Evaluation Metrics
In order to evaluate the performance of the deployed models and techniques, we used three different evaluation metrics.
Pearson’s Correlation [
51]:
This measure is used to find the correlation between two numerical variables. The assigned values through this method range from −1 to 1, with 1 indicating positive correlation, 0 indicating no correlation, and −1 indicating negative correlation:
where for two distributions
X and
Y,
xi and
yi are the
ith value of the distributions and
and
are the mean values for both distributions, respectively.
Pearson’s correlation is one of the most popular correlation measures for comparing human expert scores with predicted scores for short text similarity tasks.
- 2.
Root Mean Square Error [
52]:
RMSE measures the error value between predicted and observed values. RMSE score is calculated as:
Here represents predicted value, while represents the observed value. A lower RMSE value indicates better results.
- 3.
Mean Absolute Error
It is a widely used metric in regression analysis that quantifies the average absolute differences between predicted values and actual observations.
3.4. Disagreement Analysis Pipeline
For in-depth analysis of the similarity scores, we additionally conducted disagreement analysis to identify key factors contributing to low similarity scores between human-expert and model-generated scores of the employed models. As a first step, we used a difference-based strategy to evaluate the alignment between expert-assigned and model-predicted scores. Specifically, we calculated the absolute difference between the two scores and categorised them into three agreement levels: strong agreement, moderate agreement, and low agreement.
In addition to score-based disagreement analysis, we further implemented a structured disagreement tagging scheme to better examine the underlying reasons for the difference between expert and model scores and to systematically map those mismatch patterns. Each response pair with moderate or low agreement was further examined and a set of predefined rules is applied to categorise the nature of disagreement such as defect type mismatch, severity mismatch, terminology mismatch, and location mismatch. For example, if the expert and the operator identified a different defect type, the pair was tagged as a defect type mismatch. These tags not only enabled us to identify the semantic aspects contributing to disagreement but also offered detailed insights into the model’s limitations. Moreover, they revealed opportunities to improve the model’s understanding of domain-specific language nuances. This multi-level analysis complements quantitative scoring by providing explainable insights into the model’s behaviour.
4. Experimentation
Figure 2 demonstrates the experimental setup for the subjective analysis with all the steps followed for training, testing, and performance evaluation.
We initially prepared a dataset comprising both expert and operator responses for 50 mobile phones. This dataset comprises a total of 400 responses, collected from eight operators and 50 expert responses, one for each mobile phone. In the first phase, the expert evaluated the operator responses and assigned a score ranging from 0 to 1 according to the level of correctness indicating the similarity with the expert answers. Here, a score of 0 indicates complete dissimilarity while 1 indicates full similarity between expert and operator answers. A few samples from the dataset are provided in
Table 3 below.
In
Table 3, the column named “
pid” represents the mobile phone IDs, while “
expert_ans” contains the answers provided by the expert. The column named “
answers” includes the corresponding responses provided by operators, and the column named “
score” shows the similarity scores assigned by a human expert. It is evident that operators often use different terminologies to describe the same defect, which can significantly impact the consistency and efficiency of the repair process. This variation highlights the critical need for subjective assessment to ensure more accurate and effective task execution.
Figure 3 provides an overview of the similarity score distribution across operator responses. It is important to note that the majority of operator responses received a perfect score of 1.0, based on the human-experts semantic understanding, despite variations in terminologies or the use of short phrases that did not explicitly contain the exact information mentioned in the expert responses. However, responses that received lower scores are particularly important, as they may identify some key factors and missing information. Further analysis of these responses can offer valuable insights to design and develop targeted training guidelines and support the upskilling of operators for more consistent and accurate defect identification.
Subsequently, we applied the preprocessing techniques described in
Section 3.3.1, followed by the use of embedding models to generate sentence representations. For the SBERT model, we adopted two approaches: one using a pre-trained version without finetuning, and the other using a finetuned model to determine the cosine similarity between operator and expert responses. We finetuned the model with 75% of the responses, and 25% responses were used as a test set. The purpose of finetuning is to make the model learn the domain-specific vocabulary. During training, a batch size of 16 is used with a learning rate of 2 × 10
−5. The model is trained for 5 epochs. The trained model is then used to measure the cosine similarity between the semantically meaningful generated embeddings. Finally, a linear regression model was trained to evaluate the performance of the SBERT embeddings, with RMSE, MAE, and Pearson’s correlation used as evaluation metrics. In addition to this semantic analysis, we also employed syntactic-based similarity measures, which are context-independent and rely solely on factors such as word overlap and text length to assess similarity between the operator and expert responses.
5. Results and Discussion
In this section, we discuss the results obtained by all the models along with the agreement level analysis. For all the models, text similarity scores were computed for both preprocessed and raw responses. Pearson’s correlation was used to assess how well the predicted scores align with expert-assigned scores; higher values suggest better alignment. In contrast, RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) measure the prediction error, where lower values for both indicate better model performance. These scores are used to classify the most reliable model for capturing semantic similarity in subjective responses. In addition to the model’s performance evaluation, disagreement analysis was conducted to investigate cases where model predictions diverged from human judgement. This analysis involved tagging moderate and low agreement instances with specific mismatch categories, offering deeper insights into the factors of semantic variability and model limitations.
In the text similarity task, a piece of text is compared with the reference or model text to predict the similarity score. This text similarity could be syntactical, where the number of common words in both texts determines the level of similarity. In such techniques, factors like text length, word order, and the presence of common words contribute to the similarity score. Furthermore, context-free approaches such as Word2Vec focus on capturing the semantic meaning of words based on their distributional context, rather than their position or frequency in the text. These methods typically do not account for text length, word order, or exact token overlap. In contrast, advanced approaches like BERT or SBERT capture the semantic meaning of sentences utilising context-aware embeddings regardless of the sentence length or common words.
To compare expert and operator responses using the SBERT model, we conducted two experiments: one utilising the pre-trained SBERT model and the other using a finetuned version of SBERT to evaluate cosine similarity scores. For finetuning, 75% of the response data was used for training, while the remaining 25% served as the testing set. In addition to SBERT, we also employed the Word2Vec model to analyse similarity at the word level, providing a complementary perspective on model performance. Furthermore, we incorporated the Dice similarity measure, a syntactic and surface-level technique that assesses textual overlap without considering semantic meaning. The objective of applying these diverse techniques was to identify the most suitable model for our application by comparing their effectiveness in capturing similarity across different linguistic dimensions.
Additionally, in our experiments, we initially evaluated the models using the raw data, applying only case normalisation. As preprocessing techniques are crucial for NLP models in order to optimise the performance, we subsequently prepared the dataset by applying the full range of preprocessing techniques described earlier. This allowed us to assess the impact of these techniques on model effectiveness and overall performance. Later, the results were compared for both approaches as presented in
Figure 4 and
Figure 5.
It can be observed in the above figures that this preprocessing significantly improves the performance of each model by systematically reducing noise, such as irrelevant characters, which often affects model performance. By normalising the input text through case normalisation, lemmatisation, and removal of stop words, the models were able to better focus on the core content of the data.
Furthermore, both figures show that the finetuned SBERT model outperformed the other models, largely because it was trained with domain-specific vocabulary. This highlights the significance of incorporating domain-specific language during training, as opposed to the pre-trained model, which lacks familiarity with specialised terminology. Moreover, these results emphasise the importance of finetuning of models on domain-specific datasets, where larger and more diverse datasets generally lead to improved performance. Also, restricting response length is important to filter out unnecessary words, such as stop words, which do not add meaning but increase computational overhead. In our survey, the operator and expert responses consisted of short phrases with a word limit of not more than ten words per response. The presence of a domain-specific and limited vocabulary in the responses also played a significant role in the model’s performance as it reduces variability in the responses, thereby enhancing the model’s learning efficiency.
In addition to SBERT, we experimented with other models as described in
Section 3.3.2. However, because the Word2Vec model generates word-level embeddings and aggregates them to form the sentence-level representation, it lacks contextual understanding, making it less favourable. Moreover, we assessed sentence similarity using the Dice similarity measure, a syntactic technique that evaluates similarity based on both the structural arrangement and the shared words between responses. It is important to mention that, even though text pairs are semantically similar, if they do not have common words, these measures result in low similarity scores as shown in
Table 4.
It can be observed that although the responses are semantically similar, both models yielded low similarity scores. This strongly supports our point about the absence of contextual understanding and limited vocabulary in these methods. Additionally, it is important to note that Dice similarity scores tend to be higher for sentences that share common words with an expert response, as seen in the table above.
Figure 6 and
Figure 7 show the alignment between the assigned and predicted scores for raw and preprocessed input responses, respectively. In both experimental scenarios, all four models demonstrated satisfactorily distinctive patterns for text pair similarity detection. However, the SBERT model outperformed the other two, even with the pre-trained features and achieved closer predictions compared to the human experts. Moreover, the finetuning approach helped the model to incorporate domain-specific vocabulary, as well as deeper semantics understanding, resulting in even more improved results for the SBERT model.
Moreover, Word2Vec and Dice similarity models performed equally across both preprocessed and raw input settings. Due to the domain-specific nature of manufacturing-related task, where vocabulary is often limited and context plays a critical role, semantic understanding is essential in executing the similarity task efficiently. This is especially true in scenarios where textual responses are brief and contain a limited vocabulary set. Due to their limited ability to interpret semantic meaning, both Word2Vec and Dice similarity prove less suitable for such tasks. In addition, preprocessing techniques also had a noticeable impact on the performance of these models, as illustrated in
Figure 6 and
Figure 7. The standardisation of the text through preprocessing improved results by resolving variations such as singular versus plural forms. For instance, the Dice measure, which initially treated singular and plural forms as different words, performed better after normalisation. However, this reliance on exact word matches highlights a fundamental limitation of the Dice measure, reducing its effectiveness in a real-world application like this one where semantic nuance is crucial.
In addition to measuring the semantic similarity between responses, it is also important to note the high-level inconsistencies in operator responses where defects are identified incorrectly, as shown by the examples in
Table 5.
For an accurate and effective decision-making process in quality assurance, it is essential that operators possess the ability to detect even the smallest defects. The identification of minor defects plays a crucial role in maintaining high standards and ensuring that products meet the required specifications. As shown in
Table 5, several operator responses either incorrectly identified the defect or failed to detect it altogether. Such inconsistencies in operator assessment can have a significant impact on the overall quality assurance process. When defects are overlooked or misclassified at the identification stage, it not only compromises product quality but also introduces challenges in subsequent stages, such as repair and rework. Incorrect defect detection may lead to inadequate and unnecessary repairs and increased production costs.
An important feature of this study is that the SME in question now has clear training criteria as it seeks to minimise variations in operator responses and enhancing performance in defect recognition. Specialised training programmes, clearly defined guidelines, and the provision of standardised terminologies can be introduced on the basis of these similarity scores to significantly enhance the consistency and accuracy of defect identification by operators. Such a case study will be considered in a follow-on project after the successful conclusion of this pilot. While this framework has been shown to be efficient, it also exhibits certain practical limitations. This is especially the case when it comes to a single point of failure, i.e., relying on a single expert’s judgement during the assessment. Relying on a single expert in order to collect ground truth similarity scores may introduce potential bias, which can impact the model’s performance during the training phase. Moreover, annotations from a single expert may be unreliable, as seen in this case, where a few mobile phone defects were found upon review to be missed by the expert but correctly identified by operators. Such observations necessitate the introduction of a vector-based scoring system where operator responses are compared with multiple expert answers. Clearly, incorporating multiple experts will improve assessment of operator responses only if there is a reliable optimisation procedure for weighting ground truth answers. These can be determined by using inter-rater reliability measures and appropriate ||v||1-norm optimisation. However, the involvement of multiple experts adds extra expenses and finding them was beyond the scope of this pilot study.
Disagreement Analysis
As part of the evaluation process, a disagreement analysis was performed to identify discrepancies between expert and operator similarity scores. Score differences were calculated to assess score alignment by calculating the absolute difference between expert and operator scores, which were then categorised into strong, moderate, or low agreement levels. Qualitative tags were applied to low or moderate agreement cases to highlight specific factors, including defect type, severity, terminology, or location mismatches. This integrated approach provides deeper insight into model limitations and domain-specific challenges.
Figure 8 shows the comparative performance of each model for agreement level analysis.
The results presented in the above figure clearly show that the finetuned SBERT model exhibits superior performance with a large number of responses falling under the “strong agreement” category, indicating high alignment with the expert judgement. These results further endorse the domain-specific finetuning approach, specifically where the technical expertise is essential. Moreover, it enhances the model’s capacity for semantic understanding and improves its ability to capture nuanced variations in human language specific to the task. On the other hand, the pre-trained SBERT model shows a more distributed pattern across all three levels of agreement. This further indicates the inconsistent performance in capturing the intended semantic equivalence. Conversely, the other two models, Word2Vec and Dice similarity, show a noticeable decline in performance with a high number of low-agreement responses with relatively low counts for strong-agreement responses. Both models underperform due to the inability to account for semantic understanding and surface-level similarity measurements.
To gain further insights into the observed discrepancies, we also implemented a disagreement tagging framework. Each disagreed response was assigned a disagreement tag indicating the underlying causes of disagreement. These tags were derived from domain-related categories, including defect types, severity indicators, screen-related terms and defect location indicators. A list of terms under each tagging category is provided in
Table 6. This tagging process enabled a deeper analysis of error patterns and provided insight into how and why the model struggled.
These tags were applied to responses that fell into the categories of low or moderate agreement between model-generated and expert-assigned similarity scores. Instances of strong agreement were excluded from tagging, as they were assumed to reflect complete alignment between the operator and expert responses, indicating no significant mismatches. The purpose of tagging was to identify and analyse specific factors contributing to disagreement and reduced model performance.
For example, if an operator response misses a defect type that is present in the expert response, then a
defect type tag is assigned. This approach highlights the absence of a key linguistic feature contributing to the disagreement. Similarly, if a response differs from the expert annotation in its use or omission of a severity descriptor or screen-related term, the corresponding
severity or
screen tag is applied.
Figure 9 shows the categorisation of responses tagged in one or more categories based on the missing attributes in the operator responses. To better contextualise the tagging approach, let us consider an example where the expert response is “
minor scratch on the left upper corner of the display”, while the corresponding operator response is “
scratch on the screen”, in this case,
severity and
location tags would be applied to indicate the missing details. Since very critical information is missing from the operator response, an additional inspection step would be required during the repair decision process to determine the severity and exact location of the defect, ensuring appropriate and accurate repair actions. This tagging mechanism helps to mitigate this challenge and enables the visualisation and analysis of missing or inconsistent information, thereby informing the design of effective operator training and better preparing operators for accurate defect identification.
Figure 9 illustrates the frequency distribution of disagreement tags across four models, revealing the possible underlying reason(s) of disagreement in responses with low and moderate disagreement levels. Among these models, finetuned SBERT models showed the most balanced performance, demonstrating a fair distribution across all disagreement categories. However, it is important to note that this model produces a relatively low number of low and moderate similarity scores due to its deeper semantic understanding. As a result, disagreement tags assigned to the responses for the finetuned model are not because the sentences are dissimilar in wording, but because critical contextual content and key elements are absent. For example, when the severity level of a defect is missing or differs between the expert and operator responses, the tag “S” indicates the missing element. Similarly, if the defect is incorrectly identified, for instance, an operator response mentions “
crack line” to the corresponding expert response “
long and short crack line minor scratch”, in this case the operator response not only is missing another defect type but also the severity of the defect, therefore assigned tags include “S” and “D”. In contrast, for the pre-trained SBERT model, the number of responses in the category of “S, D” is substantially high, indicating that the model can detect partial similarities but fails to achieve complete alignment with the expert responses. For example, an operator described the defect as “
minor crack” to the expert response “
random multiple scratch”, the identified reasons in the form of tags are “D, S”. Moreover, this model also displays frequent disagreement for the category “D, S, S_T”, highlighting the lack of interpretation of the terms related to defect type, severity or screen-related terminologies without further training. For example, operator response is “
multiple scratch” to the expert response “
minor scratch line and random scratch”, although both are semantically similar, the model inferred a low similarity score. Hence, the category of disagreement assigned is “S”. Additionally, for expert response “deep
crack and shatter”, the operator response is “
major crack”; again, both responses are mostly similar semantically, the disagreement tags are “D, S” based on the missing information and semantic misunderstanding of the terms.
The Word2Vec model demonstrated high disagreement in the categories “S”, and “S, D”, this indicates that the model struggles to accurately interpret the semantic context related to severity levels and defect types. Finally, the Dice measure, which is a pure lexical-based model, also recorded high frequencies in categories similar to the Word2Vec model. For example, operator response is “a little scratch” to the corresponding expert response “minor scratch”, as both responses only share a single word, while other words “a little” or “minor” are not similar lexically, the Dice measure resulted in a very low score. This is due to various reasons, including the model not only lacking semantic understanding it but also only considering the common words for the similarity measure, as well as the length of the sentences. These factors might affect the results overall.
The disagreement analysis was conducted to evaluate the performance of four models by comparing their similarity scoring against human annotation. With this analysis, we aimed to systematically examine why each model diverges from human judgement at every instance. This evaluation of models demonstrates that the finetuned SBERT model exhibited a strong performance by consistently capturing the contextual meaning and domain-specific vocabulary. The disagreement tags assigned to their low and moderate similarity responses often reflected key omissions or mismatches in content, including the incorrect defect type or severity level. This signifies the importance of the key information that may not only affect the interpretation but also the decision-making step. On the other hand, the pre-trained SBERT results showed that the model lacks domain-specific understanding of the terms and vocabulary to achieve full alignment. In contrast, Word2Vec and Dice similarity resulted in a higher number of disagreements. These models failed to recognise synonyms or semantically related responses, resulting in mismatches even with slightly different words from the expert responses.
Ultimately, this analysis demonstrates the need for domain-specific finetuning so that models can be deployed generally with confidence. The importance of using finetuned models with model specific questions to accurately interpret operator responses in domain-specific scenarios is of paramount importance. The disagreement analysis presented here provides valuable insights into each model’s reliability and trustworthiness, while also identifying areas for potential improvement in their general deployment after the pilot study.
6. Conclusions
This study explored the application of NLP techniques for evaluating operator responses in the context of mobile defect identification, with a particular focus on enhancing training processes in remanufacturing settings. By comparing a range of models, including finetuned and pre-trained SBERT, Word2Vec, and lexical-based approaches like Dice Similarity, we demonstrated that context-aware, domain-specific models significantly outperform generic types in capturing the variable, often nuanced language used by operators.
A key contribution of this study is the proposal of a disagreement analysis and tagging framework that extends beyond traditional similarity metrics. While existing models often depend solely on numerical evaluations such as cosine similarity, our method offers interpretability by identifying specific sources of disagreement, including defect type, severity level, terminology, and defect location on screen. This deeper linguistic insight helps bridge the gap between expert and model reasoning, providing a boilerplate for customised operator training. This framework, supported by manually curated domain vocabularies and rule-based tagging logic, adds methodological depth and suggests the importance of domain knowledge in NLP tasks.
Model transparency has been enhanced and a new diagnostic tool has been developed to determine weaknesses in model comprehension that can be deployed with confidence by SME-based experts. By indicating where and also why disagreements occur, targeted improvements in model performance are facilitated and directly support the design of effective training programmes for an SME that needs to deploy AI as a cost-effective human assistant tool in this demanding environment. The framework is also scalable and adaptable, making it applicable across various size-constrained datasets and manufacturing domains where expert interpretation is required in a cost-effective manner.
To conclude, this study demonstrates that context-aware, finetuned models are essential for achieving better performance and trust in specialised inspection-type applications where the cost of preparing data for training purposes can often be prohibitive. The qualitative insights provided by the disagreement tagging that has been proposed not only informs the question of better model evaluation but also serves as a foundation for improving human–AI collaboration in industrial training and decision-making processes.
Despite these key contributions, the limitations of this study should not be glossed over. The defect dataset used in this study is necessarily small and domain-specific due to the cost of data collection, which can clearly affect the generalizability of the framework. Moreover, there is a tension at the heart of the analysis wherein the incorporation of expert input from various sources for both ground truth responses and domain-specific vocabulary construction that can limit scalability. This must also be balanced by the risk that is often encountered by SMEs of needing to rely on limited expert source availability for similarity scoring which also can introduce subjective bias while also reducing the reliability of the evaluations. Such constraints are an everyday occurrence for SMEs who wish to employ AI as an assistant tool to increase productivity. In addition, this study primarily relied on well-established semantic similarity models with open-source implementations to ensure reproducibility and ease of integration. However, this choice may limit performance, as more recent models were not explored. Future work will focus on how best to expand the dataset by the formulation of an evaluation procedure that can accurately incorporate more industry expert feedback ‘on the fly’ using suitably weighted vector space-based optimisation. Additionally, developing more refined and structured questionnaires in collaboration with industry partners will help strengthen both the framework and the robustness of the overall training and quality assurance process. Furthermore, future research will also focus on the deployment of newer semantic similarity models, which may enhance the effectiveness of operator response comparison and improve the overall performance of the proposed automation pipeline.