Efficient Retrieval of Images with Irregular Patterns Using Morphological Image Analysis: Applications to Industrial and Healthcare Datasets

Image retrieval is the process of searching and retrieving images from a datastore based on their visual content and features. Recently, much attention has been directed towards the retrieval of irregular patterns within industrial or healthcare images by extracting features from the images, such as deep features, colour-based features, shape-based features, and local features. This has applications across a spectrum of industries, including fault inspection, disease diagnosis, and maintenance prediction. This paper proposes an image retrieval framework to search for images containing similar irregular patterns by extracting a set of morphological features (DefChars) from images. The datasets employed in this paper contain wind turbine blade images with defects, chest computerised tomography scans with COVID-19 infections, heatsink images with defects, and lake ice images. The proposed framework was evaluated with different feature extraction methods (DefChars, resized raw image, local binary pattern, and scale-invariant feature transforms) and distance metrics to determine the most efficient parameters in terms of retrieval performance across datasets. The retrieval results show that the proposed framework using the DefChars and the Manhattan distance metric achieves a mean average precision of 80% and a low standard deviation of ±0.09 across classes of irregular patterns, outperforming alternative feature–metric combinations across all datasets. Our proposed ImR framework performed better (by 8.71%) than Super Global, a state-of-the-art deep-learning-based image retrieval approach across all datasets.


Introduction
Image retrieval (ImR) is the task of searching and analysing images; some applications include face recognition, image search engines, image metadata annotation, object classification, and more.Recently, ImR has been applied to retrieve similar images based on their irregular patterns for fault inspection and disease diagnosis purposes.Irregular pattern analysis can detect industrial defects [1,2,3], chest infections in medical scans [4,5,6], and ice or snow on lakes [7,8], serving industry, healthcare, and environmental monitoring.An accurate ImR system can aid experts (e.g.manufacturing engineers, doctors, quality inspectors, security officers, etc.) during decision-making.
Many research studies have explored the retrieval of images containing irregular patterns in industrial or medical datasets using different features and similarity metrics.Image-based similarity metrics (e.g.mean squared error (MSE), universal image quality index (UIQ) [9], spectral angle mapper (SAM) [10], etc.) [11,12,13,14], which compare the similarity between image data, provide a simple and intuitive means of comparing two images in ImR tasks.
However, the similarity values computed from these metrics are sensitive to image noise and quality.Feature extraction methods can extract the hidden features of the irregular patterns within images and improve retrieval performance.These methods extract local binary pattern (LBP) features [15,16,17,18], scale invariant feature transform (SIFT) features [19,20,21], as well as color and shape features [22,23] to conduct retrieval of images with irregular patterns.Distance-based similarity metrics (e.g.Manhattan, Jaccard, Euclidean, Cosine, etc.) can be utilised to compute similarity values between two sets of features extracted from images with irregular patterns.Zhang et al. [24] proposed a set of morphological features, known as defect characteristics (DefChars), to characterise images with irregular patterns in terms of colour, shape, and meta aspects.Zhang et al. [24] successfully utilised the DefChars to reason the outputs from an artificial intelligence-based defect detection and classification model.However, using DefChars as features in an ImR task is still a question waiting to be explored.This paper proposes an ImR framework that can extract DefChars from images containing irregular patterns and retrieve images with similar irregular patterns by comparing their DefChars vectors using a feature-based similarity metric.Four datasets are employed in this study: wind turbine blade images with defects, chest computerised tomography (CT) scans with COVID-19 infection, heatsinks images with defects, and lake images with ice.The proposed framework is evaluated with different feature-metric combinations, such as DefChars vectors and feature-based similarity metrics, resized raw images and image-based similarity metrics (MSE, SAM, UIQ), LBP and feature-based similarity metrics, SIFT and the Euclidean metric.The retrieval results demonstrate that using the combination of DefChars and the Manhattan metric within the proposed framework consistently achieves the highest mean average precision (mAP) (average 0.80) and also maintains the lowest standard deviation (average 0.09) across classes of irregular patterns and fast retrieval time (average 0.14 seconds per query) across all datasets.Additionally, the retrieval results also indicate that using DefChars within the proposed framework can relatively have high and balanced retrieval accuracy across classes despite dataset imbalances or small-sized datasets.The proposed ImR framework could be expanded to various industrial tasks in the future, including irregular pattern identification, classification, deterioration monitoring, repairing predictions and more.
There are six sections in this paper.Section 2 interprets related works concerning similarity metrics and feature extraction methods that can be employed in the ImR task.Section 3 presents the proposed ImR framework for retrieving irregular patterns in industrial or medical datasets.Section 4 outlines the datasets, relevant feature extraction methods, similarity metrics, and the methodology used in this research, providing insight into the experimental setup.Section 5 evaluates and discusses the retrieval performance and execution time of various features and similarity metrics explored in the paper.Section 6 summarises the key findings and conclusions drawn from the research conducted in this paper; moreover, some future works are briefly pointed out.

Feature Extraction and Relevant Similarity Metrics for Retrieving Images
Recently, an increasing number of researchers have explored how effective feature extraction methods can enhance the performance of ImR.In 2019, Latif et al. [25] provided a comprehensive review of successful feature extraction methods used in content-based image retrieval (CBIR) tasks.There are six major types of features that can be extracted from these methods, including colour-based features, shape-based features, texture-based features, spatial features, fusion features, and local features.Colour-based features [26,27,28,29,30,31,32,33] offer fundamental visual information that is similar to human vision, and they are relatively robust against image transformations.Texture-based features [34,35,36,37,38,39,40] capture repeating patterns of local variance in image intensity; these features often hold more semantic meaning than colour-based features, though they can be susceptible to image noise.Shape-based features decode an object's geometrical forms into machine-readable values; Latif et al. [25] summarised that shapebased features can encompass contour, vertex angles, edges, polygons, spatial interrelation, moments, scale space, and shape transformation.Spatial features [41,42,43,44,45,46,47] convey the location information of objects within the image space.Fusion features [48,49,50,51,52] combine basic features to form high-dimensional concatenated features; often, principal component analysis is applied to reduce dimensions.Local features [53,54,55,56,57,58,59,60] represent distinct structures and patches in an image, providing fine-grained details for ImR tasks.
Seetharaman and Sathiamoorthy [23] applied the Manhattan similarity metric along with colour-based and shape-based features to complete a medical ImR task.The results demonstrated that their method achieved the highest average retrieval rate of 84.47 % and a speed of 2.29 seconds.Petal et al. [22] extracted both colour-based and texture-based features from images and applied them to a CBIR task using various distance measures (e.g., Euclidean, Cosine, Jaccard, Manhattan, etc.).Their approach achieved an impressive accuracy of 87.2 % in retrieving similar images.In 2022, Shamna et al. [61] employed the bag of visual words model as spatial features for retrieving medical images.Their method excelled in handling grayscale datasets, achieving a mAP of 69.70 %.However, its performance on coloured datasets was less satisfactory.

Recent works to retrieve industrial and medical images with irregular patterns
In 2021, Boudani et al. [15] employed wavelet-based LBP feature with the chi-square similarity metric to identify images containing surface defects on hot-rolled steel strips, achieving an mAP@10 score of 0.93; however, performance was not stable between classes.Mo et al. [62] proposed a concentrated hashing method with neighbourhood embedding, utilising a convolutional neural network (CNN) to extract hashing features, for retrieving fabric and textile datasets in industrial applications.Their method outperformed other methods in four fabric datasets with an average mAP of over 90 %, but the precision sharply dropped by 35 % when retrieving more than 8 images.In 2022, Deep et al. [63] introduced a texture descriptor based on the concept of LBP for conducting ImR tasks on three biomedical datasets.The results showed that their proposed methods reached an average precision (AP) rate of 91.5 %.However, the AP rate of one of the datasets was much lower than other datasets (i.e.93 %, 90 %, 46 %).Furthermore, their proposed descriptor needs longer retrieval times because the descriptor had a large vector size (3 × 4 × 256).Maintaining a consistent retrieval performance using the same feature and similarity metric for different datasets is challenging in the ImR domain.Boudani et al. [15] and Deep et al. [63] both applied the LBP-based method for ImR tasks, but their mAPs differed across their respective datasets.
Zhang et al. [24] introduced a set of 38 morphological features named DefChars, which capture attributes related to defects in terms of colour, shape, and meta characteristics.This feature set explains defect attributes through visualisations, enhancing understanding for individuals with human visual knowledge.Furthermore, Zhang et al. [24] employed DefChars in an artificial intelligence (AI) reasoning task, showcasing its capability in data explanation and reasoning.Despite this advancement, there remains a lack of research focused on evaluating retrieval performance using the DefChars.

Ranking Module
Indexing Module

Repository Process
DefChars Extraction module: This module serves as a feature extraction component responsible for generating a DefChars matrix that extracts the colour-based, shape-based, and meta-based features of irregular patterns within images.It is important to note that this module can be replaced by raw image data or other feature extraction methods, such as LBP or SIFT.The input to this module consists of a set of images and corresponding annotation matrices.Each image in the input set is required to contain a single irregular pattern.Additionally, the corresponding annotation is represented as a mask-based matrix, outlining the irregular pattern's region within the image.This matrix matches the size of the input image, with each value indicating whether the corresponding pixel in the input image falls inside or outside the irregular pattern's region.To prepare the annotation matrices, image annotation tools like VIA [64], Labelme [65], or the drawContours function from the OpenCV package [66] can be employed if the dataset lacks annotation data.Subsequently, the module computes a DefChars vector (size: 38 × 1) for each image to represent the DefChars of the irregular pattern.This is achieved by analysing the pixel values within both the irregular pattern and background regions, as detailed in Table 2.The values in these vectors are then normalised to a range between 0 and 1.Finally, these DefChars vectors are aggregated into a DefChars matrix, which serves as the module's output.
Indexing module: This module associates each input image with its respective DefChars vector during the repository process.It takes the extracted DefChars matrix as input and assigns a unique index to each vector within the DefChars matrix.The index within a DefChars vector can assist users in locating the corresponding input image during the retrieval process.The module outputs an indexed DefChars matrix and stores it in a datastore.Additionally, this module can extend its indexing functionality when adding new images to the datastore.

Retrieval Process
DefChars Extraction module: This module is the same as the one described in Section 3.1.However, its function is to extract a single DefChars vector from an annotated query image.The input for this module consists of an image containing a single irregular pattern and an annotation matrix showing the region of the irregular pattern within the image.The module generates a single DefChars vector that represents the features of the irregular pattern within the query image.
Similarity Computation module: This module compares the DefChars vector extracted from the query image to each DefChars vector in the datastore using a feature-based similarity metric.In this paper, the Manhattan metric was employed as the feature-based similarity metric, although it can be substituted with any distance-based metric (e.g.Cosine, Jaccard, Euclidean, etc.).The input to this module comprises the datastore (DefChars matrix) and the DefChars vector extracted from the query image.Subsequently, the module computes a set of similarity values using the selected metric to illustrate the similarity between each DefChars vector in the datastore and the DefChars vector extracted from the query image.Finally, the set of similarity values is recorded and outputted in a Similarity Results table.
Ranking module: This module is the last step of the retrieval process, which ranks the retrieved results and presents the retrieved images.The module's input is the Similarity Results table generated from the similarity computation module.Next, the Similarity Results table is ranked in order of the computed similarity values, and the module outputs the respective images according to their index.

Datasets
The experiment applied the proposed ImR framework to four distinct datasets: wind turbine blade images with defects, chest CT images with lung infections, heatsink images with defects, and lake images with ice.The wind turbine blade  [69]; the lake ice images were collected from the work of Prabha et al. [67].All of these datasets include mask annotations that identify the regions and classes of irregular patterns.
To ensure comprehensive irregular pattern retrieval within these datasets, each irregular pattern was cropped into an individual image based on the boundaries outlined in its respective mask annotation.The distribution of classes within each dataset is outlined in Table 1.Moreover, Figure 2 showcases select cropped images for each class within each dataset.the DefChar, their respective value ranges, and descriptions.The outcome of Zhang et al. 's method [24] manifests as a matrix with dimensions 38 × n, where n signifies the count of defects present within the dataset.

Feature Extraction
Scale Invariant Feature Transform: Lowe [70] introduced SIFT, a local feature extraction method which can maintain an image's scale invariance.SIFT identifies a set of keypoints and descriptors, capturing distinctive points within images.This set of keypoints and descriptors can subsequently be employed to compute similarity between images by comparing their keypoints and descriptors using the Euclidean distance metric.The SIFT extraction method contains four key steps.The first step, scale-space extrema detection, utilises the Gaussian pyramid, images are progressively downsampled to identify a collection of potential keypoints.This is achieved by analysing the differences across each level of the Gaussian pyramid.The second step, orientation assignment, involves the elimination of low-contrast keypoints, enhancing the quality of the selected keypoints.The third step, keypoint descriptor computes dominant orientations for individual keypoints, ensuring invariance to image rotation.Lowe [70] recommended using Euclidean similarity metric (explained in Section 4.3.2) to determine the similarity between the keypoints and descriptors of two images.
Local Binary Pattern: Ojala et al. [71] introduced a texture descriptor feature extraction method, designed for ImR tasks.The LBP method extracts texture descriptors by comparing the intensity value of each pixel in an image to the intensity values of its neighbouring pixels.The LBP extraction method contains four sequential steps.The first step, neighbourhood definition, identifies the eight neighbouring pixels around each pixel within the image.The second step, binary comparison, calculates the binary intensity value of each neighbouring pixel, relative to the centre pixel.The third step, binary pattern generation, concatenates all the binary intensity values into a singular vector, following either a clockwise or counter-clockwise order.The last step, decimal Representation, converts the generated binary patterns into a decimal number, serving as a representation of the texture feature.

Similarity Metrics for Image Retrieval
There are two categories of similarity metrics to conduct an ImR task: image-based metrics for raw image data and feature-based metrics for extracted feature data.Image-based metrics compute the similarity or dissimilarity between images by directly comparing the raw image data.MSE, SAM and UIQ are used in this experiment.On the other hand, feature-based metrics assess the similarity or dissimilarity between images by analysing the extracted features.Euclidean, Cosine, Jaccard and Manhattan distance metrics are utilised in this experiment.In the descriptions that follow, let X be the query image; and let Y p be one of the retrieved images.

Image-based Similarity Metrics
Image similarity metrics play a crucial role in an ImR task by searching similar images within a database.Traditional image similarity metrics, such as MSE, SAM [10], UIQ [9], and structural similarity index (SSIM) [12], allow for a direct comparison of pixel value differences between two images using mathematical equations.
Mean Square Error (MSE) calculates the average squared difference in pixel values between two images.A higher MSE value signifies a greater dissimilarity between the two images.
where H represents the height of the image; W represents the width of the image; C represents the number of channels (colour components) in each pixel; x(i, j, k) represents the pixel value of the ith row, jth column, and kth channel in the query image X; y(i, j, k) represents the pixel value of the ith row, jth column, and kth channel in the retrieving image Y .
Spectral Angle Mapper (SAM) calculates the angular disparity between two spectral signatures within a highdimensional spectral space.A higher SAM value signifies a greater dissimilarity between the two images.
where C represents the number of channels (colour components) in each pixel; H represents the height of the image; W represents the width of the image; x(i, j, k) represents the pixel value of the ith row, jth column, and kth channel in the query image X; and y(i, j, k) represents the pixel value of the ith row, jth column, and kth channel in the retrieving image Y .
Universal Image Quality Index (UIQ) takes into account the similarity between two images based on their correlation, luminance, and contrast.A higher UIQ value signifies a greater similarity between the two images.The maximum possible value of UIQ is 1, indicating that the two images are exactly the same.
where x k is the mean of the kth channel's pixel values in the query image X; y k is the mean of the kth channel's pixel values in the retrieving image Y ; σ 2 x k is the variance of the kth channel's pixel values in the query image X; σ 2 y k is the variance of the kth channel's pixel values in the retrieving image Y ; σ x k is the standard deviation of the kth channel's pixel values in the query image X; σ y k is the standard deviation of the kth channel's pixel values in the retrieving image Y ; σ x k y k is the correlation of the kth channel's pixel values between the query image X and the retrieving image Y .

Feature-based Similarity Metrics
Euclidean distance calculates the direct straight-line distance between each point of two vectors.A higher value of the Euclidean distance signifies a greater dissimilarity between the two images.
where: N represents the number of elements in the feature vector; x i represents the ith value of the extracted feature vector from the query image X; y i represents the ith value of the extracted feature vector from the retrieving image Y .
Cosine distance calculates the cosine of the angles between two vectors.A higher value of the cosine distance indicates that the two images are more similar.

Cosine Distance
where: N represents the number of elements in the feature vector; x i represents the ith value of the extracted feature vector from image X; and y i represents the ith value of the extracted feature vector from image Y .
Manhattan distance calculates the sum of absolute differences between corresponding elements of two vectors.A larger value of the Manhattan distance signifies that the two images are more dissimilar.

Manhattan Distance
where: N represents the number of elements in the feature vector; x i represents the ith value of the extracted feature vector from the qeury image X; and y i represents the ith value of the extracted feature vector from the retrieving image Y .
Jaccard distance calculates the ratio of the common elements between two feature vectors to the total number of elements present in the vectors.A higher value of the Jaccard distance suggests that the two images are more similar in terms of the shared features or values. where: x ∩ y represents the intersection of the sets of elements present in the feature vectors of the query image X and retrieving image Y ; and x ∪ y represents the union of the sets of elements present in the feature vectors of the query image X and retrieving image Y .

Evaluation Measure
This section introduces the evaluation measures employed within the scope of an ImR task carried out in this experiment.
In the context of an ImR task, the primary objective revolves around searching for images with similar irregular patterns within a datastore.Precision@K is defined as the ratio of relevant images with irregular patterns correctly retrieved among the top K retrieved images with irregular patterns.Moreover, the AP@K, AP, calculates the average value of Precision@K across the queries in an irregular pattern class.Subsequently, the term mAP@K, signifying mean average precision, computes the average of AP@K values across all the irregular pattern classes that exist within the dataset.Furthermore, the standard deviations for AP and mAP were computed to assess the consistency of the retrieval performance for query and class, respectively.

Methdology
The experiment ran separate ImR tasks to evaluate the performance between different features and similarity metrics for each dataset.Three conventional image-based similarity metrics (i.e.MSE, SAM, and UIQ) were employed in instances where the raw image data was utilised.For scenarios involving either DefChar or LBP feature data, the experiment utilised four feature-based similarity metrics (i.e.Euclidean, Cosine, Manhattan, and Jaccard).Notably, in alignment with Lowe's recommendations [70], the Euclidean similarity metric was exclusively utilised for the SIFT feature data.
In the initial step, this experiment compresses the raw images into four distinct sizes (i.e. 100 × 100, 50 × 50, 20 × 20, 8 × 8) with the dual goals of normalisation and acceleration of the retrieval process.Subsequently, the SIFT and LBP features are extracted from these compressed images.Notably, the extraction of DefChars necessitates the utilisation of raw images due to the potential distortion of mask annotations caused by image resizing.The next step iteratively picks the compressed image or the extracted feature vector of each irregular pattern in the dataset as a query; and then retrieves the remaining irregular patterns by applying the corresponding similarity metrics; for instance, the feature-based similarity metrics are utilised for DefChars, SIFT and LBP features and image-based similarity metrics are utilised for compressed raw images.The retrieval criterion for relevant irregular patterns is based on the class of the irregular pattern.Then, the retrieved irregular patterns are ranked according to the computed similarities, and the precision is computed for each query.Ultimately, the experiment calculates the average precision to evaluate the retrieval performance for each class within the dataset; the mean average precision is calculated to assess the overall retrieval performance.
The experiment was conducted on a high-performance computer featuring an AMD Ryzen 9 CPU and 32GB RAM.Notably, the utilisation of a GPU is unnecessary for executing the ImR task.

Results & Discussion
This section illustrates the retrieval performance results when using different extracted features and similarity metrics for each dataset.The full retrieval results are presented in Appendices A, B, C, D. In each appendix, a series of tables provide insight into the average precision with the associated standard deviation for each class within the dataset; additionally, there is a table illustrating the mean average precision with its standard deviation to reflect the overall performance in the dataset.This section contains two subsections to illustrate the evaluations of the retrieval performance.Section 5.1 assesses the retrieval performance using distinct features and similarity metrics for each dataset.This assessment aims to identify noteworthy features and similarity metrics that exhibit impressive performance.Section 5.2 compares the retrieval performance of the remarkable methods mentioned in Section 5.1 for each dataset; moreover, this section discusses the time used to extract the features and retrieve the query for each dataset.

Information Retrieval Performances between Different Features, Similarity Metrics and Image Sizes for Each Dataset
This section empirically evaluates the retrieval performance using four different features (i.e., DefChars, resized raw images, LBP, and SIFT) for each dataset.The DefChars-based methods applied the DefChars, extracted from the raw images, to four feature-based similarity metrics (i.e.Cosine, Euclidean, Jaccard and Manhattan).The image-based methods applied the images with four different sizes (i.e. 8 × 8, 20 × 20, 50 × 50, 100 × 100) to three image-based similarity metrics (i.e.MSE, UIQ and SAM).The LBP-based methods applied the LBP features, extracted from resized images, to four feature-based similarity metrics.The SIFT-based methods applied SIFT features, extracted from resized images, to the Euclidean similarity metric.Then, this section discusses the noteworthy similarity metrics and image sizes that achieved an outstanding retrieval performance (e.g.highest mAP and lowest standard deviation) when applying different features for each class within the dataset.

Chest CT Dataset
Table 3 shows the mAPs, along with standard deviation, when using different features, similarity metrics and image sizes in chest CT dataset; Table 4, 5, 6 illustrate the APs with standard deviation for each class within the dataset.

Performance using DefChars-based methods [Proposed]
: There were two similarity metrics (i.e.Cosine and Euclidean) that yielded the highest mAP and the lowest standard deviation, averaging at 0.85 ± 0.06.In terms of retrieval performance for each class using the Cosine or Euclidean metric, both metrics had same mAP values and standard deviations for class 1.For class 2, the Euclidean metric achieved a slightly higher AP of 0.01 compared to the Cosine metric at @1 and @20.Conversely, the Cosine metric outperformed the Euclidean metric by 0.01 in AP at @1 and @10 for class 3. Additionally, the Manhattan metric exhibited noteworthy retrieval performance for the Chest CT dataset and relatively achieved the highest AP for class 1 and 2. However, for class 3, the retrieval performance of the Manhattan metric was slightly lower than that of the Cosine and Euclidean metrics, with an average difference of 0.01-0.02at @5, @10, and @15.This resulted in a higher standard deviation in mAP, despite the metrics having the same mean mAP values.

Performance using image-based methods:
The best-performing image-based method was the UIQ metric with a 20 × 20 image size, averaging at 0.76 ± 0.17 in terms of mAP.The UIQ metrics with 50 × 50 and 100 × 100 image sizes exhibited similar mAP values across the range from @1 to @20, but they had higher standard deviations compared to the UIQ metric with a 20 × 20 image size.
The UIQ metric with a 20 × 20 image size showed a relatively small difference between the maximum and minimum AP in different classes (i.e.@1: 0.94 -@20: 0.89 for class 1 and @1: 0.59 -@20: 0.58 for class 3), although it may not achieve the highest AP.

Performance using LBP-based methods:
The performance of the LBP-based methods exhibited a correlation with the image size.The highest mAP was achieved, averaging at 0.30 ± 0.22 between @1 and @20, when using images of size 100 × 100.However, it is worth noting that the retrieval performance of the best-performing LBP method for each class was not consistent.

Performance using SIFT-based methods:
The 20 × 20 image size proved to be the optimal setting for the SIFT-based method, with an average mAP of 0.52 ± 0.31.The standard deviation of the SIFT method was the highest among all methods; consequently, the SIFT method struggled to maintain consistent retrieval performance across all classes within the chest CT dataset.

Heatsink Dataset
Table 7 shows the mAPs, along with standard deviation, when using different features, similarity metrics and image sizes in heatsink dataset; Table 8 and 9 illustrate the APs with standard deviation for each class within the dataset.
Performance using DefChars-based methods [Proposed]: All similarity metrics consistently achieved the highest mAP with an average of 0.97 ± 0.02, except for the Jaccard metric.When analysing the performance for each class, the Manhattan metric sometimes exhibited a slightly higher standard deviation by 0.01, compared to the Cosine and Euclidean metrics.Additionally, the AP for the Manhattan metric was sometimes lower by 0.01.As a result, the performance of the Manhattan metrics was slightly lower than that of the others, particularly at @10 and @15.
Performance using image-based methods: The average mAP between @1 and @20 reached its peak at 0.88 ± 0.08 when using 100 × 100 images with the UIQ metric, although its mAP@1 was slightly lower at 0.86 compared to others.Additionally, MSE with 8 × 8 image size, demonstrated relatively high performance with an average mAP of 0.87.When considering the performance for each class, the UIQ metric with 100 × 100 image size and the MSE metric with 8 × 8 image size consistently maintained a high AP across all classes.In contrast, other methods exhibited fluctuations in AP when applied to different classes.

Performance using LBP-based methods:
The LBP method exhibited consistent mAP and standard deviation across all feature-based similarity metrics.The best-performing LBP method achieved an mAP of 0.53 ± 0.18 on average when utilising 8 × 8 images.However, it is worth noting that the performance of the LBP-based method varied significantly for each class.The LBP metric with 8 × 8 images significantly outperformed the others by more than 0.43 in AP for class 2; but, the performance of this metric fell behind the others for class 1, particularly when compared to the LBP metric with a 20 × 20 image size.
Performance using SIFT-based methods: The best retrieval performance was 0.54 ± 0.23 in terms of mAP when applying the 100 × 100 images to the SIFT-based method.The performance of the best-performing SIFT method was not balanced between each class; for instance, the AP ranged from @5: 0.51 to @20: 0.45 for class 1 and @5: 0.59 to @20: 0.64 for class 2.

Lake Ice Dataset
Table 10 shows the mAPs, along with standard deviation, when using different features, similarity metrics and image sizes in lake ice dataset; Table 11, 12, 13 and 14 illustrate the APs with standard deviation for each class within the dataset.
Performance using DefChars-based methods [Proposed]: The highest mAP among the DefChar-based methods had an average of 0.90 ± 0.07 when using the Manhattan metric.Also, the retrieval performance was relatively balanced across all classes within the dataset; the AP for all classes exceeded 0.94 at @1 and 0.74 at @20, which was higher than for other DefChar-based methods.
Performance using image-based methods: All image-based methods showed similar performance; however, the SAM using 8 × 8 images outperformed the others with the highest mAP (0.86) and the lowest standard deviation (0.13).
In terms of performance across different classes, there was no significant difference observed between all similarity metrics and image sizes.Nevertheless, the SAM with 8 × 8 image size achieved a higher AP than other image-based methods by 0.04-0.06for class 4, resulting in a higher mAP with a lower standard deviation.
Performance using LBP-based methods: All LBP-based methods exhibited relatively low mAP values across all classes.The best-performing LBP-based method, when using any feature-based similarity metric with 100 × 100 images, only reached an average mAP of 0.26 ± 0.37.Furthermore, this best-performing LBP-based method achieved high performance for class 3 (from mAP@1: 0.99 to mAP@20: 0.68); for other classes, however, the mAP dropped significantly, falling below 0.23 and even reaching 0.00.
Performance using SIFT-based methods: In all SIFT-based methods, using large-sized images (i.e., 100 × 100) achieved the highest mAP, averaging at 0.64 ± 0.15.The performance of the SIFT method with a 100 × 100 image size was relatively balanced between all classes, except for class 4, where the AP was 40 % lower than that for other classes.

Wind Turbine Blade Dataset
Table 15 shows the mAPs, along with standard deviation, when using different features, similarity metrics and image sizes in wind turbine blade dataset; Table 16, 17, 18 and 19 illustrate the APs with standard deviation for each class within the dataset.

Performance using DefChars-based methods [Proposed]:
In the DefChars-based methods, the Manhattan metric outperformed others with the highest mAP (0.62) and the lowest standard deviation (0.17).When considering the performance for each class, the retrieval performance using the Manhattan metric generally exceeded other metrics, except for AP@15 and AP@20 in class 2, and AP@5, AP@10, and AP@15 in class 4.
Performance using image-based methods: In the image-based methods, two settings (i.e., MSE with a 8 × 8 image size and UIQ with a 20 × 20 image size) both averagely achieved the highest mAP (0.44) and the lowest standard deviation (0.31).When evaluating the performance for each class using these best-performing image-based methods, they exhibited similar AP values across all classes.However, the AP of the MSE metric occasionally exceeded that of the UIQ by 0.03-0.08 at @1 for all classes except class 3.
Performance using LBP-based methods: Any feature-based similarity metric with a small-sized image (i.e. 8 × 8) outperformed all LBP-based methods and achieved an average mAP of 0.26 ± 0.17.However, all LBP-based methods, including the best-performing one, struggled to maintain a consistent performance across all classes.For instance, the AP of the best-performing method was lower than other LBP-based methods between @1 and @20 for classes 1 and 3.

Performance using SIFT-based methods:
The highest mAP was achieved at 0.35 ± 0.30 when utilising the Euclidean metric with a 100 × 100 image size in the SIFT-based method.When evaluating the performance differences of the SIFT methods across each class, the best-performing method was relatively more accurate than others for class 3 by over 0.10, although it did not outperform the others for the rest of the classes.

Overall Performance Comparisons for the Information Retrieval Tasks
This section compares retrieval performance of different feature extraction methods, utilising the settings that demonstrated remarkable results for each dataset.These settings were selected based on the analysis presented in Section 5.1.Additionally, this section delves into investigating the time completed for each method and dataset in the ImR task.
Figure 3 presents a line chart illustrating the mAP of the outstanding ImR methods, that were explained earlier, across all datasets.The DefChars-based methods consistently outperformed other methods in all datasets.When using DefChars, the performances of different similarity metrics were not notably distinct; however, the Manhattan metric stood out as an effective choice because it consistently reached the highest mAP across all datasets.The image-based ImR methods were the second best method in all datasets.However, the choice of similarity metrics and image sizes emerged as significant factors influencing performance for different datasets, and there is no image-based method with a consistent setting that relatively maintains high retrieval performance.For instance, the UIQ metric performed relatively better with large-sized images in the chest CT and heatsink datasets, while the SAM and MSE metrics achieved higher mAP in lake ice and wind turbine blade dataset when utilising small-sized images.The performance of the LBP-based method was comparably the worst in all remarkable methods and did not exhibit significant differences based on the choice of similarity metric.Furthermore, no discernible pattern emerged regarding the impact of image sizes on LBP methods.For instance, the chest CT and lake ice datasets exhibited higher mAP values with larger-sized images, whereas the heatsink and wind turbine blade datasets yielded better results with smaller-sized images.The SIFT-based methods exhibited a better overall performance compared to the LBP-based methods, although they fell short of other methods.Notably, for the SIFT methods, larger image sizes (e.g. 100 × 100) translated to enhanced performance compared to smaller sizes except for the chest CT dataset.
Figure 4 provides insight into the standard deviations encountered when calculating the mAP using the outstanding methods.Notably, the DefChars methods generally exhibited the lowest standard deviation across all datasets.This indicates their relatively stable and reliable performance.In the wind turbine blade dataset, the LBP method displayed standard deviations that were 0.03-0.05lower than those of the DefChar method when retrieving more than 5 irregular patterns.However, the mAP achieved by the LBP method was the lowest among all methods.In contrast, the rest of the methods, such as image-based, SIFT-based, and LBP-based methods, exhibited higher standard deviations, often exceeding those of DefChars-based methods by more than 0.1 in all datasets.Moreover, these methods demonstrated varying standard deviations across different datasets.This implies that these methods might exhibit less consistency when retrieving irregular patterns from different classes within a dataset.As noted in Table 1, the datasets themselves are relatively imbalanced between each class.However, the ImR method utilising DefChars and the Manhattan metric showed the capability to maintain relatively high and balanced accuracy in retrieving similar irregular patterns.Additionally, the performance of DefChars methods did not exhibit significant deterioration when applied to a small dataset (i.e.wind turbine blade).
Figure 5 presents the time required to extract features and retrieve images with similar irregular patterns for each query across four datasets.Among these remarkable methods, LBP exhibited the shortest total time, taking less than 0.062 seconds per query.On the other hand, UIQ consumed the most time, especially when using 100 × 100 images, resulting in query times exceeding 5 seconds for large datasets such as chest CT, heatsink, and lake ice.In the DefChars-based  method, the retrieval time ranged from 0.06 to 0.26 seconds, making it one of the faster approaches except for the wind turbine blade dataset.However, while the feature extraction time for DefChars-based method was comparatively longer than other methods, their retrieval times were significantly shorter than those of other methods.The image-based approaches generally required more time for retrieval, although image resizing reduced the feature extraction time.The SIFT-based method needed an additional 0.0003 to 0.0018 seconds for extracting the SIFT features after image resizing, but the retrieval time was relatively shorter than most image-based methods.When considering the time impact from the dataset or image size, a large dataset or image size typically led to a longer execution time.For instance, the ImR time for UIQ metrics when using 100 × 100 images was over three times longer than when using 20 × 20 images.On the other hand, the heatsink dataset, the largest dataset according to Table 1, contained 7007 irregular patterns, resulting in longer retrieval times across all datasets.Therefore, the DefChars-based method was able to complete a relatively fast and accurate ImR task.

Figure 3 :
Figure 3: Mean Average Precision of the highlighted features, similarity metrics and image sizes.The blue-based, orange-based, green-based, and purple-based lines respectively represent the ImR methods which used the DefChars, image-based feature, LBP feature, and SIFT feature.The depths of the line colour represent different similarity metrics and image sizes.

Figure 4 :
Figure 4: Standard deviation of the highlighted features, similarity metrics and image sizes.The blue-based, orangebased, green-based, and purple-based lines respectively represent the ImR methods which used the DefChars, imagebased feature, LBP feature, and SIFT feature.The depths of the line colour represent different similarity metrics and image sizes.

Figure 5 :
Figure 5: Image Retrieval Time of the highlighted features, similarity metrics and image size for each dataset.The blue bar represents the average feature extraction time and the orange bar represents the average retrieval time.The text at end of each bar shows the total time used in retrieving each query.

Table 1 :
Class Distribution of irregular patterns in Each Dataset; -represents that there is no such type in the dataset.

Lung Area Ground Glass Opacity Consolidation Chest CT Heatsink Defect Scratch Stain Lake Ice Water Ice Snow Clutter Wind Turbine Blade Defect Crack Erosion Void Other Figure 2: Example images of irregular patterns in each class of every dataset. defect
images were provided by the industrial partner Railston & Co Ltd.; the chest CT images were sourced from Ter-Sarkisov's [68] study; the heatsink defect images were gathered from Yang et al.'s experiment

Table 2 :
Method for Image RetrievalDefect Characteristics: This feature extraction methodology necessitates images with associated mask-based annotations, which are essential for calculating the DefChar values corresponding to each defect within the dataset.This is particularly important since a single image may encompass multiple defects.Table 2 provides a comprehensive list of Proposed DefChars.

Table 4 :
Average precision and standard deviation values of class 1.The highest average precision with relative lowest standard deviation are marked in bold text.

Table 5 :
Average precision and standard deviation values of class 2. The highest average precision with relative lowest standard deviation are marked in bold text.

Table 6 :
Average precision and standard deviation values of class 3. The highest average precision with relative lowest standard deviation are marked in bold text.

Table 8 :
Average precision and standard deviation values of class 1.The highest average precision with relative lowest standard deviation are marked in bold text.

Table 9 :
Average precision and standard deviation values of class 2. The highest average precision with relative lowest standard deviation are marked in bold text.

Table 11 :
Average precision and standard deviation values of class 1.The highest average precision with relative lowest standard deviation are marked in bold text.

Table 12 :
Average precision and standard deviation values of class 2. The highest average precision with relative lowest standard deviation are marked in bold text.