A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects

Molefe, Mohale Emmanuel; Tapamo, Jules Raymond; Vilakazi, Siboniso Sithembiso

doi:10.3390/jsan14030058

Open AccessArticle

A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects

by

Mohale Emmanuel Molefe

¹

,

Jules Raymond Tapamo

^1,*

and

Siboniso Sithembiso Vilakazi

²

¹

School of Engineering, University of KwaZulu Natal, Durban 4041, South Africa

²

Transnet Freight Rail, Johannesburg 2193, South Africa

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(3), 58; https://doi.org/10.3390/jsan14030058

Submission received: 7 April 2025 / Revised: 13 May 2025 / Accepted: 22 May 2025 / Published: 29 May 2025

(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)

Download

Browse Figures

Versions Notes

Abstract

Defects formed during the thermite welding process of two sections of rails require the welded joints to be inspected for quality, and the most used non-destructive method for inspection is radiography testing. However, the conventional defect investigation process from the obtained radiography images is costly, lengthy, and subjective as it is conducted manually by trained experts. Additionally, it has been shown that most rail breaks occur due to a crack initiated from the weld joint defect that was either misclassified or undetected. To improve the condition monitoring of rails, the railway industry requires an automated defect investigation system capable of detecting and classifying defects automatically. Therefore, this work proposes a method based on image processing and machine learning techniques for the automated investigation of defects. Histogram Equalization methods are first applied to improve image quality. Then, the extraction of the weld joint from the image background is achieved using the Chan–Vese Active Contour Model. A comparative investigation is carried out between Deep Convolution Neural Networks, Local Binary Pattern extractors, and Bag of Visual Words methods (with the Speeded-Up Robust Features extractor) for extracting features in weld joint images. Classification of features extracted by local feature extractors is achieved using Support Vector Machines, K-Nearest Neighbor, and Naive Bayes classifiers. The highest classification accuracy of 95% is achieved by the Deep Convolution Neural Network model. A Graphical User Interface is provided for the onsite investigation of defects.

Keywords:

bag of visual words; Chan–Vese active contour model; local binary patterns; speeded-up robust features; deep convolution neural networks; support vector machines; graphical user interface

1. Introduction

Throughout history, railway transportation has played a significant role in the development of the world economy. It has continued to play a vital role in the transport of heavy freight and passengers at a lower cost than other modes of transportation. It has also enabled the growth of sectors, such as agriculture and mining. Railway infrastructure is complex and requires the interaction of various engineering disciplines. It comprises overhead traction equipment, rolling stock, earthworks, bridges, and the track structure. The track structure forms the foundation of the railway infrastructure system and comprises pairs of rails positioned on top of sleepers embedded in ballast. The track structure aims to ensure safe train transportation by providing a guideway for wheels and absorbing the dynamic loading caused by the train motion.

In contrast to other components of the track structure, the rails are in direct contact with the train wheels. Therefore, they require more maintenance. Additionally, manufacturing rails in sections means they must be joined together during installation to form a continuous railway line. Some known joining methods include bolting the rails with fish plates and welding. However, welding is preferred as it allows permanent bonding of rail sections. A commonly used welding method is thermite welding. Thermite welding provides an easy and cost-effective way of joining two rail sections permanently. It uses molten iron produced from a chemical reaction between aluminum and iron oxide to join the rails, forming the weld joint. After welding, the weld joint is inspected for quality assurance purposes. Weld joint inspection is performed using non-destructive testing (NDT) methods such as acoustic emission, eddy current, ultrasonic testing, and radiography testing. However, radiography testing (RT) is favored as it allows radiography experts to identify various defect types and make decisions based on applicable radiography standards.

After inspecting the weld joint using RT, the obtained radiography images are manually investigated by trained experts to detect and classify defects which could have occurred during the welding operation. The experts accept or reject the weld joint based on the detected defect and applicable radiography standards. However, detecting and classifying welding defects using human expertise remains time-consuming, subjective, and costly. This process often takes two months to complete, leaving the weld joint exposed to a rail break. Studies indicate that most derailments occur due to a crack initiated from the weld joint, which causes a rail break [1,2]. Furthermore, some defects are left undetected or misclassified, thus posing an even greater risk of derailments and loss of lives and revenue. Although radiography experts possess invaluable skills in the detection and classification of weld defects within radiographic images, manual classification is inherently susceptible to errors, especially under conditions of high image volume or compromised image quality. Therefore, the railway industry needs a method to automate the defect detection and classification process so that defects can be detected and classified faster, reliably, and objectively. It is important to mention that this automated method is meant to augment, rather than replace, human expertise, enabling experts to focus on the most complex and critical decision-making aspects of weld inspection.

This work proposes image processing and machine learning techniques to detect and classify rail welding defects on radiography images automatically. The rest of this paper is structured as follows: Section 2 reviews recent image processing and machine learning applications for rail defect investigation. Section 3 discusses the techniques considered in this work to detect and classify rail welding defects. Section 4 presents the experimental results and discussion. Section 5 presents the Graphical User Interface (GUI) for the automated investigation of defects. Section 6 concludes this work.

2. Related Work

Recently, railway industry practitioners have benefited from various applications of computer vision methods. The possibility of collecting data such as rail surface defects and welding defects has proved beneficial for efficient railway transportation and the development of preventive maintenance models. Image processing and machine learning techniques have started to gain popularity in the railway industry since the introduction of the first-generation rail monitoring systems introduced in [3]. These systems could capture, collect, and store railway track images for later review; however, they did not incorporate automated detection of defects in the captured images. As faster processing hardware became available, several researchers introduced image processing, machine learning, and deep learning frameworks with advanced automation capabilities. This section reviews image processing, machine learning, and deep learning methods found in the literature to detect and classify defects in the railway industry.

2.1. Techniques Based on Image Processing

One of the earliest and most innovative attempts to apply image processing techniques for detecting and classifying rail defects was the application of Gabor filters and texture analysis [4]. The authors first extracted the rail surface from the background using the Binary Image-Based Rail Extraction (BIBRE) algorithm. The extracted rail images were then enhanced using direct enhancement methods to achieve a uniform background between the surface defect and the rail image. Finally, the Gabor filters were applied to the enhanced images to maximize the energy difference and distinguish between defective and defect-less images. However, this approach was suited for a binary classification task and did not incorporate other types of defects in rails. Furthermore, the BIBRE algorithm assumes that the rail image has a bimodal histogram.

An improved version of the method introduced by [4] was proposed in [5] for real-time multi-class rail defect detection and classification using image processing techniques. The authors focused on four types of rail surface defects, namely headcheck, undulation, scour, and cracks. The initial step involved using the Hough Transform to segment and extract the rail from the image background. After that, defect features within the rail images were extracted using morphological operations, and defective regions were identified. Classification of the identified defect regions was carried out by comparing the areas of the defects. This method was implemented for a multi-class defect detection and classification task. However, only the geometrical property of the defects was considered; therefore, low classification accuracy was obtained. Other image processing methods for detecting and classifying rail defects can be found in [6,7].

2.2. Techniques Based on Image Processing and Machine Learning

Recently, machine learning techniques have been used along with image processing techniques to implement a robust defect detection and classification framework. Image processing involves methods for extracting defect features, while machine learning involves algorithms that use the extracted features to learn a decision rule that distinguishes different types of defects. Rajagopal et al. [8] used the Gray-Level Co-occurrence Matrix (GLCM) and the Neural Network (NN) classifier [9] to detect and classify defective and defect-less rail images. The authors initially applied image enhancement techniques before using the Gabor Transform to obtain multi-resolution images from the spatial domain. A similar approach was proposed in [10], where geometric features were combined with the gray-level features of three rail surface defects: scale peeling, crack stripping, and thread crack. Multi-class classification was achieved using the AdaBoost classifier. However, GLCM is a global feature extractor; therefore, it is not invariant to various image transformations, such as image rotation.

A method for detecting and classifying images containing souring rail defects and defect-less rail images was proposed in [11]. Feature extraction techniques such as the Principal Component Analysis (PCA) [12], Kernel Principal Component Analysis (KCPA) [13], Singular Value Decomposition (SVD) [14], and Histogram Matching (HM) were used for experiments. A comparative analysis was conducted using the Random Forest (RF) [15] as a classifier. Compared to other methods, PCA improved classification accuracy while the HM achieved faster feature extraction and training time. However, this proposed method was less effective as foreign objects on images were detected as defects.

A surface defect called squats is usually caused by Rolling Contact Fatigue (RCF). In [16], three data sources comprising ultrasonic, eddy current, and rail images were used to detect squats more reliably. Features from these data sources were grouped using a clustering algorithm and then fed into Support Vector Machines (SVMs) [17] trained to detect squats. This method lacked accuracy, and the feature extraction was slow. Mercy and Rao [18] performed an analytical study of real-time rail surface defect prediction using three machine learning classification algorithms: NN, Decision Trees (DTs) [19], and RF. These algorithms were trained and validated using an image dataset acquired by the Track Recording Car (TRC) [20], a specialized vehicle equipped with sensors and cameras designed to capture images of the rail surface. The experimental results showed that the DT classifier outperforms other classifiers. Even though the classification accuracy was impressive, this method’s downside is in the detection stage; some defects could not be detected at high TRC speeds.

A method proposed in [21] uses the rotation-invariant Local Binary Pattern (LBP) extractor [22] and the K-Nearest Neighbors (KNN) classifier [23] for multi-class detection and classification of rail welding defects. The authors extracted the weld joint as the Region of Interest from the background of each image using the active contour model. Thereafter, the performance of LBP extractor at increasing cell size parameters was investigated at a fixed K value of the KNN classifier. The authors considered four different types of classes: a defect-less class and defect classes of wormholes, inclusions, and shrinkage cavities. The study proposed in [24] uses the Speeded-Up Robust Features (SURFs) extractor and an SVM classifier to add scale and illumination invariance to the method proposed in [21] using a similar dataset, and the classification accuracy improved slightly.

2.3. Techniques Based on Deep Learning

Deep learning has seen many successes in the medical and science fields, and it is currently the state-of-the-art paradigm in image recognition and speech recognition. Compared to machine learning, deep learning algorithms do not rely on hand-crafted image feature extraction techniques but automatically learn features within a given defect image. However, this happens at the cost of large data requirements. One of the earliest attempts to apply deep learning algorithms for rail defect detection is presented in [25]. The authors designed a Convolution Neural Network (CNN) model with two layers to classify defective and non-defective rails from stereo images; however, a small dataset was used for training, and thus the model was vulnerable to overfitting. James et al. [26] employed a multiphase deep learning-based technique to detect rail surface discontinuities on rail images; their approach first performed image segmentation techniques to remove the railroad from the background and then used the linear binary classifier to classify the rail as defective or intact.

Shang et al. [27] proposed a two-stage method for rail inspection using image segmentation and CNN. Their method was designed specifically for two objectives: to extract the rail surface from the background and to classify the rail as defective or defect-less. The rail surface was extracted using the Canny edge operator to detect edges. Subsequently, rail classification as either defective or defect-less was achieved using CNN based on the inception-v3 pre-trained model. This method achieved great classification results but was implemented for a binary classification task. Additionally, the Canny edge operator did not guarantee the successful detection of edges in every input image.

Roohi et al. [28] developed a Deep Convolution Neural Network (DCNN) framework to automatically detect and classify four classes of rail defects, namely, welding defects, light squats, moderate squats, and severe squats. The authors claimed that feature extraction using DCNN is more robust and accurate than the traditional feature extraction methods used on a large dataset. Their framework comprised three convolutional layers, three max pooling layers, and three fully connected layers. Subsequently, the hyperbolic tangent (Tanh) function and the rectified linear unit (ReLU) were used as activation functions. The classification accuracy achieved was impressive but could be improved with hyperparameter tuning, by adjusting parameters such as the learning rate and optimizer. Furthermore, their framework does not detect and classify different welding defects.

The method proposed by Jamshid et al. [29] detected squats and predicted their growth based on video images and ultrasonic measurement data. The ultrasonic measurement data were used to derive the general characteristics of the squats, and the video image data were used to analyze the growth of the visual length of defects. As an improvement to their previous method in [30], where an SVM classifier was used to classify fastener defects, Gibert et al. [31] trained a CNN pipeline based on five convolutional layers to classify the condition of fasteners as good, missing, or defective. To make their pipeline more robust against unusual situations, the authors used image augmentation and resampling techniques to add more “hard-to-classify” images to their training dataset.

Yanan et al. [32] developed a rail surface defect detection method using the YOLO-v3 deep learning network. Grayscale input images were initially divided into equal cells, and within each cell, the defects’ height, width, and center coordinates were calculated using the dimensional clustering method. The authors further used a logistic regression algorithm to calculate the bounding box score; meanwhile, predictions of the defect class that the bounding box contained were made using the binary cross-entropy loss function. However, the classification results were not impressive, and the learning rate was high. A high learning rate allows a model to learn faster at the cost of a sub-optimal solution.

Recurrent Neural Networks (RNNs) are another example of deep learning algorithms commonly used for sequential and time-series tasks. Long Short-Term Memory (LSTM) networks are a particular case of RNNs, and they can handle the vanishing gradient problem of the standard RNN. Using ultrasonic measurement data, Xu et al. [33] developed an LSTM model to detect and classify defective and non-defective rail surfaces. The pulse sequence from the ultrasonic data was interpreted as a sequential task in the LSTM architecture. The LSTM memory cell was used to establish the surface defect classification pipeline.

Song et al. [34] conducted a comparative study to detect and classify the severity of rail shelling defects. The dataset used to conduct the experiment included images of four levels of rail shelling defects ranging from low risk to high risk. The authors compared two pre-trained CNN models, the Residual Neural Networks (ResNet) and the VGG-16 network, with approaches based on manually extracted features, including the Histogram of Oriented Gradients (HoG) descriptor with an SVM classifier. The authors presented the results in terms of computation cost and classification accuracy. Their experimental results showed that the ResNet model required less computational cost and achieved the highest overall classification accuracy.

As shown in the literature, most researchers have used image processing, machine learning, and deep learning techniques for detecting and classifying rail surface defects. Although the condition monitoring of weld joints based on computer vision techniques has been studied by several researchers [35,36,37], little research exists in the literature for detecting and classifying rail welding defects using image processing and machine learning techniques. To the best of the authors’ knowledge, the studies presented in [21,24,38] are the only methods found in the literature for detecting and classifying rail welding defects. These three methods use a similar defect dataset of 300 images, representing a defect-less class and three defect classes of wormholes, inclusions, and shrinkage cavities. Each class comprised 75 images. Unfortunately, the dataset used by these three studies is too small to be conclusive, and only a few quantitative results are presented. Furthermore, none of these studies is based on state-of-the-art deep learning methods for image classification tasks.

Therefore, this work compares several image feature extraction and machine learning techniques for detecting and classifying rail welding defects using a much larger dataset. Furthermore, the state-of-the-art DCNN algorithm is proposed to investigate its performance against traditional methods for handling similar tasks. Machine learning algorithms use feature extraction techniques to generate a feature vector representing each image. Generally, feature extraction techniques are divided into global and local feature extractors. Local feature extractors are advantageous as they are invariant to significant image transformations compared to global feature extractors. Thus, this work compares two commonly used local feature extractors, namely the Local Binary Patterns (LBPs) and the Bag of Visual Words (BoVW), with the Speeded-Up Robust Features (SURFs) extractor. For defect classification, three machine learning classifiers are used for comparison: Support Vector Machines (SVMs), K-Nearest Neighbors (KNN), and Naive Bayes classifiers. These classifiers were chosen as they have proven to provide a high classification accuracy compared to other classifiers in classifying rail welding defects [21,24,38]. On the deep learning side, the DCNN pipeline is proposed to extract and classify image features automatically.

3. Proposed Methods

A methodological diagram of the proposed methods is depicted in Figure 1. Image enhancement techniques are first applied to thermite weld images to improve image quality. Image segmentation techniques are applied to extract the weld joint from the image background. Then, the weld joint images are fed as inputs to two architectures. The first architecture applies several feature extraction and machine learning techniques to extract features and classify defects, respectively. The second architecture uses Deep Convolution and Neural Networks (DCNNs) to learn feature representation and classify defects, respectively. The best combination of feature extractor and classifier (in terms of classification accuracy) in the first architecture is compared to the classification accuracy of the DCNN method in the second architecture. The method that achieves the highest accuracy is integrated into the Graphical User Interface (GUI) for automated defect classification. All materials used in the development and evaluation of the proposed methods, including the code and supporting documentation, are publicly available for transparency and reproducibility. These resources can be accessed at the following GitHub repository: https://github.com/Molefe-M/thermite_weld_defect_investigation_system (accessed on 5 April 2025).

3.1. Image Enhancement

The collected thermite weld images are characterized by pixels with low-dynamic-range intensity values; thus, image enhancement techniques are required to improve the images’ quality for further analysis. Several image enhancement techniques deemed suitable for enhancing radiography images have been used in the literature [39]. These enhancement techniques rely on the histogram of the image and its cumulative distribution function (CDF) to implement a transform function that produces an enhanced version of the original input image [40]. Let

I = I (x, y)

be an input image that is composed of L discrete gray-level intensity values, denoted as

{g_{0}, g_{1}, \dots, g_{L - 1}}

, where

I (x, y)

represents the image intensity at spatial location

(x, y)

and

I (x, y) \in {g_{0}, g_{1}, \dots, g_{L - 1}}

. Then, the normalized histogram of I is defined as follows:

p (g_{k}) = \frac{n_{k}}{n}

(1)

where

n_{k}

is the total number of pixels in I with gray level

g_{k}

for

k = 0, 1, \dots, L - 1

and n is the total number of pixels in I. From the obtained normalized histogram, the CDF of I is calculated as follows:

c {(g_{k})}_{=} \sum_{k = 0}^{L - 1} p (g_{k})

(2)

The transform function is then computed by mapping the input image histogram into its dynamic range intensity values by using the CDF as a transform function. The transform function is calculated as follows:

T (g_{k}) = g_{0} + (g_{L - 1} - g_{0}) \times c (g_{k})

(3)

Then, the enhanced output image

G = G (x, y)

is computed as follows:

G (x, y) = T (I (x, y))

(4)

where

I (x, y) \in g_{0}, g_{1}, \dots, g_{L - 1}

.

There are three commonly used methods of Histogram Equalization: Global Histogram Equalization (GHE) [41], Adaptive Histogram Equalization (AHE) [42], and Contrast-Limited Histogram Equalization (CLAHE) [43]. GHE does not perform well on images that contain local regions of low contrast or regions that are dark or bright. In such cases, the AHE technique is used. It operates by dividing an image into regions, and for each region, it calculates the transform function based on the region’s CDF. However, AHE has its limitations as it results in the over-enhancement of noise in homogeneous regions [40]. Thus, the enhancement technique presented in this work is based on the CLAHE technique. CLAHE overcomes the noise enhancement artifact of AHE by clipping the histogram before using the CDF as a transform function. The clipping limit is defined as follows:

C . L = \frac{M \times N}{L} [1 + \frac{ρ}{100} (δ_{m a x} - 1)]

(5)

where

ρ

is the clipping factor and

δ_{m a x}

is the maximum allowable slope.

Algorithm 1 gives the steps used in this work to improve the quality of the thermite weld images using the CLAHE technique.

Algorithm 1 Image enhancement using CLAHE.

Requirements:: Thermite weld images
Output:: Enhanced images

1:: for each image I in the dataset do
2:: Divide into non-overlapping cells.
3:: for each cell do
4:: Calculate the histogram using (1).
5:: Calculate the clip limit using (5).
6:: Calculate the CDF using (2).
7:: Calculate the transform function using (3).
8:: Obtain the enhanced region using (4).
9:: end for
10:: end for

3.2. Image Segmentation and Region-of-Interest Extraction

To reduce the computational cost and eliminate noisy image backgrounds, it is recommended to extract the Region of Interest (RoI) before performing feature extraction. In this work, the RoI is the weld joint in the thermite weld image backgrounds, as shown in Figure 2. Given that the weld joint is irregular in shape and the background is complex and challenging, traditional image segmentation techniques such as Thresholding [44] and the Hough Transform [45] are not suited for this task. Active Contour Models (ACMs) are segmentation techniques that make it possible to segment irregularly shaped image regions and complex image background [46,47,48]. Segmenting images with ACMs is equivalent to solving an optimization problem, where the energy function is defined such that its minima corresponds to a contour that estimates the RoI object boundaries.

Segmenting images using ACMs takes two forms: parametrized and level set approaches [49]. In parametrized approaches, the snake ACM [50] uses a parametric contour, which, under the influence of energy forces, is driven towards the edges of the RoI to be segmented. However, the snake ACM’s disadvantage is that the evolving contour must be placed close to the RoI. Furthermore, the snake ACM is not effective in segmenting images with smooth edges or where edge detection is not feasible. In level set approaches, a contour is represented implicitly by a level set function

ϕ (x, y)

, where

(x, y)

is the pixel location in the region represented by the image domain

Ω

. The contour C is defined as those pixels in

Ω

where the level set function is zero. This is mathematically expressed as follows:

C = {(x, y) \in Ω : ϕ (x, y) = 0}

(6)

The level set method used in this work is based on the ACM without the use of gradient information proposed by Chan–Vese [51]. The Chan–Vese ACM was implemented for the Mumford–Shah segmentation techniques [52], which enables contours to be detected with or without the edges; thus, RoI objects with smooth edges or discontinuous edges are segmented. Given a weld joint image with two energy functions,

E_{1}

and

E_{2}

, in the foreground

Ω_{1}

and background

Ω_{2}

image regions, the Mumford–Shah energy function

E_{M S}

to minimize is defined as follows:

E_{M S} = \int_{Ω_{1}} | (I (x, y) - h_{1} |^{2} d x d y + \int_{Ω_{2}} | (I (x, y) - h_{2} |^{2} d x d y + v | \partial Ω_{1} |

(7)

where

h_{1}

and

h_{2}

represent the average intensity values inside and outside C,

v | \partial Ω_{1} |

is the contour length, and it is used as a regularization term to encourage smoothness in the segmentation boundary. Without it, the energy function could overfit to noise or small intensity variations in the image, yielding a very fragmented contour.

The piecewise constant Mumford–Shah model defines the Heaviside step function for two image regions in terms of the level set as follows:

H (ϕ (x, y)) = \{\begin{matrix} 1, & if ϕ (x, y) \geq 0 \\ 0, & ϕ (x, y) < 0 \end{matrix}

(8)

Implementing the Mumford–Shah energy function (7) in terms of the Heaviside step function (8) gives the Chan–Vese ACM energy function, which must be optimized to achieve segmentation. This energy function is defined as follows:

\begin{matrix} E (h_{1}, h_{2}, ϕ) = \int_{Ω} ({(I (x, y) - h_{1})}^{2} - {(I (x, y) - h_{2})}^{2}) H ϕ (x, y) + \int_{Ω} {(I (x, y) - h_{2})}^{2} d x d y \\ + v \int_{Ω} | \nabla H ϕ (x, y) | d x d y \end{matrix}

(9)

The local minima of (9) is computed from the gradient descent. It is further assumed that (8) is smooth slightly to enable differentiation. The gradient descent problem is obtained by minimizing (9) with respect to

Ω

, yielding the following:

\begin{matrix} \frac{\partial ϕ}{\partial t} = δ (ϕ) (v d i v (\frac{\nabla ϕ}{| \nabla ϕ |}) + {(I (x, y) - h_{2})}^{2} - {(I (x, y) - h_{1})}^{2}) \end{matrix}

(10)

The Heaviside function derivative and the mean intensity values

h_{1}

and

h_{2}

inside and outside the evolving contour, respectively, are calculated as follows:

\frac{d H (ϕ)}{H (ϕ)} = δ (ϕ)

(11)

h_{1} (ϕ) = \frac{\int_{Ω} I (x, y) H (ϕ (x, y)) d x d y}{\int_{Ω} H (ϕ (x, y)) d x d y}

(12)

h_{2} (ϕ) = \frac{\int_{Ω} I (x, y) (1 - H (ϕ (x, y))) d x d y}{\int_{Ω} (1 - H (ϕ (x, y))) d x d y}

(13)

Algorithms 2 and 3 list the steps followed in this work to segment and extract the weld joint as the RoI from the enhanced thermite weld images, respectively.

Algorithm 2 Image segmentation.

Requirements:: Enhanced thermite weld images
Output:: Segmented images

1:: for each image I in the dataset do
2:: Initialize $ϕ$ and set n number of iterations.
3:: for n=1 to maximum n do
4:: while contour is not stationary do
5:: Calculate $h_{1}$ ( $ϕ$ ) and $h_{2}$ ( $ϕ$ ) using (12) and (13), respectively.
6:: Evolve $ϕ^{n + 1}$ using (10).
7:: end while
8:: end for
9:: Segment the image.
10:: end for

Algorithm 3 Weld joint RoI extraction.

Requirements:: Segmented thermite weld images
Output:: Weld joint RoI images

1:: for each segmented image in the dataset do
2:: Obtain the segmented pixels coordinates.
3:: Apply the coordinates to the original images.
4:: Apply the bounding box across the coordinates.
5:: Crop the region within the bounding box.
6:: Save image as a weld joint.
7:: end for

3.3. Architecture One: Feature Extraction

After extracting the weld joint as the RoI, feature extraction is performed to identify interesting defect information within each weld joint image, and the final output is a descriptor vector that is later used to train a classifier. The first architecture of this work compares the LBP extractor to BoVW with a dense SURF extractor (referred to as the BoVW approach hereafter) for extracting features in weld joint images. These feature extraction techniques were chosen for comparison as they are invariant to image transformations such as illumination, rotation, and scale changes [21,24,38].

3.3.1. Feature Extraction Using the LBP Extractor

The LBP extractor is an effective feature extraction method for a simple way of extracting texture in images. It is invariant to illumination changes and image rotation. Due to its simplicity, feature extraction using the LBP descriptor has gained popularity in many applications where image feature extraction is essential. The principle of the original LBP extractor [53] can be understood by considering Figure 3. The weld joint image is initially divided into non-overlapping cells. Then, a center pixel to be labeled in each cell is used to assign a binary number to pixels in a

3 \times 3

neighborhood region. Pixels greater than the center pixel are converted to 1; otherwise, pixels are converted to 0. The obtained binary code is converted into a decimal number which represents the LBP code of the center pixel. A histogram feature vector is then obtained by concatenating LBP codes in each cell. The obtained feature vector is later used to train a classifier.

However, the drawback of the original LBP descriptor is that the

3 \times 3

neighborhood region is too small to extract features at larger scales. Therefore, a modified LBP extractor proposed in [54] allows for the extraction of features at different neighborhood sizes. In the modified version, neighboring sample points used to assign the LBP code to a center pixel are evenly placed on the circumference of a circle. Additionally, the modified version allows any sample points in the neighborhood to be used; this is achieved by bilinearly interpolating sampling points outside the pixels. Figure 4 shows some examples of the modified LBP descriptor.

The formal computation of the LBP code for any given pixel c in an image I surrounded by P neighboring pixels placed on a circle of radius R from c is defined as follows [53]:

L B P_{(R, P)} (c) = \sum_{i = 0}^{P - 1} S (g_{i} - g_{c}) \times 2^{i}

(14)

where

g_{c}

and

g_{i}

represent the intensity values of the center pixel and neighboring pixels, respectively. The notation

(R, P)

denotes a neighborhood of P sampling points on a circle of radius R. To achieve an illumination invariant LBP extractor, the sign function S is used, and it is defined as follows:

S (z) = \{\begin{matrix} 1, & if z \geq 0 \\ 0, & if z < 0 \end{matrix}

(15)

The LBP descriptor defined by (14) is not invariant to image rotation. For instance, if the image is rotated, each pixel in the neighborhood moves accordingly along the circle’s perimeter, and a different LBP code will be produced. A rotation-invariant LBP extractor is obtained by grouping together the LBP codes that are the rotated versions of the same pattern. The rotation-invariant LBP extractor is defined as follows [54]:

L B P_{(R, P)}^{r i} = m i n {R O R (L B P_{(R, P)} i | i = 0, 1, \dots, P - 1}

(16)

The function

R O R (z, i)

performs the circular stepwise right shift on the binary string of the LBP code for i number of times. The minimum between the obtained LBP codes is selected. It should be mentioned that keeping only the rotationally invariant patterns leads to a reduction in feature dimensionality. However, the number of LBP codes increases drastically with an increase in P. Thus, uniform patterns are used in the extended LBP extractor to reduce the number of LBP codes. It has been experimentally proven that more than 90% of patterns in texture images are uniform [54]. A pattern is said to be uniform if it contains at most two transitions from 0 to 1 or vice versa when the binary string is considered circular. For example, 0001000 is a uniform pattern because it has two transitions, while 0101010 is not a uniform pattern because it has six transitions. The extended, uniform LBP extractor is defined as follows:

L B P_{(R, P)}^{r i u} = \{\begin{matrix} \sum_{i = 0}^{P - 1} S (g_{i} - g_{c}) \times 2^{i}, & if U \geq 2 \\ P + 1, & otherwise \end{matrix}

(17)

where U is the uniformity measure used to differentiate uniform patterns from non-uniform patterns. A uniform pattern is a pattern where

U <

2 (fewer than two transitions). A non-uniform pattern is a pattern where

U >

2. U is defined as follows:

\begin{matrix} U (L B P_{(R, P)}) = | S (g_{i - 1} - g_{c}) - S (g_{0} - g_{c}) | + \sum_{i = 0}^{P - 1} | S (g_{i - 1} - g_{c}) - S (g_{i} - g_{c}) | \end{matrix}

(18)

The difference between the uniform LBP and the original extractors is that the former yields a significantly small feature vector length in each cell. To illustrate this, consider a circular neighborhood region with eight sampling points; a total of 256 patterns are generated, of which 58 are uniform and 198 are non-uniform. The uniform LBP extractor accumulates all the non-uniform patterns into a single histogram bin, and each uniform pattern is given a single histogram bin. Thus, a feature vector length in a cell has 59 histogram bins, which is a significant reduction to the feature length of the LBP extractor with non-uniform patterns. Therefore, feature extraction in this work is achieved using the uniform LBP extractor. The histogram of all cells in an image is normalized and concatenated to form a final feature vector, as shown in Figure 5.

The steps used in this work to extract features using the uniform LBP extractor are shown in Algorithm 4.

Algorithm 4 Feature extraction using LBP.

Requirements:: Weld joint image dataset
Output:: Concatenated feature vector z per image

1:: for each image I in the dataset do
2:: Divide into cells.
3:: for each cell do
4:: Obtain the LBP patterns using (17).
5:: Obtain and normalize the histogram.
6:: end for
7:: Form a feature vector z.
8:: end for

3.3.2. Feature Extraction Using SURF Extractor

The SURF extractor is a feature extraction algorithm that is invariant to scale, rotation, illumination, and occlusion [55]. It is one of the commonly used extractors for image recognition, matching, classification, and image registration. Feature extraction using SURF involves keypoint detection and keypoint description. Keypoint detection involves finding interesting image information such as blobs or corners representing different defect characteristics. Keypoint description, on the other hand, constructs a vector representing each keypoint.

Keypoint detection: Typically, this is achieved using keypoint detectors or dense sampling-based methods. SURF uses the determinant of the Hessian matrix and the integral images’ principles to detect the location and scale of the keypoints in an image. This then enables keypoints to be detected reliably since detection includes many levels of viewpoint and illumination invariance. However, when an image is characterized by low contrast, finding keypoints becomes infeasible, making the detection task useless. On the other hand, dense sampling methods consist of patches of fixed size placed on a regular grid across an image (see Figure 6a). A center pixel in every grid (highlighted by a red circle in Figure 6a) is considered a keypoint. This has advantages as a constant number of features per image area is obtained. Furthermore, image regions with low contrast contribute equally to the overall image representation. Achieving scale-invariant keypoints involves blurring the image with increasing Gaussian scales. On the downside, dense sampling does not reach the same level of repeatability as obtained with the SURF keypoint detector. Nevertheless, only scale and illumination changes are considered important image transformations in this work. Therefore, the dense sampling method is used to detect keypoints.

Keypoint description: After detecting keypoints, the next step is to construct a keypoint descriptor vector for every keypoint in the weld joint image. As illustrated in Figure 6b, this is achieved by placing a square region centered around every keypoint and oriented along the keypoint’s dominant orientation. This region is then divided into 16 sub-regions, and within each region, the x- and y-direction Haar wavelet responses denoted by

d_{x}

and

d_{y}

, respectively, are calculated and summed up to form the feature vector’s first two entries. Additionally, the absolute values of the responses are calculated and added to the feature vector. Thus, each sub-region is a four-dimensional feature vector calculated as follows:

v = (\sum d_{x}, \sum d_{y}, \sum | d_{x} |, \sum | d_{y} |)

(19)

There are 16 sub-regions within each square region; thus, each keypoint descriptor vector in SURF has 64 dimensions. Algorithm 5 lists the steps used in this work to extract features using the SURF extractor.

Algorithm 5 Keypoint detection and description using SURF.

Requirements:: Weld joint image dataset
Output:: Keypoint descriptor vectors per image

1:: for each image I in the dataset do
2:: Split into overlapping cells.
3:: for each cell do
4:: Take the center pixel as a keypoint.
5:: end for
6:: end for
7:: for each potential keypoint do
8:: Construct a square region centered at a keypoint.
9:: Split into 16 sub-regions and form a feature vector.
10:: for each sub region do
11:: Calculate the descriptor vector using (19).
12:: end for
13:: Concatenate and store a keypoint vector.
14:: end for

3.3.3. Image Representation Using BoVW

The SURF extractor produces highly discriminative keypoints, where a 64-dimensional feature vector represents each keypoint. This means many keypoint descriptor vectors represent each weld joint image, though these vectors are invariant to scale, illumination change, and orientation. Training a classification algorithm with many keypoint vectors for every image will mean outliers impact the classification results. Additionally, the computational cost is increased significantly. To address these challenges while maintaining invariants to image transformations, the BoVW approach is proposed in this work, such that each weld joint image is only presented by a single vector. The BoVW approach consists of three steps: Codebook construction, Coding, and Pooling. As depicted in Figure 7, Codebook construction is achieved by clustering similar keypoints from the unlabeled dataset, and each group is called a codeword. Coding aims to represent every keypoint in an image in terms of the codeword. Finally, Pooling represents every image as a global feature vector. This vector counts the appearance of each codeword on the image. Thus, the final output is a global feature vector for every weld joint image in the dataset.

Codebook construction: Codebook construction steps take the keypoint descriptor vectors in a high-dimensional feature space and cluster them based on similarity. This is achieved using the Kmeans clustering algorithm. Let

V = {v_{j} | j = 1, 2, \dots, N}

represent the unordered keypoint descriptor vectors extracted from the training dataset, where N is the total number of descriptor vectors and

v_{j} \in {I R}^{D}

. Kmeans clustering randomly picks K descriptors and assigns them as cluster centers

(w_{1}, w_{2}, \dots, w_{K})

. The remaining

N - K

descriptors are assigned to the closest cluster centers. For each cluster, a new mean,

w_{n e w (i)}

, is then calculated using the assigned descriptors. Descriptors closest to

w_{n e w (i)}

are again assigned; the process is iterative and only stops when the cluster centers are unchanged. The output from Kmeans clustering, as shown in Figure 7, is a codebook defined as follows:

C = {c_{k} | k = 1, 2, \dots, K}

where

c_{k} \in {I R}^{D}

.

Coding: This step aims to represent every keypoint in an image in terms of codewords. This is achieved by defining a function

β_{j} = {(β_{k, j}) | k = 1, 2, \dots, K}

such that any descriptor vector

v_{j}

is mapped into the nearest codeword

c_{k}

of the codebook. The mapping is achieved by the following hardcoding equation:

β_{k, j} = \{\begin{matrix} 1, & if k = arg min_{k \in {1 . . ., K}} | | v_{j} - c_{k} {| |}_{2}^{2} \\ 0, & otherwise \end{matrix}

(20)

where

β_{k, j}

is the

k^{t h}

component of the encoded vector

β_{j}

.

Pooling: Pooling is the final step in BoVW, where a vector z is constructed to provide a global description of an image. In this step, the encoded descriptor vector elements for every keypoint in an image are concatenated and added. Therefore, given an image with n keypoint vectors, the

k^{t h}

component of z is defined as follows:

z_{k} = \sum_{j = 1}^{n} β_{k, j}

(21)

Algorithm 6 lists the steps used in this work to construct the codebook and represents every image as a global feature vector.

Algorithm 6 Image representation using BoVW approach.

Requirements:: SURF keypoint descriptor vectors
Output:: Global feature vector per image

1:: for each descriptor vector in the training dataset do
2:: Store into a dictionary D.
3:: end for
4:: Randomly choose K number of descriptor vectors from D to form codeword centers $c_{1}, c_{2}, \dots, c_{K}$ .
5:: while codeword centers are unchanged do
6:: Assign each descriptor vector to the closest c.
7:: for each c do
8:: Replace with the new mean of the descriptor vectors assigned.
9:: end for
10:: end while
11:: for every weld joint image I in the dataset do
12:: Assign every descriptor vector in I to the nearest codeword in C using (20).
13:: Form a global feature vector z using (21).
14:: end for

3.4. Architecture One: Feature Classification

Three machine learning classification algorithms have been considered in this work for empirical comparison. The feature vectors extracted by the previous section’s methods are independently given as inputs to train and validate each classifier. Each classifier’s objective is to use the training feature vectors to learn a decision boundary that separates the unseen feature vectors from the test dataset into a defect-less class or a class containing either wormholes, shrinkage cavities, or inclusion defects.

3.4.1. Classification Using the K-Nearest Neighbors

Feature classification using the KNN classifier is conceptually straightforward. During training, the feature vectors (with corresponding class labels) from the training dataset are stored in memory. When an unlabeled feature vector is presented to the feature space, its class label is obtained using a set of K closest feature vectors. To explain the working principles of the KNN classifier, let the weld joint images from the training dataset be represented as

((v_{1}, y_{1}), \dots, (v_{n}, y_{n}))

, where

v_{i}

are the training feature vectors in a feature space

{I R}^{m}

, and

y_{i}

are the class labels for each feature vector. The distance between the test feature vector z and any training feature vector

v_{j}

is the Euclidean distance defined as follows [22]:

d (v_{i}, v_{j}) = \sqrt{\sum_{r = 1}^{m} {(a_{r} (v_{i}) - a_{r} (z))}^{2}}

(22)

Then, the class label to the unlabeled feature vector z is assigned based on the K

(v_{1}, v_{2}, \dots, v_{K})

closest feature vectors from the training dataset according to the following:

y (z) \leftarrow \underset{c \in C}{argmax} \sum_{i = 1}^{k} w_{i} δ (c, y (z))

(23)

where

y (z)

is the class of sample z,

c \in C

is the class label, and

δ (c, y (z))

is equal to 1 if c is equal to

c (v_{i})

; otherwise,

δ (c, y (z))

is equal to 0.

w_{i}

is the function used to weight the training feature vectors based on their distance from z. The weighting function used in this work is the inverse square distance, and it is calculated as follows:

w_{i} = \frac{1}{d {(z, v_{i})}^{2}}

(24)

Algorithm 7 lists the steps used in this work to classify rail welding defects using the KNN classifier. Inputs are the feature vectors that are initially divided into training, validation, and test datasets. The training dataset is used to train the classifier, the validation dataset is used for obtaining optimal classifier parameters, and the test dataset is used to report the classification accuracy based on optimal parameters.

Algorithm 7 Defect classification using KNN.

Requirements:: Training, validation, and testing feature vectors
Output:: Class label for each test feature vector

1:: Classifier training:
2:: for each feature vector v in the training dataset do
3:: Save into a memory.
4:: end for
5:: Obtain optimal K value parameter.
6:: Classifier testing:
7:: for each feature vector z in the test dataset do
8:: Compute the distance to training features using (22).
9:: Use the optimal K value parameter.
10:: Obtain class label using (23).
11:: end for

3.4.2. Feature Classification Using Naive Bayes

The Naive Bayes classifier is a family of Bayesian networks [56], where the class assignment of an unknown feature vector is based on class-conditional probabilities, with each representing the probability that the unknown vector belongs to a respective class. Given the weld joint image dataset with M classes

y_{1}, y_{2} \dots, y_{M}

and the unlabeled feature vector z from the test dataset in a dimensional feature space

{I R}^{m}

, according to Bayes’ rule, the probability that the class label of z belongs to any class

y_{i}

is defined as follows:

P (y_{i} | z) = \frac{P (z | y_{i}) P (y_{i})}{P (z)}

(25)

where

P (y_{i} | z)

is the probability that the label of vector z belongs to class

y_{i}

,

P (z | y_{i})

is the probability of generating sample z given class

y_{i}

,

P (y_{i})

is the prior probability of class

y_{i}

, and

P (z)

is the probability of sample z occurring. To model

P (z | y_{i})

is impractical given that z is a vector in a high-dimensional feature space. Thus, in Naive Bayes, it is assumed that individual

z_{r^{'} s}

are conditionally independent given y. The numerator of (25) becomes the following:

P (z_{1} | y_{i}) \cdot P (z_{2} | y_{i}) \dots, P (z_{r} | y_{i}) \cdot P (y_{i}) = \prod_{r = 1}^{m} P (z_{k} | y_{i}) P (y_{i})

(26)

P (z)

is the same for all the classes, and it does not affect the decision. Thus, (26) simplifies to the following:

P (z | y_{i}) = \prod_{r = 1}^{m} P (z_{k} | y_{i}) P (y_{i})

(27)

P (y_{i})

is the class-prior probability. Given N feature vectors from the training dataset and

N_{k}

feature vectors which belong to class

y_{i}

, the prior probability is calculated as follows:

P (y_{i}) = \frac{N_{i}}{N}

(28)

To assign the class label to an unknown sample, the value of (27) is computed for each class, and the class where this value is maximal is selected. This is computed as y for sample z:

y \leftarrow \underset{y_{i}}{argmax} \prod_{r = 1}^{m} P (z_{r} | y_{i}) P (y_{i})

(29)

Algorithm 8 lists the steps used in this work to classify defects using the Naive Bayes classifier.

Algorithm 8 Defect classification using Naive Bayes.

Requirements:: Training, validation, and test feature vectors
Output:: Class label for each test feature vector

1:: Classifier training:
2:: for each class $y_{i}$ do
3:: Calculate prior probability using (28).
4:: end for
5:: Classifier testing:
6:: for each feature vector z in the test dataset do
7:: Assign the class label using (29).
8:: end for

3.4.3. Feature Classification Using SVM

SVMs are a widely used classification algorithm due to many promising characteristics in terms of performance [17]. At first, SVMs were introduced for a binary classification task; however, their use soon expanded to multi-class classification tasks. Multi-class classification is achieved by one-vs.-one or one-vs.-all SVM classifiers. To understand the principles of the SVM, let

((v_{1}, y_{1}), \dots, (v_{n}, y_{n}))

be the training dataset where v are feature vectors representing each weld joint image and

y \in (- 1, + 1)

are the corresponding class labels for feature vectors. Then, the SVM algorithm’s goal, as illustrated in Figure 8, is to construct an optimal hyperplane that separates the feature vectors in each class with the largest possible margin. An unlabeled feature vector is then assigned a class label depending on its relative position from the optimal hyperplane.

The margin is defined as the shortest distance between the feature vectors in the negative and positive classes; these feature vectors are known as support vectors. The

H_{1}

and

H_{2}

planes defined by

w v_{i} + b \geq 1

and

w v_{i} + b \leq - 1

represent the boundaries for feature vectors that belong to a negative and positive class, respectively. The margin which must then be maximized for the optimal hyperplane is the distance,

d = \frac{2}{| w |}

, between the

H_{1}

and

H_{2}

planes. Maximizing d is equivalent to solving a dual optimization problem defined as follows:

m i n \frac{1}{2} w^{2}

, subject to

y_{i}

(w v_{i} + b \geq 1), \forall i

. Thus, introducing the Lagrangian formulation to eliminate the constraints leads to the definition of the dual SVM, expressed as follows:

\begin{matrix} max_{α} & \sum_{i} α_{i} - \frac{1}{2} \sum_{i} \sum_{j} α_{i} α_{j} y_{i} y_{j} (v_{i} \cdot v_{j}) \\ subject to & \sum_{i = 1} α_{i} y_{i} = 0 and α_{i} \geq 0 \end{matrix}

(30)

Solving the dual SVM yields the coefficients of

α_{i}

; feature vectors where

α_{i} > 0

are the support vectors, and they lie directly on the

H_{1}

and

H_{2}

planes. The dual SVM problem is 0 for

α_{i} = 0

. Thus, the SVM optimization problem is affected only by the support vectors. The optimal hyperplane for assigning a class label to a test feature vector z is then defined as follows:

f (z) = \sum_{i}^{M} y_{i} α_{i} (v_{i}^{T} \cdot z) + b

(31)

The formulation of the dual SVM discussed assumes that feature vectors are linearly separable in a feature space. However, in real applications, the data are characterized by noise and outliers, making it impossible to separate the features linearly. Nonlinear SVMs allow nonlinearly separable features to be separable by transforming to a higher feature space

χ

. The transformation is computed by taking the dot product between any pairs of feature vectors using a Kernel function:

K (v_{i} \cdot v_{j}) = ϕ (v_{i}) \cdot ϕ (v_{j})

. The radial basis function (RBF) kernel is used in this work to achieve the transformation. The RBF kernel is defined as follows:

K (v_{i} \cdot v_{j}) = \exp (\frac{| | v_{i} - v_{j} {| |}^{2}}{2 σ^{2}})

(32)

where

σ

is the kernel width. To achieve multi-class classification of defects, the one-vs.-one SVM formulation is used. This is achieved by training two defect classes at a time. For a classification task with M number of classes, a total of

\frac{M (M - 1)}{2}

classifiers are obtained. An unknown vector is given a class label based on the class with majority counts.

Algorithm 9 lists the steps used in this work to classify defects using the SVM classifier.

Algorithm 9 Defect classification using SVM.

Requirements:: Training, validation, and test feature vectors
Output:: Class label for each test feature vector

1:: Classifier training.
2:: for any pair of classes $y_{i}$ and $y_{j}$ do
3:: Map training feature vectors into $χ$ using (32).
4:: Minimize (30) to obtain optimal hyperplane.
5:: end for
6:: Obtain optimal $σ$ parameter value.
7:: Classifier testing
8:: for each feature vector z in the test dataset do
9:: Assign to class label with the majority vote.
10:: end for

3.5. Architecture Two: Deep Convolution Neural Network

In contrast to architecture one, where feature extraction techniques are applied to represent every weld joint image as a feature vector that is later used to train machine learning classifiers; the Deep Convolution Neural Network (DCNN) architecture learns the image representation automatically using deep convolution before passing the feature to a fully connected neural network classifier. Figure 9 illustrates the system diagram of the proposed DCNN architecture. It is made up of four primary operations: convolution, activation function, pooling, and a fully connected neural network.

3.5.1. Convolution

The convolution component is the most critical component in the DCNN architecture. It contains a set of convolution kernels (filters) that get convolved with the input image to produce an output feature map. During the training phase of the DCNN model, the weights of the kernels are initialized with random numbers; thereafter, the kernel is slid over the input image in the horizontal and vertical directions. For every sliding operation, the dot product between the corresponding values of the kernel and the input image is computed. The obtained values from the dot product computation are then summed to obtain a single scalar value in the output feature map. This process continues until the kernel can no longer slide further on the input image. The method of computing a scalar value of the feature map in each convolution process is defined as follows:

F M_{j} = \sum_{i = 1}^{N} I_{i} \times K_{i}

(33)

where I is the input image, K is the kernel, N is the total number of pixels in K, and

F M_{j}

is the

j^{t h}

element of the output feature map. The convolution operation also contains two other essential components: strides and padding. Strides define the step size taken along the horizontal and vertical positions of the convolution process, and an increase in stride size yields lower dimensions of the feature map. Padding, on the other hand, aims to allow for a convolution process on the border information of the image by increasing its dimensions. Therefore, the feature map size of the input image is computed after the convolution process is calculated as follows:

h' = \frac{h - k + p}{s} + 1

(34)

w' = \frac{w - k + p}{s} + 1

(35)

where

h'

is the height of the feature map,

w'

is the width of the feature map, h is the height of the input image, w is the width of the input image, k is the filter size, p is the padding of the convolution process, and s is the stride.

3.5.2. Pooling

The pooling component of the DCNN architecture aims to perform dimension reduction on the feature map while preserving the most significant feature map features. Like convolution, pooling is performed by specifying the pooling region’s size and the operation’s stride. There are several pooling operations available in the literature including max pooling, min pooling, and average pooling. Max pooling is the widely used and most popular technique; it takes the maximum pixel value within a pooling region of the feature map. A major disadvantage of max pooling is that the pooling operator only considers the maximum pixels from the pooling region and disregards pixels. Thus, if the majority of pixels in the pooling have high pixel dynamic values (representing a defective region), the discerning features disappear after performing the max pooling operation. To tackle this challenge, this work combines the max pooling operation with the average pooling operation as follows:

I_{p o o l} = m a x_{i, j}^{h', w'} \times F . M_{i, j} + \frac{1}{h' w'} \sum_{i, j = 1}^{h', w'} F . M_{i, j}

(36)

3.5.3. Activation Function

The main objective of the activation function in the DCNN architecture is to introduce non-linearities to the network by mapping the input values to the output values. An input value is obtained by taking the weighted sum of the neuron’s input and adding a bias; this is then multiplied by the activation function to obtain the output value. Several activation functions are available in the literature, including the sigmoid function, Tanh function, and Rectified Linear Unit (ReLU) function. The sigmoid activation function takes real numbers as inputs and constrains the outputs in the

[0, 1]

range. The Tanh activation function constrains the input within the

[- 1, 1]

range. The ReLU activation function is the most commonly used activation function in DCNN and it converts all the input values into positive numbers according to the following equation:

f {(x)}_{R e L U} = m a x (o, x)

(37)

3.5.4. Fully Connected Neural Network

The final component of the DCNN architecture is a fully connected layer, which consists of an input layer, one or more hidden layers, and an output layer. The input layer takes as input the flattened feature vector obtained from the final convolution or pooling feature map. These inputs are then passed to the hidden layer(s) and the output layer. The size of the output layer corresponds to the number of classes to predict, and each neuron in the output layer gives a probability that a given image belongs to a specific class. The likelihood of class i is obtained as follows:

P_{i} = \frac{e^{α_{i}}}{\sum_{j = 1}^{N} e_{j}^{α}}

(38)

where N is the total number of neurons in the output layer, and

e^{α}

is the unnormalized output from the previous layer.

3.5.5. Loss Function

During the training process of the DCNN architecture, the weights of the convolution and pooling kernels are randomly initialized. Then, several layers of convolution and pooling operations are performed to extract key image features and reduce the feature map dimensions, respectively. After that, the feature map on the final convolution or pooling layer is flattened to generate a feature vector. The obtained vector is then forwarded to the feed-forward neural network pipeline, and the class prediction of the input image is obtained. The predicted output is then compared to the actual output to obtain the loss according to the following cross-entropy loss equation, where Y is the actual output and

i \in [1, N]

.

L (P, Y) = - \sum_{i} Y_{i} log (P_{i})

(39)

3.5.6. Back-Propagation and Weight Update

The cross-entropy loss measured between the predicted output and the actual output is used to update the weight parameters in each layer of the DCNN architecture using the back-propagation method. Several methods for performing back-propagation are available in the literature; these include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent methods. Stochastic gradient descent updates weight parameters after every image vector. Thus, it is slow in terms of speed. On the other hand, batch gradient descent takes all the training datasets at once to update weight parameters. However, it is computationally expensive to perform batch gradient descent, especially when the dataset is huge. The mini-batch gradient method is used in this work as it allows for a much faster learning process and reduces the computational cost of modeling large datasets. In the mini-batch gradient descent method, weight parameters are updated by iteratively selecting small batches of data randomly as follows:

w_{k} \to w_{k} = w_{k} - \frac{μ}{m} \sum_{i} \frac{\partial L_{x_{i}}}{\partial w_{k}}

(40)

where

w_{k}

is the weight parameter,

μ

is the learning rate, m is the number of training samples per iteration, and

L_{x_{i}}

is the cross-entropy loss of sample

x_{i}

.

Algorithm 10 shows the steps used to train and classify the test dataset using the DCNN architecture.

Algorithm 10 Defect classification using DCNN.

Requirements:: Weld joint image dataset
Output:: Class label for each test feature vector

1:: Training stage.
2:: Define the size of the convolution kernel.
3:: Define the size of the pooling kernel.
4:: Define the stride parameter.
5:: Define the padding parameter.
6:: Define the batch size parameter.
7:: Randomly divide the dataset into mini-batches.
8:: Randomly initialize the weight parameters of the convolution and pooling kernels.
9:: while convergence is not reached do
10:: for each mini batched weld joint image with corresponding class labels do
11:: Obtain feature map from convolution using (33).
12:: Apply the activation function using (37).
13:: Obtain feature map from pooling using (36).
14:: Apply the activation function using (37).
15:: Obtain class prediction using (38).
16:: Obtain loss function using (39).
17:: Update weight parameters using (40).
18:: end for
19:: end while
20:: Test stage
21:: for each unlabelled weld joint image do
22:: Obtain feature map from convolution using (33).
23:: Apply the activation function using (37).
24:: Obtain feature map from pooling using (36).
25:: Apply the activation function using (37).
26:: Obtain class prediction using (38).
27:: end for

4. Experimental Results and Discussion

All experiments in this work were carried out using Matlab R2019b software under the University of KwaZulu Natal (UKZN) license. The experiments were conducted on a 64-bit MSI machine powered by Nvidia with eight cores and 32 GB RAM. Section 4.1 describes the dataset used to conduct the experiments. Section 4.2 presents some image enhancement results obtained by the CLAHE technique presented by Algorithm 1. Section 4.3 presents the accuracy of the Chan–Vese ACM to extract the weld joint Region of Interest (RoI). Section 4.4 and Section 4.5 present and discuss the classification results obtained by architectures one and two, respectively. Section 4.6 presents the optimal architecture for the rail welding defect classification task by comparing the best results obtained in each architecture.

4.1. Dataset Description

The experiments were conducted using the dataset collected from the welding department of Transnet Freight Rail (TFR). A total of 6317 radiography images representing four classes of imbalanced dataset distribution, namely defect-less (2210), wormholes (1800), shrinkage cavities (1340), and inclusions (967) defects, were collected. In each class, images were divided into a 70/30 split, where 70% of images were used to train the models and 30% were used to test the models based on optimal parameters. The training image dataset was further divided into a 70/30 split, where 30% of the training set was used to identify the optimal parameters of the feature extractors and classifiers. Figure 10, Figure 11, Figure 12 and Figure 13 show sample images for each class.

4.2. Image Enhancement

The collected thermite weld images have poor contrast; therefore, image enhancement techniques were used to improve image quality. The enhancement method used in this work is the CLAHE technique, and it was applied in every image using the steps listed by Algorithm 1. A clip factor of 0.01 was used, and each transform function was calculated at a cell size of

10 \times 10

pixels. Figure 14 depicts some of the original and enhanced images.

4.3. Weld Joint Extraction

After enhancing the thermite weld images, the next step was to extract the weld joint as the RoI from the image background. Weld joint segmentation and extraction were achieved using Algorithms 2 and 3, respectively. On each image, the Chan–Vese ACM model was applied, and the contour at weld joint boundaries was obtained (see Figure 15a). The obtained contour was then segmented, and the coordinates of the weld joint were obtained (see Figure 15b). The obtained coordinates were overlaid on the original image; then, the bounding box was placed across the boundaries of the coordinates (see Figure 15c). The region within the bounding box was cropped, and the cropped image represents the weld joint as the RoI (see Figure 15d).

Depicted in Figure 16 are some of the images segmented using the proposed Chan–Vese ACM method. It can be observed that some of the weld joint images, particularly those containing wormhole defects, needed to be post-processed to achieve accurate segmentation and, therefore, accurate weld joint extraction. This is understandable since wormhole defects are characterized by multiple “worm-like”-shaped, dark patterns introduced by gas entrapment during the thermite welding process. On the contrary, the shrinkage cavities and inclusion defects were easily segmented as they are characterized mainly by a single shape representing a defect. Shrinkage cavities usually appear as a straight line, and they are caused by the poor pre-heating temperature of rail ends during thermite welding. On the other hand, inclusions are irregular in shape and are caused by the presence of foreign objects.

The post-processing technique employed in this work to remove residual spots was based on the mathematical morphology operations presented in [57], and the results are as illustrated in Figure 17. Depicted in Figure 18 is a graph showing the percentage of successfully segmented weld joint images per class.

4.4. Architecture One: Experimental Results

The extracted weld joint RoI images were used as inputs to the first architecture, where two feature extraction methods were applied to every weld joint image to extract features. The performances of feature extraction methods were compared using three machine learning classification algorithms: KNN, SVM, and Naive Bayes. The below sub-sections provide the classification accuracy achieved in each feature extractor by all three classifiers.

4.4.1. Classification of LBP Features

To investigate the impact of the LBP cell size parameter on classification accuracy, feature extraction was conducted at increasing cell sizes:

[6 \times 14]

,

[12 \times 28]

,

[30 \times 70]

, and

[60 \times 140]

. Additionally, eight neighboring pixels were used as sampling points because uniform patterns frequently occur at smaller sampling points. Furthermore, long feature vectors were avoided.

Classification with KNN Classifier: Table 1 shows the validation and test accuracies obtained with increasing LBP cell sizes when KNN is used as the classifier. At each cell size (e.g.,

[6 \times 14]

), various K parameter values of the KNN classifier are used. The classification accuracies of each K value are reported utilizing the validation and test datasets. It can be observed that the optimal K value for classifying LBP features obtained at

[6 \times 14]

is K = 40, with validation and test accuracies of 89% and 91%, respectively. The optimal K value for classifying LBP features generated at

[12 \times 28]

is K = 25, with validation and test accuracies of 87% and 83%, respectively. At the

[30 \times 70]

LBP cell size, the optimal K value is obtained at K = 25, with validation and test accuracies of 85%. At a much larger cell size of

[60 \times 140]

, the optimal K value is obtained at K = 40, with validation and test accuracies of 83% and 79%, respectively. The results indicate that the classification accuracy of the KNN classifier decreases with an increase in the LBP cell size parameter. This also means the KNN classifier performs better on LBP features generated at a much smaller image spatial scale.

Classification with SVM Classifier: Table 2 shows the validation and test accuracies obtained at increasing LBP cell sizes when SVM is used as the classifier. At each cell size (e.g.,

[6 \times 14]

), various

σ

parameter values of the RBF kernel are used. The classification accuracies of each

σ

value are reported utilizing the validation and test datasets. The optimal

σ

parameter value for classifying LBP features obtained at

[6 \times 14]

is

σ = 0.5

, with validation and test accuracies of 87% and 89%, respectively. The optimal

σ

value for classifying LBP features generated at

[12 \times 28]

is

σ = 0.25

, with validation and test accuracies of 85%. At the

[30 \times 70]

LBP cell size, the optimal

σ

value is

σ = 2

, with validation and test accuracies of 77%. At a much larger cell size of

[60 \times 140]

, the optimal

σ

value is obtained at

σ = 0.25

, with validation and test accuracies of 77%. The results indicate that the classification accuracy of the SVM classifier decreases slightly with an increase in the

σ

parameter. It is worth noting that this slight decrease in accuracy is not as prominent as that of the KNN classifier.

Classification with Naive Bayes Classifier: Table 3 shows the validation and test accuracies obtained at increasing LBP cell sizes when Naive Bayes is used as the classifier. It can be seen that the Naive Bayes classifier achieved validation and test accuracies of 73% and 71%, respectively, for the LBP features generated at the

[6 \times 14]

cell size. Validation and test accuracies of 75% are obtained for LBP features generated at the

[12 \times 28]

cell size. At the

[30 \times 70]

LBP cell size, validation and test accuracies of 79% and 77%, respectively, are achieved by the Naive Bayes classifier. For LBP features generated at the

[60 \times 140]

cell size, the classifier achieves validation and test accuracies of 81%. In contrast to the KNN and SVM classifiers, where the classification accuracies decrease with an increase in the LBP cell size parameter, the accuracies of the Naive Bayes classifier increase with an increase in the LBP cell size parameter.

4.4.2. Best Classifier for LBP Features

Table 4 shows the highest test accuracy achieved by each classifier (with optimal parameters) for classifying features extracted using the LBP descriptor. The best classifier for classifying LBP features is the KNN classifier (optimal parameters: cell size = [

6 \times 14

] and K = 40) with 91% test accuracy. The second-best classifier is the SVM classifier (optimal parameters: cell size = [

6 \times 14

] and

σ = 0.5

). The Naive Bayes classifier achieved the worst results with 81% test accuracy at the optimal cell size parameter of [

30 \times 70

].

4.4.3. Classification of BoVW Features

Keypoint detection and description using the SURF extractor were applied in every weld joint image using Algorithm 5. To achieve scale-invariant keypoints, keypoints were extracted densely after every 10 pixels at sampling scales of 1.2, 3.6, and 4.8, respectively. Thus, the Kmeans clustering algorithm was applied to over a million randomly sampled keypoint descriptor vectors to learn the codebook according to Algorithm 6. In the Kmeans clustering algorithm, the codebook size was varied from 400 to 3200 codewords to investigate its impact on classification accuracy.

Classification with KNN Classifier: Table 5 shows the validation and test accuracies obtained at increasing codebook sizes when KNN is used as the classifier. At each codebook size, various K values of the KNN classifier are used. The classification accuracies of each K value are reported utilizing the validation and test datasets. At a codebook size of 400 codewords, the optimal K value of the KNN classifier is K = 25, with validation and test accuracies of 89%. At a codebook size of 1200 codewords, the optimal K value of the KNN classifier is K = 40, with validation and test accuracies of 87%. At a codebook size of 2000 codewords, the optimal K value of the KNN classifier is K = 40, with validation and test accuracies of 85% and 87%, respectively. At a codebook size of 3200, the optimal K value of the KNN classifier is K = 25, with validation and test accuracies of 77% and 79%, respectively. A decrease in classification accuracies is observed with an increase in codebook size.

Classification with SVM Classifier: Table 6 shows the validation and test accuracies obtained at increasing codebook sizes when SVM is used as the classifier. At each codebook size, various

σ

parameter values of the RBF kernel are used. The classification accuracies of each

σ

value are reported utilizing the validation and test datasets. At a codebook size of 400 codewords, the optimal

σ

is 0.5, with validation and test accuracies of 91%. At a codebook size of 1200 codewords, the optimal value of the

σ

parameter is 2, with validation and test accuracies of 89%. The optimal

σ

parameter value of 2 is obtained for a codebook size of 2000 codewords, with validation and test accuracies of 85%. At a codebook size of 3200, the optimal

σ

parameter value of 0.5 is obtained, with validation and test accuracies of 85% and 87%, respectively.

Classification with Naive Bayes Classifier Table 7 shows the validation and test accuracies obtained at increasing codebook sizes when Naive Bayes is used as the classifier. Validation and test accuracies of 77% and 75%, respectively, are obtained at a codebook size of 400. At a codebook size of 1200, validation and test accuracies of 75% are obtained. At a codebook size of 2000, validation and test accuracies of 79% are obtained. For LBP features generated at the

[60 \times 140]

cell size, the classifier achieves validation and test accuracies of 81%. Validation and test accuracies of 77% and 75%, respectively, are obtained at a codebook size of 3200.

4.4.4. Best Classifier for BoVW Features

Table 8 shows the highest test accuracy achieved by each classifier (with optimal parameters) for classifying features extracted using the BoVW approach. SVM is the best classifier (optimal parameters: codewords = 400 and

σ

= 0.5) with 91% test accuracy. KNN is the second-best classifier (optimal parameters: codewords = 400 and K = 25) with 89% test accuracy. The Naive Bayes classifier achieved the worst results, with 79% test accuracy at the optimal codebook size of 2000 codewords.

4.4.5. Architecture One: Best Feature Extractor and Classifier

To identify the best feature extraction and classification algorithms in architecture one, Table 9 depicts the best classifiers for classifying features extracted by the LBP extractor and BoVW approach in terms of the feature vector length, computation time, and test accuracy. Computation time is the average time to extract and classify features in a single weld joint image from the test dataset. It is worth noting that both the LBP and BoVW methods achieve the same test accuracy of 91% with KNN and SVM as classifiers, respectively. However, the BoVW approach and the SVM classifier are the best combinations for extracting and classifying weld joint image features in architecture one. This conclusion is realistic given that the BoVW approach produces a much smaller feature vector of 400 dimensions for each weld joint image, which takes a much shorter time to compute than the feature vector of 147,500 dimensions produced by the LBP extractor.

4.4.6. Architecture One: Discussion

This section presents the experimental results obtained by the feature extractors and classifiers used in architecture one. The LBP feature extractor and BoVW with SURF features were compared. The classification accuracies obtained by these two methods were reported by comparing the KNN, SVM, and NB classifiers. To identify the optimal parameters, several LBP cell size and BoVW codebook size parameters were investigated on features extracted by the LBP extractor and BoVW with SURF features, respectively. For each LBP cell size and BoVW codebook size, several classifier parameters were used to investigate their impact on classification accuracy; these parameters were the K and

σ

values on the KNN and SVM classifiers, respectively.

As indicated in Table 1 and Table 2, the optimal parameters of the KNN and SVM classifiers resulted in decreased test and validation accuracies as the LBP cell size parameter increased. This decrease in classification accuracies illustrates that the LBP extractor generates robust features at smaller regions of weld joint images; this observation is understandable since the shape of wormhole defects resembles the shape of inclusion defects at larger regions of the images. It is also worth mentioning that the opposite trend was observed in the classification accuracies achieved by the NB classifier, as shown in Table 3. However, the NB classifier is generally not suitable due to assumptions made on data distribution. Table 4 indicates that the optimal classifier for classifying LBP features is the KNN classifier with a test accuracy of 91%, at a

6 \times 14

LBP cell size and K = 40.

As indicated in Table 5 and Table 6, the optimal parameters of the KNN and SVM classifiers yielded a slight decrease in the test and validation accuracies with an increase in the BoVW codebook size parameter. This slight decrease in classification accuracies illustrates that the codebook size parameter does not significantly impact the classification results. On the other hand, the Naive Bayes classifier shows a slight increase in classification accuracies for the first 2000 codewords. However, a decrease in classification accuracies is observed at a codebook size of 3200 codewords, as indicated in Table 7.

Table 8 indicates that the optimal classifier for classifying BoVW with SURF features is the SVM classifier with a test accuracy of 91%, at a codebook size of 400 codewords and

σ

= 0.5. In Table 9, the best feature extractor and classifier in architecture one is presented in terms of the feature vector length, computation cost, and classification accuracy of the test dataset. It can be observed that the combination of the LBP extractor and KNN classifier achieved a similar classification accuracy of 91% as the combination of the BoVW approach with the SVM classifier. However, the feature vector of the LBP extractor has 147,500 dimensions, which require 1.89 s to extract and classify with the KNN classifier. On the other hand, the BoVW approach produces a vector length of only 400 dimensions, which require 0.21 s to extract and classify with the SVM classifier. Therefore, the BoVW approach with the SVM classifier was chosen as the best feature extractor and classifier for architecture one; these results will then be compared to the results obtained in architecture two.

4.5. Architecture Two: Experimental Results

The weld joint Region-of-Interest (RoI) images of size

300 \times 700

extracted by Algorithm 2 were used as inputs to train, validate, and test the Deep Convolution Neural Network (DCNN) architecture for automated classification of thermite weld defects. Unlike in architecture one, where hand-crafted feature extraction techniques were first applied to extract image features before classifying defects, the DCNN architecture automatically learns the image feature representation using deep convolution, and the classification of learned features is achieved using fully connected neural network layers.

4.5.1. Training Process

Automated feature extraction and classification of thermite weld defects using the DCNN architecture was achieved by following the steps listed in Algorithm 10. In the training stage, several convolution and pooling layers (with corresponding kernels) are applied to the input image to produce image feature maps and reduce image dimensionality while preserving the most significant image features. The feature maps obtained at the last convolution/pooling layer are flattened and passed to several fully connected layers and then to the output layer, where the class prediction of the input image is made using the Softmax activation function. A comparison of the predicted and actual class labels is made and then measured using the cross-entropy loss function. The measured loss is used to re-adjust the kernel weights of convolution and pooling layers and the weight parameters of the fully connected layers. Back-propagation for re-adjusting the weights is achieved using the Adam optimizer [58]. The DCNN model is allowed to train for 1200 iterations, and after each iteration, the previous learning rate parameter is adjusted to reach the global minima. The Adam optimizer was preferred over other optimization algorithms due to its adaptive learning rate capabilities combined with momentum.

4.5.2. Fixed Parameters

Some of the DCNN parameters were kept at constant values throughout the experiments as they were deemed to have no significant impact on classification accuracy. The mini-batch size parameter was kept at 96 weld joint images per training iteration. To capture the finer details of the input image, the stride parameter on the convolution component was chosen as (

s = 1

). The zero padding parameter was chosen as (

p = 2

) to allow for the extraction of features at the border regions of the input image. The initial learning rate was set to (

u = 0.001

), which was used as the first learning rate value in the Adam optimizer. The size of each hidden layer of the fully connected network was set to 80 units.

4.5.3. Fine-Tuned Parameters

Several DCNN parameters deemed to impact the model performance significantly were fine-tuned. These parameters are the kernel size of each convolution and pooling layers, respectively, and the number of hidden layers on the fully connected network side. Table 10, Table 11 and Table 12 depict the results obtained using the validation dataset using varying convolution parameters and fully connected networks with single, double, and triple hidden layers of the same size.

Table 10 shows that a fully connected network with one hidden layer achieved a validation accuracy of 91% at optimal kernel sizes of

70 \times 70

,

50 \times 50

, and

35 \times 35

for the first, second, and third convolution layers, respectively, while the optimal kernel size of the pooling layer was

15 \times 15

. Table 11 and Table 12 indicate that fully connected networks with two and three hidden layers achieved validation accuracies of 97% and 93%, respectively, using the same optimal parameters as the fully connected network with one hidden layer. The results across the three tables further indicate that an increase in the kernel size of the convolution and pooling operation reduces the classification accuracy. Similar to the results of architecture one, larger kernel sizes of the DCNN architecture capture more robust weld joint image features than smaller kernel sizes. At smaller kernel sizes, features captured by the DCNN architecture may be similar across multiple defect classes (defect-less, shrinkages, and inclusions) because most image regions are not defective.

4.5.4. Best Parameters for Architecture Two

The optimal kernel sizes of the convolution and pooling layers for fully connected networks with increasing numbers of hidden layers are reported in terms of test accuracy and computation time in Table 13. Computation time is the average time it takes the DCNN architecture to assign a class label to each weld joint image from the test dataset. With increasing numbers of hidden layers in fully connected networks, the optimal kernel sizes of the convolution and pooling layers are similar. It can also be observed that a fully connected network with two hidden layers achieved the highest test accuracy of 95% compared to a network with one and three hidden layers, respectively. Though the computation time it takes to process a single weld joint image using the network with two hidden layers is slightly larger than the computation time for a network with a single layer, the difference in test accuracy is very large. Therefore, a DCNN architecture with two hidden layers in the fully connected network, using the corresponding convolution and pooling parameters, is the best method for weld joint defect classification in architecture 2.

4.6. Best Architecture for Weld Joint Defect Classification

To select the best architecture for thermite weld defect classification, the best methods that achieved the highest test accuracy in each architecture are shown in Table 14. It is shown that architecture two outperforms architecture one in terms of test accuracy. Indeed, the results prove that extracting features with convolution yields a more robust image representation than local feature extractors. While local feature extractors such as BoVW and LBP extract low-level features such as blobs, points, and small patches, the convolution process of the DCNN model in this work is structured such that the first layer extracts low-level features, the second layer extracts mid-level features such as defect edges, and the third layer looks at the actual defect shape. However, architecture one requires less computation costs to extract features and assign a class label to a single weld joint image compared to architecture two. Nevertheless, the weld joint classification system does not require real-time classification of images. Thus, more weight is given to classification accuracy than computation time. Therefore, architecture two is selected as the best method for the weld joint defect classification task.

5. Graphical User Interface for Onsite Defect Investigation

A desktop Graphical User Interface (GUI) for direct use by railway personnel for automated onsite weld joint defect investigation was developed using Matlab. As shown in Figure 19, the GUI comprises five main panels: image analysis, visualization, classification, rejection criteria, and report generation.

5.1. Image Analysis Panel

The image analysis panel consists of four buttons. The user first uploads an image from a local machine by pressing the “upload image” button. The uploaded image is shown as “input” under the visualization panel. Thereafter, the user applies Algorithm 1 to enhance the image by pressing the “filter” button. The enhanced image is shown as “processed” under the visualization panel. The user then applies Algorithm 3 to extract the weld joint Region of Interest (RoI) by pressing the “weld joint RoI” button; the extracted image is shown as “weld joint” under the visualization panel.

5.2. Classification Panel

After extracting the weld joint, the user applies Algorithm 10 to extract features and obtain the predicted class label by pressing the “select classifier” button. The predicted class is displayed under the “predictions” text box area. For instances where a class is predicted as one of the defects, the user can obtain information such as defect location and sizes by pressing the “defect info” button. The location of the identified defects within the weld joint is highlighted in yellow under the classification panel, while the size of the largest defect is shown under the “largest defect” text box.

5.3. Rejection Criteria Panel

After obtaining the predicted defect type and the largest defect size, the user compares the results to the applicable railway standard that defines different rejection criteria for axle loads (20 and 26 tons) and rail location (foot or web). Therefore, a user first selects the axle load of the railway line where that specific weld joint is located and the rail location. Based on these two inputs, the user obtains the decision (accept or reject) and the corresponding maintenance activity (class 2 or class 3).

5.4. Report Generation Panel

Once the decision and maintenance activity have been obtained, the user generates a report that will be sent to the maintenance manager. This is achieved by selecting a Transnet depot in charge of that railway line and then selecting a date on which radiography testing was conducted and the corresponding radiography number. The user then adds all the above information to the report by pressing the “add row” button. Thereafter, the user can either email the obtained report to the depot’s maintenance manager or save the report to his local machine by pressing the “email report” or “save report” button.

6. Conclusions

This work has presented a comparative study of two architectures for the multi-class classification of rail welding defects. Radiography images were initially enhanced using the CLAHE technique, and the image quality and defect visibility were improved. The Chan–Vese ACM was applied to the enhanced images, and the weld joint was segmented and extracted as the RoI from the image background. The extracted weld joint images were then used as inputs to train, validate, and test the feature extraction and classification methods in two architectures. In architecture one, two traditional feature extraction methods were compared: the LBP extractor and the BoVW with dense SURF features. The performance of each feature extractor was reported using the classification accuracies of the KNN, SVM, and Naive Bayes classifiers based on the validation and test datasets. The cell size and codebook size parameters on the LBP extractor and BoVW approach, respectively, were varied to investigate their impact on the classification accuracies. For each cell size and codebook size parameter, various parameters of the classifiers were used to identify the optimal ones. Experiments found that combining the BoVW approach (at a codebook size of 400 codewords) and the SVM classifier (at

σ

= 0.5) achieves the highest classification accuracy of 91% based on the test dataset. In architecture two, the DCNN architecture experimented with several kernel sizes of the convolution and pooling layers and several hidden layers of the fully connected network. Experiments found that a fully connected network with two hidden layers of the same size achieves the highest accuracy of 95% at optimal convolution kernel sizes of

75 \times 75

,

50 \times 50

, and

35 \times 35

for the first, second, and third convolution layers, respectively; the optimal pooling kernel sizes were

50 \times 50

for each pooling layer. Comparing the best classification accuracies obtained in both architectures, it was apparent that the DCNN method in architecture two is the best method for classifying rail welding defects. Though some works exist in the literature for handling a similar task, the comparison would not be fair given that the dataset used in the literature is very small compared to the dataset used in this work.

The DCNN method was integrated into a Graphical User Interface (GUI), which is currently being used for onsite investigation of rail welding defects in radiographic images at Transnet Freight Rail (TFR). Notably, the developed approach is universal and can be adapted for similar defect detection tasks in other countries that use Thermite welding and radiographic testing for rail inspection. While radiography testing standards for welded rail inspection may vary between countries, the developed approach allows for the incorporation of these standards to ensure broader applicability.

7. Future Work

It is important to note that while the dataset was suitable for training and evaluating the models, detailed information—such as radiography equipment specifications and exact imaging conditions—was not available. This limitation restricts the ability to fully assess the influence of acquisition parameters on model performance and may affect reproducibility in other settings. Nevertheless, the image enhancement technique proposed in this study attempts to improve image quality before training the model. To address the limitation posed by the absence of detailed image acquisition information, future work could involve constructing a curated and fully annotated radiographic image dataset with standardized documentation of key parameters such as radiography system specifications, exposure settings, and environmental conditions. Such a dataset would not only enhance reproducibility and benchmarking across studies but also allow for controlled experiments to evaluate the impact of acquisition variables on defect detection performance. Collaborating directly with equipment manufacturers or radiography teams during data collection could help ensure that this critical information is systematically captured and made available for analysis. Another focus area would be to expand the image dataset and leverage transfer learning techniques—such as Residual Networks (ResNet) [58], Visual Geometry Group (VGG16) [59], and Vision Transformers [60]. These models, when pre-trained on large-scale image datasets, are particularly effective for improving performance in image classification tasks involving limited data. Furthermore, different model performance metrics such as precision, recall, and F1-score should be explored.

Author Contributions

Conceptualization, M.E.M., J.R.T. and S.S.V.; methodology, M.E.M. and J.R.T.; software, M.E.M. and J.R.T.; validation, M.E.M. and J.R.T.; formal analysis, M.E.M. and J.R.T.; investigation, M.E.M. and J.R.T.; resources, M.E.M., J.R.T. and S.S.V.; data curation, M.E.M., J.R.T. and S.S.V.; writing—original draft preparation, M.E.M.; writing—review and editing, J.R.T.; visualization, M.E.M. and J.R.T.; supervision, J.R.T.; project administration, J.R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used to conduct experiments in this study are available upon request.

Acknowledgments

The authors would like to thank the welding department of Transnet Freight Rail (TFR) for providing the dataset used to conduct the experiments. The authors would also like to thank the radiography specialists of the TFR Rail Network for categorizing different defect types.

Conflicts of Interest

Author Siboniso Sithembiso Vilakazi was employed by the Transnet Freight Rail company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kumar, S. Study of Rail Breaks: Associated Risks and Maintenance Strategies; Luleå Railway Research Center: Luleå, Sweden, 2006. [Google Scholar]
Rail Accident Investigation Branch. Class Investigation into Rail Breaks on the East Coast Main Line; Rail Accident Investigation Branch: Derby, UK, 2014.
Cunningham, J.; Shaw, A.; Trosino, M. Automated Track Inspection Vehicle and Methods. U.S. Patent US6356299B1, 16 May 2000. [Google Scholar]
Vijaykumar, V.R.; Sangamithirai, S. Rail defect detection using Gabor filters with texture analysis. In Proceedings of the 2015 3rd International Conference on Signal Processing, Communication and Networking (ICSCN), Chennai, India, 6–28 March 2015; pp. 1–6. [Google Scholar]
Min, Y.; Xiao, B.; Dang, J. Real time detection system for rail surface. EURASIP J. Image Video Process. 2018, 2018, 3. [Google Scholar] [CrossRef]
Lui, Y.; Fan, L.; Zhang, S. Exploration of rail defects detection system. In Proceedings of the 2018 5th International Conference on Information Science and Control Engineering (ICISCE), Zhengzhou, China, 20–22 July 2018; pp. 1118–1122. [Google Scholar] [CrossRef]
Taştimur, C.; Karaköse, M.; Akın, E.; Aydın, İ. Rail defect detection with real time image processing technique. In Proceedings of the 2016 IEEE 14th International Conference on Industrial Informatics (INDIN), Poitiers, France, 19–21 July 2016; pp. 411–415. [Google Scholar]
Rajagopal, M.; Balasubramanian, M.; Palanivel, S. An Efficient Framework to Detect Cracks in Rail Tracks Using Neural Network Classifier. Comput. Sist. 2018, 22, 943–952. [Google Scholar] [CrossRef]
Wang, G.; Torr, P. Traditional Classification Neural Networks are Good Generators. arXiv 2022, arXiv:1409.1556. [Google Scholar]
Yue, B.; Wang, Y.; Min, Y.; Zhang, Z.; Wang, W.; Yong, J. Rail Surface Defect Recognition Method Based on AdaBoost Multi-classifier Combination. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 391–396. [Google Scholar]
Santur, Y.; Karaköse, M.; Akin, E. Random forest based diagnosis approach for rail fault inspection in railways. In Proceedings of the 2016 National Conference on Electrical, Electronics and Biomedical Engineering (ELECO), Bursa, Turkey, 1–3 December 2016; pp. 745–750. [Google Scholar]
Alkanari, A.; Aljaber, S. Principle Component Analysis algorithm (PCA) for image recognition. In Proceedings of the 2015 Second International Conference on Computing Technology and Information Management (ICCTIM), Johor, Malaysia, 21–23 April 2015; pp. 76–80. [Google Scholar]
Fauvel, M.; Chanussot, J.; Benediktsson, J.A. Kernel Principal Component Analysis for the Classification of Hyperspectral Remote Sensing Data over Urban Areas. EURASIP J. Adv. Signal Process. 2009, 2009, 783194. [Google Scholar] [CrossRef]
Rebrov, U.; Galina, K. Using Singular Value Decomposition to Reduce Dimensionality of Initial Data Set. In Proceedings of the 2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS), Riga, Latvia, 15–16 October 2020; pp. 1–4. [Google Scholar]
Biau, G.; Scornet, E. A Random Forest Guided Tour. arXiv 2015, arXiv:1511.05741. [Google Scholar] [CrossRef]
Shulin, G.; Thorsten, S.; Ralf, A. Use of Combined Railway Inspection Data Sources for Characterization of Rolling Contact Fatigue. In Proceedings of the 12th European Conference on Non-Destructive Testing (ECNDT), Gothenburg, Sweden, 11–15 June 2018. [Google Scholar]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18. [Google Scholar] [CrossRef]
Mercy, K.G.; Rao, S.K.S. A Framework for Rail Surface Defect Prediction Using Machine Learning Algorithms. In Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 11–12 July 2018; pp. 972–977. [Google Scholar]
Quinlan, J.R. Learning Decision Tree Classifiers; Association for Computing Machinery: New York, NY, USA, 1996. [Google Scholar]
Shah, A.; Waqas, A.; Chowdhry, B. Development of a Wireless Track Recording Vehicle with a Low Environmental Impact: An Approach for Enhancing Railway Track Safety Standards. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Malaga, Spain, 14–17 February 2022; pp. 1–7. [Google Scholar]
Molefe, M.E.; Tapamo, J.-R. Classification of Thermite Welding Defects using Local Binary Patterns and K Nearest Neighbors. In Proceedings of the 2021 Conference on Information Communications Technology and Society (ICTAS), Durban, South Africa, 10–11 March 2021; pp. 91–96. [Google Scholar]
Cunningham, P.; Delany, S. k-Nearest neighbour classifiers. ACM Comput. Surv. 2007, 54, 128. [Google Scholar]
Kashvi, T.; Sanjukta, D.; Srishti, V. A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. In Proceedings of the 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India, 15–17 May 2019; pp. 1255–1260. [Google Scholar]
Molefe, M.; Tapamo, J.R. Classification of Rail Welding Defects Based on the Bag of Visual Words Approach. In Pattern Recognition, Proceedings of the 13th Mexican Conference, MCPR 2021, Mexico City, Mexico, 23–26 June 2021; Lecture Notes in Computer Science; Roman-Rangel, E., Kuri-Morales, Á.F., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12725. [Google Scholar]
Soukup, D.; Huber-Mörk, R. Convolutional Neural Networks for Steel Surface Defect Detection from Photometric Stereo Images. In Advances in Visual Computing, Proceedings of the 10th International Symposium, ISVC 2014, Las Vegas, NV, USA, 8–10 December 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8887. [Google Scholar]
James, A.; Jie, W.; Xulei, Y.; Chenghao, Y.; Ngan, N.B.; Yuxin, L.; Yi, S.; Chandrasekhar, V.; Zeng, Z. TrackNet—A Deep Learning Based Fault Detection for Railway Track Inspection. In Proceedings of the 2018 International Conference on Intelligent Rail Transportation (ICIRT), Singapore, 12–14 December 2018; pp. 1–5. [Google Scholar]
Shang, L.; Yang, Q.; Wang, J.; Li, S.; Lei, W. Detection of rail surface defects based on CNN image recognition and classification. In Proceedings of the 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February 2018; pp. 45–51. [Google Scholar]
Faghih-Roohi, S.; Hajizadeh, S.; Núñez, A.; Babuska, R.; De Schutter, B. Deep convolutional neural networks for detection of rail surface defects. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2584–2589. [Google Scholar]
Jamshidi, A.; Hajizadeh, S.; Su, Z.; Naeimi, M.; Núñez, A.; Dollevoet, R.; De Schutter, B.; Li, Z. A decision support approach for condition-based maintenance of rails based on big data analysis. Transp. Res. Part Emerg. Technol. 2018, 95, 185–206. [Google Scholar] [CrossRef]
Gibert, X.; Patel, V.M.; Chellappa, R. Robust Fastener Detection for Autonomous Visual Railway Track Inspection. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 694–701. [Google Scholar]
Gibert, X.; Patel, V.M.; Chellappa, R. Deep Multitask Learning for Railway Track Inspection. IEEE Trans. Intell. Transp. Syst. 2017, 18, 153–164. [Google Scholar] [CrossRef]
Yanan, S.; Hui, Z.; Li, L.; Hang, Z. Rail Surface Defect Detection Method Based on YOLOv3 Deep Learning Networks. In Proceedings of the 2018 Chinese Automation Congress (CAC), Xi’an, China, 30 November–2 December 2018; pp. 1563–1568. [Google Scholar]
Xu, Q.; Zhao, Q.; Yu, G.; Wang, L.; Shen, T. Rail Defect Detection Method Based on Recurrent Neural Network. In Proceedings of the 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 6486–6490. [Google Scholar]
Song, X.; Chen, K.; Cao, Z. ResNet-based Image Classification of Railway Shelling Defect. In Proceedings of the 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 6589–6593. [Google Scholar]
Liu, Y.; Sun, X.; Pang, J. A YOLOv3-based deep learning application research for condition monitoring of rail thermite welded Joints. In Proceedings of the 2020 2nd International Conference on Image, Video and Signal Processing, Singapore, 20–22 March 2020. [Google Scholar]
Sumesh, A.; Rameshkumar, K.; Mohandas, K.; Shyam, R. Use of machine learning algorithms for weld quality monitoring using acoustic signature. In Proceedings of the 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15), Chennai, India, 12–13 March 2015; pp. 316–322. [Google Scholar]
Shin, S.; Jin, C.; Yu, J.; Rhee, S. Real-time detection of weld Defects for automated welding process based on deep neural network. Metals 2020, 10, 389. [Google Scholar] [CrossRef]
Molefe, M.; Tapamo, J.R. Combining Multi-Layer Perceptron and Local Binary Patterns for Thermite Weld Defects Classification. In Pan-African Artificial Intelligence and Smart Systems, Proceedings of the First International Conference, PAAISS 2021, Windhoek, Namibia, 6–8 September 2021; Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Ngatched, T.M.N., Woungang, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 405. [Google Scholar]
Mu, W.L.; Gao, J.M.; Jiang, H.Q.; Chen, F.M.; Gao, Z.Y.; Chen, K.; Dang, C.Y. A Method of Radiographic Image Quality Enhancement. In Proceedings of the 2013 Fifth International Conference on Measuring Technology and Mechatronics Automation, Hong Kong, China, 16–17 January 2013; pp. 29–32. [Google Scholar]
Patel, S.; Goswami, M. Comparative analysis of Histogram Equalization techniques. In Proceedings of the International Conference on Contemporary Computing and Informatics (IC3I), Mysore, India, 27–29 November 2014; pp. 167–168. [Google Scholar]
Yuan, H. Identification of global histogram equalization by modeling gray-level cumulative distribution. In Proceedings of the 2022 Global Conference on Wireless and Optical Technologies (GCWOT), Malaga, Spain, 14–17 February 2022; pp. 645–649. [Google Scholar]
Laldinpuii, C.; Sandeep, K.; Anjali, A.; Mahak, J.; Manisha, B.; Acharya, K. Performance Analysis of Adaptive Histogram Equalization-Based Image Enhancement Schemes. In Proceedings of the 2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Mysore, India, 27–29 November 2014; pp. 128–134. [Google Scholar]
Garima, Y.; Saurabh, M.; Anjali, A. Contrast limited adaptive histogram equalization based enhancement for real time video system. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, 24–27 September 2014; pp. 2392–2397. [Google Scholar]
Tong, T.; Cai, Y.; Sun, D. Defects Detection of Weld Image Based on Mathematical Morphology and Thresholding Segmentation. In Proceedings of the 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, China, 21–23 September 2012; pp. 1–4. [Google Scholar]
Ajmi, C.; Ferchichi, S.E.; Zaafouri, A.; Laabidi, K. Automatic Detection of Weld Defects Based on Hough Transform. In Proceedings of the 2019 International Conference on Signal, Control and Communication (SCC), Hammamet, Tunisia, 16–18 December 2019; pp. 1–6. [Google Scholar]
Urata, S.; Yasukawa, H. Improvement of contour extraction precision of active contour model with structuring elements. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 1169–1172. [Google Scholar]
Wang, L.; Liu, J.; Wu, T. Fast Global Active Contour Model with Local Information. In Proceedings of the IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019; pp. 1506–1510. [Google Scholar]
Zhang, X. A hybrid active contour model driven by global and local intensity information. In Proceedings of the 2nd International Conference on Safety Produce Informatization (IICSPI), Chongqing, China, 28–30 November 2019; pp. 425–427. [Google Scholar]
Kaganami, H.G.; Beiji, Z. Region-Based Segmentation versus Edge Detection. In Proceedings of the Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Kyoto, Japan, 12–14 September 2009; pp. 1217–1221. [Google Scholar]
Kass, M.; Witkin, A.; Terzopoulos, D. Snakes: Active contour models. Int. J. Comput. Vis. 1988, 1, 321–331. [Google Scholar] [CrossRef]
Chan, T.F.; Vese, L.A. An active contour model without edges. In Scale-Space Theories in Computer Vision, Proceedings of the Second International Conference, Scale-Space’99, Corfu, Greece, 26–27 September 1999; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1999; Volume 1682, pp. 141–151. [Google Scholar]
Mumford, D.; Shah, J. Optimal approximation by piecewise smooth functions and associatedvariational problems. Commun. Pure Appl. Math. 1989, 42, 577–685. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
Ojala, T.; Pietikäinen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Computer Vision—ECCV 2006, Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Lecture Notes in Computer Science; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951. [Google Scholar]
Hongwei, Z.; Yuchang, L. Learning Bayesian network classifiers from data with missing values. In Proceedings of the 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, TENCOM ’02 Proceedings, Beijing, China, 28–31 October 2002; Volume 1, pp. 35–38. [Google Scholar]
Ko, S.; Morales, A.; Hoon, K. A fast implementation algorithm and a bit-serial realization method for grayscale morphological opening and closing. IEEE Trans. Signal Process. 1995, 43, 3058–3061. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]

Figure 1. Methodological diagram of the proposed methods.

Figure 2. Thermite weld image.

Figure 3. The computation of the original LBP descriptor for the center pixel (highlighted in red). First, the pixel value of the center pixel is compared with the gray values of pixels in its neighbourhood region

(11, 12, 12, 16, 11, 13, 15, 19)

. Thereafter, binary code

(0, 0, 0, 1, 0, 0, 1, 1)

is obtained, where pixels greater than the center pixels are assigned a value of “1” and pixels smaller than the center pixels are assigned a value “0”. Finally, the binary code is converted into a binary number

(19)

which represents the LBP feature of the center pixel.

Figure 3. The computation of the original LBP descriptor for the center pixel (highlighted in red). First, the pixel value of the center pixel is compared with the gray values of pixels in its neighbourhood region

(11, 12, 12, 16, 11, 13, 15, 19)

. Thereafter, binary code

(0, 0, 0, 1, 0, 0, 1, 1)

is obtained, where pixels greater than the center pixels are assigned a value of “1” and pixels smaller than the center pixels are assigned a value “0”. Finally, the binary code is converted into a binary number

(19)

which represents the LBP feature of the center pixel.

Figure 4. Examples of the modified LBP descriptor: the circular (8, 1), (12, 2), (12, 3) neighborhoods, grey pixels represent the center pixels and black pixels represent the neighbours of the center pixels.

Figure 5. Computation of a histogram vector using the uniform LBP extractor: First, an image is divided into cells of equal size (red and blue as shown), thereafter,

59 -

patterns, of which 58 are uniform, are extracted on each cell (represented by histograms). Finally, the patterns from all the cells are concatenated into a single histogram to obtain the final feature vector for the given image.

Figure 5. Computation of a histogram vector using the uniform LBP extractor: First, an image is divided into cells of equal size (red and blue as shown), thereafter,

59 -

patterns, of which 58 are uniform, are extracted on each cell (represented by histograms). Finally, the patterns from all the cells are concatenated into a single histogram to obtain the final feature vector for the given image.

Figure 6. (a) keypoint extraction: An image is divided into grids of equal sizes, and the center pixel (highlighted in red) in each grid is considered a keypoint. and (b) keypoint description: a square region is placed and centered around every keypoint and oriented along the keypoint’s dominant orientation. This region is then divided into 16 sub-regions.

Figure 7. Image representation using BoVW: First, keypoints (highlighted in red) are detected, thereafter (a), the descriptor vectors are generated (b), the obtained vectors are then grouped into multiple codebooks using the Kmeans clustering algorithm (example

c_{1}

,

c_{2}

,

c_{3}

and

c_{4}

) (c). Finally, coding and pooling steps are applied to represent each image in terms of codewords, and to provide a global feature representation, respectively (d).

Figure 7. Image representation using BoVW: First, keypoints (highlighted in red) are detected, thereafter (a), the descriptor vectors are generated (b), the obtained vectors are then grouped into multiple codebooks using the Kmeans clustering algorithm (example

c_{1}

,

c_{2}

,

c_{3}

and

c_{4}

) (c). Finally, coding and pooling steps are applied to represent each image in terms of codewords, and to provide a global feature representation, respectively (d).

Figure 8. SVM concept for a binary classification task: Circular and squared shapes represent two distinct instances belonging to two class labels. The line between

H_{1}

and

H_{2}

planes is considered an optimal hyperplane, and it is obtained by maximizing the distance d.

Figure 8. SVM concept for a binary classification task: Circular and squared shapes represent two distinct instances belonging to two class labels. The line between

H_{1}

and

H_{2}

planes is considered an optimal hyperplane, and it is obtained by maximizing the distance d.

Figure 9. DCNN architecture: The red line illustrates the transition from the convolution layer to the pooling layer.

Figure 10. Defect-less.

Figure 11. Wormhole.

Figure 12. Inclusions.

Figure 13. Shrinkage cavities.

Figure 14. Original images (a,c,e), enhanced images (b,d,f).

Figure 15. Weld joint segmentation and RoI extraction. (a) Chan-Vese active contour model is applied and its energy is minimum at the weld joint boundaries, (b) weld joint image is segmented, (c) a bounding box is placed on the boundaries of the segmented weld joint image, and (d) region inside the bounding box is cropped, and it represents weld join region of interest.

Figure 16. Some of the obtained results: (a) application of Chan–Vese ACM and (b) segmented thermite weld images.

Figure 17. Some of the post-processing results applied on wormhole defect images.

Figure 18. Percentage of segmented weld joint images per each class.

Figure 19. Graphical User Interface for onsite defect investigation.

Table 1. Accuracies of LBP features with KNN classifier.

Cell Size	Length	Parameter	Val. Acc (%)	Test Acc (%)
[ $6 \times 14$ ]	147,500	K = 10	85	81
		K = 25	81	79
		K = 40	89	91
		K = 55	86	81
[ $12 \times 28$ ]	36,875	K = 10	83	79
		K = 25	87	83
		K = 40	81	76
		K = 50	79	77
[ $30 \times 70$ ]	5900	K = 10	83	85
		K = 25	85	85
		K = 40	77	79
		K = 55	81	79
[ $60 \times 140$ ]	1475	K = 10	75	71
		K = 25	77	73
		K = 40	83	79
		K = 50	77	73

Note: Bold values indicate best performance for each cell size and length.

Table 2. Accuracies of LBP features with SVM classifier.

Cell Size	Length	Parameter	Val. Acc (%)	Test Acc (%)
[ $6 \times 14$ ]	147,500	$σ = 0.25$	83	85
		$σ = 0.5$	87	89
		$σ = 2$	83	81
		$σ = 4$	79	81
[ $12 \times 28$ ]	36,875	$σ = 0.25$	85	85
		$σ = 0.5$	79	79
		$σ = 2$	81	77
		$σ = 4$	79	77
[ $30 \times 70$ ]	5900	$σ = 0.25$	75	73
		$σ = 0.5$	77	75
		$σ = 2$	77	77
		$σ = 4$	75	77
[ $60 \times 140$ ]	1475	$σ = 0.25$	77	77
		$σ = 0.5$	77	75
		$σ = 2$	75	73
		$σ = 4$	75	73

Note: Bold values indicate best performance for each cell size and length.

Table 3. Accuracies of LBP features with Naive Bayes classifier.

Cell Size	Length	Val. Acc (%)	Test Acc (%)
[ $6 \times 14$ ]	147,500	73	71
[ $12 \times 28$ ]	36,875	75	75
[ $30 \times 70$ ]	5900	79	77
[ $60 \times 140$ ]	1475	81	81

Note: Bold values indicate best performance for each cell size and length.

Table 4. Highest classification accuracy of each classifier for LBP features.

Method	Optimal Parameters		Length	Test Acc (%)
LBP + KNN	Cell size: $6 \times 14$	K = 40	147,500	91
LBP + SVM	Cell size: $6 \times 14$	$σ$ = 0.5	147,500	89
LBP + NB	Cell size: $30 \times 70$	NB	1475	81

Note: Bold values indicate best classifier at the corresponding parameters for LBP features.

Table 5. Accuracies of BoVW approach with KNN classifier.

Codebooks	Length	Parameter	Val. Acc (%)	Test Acc (%)
400	400	K = 10	85	83
		K = 25	89	89
		K = 40	85	87
		K = 55	83	85
1200	1200	K = 10	83	81
		K = 25	83	85
		K = 40	87	87
		K = 50	87	83
2000	2000	K = 10	81	83
		K = 25	83	83
		K = 40	85	87
		K = 55	85	83
3200	3200	K = 10	81	81
		K = 25	85	83
		K = 40	83	83
		K = 50	77	79

Note: Bold values indicate best performance for each codebook and length.

Table 6. Accuracies of BoVW features with SVM classifier.

Codebooks	Length	Parameter	Val. Acc (%)	Test Acc (%)
400	400	$σ$ = 0.25	85	83
		$σ$ = 0.5	91	91
		$σ$ = 2	87	87
		$σ$ = 4	87	85
1200	1200	$σ$ = 0.25	87	89
		$σ$ = 0.5	87	85
		$σ$ = 2	89	89
		$σ$ = 4	83	83
2000	2000	$σ$ = 0.25	85	83
		$σ$ = 0.5	83	83
		$σ$ = 2	85	85
		$σ$ = 4	87	83
3200	3200	$σ$ = 0.25	81	81
		$σ$ = 0.5	85	87
		$σ$ = 2	85	83
		$σ$ = 4	81	79

Note: Bold values indicate best performance for each codebook and length.

Table 7. Accuracies of BoVW features with Naive Bayes classifier.

Cell Size	Length	Val. Acc (%)	Test Acc (%)
400	400	77	75
1200	1200	75	75
2000	2000	79	79
3200	3200	77	75

Note: Bold values indicate best performance for each codebook and length.

Table 8. Highest classification accuracy of each classifier for BoVW features.

Method	Optimal Parameters		Length	Test Acc (%)
BoVW + KNN	Codewords: 400	K = 25	400	89
BoVW + SVM	Codewords: 400	$σ$ = 0.5	400	91
BoVW + NB	Codewords: 2000	NB	2000	79

Note: Bold values indicate best classifier at the corresponding parameters for BoVW features.

Table 9. Highest classification accuracy by each feature extractor.

Method	Vector Length	Compt. Time (s)	Test Acc (%)
LBP + KNN	147,500	1.89	91
BoVW + SVM	400	0.21	91

Note: Bold values indicate best feature extractor and classifier for architecture one.

Table 10. DCNN validation accuracies with varying convolution parameters and a single fully connected layer.

Layers		First	Second	Third	Acc (%)
	Filter size	$70 \times 70$	$50 \times 50$	$35 \times 35$	91
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$15 \times 15$	$15 \times 15$	$5 \times 15$
Pooling	Stride	1	1	1
	Filter size	$50 \times 50$	$30 \times 30$	$15 \times 15$	89
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$10 \times 10$	$10 \times 10$	$10 \times 10$
Pooling	Stride	1	1	1
	Filter size	$30 \times 30$	$15 \times 15$	$10 \times 10$	87
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$5 \times 5$	$5 \times 5$	$5 \times 5$
Pooling	Stride	1	1	1

Table 11. DCNN validation accuracies with varying convolution parameters and two fully connected layers of same size.

Layers		First	Second	Third	Acc (%)
	Filter size	$70 \times 70$	$50 \times 50$	$35 \times 35$	97
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$15 \times 15$	$15 \times 15$	$5 \times 15$
Pooling	Stride	1	1	1
	Filter size	$50 \times 50$	$30 \times 30$	$15 \times 15$	93
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$10 \times 10$	$10 \times 10$	$10 \times 10$
Pooling	Stride	1	1	1
	Filter size	$30 \times 30$	$15 \times 15$	$10 \times 10$	93
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$5 \times 5$	$5 \times 5$	$5 \times 5$
Pooling	Stride	1	1	1

Table 12. DCNN validation accuracies with varying convolution parameters and three fully connected layers of same size.

Layers		First	Second	Third	Acc (%)
	Filter size	$70 \times 70$	$50 \times 50$	$35 \times 35$	93
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$15 \times 15$	$15 \times 15$	$5 \times 15$
Pooling	Stride	1	1	1
	Filter size	$50 \times 50$	$30 \times 30$	$15 \times 15$	91
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$10 \times 10$	$10 \times 10$	$10 \times 10$
Pooling	Stride	1	1	1
	Filter size	$30 \times 30$	$15 \times 15$	$10 \times 10$	87
Convolution	Stride	1	1	1
Convolution	Padding	2	2	2
Pooling	Filter size	$5 \times 5$	$5 \times 5$	$5 \times 5$
Pooling	Stride	1	1	1

Table 13. Highest DCNN classification accuracies with increasing sizes of fully connected hidden layers.

No. of Layers	Layer Size	Convolution Kernel	Pooling Kernel	Test Acc (%)	Comp. Time (s)
		$75 \times 75$
1	{80}	$50 \times 50$	$50 \times 50$	89	0.97
		$35 \times 35$
		$75 \times 75$
2	{80,80}	$50 \times 50$	$50 \times 50$	95	1.17
		$35 \times 35$
		$75 \times 75$
3	{80,80,80}	$50 \times 50$	$50 \times 50$	93	1.39
		$35 \times 35$

Table 14. Best architecture for weld joint defect classification.

Architecture	Method	Test Acc (%)	Compt. Time (s)
One	BoVW + SVM	91	0.21
Two	DCNN	95	1.17

Note: Bold values indicate best architecture for thermite weld defect classification.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Molefe, M.E.; Tapamo, J.R.; Vilakazi, S.S. A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects. J. Sens. Actuator Netw. 2025, 14, 58. https://doi.org/10.3390/jsan14030058

AMA Style

Molefe ME, Tapamo JR, Vilakazi SS. A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects. Journal of Sensor and Actuator Networks. 2025; 14(3):58. https://doi.org/10.3390/jsan14030058

Chicago/Turabian Style

Molefe, Mohale Emmanuel, Jules Raymond Tapamo, and Siboniso Sithembiso Vilakazi. 2025. "A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects" Journal of Sensor and Actuator Networks 14, no. 3: 58. https://doi.org/10.3390/jsan14030058

APA Style

Molefe, M. E., Tapamo, J. R., & Vilakazi, S. S. (2025). A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects. Journal of Sensor and Actuator Networks, 14(3), 58. https://doi.org/10.3390/jsan14030058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Image Processing and Machine Learning Methods for Classification of Rail Welding Defects

Abstract

1. Introduction

2. Related Work

2.1. Techniques Based on Image Processing

2.2. Techniques Based on Image Processing and Machine Learning

2.3. Techniques Based on Deep Learning

3. Proposed Methods

3.1. Image Enhancement

3.2. Image Segmentation and Region-of-Interest Extraction

3.3. Architecture One: Feature Extraction

3.3.1. Feature Extraction Using the LBP Extractor

3.3.2. Feature Extraction Using SURF Extractor

3.3.3. Image Representation Using BoVW

3.4. Architecture One: Feature Classification

3.4.1. Classification Using the K-Nearest Neighbors

3.4.2. Feature Classification Using Naive Bayes

3.4.3. Feature Classification Using SVM

3.5. Architecture Two: Deep Convolution Neural Network

3.5.1. Convolution

3.5.2. Pooling

3.5.3. Activation Function

3.5.4. Fully Connected Neural Network

3.5.5. Loss Function

3.5.6. Back-Propagation and Weight Update

4. Experimental Results and Discussion

4.1. Dataset Description

4.2. Image Enhancement

4.3. Weld Joint Extraction

4.4. Architecture One: Experimental Results

4.4.1. Classification of LBP Features

4.4.2. Best Classifier for LBP Features

4.4.3. Classification of BoVW Features

4.4.4. Best Classifier for BoVW Features

4.4.5. Architecture One: Best Feature Extractor and Classifier

4.4.6. Architecture One: Discussion

4.5. Architecture Two: Experimental Results

4.5.1. Training Process

4.5.2. Fixed Parameters

4.5.3. Fine-Tuned Parameters

4.5.4. Best Parameters for Architecture Two

4.6. Best Architecture for Weld Joint Defect Classification

5. Graphical User Interface for Onsite Defect Investigation

5.1. Image Analysis Panel

5.2. Classification Panel

5.3. Rejection Criteria Panel

5.4. Report Generation Panel

6. Conclusions

7. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI