1. Introduction
The operational continuity and safety of mining transport systems depend heavily on the integrity of their conveyor belts [
1,
2,
3,
4,
5]. These belts are particularly prone to longitudinal tearing from impacts with sharp objects and from material fatigue, which can cause costly unplanned downtime and resource losses [
6]. Phenomena such as belt mistracking can lead to rubbing against the conveyor structure, significantly shortening belt life cycle [
7,
8]. Given the limitations of manual visual inspection, developing reliable automated methods for detecting longitudinal tears has become an important research challenge [
9,
10].
Monitoring conveyor belt condition is challenging because damage to the top cover, such as cuts and gouges caused by sharp, falling material, can propagate and cause core degradation [
11]. In practice, surface and geometric assessments at many mining sites still depend largely on manual visual inspection by supervisors. Prior efforts to automate inspection with image-based methods have met limited success, largely due to adverse field conditions, including poor illumination and airborne dust, that degrade data quality and complicate reliable analysis. These limitations highlight the need for more robust, environment-resilient inspection techniques [
6,
12].
Automatic methods for detecting longitudinal tears in industrial conveyor belts fall into two main categories: contact and non-contact. Contact techniques use hardware that physically interacts with the belt, such as linear detectors, swing rollers, and pressure sensors [
13]. Although these approaches can be fast and conceptually simple, they tend to be costly and can collide with conveyed material, producing false alarms. Non-contact approaches, including electromagnetic induction and X-ray fluoroscopy, generally yield fewer false detections but depend on precise sensor-belt coupling, an arrangement that is difficult to maintain in the dynamic conditions of mining operations [
14].
Surface reconstruction through 3D scanning offers a valuable approach to evaluating conveyor belt condition, enabling the generation of precise digital representations of belt geometry. Consequently, this perspective suggests that affordable 3D scanning could be a beneficial method for inspecting conveyor belts, despite the ongoing challenge of adapting these readily available technologies for reliable performance in demanding industrial environments [
15]. Recent developments have expanded the availability of these technologies across a spectrum of devices, ranging from sophisticated industrial setups to more affordable scanners [
16]. Although high-end systems offer greater precision, lower-cost devices frequently prove adequate and more practical when the objective is to identify significant geometric anomalies rather than minute imperfections [
15].
Modern consumer-grade smart devices, including mobile phones and tablets, now incorporate sophisticated scanning technologies that extend beyond traditional photogrammetry. Among these, LiDAR (Light Detection and Ranging) functions on a Time-of-Flight (ToF) principle, determining distance by measuring the delay between emitting a light signal and receiving its reflection [
17]. In contrast, Apple’s proprietary TrueDepth camera utilizes a vertical-cavity surface-emitting laser (VCSEL), a dot projector, a flood illuminator, and an infrared camera. Its operational core involves projecting a pattern of over 30,000 infrared dots onto a scene; the distortion of this pattern, captured by the infrared camera, is then analyzed to construct a depth map [
18]. This map is subsequently processed by machine learning algorithms to generate a precise mathematical model of the environment [
19].
The TrueDepth camera built into Apple devices functions as a low-cost 3D scanner, using infrared illumination to produce depth maps for applications such as facial recognition and augmented reality. Its use in scientific contexts, particularly for anthropometric data collection, has been investigated in recent studies [
15,
20,
21]. Nevertheless, the sensor’s capacity to reliably reconstruct fine surface geometry remains an area of ongoing research.
In this paper, we present a low-cost, smartphone-driven maintenance system for intelligent condition monitoring of conveyor belt surfaces that leverages the iPhone 12 Pro Max as an integrated sensing platform. The system uses the device’s TrueDepth camera to create accurate 3D point cloud models of the moving belt. This allows for a quantitative assessment and identification of surface issues based on their geometric characteristics. This smartphone-only approach supports cost-effective, near-real-time monitoring to aid maintenance decision-making.
Deep learning models have exhibited strong efficacy in image classification, detection, and segmentation, largely due to the application of 2D convolutional neural networks (CNNs). These models effectively capture both global and spatial features, demonstrating considerable generalization capabilities, which renders them particularly well-suited for the analysis of RGB imagery. Recent developments in 3D deep learning have expanded the application of these techniques to include classification, object detection, and semantic segmentation of point cloud data. Despite this progress, 3D CNNs continue to be computationally demanding and exhibit reduced scalability compared to their 2D counterparts [
22]. To address this limitation, we can use dimension reduction methods, such as projection-based techniques. This allows us to use pre-trained and efficient 2D convolutional neural network (CNN) architectures to analyze 3D data.
The contributions of the proposed surface defect detection system are summarized as follows:
A smartphone-driven 3D inspection pipeline: A cost-effective system has been developed to capture point clouds, utilizing the iPhone 12 Pro Max’s TrueDepth camera, and subsequently processes these point clouds through a novel 3D-to-2D projection method. This pipeline capitalizes on pre-trained 2D convolutional neural networks (CNNs) to extract deep features, and it incorporates efficient tree-based classifiers to facilitate robust defect detection and classification, relying exclusively on geometric data.
An industrial benchmark and empirical assessment: a specialized dataset comprising TrueDepth point clouds, encompassing various induced fault types and detailed annotations, serves as the foundation for evaluating the generalizability of the proposed method in the context of conveyor-component condition monitoring.
A lightweight, deployable detection and quantification pipeline: a computationally efficient approach suitable for edge and mobile deployment that identifies topographic defects and provides practical geometric measurements (e.g., depth, volume) to support maintenance decisions.
The remainder of this paper is structured as follows.
Section 2 reviews related work on conveyor belt damage modes and point cloud based condition monitoring methods.
Section 3 delineates the proposed methodology for geometric defect detection.
Section 4 describes the experimental setup, including the dataset, data collection process, and evaluation metrics.
Section 5 reviews the model training and validation process.
Section 6 presents and discusses the experimental results. Finally,
Section 7 concludes the study and suggests directions for future work.
2. Literature Review
A wide range of techniques are used to diagnose and inspect conveyor belts, each aimed at identifying various types of faults and maintaining reliable operations. These approaches span from traditional visual and manual inspections to more advanced technologies, such as ultrasonic testing [
23,
24] or magnetic belt inspection [
25]. Additional methods include the use of RGB and infrared cameras, commonly applied to monitor idlers [
26,
27,
28], detect material blockages [
29], or detect misalignment of the track belt [
30], as well as to evaluate the belt’s own condition [
31,
32], alongside acoustic analysis [
33,
34] and X-ray imaging [
35]. Multisensor systems designed for belt conveyor monitoring, such as DiagBelt+ can incorporate mentioned sensors to great results [
36,
37,
38].
The approach using LiDAR data can be found in [
6] where the belt has been scanned with the TLS system (terrestrial laser scanner), obtaining the elevation data of the belt and consequently allowing the detection of local defects. In [
39] the authors used a binocular line laser stereo vision camera mounted between the upper and lower belts to obtain the data for the detection of longitudinal rip. As in the previous case, the suspected points come from fluctuations of the point position in selected directions. An example of the damage detection in a multi-wedge belt can be found in [
40], where the authors detected the most common pits, scratches, and cracks with a detection rate of 96%. The proposed methodology consisted of point cloud extraction, clustering of separate tooth top surfaces with use of DBSCAN, and final defect detection through adaptive moving window.
Some other uses of surface inspection with the use of scanning technology can be found in the inspection of buildings, road damage detection, and quality control of various materials. In [
41], the authors used TLS in the bridge structural health monitoring task. The accuracy of the scanning equipment (Faro S350) proved to be high in comparison with manual measurements. As expected, the differences increased with the increasing angle of the scanning axis perpendicular to the surface. In [
42] the colored point cloud of the ship hull was utilized in the detection of corroded regions, allowing better estimation and optimization of maintenance routines. The authors used threshold-based detection similar to image segmentation methodologies. On a larger scale, scanning has been used in cases such as quality inspection of large prefabricated housing units [
43], where the geometric dimensions (together with parameters such as straightness or flatness) of different elements have been measured. In this case, the preserved accuracy remained below 2.3 mm. A few different scanners, including the iPhone Lidar, have been tested in the task of damage estimation of forest road surfaces. The quality and accuracy of iPhone Lidar proved sufficient for the task, although with error tied strictly to the distance from the scanned surface [
44].
In [
45] the authors utilized density histograms, Euclidean clustering, and a dimension-based classifier for the detection of idle position for further diagnosis. Machine vision and artificial intelligence are often used in modern methods to make fault detection more accurate and to better predict when maintenance will be needed. This category includes approaches such as support vector machines [
46,
47,
48], neural networks [
49,
50], or DBSCAN [
51].
In the case of belt damage detection with the use of machine learning, there can be distinguished two types of damage that the methodologies focus on separately. Both of them usually rely on the image data (such as RGB or X-ray images), with exceptions such as magnetic detection [
50]. First branch of the trained networks revolve around the belt deviation problem [
28,
52,
53,
54] that often incorporate various edge and line detection algorithms into the algorithm pipeline. The second one focuses on the surface damage [
55,
56]. In both cases, a robust region of interest reduction is usually very beneficial for the final results - for this purpose detectors such as MobileNet SSD have been implemented [
28].
The TrueDepth camera utilized has been thoroughly tested in [
57] where authors proved its usability in the millimeter range applications, obtaining 0.1 mm details detection on a working distance of 150 to 170 mm. For a stable measurement of less intensive textures it was recommended to stay in range of maximum 300 mm distance from the surface, 500 mm in case of more textured material. Similar conclusions have been reached by authors in [
58], where they obtained results of point-to-plane deviation from 0.291 to 0.739 mm for distances from surface increasing from 175 to 450 mm. This highlights a significant importance of the distance from the measured object that might render it not useful in many industrial applications. A direct comparison with existing solutions have been provided in [
59], where authors evaluated the scanning accuracy of iPad Pro TrueDepth with Artec Space Spider high resolution industrial scanner. In these tests, TrueDepth was outperformed by industrial solution, although the differences were in the range of one millimeter differences. Additionally, a high impact of the scanner movement and measurement technique in the case of TrueDepth accuracy have been noticed, which may lead to a significant improvement in the scanner performance if implemented properly.
3. Material and Methods
This section delineates the proposed methodology for the intelligent inspection of conveyor belt surfaces based on geometric data. The overall procedure, illustrated in
Figure 1, begins with the acquisition of 3D point cloud data using the integrated TrueDepth camera of an iPhone 12 Pro Max. A critical preprocessing step involves transforming the raw 3D point clouds into 2D feature projections. This transformation is essential to reduce computational complexity and to leverage pre-trained 2D CNNs for effective feature extraction. Subsequently, a hybrid framework is introduced, where deep features extracted from the 2D point cloud projections are classified using traditional machine learning models. The procedures for training and evaluating these models are detailed to identify the optimal CNN and classifier pair for defect detection.
3.1. Point Cloud Reconstruction
To leverage the TrueDepth camera for accurate geometric modeling of the conveyor belt surface, the raw sensor data must first be preprocessed. This section outlines the core computer vision principles for this processing, beginning with the camera’s intrinsic parameters. These parameters govern the transformation between 3D world coordinates and 2D image pixels, forming the foundation for converting depth maps into 3D point clouds.
The projective transform maps a 3D point
from the camera coordinate system to a point
in the image plane
. Here
is defined as:
The matrix contains the focal length f in pixels, an aspect ratio a, a shear factor, s and the principal point .
The authors Urban et al. [
60] performed the series of experiments to find the calibration factors. In their work for the iPhone 12 Pro Max the factory-calibrated intrinsics returned:
and
. In addition, the principal point always coincides with the lens distortion center that can also be requested from the APIs provided by the manufacturer (Apple Inc., Cupertino, CA, USA) [
60,
61].
In the next step the acquired depth image needs to be used to reconstruct the point clouds. To convert a depth image
to a point cloud,
the following mapping can be used:
where
and
and
. In the test smartphone, the depth image has a resolution of 640 × 480 pixels.
3.2. 3D-to-2D Point Clouds Feature Projection
The proposed architecture processes point clouds using 2D convolutional layers, necessitating the transformation of 3D point data into 2D feature maps compatible with regular grid-based processing. Since point clouds inhabit continuous 3D space, they cannot be directly processed by 2D CNNs without prior conversion to a structured 2D representation [
22].
For a point
with coordinates
and associated feature
f, the projection process involves normalizing the point cloud to a specified range relative to each projection plane. For the XY-plane with dimension
, the
x and
y coordinates are normalized to the interval
. Feature projection onto the grid is accomplished through bilinear interpolation, selected for its favorable balance of computational efficiency and memory requirements. When multiple features map to the same grid cell, they are aggregated through summation. This process generates a 2D feature map of dimension
from the 3D point cloud, formally defined as:
where
denotes the 2D feature at grid position
on the XY-plane,
represents the 3D feature of point
, and
is the 2D bilinear interpolation kernel composed of one-dimensional linear kernels as defined in Equation (4).
Figure 2 illustrates this projection mechanism.
For points that share identical coordinates but differ in their z values, such as two points and with coordinates and respectively, projecting into only the XY-plane results in identical 2D representations. This characteristic ensures that the surface geometry of the conveyor belt is effectively captured while maintaining computational efficiency.
To ensure data continuity and the reliability of the geometric profile in the captured point cloud, missing data points resulting from infrared dots that were not correctly reflected and captured by the TrueDepth camera were reconstructed using linear interpolation. For each missing point at coordinates , the algorithm estimates its depth value z by analyzing the values of its valid neighboring points within a defined spatial kernel. This process reconstructs a spatially consistent point cloud suitable for further analysis.
3.3. Deep CNN Models for Defect Feature Extraction
Convolutional Neural Networks (CNNs) are a distinct category of deep learning architectures designed for analyzing structured grid data, with a primary focus on images. They mirror the hierarchical pattern recognition processes observed in biological vision systems [
62,
63]. Their remarkable effectiveness in image processing, coupled with the ability to learn directly from raw pixel data, has established CNNs as the prevailing approach in computer vision applications. The fundamental CNN architecture employs a series of convolutional filters, activation functions, and pooling operations to autonomously derive hierarchical feature representations from input images. These models are typically optimized using gradient-based methods, such as backpropagation, for a variety of tasks, including image classification and feature extraction [
64,
65].
The standard structure of a convolutional neural network is a series of specific layers arranged in order. Each layer has a distinct computational role in a step-by-step process of extracting features. The process begins with the input layer, which receives the original image data and applies preprocessing steps, such as normalization, to standardize the data. Following this, the convolutional layers, which are the core of the architecture, use learnable filters to extract spatial features through convolution. The generated feature maps pass through activation layers, often utilizing rectified linear units (ReLU), to incorporate non-linear transformations crucial for the acquisition of intricate mappings. Pooling layers then execute spatial down-sampling, diminishing feature dimensionality while maintaining essential information, thus improving computational efficiency and offering translational invariance. The concluding phase entails flattening the extracted features and subjecting them to fully connected layers, which consolidate high-level representations to produce classification outputs.
Transfer Learning with Pre-Trained Architectures
Over the last ten years, the development of Convolutional Neural Network (CNN) architectures has produced numerous models that excel in large-scale visual recognition applications. This study utilizes transfer learning, employing four well-established CNN architectures—VGG16, ResNet50, InceptionV3, and Xception—to extract distinguishing features from the 2D point cloud representation of the conveyor belt surface. Transfer learning facilitates the transfer of knowledge from models initially trained on extensive datasets, such as ImageNet, to our specific area of interest, thereby substantially improving performance, particularly when labeled data is scarce [
66].
Each architecture presents unique benefits for feature extraction. VGG16, distinguished by its consistent architecture, features 13 convolutional layers utilizing 3 × 3 filters, thereby establishing a deep yet uncomplicated structure that effectively identifies hierarchical features [
67]. Conversely, ResNet50 incorporates residual connections throughout its 48 convolutional layers to address the vanishing gradient issue, thus facilitating the effective training of considerably deeper models [
68]. The InceptionV3 architecture, in contrast, utilizes parallel convolutional pathways with diverse receptive fields to efficiently capture multi-scale features while managing computational complexity [
69]. Lastly, Xception, an advancement of the Inception concept, is predicated on depthwise separable convolutions within its 71-layer design, which improves parameter efficiency while preserving robust representational capabilities [
70].
Initially trained on the ImageNet dataset, which comprises more than 13 million images spanning 20,000 categories, these pre-trained models offer strong feature extraction abilities. We leverage these capabilities to identify surface anomalies within point cloud representations of conveyor belts. The features extracted from this process are then utilized as input for conventional machine learning classifiers, ultimately facilitating the classification of defects.
To prepare the extracted point cloud data for feature extraction using convolutional filters, the proposed projection algorithm was applied to convert the 3D point clouds captured by the TrueDepth camera into 2D top-view representations on the XY plane (see
Figure 3). The z-axis (depth parameter) of point clouds was normalized to a range of 0 to 255 and represented as grayscale images, enabling their direct use as input to the employed feature extraction models. To reduce computational cost while preserving the essential geometric information, all projected point clouds were resized to 128 × 128 pixels.
To use the pre-trained architectures without changing the point cloud matrices to RGB, two extra dimensions were created by copying the single-channel image across all three channels. This channel duplication is a common practice to adapt single-channel data for models pre-trained on 3-channel RGB images, preserving the learned filter structures from ImageNet.
3.4. Hybrid Deep Learning and Machine Learning Framework
To address the challenges of limited computational resources and small datasets in industrial applications, particularly in our case study, we propose a hybrid framework. This framework combines deep feature extraction, using pre-trained CNN architectures, with traditional machine learning models for classification tasks based on the extracted features.
The proposed framework functions via a two-step process. Initially, pre-trained convolutional neural network (CNN) models are employed to extract high-level features from two-dimensional point cloud representations, which are themselves derived from the original three-dimensional point cloud data. Subsequently, traditional machine learning classifiers, such as Random Forest, Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), conduct the final classification based on the features that have been extracted. This modular design facilitates the efficient optimization of each individual component while simultaneously mitigating computational requirements when contrasted with end-to-end deep learning methodologies. Consequently, the framework retains the lightweight attributes that are critical for practical implementation within industrial contexts.
Ensemble Machine Learning Classifiers
To augment the profound feature extraction capabilities inherent in pre-trained Convolutional Neural Networks (CNNs), we utilize three ensemble machine learning classifiers, each recognized for its strong performance in classification applications: Random Forest (RF), XGBoost, and LightGBM. These algorithms each present unique benefits when processing the features derived from 2D point cloud representations.
Random Forest (RF), as introduced by Breiman [
71], constitutes an ensemble learning technique that employs bootstrap aggregation (bagging) to construct numerous decision trees during the training phase. RF operates by building trees in parallel; each tree is trained on randomly selected subsets of both data and features. This methodology serves to effectively reduce the overfitting tendencies often observed in individual decision trees. The ultimate predictions are derived from the aggregation of individual tree outputs via majority voting, thereby improving both accuracy and robustness.
XGBoost, as presented by Chen and Guestrin [
72], constitutes an advanced version of Gradient Boosting Decision Trees (GBDT). In contrast to the bagging methodology of Random Forests (RF), XGBoost constructs trees sequentially through boosting, wherein each subsequent tree addresses the errors of its predecessors by minimizing a specified loss function using gradient descent. The algorithm’s structure is characterized by level-wise tree growth, which promotes balanced architectures, and it incorporates regularization techniques to manage model complexity. Consequently, the ultimate prediction is derived from a weighted aggregation of all tree outputs, thereby demonstrating considerable efficacy in handling intricate datasets.
LightGBM, a variant of GBDT, is engineered for both efficiency and rapid processing of extensive datasets [
73]. This algorithm introduces two principal innovations: Gradient-based One-Side Sampling (GOSS), which emphasizes instances exhibiting substantial gradients while randomly sampling those with smaller gradients, and Exclusive Feature Bundling (EFB), which diminishes feature dimensionality by bundling mutually exclusive features. Furthermore, LightGBM utilizes leaf-wise tree growth, thereby facilitating faster convergence and enhanced performance relative to conventional level-wise methodologies.
4. Experimental Setup and Data Collection
The experimental investigation employed a steel cord conveyor belt with a rubber top cover as the test specimen. The belt was maintained in excellent condition with intact edges, attributable to its exclusive use in a controlled laboratory environment, as depicted in
Figure 4. All data acquisition and testing were performed in the specialized belt conveyor laboratory at the Wrocław University of Science and Technology (WUST). This controlled facility enabled the precise induction of artificial defects and the subsequent collection of high-quality data necessary to validate the proposed inspection methodology.
To assess the efficacy of the suggested non-destructive inspection (NDI) system, nine unique artificial defects were incorporated onto the surface of the test conveyor belt. These defects were engineered to simulate common failure mechanisms observed in industrial contexts, specifically the significant impact and abrasive wear typical of hard rock mining operations. As depicted in
Figure 5, the induced damage exhibits variations in geometry, depth, and severity, thus creating a demanding and representative dataset for thorough system evaluation.
The simulated damage profile includes situations such as deep gouges and cuts that partially expose the underlying steel cord reinforcement. These defects are mainly characterized by localized geometric changes on the belt’s surface. The proposed method uses these geometric indicators, specifically point clouds from a TrueDepth camera, to provide precise, quantitative measurements of 3D surface changes. This approach allows for reliable defect detection based solely on measurable geometric anomalies, maintaining accuracy regardless of lighting conditions or surface contamination.
4.1. Data Acquisition and Sensor Characteristics
Data acquisition was performed using a smartphone-based sensing platform, using an iPhone 12 Pro Max, which features an integrated TrueDepth camera. This system employs structured light technology; it projects a pattern of over 30,000 infrared dots onto the surface, subsequently capturing the resultant deformation with an infrared camera. This procedure yields a dense 3D point cloud, with each point characterized by its spatial coordinates in relation to the sensor. Furthermore, the system integrates a flood illuminator to facilitate low-light operation and a CCD sensor for the concurrent capture of 2D RGB texture data.
As demonstrated in
Figure 6, which compares the LiDAR and TrueDepth camera in capturing point clouds from an exemplary conveyor belt defect, the iPhone LiDAR shows limitations in accurately reproducing the defect’s geometrical features. In contrast, the TrueDepth camera captures precise point clouds that enable accurate depth measurement of the surface defects. The main reason behind the poor performance of the LiDAR sensor is the fact that the LiDAR module in the iPhone 12 Pro Max is specially designed for distance estimation rather than precise point cloud generation. Therefore, although it can work better with geometrical features at longer distances in comparison to the TrueDepth camera, it has a poor performance in capturing precise point cloud data at short distances, as was the intention of this case study.
The working distance of the TrueDepth camera was maintained within 200–300 mm, consistent with ranges validated in prior studies for reliable 3D data acquisition [
57,
74]. During experiments, the smartphone was positioned approximately 250 mm above the belt surface to capture samples. The TrueDepth camera operated at a frame rate of 30 frames per second, with each frame generating a corresponding point cloud. A rectangular region of interest (ROI) measuring 25 cm in width and 35 cm in height was continuously recorded from the central section of the belt. Each TrueDepth frame generated a point cloud containing 273,674 individual points (see
Figure 7). To ensure smooth and consistent data acquisition, the conveyor belt was moving at a constant speed of 0.075 m/s. The test conveyor had a total length of 15 m and a belt thickness of 5 cm.
To ensure data integrity and avoid the opaque processing routines common in many 3D scanning applications, the “Record3D” application was utilized for data extraction. This application exports raw, unaltered point cloud data streams directly from Apple’s ARKit framework without applying proprietary post-processing or mesh refinement algorithms. Consequently, the dataset employed in this work consists exclusively of native ARKit outputs, establishing a transparent and reproducible foundation for analysis [
58]. For validation purposes, manual measurements of maximum depth, width, and height for each defect were collected using precision rulers and calipers (see
Figure 8) to enable comparative analysis with the camera-acquired results.
4.2. Performance Metrics
The performance of the proposed classifier was evaluated using standard metrics derived from the confusion matrix: accuracy, sensitivity (recall), precision, and F1 score. These metrics are formally defined as follows:
In these formulations, TP (True Positive) represents correctly identified defective regions, TN (True Negative) denotes correctly classified non-defective areas, FP (False Positive) indicates non-defective areas misclassified as defective, and FN (False Negative) corresponds to defective areas incorrectly classified as non-defective. Sensitivity quantifies the model’s capability to detect actual defects, while precision measures the accuracy of positive predictions. The F1 score provides a balanced metric through the harmonic mean of precision and sensitivity. All metrics range from 0 to 1, with 1 representing optimal performance, which served as the optimization objective in this study.
5. Model Training and Validation Process
The dataset for this study comprised a total of 7086 of samples captured from the conveyor belt surface, representing nine distinct fault conditions as represented in the
Figure 5. To ensure a robust evaluation, a structured approach was employed for data partitioning into training, validation, and test sets.
To mitigate the risk of overfitting from sequential, highly correlated samples and to maximize feature diversity for enhanced model generalizability, a strategic sample selection process was employed, leveraging the ORB (Oriented FAST and Rotated BRIEF) algorithm [
75]. Established by Rublee et al. as computationally efficient, ORB was ideal for rapidly quantifying visual dissimilarity. ORB features were extracted from an initial pool of 832 healthy and 832 faulty samples, and the Hamming distance between their binary descriptors was computed for all projected point cloud pairs to generate a dissimilarity score for each sample. The 832 faulty samples represent the defect numbers 1 to 5. The final balanced training dataset was constructed by selecting the 100 most dissimilar samples from each of the two categories, ensuring the selected data encapsulated the widest possible variation in surface conditions and promoting robust model performance from a limited number of samples.
Hyperparameter optimization was conducted using a separate validation set, which consisted of 274 healthy samples and 686 faulty samples encompassing the two defects (defect numbers 6 and 7) on validation dataset. The hyperparameter optimization is an integral part of tuning the employed ML-based classifier in this study. The optimization was performed using random search cross-validation, a technique that efficiently explores the hyperparameter space by evaluating a fixed number of parameter settings sampled from specified distributions in RF, XGBoost, and LightGBM classifiers. Unlike an exhaustive grid search, random search cross-validation offers a more computationally efficient approach to identifying a near-optimal configuration for the classifiers, providing a favorable trade-off between search time and model performance.
The final performance evaluation of the model was conducted on a held-out test set, which included samples from the two remaining fault types (Faults 8–9) not seen during training or validation. This test set contained 259 healthy samples and 814 faulty samples, providing a rigorous assessment of the model’s ability to generalize to novel defect patterns.
The experimental setup utilized the following hardware configuration: a desktop computer equipped with an AMD Ryzen 7 5800H CPU (Advanced Micro Devices Inc. (AMD), Santa Clara, CA, US), an NVidia GeForce RTX 3060 graphics processing unit (GPU) (NVidia, Santa Clara, CA, USA), and 16 GB of RAM.
6. Results and Discussion
This section presents a comprehensive evaluation of the proposed defect detection models. The analysis begins by systematically evaluating the strengths and limitations of models trained exclusively on data from the TrueDepth camera. To quantitatively validate the system’s measurement precision, we compare the physical dimensions (height, width, and depth) of identified faults against manual measurements obtained with laboratory-grade tools in
Section 6.2. This comparative analysis confirms that the geometric data derived from the point clouds provides highly accurate quantitative assessments of surface damage, moving beyond mere detection to enable precise fault characterization.
6.1. Hybrid CNN-ML Model Performance Comparison in Performing Surface Defect Classification
The performance of the trained classification models was evaluated for their capability to identify surface defects on the conveyor belt using RGB and TrueDepth cameras separately. Classifiers’ efficacy was quantified using standard performance metrics: accuracy and F1 score. An F1 score exceeding 0.9 indicates strong potential for real-world industrial deployment. The comprehensive performance results across both validation and test datasets are summarized in
Table 1.
For the RGB image modality, the Xception architecture paired with a Random Forest (Xception-RF) classifier achieved the highest F1 score (0.9813) on the test set, demonstrating its superior capability in detecting defects based on visual features. This model also showed strong consistency, with its performance on the validation set (0.9692) closely matching its test results, indicating robust generalizability. The VGG16-RF model also performed robustly, with an F1 score of 0.9769. Notably, while the Xception-XGBoost model achieved the highest validation F1 score (0.9882) for RGB, its test performance (0.9674) experienced a more significant drop, suggesting a potential for overfitting compared to the more stable RF-based variants. In contrast, models based on the ResNet50 architecture showed markedly lower performance on RGB data, with F1 scores on the test set falling below 0.78 for its XGBoost and LightGBM implementations. This suggests that the ResNet50 architecture may be less suitable for extracting discriminative features from the surface texture and color variations present in the conveyor belt images under the studied conditions.
For the 2D representation of point cloud data modality, which captures 3D topographic information, the InceptionV3-RF model achieved the highest F1 score (0.9919) on the test set. This indicates the exceptional effectiveness of geometric features for defect detection, as surface deformations like gouges and cuts manifest clearly as anomalies in the 3D point cloud. The Xception-RF and VGG16-LightGBM models also performed exceptionally well on this modality, with test F1 scores of 0.9894 and 0.9805, respectively. The consistently high performance across multiple architectures for the TrueDepth modality—with six different model-classifier combinations achieving a test F1 score above 0.97—underscores the inherent robustness of geometric data. This data is less susceptible to the visual challenges, such as lighting variations and dust, that can adversely affect RGB image analysis, a fact highlighted by the performance gap between modalities for architectures like InceptionV3, where the TrueDepth F1 score was over 0.05 higher than its RGB counterpart.
The confusion matrix for the top-performing TrueDepth model including InceptionV3-RF, shown in
Figure 9, reveals a conservative detection profile: no false negatives were observed on the test set (259 faulty instances), while the primary error mode consisted of 13 false positives where intact surface areas were flagged as defective. This behaviour reduces the risk of missed critical defects, which is an important advantage for maintenance in safety-critical operations, but increases the rate of unnecessary follow-up inspections. The false-positive burden can be managed in practice by adjusting detection thresholds, applying simple post-processing filters, or introducing a lightweight secondary verification step to improve precision without substantially compromising defect detection sensitivity.
The Receiver Operating Characteristic (ROC) curves in
Figure 10 summarize the performance of models trained on TrueDepth-derived data. The models using 3D geometric features exhibit strong and consistent discriminative power, with curves rising sharply toward the top-left corner and area under the curve (AUC) values exceeding 0.98 across classifiers. These results indicate that features extracted from TrueDepth point clouds provide a robust and separable representation of defects, enabling high true positive rates with low false alarm rates across different model–classifier combinations. The consistency of ROC behaviour across architectures suggests that the projected point cloud representation delivers classifier-agnostic signal quality suitable for reliable defect detection in challenging environments.
The results demonstrate that the TrueDepth camera captures superior data for identifying surface defects on conveyor belts. The underperformance of the model trained on RGB data is primarily due to false positives caused by surface textures such as permanent belt patches which were incorrectly classified as defects, as illustrated in
Figure 11. In contrast, the point cloud data from the TrueDepth camera effectively excluded these patches from consideration, as their depth measured below 1 mm. Our model was specifically trained to recognize defects with a depth exceeding 2 mm, thereby ignoring superficial variations.
Furthermore, the 2D representation derived from the point cloud retains only depth-based anomalies, whereas RGB images contain all visual textures present on the belt surface. This additional complexity makes it significantly more challenging for a model to distinguish genuine defects from normal surface patterns. Consequently, using the 2D representation of point clouds substantially reduces the number of training samples required to achieve high performance. This is a crucial advantage in industrial environments, where acquiring large, accurately labeled datasets is often prohibitively expensive or logistically impractical.
6.2. Real-World Accuracy Comparison for Defect Quantification
This section analyses the model outcomes and outlines how detected faults can be subjected to further quantitative evaluation using TrueDepth-derived geometry. By training on 2D projections of the point cloud, we substantially reduce computational complexity while retaining the geometric detail needed for reliable defect detection. Only samples classified as defective are forwarded for high-resolution, full-dimensional 3D reconstruction using the TrueDepth point clouds, producing detailed geometric models that support precise dimensional measurements and localization.
Figure 12 illustrates the measured dimension of fault sample 2 obtained from the TrueDepth point cloud. This selective reconstruction strategy optimizes computational resources by avoiding expensive processing of non-defective data and delivers actionable, geometry-focused maintenance feedback to technicians.
To validate the quantitative accuracy of the proposed vision-based system, its measurements were benchmarked against a conventional manual method using a ruler and caliper as the baseline. A comprehensive comparison between the measured height, width, and depth of the faults obtained from TrueDepth 3D point clouds and the manual measurements is presented in
Table 2.
The comparative analysis revealed a high degree of concordance, with the measurement error for defect dimensions between the TrueDepth camera and the manual method remaining within 3 mm. This result confirms the system’s reliability and precision in capturing key geometric parameters. Beyond replicating manual measurements, the TrueDepth camera offers a distinct advantage: the capability for near-real-time, quantitative assessment of complex defect morphology. While manual techniques struggle with the irregular, non-linear contours typical of surface damage, the system accurately determines the shape and calculates the actual surface area of a defect. This facilitates a more comprehensive damage assessment, including a preliminary estimation of material loss volume, derived from the product of the measured area and average depth. It is important to note that the accuracy of this volumetric estimate is contingent upon the slope and internal geometry of the surface defects.
7. Summary and Conclusions
The presented study successfully introduced and validated a novel smartphone-driven surface defect detection system framework for the inspection of industrial conveyor belt surfaces. Addressing the challenges of unreliable computer vision methods in harsh mining environments characterized by variable lighting and dust, the system uses the integrated TrueDepth camera of a commercial smartphone (iPhone 12 Pro Max) to simultaneously capture high-resolution visual data and precise 3D point clouds from a moving belt.
The presented methodology is based on a 3D-to-2D projection, which converts complex point cloud data into structured 2D representations. We employed a hybrid architecture where pre-trained CNNs (VGG16, ResNet50, InceptionV3, Xception) serve as deep feature extractors, followed by machine learning classifiers (Random Forest, XGBoost, LightGBM). The InceptionV3-RF model, operating on geometric features, attained a high test F1 score of 0.9919 and maintained a near-perfect recall for the fault class. This capability is critical for operational safety as it prevents the occurrence of errors.
Additionally, the proposed methodology enables the quantitative assessment of surface damage. Comparative analysis against manual measurements confirmed the system’s reliability, with measurement errors for defect dimensions remaining within 3 mm for point cloud-derived depth. Such accuracy allows the system to properly determine the complex morphology of defects and calculate the defect surface area and shape. This information allows for a robust system of maintenance with tracking of the belt condition through its full life cycle and, in turn, better prediction of the time of necessary intervention.
In conclusion, this research validates a reliable and cost-effective smartphone-based sensing platform that supports near-real-time maintenance decisions. We demonstrate the distinct advantage of the TrueDepth camera over conventional RGB imaging for capturing surface geometry, a capability that proves particularly suitable for low-light industrial conditions. This study successfully established the system’s core effectiveness in a controlled laboratory setting, confirming its ability to detect geometric defects and generate accurate surface maps under simulated conditions.
Building on this validated foundation, the proposed methodology demonstrates significant potential to improve conveyor belt management, reduce maintenance costs, and enhance operational safety. In our future work, we plan to conduct field validation on an operational conveyor belt within a mining site to investigate the system’s robustness against real-world challenges such as airborne particulates and mechanical vibration. Furthermore, we intend to explore data fusion techniques that integrate complementary information from both RGB and TrueDepth cameras. This multi-modal approach aims to generate a more comprehensive condition report, potentially providing supervisors with a richer, hybrid data stream for enhanced quality monitoring and decision-making in industrial settings.