Figure 1.
Fundamental working principle of AI.
Figure 1.
Fundamental working principle of AI.
Figure 3.
Components of AI.
Figure 3.
Components of AI.
Figure 4.
Generative AI as a subfield of AI.
Figure 4.
Generative AI as a subfield of AI.
Figure 5.
Different types of AI.
Figure 5.
Different types of AI.
Figure 6.
The underlying systems of Generative AI.
Figure 6.
The underlying systems of Generative AI.
Figure 7.
Generative AI impact on manufacturing performance metrics.
Figure 7.
Generative AI impact on manufacturing performance metrics.
Figure 8.
Side view and top view of an intact package on a conveyor belt.
Figure 8.
Side view and top view of an intact package on a conveyor belt.
Figure 9.
Normal distributions, centered at 0.35 (damaged) and 0.20 (intact), showing the brush-strength variability used during sculpting.
Figure 9.
Normal distributions, centered at 0.35 (damaged) and 0.20 (intact), showing the brush-strength variability used during sculpting.
Figure 10.
Visualizing sample stroke-path geometries with ten random cubic Bézier curves, within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)
Figure 10.
Visualizing sample stroke-path geometries with ten random cubic Bézier curves, within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)
Figure 11.
Stroke-path geometry with damaged vs. intact overlays within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)
Figure 11.
Stroke-path geometry with damaged vs. intact overlays within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)
Figure 12.
Uniform distributions of jitter for X-offsets (±0.04 m), Y-offsets (±0.07 m), and yaw angles (±20.00°), depicting how packages were randomly translated and rotated on the conveyor.
Figure 12.
Uniform distributions of jitter for X-offsets (±0.04 m), Y-offsets (±0.07 m), and yaw angles (±20.00°), depicting how packages were randomly translated and rotated on the conveyor.
Figure 13.
Discrete uniform distribution of background frame selection.
Figure 13.
Discrete uniform distribution of background frame selection.
Figure 14.
A normal distribution (μ = 16, σ = 3) for f-stop values, illustrating the camera DOF variation.
Figure 14.
A normal distribution (μ = 16, σ = 3) for f-stop values, illustrating the camera DOF variation.
Figure 15.
Foreground/Background sharpness and blur, where the package and conveyor grid both appear noticeably out of focus.
Figure 15.
Foreground/Background sharpness and blur, where the package and conveyor grid both appear noticeably out of focus.
Figure 16.
DOF vs. F-Stop.
Figure 16.
DOF vs. F-Stop.
Figure 17.
Relative variation across pipeline variables (log scale).
Figure 17.
Relative variation across pipeline variables (log scale).
Figure 18.
Individual vs. cumulative configuration possibilities (log scale).
Figure 18.
Individual vs. cumulative configuration possibilities (log scale).
Figure 19.
Differences in rendering quality.
Figure 19.
Differences in rendering quality.
Figure 20.
More realistic rendering options.
Figure 20.
More realistic rendering options.
Figure 21.
Alternative rendering qualities.
Figure 21.
Alternative rendering qualities.
Figure 22.
Relative impacts of quality parameters on cost metrics.
Figure 22.
Relative impacts of quality parameters on cost metrics.
Figure 23.
Simplified two-view damage-detection architecture for MobileNetV2.
Figure 23.
Simplified two-view damage-detection architecture for MobileNetV2.
Figure 24.
Different image enhancements applied in Approach II.
Figure 24.
Different image enhancements applied in Approach II.
Figure 25.
Applying image enhancement techniques in sequence.
Figure 25.
Applying image enhancement techniques in sequence.
Figure 26.
Simplified two-view damage-detection architecture for ResNet-18.
Figure 26.
Simplified two-view damage-detection architecture for ResNet-18.
Figure 27.
Annotated package with side and top views.
Figure 27.
Annotated package with side and top views.
Figure 28.
Architecture of the pretrained ViT.
Figure 28.
Architecture of the pretrained ViT.
Figure 29.
Simplified architecture of the GPT o4-mini-high used in package inspection.
Figure 29.
Simplified architecture of the GPT o4-mini-high used in package inspection.
Figure 30.
The simplified architecture of the YOLOv7 used in package inspection.
Figure 30.
The simplified architecture of the YOLOv7 used in package inspection.
Figure 31.
Simplified architecture of the YOLOv8 seg used in the package inspection.
Figure 31.
Simplified architecture of the YOLOv8 seg used in the package inspection.
Figure 32.
Samples of segmented packages, with side view and top view, after CLAHE.
Figure 32.
Samples of segmented packages, with side view and top view, after CLAHE.
Figure 33.
Loss and accuracy curves associated with the training and validation process utilizing the MobileNetV2 in Approach I.
Figure 33.
Loss and accuracy curves associated with the training and validation process utilizing the MobileNetV2 in Approach I.
Figure 34.
MobileNetV2 PR curves in Approach I.
Figure 34.
MobileNetV2 PR curves in Approach I.
Figure 35.
ROC curve for MobileNetV2 in Approach I.
Figure 35.
ROC curve for MobileNetV2 in Approach I.
Figure 36.
Confusion matrix for MobileNetV2 in Approach I.
Figure 36.
Confusion matrix for MobileNetV2 in Approach I.
Figure 37.
Comparison of the accuracy of Approach II variants (with different enhancements).
Figure 37.
Comparison of the accuracy of Approach II variants (with different enhancements).
Figure 38.
Loss and accuracy curves during the training and validation process of the MobileNetV2 with CLAHE in Approach II.
Figure 38.
Loss and accuracy curves during the training and validation process of the MobileNetV2 with CLAHE in Approach II.
Figure 39.
MobileNetV2 with CLAHE PR curves in Approach II.
Figure 39.
MobileNetV2 with CLAHE PR curves in Approach II.
Figure 40.
ROC curve for MobileNetV2 with CLAHE in Approach II.
Figure 40.
ROC curve for MobileNetV2 with CLAHE in Approach II.
Figure 41.
Confusion matrix for MobileNetV2 with CLAHE in Approach II.
Figure 41.
Confusion matrix for MobileNetV2 with CLAHE in Approach II.
Figure 42.
Loss and accuracy curves during the training and validation process of ResNet-18 with image enhancements in Approach III.
Figure 42.
Loss and accuracy curves during the training and validation process of ResNet-18 with image enhancements in Approach III.
Figure 43.
ResNet-18 with image enhancement PR curves in Approach III.
Figure 43.
ResNet-18 with image enhancement PR curves in Approach III.
Figure 44.
ROC curve for ResNet-18 with image enhancements in Approach III.
Figure 44.
ROC curve for ResNet-18 with image enhancements in Approach III.
Figure 45.
Confusion matrix for ResNet-18 with image enhancements in Approach III.
Figure 45.
Confusion matrix for ResNet-18 with image enhancements in Approach III.
Figure 46.
Comparison of Approach IV classifiers’ accuracies (both images as input).
Figure 46.
Comparison of Approach IV classifiers’ accuracies (both images as input).
Figure 47.
PR curves for default LR in Approach IV.
Figure 47.
PR curves for default LR in Approach IV.
Figure 48.
ROC curve for default LR in Approach IV.
Figure 48.
ROC curve for default LR in Approach IV.
Figure 49.
Confusion matrix for default LR classifier in Approach IV.
Figure 49.
Confusion matrix for default LR classifier in Approach IV.
Figure 50.
Confusion matrix for the GPT o4-mini-high in Approach V.
Figure 50.
Confusion matrix for the GPT o4-mini-high in Approach V.
Figure 51.
Loss curves for YOLO (top view on the left vs. side view on the right) in Approach VI.
Figure 51.
Loss curves for YOLO (top view on the left vs. side view on the right) in Approach VI.
Figure 52.
PR curves for YOLOv7 in Approach VI.
Figure 52.
PR curves for YOLOv7 in Approach VI.
Figure 53.
Comparison of LR classifiers in Approach VII.
Figure 53.
Comparison of LR classifiers in Approach VII.
Figure 54.
PR curve for the default LR classifier in Approach VII.
Figure 54.
PR curve for the default LR classifier in Approach VII.
Figure 55.
ROC curve for the default LR classifier in Approach VII.
Figure 55.
ROC curve for the default LR classifier in Approach VII.
Figure 56.
Confusion matrix for the default LR classifier in Approach VII.
Figure 56.
Confusion matrix for the default LR classifier in Approach VII.
Figure 57.
Side-view validation metric curves for YOLOv8-seg.
Figure 57.
Side-view validation metric curves for YOLOv8-seg.
Figure 58.
Top-view validation metric curves for YOLOv8-seg.
Figure 58.
Top-view validation metric curves for YOLOv8-seg.
Figure 59.
Confusion matrices for the training and validation process for the side-view images and the top-view images.
Figure 59.
Confusion matrices for the training and validation process for the side-view images and the top-view images.
Figure 60.
Training and validation loss and accuracy curves for top-view images.
Figure 60.
Training and validation loss and accuracy curves for top-view images.
Figure 61.
Training and validation loss and accuracy curves for both side- and top-view images.
Figure 61.
Training and validation loss and accuracy curves for both side- and top-view images.
Figure 62.
ROC curves during the testing of MobileNetv2 for both the top-view only and the side- and top-view images.
Figure 62.
ROC curves during the testing of MobileNetv2 for both the top-view only and the side- and top-view images.
Figure 63.
Confusion matrices for MobileNetV2 for the top-view images and both the side- and top-view images in Approach VIII.
Figure 63.
Confusion matrices for MobileNetV2 for the top-view images and both the side- and top-view images in Approach VIII.
Figure 64.
PR curves for the LR classifier in Approach IX.
Figure 64.
PR curves for the LR classifier in Approach IX.
Figure 65.
ROC curves during the testing of ViT + LR for the top-view images and the side- and top-view images.
Figure 65.
ROC curves during the testing of ViT + LR for the top-view images and the side- and top-view images.
Figure 66.
Confusion matrices of ViT + LR for the top-view images and the side- and top-view images in Approach IX.
Figure 66.
Confusion matrices of ViT + LR for the top-view images and the side- and top-view images in Approach IX.
Figure 67.
Accuracy for configurations in Approach X.
Figure 67.
Accuracy for configurations in Approach X.
Figure 68.
PR curve for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 68.
PR curve for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 69.
ROC curve for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 69.
ROC curve for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 70.
Confusion matrix for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 70.
Confusion matrix for 3:1 augmented training and validation with ViT + LR in Approach X.
Figure 71.
Accuracy values for all approach variants.
Figure 71.
Accuracy values for all approach variants.
Figure 72.
Performance vs. complexity, speed, training time, and computational cost.
Figure 72.
Performance vs. complexity, speed, training time, and computational cost.
Table 1.
The structure of the paper.
Table 1.
The structure of the paper.
N | Section | Subsection | Subdivision | Overall Description |
---|
1 | Introduction | NA | Manual inspection and AI-based inspection |
2 | Development of AI | NA | | Mechanics of Narrow AI and Generative AI |
3 | AI-Enabled Sustainability | Narrow AI: Enhancing Defect Management and Waste Reduction through Data-Driven Insights | NA | How AI enables waste reduction, QC, sustainable driven production, waste reduction, and environment simulation |
Generative AI: A Paradigm Shift Towards Proactive Quality Design and Waste Minimization |
Different Forms of Generative AI | 3D Synthesis via Blender |
Hybrid Systems | NA |
4 | The Dataset | Pipeline Design of the Dataset | NA | Information, parameters, and specifications regarding the creation of the dataset and its quality |
General Constraints and Limitations of the Synthetic Images’ Datasets |
Evaluating the Quality of the Dataset |
Synopsis |
5 | Methodology | Approach I | NA | Detailed description of all the models, algorithms, enhancements, and ensembles being deployed |
Approach II | Global Histogram Equalization |
Contrast-Limited Adaptive Histogram Equalization |
Sharpening |
Approach III—Approach X | NA |
6 | Results | Results of Approach I—Approach X | NA | Shows the results of all the approaches being deployed in Section 5 |
7 | Discussion | Performance Measurements | Root-Cause for Misclassifications | Quantitative and qualitative comparison between the different approaches in terms of effectiveness and fidelity |
Future work | NA |
Limitations |
8 | Conclusion | NA | Lessons learned and final thoughts |
Table 2.
Descriptions of different types of AI.
Table 2.
Descriptions of different types of AI.
Narrow AI (ANI), also known as Weak AI or Predictive AI | Also known as Weak AI, this category encompasses AI systems designed to perform specific, well-defined tasks within limited domains. These systems operate under pre-programmed rules and lack the ability to generalize beyond their designated functions. Examples include recommendation algorithms, virtual assistants, and automated speech recognition. |
AGI (Strong AI) | This level of AI aspires to mimic human cognitive abilities, enabling machines to comprehend, learn, and apply knowledge across multiple domains, without task-specific programming. Unlike Narrow AI, General AI possesses adaptability and can autonomously solve problems, reason, and make decisions in a manner similar to human intelligence. While still theoretical, the development of such systems remains a major goal in AI research [29]. |
ASI (Super AI) | This hypothetical stage represents the surpassing of human intelligence by AI in all cognitive aspects, including reasoning, problem-solving, creativity, and emotional intelligence. Machines with Super AI would not only execute complex tasks but also possess self-awareness, consciousness, and autonomous decision-making abilities. Although currently a subject of speculation and ethical debate, advancements in AI continue to push the boundaries toward this possibility. |
Reactive Machines (RM) | The simplest form of AI, RM function without memory or the ability to learn from past experiences. They operate solely based on real-time inputs, responding to specific stimuli without retaining information for future decision-making. Deep Blue, the IBM chess-playing system, exemplifies this type of AI. |
Limited Memory AI | Unlike reactive machines, these AI systems can retain and utilize past information for a limited period to make decisions. They rely on historical data and real-time inputs to improve performance. Autonomous vehicles, which use stored sensor data to navigate dynamic environments, fall into this category. |
Theory of Mind AI (ToM AI) | Representing a more advanced stage of AI, this concept envisions machines capable of understanding human emotions, beliefs, and intentions. By recognizing psychological states and adapting their responses accordingly, these systems would enable more natural and meaningful human–AI interactions. Although still in its early stages, research in this area aims to enhance AI’s ability to engage in social reasoning and cooperative tasks. |
Self-Aware AI | The most advanced and speculative form of AI, self-aware systems would possess a sense of consciousness and be capable of self-reflection and independent thought. Unlike other AI types, they would not only process information but also understand their own existence, emotions, and objectives. While theoretical, the pursuit of self-aware AI raises profound ethical, philosophical, and safety considerations regarding machine autonomy and decision-making. |
Table 3.
Comparing Generative AI and Narrow AI.
Table 3.
Comparing Generative AI and Narrow AI.
Technology | Key Features | Advantages | Disadvantages | Use Cases |
---|
Narrow AI | Rule-based systems, Limited adaptability | Established frameworks, Predictive accuracy in stable environments | Rigid, Cannot self-improve, Requires extensive data labeling | QC, Supply chain optimization, Demand forecasting, |
Generative AI | Self-learning, Adaptive algorithms, Enhanced creativity | Can generate novel solutions, Continuous learning, Reduced need for labeled data | Complexity in implementation, Potential for biased outputs, Requires significant computational resources | Product design, Process optimization, QC, Inspection, Predictive maintenance (PdM) |
Hybrid AI Systems | Combines Narrow AI with Generative AI, Utilizes DT and NN | Leverages strengths of both technologies, More robust decision-making, Enhanced flexibility | Increased system complexity, Higher initial investment, Difficulties in integration | Smart manufacturing, Automated quality assurance, Adaptive supply chain management |
Table 4.
Summary of the numerical values of the parameters.
Table 4.
Summary of the numerical values of the parameters.
Parameter | Value/Distribution |
---|
Stroke Count (Damaged) | 3 strokes |
Stroke Count (Intact) | 2 strokes |
Stroke Position (X, Y) | X ∈ [–5 cm, +5 cm], Y ∈ [–2.5 cm, +2.5 cm] |
Stroke Strength (Damaged) | (mean = 0.35, σ = 0.05) |
Stroke Strength (Intact) | (mean = 0.20, σ = 0.05) |
Strength Distribution (σ) | σ = 0.05 (shared) |
Stroke Shape (Bezier Control) | Random cubic Bézier control points |
Axes of Variability | (1) Stroke count, (2) Strength, (3) Position, (4) Shape |
Table 5.
Summary of the numerical values of the parameters.
Table 5.
Summary of the numerical values of the parameters.
Parameter | Value/Distribution |
---|
X-Offset (Lateral Jitter) | Uniform over ±0.04 m (±44 cm) |
Y-Offset (Longitudinal Jitter) | Uniform over ±0.07 m (±77 cm) |
Yaw Rotation Jitter | Uniform integer in [–20°, +20°] |
Uniform Distribution | Equal probability across each continuous or integer value |
Table 8.
FID scores and the descriptive statistical values for SSIM and Brightness.
Table 8.
FID scores and the descriptive statistical values for SSIM and Brightness.
Inter-Category Comparison
| Interpretation Based on FID Scores | FID Score | NA |
Top (Intact vs. Damage)
| Good (Noticeable Diff) * | 11.77 |
Side (Intact vs. Damage)
| 23.59 |
Intact (Top vs. Side)
| Poor/Very Different * | 362.52 |
Top Intact vs. Side Damage
| 362.35 |
Top Damage vs. Side Intact
| 365.1 |
Damage (Top vs. Side)
| 364.52 |
Inter-Category Comparison
| Interpretation Based on SSIM Scores | Mean | Std | Range |
Top (Intact vs. Damage)
| Moderate Similarity * | 0.268 | 0.215 | 0.068–0.881 |
Side (Intact vs. Damage)
| 0.323 | 0.153 | 0.162–0.807 |
Intact (Top vs. Side)
| Very Low Similarity * | 0.002 * | 0.004 * | −0.019 |
Top Intact vs. Side Damage
| −0.022 |
Top Damage vs. Side Intact
| −0.021 |
Damage (Top vs. Side)
| −0.024 |
Category
| FID vs. SSIM Agreement | NA * |
Top Intact
| Strong (Both Indicate Similarity) * |
Side Intact
|
Top Intact
|
Top Damage
|
Category
| Internal Consistency based on SSIM Scores | Mean | Std | Range |
Top Intact
| Very Good Consistency/High | 0.306 | 0.237 | 0.066–0.877 |
Top Damage
| Good Consistency/Medium | 0.291 | 0.232 | 0.067–0.914 |
Side Intact
| Excellent (Low Variation)/High * | 0.317 | 0.166 | 0.161–0.801 |
Side Damage
| 0.326 | 0.171 | 0.183–0.812 |
Category
| Technical Quality Based on Brightness Stats | Mean | Std | Range |
Top Intact
| Good (Consistent) * | 137.1 * | 43.8 | 93.3–181.0 |
Top Damage
| 43.8 | 93.4–180.9 |
Side Intact
| Excellent (Very Consistent) * | 133.4 | 30.5 | 102.8–163.9 |
Side Damage
| 133.7 | 30.4 | 103.3–164.1 |
Category
| Number of Instances | Features | NA * |
Top Intact
| 100 * | 2048 * |
Top Damage
|
Side Intact
|
Side Damage
|
Table 9.
Detailed summary of all the approaches used in the package inspection process.
Table 9.
Detailed summary of all the approaches used in the package inspection process.
Variants | Approaches I–X |
---|
Process/Model/Algorithm | I | II | III | IV | V | VI | VII | VIII | IX | X |
Polygon mask annotation in COCO format | | | | x | | | X | | | |
Bounding Box (BBox) annotation in COCO format | | | | x | | | X | | | |
Polygon mask annotation in YOLO format | | | | | | | | x | x | x |
Bounding Box (BBox) annotation in YOLO format | | | | | | x | | x | x | x |
Training set augmentation | | | x | | | | | | | x |
Validation set augmentation | | | | | | | | | | x |
Training set augmentation ratio (1:1) | | | x | | | | | | | |
Training/Validation set augmentation ratio (3:1) | | | | | | | | | | x |
Test-time augmentation (TTA) | | | | | | | | | | x |
Stratified split (50/30/20) of data | x | x | x | x | | | X | x | x | x |
Stratified split (80/20) of data | | | | | | x | | | | |
Deblurring of blurry images | | | | | | | X | | | |
Transfer learning | x | x | x | x | | x | X | x | x | x |
MobileNetV2 | x | x | | | | | | x | | x |
Both top- and side-view images were fed as input | x | x | x | x | x | x | X | x | x | x |
Only top-view images were fed as input | x | x | | | x | | X | x | x | |
Sharpening | | x | | | | | | | | |
Global histogram equalization (GHE) | | x | | | | | | | | |
Multiscale Retinex (MSR) | | | x | | | | | | | |
Contrast-limited adaptive histogram equalization (CLAHE) | | x | x | x | | | X | x | x | x |
Fast non-local means (NIM) denoising for colored images | | | x | | | | | | | |
Fast super-resolution convolutional neural network (FSRCNN) | | | x | | | | | | | |
Mean–variance normalization | | | x | | | | | | | |
ResNet-18 | | | x | | | | | | | |
Region of interest (ROI) extraction | | | | x | | | X | | | |
Visual Transformer (ViT) | | | | x | | | X | | x | x |
Logistic regression (LR) | | | | x | | | X | | x | x |
SVM | | | | x | | | | | | |
RF | | | | x | | | | | | |
GPT o4-mini-high | | | | | x | | | | | |
You Only Look Once (YOLO)/ YOLOv7 | | | | | | x | | | | |
YOLOv8 Segmentation/ YOLOv8-seg | | | | | | | | x | x | x |
Table 10.
Pipeline description for each approach.
Table 10.
Pipeline description for each approach.
# | Approach Variants | Views |
---|
I | MobileNetV2 | Both |
Top |
II | MobileNetV2 + CLAHE | Top |
III | ResNet18 + MSR + CLAHE + fast NIM denoising + FSRCNN + Normalization | Both |
IV | ROI + ViT + LR | Both |
V | GPT o4-mini-high | Both |
Top |
VI | YOLOv7 | Both |
VII | ROI + Deblur + CLAHE + ViT + LR | Both |
Top |
VIII | YOLOv8-seg + CLAHE + MobileNetV2 | Both |
Top |
IX | YOLOv8-seg + CLAHE + ViT + LR | Both |
Top |
X | YOLOv8-seg + CLAHE + Augmentation + ViT + LR | Both |
Table 11.
Parameter comparison between MobileNetV2 and ResNet-18.
Table 11.
Parameter comparison between MobileNetV2 and ResNet-18.
Feature | ResNet-18 (PyTorch, Dual-View) | MobileNetV2 (TensorFlow, Dual-View) |
---|
Architecture | ResNet-18 (×2, one for each view) | MobileNetV2 (×2, one for each view) |
Pretraining | ImageNet | ImageNet |
Input Size | 224 × 224 × 3 | 128 × 128 × 3 |
Parameter Count | ~11.7 million per ResNet18 | ~3.5 million per MobileNetV2 |
Fusion | Concatenate 512-dim features from each ResNet | Concatenate pooled features from each MobileNetV2 |
Classifier Head | Linear(1024→256) → ReLU → Dropout → Linear(256→2) | Dense(128, relu) → Dropout → Dense(1, sigmoid) |
Trainable Layers | Only layer3, layer4, and classifier head | All layers frozen except dense head |
Framework | PyTorch | TensorFlow/Keras |
Loss Function | CrossEntropyLoss | Binary Crossentropy |
Optimizer | Adam | Adam |
Batch Size | 16 | 32 |
Epochs | 15 | 15 |
Augmentation | Resize, flip, rotation, color jitter | Resize, normalization (optionally, enhancement) |
Use Case | Fine-tuned, robust, slightly heavier | Lightweight, fast, efficient |
Table 12.
Tunning parameters for different classifiers, using GridSearchCV.
Table 12.
Tunning parameters for different classifiers, using GridSearchCV.
Model | Parameters |
---|
RF | A tuned RF with max_depth = 20 and n_estimators = 200. An ensemble of two hundred DT, each restricted to a maximum depth of twenty splits. By limiting tree depth, the model prevents any single tree from growing overly complex and overfitting to noise, while still allowing sufficient hierarchy to capture nonlinear interactions among features. With two hundred trees voting on each prediction, the ensemble mitigates the high variance typical of individual trees, yielding classification boundaries that are more stable and robust for intact versus damaged packages. The combination of a moderately deep structure and a large number of trees strikes a balance between bias and variance, ensuring that the RF generalizes well without sacrificing its capacity to model subtle patterns in the feature set. |
SVM | A tuned SVM with a Radial Basis Function (RBF) kernel and C = 10. The classifier uses the RBF (Gaussian) kernel to project input feature vectors into a higher-dimensional space where they can be separated by a nonlinear boundary. The kernel’s Gaussian shape measures similarity between points based on their Euclidean distance, allowing the model to capture complex patterns in the data. The regularization parameter C = 10 sets the trade-off between maximizing the margin around the decision boundary and minimizing misclassification errors: a larger C places greater emphasis on correctly classifying training examples (at the risk of a smaller margin and potential overfitting), while a smaller C would allow more margin violations to achieve a smoother boundary. By tuning C to 10, the SVM is configured to penalize training errors relatively strongly, helping it to fit the subtle distinctions between intact and damaged package features extracted by the upstream feature extractor. |
LR | The tuned LR model employs L2-regularized maximum-likelihood estimation with a penalty strength inversely proportional to C = 10, meaning the algorithm imposes a relatively mild regularization that allows the model coefficients to adapt more freely to the training data. By selecting the liblinear solver, an efficient coordinate-descent implementation optimized for smaller datasets and L2 penalties, the training process iteratively updates each coefficient while holding others fixed, ensuring rapid convergence even when feature dimensionality is moderate. This configuration strikes a balance between underfitting and overfitting: the higher C value prioritizes reducing classification errors by permitting larger weights on informative features, while the liblinear backend delivers stable solutions and straightforward interpretability of the decision boundary separating intact from damaged package instances. |
LR | Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100). |
Table 13.
The different parameters for LR in Approach VII.
Table 13.
The different parameters for LR in Approach VII.
Top view | GridSearchCV suggested parameters: The logistic-regression model was tuned to use an L1 penalty (penalty = ‘l1’) with the saga solver and a high regularization constant (C = 100). By choosing L1 regularization, the model encourages sparsity in its weight vector, effectively performing built-in feature selection by driving many coefficients to zero, while the saga optimizer is one of the few solvers in scikit-learn that supports L1 penalties. Setting C = 100 (the inverse of the regularization strength) minimizes the shrinkage effect, allowing the most informative features to retain substantial weight while still benefiting from the robustness and interpretability that sparse solutions provide. This configuration strikes a balance between expressive power (through a large C) and parsimony (through L1). |
Both views | Same as in the above cell describing the top-view-only inspection |
Top view | Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100). |
Both views | Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100). |
Table 14.
Augmentation variations implemented in Approach X.
Table 14.
Augmentation variations implemented in Approach X.
Augmentation Variation | Details Following the YOLOv8-seg + CLAHE |
---|
A | ViT + LR + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and a non-augmented (1:1) test set. |
B | ViT + LR + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and an 8:1 TTA test set. |
C | MobileNetV2 + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and a non-augmented (1:1) test set. |
Table 15.
Performance measurements obtained for all classes and views in different variants.
Table 15.
Performance measurements obtained for all classes and views in different variants.
Approach Variants | Views | Class | Accuracy | Precision | Sensitivity | F1 Score | Specificity |
---|
MobileNetV2 | Both | Intact | 45.00% | 65.00% | 46.43% | 54.17% | 41.67% |
Damaged | 45.00% | 25.00% | 41.67% | 31.25% | 46.43% |
Top | Intact | 50.00% | 5.00% | 50.00% | 9.09% | 50.00% |
Damaged | 50.00% | 95.00% | 50.00% | 65.52% | 50.00% |
MobileNetV2 + CLAHE | Top | Intact | 67.50% | 81.82% | 45.00% | 58.06% | 90.00% |
Damaged | 67.50% | 62.07% | 90.00% | 73.47% | 45.00% |
ResNet18 + MSR + CLAHE + fast NIM denoising + FSRCNN + Normalization | Both | Intact | 77.50% | 73.91% | 85.00% | 79.07% | 70.00% |
Damaged | 77.50% | 82.35% | 70.00% | 75.68% | 85.00% |
ROI + ViT + LR | Both | Intact | 92.50% | 100.00% | 85.00% | 91.89% | 100.00% |
Damaged | 92.50% | 86.96% | 100.00% | 93.02% | 85.00% |
GPT o4-mini-high | Both | Intact | 73.50% | 71.56% | 78.00% | 74.64% | 69.00% |
Damaged | 73.50% | 75.82% | 69.00% | 72.25% | 78.00% |
Top | Intact | 60.50% | 58.82% | 70.00% | 63.93% | 51.00% |
Damaged | 60.50% | 62.96% | 51.00% | 56.35% | 70.00% |
YOLOv7 | Both | Intact | 69.00% | 88.00% | 44.00% | 58.67% | 94.00% |
Damaged | 69.00% | 62.67% | 94.00% | 75.20% | 44.00% |
ROI + Deblur + CLAHE + ViT + LR | Both | Intact | 90.00% | 83.33% | 100.00% | 90.91% | 80.00% |
Damaged | 90.00% | 100.00% | 80.00% | 88.89% | 100.00% |
Top | Intact | 70.00% | 70.00% | 70.00% | 70.00% | 70.00% |
Damaged | 70.00% | 70.00% | 70.00% | 70.00% | 70.00% |
YOLOv8-seg + CLAHE + MobileNetV2 | Both | Intact | 82.50% | 80.95% | 85.00% | 82.93% | 80.00% |
Damaged | 82.50% | 84.21% | 80.00% | 82.05% | 85.00% |
Top | Intact | 55.00% | 55.00% | 55.00% | 55.00% | 55.00% |
Damaged | 55.00% | 55.00% | 55.00% | 55.00% | 55.00% |
YOLOv8-seg + CLAHE + ViT + LR | Both | Intact | 92.50% | 94.74% | 90.00% | 92.31% | 95.00% |
Damaged | 92.50% | 90.48% | 95.00% | 92.68% | 90.00% |
Top | Intact | 67.50% | 68.42% | 65.00% | 66.67% | 70.00% |
Damaged | 67.50% | 66.67% | 70.00% | 68.29% | 65.00% |
YOLOv8-seg + CLAHE + Augmentation + ViT + LR | Both | Intact | 90.00% | 90.00% | 90.00% | 90.00% | 90.00% |
Damaged | 90.00% | 90.00% | 90.00% | 90.00% | 90.00% |
Table 16.
Root causes for misclassifications.
Table 16.
Root causes for misclassifications.
Cause | Breakdown |
---|
Subtle or Localized Defects | Small dents, creases, or punctures can occupy only a few pixels in a 128 × 128 or 224 × 224 crop. If a model’s receptive field or feature-extraction layers are tuned to more global textures (e.g., overall box color or printed logos), it may “miss” these tiny irregularities. |
The “intact” packages were given two light sculpt strokes ( (0.20, 0.05)) while “damaged” ones received three deeper strokes ((0.35, 0.05)). However, if a CNN’s filters are not sensitive to such a difference in depth, or if the lighting in the rendering hides shallow dents, even a ResNet18 or MobileNetV2 may not pick up the nuance. |
View-Specific Occlusions and Shadowing | A top-only view can hide side-facing punctures or crushed corners. Conversely, a side-only view can miss dents on the top face. In our earlier experiments, using both views simultaneously often boosted recall, whereas top-only sometimes dropped to ~50%. |
Harsh shading or low-contrast regions (especially in rendering settings with limited ray-bounces) can wash out small deformations. Without preprocessing (e.g., MSR, CLAHE), a classifier may confuse a shadow-induced dark patch for damage or simply ignore a low-contrast dent. |
Insufficient Variation in Training Data | The synthetic dataset had 200 pairs of top + side images. Even with random spatial jitter (±0.04 m, ±0.07 m) and yaw rotation (±20°), the total coverage of possible real-world angles, box materials, and crease patterns remains limited. |
Models like ViT or GPT o4-mini-high—pretrained on open-domain images—need enough in-domain examples to learn “what a dent looks like on a package.” If the network has never seen a subtle crease under a particular type of lighting, it cannot reliably generalize in some scenarios. |
Blur and Image Quality Degradation | Some rendered frames might be blurred (e.g., from DoF settings around f/16 ± 3). If the damage occupies just a few pixels, blur can obliterate it entirely. |
Diffusion-based classifiers (GPT o4-mini-high) often expect crisp, high-resolution detail. It is possible that any motion blur, low-lighting noise, or compression artifact can derail their latent-space feature extraction, causing them to focus on color/layout instead of surface texture. |
Over-Reliance on Background or Framing Cues | A YOLO detector trained on COCO-style polygon masks might inadvertently learn “where the box usually sits” rather than “what damage looks like.” If the box center shifts unpredictably (±44 cm laterally, ±77 cm longitudinally), YOLO’s bounding-box proposals may become unstable, and it might classify an undamaged box as damaged simply because it is slightly off-center. |
Similarly, ViT’s patch embeddings can latch onto consistent background patterns (e.g., conveyor-belt texture) instead of focusing on subtle surface defects, especially if the damage patch covers only 1–2 patches in a 14 × 14 grid. |
Model Capacity vs. Overfitting | A very large model such as GPT o4-mini-high can overfit the small synthetic dataset unless heavily regularized. During fine-tuning, it may learn to memorize specific rendering artifacts such as consistent highlight on a dented edge rather than generalizable damage features. |
A lightweight CNN (MobileNetV2) may not have enough capacity to capture all variations of damage under all lighting/pose combinations, leading to underfitting for more subtle defects. |