AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains

Shahin, Mohammad; Hosseinzadeh, Ali; Chen, F. Frank

doi:10.3390/electronics14142824

Open AccessArticle

AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains

by

Mohammad Shahin

,

Ali Hosseinzadeh

^*

and

F. Frank Chen

Department of Mechanical, Aerospace, and Industrial Engineering, The University of Texas at San Antonio, San Antonio, TX 78249, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2824; https://doi.org/10.3390/electronics14142824 (registering DOI)

Submission received: 16 May 2025 / Revised: 16 June 2025 / Accepted: 4 July 2025 / Published: 14 July 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence in Intelligent Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

Despite advances in automation, the global manufacturing sector continues to rely heavily on manual package inspection, creating bottlenecks in production and increasing labor demands. Although disruptive technologies such as big data analytics, smart sensors, and machine learning have revolutionized industrial connectivity and strategic decision-making, real-time quality control (QC) on conveyor lines remains predominantly analog. This study proposes an intelligent package integrity monitoring system that integrates waste reduction strategies with both narrow and Generative AI approaches. Narrow AI models were deployed to detect package damage at full line speed, aiming to minimize manual intervention and reduce waste. Using a synthetically generated dataset of 200 paired top-and-side package images, we developed and evaluated 10 distinct detection pipelines combining various algorithms, image enhancements, model architectures, and data processing strategies. Several pipeline variants demonstrated high accuracy, precision, and recall, particularly those utilizing a YOLO v8 segmentation model. Notably, targeted preprocessing increased top-view MobileNetV2 accuracy from chance to 67.5%, advanced feature extractors with full enhancements achieved 77.5%, and a segmentation-based ensemble with feature extraction and binary classification reached 92.5% accuracy. These results underscore the feasibility of deploying AI-driven, real-time QC systems for sustainable and efficient manufacturing operations.

Keywords:

segmentation; deep learning; computer vision; sustainability; AI; supply chains; waste reduction; quality control; industry 4.0; production management

1. Introduction

The contemporary manufacturing sector stands at a critical juncture, compelled by mounting environmental pressures, evolving regulatory landscapes, and shifting consumer expectations to fundamentally rethink its operational paradigms. The pursuit of mere economic efficiency is no longer sufficient; sustainability, encompassing environmental stewardship, social responsibility, and long-term economic viability, has emerged as a non-negotiable strategic imperative. Sustainable manufacturing, formally defined by the U.S. Environmental Protection Agency as “the creation of manufactured products through economically sound processes that minimize negative environmental impacts while conserving energy and natural resources” [1], necessitates a holistic approach that addresses the entire product lifecycle, from raw-material extraction and processing to product use and end-of-life management. Sustainable production involves designing and operating manufacturing processes in an economically viable manner that conserves resources, lowers waste, and minimizes negative environmental impacts [2,3].

The environmental impact of traditional manufacturing practices is substantial. Industry is a major contributor to global greenhouse gas emissions, resource depletion, and waste generation [4]. Manufacturing processes consume vast amounts of energy, often derived from fossil fuels, and utilize significant quantities of raw materials, many of which are finite. The resulting waste streams, including material scrap, defective products, chemical effluents, packaging waste, and heat losses, contribute to landfill burdens, pollution, and the inefficient use of valuable resources. The concept of the “carbon footprint,” quantifying the total amount of greenhouse gases emitted directly or indirectly by a process, product, or organization, has become a key metric for assessing environmental impact, with manufacturing activities being a primary focus for reduction efforts [4]. Beyond the environmental consequences, this inefficiency represents a significant economic liability, often referred to as the “hidden factory,” in which substantial capacity (estimated at 20–40%) is consumed by non-value-added activities like rework, scrap handling, and the management of quality failures [5].

Within this context, waste reduction emerges as a cornerstone of sustainable manufacturing. It directly addresses both environmental concerns (reducing resource depletion and pollution) and economic objectives (lowering costs and improving efficiency). QC processes are intrinsically linked to waste generation. While the primary goal of QC is to ensure products meet specified standards and meet customer expectations, traditional, often reactive, approaches can inadvertently contribute to waste. Inspecting products only at the end of the production line means that defects are identified after significant resources such as materials, energy, labor, and machine time have already been invested. Such defective items represent embodied waste; they must either be scrapped, leading to material loss and disposal costs, or subjected to costly and labor and/or machine (energy)-intensive rework processes, further consuming resources and potentially impacting the final product’s integrity [6,7].

Therefore, a paradigm shift in quality management is essential for achieving sustainable manufacturing goals. The focus must move from processing customer complaints regarding defects to proactively detecting them upstream and identifying the root cause behind them to be able to prevent their occurrence in the future. This aligns strongly with established methodologies like Lean Manufacturing, which fundamentally aims to eliminate waste (Muda) in all its manifestations [8]. The eight wastes commonly identified in Lean include Defects, Overproduction, Waiting, Non-utilized Talent, Transportation, Inventory, Motion, and Extra-processing (DOWNTIME) [9]. Defects are explicitly recognized as a primary form of waste.

Manual inspection remains the bottleneck in many automated manufacturing and logistics chains, leading to wasted labor and resources, in addition to customer dissatisfaction, when damaged parcels slip through. The following major contributions are described within this paper:

The paper introduces novel end-to-end pipelines that integrate illumination correction, denoising, super-resolution, and multi-view transformer fusion. This suite of novel computer-vision pipelines, ranging from contrast-enhanced MobileNetV2 to transformer-based ensembles and state-of-the-art YOLOv8 segmentation, is used to automatically detect and classify packages as damaged versus intact in high-volume production environments.
Extensive, reproducible benchmarks employing both lightweight and high-capacity architectures under realistic production constraints are provided, as demonstrated by the utilized dataset.
While previous studies have applied single-view deep classifiers or heuristic damage indicators on private datasets, ours is the first comprehensive comparison of multi-view fusion architectures, image-preprocessing pipelines, and lightweight versus heavyweight models on a purpose-built dataset of 200 paired top-and-side package images.
Waste-reduction potential in supply-chain operations through early damaged-package interception is demonstrated, a factor which can contribute to sustainability initiatives.
A comprehensive review of the most recent articles that relate to the subject is provided. Finally, Table 1 breaks down the structure of this paper.

The next section provides a comprehensive overview of how AI has evolved over time, beginning with its earliest theoretical foundations and progressing through the major breakthroughs in algorithms, hardware, and practical applications. It traces key milestones, from the conception of symbolic reasoning in the 1950s and the emergence of expert systems in the 1970s, to the advent of statistical machine learning and deep neural networks in the late 20th and early 21st centuries, highlighting the technological and methodological shifts that have driven AI forward. Following this historical narrative, the section also introduces the principal subfields of AI, types of AI, and components of AI; each of these domains is briefly described in terms of its core objectives and representative techniques and the ways in which it contributes to the broader landscape of intelligent systems. This allows the reader to gain more insights into both the historical context and the contemporary scope of AI research and development.

2. Development of AI

Artificial Intelligence (AI) denotes the ability of robots to demonstrate intelligence akin to human cognitive processes, including decision-making and problem-solving [10]. AI leverages Machine Learning (ML) and Deep Learning (DL) techniques, where models are trained on massive datasets to enable them to make autonomous and intelligent decisions. These AI-driven computers emulate human intellect by analyzing extensive amounts of data, identifying complex patterns, and understanding natural language. Cognitive computing is fundamental to AI capability, enabling knowledge processing and pattern recognition, and hence allowing AI to emulate human learning and perception without direct human involvement. Artificial intelligence empowers machines and computing systems to replicate human intellect in domains historically dependent on human cognition.

Whether through machine vision, speech recognition, or predictive analytics, AI has ushered in a new technological era, profoundly impacting industry and society [11,12]. Figure 1 shows the fundamental working principle of a simple AI system [13,14]. Figure 2 shows the traits of AI [15,16].

Generative AI comprises multiple interconnected components that dictate its functionality and adaptability across various applications. These fundamental components affect the Generative AI’s ability to process information, make decisions, and interact with human users. Figure 3 shows the general components of AI [17]. The domain of AI is evolving rapidly, particularly with the emergence of Generative AI, in contrast to Narrow AI, which is engineered to perform specific cognitive functions and is constrained by its lack of autonomous learning capabilities. Narrow AI can also be called Artificial Narrow Intelligence (ANI) or weak AI. ANI utilizes ML, Natural Language Processing (NLP), and DL via advanced forms of NNs algorithms to complete specified tasks. Some examples of Narrow AI include self-driving cars, which rely on Computer Vision (CV) algorithms [18], and AI virtual assistants. Artificial General Intelligence (AGI), also called general AI or strong AI, refers to a form of AI that can learn independently, think, and perform a wide range of tasks at the human level. The ultimate goal of AGI is to create machines capable of versatile, human-like intelligence, functioning as highly adaptable assistants (AI Agents) in everyday life. Generative AI is a form of AGI [19,20]. Figure 4 shows the subfield of Generative AI with relation to AI, DL, and ML.

Figure 5 shows the relationships between the different types of AI [21], and Table 2 shows their descriptions [22,23,24,25,26,27,28].

Table 3 shows a comparison between Generative AI and Narrow AI. Narrow AI and Generative AI are Capability-based AI, due to the ways in which they learn and the degree to which they can execute actions based on their knowledge [30]. Figure 6 shows the integrated subfields of Generative AI [31,32].

3. AI-Enabled Sustainability

The evolution towards Industry 4.0, characterized by the integration of Cyber–Physical Systems (CPS), the Internet of Things (IoT), cloud computing, and big data analytics, provides unprecedented opportunities to enhance these quality management efforts. The proliferation of sensors throughout the manufacturing process generates vast streams of real-time data related to machine performance, environmental conditions, material properties, and process parameters. AI emerges as a critical enabling technology within this ecosystem, offering the analytical power needed to extract meaningful insights from this data deluge [33]. Furthermore, advances in edge devices have allowed for real-time AI integration with IoT devices, creating what is currently known as Artificial Intelligence of Things (AIoT). AI algorithms can identify the complex, non-linear relationships and subtle patterns that often elude traditional statistical methods, enabling more accurate predictions, faster detection of anomalies, and more effective optimization of processes for both quality and sustainability.

3.1. Narrow AI: Enhancing Defect Management and Waste Reduction Through Data-Driven Insights

Narrow AI, often referred to as weak AI, represents the current state-of-the-art AI, characterized by systems designed and optimized for specific, well-defined tasks. Unlike the aspirational goal of AGI, which seeks to replicate broad human cognitive abilities, Narrow AI operates within circumscribed domains, demonstrating remarkable proficiency in areas such as pattern recognition, prediction, classification, and optimization based on the data it is trained on. Within the complex ecosystem of modern manufacturing, particularly in the pursuit of enhanced QC and waste reduction, Narrow AI offers a suite of powerful tools, primarily leveraging ML and DL algorithms [33,34]. These techniques are adept at processing and analyzing the high-velocity, high-volume, and high-variety data streams generated by sensors, cameras, Manufacturing Execution Systems (MES), and other I4.0 technologies, transforming raw data into actionable insights for improved defect management and resource efficiency.

Traditional QC methodologies, such as Statistical Process Control (SPC), have long served as the bedrock of manufacturing quality assurance. SPC utilizes statistical charts (e.g., X-bar and R charts, p-charts) to monitor process stability and identify deviations that may indicate quality issues. While fundamentally sound, SPC methods often require manual interpretation by trained personnel and can be challenged by the sheer scale and complexity of data in contemporary manufacturing environments. Furthermore, SPC typically focuses on detecting deviations after they occur, limiting its ability to prevent defects proactively. Narrow AI significantly enhances these traditional approaches by automating the analysis, identifying more subtle and complex patterns, and enabling a shift towards predictive quality management [33].

Predictive quality analytics: This represents a significant advancement over purely reactive detection, aiming to anticipate quality issues before they result in defective products. This involves analyzing real-time and historical data from various sources across the production line, including sensor readings (temperature, pressure, vibration, and flow rate), machine parameters (speed, torque, and energy consumption), environmental conditions, and even raw-material characteristics. ML algorithms, ranging from traditional methods like Support Vector Machines (SVM), Decision Trees (DT), K-Nearest Neighbors (KNN), and ensemble models (e.g., Random Forests—RF, Gradient Boosting—GB) to more advanced DL architectures like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks (particularly suited for time-series data), are employed to build predictive models [33]. These models learn the complex correlations between process variables and final product quality outcomes. They can then predict the likelihood of a defect occurring based on current process conditions, identify specific parameters drifting towards undesirable states, or even predict the Remaining Useful Life (RUL) of equipment, linking potential failures to quality degradation (Predictive Maintenance—PdM). For instance, an AI model might predict the probability of dimensional inaccuracies in a machined part based on tool wear sensor data and spindle vibrations, allowing for preemptive tool changes. Similarly, in chemical processing, AI can predict batch quality based on real-time sensor readings of temperature, pressure, and reactant concentrations, enabling adjustments to prevent off-spec products. This predictive capability is paramount for waste reduction, as it allows manufacturers to intervene proactively, adjusting parameters, performing maintenance, or modifying schedules, all to prevent defects from occurring in the first place, thereby minimizing scrap, rework, and the associated resource consumption [35].

3.2. Generative AI: A Paradigm Shift Towards Proactive Quality Design and Waste Minimization

While Narrow AI excels in analyzing existing data to identify patterns and predict outcomes within established parameters, Generative AI introduces a fundamentally different capability: the creation of novel data, designs, or solutions that mimic the characteristics of the data it was trained on [36,37]. This generative power, embodied in models like Generative Adversarial Networks (GANs) [38,39,40], Variational Autoencoders (VAEs) [41,42,43], and large-scale Transformer architectures, represents a paradigm shift with profound implications for manufacturing QC and sustainable practices [44]. Instead of solely reacting to or predicting defects based on past occurrences, Generative AI enables a more proactive approach, facilitating the design of products and processes that are inherently less prone to defects and more resource-efficient, thereby minimizing waste at its source.

Generative AI addresses several key limitations inherent in traditional and Narrow AI-based QC approaches, particularly concerning data availability and the exploration of novel solutions:

Synthetic data generation for robust defect detection: A significant bottleneck in training effective Narrow AI models (especially DL models for CV applications) is the requirement for large, diverse, and accurately labeled datasets. In manufacturing, acquiring sufficient examples of all possible defect types, particularly rare or novel defects, can be challenging, costly, and time-consuming. Generative AI, particularly GANs, offers a powerful solution by learning the underlying distribution of normal and defective-product data (e.g., images, sensor signals) and generating highly realistic synthetic examples [44]. These synthetic data points can augment real-world datasets, creating larger, more balanced, and more diverse training sets. This allows for the development of more robust and accurate defect detection models capable of identifying even infrequent defect types, leading to fewer faulty products escaping detection, and ultimately reducing scrap and rework waste. Research indicates that augmenting training data with GAN-generated synthetic images can significantly improve the performance of defect classification models, enhancing their generalization capabilities [44].
Enhanced anomaly detection: Generative AI models, particularly VAEs and GANs, can be trained primarily on data representing normal operating conditions or non-defective products. By learning the intricate patterns and characteristics of “normality,” these models become highly adept at identifying deviations or anomalies that may signify emerging quality issues or process drifts, even without prior exposure to specific defect types [44]. When presented with new data (e.g., a sensor reading, a product image), the model attempts to reconstruct or generate it based on its learned understanding of normality. Data points that are poorly reconstructed or identified as having low probability under the learned distribution are flagged as anomalies. This approach is particularly valuable for detecting the novel or unforeseen defects and subtle process deviations that might precede major quality failures, enabling early intervention and preventing the production of large batches of defective goods, thus conserving resources.
Generative design for quality and sustainability: Perhaps the most transformative potential of Generative AI lies in its application to the design phases of both the product and manufacturing processes. Generative design tools, often leveraging topology optimization and AI algorithms, can explore vast design spaces based on specified constraints (e.g., material properties, performance requirements, manufacturing limitations, cost targets, and sustainability metrics like minimized material usage or embodied energy). These tools can propose novel product geometries or process parameters that inherently enhance quality, reduce material consumption, minimize energy requirements, and improve manufacturability [45]. For example, Generative AI could design a lightweight yet strong component that requires significantly less raw material, reducing both cost and environmental impact. It could also simultaneously simulate and optimize manufacturing process parameters (e.g., injection molding settings, machining toolpaths) to minimize cycle times, energy consumption, and the likelihood of defect formation. By embedding quality and sustainability considerations directly into the design phase, Generative AI facilitates a proactive approach to waste minimization, preventing waste generation at the source rather than managing it downstream.
Process simulation and optimization: Generative AI can create sophisticated digital twins or simulations of manufacturing processes. These simulations can be used to test the impacts of different parameter settings, material variations, or operational strategies on product quality and resource efficiency without disrupting physical production or generating physical waste. Manufacturers can use these simulations to identify optimal operating windows, predict the effects of changes before implementation, and train operators in virtual environments. For instance, Generative AI could simulate different cooling profiles in a casting process to determine the optimal strategy for minimizing porosity defects and energy use [46]. This simulation capability allows for rapid experimentation and optimization, accelerating the development of manufacturing processes that are more robust and sustainable [47,48]. Figure 7 shows the impact of Generative AI on manufacturing performance metrics, based on data from [49].

3.3. Different Forms of Generative AI

The most prevalent application of Generative AI in defect detection is data augmentation, addressing the common challenge of limited defect samples in industrial datasets [50]. Manufacturing processes are typically designed to minimize defects, resulting in highly imbalanced datasets in which normal samples significantly outnumber defective ones. This imbalance can severely impact the performance of supervised learning models, leading to poor generalization and high false negative rates. Generative models, particularly Generative Adversarial Networks (GANs), enable the creation of synthetic defect images that expand and balance training datasets. Liu et al. demonstrated that augmenting a steel surface defect dataset with GAN-generated samples increased detection accuracy from 83% to 91%, with particularly significant improvements for rare defect types [51]. The synthetic samples enhanced the model’s ability to recognize defect variations that were underrepresented in the original dataset. Beyond addressing class imbalance, generative data augmentation also improves model robustness by introducing controlled variations in defect appearance, size, orientation, and context [52]. This diversity helps detection models generalize better with respect to new defects and varying inspection conditions, a critical requirement for industrial deployment.

Incorporating AI-powered image generation into sustainable manufacturing practices has the potential to transform conventional approaches, paving the way for more innovative, adaptable, and resource-efficient production systems. Yet, this integration must be executed with care to ensure that technology bolsters, rather than undermines, the foundational tenets of sustainable manufacturing. At its core, sustainable manufacturing emphasizes minimization of resource consumption and environmental impact, maximization of customer value, and a relentless drive toward continuous improvement. Consequently, the deployment of generative image AI must be explicitly aligned with these priorities, whether by curtailing material and energy waste in design, elevating product quality to reduce rework, or streamlining operations to lower the carbon footprint. Establishing well-defined targets for the role of AI-driven image synthesis within the sustainability framework is crucial to guarantee that its adoption directly supports the overarching goals of waste reduction, efficiency gains, and eco-responsibility.

Achieving 3D Synthesis via Blender

This study asserts that Blender, an open-source 3D production tool, functions as a sort of controlled Generative AI due to its procedural generating features and interface with the Python Application Programming Interface (API) [53]. This viewpoint broadens the conventional definition of Generative AI to encompass not only neural methodologies but also rule-based procedural systems that demonstrate the essential generative characteristics: the capacity to generate original content from constrained input parameters, systematically investigate design spaces, and yield emergent complexity from fundamental rules.

Blender, sometimes undervalued, is a versatile open-source 3D production suite that, while not a Generative AI model itself, serves as a crucial platform for the incorporation of Generative AI techniques. Its robust node-based architecture and scripting capabilities enable procedural content generation, permitting users to algorithmically create complex models, textures, and animations.

Blender, in conjunction with Narrow AI, can be categorized as a form of Generative AI, yet it possesses distinct characteristics. Generative AI denotes artificial intelligence systems that produce novel content, like text, photos, videos, or 3D models, derived from acquired patterns and training data. Blender, a 3D graphics and animation software, might in fact use limited AI to automate the production of procedurally generated content, in alignment with Generative AI principles. The Python API of Blender enables algorithmic content creation, including the dynamic synthesis of models, textures, and environments based on established criteria.

This method replicates the mechanism via which Generative AI models generate synthetic data. Procedural generation is widely utilized in AI-based 3D modeling, synthetic datasets for machine learning, and automated design optimization. Generative AI often requires substantial datasets for model training. By employing Python scripting, Blender can generate synthetic images, videos, and depth maps that serve as training resources for computer vision models, autonomous systems, and AI-driven quality control applications.

Blender can generate precisely controlled datasets that exhibit fluctuations in illumination, angles, and object orientation, a task challenging to accomplish through real-world data collection. Python scripts in Blender can enable the development of dynamic, AI-generated scenarios for robotics, gaming, and industrial automation when employed alongside deep learning frameworks such as TensorFlow or PyTorch.

Blender 4.0 enables the training of AI models by producing diverse datasets of 3D objects with precise ground truth annotations. This aligns with the methodologies utilized by GANs or diffusion models in image and video synthesis. Blender does not inherently “learn” from data like Generative AI models do; instead, it generates outputs based on predefined rules and programming. Unlike models like DALL·E, Stable Diffusion, or MidJourney, which generate graphics from verbal cues, Blender creates images using parametric algorithms and procedural synthesis (see Table 7).

3.4. Hybrid AI Systems

Recent studies and industry surveys indicate a distinct transition in AI performance and application as firms progressively embrace hybrid systems that integrate classical machine learning with Generative AI.

AI, in its diverse manifestations, is increasingly serving as a vital facilitator of sustainable manufacturing methodologies. Narrow AI specializes in the optimization of particular tasks, resulting in significant environmental advantages. For example, machine learning algorithms can forecast when machinery requires maintenance, averting malfunctions and prolonging equipment longevity, and computer vision systems may detect faults early in the production process, reducing waste and preserving essential materials. These specific applications of AI enhance system efficiency and reduce waste [54,55].

4. Dataset

This Blender generated dataset from Kaggle contains images of packages generated within a virtual industrial production environment (see Figure 8 for a sample of the images) [56]. The images were developed using procedural generation techniques within Blender, where a coded algorithm dynamically created each image. This method introduces variability in lighting conditions, motion blur, and object orientation, ensuring that each image reflects a diverse set of real-world production scenarios. The dataset is structured and balanced, with images categorized into two distinct perspectives, namely, the top and side views of both “damaged” and “intact” classes of these packages. The dataset was divided into a 50/30/20 split to facilitate model training and evaluation. While this dataset offers a structured approach for industrial AI applications, it presents certain limitations, mainly due to the limited number of available images (200 pairs of images) and its less-than-moderate resolution. These constraints could impact the generalizability of AI models trained on this data, necessitating additional preprocessing or augmentation techniques to improve performance in real-world applications [57].

Training a classifier model on an extensive, parametric synthetic dataset effectively produces a domain-specific pretrained model. Similar to ImageNet transfer learning, these weights can be refined on a relatively small real-world dataset, resulting in accelerated convergence, enhanced initial accuracy, and diminished annotation effort in practical quality-control systems. Moreover, a significant body of research has shown that pretraining models on extensive synthetic datasets produces feature representations that transfer efficiently to real-world applications. Peng et al. demonstrated that deep object detectors pretrained on synthetic CAD-rendered images, without prior real-image pretraining, substantially exceed baseline performance in few-shot identification on PASCAL VOC and, under domain shift, on the Office benchmark [58]. Tobin et al. proposed domain randomization, training vision networks exclusively on diverse sets of simulated pictures, resulting in precise zero-shot transfer for real-world object localization and robotic grasping [59]. Tremblay et al. has further shown that deep networks pretrained on randomized synthetic data and subsequently fine-tuned on a limited real dataset outperform models trained only on actual data, for real-world object identification [60].

4.1. Pipeline Design of the Dataset

The goal of this dataset was to simulate a factory plant that produces pharmaceutical products. The machine used sometimes produces damaged packages that are dented. The packages move on a conveyor belt along an inspection line where two cameras are mounted (top and side). An API script was written to automate the generation of labeled package renderings by overlaying serial- and batch-number text onto a base template and then invoking Blender in background mode to produce either a “damaged” or “intact” 3D package model.

The script streamlines package image creation from end to end. It first sets up separate “damaged” and “intact” folders and then generates a unique serial number and batch code for each package, imprinting them onto a wrapper template. A weighted coin flip (40% damaged, 60% intact) decides whether the package will receive surface deformations or remain pristine. Finally, Blender is launched behind the scenes to render both top and side views, saving each fully labeled image into the appropriate directory.

The script procedurally “sculpts” a 3D box model to simulate either a damaged or intact package, applies random location and rotation perturbations, moves the animation frame to make the conveyor-belt motion appear realistic, and then renders both top and side camera views to JPEG files, named by serial number, which are categorized into the “damaged” or “intact” folders.

This Blender pipeline, as specified, introduces variability along four independent axes to ensure that each rendered package, whether labeled “damaged” or “intact,” is unique. Surface deformation is achieved through three random cubic Bézier sculpt strokes for damaged packages, each drawn within ±5 cm in X and ±2.5 cm in the Y range, with brush strengths sampled from a normal distribution

𝒩

(0.35, 0.05); intact packages receive two lighter strokes in the same spatial range, with strengths from

𝒩

(0.20, 0.05), introducing subtle variations.

The Stroke strength parameter in the script controls the magnitude of each brush stroke’s deformation (see Figure 9) on the package’s surface. In other words, it determines how “deep” or “pronounced” each dent or crease will be. This parameter was modeled as a normal distribution primarily to reflect realistic variability and to maintain a controlled separation between “damaged” and “intact” classes. In real-world scenarios, cardboard box damage, whether from impacts, crushing, or seam stress, varies continuously in severity, and using a normal distribution centered on a moderate strength captures the prevalence of average-depth dents while limiting extreme deformations. The normal distribution’s central tendency guarantees consistency by clustering strokes around their mean strength; its tunable spread (σ = 0.05) limits unrealistic extremes, and its smooth continuous sampling yields subtle gradations in damage severity rather than discrete depth steps.

Visually, higher strength values (around 0.40) create pronounced, sharp deformations resembling creases or gouges, whereas lower values (around 0.15) generate subtle ripples or embossed textures, preserving the statistical distinction between the two package conditions and resulting in a realistic diversity of damage patterns. Table 4 summarizes all of these parameters.

Sources of randomness affecting the synthetic renderings include stroke-path geometry (see Figure 10) and the number of sculpt strokes applied. Stroke-path geometry varies because the Bézier control points, consisting of a start point, two control points, and an endpoint, are randomly sampled within defined X–Y bounds, resulting in a wide variety of crease shapes, from long and sweeping to short and sharp, thereby enhancing the diversity of dent geometries encountered by the AI model. Additionally, the number of sculpt strokes introduces variability, with “damaged” packages consistently receiving three strokes and “intact” packages receiving two, clearly distinguishing classes while also presenting shape-count variations (sometimes showing two dents, other times three), and thereby ultimately improving the model’s robustness to different patterns of damage clustering (see Figure 11).

The curves drawn in Figure 10 illustrate how colored curves with varying start, end, and control points generate diverse crease shapes on the package surface. Some paths form wide swoops, sweeping broadly across the box face, while others create tight bends, looping or arcing sharply within a confined area. The dashed rectangle indicates the allowable range for endpoints, though control points may extend up to 1.5× beyond this boundary, resulting in occasional overhangs that further enrich the variety of crease geometries.

The illustrations provided in Figure 11 show the overlaid stroke paths, with intact strokes in green (thin lines) and damaged strokes in red (thick lines), highlighting the geometric differences between the two classes within the same bounding region. Damaged strokes, represented in red with a linewidth of 3, appear denser and more pronounced, spanning the central area with bold and distinct curves. In contrast, intact strokes, shown in green with a linewidth of 1, are lighter, sparser, and characterized by trajectories which are more open.

Spatial variability includes random location jitter (±0.04 m in X, ±0.07 m in Y) simulating lateral and longitudinal conveyor misplacements, and yaw-rotation jitter randomly drawn from integers between −20° and +20° (see Figure 12). The lateral offset (x—offset) simulates the fact that on a real belt the box may sit a little to the left or right of the camera’s optical axis by up to 44 cm. The longitudinal offset (y—offset) models variations in how far along the belt (toward or away from the camera) the box may appear, up to 77 cm closer or farther. The uniform choice ensures that every offset in that range is equally likely, so the model cannot “learn” a bias toward perfectly centered packages; this is because, in a physical inspection line, packages do not arrive in exactly the same spot every time. By jittering their positions, the classifier is trained to be robust to small framing differences; this allows it to focus on surface defects rather than absolute pixel location. Table 5 summarizes all these parameters.

Background motion is randomized by selecting the conveyor-belt texture frame, from integers between 1 and 8, that follows a discrete uniform distribution (see Figure 13). Each frame shifts the belt texture slightly, so that the stripes, seams, and lighting on the belt appear in a different position relative to the package, and small changes in reflections occur, without altering the package itself. This helps the ML classifier learn to focus on the box, not on a static, repeated backdrop.

This disperses any correlation between package damage and a fixed background pattern, improving model robustness to real-world conveyor-belt motion. In a concurrent manner, optical variation employs Depth-of-Field (DOF), with camera f-stops sampled from

𝒩

(16, 3) for both “Top” and “Side” views (see Figure 14), adjusting foreground sharpness and background blur (see Figure 15).

The f-stop (also called the f-number, denoted

𝒩

) directly controls the aperture diameter of the camera lens and thus the DOF (see Figure 16). Increased

𝒩

→ linearly increased DOF (more depth in focus). Decreased

𝒩

→ decreased DOF (shallower focus, more background blur).

Collectively, each package rendering varies along eight stochastic variables (see Figure 17): stroke strength (damaged: 0.35 ± 0.05, 0.35 ± 0.05, intact: 0.20 ± 0.0, 50.20 ± 0.05); stroke-path geometry (random Bézier control-point placements); stroke count (three for damaged, two for intact); spatial pose jitter (X in [–4 cm, +4 cm], Y in [–7 cm, +7 cm]); yaw rotation ([–20°, +20°]); belt frame selection (uniform integer from 1–8); and depth-of-field (f-stop drawn from N(16,3), N(16,3), independently for top and side cameras). Together, these factors combine to produce a richly diverse, highly realistic synthetic dataset ideal for training robust defect detectors.

Since five of the eight pipeline variables, namely, the X-offset, Y-offset, f-stop, stroke strengths, and Bézier control points, are drawn from continuous distributions, the theoretical number of unique renderings is uncountably infinite; yet, even if those continuous parameters were held fixed, the three purely discrete axes, 41 integer yaw angles (–20° to +20°), eight belt-frame choices, and two stroke-count options (two for intact, three for damaged), combine to create 41 × 8 × 2 = 656 distinct discrete configurations. However, if the continuous variables were discretized by binning each continuous parameter into N levels, (e.g., round X-offset to 100 possible positions, and the same for Y-offset, f-stop, damaged-strength, and intact-strength). Then the total number of combinations becomes 656 × N⁵, and when N = 100, the final number of rendering possibilities would be in the trillions. Regardless of the N value used for discretization, the pipeline would still produce astronomical diversity in the generated dataset (see Figure 18).

4.2. General Constraints and Limitations of the Synthetic Images’ Datasets

Blender is an advanced software with a significant learning curve, especially for those without prior experience in 3D modeling and animation. Mastering its interface and Python API requires significant time and effort, posing a possible obstacle for novice users in computer graphics or scripting. The intricacy of its capabilities and features may be daunting for novices, since considerable skill and acquaintance are necessary to maneuver, model, and render within the software effectively. Rendering realistic 3D scenes in Blender is computationally demanding, frequently requiring high-performance hardware for optimal outcomes. The creation of extensive datasets or high-resolution photographs imposes considerable demands on Central Processing Unit (CPU) and Graphical Processing Unit (GPU) resources, resulting in prolonged rendering delays if the hardware is insufficiently equipped to perform such tasks. This may be a substantial barrier, especially in industrial applications where efficiency and scalability are paramount. Moreover, although Blender provides comprehensive functionalities for producing photorealistic 3D models, the precision of the generated water bottle models is significantly influenced by the creator’s proficiency and the quality of the reference materials utilized. Attaining the exact measurements, proportions, and surface features necessitates scrupulous attention to detail, guaranteeing that the produced photos faithfully depict real-world manufacturing contexts. Any discrepancy in modeling precision can affect the validity and efficacy of the synthetic dataset in training AI-driven QC systems. Moreover, incorporating Blender into the data-generating pipeline within dynamic production environments introduces other obstacles, especially when production plans are altered. Seamless integration necessitates meticulous coordination with current technologies, guaranteeing compatibility, data consistency, and process automation throughout diverse pipeline stages. Effectively managing these complications requires a systematic strategy to ensure efficiency, avoid data inconsistencies, and integrate Blender-generated datasets with the wider manufacturing and inspection framework [61]. Figure 19 illustrates the different variations in the quality of the renderings of the packages and their backgrounds.

Though this is not the objective behind the paper, it is worth mentioning that some less sophisticated postprocessing effects can also “close the reality gap” (see Figure 20). These effects, in sequence, are described as follows: first, a barrel distortion with coefficient of 0.3 bows straight lines outward toward the edges, mimicking real lens curvature; next, a gentle Gaussian vignette (σ = 0.75) darkens the corners to draw focus toward the center; a chromatic aberration can then be simulated by shifting the red channel 2 pixels to the left and the blue channel 2 pixels to the right, creating subtle color fringing; a layer of low-amplitude Gaussian sensor noise (σ ≈ 0.01) can break the overly clean appearance with a realistic digital grain; finally, color grading boosts reds by 5% in the R channel and applies a mild gamma correction (γ = 1/1.2) to impart a warmer, film-like tone.

Alternatively, Figure 21 shows generative renderings of the packages, each depicting the same rectangular box on a conveyor belt. They combine the features of 256–512 samples per pixel for low noise with an OptiX denoiser to eliminate residual speckles, Filmic Log tone mapping over an HDRI environment augmented by area lights for soft shadows, PBR materials with normal maps, and depth-of-field for natural focus falloff. Some images also employ full path-traced global illumination (over 1024 spp) for true-to-life lighting and soft color bleeding; ACES Filmic grading to preserve detail in both shadows and highlights; and a suite of photographic imperfections, such as motion blur, chromatic aberration, bloom, and fine film grain, while volumetric dust and detailed normal/displacement maps further enhance material realism. These combined settings deliver cinematic, photorealistic visuals ideally suited for production and presentation.

Higher rendering quality drives up the costs for computation, time, and storage (see Figure 22); each adjustment in resolution, samples per pixel, ray bounces, shader complexity, or post-processing multiplies resource needs per frame. Doubling the sample count roughly doubles the rendering time; adding volumetrics or displacement maps further raises per-ray costs. As a result, longer rendering times translate into more GPU/CPU hours (and higher power or cloud-rental fees) and high-resolution textures, while deep G-buffer passes demand more memory, and multi-channel ultra-high definition outputs inflate disk usage and network bandwidth [62,63]. Choosing the right quality–cost balance depends on the project’s goals. Real-time applications prioritize frame-rate consistency with lower resolutions, aggressive denoising, and baked lighting. Offline workflows tolerate hours-per-frame for photorealism with ultra-high sampling, deep indirect lighting, and full post-pipelines. Synthetic dataset generation finds a middle ground by using moderate samples and selective effects (e.g., depth-of-field, vignetting) to achieve realism within batch-rendering time constraints. The optimal “sweet spot” emerges by profiling test renderings, defining the minimum acceptable quality, and scaling hardware or cloud resources to meet deadlines efficiently [64,65,66]. Table 6 summarizes some of the applications found in the literature regarding the use of synthetic images of datasets in different fields, including applications related to manufacturing. Table 7 shows the differences between datasets generated using a diffusion model (with limited user control via prompts) and an approach such as Blender, with a granular user control.

Table 6. Examples of synthesized images in different fields.

Application	Description of the Utilized Synthetic Image Dataset	References
Medical applications related to microscopy, understanding the behavior of the human eyes, and robotic surgery	Synthetic images of gaze behavior in the human eyes, synthetic images of biological scenery as if it had been captured by a microscope, synthetic image of complex biological tissue, and synthetic endoscopic images for robotic surgery	[67,68,69,70]
Safety applications related to automated quality inspection and underwater monitoring systems	Synthetic images of scaffolding parts and synthetic images of underwater pipes	[71,72]
Applications related to enhancing the visual detection and recognition capabilities relative to robotics	Synthetic images representing random meshes of plants, synthetic images consisting of a small-scale task-specific grasp dataset, synthetic images for food segmentation, and synthetic images for aerial recognition systems (drones)	[73,74,75,76,77,78]
Applications related to operations	Synthetic images of packages used in industrial QC and synthetic images related to shipbuilding	[79,80,81]
Applications related to different areas of computer vision	Synthetic images rendering and representing photorealistic environments for computer vision and pattern recognition applications	[82,83,84,85,86,87,88,89,90,91]
Infrastructure applications related to the maintenance of railroads and power transmission lines (PTL)	Synthetic images of digital models of railway ballast comprising a range of geometry characteristics, synthetic images for fitting recognition	[92,93]
Applications related to autonomous driving	Synthetic images of real-world urban scenic environments	[94,95,96,97,98]
Applications related to gaps in the literature	Addressing the gaps in the literature associated with synthetic images in different categories	[99,100,101,102,103,104]

Table 7. Diffusion models vs. Blender–Python.

Aspect	Diffusion Models	Blender–Python Approach
Primary Strength	Creative ideation and concept exploration	Precise control and reproducibility
Specification Needs	Minimal (high-level prompts suffice to generate diverse outputs)	Detailed scripting of physics, lighting, and geometry
Realism Level	Typically approximate; suitable when photorealism can be relaxed	Physics-based, accurate rendering—ideal when exact physical fidelity is required
Training Data Requirements	Perform best when trained on large, diverse datasets	Can generate synthetic training data from first principles, removing dependence on large preexisting datasets
Parameter Exploration	Limited ability to systematically vary individual parameters (black-box generation)	Fully systematic exploration of parameter spaces (e.g., geometry, lighting, material properties, camera angles, etc.)
Reproducibility and Repeatability	Outputs can vary unpredictably for the same prompt; exact reproducibility is challenging	Deterministic or controlled randomness; Scene setup and output remain fully reproducible given identical script parameters
Physics-Based Simulation	Not inherently physics-based; approximates style and structure through learned latent representations	Native support for physics simulation (collision, lighting, and material interactions), enabling highly realistic scenarios
Use Case Fit	Brainstorming, rapid prototyping, artistic concept generation, and any domain where “good-enough” realism is enough	Engineering validation, scientific simulation, digital twins, and any domain demanding exact, measurable output for downstream tasks
Implementation Complexity	Lower initial complexity—requires a pretrained diffusion network and high-level prompts	Higher complexity—requires knowledge of 3D modeling, rendering scripts, and physics engines
Control Over Fine Details	Limited—indirect editing via prompt variations and latent-space manipulations	Granular control over every scene element (e.g., mesh topology, light falloff, texture maps, and camera calibration)
Speed and Resource Considerations	Often faster for one-off concept images but can require substantial GPU memory for the generative network	Slower per-frame rendering (especially at high quality) but computation can be distributed or parallelized in rendering farms
Ideal Domains	Creative industries (art, design, and entertainment), rapid scenario exploration, illustrations, concept art, and ideation	Manufacturing simulation, robotics training data, autonomous-vehicle sensor data generation, and any domain requiring precise data

4.3. Evaluating the Quality of the Dataset

The Inception-based Fréchet Inception Distance (FID), along with the Structural Similarity Index Measure (SSIM), were implemented to assess the quality of the dataset [105,106]. SSIM measures perceptual similarity based on luminance, contrast, and structure, while FID evaluates distributional differences using the features of deep NNs from Inception-v3.

SSIM operates at the pixel level, comparing structural information between image pairs with values ranging from −1 to 1, where higher scores indicate greater structural similarity. It directly measures perceptual quality as humans perceive it through luminance patterns, contrast variations, and structural elements. Inception-based FID leverages deep learning features extracted from a pre-trained Inception-v3 network, comparing statistical distributions of 2048-dimensional feature vectors between image sets. Lower FID scores indicate feature distributions that are more similar, with the metric being particularly sensitive to semantic and textural differences that align with human visual perception. Table 8 summarizes the results obtained on this dataset for these metrics; this is followed by a detailed discussion of the observed values.

4.4. Synopsis

Creating images with Blender involves a multi-step process in which you essentially build a virtual world from scratch. You start by crafting 3D models, either designing them within Blender or importing them from other sources. Then, you give these models realistic appearances by applying materials that define how they look and feel, from their color and texture to how shiny or rough they are. Next, you set up lighting to illuminate the scene, creating shadows and highlights that add depth and realism. You then position a virtual camera to capture the scene from the perfect angle, and finally, you render to calculate what the final image should look like. Blender offers different rendering engines, such as Cycles, which create incredibly realistic images by simulating how light behaves in the real world, but the process can be slow. Eevee is another option; it is much faster, but it sacrifices some of the realism.

Despite its power, Blender has some limitations when it comes to generating images. It requires a lot of time and skill to model, texture, and light a scene in a way that looks convincing. Generating large datasets can also be computationally expensive, as high-quality renderings can take a long time to produce.

The next section discusses in detail all of the models, algorithms, enhancements, preprocessing, and configurations that were implemented to produce the ten different approaches that were deployed on this dataset.

5. Methodology

Different approaches have been utilized and compared against each other in terms of their performance. Table 3 breaks down the processes that were applied to the dataset and the algorithms deployed to perform the task of detecting damaged packages. A package is considered damaged when either the top or the side view shows damage. All approaches have been tested with this rule. However, in some approaches (as seen in Table 9), an alternative in which only top-view images were used to detect damaged packages was also investigated. Different image enhancements were applied in some approaches prior to feeding the images into their algorithms. Table 10 shows these variants, or pipeline descriptions, for each approach.

5.1. Approach I

This approach utilizes transfer learning: rather than training an image-recognition model from scratch, which demands massive data and computation, it leverages a pre-trained MobileNetV2 as a foundation. MobileNetV2 is a type of CNN developed by Google [15]. It is specifically designed to be efficient and perform well on mobile and resource-constrained devices, making it a good choice for many CV tasks. It has been pre-trained on the large ImageNet dataset (millions of images across 1000 categories). This pre-training allows the model to learn general-purpose visual features (like edges, textures, and shapes) that are useful for various image recognition tasks. The approach begins by loading a pre-trained MobileNetV2 model and removing its original ImageNet classification head, preserving the convolutional base as a frozen feature extractor. New, trainable classification layers are then appended to this fixed backbone and trained specifically on the intact-versus-damaged package dataset. By leveraging the pre-learned filters of MobileNetV2 and updating only the added classification head, this transfer-learning strategy achieves good performance with significantly less data and computational effort than training a comparable model from scratch. Approach I was also executed with top-view images as the only input.

The network (see Figure 23) accepts two inputs, a 128 × 128 RGB “top” image and a 128 × 128 RGB “side” image, each fed into its own MobileNetV2 backbone (pretrained on ImageNet and frozen during initial training) to serve as separate feature extractors. Each backbone’s output feature maps are reduced via global average pooling into fixed-length vectors, which are then concatenated into a single combined feature representation. This joint vector is passed through a dense layer of 128 units with ReLU activation, followed by a dropout layer (rate = 0.5) to mitigate overfitting, and finally through a one-unit sigmoid output neuron that predicts the probability of the package being damaged versus intact.

5.2. Approach II

Approach II adds a few trials of three different image enhancements to Approach I before feeding the images into the MobileNetV2 classifier. Figure 24 shows a sample package that is used to demonstrate the differences between these three image enhancements.

5.2.1. Global Histogram Equalization

The “apply_hist_eq” function implements GHE, a contrast-enhancement technique that redistributes the intensity values of an image to span the full available dynamic range. First, it converts the input Red Green Blue (RGB) image to a single-channel grayscale representation; it then computes the Cumulative Distribution Function (CDF) of the pixel intensities and remaps each pixel so that low-frequency intensity levels are stretched and high-frequency levels are compressed. This results in an output image for which the histogram is roughly uniform, making dark regions brighter and bright regions darker in a balanced way. Finally, the equalized grayscale image is converted back to three RGB channels so that downstream pipelines expecting color input remain compatible. By boosting overall contrast, this method can reveal hidden details in underexposed or overexposed areas, but because it operates globally, it may also amplify noise or wash out subtle gradients in regions in which the original histogram was already dense.

5.2.2. Contrast-Limited Adaptive Histogram Equalization

The “apply_clahe” function refines the global histogram equalization approach with CLAHE, which prevents the overamplification of noise, a common side effect of global equalization, by dividing the image into nonoverlapping tiles (by default, 8 × 8) and applying histogram equalization independently within each tile. A clip_limit parameter (default 2.0) caps the maximum contrast enhancement by clipping the histogram’s peaks before computing the CDF, and then redistributes the clipped pixels uniformly across all intensity levels. After processing each tile, the function blends adjacent regions via bilinear interpolation to eliminate artificial boundaries. Like the global method, it begins by converting the BGR image to grayscale and ends by reconverting it back to RGB. The result is an image with enhanced local contrast, which is useful for highlighting fine textures or features, without the excessive noise boost and artifacts that can occur in uniform dark or bright areas.

5.2.3. Sharpening

The “apply_sharpening” function accentuates edges and fine details by convolving the original RGB image with a 3 × 3 sharpening kernel. This kernel places a large positive weight (9) at the center and negative weights (–1) at each of the eight surrounding positions, effectively adding a scaled version of the image’s high-frequency (detail) components back onto the original. The operation enhances transitions between adjacent pixels—such as object boundaries or texture lines—making them appear crisper and more pronounced. Because the convolution is applied directly to the three RGB channels, it preserves color information while boosting sharpness. Although sharpening can improve the visual clarity of features like wrinkles, edges, or printed text, overly aggressive kernels may introduce ringing artifacts or amplify noise, so the chosen weights strike a balance between clarity and artifact suppression.

5.3. Approach III

The core objective was to apply a series of image enhancement techniques and then train a classifier to predict the overall damage status of each package, adhering to the rule that a package is considered damaged if either its side view or top view, or both views, show damage. A multi-stage image enhancement pipeline was applied to the original images before they were used for classification. The goal was to improve image quality and potentially aid the classifier. Standard data augmentation techniques (random horizontal flip, random rotation, and color jitter) were applied to the training set. Both training and validation/test sets had normalization applied using ImageNet statistics. These statistics are essential for preprocessing when using an ImageNet-pretrained network (e.g., ResNet, VGG, and Inception), to ensure consistency with the original training.

The enhancement pipeline (see Figure 25) begins by correcting illumination imbalances, using an MSR algorithm (apply_msr.py), implemented from the open-source “Retinex-Image-Enhancement” GitHub repository, with its default parameters. Local contrast is then boosted through CLAHE (apply_clahe.py), leveraging OpenCV’s cv2.createCLAHE(), with a clip limit of 2.0 and an 8 × 8 tile grid. To suppress sensor noise while preserving detail, fast NIM denoising for colored images (apply_denoise.py) is applied via OpenCV’s cv2.fastNlMeansDenoisingColored() using its standard settings. Resolution is enhanced using a learned FSRCNN super-resolution model (×2 magnification) through OpenCV’s dnn_superres module (apply_sr.py), made available by the opencv-contrib package. Finally, each color channel undergoes per-channel mean–variance normalization (apply_normalization.py), converting the pixel values to floating point and scaling by their dataset statistics to standardize image appearance prior to downstream analysis by ResNet-18.

ResNet-18 is an 18-layer CNN introduced in 2015 as part of the residual network family. It begins with a 7 × 7 convolution and max-pooling layer, followed by four stages of “basic blocks,” each block containing two 3 × 3 convolutions with batch normalization and ReLU activations, plus an identity skip connection that adds the block’s input to its output. These residual connections mitigate the vanishing-gradient problem, enabling deeper architectures to train effectively. After the final stage, at which feature channel counts double at each transition from 64 to 512, the network applies global average pooling and a fully connected layer to produce class scores. With approximately 11 million parameters, ResNet-18 balances representational capacity and computational efficiency, making it a popular backbone for real-time inference tasks.

The ResNet-18 was trained for 15 epochs with a batch size of 16 (see Figure 26). The Adam optimizer was used with a learning rate of 0.001. The loss function was CrossEntropyLoss. The fine-tuned classifier follows a two-branch ResNet-18 architecture: each branch ingests either the top-view or side-view 224 × 224 RGB image into a pretrained ResNet-18 backbone, the parameters of which are frozen except for the deeper residual blocks (layer3 and layer4) and the newly defined fully connected layers. Both backbones’ final classification heads are replaced with a linear layer projecting to 512 features (trainable). The two 512-dimensional feature vectors are concatenated into a 1024-dimensional joint representation, which is passed through a 256-unit Dense layer with ReLU activation, followed by a 0.5-dropout layer, and finally a 2-unit output layer producing logits for the intact-versus-damaged decision. By unfreezing only layer3, layer4, and the custom classification head, the model retains the highest capacity of pretrained feature extraction possible while adapting its deeper semantics and decision boundary to the package-damage task. Table 11 shows parametric comparison between MobileNetV2 and ResNet-18.

5.4. Approach IV

At this point, the dataset was annotated with polygon masks in COCO format, indicating the location of the package within each image (see Figure 27). In the COCO annotation schema, each package instance is localized not only by a rectangular bbox but also by one or more polygon masks that precisely trace its outline at the pixel level. These polygon masks appear in the segmentation field of each annotation entry as a list of [x₁, y₁, x₂, y₂, …] coordinate pairs, defining the package’s contour in image-coordinate space. Alongside the mask, the annotation record includes a bbox array ([x, y, width, height]) for quick region proposals, an area value computed from the polygon’s interior, a category_id linking to the “package” label, an image_id tying the annotation back to its source image, and an id for the annotation itself. An optional iscrowd flag distinguishes single-object polygons from merged or ambiguous regions. By leveraging polygon masks rather than simple boxes, this format enables instance-segmentation models to learn the exact shape and position of each package, improving the precision of downstream tasks such as defect localization or contour-based feature extraction.

The processing pipeline begins by extracting ROIs directly from the COCO annotations, using a Python script that parses each image’s BBox entries. Though it was not deployed in this paper, the Segment Anything Model (SAM) is a prompt-based segmentation framework developed by Meta AI’s FAIR lab, designed to isolate any object in an image with a single click or other simple prompt. SAM’s design allows it to generalize on a zero-shot basis to new image distributions and tasks without any additional fine-tuning, making it a versatile tool for interactive segmentation and real-time CV applications. In this experiment, ROI extraction was deployed to see whether extracting packages’ ROI and learning their features could help improve model performance.

Each ROI is converted to grayscale and enhanced via the CLAHE image enhancement technique, producing contrast-optimized ROI images. Finally, in the last stage, a model classifies each enhanced ROI as intact or damaged by loading image features from a pretrained ViT via its pooler output and training different classifiers on those features, namely, classifiers LR, SVM, and RF. Table 12 shows the tunning parameters for these three classifiers.

The ViT model adapts the transformer architecture (originally developed for NLP) to CV by treating fixed-size image patches as tokens. In the google/vit-base-patch16-224-in21k variant used in this approach, each 224 × 224 RGB image is divided into 14 × 14 non-overlapping patches of 16 × 16 pixels, which are flattened and projected into a 768-dimensional embedding space. Learnable positional embeddings are added to preserve spatial information, and the resulting sequence, including a special classification token, is passed through 12 stacked transformer encoder layers, each comprising multi-head self-attention (12 heads) and feed-forward sublayers. After processing, the final hidden state of the classification token (“pooler_output”) serves as a global feature vector summarizing the image’s content. Pretrained on the expansive ImageNet21k dataset, this ViT backbone provides rich, high-level representations that, when frozen and paired with a lightweight classifier (e.g., LR or SVM), enable effective package damage detection with relatively little task-specific training. Since the ViT was pretrained on ImageNet21k, its weights remain frozen while the pooler_output serves as a fixed 1024-dimensional feature extractor for downstream classifiers. By reusing these pretrained weights rather than training from scratch, the models converge faster, require far less task-specific data, and generalize more effectively to the intact-vs-damaged package task. Figure 28 shows the architecture of the pretrained ViT that was deployed in this approach.

5.5. Approach V

This approach utilized a Generative Pre-training Transformer (GPT) o4-mini-high model by Open AI. Simply put, the dataset was uploaded to the model. The model was prompted to inspect each individual package in the given images and identify the damaged one from among the intact ones. The model performed the inspection process by using top-views only and also by using both side and top views. The model was also prompted not to use the labeling of the dataset to perform the classification but rather to rely on its visual reasoning abilities.

The GPT o4-mini-high is an example of Large Vision Models (LVMs), a sophisticated class of Generative AI systems that process and understand visual data, such as images and videos. These models have significantly enhanced CV, enabling machines to interpret visual information with accuracy and detail that closely mirrors human perception. Characterized by their extensive parameter counts, which often reach into the millions or even billions, LVMs, at their core, are deep NNs with architectures tailored to handle the complexities of visual data. Initially, CNNs were the standard for image processing tasks due to their ability to process pixel data and recognize hierarchical features efficiently. However, the introduction of Transformer architectures, which were initially developed for NLP, has led to the adaptation of LVMs in visual contexts, resulting in improved performance in tasks like image recognition and object detection. The primary advantage of LVMs lies in their ability to handle complex visual functions

The GPT o4-mini-high (see Figure 29) belongs to OpenAI’s “o4-mini” family of reasoning-optimized transformer models. It uses the same decoder-only architecture as the larger o4 variants, multi-head self-attention layers interleaved with feed-forward networks and layer normalization, but with substantially reduced depth and width (on the order of a few dozen layers and hundreds of millions of parameters rather than billions). This smaller footprint yields much lower values for inference latency and resource consumption, while still preserving strong performance on logic, classification, and summarization benchmarks. The GPT o4-mini-high typically supports context windows in the low-to-mid thousands of tokens and has been fine-tuned with chain-of-thought and instruction-following data to boost its reasoning and alignment. Compared to the full-scale o4 model, it trades some creative and generative flair for faster response times and efficiency, making it ideal for real-time applications and embedded use cases in which visual reasoning at high speed is needed.

5.6. Approach VI

An object detection problem consists of two stages, namely, identifying an object in an image and precisely estimating its location (localization) within the image. YOLO is a pretrained model of a Single Stage Detector (SSD) that performs multiple tasks in a single stage. The YOLOv7 model (see Figure 30) takes 640 × 640 images as its input. Instead of learning regions, YOLO looks at the complete image, splits it into n x n grids, and then uses a single CNN to predict the BBox and the class probabilities for these boxes. The package position is defined by an annotated dataset with BBox around the packages to locate them. A corresponding BBox with coordinates, a certain height and width, and a confidence score results. Consider the package represented by a ground-truth BBox and the detected area represented by a predicted BBox. A perfect match occurs when the area and location of the predicted and ground-truth boxes are the same. The Intersection over Union (IoU), is used to evaluate these two BBox. IoU is equal to the area of the overlap (intersection) between the predicted and the ground-truth BBox, divided by the area of their union. Even a small IoU value still constitutes a valid prediction. However, an IoU close to one is considered more restrictive than an IoU close to zero. The confidence score reflects how likely it is that the predicted BBox contains the package, and how confident the classifier is about it. An IoU value of 0.5 was selected, which is neither restrictive nor loose.

5.7. Approach VII

This approach is similar to Approach IV. However, a deblurring process has been included here to reduce the effect of motion blur in images and to investigate whether this factor would have an impact on the overall accuracy. The LR has outperformed the other classifiers in Approach IV. It was the only one deployed in this approach after feature extraction was carried out by the ViT model.

Additionally, the detection of damaged packages has been executed, using top-view images only, with two trials of different LR parameters; top and side-view images were then processed with the default LR parameters. Table 13 shows the different tunning parameters for the LR classifiers

5.8. Approach VIII

The YOLOv8-seg model (see Figure 31), a state-of-the-art object detection and instance segmentation model, was chosen for its high accuracy and efficiency. The models were trained on a custom dataset of package images, with annotations specifying the package outlines (polygonal masks).

Following the segmentation of packages using YOLOv8 (see Figure 32), the resulting cropped package images were subjected to an image enhancement step. The primary technique employed was CLAHE, which is aimed at improving the local contrast of the images. This can be particularly beneficial for highlighting subtle details related to package damage that might otherwise be obscured due to poor lighting or low contrast in the original images. CLAHE is an adaptive version of histogram equalization. Instead of applying equalization to the entire image, CLAHE operates on small regions of the image, called tiles. For each tile, a histogram equalization is performed. To avoid over-amplification of the noise that standard histogram equalization can produce, CLAHE limits the contrast enhancement, especially in homogenous regions. The resulting equalized tiles are then stitched back together using bilinear interpolation to remove artificially induced boundaries.

After image enhancement with CLAHE, the next stage in the pipeline involved classifying the segmented package images as either “intact” or “damaged.” We explored using MobileNetV2, a lightweight and efficient CNN architecture, for this binary classification task. Two approaches were implemented: one using only the top-view images and another using a combination of top-view and side-view images. MobileNetV2 is well-suited for applications where computational resources might be constrained, and functions in these cases without a significant compromise as to accuracy. It utilizes depthwise separable convolutions to reduce the number of parameters and computations. For our task, we used a MobileNetV2 model pre-trained on the ImageNet dataset and fine-tuned it on our custom dataset of CLAHE-enhanced segmented package images. The fine-tuning process involved replacing the final classification layer of the pre-trained model with a new layer suited to our specific number of classes (two, “damage” and “intact”) and then unfreezing some of the later layers of the network to adapt their weights to our dataset.

5.9. Approach IX

This approach used the same results that were obtained by YOLOv8-seg in Approach VIII but used ViT for feature extraction and LR with default parameters for classification.

5.10. Approach X

This approach was a rerun of Approach IX, but with augmentation. In Approach III, augmentation was implemented on the training set with 1:1 ratio. In Approach X, the augmentation has been implemented using the different variations detailed in Table 14. Only multi-view inspection was implemented, using both side and top views to detect damaged packages.

A lightweight but comprehensive augmentation pipeline can vastly improve robustness across real-world variation. First, geometric transforms such as random resized cropping (to simulate different scales and translations), horizontal flips (when damage is symmetric), and small rotations (±5–10° to mimic slight misalignments) ensure the network sees every face of the box in differing poses. Next, color augmentations, including brightness/contrast/saturation/hue jitter and occasional grayscale conversion, teach the model to ignore lighting shifts and color inconsistencies in the cardboard or printing. Adding noise and blur (low-amplitude Gaussian noise and slight random Gaussian blur) further inoculates against sensor grain and autofocus artifacts. Finally, random erasing (Cutout), which blanks out small rectangles with mean values or noise, encourages the classifier to reason about damage even when parts of the ROI are occluded or obscured. Together, these steps, when applied on the cropped, enhanced ROIs, yield a training set that better reflects the unpredictable conditions of a live packaging line.

TTA applies the same transformations used during training to each image at inference time, but without updating the model’s weights. For every validation or test image, a set of variants, such as flipped, cropped, or color-jittered versions, is generated, and the trained model produces a prediction for each variant. Those predictions are then combined, commonly by averaging softmax probabilities or taking a majority vote on class labels, to yield the final output. By effectively ensembling over multiple views of the same image, TTA reduces sensitivity to specific poses, scales, or lighting conditions and often delivers modest improvements in accuracy and robustness, at a cost of increased inference computation proportional to the number of augmentations used. The TTA pipeline produced eight transformed variants for each paired top/side image in the test set. It applied the same eight augmentation techniques used during training, each with its designated probability, and with geometric transforms synchronized across both views. For each augmented version, the ViT extracted a feature embedding, and the trained LR classifier generated a probability for the “damage” class. These eight damage-class probabilities were then averaged to yield a single, final probability for the original image pair, which was subsequently thresholded to produce the intact-or-damaged prediction.

The next section illustrates and details the results obtained while running the pipeline of each approach.

6. Results

6.1. Results of Approach I

Figure 33 shows the loss and accuracy curves of this approach during the training and validation phases. For images of both views as input (top row), the loss curves show rapid drops in both training loss (from ~1.5 to ~0.8 by epoch 3) and validation loss (from ~2.2 to ~0.75) during the first few epochs, after which training loss continues a gentle decline to ~0.55 by epoch 14 while validation loss plateaus around 0.65–0.70 and even creeps upward, signaling mild overfitting. In parallel, its accuracy curves climb steadily for the training set (from ~0.48 to ~0.72), but validation accuracy rises more erratically, peaking near 0.65 around epochs 6 and 9 before fluctuating again, indicating that improvements from the training data are not fully mirrored on unseen examples. When only top-view images were as input (bottom row), the training loss initially increased slightly in the first two epochs (peaking around 0.83), then fell to ~0.66 by epoch 7, whereas validation loss bounced between ~0.70 and ~0.85 with no sustained downward trend, implying unstable learning. Its accuracy curves reflect a similar pattern: training accuracy dips to ~0.47 at epoch 1 then steadily climbs to ~0.67, while validation accuracy peaks modestly at ~0.57 in the early epochs before dropping to ~0.47–0.48 mid-training and recovering only briefly, demonstrating poorer generalization when using only the top-view input. Figure 34 shows the Precision Recall curves for this approach.

The precision–recall curve for images of both views as input begins at perfect precision (1.0) when no positives are predicted and then drops steeply below 0.3 as recall reaches approximately 0.05, indicating that the earliest True Positive (TP) recoveries incur a high False Positive (FP) rate. As the decision threshold relaxes further, precision rebounds in discrete increments and rises to around 0.55–0.60 at full recall, demonstrating that the model can sustain moderate precision once it captures a critical mass of TP. In contrast, the curve (in the case where top-view only images were used as input) also starts at precision = 1.0 and falls sharply, to below 0.3, by recall ≈ 0.05, but its recovery is slower and it peaks at only at about 0.50–0.58 by recall ≈ 0.5 before plateauing near 0.48–0.50 toward full recall. This flatter recovery indicates that the single-view model struggles to maintain precision while attempting to identify additional positives. Overall, although both methods of input achieve high precision at very low recall, the dual-view configuration consistently sustains higher precision across increasing recall levels, reflecting a superior trade-off between sensitivity and specificity. Figure 35 shows the Receiving Operating Characteristics (ROC) curves for this approach.

The ROC curve for the use of both views as input rises modestly above the diagonal “no-skill” line, achieving an Area Under Curve (AUC) of 0.56. It begins at (0,0) and shows almost no TP until the FP rate exceeds roughly 0.2, after which the TP rate climbs in discrete steps, reaching 1.0 only when the model accepts a high proportion of false alarms. This behavior indicates that the dual-view model can only recover damaged packages by tolerating many FP, yielding limited discriminative power. By contrast, the “top-view only” input ROC curve hugs the diagonal more closely and attains an AUC of just 0.47, worse than random, rising above the baseline only at high FP rates (above approximately 0.3) and never achieving strong separation. In summary, while both models perform poorly, the dual-view architecture demonstrates a marginally better balance between sensitivity and specificity, compared to its single-view counterpart. Figure 36 shows the confusion matrices obtained in Approach I.

6.2. Results of Approach II

The horizontal bar chart below (see Figure 37) compares the classification accuracies of six MobileNetV2 variants, each employing different image-enhancement techniques and input configurations. The top-view-only model with CLAHE achieved the highest accuracy among the Approach II variants, at 67.5%, followed by the both-views model with CLAHE, at 62.5%. The both-views model with sharpening reached 60% accuracy, whereas its top-view-only counterpart achieved only 50%. Global histogram equalization produced 55% accuracy when applied to the top-view-only model and 50% for the both-views model. These results indicate that CLAHE delivers the greatest performance boost, particularly for single-view inputs, while sharpening and global equalization offer more limited gains.

The accuracy curves (see Figure 38) show that training accuracy climbs steadily from about 0.47 at epoch 0 to roughly 0.68 by epoch 10, while validation accuracy, after an initial rise to approximately 0.67 at epoch 2, fluctuates between 0.55 and 0.63 for the remainder of training, never matching the steady gains on the training set. In tandem, the loss curves reveal sharp drops in both training loss (from ≈1.09 to ≈0.70 by epoch 3) and validation loss (from ≈0.92 to ≈0.69), after which training loss continues a gradual decline to around 0.60, whereas validation loss plateaus near 0.68–0.71 and even edges upward slightly. Together, these patterns indicate that the CLAHE-enhanced top-view-only model learns effectively early on but begins to overfit after the first few epochs, as evidenced by the growing gap between its improving training performance and its stagnant validation metrics. Only the variant with the best results is discussed in more detail in the following.

The precision–recall curve (see Figure 39) starts at perfect precision (1.0) when no positives are predicted and then plunges to roughly 0.50 at a very low recall (~0.05), reflecting the initial trade-off of capturing TP at the cost of false alarms. As the decision threshold relaxes, precision recovers and peaks near 0.75 at recall ≈ 0.10 before settling into a jagged plateau between 0.60 and 0.70 for moderate recall values (0.20–0.80), indicating that the model can maintain reasonably high precision while identifying a substantial fraction of the damaged packages. At full recall, precision declines to approximately 0.49, demonstrating that attempts to capture every positive sample ultimately admit a larger proportion of false positives. The step-wise nature of the curve arises from the finite set of prediction scores, with each jump corresponding to a threshold that flips a small number of instances.

The ROC curve (see Figure 40) runs closely along, and even below, the diagonal “no-skill” line, yielding an AUC of just 0.43. It originates at (0,0) and remains near zero TP rate until the FP rate exceeds roughly 0.8, at which point the TP rate climbs steeply in a few discrete jumps to reach 1.0 only when nearly all negatives are misclassified. This step-wise progression reflects the limited number of threshold predictions, and the fact that high sensitivity is achieved only at the expense of extreme false alarms. Overall, this top-view model by itself demonstrates a weak discriminative ability. Figure 41 shows the confusion matrix for this model.

6.3. Results of Approach III

The curves (see Figure 42) show that the training accuracy climbed consistently from approximately 45% in epoch 1 to about 78% by epoch 15, indicating that the model continued to fit the training data more closely over time. In contrast, validation accuracy remained near 50% in the early epochs, surged to a peak of roughly 73% at epoch 8, and then gradually declined and subsequently oscillated around 60% thereafter, signaling the onset of overfitting once the model’s performance on unseen data failed to improve. Correspondingly, training loss fell steadily from about 1.0 to 0.5, whereas validation loss, after an initial decrease to around 0.7, spiked dramatically to over 6.5 at epoch 4 before stabilizing near 0.7–1.0, further illustrating that while the model learned to minimize loss on the training set, its generalization to validation data was disrupted by the aberrant loss spike and subsequent overfitting.

The precision–recall curve (see Figure 43) for the classifier remains at a perfect precision of 1.0 until the recall reaches approximately 0.60, after which precision declines in gradual steps—falling to about 0.90 at recall ≈ 0.70, 0.80 at ≈ 0.80, and settling near 0.55 by full recall. The area under this curve is 0.90, indicating that the model maintains a strong trade-off between sensitivity and specificity: it can recover a large proportion of damaged-package instances while keeping the rate of false positives relatively low.

The ROC curve for the ResNet-18 (see Figure 44) ascends sharply above the diagonal “no-skill” line, achieving an AUC of 0.88. It begins at (0,0), then captures about 60% of TP while incurring only a 5% FP rate, before climbing through successive thresholds, reaching roughly 75% TP rate at 20% FP rate and 90% TP rate by 60% FP rate, and finally attaining full sensitivity only when nearly all negatives are also flagged. This strong separation between TP and FP rates demonstrates that the model distinguishes damaged from intact packages with high reliability. Figure 45 shows the confusion matrix for this model.

6.4. Results of Approach IV

The LR classifier, with its default tuning parameters, achieved the highest accuracy performance compared to the other classifiers (see Figure 46 for comparison).

The precision–recall curve (see Figure 47) for the package-damage classifier begins at perfect precision (1.00) for the initial range of thresholds, holding until approximately 40% recall, which indicates that the first 40% of damaged packages are retrieved without any FP. As recall increases beyond this point, precision gently declines to almost 0.9, remaining above 0.90 up to roughly 90% recall, demonstrating that the model sustains high confidence while capturing most positive instances. Only when recall approaches 100% does precision drop sharply to about 0.50, reflecting the necessity of accepting a substantial number of FP to recover every remaining TP. The average precision (AP) of 0.96 underscores the model’s excellent balance between sensitivity and specificity across all decision thresholds.

The ROC curve (see Figure 48) for the package-damage classifier rises sharply from the origin, achieving over 90% TP rate with less than 10% FP rate, and then flattens near the top edge before reaching (1,1). Its area under the curve value of 0.97 indicates that the model distinguishes damaged from intact packages with excellent accuracy across all thresholds: it can recover the vast majority of TP while keeping false alarms very low, and only approaches the trade-off of increasing false positives when pushing for complete sensitivity.

The classification stage used a pre-trained feature extractor (ViT) followed by a scikit-learn LR classifier. The training process for LR within scikit-learn does not readily expose per-epoch loss or accuracy metrics needed to plot these curves, unlike typical DL training loops in which models are trained end-to-end over multiple epochs. Figure 49 shows the confusion matrix for the default LR classifier in Approach IV.

6.5. Results of Approach V

The GPT o4-mini-high did not re-train or fine-tune any models on the 200 packages (400 images) itself. Instead, it “inspected” the dataset by parsing and reasoning over it. In contrast to a traditional DL pipeline, which would require splitting images into training and validation sets, iterating over epochs, back-propagating gradients, and tuning hyperparameters, the GPT o4-mini-high simply leveraged its pretrained language-and-reasoning capabilities to interpret the existing outputs and generate new analyses, visualizations, and explanations on demand. Because all model training and validation had already been performed offline by the user’s scripts, there was no need for ChatGPT to execute additional training or validation steps; it functioned purely as an analysis and synthesis engine, not as a training framework. Figure 50 shows the confusion matrix for the GPT o4-mini-high in Approach V.

6.6. Results of Approach VI

The YOLOv7 loss curves (see Figure 51) for the damaged-package detector exhibit a very steep, almost exponential, decline in the first few epochs, dropping from roughly 1.0 to down below 0.1 by epoch 5, and then continue to decrease more gradually, flattening out near zero by around epochs 20–30. This rapid convergence indicates that the model quickly masters both the BBox regression and classification objectives for the “damaged” class, essentially memorizing the training examples in very few epochs.

The PR curves (see Figure 52) for the top-view images show that the training precision starts near zero but rises steeply between epochs 5 and 15, surpassing 0.7 by epoch 15, and then continues climbing to perfect precision (1.0) by around epoch 40, where it remains for the rest of training. Training recall mirrors this behavior: it jumps above 0.8 by epoch 12, reaches 1.0 by about epoch 20, and stays at perfect recall thereafter. By contrast, the side-view images training achieves a much lower precision plateau; after initial volatility, precision climbs to roughly 0.5 by epoch 15 but never exceeds ~0.52, fluctuating mildly around that level through epoch 100. Its recall improves more substantially, rising to about 0.8 by epoch 15 and approaching 1.0 by epoch 25, but exhibits occasional dips before finally stabilizing near perfect recall.

6.7. Results of Approach VII

A comparison of the performance of the different LR classifiers is shown in Figure 53. Once again, an LR classifier with default parameters has outperformed the other alternatives. Figure 54 shows the PR curve for an LR classifier with default parameters.

The precision–recall curve shows that the model maintains perfect precision (1.0) up to a recall of roughly 0.85, meaning that until it has retrieved about 85% of the true “damaged” packages, and it makes no FP errors. Beyond that point, precision begins to decline, dropping to around 0.95 at a recall of 0.90, to about 0.73 at 0.97 recall, and finally falling to 0.50 when the model pushes for 100% recall. This behavior indicates a very high initial confidence in its positive predictions, followed by an increasingly steep trade-off; recovering the last few TP requires accepting a substantially higher FP rate. Overall, the model achieves strong early-retrieval performance but exhibits diminishing precision as it seeks to cover all positive instances. Figure 55 shows the ROC curve for the default LR classifier.

The ROC curve rises sharply toward the top-left corner, achieving an 85% TPR at only a 10% FPR, and then steadily climbs to perfect sensitivity at an FPR of around 60%. Its overall AUC of 0.95 indicates excellent discrimination between damaged and intact packages across all classification thresholds. In practical terms, this means the model can correctly flag most damaged packages while incurring very few false alarms, and even when pushed to identify every last positive, it still maintains reasonable specificity before ultimately trading off complete recall for higher FP. Figure 56 shows the confusion matrix for the default LR classifier.

6.8. Results of Approach VIII

The YOLOv8-seg was trained and validated on each side and top view individually, and this same trained model has been used in approaches IX and X. Figure 57 shows the side-view validation metric curves and Figure 58 shows the top-view validation metric curves.

The side-view YOLOv8-seg training curves (top row) show that all four loss terms—bounding-box (box), segmentation (seg), classification (cls), and distribution-focal-loss (dfl)—drop sharply in the first 5–10 epochs, before settling into a gentle, downward drift or mild plateau. In particular, box loss climbs briefly to ~0.65 around epoch 4 (as the model first learns to localize boxes) then falls steadily to ≈0.20 by epoch 30; seg loss plummets from ~0.65 to ~0.15 in the first ten epochs and then hovers near 0.10; cls loss declines from ~1.75 to about 0.50; and dfl loss, after a small initial rise to ~0.95, also cools down to ~0.75. The validation curves (bottom row) track the same patterns; each loss spikes in the opening epochs (reflecting rapid parameter adjustment) and then converges downward, with val/box and val/cls losses mirroring their training counterparts and val/seg and val/dfl losses reaching similarly low steady-state levels. Crucially, the detection and segmentation metrics all jump to near-perfect values very early and remain there. Both bounding-box precision and recall exceed 0.99 after epoch 2, as do mask precision and recall; mAP@50 for both boxes and masks surpasses 0.99 almost immediately, and even the stricter mAP@50–95 climbs from ~0.70 (for boxes) and ~0.92 (for masks) in the first few epochs to above ~0.98 by epoch 20, where it plateaus.

The top-view YOLOv8-seg training curves (first row) again exhibit the characteristic “fast converge then fine-tune” pattern seen in the side-view results, but with even smoother descent thanks to the richer input perspective. The train/box_loss begins near 0.90 in epoch 1 (as the model learns to localize package bounding boxes), rises briefly to ≈0.92 around epoch 4, and then falls in a near-monotonic fashion to ≈0.20 by epoch 50. The train/seg_loss starts at ≈1.05, plunges to ≈0.20 by epoch 10, and steadily tightens to ≈0.12 by the end. The train/cls_loss declines from ≈1.95 to ≈0.50, and train/dfl_loss drops from ≈0.98 to ≈0.75, all reflecting rapid improvements in classification confidence and bounding-box regression before fine-grained tuning. The corresponding validation losses (second row) mirror the training: val/box_loss spikes to ≈0.75 around epoch 4, then declines to ≈0.18; val/seg_loss falls from ≈0.55 to ≈0.15; val/cls_loss moves from ≈2.10 down to ≈0.47; and val/dfl_loss decreases from ≈0.90 to ≈0.75. Crucially, there is no divergence between train and val curves, indicating minimal overfitting on the top-view data. On the metric side (columns 5–8 for training, 9–12 for validation), all four measures, for both bounding boxes (B) and masks (M), climb to near-perfect performance within the first 10 epochs and remain there: precision(B/M) and recall(B/M) all exceed 0.99 by epoch 5; mAP50(B/M) reaches >0.99 almost immediately; and mAP50–95(B/M), which is the strictest indicator, grows from ≈0.75/0.93 (boxes/masks) around epoch 1–2 to above ≈0.98/0.93 by epoch 20, ultimately plateauing above 0.99.

These plots demonstrate that on the side-view data, YOLOv8-seg learns extremely quickly, achieving near-perfect detection and segmentation performance within a handful of epochs, and that both training and validation losses decline in lockstep without signs of overfitting, thanks to the model’s strong regularization and the relatively small, well-curated dataset. The top-view model learns even more efficiently than the side view, achieving both localization and segmentation with exceptional accuracy and stability, and showing training and validation performance in lockstep without overfitting across all 50 epochs. Finally, the YOLOv8-seg was able to detect the packages from the background in all images successfully. Figure 59 shows the confusion matrices for the training and validation process for the side-view images and the top-view images.

The MobileNetV2 loss and accuracy curves (see Figure 60) for top-view images over 30 epochs reveal a classic “learning vs. generalization” story. On the left, the blue “Train Loss” curve falls almost monotonically, from about 0.95 down to 0.08 by epoch 29, showing the network steadily driving down its error on the training set. In contrast, the orange “Val Loss” is much noisier and punctuated by large spikes around epochs 1, 8, 13, 19, and 29 (peaking above 1.8–2.6), before generally drifting downward from ~1.4 toward ~1.0 by the end. Those sharp upward jumps in validation loss strongly suggest the presence of occasional overfitting bursts whenever the model over-adjusts to training batches.

On the right, the “Train Acc” curve climbs steadily from ~0.55 at epoch 0 to ~0.97 at epoch 29, again reflecting a nearly perfect fit on the training data. The “Val Acc” (orange) starts at ~0.42, jumps to ~0.55 by epoch 3, and peaks around ~0.72 at epochs 9 and 25, but otherwise oscillates between ~0.53 and ~0.60, never closing the gap with training accuracy. The persistent ~0.25–0.30 accuracy gap—combined with the volatile validation loss—indicates that while the model memorizes the training examples, its ability to generalize to unseen top-view images is far more limited.

The MobileNetV2 loss and accuracy curves (see Figure 61) for the top- and side-view images model exhibit a markedly smoother learning dynamic compared to the top-view-only branch, but the data still reveal a persistent train/validation gap. On the left, the blue “Train Loss” curve plummets from 1.25 at epoch 0 down to about 0.15 by epoch 29, while the orange “Val Loss” falls from roughly 1.05 to about 0.35—albeit with four prominent spikes around epochs 6, 12, 16, and 26 (up to ~1.05). Those validation-loss blips suggest occasional overfitting or sensitivity to particular minibatches, but overall, the downward trend shows the model steadily improving on both sets.

On the right, “Train Acc” rises from ~0.55 to ~0.93, mirroring the loss decline, while “Val Acc” climbs more modestly from ~0.62 to ~0.88 (peaking around epochs 9, 20 and 26) and oscillates between 0.63 and 0.90. The narrower accuracy gap (≈0.05–0.10) compared to the top-view-only case indicates that incorporating both viewpoints improves generalization, yet the recurring dips in validation accuracy (notably at epoch 15) still flag over-adaptation. Figure 62 shows the ROC curves during the testing phase for MobileNetV2.

The ROC plot makes the performance gap between the two MobileNetV2 variants immediately apparent. On the left, the dual-view model’s ROC curve soars quickly toward the upper-left corner, achieving an AUC of 0.93 for both the “damaged” and “intact” classes, indicating excellent discrimination and very few FP at high TPR. By contrast, the right panel shows the top-view-only branch with a much shallower curve and an AUC of just 0.57 for both classes, barely rising above the diagonal no-skill line. This means that relying solely on the top-down image yields near-random classification, whereas fusing side and top perspectives enables the network to distinguish damaged from intact packages with high confidence across thresholds. Figure 63 shows the confusion matrices for MobileNetV2 in Approach VIII.

6.9. Results of Approach IX

In the figure (Figure 64) below, the left panel shows the precision–recall curves for the top-view-only pipeline, while the right panel presents the curves when both top and side views are used together. In the Top-View Only results, precision and recall for both “damage” (blue) and “intact” (orange) classes climb gradually: each curve peaks at around 0.80 precision and 0.65 recall, yielding an area under the PR curve (AUC) of 0.73 for both classes. This indicates that relying on a single viewpoint provides moderate discrimination but leaves a significant trade-off between correctly identifying damaged packages and avoiding false alarms. By contrast, the Dual-View (Top + Side) pipeline achieves dramatically better performance: both the damage and intact curves hug the top-right corner of the plot, with precision above 0.90 even at 0.90 recall, and both AUCs at 0.99. The near-flat, high precision at almost full recall demonstrates that combining two complementary camera angles allows the model to detect nearly all true cases of damage and intact packages while making almost no false predictions.

Figure 65 shows testing ROC curves for Approach IX. In the left panel, the ROC curves for the top-view–only pipeline demonstrate only moderate discrimination (AUC = 0.73 for both “damage” and “intact” classes): the TPR climbs gradually as the FPR increases, indicating that relying on a single perspective leaves many decision thresholds at which sensitivity and specificity are fairly balanced but far from ideal. In contrast, the right panel shows the dual-view (top + side) pipeline achieving near-perfect separation (AUC = 0.99 for both classes); both curves rise sharply to the top-left corner and remain flat at 1.0 for most thresholds, meaning the model correctly identifies almost all damaged and intact packages while incurring virtually no false alarms. Figure 66 shows the confusion matrices obtained for the two cases.

6.10. Results of Approach X

Figure 67 shows the accuracy values obtained with the different augmentation configurations that were used with ViT + LR in Approach X. Several factors can explain why applying TTA reduced rather than improved accuracy. First, averaging soft-damage probabilities across eight variants can amplify occasional high-confidence errors: a handful of strongly mistaken predictions on more heavily transformed images may drag the mean score away from the correct class. Second, when the baseline model is already operating near its ceiling (90% accuracy without TTA), additional augmentations often yield diminishing returns, since there is little room left to discover new decision boundaries. Finally, some datasets simply do not benefit much from TTA; if the training augmentations already capture the key variations, then repeating them at test time adds little new information and can even introduce inconsistencies. Collectively, these dynamics can make TTA a net negative when the original model is both strong and sensitive to certain input transformations.

Essentially, the set of augmentations and the aggregation method used for TTA did not align well enough with the learned characteristics of that particular ViT + LR model to enhance its predictions on the test set and instead introduced enough variance, or a sufficient number of confusing samples, to lower the overall accuracy. Finally, none of the augmentation configurations have led to any significant improvement in model performance compared to the highest-performing approaches that have been tested so far (92.5% accuracy). Figure 68 shows the test PR curve for this approach.

The precision–recall curve (AUC = 0.95) shows that the model maintains perfect precision (1.0) up through roughly 65% recall, indicating it makes no FP errors while recovering two-thirds of all damaged packages. Beyond that point, precision declines gradually: it falls to about 0.92 at 70% recall, to around 0.82 by 90% recall, and ultimately to roughly 0.55 at full (100%) recall. In other words, the classifier confidently identifies the majority of TP with virtually no false alarms at moderate recall levels but must trade off increasing FP in order to capture the rare remaining damaged cases. This behavior, strong early-retrieval performance followed by a precision–recall trade-off at the extremes, reflects a highly capable model that nevertheless faces the usual challenge of balancing completeness against specificity when pushed for total coverage. Figure 69 shows the test ROC curve obtained for this setup.

The ROC curve rises sharply toward the top-left corner, reflecting very strong discrimination between “damage” and “intact” classes. At a 0% FPR, the model already captures about 60% of TP; by allowing only a 10% FPR, it recovers roughly 90% of TP. As the threshold lowers further, TPR approaches 100% at the expense of more false alarms, but precision remains high until extreme recall. The overall AUC = 0.94 quantifies this excellent trade-off: across all thresholds the classifier maintains both high sensitivity and specificity, indicating that it can reliably distinguish damaged packages from intact ones in almost all images. Figure 70 shows the confusion matrix for this setup.

The next section discusses the results that were documented from these approaches and elaborates on their implications.

7. Discussion

7.1. Performance Measurements

Table 15 shows the performance measurements obtained for all classes and views in different variants.

With MobileNetV2 ingesting both top- and side-view images, overall classification accuracy remained low, implying that its architecture or the dual-view fusion may not suit this dataset optimally. The model detected only 42% of damaged packages, overlooking a majority of faults, while identifying 46% of intact ones correctly. Corresponding F₁-scores were 0.31 for damaged and 0.54 for intact, underscoring the modest balance between precision and recall in each class. When constrained to top-view images alone, MobileNetV2 still failed to achieve strong accuracy. Damage recall improved to 50%, yet the model still missed half of all damaged packages; intact recall likewise sat at 50%. F₁-scores diverged sharply, at 0.66 for damaged versus a mere 0.09 for intact, highlighting a severe imbalance in prediction quality. Interestingly, relying solely on the top view raised the damaged-package recall from 0.42 (dual view) to 0.50, suggesting that in this case the side-view input introduced conflicting informative cues or fewer informative cues for damage detection.

MobileNetV2 enhanced with CLAHE on top-view inputs delivered moderate overall accuracy, demonstrating the value of localized contrast boosting while leaving room for further optimization. Its ability to detect damage was exceptional, with a recall of 0.90, minimizing missed defects, whereas the intact-package recall was 0.45. These outcomes yielded F₁-scores of 0.73 for the damaged class and 0.58 for the intact class, capturing the differing balance of precision and sensitivity across both categories.

Employing a ResNet-18 backbone combined with MSR, CLAHE, fast NIM denoising, FSRCNN super-resolution, and per-channel normalization on both top and side views yielded markedly robust discrimination between damaged and intact packages. The damaged-package recall reached 0.70, while intact recall climbed to 0.85, indicating strong sensitivity in both categories. These results translated into F₁-scores of 0.76 for damaged and 0.79 for intact, demonstrating well-balanced precision and recall across the two classes.

The ROI–ViT–LR ensemble, when applied to both top- and side-view images, achieved near-perfect discrimination of package condition. It attained a damaged-package recall of 1.00, indicating zero missed faults, and an intact-package recall of 0.85. The corresponding F₁-scores were 0.93 for damaged and 0.92 for intact packages, demonstrating a remarkably balanced trade-off between precision and recall in both categories.

In the dual-view configuration, the GPT o4-mini-high model attained a moderate overall accuracy, suggesting a reasonable baseline performance with room for enhancement. Its sensitivity to damaged packages was 0.69, while it correctly identified 78% of intact boxes. The resulting F₁-scores, 0.72 for damaged and 0.75 for intact, demonstrate a comparable balance between precision and recall across both classes. As an off-the-shelf foundation model, the GPT o4-mini-high’s outcomes here reflect its general pre-training and architectural biases, given the lack of task-specific adaptations. When restricted to the top-view images alone, the GPT o4-mini-high’s accuracy remained in the moderate range, but its ability to detect damage dropped to a recall of 0.51, indicating that nearly half of the damaged packages went unflagged. Its intact recall was 0.70, and the F₁-scores fell to 0.56 for damaged and 0.64 for intact packages. These metrics underscore the model’s reliance on multi-view context for robust damage recognition and highlight the limitations of its default weights when applied, without further fine-tuning, to this specific task.

In the dual-view YOLOv7 setup, the model achieved a respectable overall accuracy, highlighting both its strengths and areas for refinement. Damage detection recall reached 0.94, demonstrating the model’s proficiency at avoiding missed defects. In contrast, intact-package recall was 0.44, indicating a higher incidence of false positives for undamaged items. The corresponding F₁-scores, 0.75 for damaged and 0.59 for intact, capture the trade-off between precision and sensitivity in each class.

The ROI + Deblur + CLAHE + ViT + LR pipeline fed with both top- and side-view images achieved very strong discrimination between intact and damaged packages. Its damaged-package recall reached 0.80, meaning it correctly flagged four out of five actual damage cases, while intact-package recall hit a perfect 1.00, indicating zero missed normal items. The corresponding F₁-scores, 0.89 for damaged and 0.91 for intact, demonstrate a well-balanced trade-off between precision and sensitivity in each class. This variant’s incorporation of a deblurring stage may have contributed to sharper defect cues, although a direct, side-by-side comparison to a non-deblurred counterpart is not available in the current results. When restricted to top-view images only, the same ROI + Deblur + CLAHE + ViT + LR configuration yielded more modest outcomes. Damaged-package recall fell to 0.70, and intact-package recall likewise sat at 0.70, producing symmetrical F₁-scores of 0.70 for both classes. This drop, specifically, the damage recall of 0.70 on the top view versus 0.80 with both views, highlights the importance of side-view information for capturing certain defects that may be obscured in a single perspective.

When applied to both top- and side-view inputs, the YOLOv8-seg + CLAHE + MobileNetV2 pipeline exhibited robust discrimination between damaged and intact packages. It achieved a damaged-package recall of 0.80 and an intact-package recall of 0.85, yielding F₁-scores of 0.82 and 0.83, respectively. These figures attest to the model’s strong ability to balance precision and sensitivity across both classes under dual-view conditions. In contrast, restricting the same architecture to the top-view images alone led to markedly weaker outcomes, with both damaged and intact recall falling to 0.55 and corresponding F₁-scores also at 0.55. The decline from 0.80 to 0.55 in damage detection underscores that critical defect cues are captured more effectively when side-view information supplements the top perspective.

The YOLOv8-seg + CLAHE + ViT + LR workflow, when fed both top- and side-view images, achieved exceptional discrimination between damaged and intact parcels. It correctly flagged 95% of all defective packages, while successfully identifying 90% of undamaged ones. The corresponding F₁ scores—0.93 for the damaged class and 0.92 for the intact class—demonstrate a finely balanced blend of precision and sensitivity. Limiting the same architecture to the top-view alone resulted in a more moderate performance: only 70% of damaged parcels were detected, and 65% of intact packages were recognized. Its F₁ scores dropped to 0.68 for the damaged class and 0.67 for the intact class. The decrease from 0.95 to 0.70 in damage detection highlights the critical role of the side-view perspective in revealing defects that are not readily visible from above.

The YOLOv8-segmentation pipeline was enhanced with CLAHE, a 3:1 ratio of training and validation data augmentation, and a ViT + LR classifier; when fed both top- and side-view images, achieved uniformly strong detection performance. It correctly flagged 90% of all damaged packages and 90% of intact ones, yielding F₁-scores of 0.90 for each class. These identical recall and F₁ values reflect a highly balanced precision–recall trade-off, demonstrating the model’s ability to minimize both missed defects and false alarms across both damage categories. Figure 71 shows a comparative summary of all accuracy values in detecting damaged packages in all variants employed in this experiment.

Reflecting on the performance across these varied approaches, several key themes emerge from a comparative standpoint. For instance, the inclusion of sophisticated segmentation stages, such as those employing YOLO, often correlated with improved downstream classification performance, particularly in recall for damaged items, when compared to approaches lacking such explicit object localization. This suggests that accurately isolating the package from background clutter is a critical first step. The choice of image views also presented clear patterns. In many direct comparisons of the same base architecture, models utilizing both top and side perspectives frequently demonstrated superior recall and F1-scores for the “damaged” class compared to their top-view-only counterparts. This underscores the three-dimensional nature of package damage and the value of comprehensive visual input. However, occasional exceptions in which top-view models performed comparably or better highlight that feature interaction and model capacity can sometimes lead to counterintuitive outcomes if additional views introduce more noise than signal for a specific architecture. The impact of preprocessing steps like deblurring and CLAHE was highly context dependent. Direct comparisons revealed instances where deblurring enhanced performance, presumably by clarifying features, but also cases where it was detrimental, possibly by removing subtle textures. This reinforces the necessity of empirical validation for each preprocessing step within the specific pipeline, rather than assuming universal benefit.

Based on strong performances in critical metrics such as recall for damaged packages, the configuration involving ROI + ViT + LR using both views stands out. Its ability to correctly identify a high proportion of damaged items, potentially outperforming several other configurations in this specific aspect, marks it as a promising candidate for further development. A detailed comparative error analysis looking at why this model succeeded where others faltered on specific challenging cases, would be highly instructive. Additionally, ROI requires significant preprocessing operations such as masking, which can be hard to perform in real-time edge device analysis. Without explicit segmentation (ROI), a classifier must learn to ignore irrelevant pixels such as conveyor-belt lines, printed labels, or barcode stickers. Any variation in those elements (lighting glare, sticker orientation) can confuse the models. Figure 72 shows the relationship between performance, complexity, speed, training time, and computational cost.

Root-Cause for Misclassifications

It has been noted throughout this work that even state-of-the-art vision backbones can struggle to discriminate between “damaged” and “intact” packages under the simulated conditions of the dataset. Table 16 breaks down the main reasons potentially behind this determination.

7.2. Future Work

To ensure continued gains, future development should employ systematic A/B experiments that isolate each pipeline component. Whenever a variant “Model X + A” outperforms its baseline “Model X,” that provides clear, empirical support for incorporating component A, whether it be a preprocessing step, an architectural module, or a particular fusion strategy for multi-view data.

Given the variability we observed, especially in recall, an adaptive, ensemble-based framework merits exploration. For example, specialized submodels tuned to particular damage types could be combined in a cascaded or voting ensemble, leveraging each model’s strengths while mitigating its weaknesses. The comparative results presented here will help identify the most complementary candidates for such an ensemble. Equally important is a deeper investigation into preprocessing effects. In cases where deblurring or other enhancements degraded recall, we must go beyond noting the drop and rigorously probe its source, whether that be suboptimal parameter settings, algorithmic mismatches, or interactions with downstream features. This diagnostic work will guide more informed choices about when and how to apply each transform.

Additionally, all future benchmarks should report a balanced suite of metrics, beyond overall accuracy, to reflect the real costs of FN in damaged-package inspection. Recall for the damage class, precision for the intact class, F₁-scores, and specificity must remain front and center. By maintaining and expanding the comparative-analysis framework established in this study, subsequent iterations can rapidly validate improvements and push toward a truly robust, deployable solution.

Finally, to drive further gains in package-damage classification, six complementary strategies are recommended, each grounded in the empirical insights from the current comparative analysis:

-: First, rather than uniformly averaging predictions over a fixed set of TTA, a lightweight meta-model can be trained to weight only those transformations that preserve the salient damage cues, thereby avoiding dilution of high-confidence inferences. Alternatively, augmentations may be applied selectively to samples with base-model outputs near the decision boundary, focusing computational effort on ambiguous cases while maintaining performance on clear-cut examples.
-: Second, feature-level fusion of multi-view data can be made more expressive by replacing simple concatenation with a learned cross-attention or gating mechanism that dynamically emphasizes the most informative perspective for each package. In a related vein, separate ViT branches may be trained on top and side views, with their final logits ensembled via learned weights, allowing the model to leverage complementary information in a late-fusion stage.
-: Third, synthetic data generation and domain randomization can fill critical gaps in this dataset. By rendering dent geometries that are more extreme, different lighting conditions, and material variations alongside randomized camera intrinsics, conveyor backgrounds, and cardboard textures, the training set can be enriched with under-represented damage patterns, improving robustness to novel scenarios.
-: Fourth, the existing COCO polygon annotations can guide feature learning by restricting ViT fine-tuning to damaged-region masks, compelling the model to focus its attention on creases and punctures. Introducing a small segmentation head and training in a multi-task regime (damage classification plus mask refinement) further sharpens the model’s ability to localize and interpret defect regions.
-: Fifth, deeper fine-tuning of transformer layers, extending unfreezing to earlier blocks under a reduced learning rate, enables the adaptation of low-level feature extractors to cardboard textures. Regularization techniques such as MixUp or CutMix, applied across multi-view inputs, discourage over-reliance on any single viewpoint or local patch, promoting representations that are more distributed and generalizable.
-: Sixth, an active-learning loop that identifies and re-labels low-confidence or misclassified samples, particularly those exhibiting subtle or borderline damage, can concentrate annotation effort where it yields the greatest performance return. By iteratively incorporating these “hard examples” into the training corpus, the decision boundary can be refined in precisely those regions where the model remains uncertain.

Collectively, these strategies furnish a rigorous roadmap for future experimentation, enabling systematic A/B validation of individual pipeline components and continual benchmarking on key metrics such as damaged-package recall, intact-package precision, F₁-scores, and specificity.

7.3. Limitations

A major obstacle to the effective utilization of Generative AI in manufacturing is the procurement of high-quality data. These AI models require extensive and significant datasets for training. In a manufacturing setting, consistently acquiring such data, often gathered from diverse sensors, production machinery, and other operational systems, is challenging. Inconsistent or substandard data might result in flawed generative models, hence diminishing their effectiveness in critical applications such as QC and generative design. This issue is particularly pronounced among various manufacturers, especially small and medium-sized enterprises, companies that often face a data shortfall stemming from inadequate digitization and automation in their operations. Large enterprises may encounter challenges stemming from insufficient or disorganized data, making these resources less suitable for training advanced AI systems. The problem escalates when considering infrequent events, such as specific equipment failures, which seldom occur, and hence offer few examples for the model to learn from. Improving data collection, purification, and annotation techniques is essential to tackle these challenges. Furthermore, enterprises must invest in modern data infrastructure that enables the real-time collection, storage, and processing of data of diverse types, hence enhancing the robustness and overall efficacy of Generative AI models in manufacturing applications [107].

Generative AI necessitates substantial computational resources, demanding high-performance hardware and robust cloud infrastructure. Training these models on extensive production datasets is resource-intensive and time-consuming. Typically, these models are executed on specialized processors that are costly and demand significant energy usage. The significant computational expenses provide a major barrier to the widespread adoption of Generative AI in the industrial sector, one particularly affecting smaller organizations with constrained AI infrastructure [108].

8. Conclusions

The initial performance metrics suggest that while some techniques like CLAHE show promise, there is significant room for improvement to reach the desired >90% accuracy. A multi-pronged approach focusing on better data utilization (especially for combined views), robust augmentation, and potentially, more powerful model architectures or fine-tuning of strategies, in addition to careful error analysis, will be key. The choice of using top, side, or both views should be guided by empirical results, potentially favoring combined views if effective fusion techniques are employed. The very low precision for “Intact” in the MobileNetV2 (Top View) model, despite high precision for “Damaged,” highlights the need for balanced performance across classes, which metrics like F1-score and a careful look at per-class precision/recall can help evaluate. Future work should prioritize identifying the best current overall approach from the full set of results in the Excel file and then systematically applying the improvement strategies outlined above, starting with those that are most likely to yield significant gains based on the observed weaknesses.

The dramatic transformation experienced in the manufacturing industry may be mainly due to the arrival of new concepts like the IoT, cloud-based technologies, mobile internet, and AI. AI has paved the transition from automated manufacturing to intelligent automated manufacturing. As companies battle with an ever-expanding mass of data, the pressure to handle, store, and process this huge landscape of information highlights the criticality of data-centric manufacturing within the area of intelligent production. Incorporating AI vision systems into sustainable industrial environments promotes the detection of waste and inefficiencies through constant observation. The real-time nature of continuous monitoring allows production supervisors to intervene immediately, minimizing further waste. Lowering scrap and rework rates is crucial in sustainable manufacturing due to their impacts on production expenses and throughput. These AI technologies identify defects early in the production process and automate the separation of defective items from quality products. This automation aims to reduce scrap volumes and labor costs. AI is going to change the landscape of sustainable manufacturing. AI allows better data flow among individuals, systems, and gadgets, enhancing products and processes. As organizations optimize human–technology interactions through automation, AI stands ready to elevate operational efficiency.

Generative AI has emerged as a transformative force in numerous industrial applications, significantly enhancing inventory management, defect detection, quality assurance, and process optimization. Generative AI facilitates the development of training models through the synthesis of extensive, realistic datasets, enhancing ML systems to address varied and intricate industrial difficulties. The ability to generate synthetic data provides a scalable and economical solution to the limitations of traditional data collection methods, hence improving the dependability and precision of decision-making in manufacturing processes.

Author Contributions

Conceptualization, M.S.; methodology, M.S.; software, M.S.; validation, M.S. and A.H.; formal analysis, M.S.; investigation, M.S.; resources, F.F.C.; data curation, M.S.; writing—original draft preparation, M.S. and A.H.; writing—review and editing, M.S. and A.H.; visualization, M.S.; supervision, M.S.; project administration, M.S.; funding acquisition, F.F.C. All authors have read and agreed to the published version of the manuscript.

Funding

The reported research work is based upon work supported by the US Department of Defense under the Office of Local Defense Community Cooperation (OLDCC) Award Number MCS2106-23–01. The views expressed herein do not necessarily represent the views of the US Department of Defense or the United State Government. Additionally, this paper received partial financial support from US Department of Energy/NNSA (Award Number: DE-NA0004003), as well as from the Lutcher Brown Distinguished Chair Professorship fund of the University of Texas at San Antonio.

Data Availability Statement

The Industrial Quality Control of Packages is a free to use dataset and is publicly available to download via the following link: https://www.kaggle.com/datasets/christianvorhemus/industrial-quality-control-of-packages accessed on 15 November 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kazakova, E.; Lee, J. Sustainable Manufacturing for a Circular Economy. Sustainability 2022, 14, 17010. [Google Scholar] [CrossRef]
Chen, S.-C.; Chen, H.-M.; Chen, H.-K.; Li, C.-L. Multi-Objective Optimization in Industry 5.0: Human-Centric AI Integration for Sustainable and Intelligent Manufacturing. Processes 2024, 12, 2723. [Google Scholar] [CrossRef]
Mana, A.A.; Allouhi, A.; Hamrani, A.; Rehman, S.; el Jamaoui, I.; Jayachandran, K. Sustainable AI-Based Production Agriculture: Exploring AI Applications and Implications in Agricultural Practices. Smart Agric. Technol. 2024, 7, 100416. [Google Scholar] [CrossRef]
Diaz, N.; Redelsheimer, E.; Dornfeld, D. A Critical Review on the Environmental Impact of Manufacturing: A Holistic Perspective. Int. J. Adv. Manuf. Technol. 2021, 117, 2793–2816. [Google Scholar] [CrossRef]
Yarrow, D.; Mitchell, E.; and Robson, A. The Hidden Factory: The Naked Truth about Business Excellence in the Real World. Total Qual. Manag. 2000, 11, 439–447. [Google Scholar] [CrossRef]
Yang, J.; Zheng, Y.; Wu, J.; Wang, Y.; He, J.; Tang, L. Enhancing Manufacturing Excellence with Digital-Twin-Enabled Operational Monitoring and Intelligent Scheduling. Appl. Sci. 2024, 14, 6622. [Google Scholar] [CrossRef]
Maldonado-Romo, J.; Ahmad, R.; Ponce, P.; Mendez, J.I.; Mata, O.; Rojas, M.; Montesinos, L.; Molina, A. Advancing Sustainable Manufacturing: A Case Study on Plastic Recycling. Prod. Manuf. Res. 2024, 12, 2425672. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Bouzary, H.; Krishnaiyer, K. Integration of Lean Practices and Industry 4.0 Technologies: Smart Manufacturing for next-Generation Enterprises. Int. J. Adv. Manuf. Technol. 2020, 107, 2927–2936. [Google Scholar] [CrossRef]
De la Vega-Rodríguez, M.; Baez-Lopez, Y.A.; Flores, D.-L.; Tlapa, D.A.; Alvarado-Iniesta, A. Lean Manufacturing: A Strategy for Waste Reduction. In New Perspectives on Applied Industrial Tools and Techniques; García-Alcaraz, J.L., Alor-Hernández, G., Maldonado-Macías, A.A., Sánchez-Ramírez, C., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 153–174. ISBN 978-3-319-56871-3. [Google Scholar]
Shahin, M.; Frank Chen, F.; Bouzary, H.; Hosseinzadeh, A. Deploying Convolutional Neural Network to Reduce Waste in Production System. Manuf. Lett. 2023, 35, 1187–1195. [Google Scholar] [CrossRef]
Amber Jesse, M. Cognitive Computing: Evaluating AI’s Human-Like Reasoning Skills. IJAETI 2024, 2, 427–447. [Google Scholar]
Bini, S.A. Artificial Intelligence, Machine Learning, Deep Learning, and Cognitive Computing: What Do These Terms Mean and How Will They Impact Health Care? J. Arthroplast. 2018, 33, 2358–2361. [Google Scholar] [CrossRef] [PubMed]
Shahin, M.; Hosseinzadeh, A.; Chen, F.F.; Davis, M.; Rashidifar, R.; Shahin, A. Deploying Optical Character Recognition to Improve Material Handling and Processing. In Proceedings of the Flexible Automation and Intelligent Manufacturing: Establishing Bridges for More Sustainable Manufacturing Systems, New Delhi, India, 28–29 June 2024; Silva, F.J.G., Ferreira, L.P., Sá, J.C., Pereira, M.T., Pinto, C.M.A., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 510–517. [Google Scholar]
Ras, G.; Xie, N.; van Gerven, M.; Doran, D. Explainable Deep Learning: A Field Guide for the Uninitiated. J. Artif. Intell. Res. 2022, 73, 329–396. [Google Scholar] [CrossRef]
Mehrzadi, H.; Maghanaki, M.; Shahin, M.; Chen, F.F.; Hosseinzadeh, A.; Tarhani, S. Enhanced Lung Cancer Detection in CT Scans Using Deep Learning Architectures. In Proceedings of the Third International Conference on Advances in Computing Research (ACR’25), Nice, France, 7–9 July 2025; Daimi, K., Al Sadoon, A., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 127–135. [Google Scholar]
Rashid, A.B.; Kausik, M.A.K. AI Revolutionizing Industries Worldwide: A Comprehensive Overview of Its Diverse Applications. Hybrid. Adv. 2024, 7, 100277. [Google Scholar] [CrossRef]
Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.-W.; et al. Artificial Intelligence: A Powerful Paradigm for Scientific Research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Maghanaki, M.; Hosseinzadeh, A.; Zand, N.; Khodadadi Koodiani, H. Improving the Concrete Crack Detection Process via a Hybrid Visual Transformer Algorithm. Sensors 2024, 24, 3247. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Hosseinzadeh, A. Harnessing Customized AI to Create Voice of Customer via GPT3.5. Adv. Eng. Inform. 2024, 61, 102462. [Google Scholar] [CrossRef]
Kalota, F. A Primer on Generative Artificial Intelligence. Educ. Sci. 2024, 14, 172. [Google Scholar] [CrossRef]
Chaudhari, D.T. Artificial Intelligence, Its Types and Application in Various Fields. Int. J. Commer. Manag. Res. 2024, 10, 49–51. [Google Scholar]
Lewis, P.R.; Sarkadi, Ş. Reflective Artificial Intelligence. Minds Mach. 2024, 34, 14. [Google Scholar] [CrossRef]
Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial Intelligence in Healthcare: Past, Present and Future. Stroke Vasc. Neurol. 2017, 2. [Google Scholar] [CrossRef]
Gurney, N.; Pynadath, D.V.; Ustun, V. Spontaneous Theory of Mind for Artificial Intelligence. In Human-Computer Interaction. HCII 2024. Lecture Notes in Computer Science; Kurosu, M., Hashizume, A., Eds.; Springer: Cham, Switzerland, 2024; Volume 14684. [Google Scholar] [CrossRef]
van der Meulen, R.; Verbrugge, R.; van Duijn, M. Towards Properly Implementing Theory of Mind in AI Systems: An Account of Four Misconceptions. arXiv 2025, arXiv:2503.16468. [Google Scholar]
Zhang, S.; Wang, X.; Zhang, W.; Chen, Y.; Gao, L.; Wang, D.; Zhang, W.; Wang, X.; Wen, Y. Mutual Theory of Mind in Human-AI Collaboration: An Empirical Study with LLM-Driven AI Agents in a Real-Time Shared Workspace Task. arXiv 2024, arXiv:2409.08811. [Google Scholar]
Winfield, A.F.T. Experiments in Artificial Theory of Mind: From Safety to Story-Telling. Front. Robot. AI 2018, 5, 75. [Google Scholar] [CrossRef] [PubMed]
Barrett, A.M.; and Baum, S.D. A Model of Pathways to Artificial Superintelligence Catastrophe for Risk and Decision Analysis. J. Exp. Theor. Artif. Intell. 2017, 29, 397–414. [Google Scholar] [CrossRef]
Ao, S.-I.; Hurwitz, M.; Palade, V. Cognitive Computing and Business Intelligence Applications in Accounting, Finance and Management. Big Data Cogn. Comput. 2025, 9, 54. [Google Scholar] [CrossRef]
Lee, J.; Davari, H.; Singh, J.; Pandhare, V. Industrial Artificial Intelligence for Industry 4.0-Based Manufacturing Systems. Manuf. Lett. 2018, 18, 20–23. [Google Scholar] [CrossRef]
Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative Artificial Intelligence: A Systematic Review and Applications. Multimed. Tools Appl. 2025, 84, 23661–23700. [Google Scholar] [CrossRef]
Onatayo, D.; Onososen, A.; Oyediran, A.O.; Oyediran, H.; Arowoiya, V.; Onatayo, E. Generative AI Applications in Architecture, Engineering, and Construction: Trends, Implications for Practice, Education & Imperatives for Upskilling—A Review. Architecture 2024, 4, 877–902. [Google Scholar] [CrossRef]
Andrianandrianina Johanesa, T.V.; Equeter, L.; Mahmoudi, S.A. Survey on AI Applications for Product Quality Control and Predictive Maintenance in Industry 4.0. Electronics 2024, 13, 976. [Google Scholar] [CrossRef]
Hosseinzadeh, A.; Shahin, M.; Maghanaki, M.; Mehrzadi, H.; Chen, F.F. Minimizing Waste via Novel Fuzzy Hybrid Stacked Ensemble of Vision Transformers and CNNs to Detect Defects in Metal Surfaces. Int. J. Adv. Manuf. Technol. 2024, 135, 5115–5140. [Google Scholar] [CrossRef]
Shin, Y.; Lee, M.; Lee, Y.; Kim, K.; Kim, T. Artificial Intelligence-Powered Quality Assurance: Transforming Diagnostics, Surgery, and Patient Care—Innovations, Limitations, and Future Directions. Life 2025, 15, 654. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Li, J.; Chen, Y.; Wang, Z. Generative Models Based Wafer Surface Defect Detection in Limited Sample Conditions. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, Tianjin, China, 25–27 October 2025; pp. 215–223. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Zhang, L.; Yang, F.; Zhang, Y.D.; Zhu, Y.J. An Effective Defect Detection Method Based on Improved Generative Adversarial Networks. J. Manuf. Syst. 2022, 60, 176–189. [Google Scholar] [CrossRef]
Rais, K.; Amroune, M.; Benmachiche, A.; Haouam, M.Y. Exploring Variational Autoencoders for Medical Image Generation: A Comprehensive Study. arXiv 2024, arXiv:2411.07348. [Google Scholar]
Vivekananthan, S. Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs, GANs, and Stable Diffusion. arXiv 2024, arXiv:2408.08751. [Google Scholar]
Saeed, A.Q.; Sheikh Abdullah, S.N.H.; Che-Hamzah, J.; Abdul Ghani, A.T.; Abu-ain, W.A. karim Synthesizing Retinal Images Using End-To-End VAEs-GAN Pipeline-Based Sharpening and Varying Layer. Multimed. Tools Appl. 2024, 83, 1283–1307. [Google Scholar] [CrossRef]
Shafiee, S. Generative AI in Manufacturing: A Literature Review of Recent Applications and Future Prospects. Procedia CIRP 2025, 132, 1–6. [Google Scholar] [CrossRef]
Kehayov, M.; Holde, L.; Koch, V. Application of Artificial Intelligence Technology in the Manufacturing Process and Purchasing and Supply Management. Procedia Comput. Sci. 2022, 200, 1209–1217. [Google Scholar] [CrossRef]
Filz, M.-A.; Thiede, S. Generative AI in Manufacturing Systems: Reference Framework and Use Cases. Procedia CIRP 2024, 130, 238–243. [Google Scholar] [CrossRef]
Nagy, M.; Figura, M.; Valaskova, K.; Lăzăroiu, G. Predictive Maintenance Algorithms, Artificial Intelligence Digital Twin Technologies, and Internet of Robotic Things in Big Data-Driven Industry 4.0 Manufacturing Systems. Mathematics 2025, 13, 981. [Google Scholar] [CrossRef]
Hosseinzadeh, A.; Frank Chen, F.; Shahin, M.; Bouzary, H. A Predictive Maintenance Approach in Manufacturing Systems via AI-Based Early Failure Detection. Manuf. Lett. 2023, 35, 1179–1186. [Google Scholar] [CrossRef]
Haridasan, P.K.; Jawale, H. Generative AI in Manufacturing: A Review of Innovations, Challenges, and Future Prospects. JAIMLD 2024, 2, 1418–1424. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wang, Y.; Xiao, C.; Zhu, Q.; Lu, X.; Zhang, H.; Ge, J.; Zhao, H. Automated Visual Inspection of Glass Bottle Bottom With Saliency Detection and Template Matching. IEEE Trans. Instrum. Meas. 2019, 68, 4253–4267. [Google Scholar] [CrossRef]
Deng, F.; Luo, J.; Fu, L.; Huang, Y.; Chen, J.; Li, N.; Zhong, J.; Lam, T.L. DG2GAN: Improving defect recognition performance with generated defect image sample. Sci. Rep. 2024, 14, 14787. [Google Scholar] [CrossRef]
Zhao, Y.; Gao, H.; Wu, S. A Survey on Surface Defect Inspection Based on Generative Models. Appl. Sci. 2024, 14, 6774. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Maghanaki, M.; Mehrzadi, H.; Hosseinzadeh, A. Toward Sustainable Production: A Synthetic Dataset Framework to Accelerate Quality Control via Generative and Predictive AI. Int. J. Adv. Manuf. Technol. 2025, 138, 5979–6018. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Maghanaki, M.; Hosseinzadeh, A.; Firouzranjbar, S.; Koodiani, H.K. Concrete Fault Detection Using Deep Learning: Towards Waste Reduction in Bridge Inspection. In Proceedings of the Flexible Automation and Intelligent Manufacturing: Manufacturing Innovation and Preparedness for the Changing World Order, Taichung, Taiwan, 23–26 June 2024; Wang, Y.-C., Chan, S.H., Wang, Z.-H., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 31–38. [Google Scholar]
Shahin, M.; Chen, F.F.; Hosseinzadeh, A.; Koodiani, H.K.; Bouzary, H.; Rashidifar, R. Deploying Computer-Based Vision to Enhance Safety in Industrial Environment. In Proceedings of the Flexible Automation and Intelligent Manufacturing: Establishing Bridges for More Sustainable Manufacturing Systems, New Delhi, India, 28–29 June 2024; Silva, F.J.G., Ferreira, L.P., Sá, J.C., Pereira, M.T., Pinto, C.M.A., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 503–509. [Google Scholar]
Industrial Quality Control of Packages. Available online: https://www.kaggle.com/datasets/christianvorhemus/industrial-quality-control-of-packages (accessed on 18 April 2025).
Shahin, M.; Chen, F.F.; Hosseinzadeh, A. Machine-Based Identification System via Optical Character Recognition. Flex. Serv. Manuf. J. 2024, 36, 453–480. [Google Scholar] [CrossRef]
Peng, X.; Sun, B.; Ali, K.; Saenko, K. Learning Deep Object Detectors from 3D Models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 1–24 September 2017; pp. 23–30. [Google Scholar]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 969–977. [Google Scholar]
Wu, C.; Zhou, K.; Kaiser, J.-P.; Mitschke, N.; Klein, J.-F.; Pfrommer, J.; Beyerer, J.; Lanza, G.; Heizmann, M.; Furmans, K. MotorFactory: A Blender Add-on for Large Dataset Generation of Small Electric Motors. Procedia CIRP 2022, 106, 138–143. [Google Scholar] [CrossRef]
Blender Foundation. Cycles Render Engine—Sampling and Performance; Blender Manual; Blender Foundation: Amsterdam, The Netherlands, 2025. [Google Scholar]
Pharr, P.; Jakob, W.; Humphreys, G. Physically Based Rendering: From Theory to Implementation, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2016; ISBN 978-0-12-800645-0. [Google Scholar]
Mohammed, S.S.; Clarke, H.G. Conditional image-to-image translation generative adversarial network (cGAN) for fabric defect data augmentation. Neural Comput. Applic. 2024, 36, 20231–20244. [Google Scholar] [CrossRef]
Wang, J.; Fu, P.; Gao, R.X. Machine Vision Intelligence for Product Defect Inspection Based on Deep Learning and Hough Transform. J. Manuf. Syst. 2019, 51, 52–60. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Chaudhary, A.K.; Nair, N.; Bailey, R.J.; Pelz, J.B.; Talathi, S.S.; Diaz, G.J. From Real Infrared Eye-Images to Synthetic Sequences of Gaze Behavior. IEEE Trans. Vis. Comput. Graph. 2022, 28, 3948–3958. [Google Scholar] [CrossRef] [PubMed]
Moreschini, S.; Gama, F.; Bregovic, R.; Gotchev, A. CIVIT Dataset: Integral Microscopy with Fourier Plane Recording. Data Brief. 2023, 46, 108819. [Google Scholar] [CrossRef] [PubMed]
Ranjbaran, S.M.; Khan, A.; Manwar, R.; Avanaki, K. Tutorial on Development of 3D Vasculature Digital Phantoms for Evaluation of Photoacoustic Image Reconstruction Algorithms. Photonics 2022, 9, 538. [Google Scholar] [CrossRef]
Cartucho, J.; Tukra, S.; Li, Y.; Elson, D.S.; Giannarou, S. VisionBlender: A Tool to Efficiently Generate Computer Vision Datasets for Robotic Surgery. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2021, 9, 331–338. [Google Scholar] [CrossRef]
Kim, A.; Lee, K.; Lee, S.; Song, J.; Kwon, S.; Chung, S. Synthetic Data and Computer-Vision-Based Automated Quality Inspection System for Reused Scaffolding. Appl. Sci. 2022, 12, 10097. [Google Scholar] [CrossRef]
Georg Olofsson Zwilgmeyer, P.; Yip, M.; Langeland Teigen, A.; Mester, R.; Stahl, A. The VAROS Synthetic Underwater Data Set: Towards Realistic Multi-Sensor Underwater Data with Ground Truth. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 3715–3723. [Google Scholar] [CrossRef]
Hemming, J.; Barth, R.; IJsselmuiden, J.; Henten, E.J.V. Data Synthesis Methods for Semantic Segmentation in Agriculture: A Capsicum Annuum Dataset. Comput. Electron. Agric. 2018, 144, 284–296. [Google Scholar] [CrossRef]
Károly, A.I.; Galambos, P. Task-Specific Grasp Planning for Robotic Assembly by Fine-Tuning GQCNNs on Automatically Generated Synthetic Data. Appl. Sci. 2022, 13, 525. [Google Scholar] [CrossRef]
Greff, K.; Belletti, F.; Beyer, L.; Doersch, C.; Du, Y.; Duckworth, D.; Fleet, D.J.; Gnanapragasam, D.; Golemo, F.; Herrmann, C.; et al. Kubric: A Scalable Dataset Generator. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3739–3751. [Google Scholar] [CrossRef]
Park, D.; Lee, J.; Lee, J.; Lee, K. Deep Learning Based Food Instance Segmentation Using Synthetic Data. In Proceedings of the 2021 18th International Conference on Ubiquitous Robots (UR), Gangneung, Republic of Korea, 12–14 July 2021; pp. 499–505. [Google Scholar] [CrossRef]
Ninos, A.; Hasch, J.; Alvarez, M.E.P.; Zwick, T. Synthetic Radar Dataset Generator for Macro-Gesture Recognition. IEEE Access 2021, 9, 76576–76584. [Google Scholar] [CrossRef]
Barisic, A.; Petric, F.; Bogdan, S. Sim2Air—Synthetic Aerial Dataset for UAV Monitoring. IEEE Robot. Autom. Lett. 2022, 7, 3757–3764. [Google Scholar] [CrossRef]
Riesener, M.; Schukat, E.; Curiel-Ramirez, L.A.; Schäfer, N.; Valdez-Heras, A. Development of a VR-Based Digital Factory Planning Platform for Dynamic 3D Layout Editing. Procedia CIRP 2024, 130, 419–426. [Google Scholar] [CrossRef]
Reutov, I.; Moskvin, D.; Voronova, A.; Venediktov, M. Generating Synthetic Data To Solve Industrial Control Problems By Modeling A Belt Conveyor. Procedia Comput. Sci. 2022, 212, 264–274. [Google Scholar] [CrossRef]
Cordeiro, A.; Souza, J.P.; Costa, C.M.; Filipe, V.; Rocha, L.F.; Silva, M.F. Bin Picking for Ship-Building Logistics Using Perception and Grasping Systems. Robotics 2023, 12, 15. [Google Scholar] [CrossRef]
Broekman, A.; Gräbe, P.J. PASMVS: A Perfectly Accurate, Synthetic, Path-Traced Dataset Featuring Specular Material Properties for Multi-View Stereopsis Training and Reconstruction Applications. Data Brief. 2020, 32, 106219. [Google Scholar] [CrossRef] [PubMed]
Marelli, D.; Bianco, S.; Ciocca, G. IVL-SYNTHSFM-v2: A Synthetic Dataset with Exact Ground Truth for the Evaluation of 3D Reconstruction Pipelines. Data Brief. 2020, 29, 105041. [Google Scholar] [CrossRef]
Karoly, A.I.; Galambos, P. Automated Dataset Generation with Blender for Deep Learning-Based Object Segmentation. In Proceedings of the 2022 IEEE 20th Jubilee World Symposium on Applied Machine Intelligence and Informatics (SAMI), Poprad, Slovakia, 2–5 March 2022; pp. 329–334. [Google Scholar] [CrossRef]
Hao, X.; Funt, B. A Multi-illuminant Synthetic Image Test Set. Color. Res. Appl. 2020, 45, 1055–1066. [Google Scholar] [CrossRef]
Perri, D.; Simonetti, M.; Gervasi, O. Synthetic Data Generation to Speed-Up the Object Recognition Pipeline. Electronics 2022, 11, 2. [Google Scholar] [CrossRef]
Adams, J.; Sutor, J.; Dodd, A.; Murphy, E. Evaluating the Performance of Synthetic Visual Data for Real-Time Object Detection. In Proceedings of the 2021 6th International Conference on Communication, Image and Signal Processing (CCISP), Virtual, 19–21 November 2021; pp. 167–171. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, X.; Xiang, N.; He, Y.; Pan, Z. Expression Sequences Generator for Synthetic Emotion. J. Multimodal User Interfaces 2012, 5, 19–25. [Google Scholar] [CrossRef]
Wrede, K.; Zarnack, S.; Lange, R.; Donath, O.; Wohlfahrt, T.; Feldmann, U. Curriculum Design and Sim2Real Transfer for Reinforcement Learning in Robotic Dual-Arm Assembly. Machines 2024, 12, 682. [Google Scholar] [CrossRef]
Delussu, R.; Putzu, L.; Fumera, G. Synthetic Data for Video Surveillance Applications of Computer Vision: A Review. Int. J. Comput. Vis. 2024, 132, 4473–4509. [Google Scholar] [CrossRef]
De Roovere, P.; Moonen, S.; Michiels, N.; Wyffels, F. Sim-to-Real Dataset of Industrial Metal Objects. Machines 2024, 12, 99. [Google Scholar] [CrossRef]
Broekman, A.; Van Niekerk, J.O.; Gräbe, P.J. HRSBallast: A High-Resolution Dataset Featuring Scanned Angular, Semi-Angular and Rounded Railway Ballast. Data Brief. 2020, 33, 106471. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Qin, X.; Lei, J.; Jia, B.; Li, B.; Li, Z.; Li, H.; Zeng, Y.; Song, J. A Novel Auto-Synthesis Dataset Approach for Fitting Recognition Using Prior Series Data. Sensors 2022, 22, 4364. [Google Scholar] [CrossRef] [PubMed]
Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
Cabon, Y.; Murray, N.; Humenberger, M. Virtual KITTI 2. Available online: https://arxiv.org/abs/2001.10773v1 (accessed on 4 June 2025).
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for Data: Ground Truth from Computer Games. In Proceedings of the ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 102–118. [Google Scholar]
Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks? In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 746–753. [Google Scholar]
Gaidon, A.; Wang, Q.; Cabon, Y.; Vig, E. VirtualWorlds as Proxy for Multi-Object Tracking Analysis. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4340–4349. [Google Scholar]
Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; Wang, B. Moment Matching for Multi-Source Domain Adaptation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1406–1415. [Google Scholar]
Mayer, N.; Ilg, E.; Häusser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A Naturalistic Open Source Movie for Optical Flow Evaluation. In Computer Vision—ECCV 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7577; pp. 611–625. ISBN 978-3-642-33782-6. [Google Scholar]
Arras, L.; Osman, A.; Samek, W. CLEVR-XAI: A Benchmark Dataset for the Ground Truth Evaluation of Neural Network Explanations. Inf. Fusion. 2022, 81, 14–40. [Google Scholar] [CrossRef]
Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 1988–1997. [Google Scholar]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Häusser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
Bakurov, I.; Buzzelli, M.; Schettini, R.; Castelli, M.; Vanneschi, L. Structural Similarity Index (SSIM) Revisited: A Data-Driven Approach. Expert. Syst. Appl. 2022, 189, 116087. [Google Scholar] [CrossRef]
Nilsson, J.; Akenine-Möller, T. Understanding SSIM. arXiv 2020, arXiv:2006.13846. [Google Scholar]
Gama, J. A Survey on Learning from Data Streams: Current and Future Trends. Prog. Artif. Intell. 2012, 1, 45–55. [Google Scholar] [CrossRef]
Elbasheer, M.; D’Augusta, V.; Mirabelli, G.; Solina, V.; Talarico, S. Leveraging Auto-Generative Simulation for Decision Support in Engineer-to-Order Manufacturing. Procedia Comput. Sci. 2024, 232, 1319–1328. [Google Scholar] [CrossRef]

Figure 1. Fundamental working principle of AI.

Figure 2. Traits of AI.

Figure 3. Components of AI.

Figure 4. Generative AI as a subfield of AI.

Figure 5. Different types of AI.

Figure 6. The underlying systems of Generative AI.

Figure 7. Generative AI impact on manufacturing performance metrics.

Figure 8. Side view and top view of an intact package on a conveyor belt.

Figure 9. Normal distributions, centered at 0.35 (damaged) and 0.20 (intact), showing the brush-strength variability used during sculpting.

Figure 10. Visualizing sample stroke-path geometries with ten random cubic Bézier curves, within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)

Figure 11. Stroke-path geometry with damaged vs. intact overlays within the defined stroke window. (The dashed box is ±5 cm horizontally and ±2.5 cm vertically.)

Figure 12. Uniform distributions of jitter for X-offsets (±0.04 m), Y-offsets (±0.07 m), and yaw angles (±20.00°), depicting how packages were randomly translated and rotated on the conveyor.

Figure 13. Discrete uniform distribution of background frame selection.

Figure 14. A normal distribution (μ = 16, σ = 3) for f-stop values, illustrating the camera DOF variation.

Figure 15. Foreground/Background sharpness and blur, where the package and conveyor grid both appear noticeably out of focus.

Figure 16. DOF vs. F-Stop.

Figure 17. Relative variation across pipeline variables (log scale).

Figure 18. Individual vs. cumulative configuration possibilities (log scale).

Figure 19. Differences in rendering quality.

Figure 20. More realistic rendering options.

Figure 21. Alternative rendering qualities.

Figure 22. Relative impacts of quality parameters on cost metrics.

Figure 23. Simplified two-view damage-detection architecture for MobileNetV2.

Figure 24. Different image enhancements applied in Approach II.

Figure 25. Applying image enhancement techniques in sequence.

Figure 26. Simplified two-view damage-detection architecture for ResNet-18.

Figure 27. Annotated package with side and top views.

Figure 28. Architecture of the pretrained ViT.

Figure 29. Simplified architecture of the GPT o4-mini-high used in package inspection.

Figure 30. The simplified architecture of the YOLOv7 used in package inspection.

Figure 31. Simplified architecture of the YOLOv8 seg used in the package inspection.

Figure 32. Samples of segmented packages, with side view and top view, after CLAHE.

Figure 33. Loss and accuracy curves associated with the training and validation process utilizing the MobileNetV2 in Approach I.

Figure 34. MobileNetV2 PR curves in Approach I.

Figure 35. ROC curve for MobileNetV2 in Approach I.

Figure 36. Confusion matrix for MobileNetV2 in Approach I.

Figure 37. Comparison of the accuracy of Approach II variants (with different enhancements).

Figure 38. Loss and accuracy curves during the training and validation process of the MobileNetV2 with CLAHE in Approach II.

Figure 39. MobileNetV2 with CLAHE PR curves in Approach II.

Figure 40. ROC curve for MobileNetV2 with CLAHE in Approach II.

Figure 41. Confusion matrix for MobileNetV2 with CLAHE in Approach II.

Figure 42. Loss and accuracy curves during the training and validation process of ResNet-18 with image enhancements in Approach III.

Figure 43. ResNet-18 with image enhancement PR curves in Approach III.

Figure 44. ROC curve for ResNet-18 with image enhancements in Approach III.

Figure 45. Confusion matrix for ResNet-18 with image enhancements in Approach III.

Figure 46. Comparison of Approach IV classifiers’ accuracies (both images as input).

Figure 47. PR curves for default LR in Approach IV.

Figure 48. ROC curve for default LR in Approach IV.

Figure 49. Confusion matrix for default LR classifier in Approach IV.

Figure 50. Confusion matrix for the GPT o4-mini-high in Approach V.

Figure 51. Loss curves for YOLO (top view on the left vs. side view on the right) in Approach VI.

Figure 52. PR curves for YOLOv7 in Approach VI.

Figure 53. Comparison of LR classifiers in Approach VII.

Figure 54. PR curve for the default LR classifier in Approach VII.

Figure 55. ROC curve for the default LR classifier in Approach VII.

Figure 56. Confusion matrix for the default LR classifier in Approach VII.

Figure 57. Side-view validation metric curves for YOLOv8-seg.

Figure 58. Top-view validation metric curves for YOLOv8-seg.

Figure 59. Confusion matrices for the training and validation process for the side-view images and the top-view images.

Figure 60. Training and validation loss and accuracy curves for top-view images.

Figure 61. Training and validation loss and accuracy curves for both side- and top-view images.

Figure 62. ROC curves during the testing of MobileNetv2 for both the top-view only and the side- and top-view images.

Figure 63. Confusion matrices for MobileNetV2 for the top-view images and both the side- and top-view images in Approach VIII.

Figure 64. PR curves for the LR classifier in Approach IX.

Figure 65. ROC curves during the testing of ViT + LR for the top-view images and the side- and top-view images.

Figure 66. Confusion matrices of ViT + LR for the top-view images and the side- and top-view images in Approach IX.

Figure 67. Accuracy for configurations in Approach X.

Figure 68. PR curve for 3:1 augmented training and validation with ViT + LR in Approach X.

Figure 69. ROC curve for 3:1 augmented training and validation with ViT + LR in Approach X.

Figure 70. Confusion matrix for 3:1 augmented training and validation with ViT + LR in Approach X.

Figure 71. Accuracy values for all approach variants.

Figure 72. Performance vs. complexity, speed, training time, and computational cost.

Table 1. The structure of the paper.

N	Section	Subsection	Subdivision	Overall Description
1	Introduction	NA		Manual inspection and AI-based inspection
2	Development of AI	NA		Mechanics of Narrow AI and Generative AI
3	AI-Enabled Sustainability	Narrow AI: Enhancing Defect Management and Waste Reduction through Data-Driven Insights	NA	How AI enables waste reduction, QC, sustainable driven production, waste reduction, and environment simulation
		Generative AI: A Paradigm Shift Towards Proactive Quality Design and Waste Minimization	NA
		Different Forms of Generative AI	3D Synthesis via Blender
		Hybrid Systems	NA
4	The Dataset	Pipeline Design of the Dataset	NA	Information, parameters, and specifications regarding the creation of the dataset and its quality
		General Constraints and Limitations of the Synthetic Images’ Datasets
		Evaluating the Quality of the Dataset
		Synopsis
5	Methodology	Approach I	NA	Detailed description of all the models, algorithms, enhancements, and ensembles being deployed
		Approach II	Global Histogram Equalization
			Contrast-Limited Adaptive Histogram Equalization
			Sharpening
		Approach III—Approach X	NA
6	Results	Results of Approach I—Approach X	NA	Shows the results of all the approaches being deployed in Section 5
7	Discussion	Performance Measurements	Root-Cause for Misclassifications	Quantitative and qualitative comparison between the different approaches in terms of effectiveness and fidelity
		Future work	NA
		Limitations	NA
8	Conclusion	NA		Lessons learned and final thoughts

Table 2. Descriptions of different types of AI.

Narrow AI (ANI), also known as Weak AI or Predictive AI	Also known as Weak AI, this category encompasses AI systems designed to perform specific, well-defined tasks within limited domains. These systems operate under pre-programmed rules and lack the ability to generalize beyond their designated functions. Examples include recommendation algorithms, virtual assistants, and automated speech recognition.
AGI (Strong AI)	This level of AI aspires to mimic human cognitive abilities, enabling machines to comprehend, learn, and apply knowledge across multiple domains, without task-specific programming. Unlike Narrow AI, General AI possesses adaptability and can autonomously solve problems, reason, and make decisions in a manner similar to human intelligence. While still theoretical, the development of such systems remains a major goal in AI research [29].
ASI (Super AI)	This hypothetical stage represents the surpassing of human intelligence by AI in all cognitive aspects, including reasoning, problem-solving, creativity, and emotional intelligence. Machines with Super AI would not only execute complex tasks but also possess self-awareness, consciousness, and autonomous decision-making abilities. Although currently a subject of speculation and ethical debate, advancements in AI continue to push the boundaries toward this possibility.
Reactive Machines (RM)	The simplest form of AI, RM function without memory or the ability to learn from past experiences. They operate solely based on real-time inputs, responding to specific stimuli without retaining information for future decision-making. Deep Blue, the IBM chess-playing system, exemplifies this type of AI.
Limited Memory AI	Unlike reactive machines, these AI systems can retain and utilize past information for a limited period to make decisions. They rely on historical data and real-time inputs to improve performance. Autonomous vehicles, which use stored sensor data to navigate dynamic environments, fall into this category.
Theory of Mind AI (ToM AI)	Representing a more advanced stage of AI, this concept envisions machines capable of understanding human emotions, beliefs, and intentions. By recognizing psychological states and adapting their responses accordingly, these systems would enable more natural and meaningful human–AI interactions. Although still in its early stages, research in this area aims to enhance AI’s ability to engage in social reasoning and cooperative tasks.
Self-Aware AI	The most advanced and speculative form of AI, self-aware systems would possess a sense of consciousness and be capable of self-reflection and independent thought. Unlike other AI types, they would not only process information but also understand their own existence, emotions, and objectives. While theoretical, the pursuit of self-aware AI raises profound ethical, philosophical, and safety considerations regarding machine autonomy and decision-making.

Table 3. Comparing Generative AI and Narrow AI.

Technology	Key Features	Advantages	Disadvantages	Use Cases
Narrow AI	Rule-based systems, Limited adaptability	Established frameworks, Predictive accuracy in stable environments	Rigid, Cannot self-improve, Requires extensive data labeling	QC, Supply chain optimization, Demand forecasting,
Generative AI	Self-learning, Adaptive algorithms, Enhanced creativity	Can generate novel solutions, Continuous learning, Reduced need for labeled data	Complexity in implementation, Potential for biased outputs, Requires significant computational resources	Product design, Process optimization, QC, Inspection, Predictive maintenance (PdM)
Hybrid AI Systems	Combines Narrow AI with Generative AI, Utilizes DT and NN	Leverages strengths of both technologies, More robust decision-making, Enhanced flexibility	Increased system complexity, Higher initial investment, Difficulties in integration	Smart manufacturing, Automated quality assurance, Adaptive supply chain management

Table 4. Summary of the numerical values of the parameters.

Parameter	Value/Distribution
Stroke Count (Damaged)	3 strokes
Stroke Count (Intact)	2 strokes
Stroke Position (X, Y)	X ∈ [–5 cm, +5 cm], Y ∈ [–2.5 cm, +2.5 cm]
Stroke Strength (Damaged)	$𝒩$ (mean = 0.35, σ = 0.05)
Stroke Strength (Intact)	$𝒩$ (mean = 0.20, σ = 0.05)
Strength Distribution (σ)	σ = 0.05 (shared)
Stroke Shape (Bezier Control)	Random cubic Bézier control points
Axes of Variability	(1) Stroke count, (2) Strength, (3) Position, (4) Shape

Table 5. Summary of the numerical values of the parameters.

Parameter	Value/Distribution
X-Offset (Lateral Jitter)	Uniform over ±0.04 m (±44 cm)
Y-Offset (Longitudinal Jitter)	Uniform over ±0.07 m (±77 cm)
Yaw Rotation Jitter	Uniform integer in [–20°, +20°]
Uniform Distribution	Equal probability across each continuous or integer value

Table 8. FID scores and the descriptive statistical values for SSIM and Brightness.

Inter-Category Comparison	Interpretation Based on FID Scores	FID Score	NA
Top (Intact vs. Damage)	Good (Noticeable Diff) *	11.77
Side (Intact vs. Damage)	Good (Noticeable Diff) *	23.59
Intact (Top vs. Side)	Poor/Very Different *	362.52
Top Intact vs. Side Damage		362.35
Top Damage vs. Side Intact		365.1
Damage (Top vs. Side)		364.52
Inter-Category Comparison	Interpretation Based on SSIM Scores	Mean	Std	Range
Top (Intact vs. Damage)	Moderate Similarity *	0.268	0.215	0.068–0.881
Side (Intact vs. Damage)	Moderate Similarity *	0.323	0.153	0.162–0.807
Intact (Top vs. Side)	Very Low Similarity *	0.002 *	0.004 *	−0.019
Top Intact vs. Side Damage				−0.022
Top Damage vs. Side Intact				−0.021
Damage (Top vs. Side)				−0.024
Category	FID vs. SSIM Agreement	NA *
Top Intact	Strong (Both Indicate Similarity) *
Side Intact
Top Intact
Top Damage
Category	Internal Consistency based on SSIM Scores	Mean	Std	Range
Top Intact	Very Good Consistency/High	0.306	0.237	0.066–0.877
Top Damage	Good Consistency/Medium	0.291	0.232	0.067–0.914
Side Intact	Excellent (Low Variation)/High *	0.317	0.166	0.161–0.801
Side Damage	Excellent (Low Variation)/High *	0.326	0.171	0.183–0.812
Category	Technical Quality Based on Brightness Stats	Mean	Std	Range
Top Intact	Good (Consistent) *	137.1 *	43.8	93.3–181.0
Top Damage	Good (Consistent) *	137.1 *	43.8	93.4–180.9
Side Intact	Excellent (Very Consistent) *	133.4	30.5	102.8–163.9
Side Damage	Excellent (Very Consistent) *	133.7	30.4	103.3–164.1
Category	Number of Instances	Features	NA *
Top Intact	100 *	2048 *
Top Damage
Side Intact
Side Damage

* Two or more cells merged because they have the same values.

Table 9. Detailed summary of all the approaches used in the package inspection process.

Variants	Approaches I–X
Process/Model/Algorithm	I	II	III	IV	V	VI	VII	VIII	IX	X
Polygon mask annotation in COCO format				x			X
Bounding Box (BBox) annotation in COCO format				x			X
Polygon mask annotation in YOLO format								x	x	x
Bounding Box (BBox) annotation in YOLO format						x		x	x	x
Training set augmentation			x							x
Validation set augmentation										x
Training set augmentation ratio (1:1)			x
Training/Validation set augmentation ratio (3:1)										x
Test-time augmentation (TTA)										x
Stratified split (50/30/20) of data	x	x	x	x			X	x	x	x
Stratified split (80/20) of data						x
Deblurring of blurry images							X
Transfer learning	x	x	x	x		x	X	x	x	x
MobileNetV2	x	x						x		x
Both top- and side-view images were fed as input	x	x	x	x	x	x	X	x	x	x
Only top-view images were fed as input	x	x			x		X	x	x
Sharpening		x
Global histogram equalization (GHE)		x
Multiscale Retinex (MSR)			x
Contrast-limited adaptive histogram equalization (CLAHE)		x	x	x			X	x	x	x
Fast non-local means (NIM) denoising for colored images			x
Fast super-resolution convolutional neural network (FSRCNN)			x
Mean–variance normalization			x
ResNet-18			x
Region of interest (ROI) extraction				x			X
Visual Transformer (ViT)				x			X		x	x
Logistic regression (LR)				x			X		x	x
SVM				x
RF				x
GPT o4-mini-high					x
You Only Look Once (YOLO)/ YOLOv7						x
YOLOv8 Segmentation/ YOLOv8-seg								x	x	x

Table 10. Pipeline description for each approach.

#	Approach Variants	Views
I	MobileNetV2	Both
I	MobileNetV2	Top
II	MobileNetV2 + CLAHE	Top
III	ResNet18 + MSR + CLAHE + fast NIM denoising + FSRCNN + Normalization	Both
IV	ROI + ViT + LR	Both
V	GPT o4-mini-high	Both
V	GPT o4-mini-high	Top
VI	YOLOv7	Both
VII	ROI + Deblur + CLAHE + ViT + LR	Both
VII	ROI + Deblur + CLAHE + ViT + LR	Top
VIII	YOLOv8-seg + CLAHE + MobileNetV2	Both
VIII	YOLOv8-seg + CLAHE + MobileNetV2	Top
IX	YOLOv8-seg + CLAHE + ViT + LR	Both
IX	YOLOv8-seg + CLAHE + ViT + LR	Top
X	YOLOv8-seg + CLAHE + Augmentation + ViT + LR	Both

Table 11. Parameter comparison between MobileNetV2 and ResNet-18.

Feature	ResNet-18 (PyTorch, Dual-View)	MobileNetV2 (TensorFlow, Dual-View)
Architecture	ResNet-18 (×2, one for each view)	MobileNetV2 (×2, one for each view)
Pretraining	ImageNet	ImageNet
Input Size	224 × 224 × 3	128 × 128 × 3
Parameter Count	~11.7 million per ResNet18	~3.5 million per MobileNetV2
Fusion	Concatenate 512-dim features from each ResNet	Concatenate pooled features from each MobileNetV2
Classifier Head	Linear(1024→256) → ReLU → Dropout → Linear(256→2)	Dense(128, relu) → Dropout → Dense(1, sigmoid)
Trainable Layers	Only layer3, layer4, and classifier head	All layers frozen except dense head
Framework	PyTorch	TensorFlow/Keras
Loss Function	CrossEntropyLoss	Binary Crossentropy
Optimizer	Adam	Adam
Batch Size	16	32
Epochs	15	15
Augmentation	Resize, flip, rotation, color jitter	Resize, normalization (optionally, enhancement)
Use Case	Fine-tuned, robust, slightly heavier	Lightweight, fast, efficient

Table 12. Tunning parameters for different classifiers, using GridSearchCV.

Model	Parameters
RF	A tuned RF with max_depth = 20 and n_estimators = 200. An ensemble of two hundred DT, each restricted to a maximum depth of twenty splits. By limiting tree depth, the model prevents any single tree from growing overly complex and overfitting to noise, while still allowing sufficient hierarchy to capture nonlinear interactions among features. With two hundred trees voting on each prediction, the ensemble mitigates the high variance typical of individual trees, yielding classification boundaries that are more stable and robust for intact versus damaged packages. The combination of a moderately deep structure and a large number of trees strikes a balance between bias and variance, ensuring that the RF generalizes well without sacrificing its capacity to model subtle patterns in the feature set.
SVM	A tuned SVM with a Radial Basis Function (RBF) kernel and C = 10. The classifier uses the RBF (Gaussian) kernel to project input feature vectors into a higher-dimensional space where they can be separated by a nonlinear boundary. The kernel’s Gaussian shape measures similarity between points based on their Euclidean distance, allowing the model to capture complex patterns in the data. The regularization parameter C = 10 sets the trade-off between maximizing the margin around the decision boundary and minimizing misclassification errors: a larger C places greater emphasis on correctly classifying training examples (at the risk of a smaller margin and potential overfitting), while a smaller C would allow more margin violations to achieve a smoother boundary. By tuning C to 10, the SVM is configured to penalize training errors relatively strongly, helping it to fit the subtle distinctions between intact and damaged package features extracted by the upstream feature extractor.
LR	The tuned LR model employs L2-regularized maximum-likelihood estimation with a penalty strength inversely proportional to C = 10, meaning the algorithm imposes a relatively mild regularization that allows the model coefficients to adapt more freely to the training data. By selecting the liblinear solver, an efficient coordinate-descent implementation optimized for smaller datasets and L2 penalties, the training process iteratively updates each coefficient while holding others fixed, ensuring rapid convergence even when feature dimensionality is moderate. This configuration strikes a balance between underfitting and overfitting: the higher C value prioritizes reducing classification errors by permitting larger weights on informative features, while the liblinear backend delivers stable solutions and straightforward interpretability of the decision boundary separating intact from damaged package instances.
LR	Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100).

Table 13. The different parameters for LR in Approach VII.

Top view	GridSearchCV suggested parameters: The logistic-regression model was tuned to use an L1 penalty (penalty = ‘l1’) with the saga solver and a high regularization constant (C = 100). By choosing L1 regularization, the model encourages sparsity in its weight vector, effectively performing built-in feature selection by driving many coefficients to zero, while the saga optimizer is one of the few solvers in scikit-learn that supports L1 penalties. Setting C = 100 (the inverse of the regularization strength) minimizes the shrinkage effect, allowing the most informative features to retain substantial weight while still benefiting from the robustness and interpretability that sparse solutions provide. This configuration strikes a balance between expressive power (through a large C) and parsimony (through L1).
Both views	Same as in the above cell describing the top-view-only inspection
Top view	Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100).
Both views	Default parameters (including L2 regularization, C = 1.0, solver = ‘lbfgs’, max_iter = 100).

Table 14. Augmentation variations implemented in Approach X.

Augmentation Variation	Details Following the YOLOv8-seg + CLAHE
A	ViT + LR + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and a non-augmented (1:1) test set.
B	ViT + LR + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and an 8:1 TTA test set.
C	MobileNetV2 + Using the segmented enhanced dataset with a 3:1 augmented training set, a 3:1 augmented validation set, and a non-augmented (1:1) test set.

Table 15. Performance measurements obtained for all classes and views in different variants.

Approach Variants	Views	Class	Accuracy	Precision	Sensitivity	F1 Score	Specificity
MobileNetV2	Both	Intact	45.00%	65.00%	46.43%	54.17%	41.67%
	Both	Damaged	45.00%	25.00%	41.67%	31.25%	46.43%
	Top	Intact	50.00%	5.00%	50.00%	9.09%	50.00%
	Top	Damaged	50.00%	95.00%	50.00%	65.52%	50.00%
MobileNetV2 + CLAHE	Top	Intact	67.50%	81.82%	45.00%	58.06%	90.00%
MobileNetV2 + CLAHE	Top	Damaged	67.50%	62.07%	90.00%	73.47%	45.00%
ResNet18 + MSR + CLAHE + fast NIM denoising + FSRCNN + Normalization	Both	Intact	77.50%	73.91%	85.00%	79.07%	70.00%
	Both	Damaged	77.50%	82.35%	70.00%	75.68%	85.00%
ROI + ViT + LR	Both	Intact	92.50%	100.00%	85.00%	91.89%	100.00%
ROI + ViT + LR	Both	Damaged	92.50%	86.96%	100.00%	93.02%	85.00%
GPT o4-mini-high	Both	Intact	73.50%	71.56%	78.00%	74.64%	69.00%
	Both	Damaged	73.50%	75.82%	69.00%	72.25%	78.00%
	Top	Intact	60.50%	58.82%	70.00%	63.93%	51.00%
	Top	Damaged	60.50%	62.96%	51.00%	56.35%	70.00%
YOLOv7	Both	Intact	69.00%	88.00%	44.00%	58.67%	94.00%
YOLOv7	Both	Damaged	69.00%	62.67%	94.00%	75.20%	44.00%
ROI + Deblur + CLAHE + ViT + LR	Both	Intact	90.00%	83.33%	100.00%	90.91%	80.00%
	Both	Damaged	90.00%	100.00%	80.00%	88.89%	100.00%
	Top	Intact	70.00%	70.00%	70.00%	70.00%	70.00%
	Top	Damaged	70.00%	70.00%	70.00%	70.00%	70.00%
YOLOv8-seg + CLAHE + MobileNetV2	Both	Intact	82.50%	80.95%	85.00%	82.93%	80.00%
	Both	Damaged	82.50%	84.21%	80.00%	82.05%	85.00%
	Top	Intact	55.00%	55.00%	55.00%	55.00%	55.00%
	Top	Damaged	55.00%	55.00%	55.00%	55.00%	55.00%
YOLOv8-seg + CLAHE + ViT + LR	Both	Intact	92.50%	94.74%	90.00%	92.31%	95.00%
	Both	Damaged	92.50%	90.48%	95.00%	92.68%	90.00%
	Top	Intact	67.50%	68.42%	65.00%	66.67%	70.00%
	Top	Damaged	67.50%	66.67%	70.00%	68.29%	65.00%
YOLOv8-seg + CLAHE + Augmentation + ViT + LR	Both	Intact	90.00%	90.00%	90.00%	90.00%	90.00%
YOLOv8-seg + CLAHE + Augmentation + ViT + LR	Both	Damaged	90.00%	90.00%	90.00%	90.00%	90.00%

Table 16. Root causes for misclassifications.

Cause	Breakdown
Subtle or Localized Defects	Small dents, creases, or punctures can occupy only a few pixels in a 128 × 128 or 224 × 224 crop. If a model’s receptive field or feature-extraction layers are tuned to more global textures (e.g., overall box color or printed logos), it may “miss” these tiny irregularities.
Subtle or Localized Defects	The “intact” packages were given two light sculpt strokes ( $𝒩$ (0.20, 0.05)) while “damaged” ones received three deeper strokes ( $𝒩$ (0.35, 0.05)). However, if a CNN’s filters are not sensitive to such a difference in depth, or if the lighting in the rendering hides shallow dents, even a ResNet18 or MobileNetV2 may not pick up the nuance.
View-Specific Occlusions and Shadowing	A top-only view can hide side-facing punctures or crushed corners. Conversely, a side-only view can miss dents on the top face. In our earlier experiments, using both views simultaneously often boosted recall, whereas top-only sometimes dropped to ~50%.
View-Specific Occlusions and Shadowing	Harsh shading or low-contrast regions (especially in rendering settings with limited ray-bounces) can wash out small deformations. Without preprocessing (e.g., MSR, CLAHE), a classifier may confuse a shadow-induced dark patch for damage or simply ignore a low-contrast dent.
Insufficient Variation in Training Data	The synthetic dataset had 200 pairs of top + side images. Even with random spatial jitter (±0.04 m, ±0.07 m) and yaw rotation (±20°), the total coverage of possible real-world angles, box materials, and crease patterns remains limited.
Insufficient Variation in Training Data	Models like ViT or GPT o4-mini-high—pretrained on open-domain images—need enough in-domain examples to learn “what a dent looks like on a package.” If the network has never seen a subtle crease under a particular type of lighting, it cannot reliably generalize in some scenarios.
Blur and Image Quality Degradation	Some rendered frames might be blurred (e.g., from DoF settings around f/16 ± 3). If the damage occupies just a few pixels, blur can obliterate it entirely.
Blur and Image Quality Degradation	Diffusion-based classifiers (GPT o4-mini-high) often expect crisp, high-resolution detail. It is possible that any motion blur, low-lighting noise, or compression artifact can derail their latent-space feature extraction, causing them to focus on color/layout instead of surface texture.
Over-Reliance on Background or Framing Cues	A YOLO detector trained on COCO-style polygon masks might inadvertently learn “where the box usually sits” rather than “what damage looks like.” If the box center shifts unpredictably (±44 cm laterally, ±77 cm longitudinally), YOLO’s bounding-box proposals may become unstable, and it might classify an undamaged box as damaged simply because it is slightly off-center.
Over-Reliance on Background or Framing Cues	Similarly, ViT’s patch embeddings can latch onto consistent background patterns (e.g., conveyor-belt texture) instead of focusing on subtle surface defects, especially if the damage patch covers only 1–2 patches in a 14 × 14 grid.
Model Capacity vs. Overfitting	A very large model such as GPT o4-mini-high can overfit the small synthetic dataset unless heavily regularized. During fine-tuning, it may learn to memorize specific rendering artifacts such as consistent highlight on a dented edge rather than generalizable damage features.
Model Capacity vs. Overfitting	A lightweight CNN (MobileNetV2) may not have enough capacity to capture all variations of damage under all lighting/pose combinations, leading to underfitting for more subtle defects.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shahin, M.; Hosseinzadeh, A.; Chen, F.F. AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains. Electronics 2025, 14, 2824. https://doi.org/10.3390/electronics14142824

AMA Style

Shahin M, Hosseinzadeh A, Chen FF. AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains. Electronics. 2025; 14(14):2824. https://doi.org/10.3390/electronics14142824

Chicago/Turabian Style

Shahin, Mohammad, Ali Hosseinzadeh, and F. Frank Chen. 2025. "AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains" Electronics 14, no. 14: 2824. https://doi.org/10.3390/electronics14142824

APA Style

Shahin, M., Hosseinzadeh, A., & Chen, F. F. (2025). AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains. Electronics, 14(14), 2824. https://doi.org/10.3390/electronics14142824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Enabled Sustainable Manufacturing: Intelligent Package Integrity Monitoring for Waste Reduction in Supply Chains

Abstract

1. Introduction

2. Development of AI

3. AI-Enabled Sustainability

3.1. Narrow AI: Enhancing Defect Management and Waste Reduction Through Data-Driven Insights

3.2. Generative AI: A Paradigm Shift Towards Proactive Quality Design and Waste Minimization

3.3. Different Forms of Generative AI

Achieving 3D Synthesis via Blender

3.4. Hybrid AI Systems

4. Dataset

4.1. Pipeline Design of the Dataset

4.2. General Constraints and Limitations of the Synthetic Images’ Datasets

4.3. Evaluating the Quality of the Dataset

4.4. Synopsis

5. Methodology

5.1. Approach I

5.2. Approach II

5.2.1. Global Histogram Equalization

5.2.2. Contrast-Limited Adaptive Histogram Equalization

5.2.3. Sharpening

5.3. Approach III

5.4. Approach IV

5.5. Approach V

5.6. Approach VI

5.7. Approach VII

5.8. Approach VIII

5.9. Approach IX

5.10. Approach X

6. Results

6.1. Results of Approach I

6.2. Results of Approach II

6.3. Results of Approach III

6.4. Results of Approach IV

6.5. Results of Approach V

6.6. Results of Approach VI

6.7. Results of Approach VII

6.8. Results of Approach VIII

6.9. Results of Approach IX

6.10. Results of Approach X

7. Discussion

7.1. Performance Measurements

Root-Cause for Misclassifications

7.2. Future Work

7.3. Limitations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI