1. Introduction
Additive manufacturing has reached a stage of transition from purely stand-alone prototype production to widespread industrial application for the manufacture of small series of sophisticated functional components. This applies in principle to all additive manufacturing processes, in particular to printing thermoplastics using strand extrusion processes, commonly referred to as fused filament fabrication (FFF), and also alternatively referred to as fused deposition modeling (FDM). The transformation towards widespread industrial utilization is accompanied by consistent technological improvements addressing inherent process challenges, such as the insufficient dimensional stability of components [
1], poor layer adhesion [
2], inadequate extrusion [
3], material breakage as a result of moisture [
4], and poor print quality due to insufficient melting of the material [
5].
In this context, an ultimate objective of technical monitoring concepts consists of identifying occurring defects at an imminent stage in order to, first and foremost, minimize efforts associated with continuing printing of defective parts as soon as possible, thus reducing excess costs due to process failures to a minimum. Furthermore, in the long term, print monitoring should also enable the optimization of printing processes and, via process control, ideally also provide adaptive processes that can cure prints in the event of a defect.
Optical Coherence Tomography (OCT) is one of the few high-resolution imaging techniques capable of providing volumetric information about transparent or semi-transparent objects, that is suitable for direct integration into production hardware for the purpose of in-line monitoring. In contrast to well-established Computed Tomography (CT), OCT uses low-energy radiation, making it a very attractive alternative for sensor technology in monitoring tasks in running production processes. OCT was originally developed as a diagnostic tool for ophthalmology [
6]. Since then, it has become a routine diagnostic tool in this field and there is growing interest in adopting it for dermatology and stomatology [
7]. The application of OCT in material diagnostics is also increasing [
8]. OCT utilizes a short-coherence light source. As the method is based on Michelson interferometry, the light is split into two beams: A reference arm and a sample arm. The sample arm is focused on the object under investigation. Depending on the optical properties of the sample, the focused light can propagate through the material while exhibiting partial scattering at the surface and local inhomogeneities. These inhomogeneities include variations in material density, optical properties, contamination, pores, delamination, and other defects.
In this study, raw tomographic data from OCT measurements on AM-generated samples, in particular using FFF, are analyzed using deep learning methods of computer vision in order to assess the internal material quality of the section shown at the corresponding tomogram in preparation for prospective automated in-line monitoring of FFF printing. Convolutional neural networks (CNN) are used for automatic feature extraction and the classification of tomographic cross-sections. The tomographic information is aquired using a commercially available OCT system, which is well suited in terms of its dimensions and mass for capturing conditions from the interior of printed material volume directly at the material fusion location during a running printing operation. The concept envisages continuously scanning through the top layer, i.e., the currently printed layer, at least to an extent that information is recorded from the interface between superimposed printed layers. The junction between printed layers is of significant interest in terms of the structural strength of printed components, so that in-line recorded condition information from the junction area between printed layers is key to achieving the effective monitoring of component quality.
2. State of the Art
In additive manufacturing, common real-time inspection techniques focus on optical inspection, thermal imaging, and acoustic monitoring [
9]. These technologies are widely adopted due to efficiency, ease of integration, and applicability to various materials and processes.
Approaches to data-based monitoring in additive manufacturing typically leverage advanced processing and analysis methods, ranging from feature extraction from sources such as melt pool images or heat maps to statistical methods that detect deviations from reference conditions. Machine learning (ML) is playing an increasingly important role in this context. Supervised models classify defects or predict porosity [
10,
11,
12], unsupervised models detect anomalies [
13,
14,
15], and physics-based approaches combine simulations with sensor data [
16], while digital twins provide virtual real-time replicas of the process [
17]. Monitoring strategies are either ex situ, involving CT or X-ray scans [
18], surface profilometry [
19], and destructive testing for validation [
20], or in situ, incorporating layer-by-layer inspection [
21], melt pool tracking [
22], and adaptive feedback [
23]. Major challenges include managing the vast amounts of data generated during printing, distinguishing between genuine defects and noise, and a lack of standardized frameworks, which complicates integration into control loop systems [
24]. Despite all obstacles, applications can be found in aerospace, medicine, and the automotive industry, where reliable monitoring supports certification and quality assurance, as well as in research and development, where processes, structures, and properties need to be linked together [
25].
Computer vision is a science at the interface between computer science and engineering and is dedicated to processing and analyzing images captured by cameras in a variety of ways in order to understand their content or extract information. Typical tasks of computer vision include object recognition and determining geometric structures of objects and movements, using image processing algorithms such as segmentation and pattern recognition methods such as object classification [
26,
27].
Optical monitoring is frequently used in additive manufacturing as it is non-contact, offers high temporal and spatial resolution, and provides direct insight into defects such as porosity, lack of fusion, or overheating. Different types of sensors can serve complementary roles, with high-speed cameras capturing detailed images of melt pools but generating enormous amounts of data [
25,
28], whereas photodiodes provide ultra-fast signals with low memory requirements based on melt pool emissions, but without spatial detail [
29,
30]. Infrared cameras map thermal fields to evaluate cooling rates and hot spots, but face challenges with emissivity [
31,
32]. Laser profilometers capture the surface topography of printed layers to enable direct conclusions about material quality and, extending on this, predictions regarding final component quality [
21,
33]. These sensors generate large, complex datasets that require complex processing, from image analysis and deep learning for defect classification to signal processing and multi-sensor fusion for accurate and responsive monitoring. Nevertheless, major challenges remain in processing huge amounts of data, reliably assigning signals to fault types, and standardizing procedures. Future trends are focused on control loop systems that adjust process parameters (nozzle temperature, bed temperature, feed speed, extrusion speed) directly based on optical feedback [
34,
35,
36].
The function of OCT is basically comparable to that of ultrasound-like optical imaging, whereby low-coherence interferometry is used to enable resolution in the µm range and depth-resolved imaging up to a depth of several millimeters in real time [
37]. OCT as a high-resolution imaging technique increasingly excels in a wide range of applications beyond medical imaging, including materials science, and thus recently in additive manufacturing [
6,
38]. OCT provides information about the internal structure of a sample by measuring the coherence of light waves, enabling the creation of three-dimensional images and thus the monitoring of internal alterations in material or tissue [
39]. The technique is particularly suitable for examining transparent or translucent materials such as polymers, biological tissue, and thin films, while being non-invasive, delivering real-time images, and boasting high resolution [
40]. Unlike conventional optical sensors such as cameras, photodiodes, or IR systems, which are limited to surface observations, OCT can visualize features beneath the surface in recently solidified layers, making it extremely valuable for detecting defects, near-surface pores, delaminations, and even surface roughness [
41,
42,
43]. Demonstrated applications include powder bed fusion, where OCT measures layer thickness and detects subsurface cracks [
30,
44], as well as directed energy deposition, where OCT tracks bead geometry and ensures consistency between layers [
45]. Research has shown that OCT can measure melt track depth and correlate it with process parameters, allowing for the prevention of porosity [
46]. Technological virtues in high-resolution, volumetric data acquisition, and non-contact operation with great potential for integrated control are contrasted with challenges in limited penetration depth, line-of-sight requirements, complexity of integration, and a necessity for managing vast 3D datasets [
8]. However, future developments in OCT are expected to complement traditional optical inspection by combining surface and subsurface information, feeding data into digital twins, proving particularly valuable in industries such as aerospace, electronics, optics and medical devices where internal defects are deemed unacceptable [
8,
47,
48,
49].
3. Materials and Methods
Tomographic cross-sectional OCT images pass through a processing pipeline consisting of labeling, preprocessing, model training and evaluation (see
Figure 1). Labeling constitutes a statistical approach employing sliding window thresholding of Z-scores, subsequent morphological operations, and thresholding outlier ratios. The features derived in the labeling approach are not further processed in model training. The data preprocessing handles raw OCT cross-sectional images and involves cropping, normalization, histogram matching, data splitting and augmentation. Spatial continuity and correlation of B-scan images are considered in data splitting using block-based partitioning to split training, validation, and test sets. ResNet-V2 [
50] builds the platform for the deep learning model for the classification task, with BottleneckV2 modules applied to the network. To optimize model performance, the width multiplication factor K and number of bottleneck modules N in the network are tuned.
The model is trained on the OCT raw data in order to perform a classification task on the OCT images into ‘good’ and ‘bad’ regarding the internal material state of the printed volume. The criteria for assessing the model’s performance include accuracy and loss curves, confusion matrix, precision (‘good’), recall (‘bad’), and the F1 score. Accuracy and loss values are recorded at each training epoch and plotted as curves upon the completion of training. As misclassifying the ‘bad’ category as ‘good’ has a significant impact on the subsequent printing process, special emphasis is placed on recall (‘bad’) and precision (‘good’). A high precision indicates a lower rate of misclassifying ‘bad’ samples as ‘good’. A high recall ensures most ‘bad’ samples are accurately detected, minimizing missed detections. The deep learning models in this study are constructed, trained, and validated using the PyTorch 2.3.0 library, with data preprocessing done by the NumPy and TorchVision libraries, including image transformation, normalization, and enhancement operations. The training and testing of the model is done in a Python 3.11 environment using the PyTorch library on an NVIDIA RTX 3070 GPU.
3.1. Data Acquisition
The experimental setup (see
Table 1) for ex situ capturing of process-related tomographic image data is based on printing experiments utilizing a Raise3D Pro3. With a layer height of 0.2 mm, on a square base area of 10 × 10 mm
2, lines with a total length of 10 mm were printed on top of each other with different numbers of layers using PA12 and PLA. A nozzle diameter of 0.4 mm, generally the most widely used in FFF, was employed. The process employed an average nozzle temperature of 205 °C in the case of PLA and 265 °C in the case of PC. The build bed had an average bed temperature of 55 °C.
The data were acquired using a Spectral Domain OCT (SD-OCT) system [
51]. A schematic representation of the system is shown in
Figure 2, and a detailed description of the data processing chain for SD-OCT and other OCT variations can be found in [
52,
53]. The basic output unit in a scanning SD-OCT system is the depth profile recorded at a single location, known as an A-scan. As the light beam moves along one axis, multiple A-scans form a B-scan, which represents a virtual cross-sectional image of the sample. A series of B-scans acquired along the other axis creates a tomogram. This terminology is similar to that used in ultrasound imaging.
The test samples were imaged using a commercially available ThorLabs Telesto OCT system, operating at a central wavelength of 1300 nm. The axial resolution in air was 6.95 µm, and the optical components provided a lateral resolution of 7 µm. Note that in SD-OCT systems, axial resolution also depends on the refractive index of the imaged material. The maximum field of view was 9 × 9 mm2, and the system could acquire A-scans at a rate of 76 kHz.
Data analyzed in this study were acquired ex situ. Samples were placed under the OCT sensor and aligned, and tomograms were recorded at specific locations for structures longer than 10 mm. Raw data were exported as stacks of matrices in TIF format. These could be viewed in Fiji software as images, where gray values corresponded to the signal intensity measured by the OCT system in dB.
OCT images contain the usual noise, resulting in the recorded images containing artifacts and false defect signals caused by the imaging mechanism and/or the sample structure:
Point-like bright spots are artifacts caused by speckle noise or tiny scattering [
54];
Vertical shadows indicate signal loss artifacts caused by superseding, highly reflective, or opaque structures that prevent light from reaching underlying layers [
55];
Vertical bright lines are caused by interaction between scan synchronization and periodic sample structures, resulting in local signal duplication or misalignment [
55].
Samples consisting of single, double, and triple layers of material were imaged, and the corresponding data stacks were used for further analysis. In total, 8135 images were available for training and testing the model (
Table 2). These images come in TIF file format, with each image accompanied by a TXT file containing A-scan-specific intensity slope values for each B-scan. By processing the TIF files, each B-scan image can be extracted for subsequent analysis. The image files obtained are listed in
Table 2 along with their characteristics. The table shows the number of B-scans obtained in each TIF stack and indicates the output format of the B-scans contained as well as the materials of the corresponding samples.
3.2. Data Exploration and Data Labeling
The TIF files, shown as ‘C-scan’ in
Figure 2c, constitute the initial files of this study, containing the entire scan spatial image. By processing the TIF file, each B-scan image can be derived for subsequent analysis.
Figure 2c and
Figure 3a show a B-scan image of an OCT scan exhibiting significant vertical artifacts, as it is a composite of several consecutive A-scans. The slope map is calculated by moving a 10-pixel sliding window along each A-scan (vertical direction, i.e., columns of the image array). For each B-scan image, a series of corresponding slope value data in txt format is generated. The sliding window is selected so as to capture local variations around each pixel point and represent its structural features in the B-scan image, especially where object edges and internal defects occur, which appear as distinct bright areas on the OCT image. The occurrence of these bright areas is related to the scattering and reflection of light at defects, and a slope analysis of the sliding window can effectively extract these bright areas while avoiding errors caused by noise along the vertical direction. This method identifies signs of structural changes appearing in the images, and thus facilitates distinct localization of outlier locations.
Figure 3b is based on the slope value document corresponding to the B-scan image in
Figure 3a. The slope analysis clearly highlights both the edge information and the internal outlier information. Although the column length is reduced by 9 pixels due to the sliding window operation, this has a negligible effect on the subsequent analysis. The reason is that the upper edge area of the B-scan consists mainly of air and not of a 3D-printed area. Therefore, deleting this section has no impact on the results of the AM area analysis. The sliding window method continues in effect for detecting feature changes in the AM area, assuring that valid information about structural changes in the center area of the image is prioritized.
Before conducting a Z-score outlier analysis for an input document with slope values, it is required to verify that the data is approximately normally distributed. According to the Q-Q plot shown in
Figure 3c, the data in the middle can reasonably be described as a normal distribution, but there are deviations at the ends of the data, which show extreme values or hard ends, indicating the presence of excessive outliers or deviations. The Z-score can effectively identify these data points that deviate from normal. Since most of the data follows a normal distribution, using the Z-score in measuring the distance from the mean is a sensible choice to reveal both the central tendency of the data and the abnormal outliers.
The applied Z-score-based evaluation method standardizes the data and uses plus or minus 2 standard deviations as criteria for determining outliers. In this way, outliers in the data can be effectively identified. The results are shown in
Figure 3d, with the red markings indicating identified outliers. These outlier regions essentially correspond to the highlighted areas at the edges and inside the image, resulting in an extraction effect. A binary outlier matrix is created as a txt file for subsequent mask creation.
Since this study addresses the classification and identification of the internal state of printed volumes, a mask is utilized to clear the areas of the surface edges and the surrounding air. Initially, the upper three-quarters of the outlier image are recognized as the primary processing object. The reason for this is that the upper part of the image contains more important information about the target during the analysis process. The lower quarter of the image, on the other hand, contains outliers whose points are irregular and not suitable for image processing. Therefore, for each image file, the upper 75% is used as the basis for subsequent processing. Morphological expansion and erosion operations are performed to highlight the target area and eliminate noise, as shown in
Table 3.
Opening performs Erosion first, followed by Dilation. Its primary function is to remove small noise points while preserving larger target areas. It is suitable for removing isolated noise points. Closing performs dilation first, followed by erosion. Its primary function is to fill small voids in the target region and connect disconnected sections. The ‘Parameter’ column of
Table 3 lists the parameters corresponding to Erosion and Dilation. The order of the parameters is based on the order in which Erosion and Dilatation are employed during Opening or Closing.
Figure 4a–d show the effects of the Opening and Closing operations on the outlier images.
Figure 4a,c are the original image, and
Figure 4b,d are the processed result. The Opening operation effectively removes the small outliers at both edges, while the Closing operation connects the disrupted edges.
In the morphologically processed image, an interpolation method is used to curate areas with all blank columns, as shown in
Figure 5a. Spline interpolation is used to find the topmost 1-value points on both sides of the blank columns, and interpolation is used to create smooth boundary lines that gradually fill in these blank sections. Since the image has undergone morphological operations, the interpolation process does not induce boundary fluctuations due to small outliers, ensuring the smoothness and continuity of the boundary regions.
This step results in a continuous boundary between the top of the print and the air. The complete image is created by combining the processed upper-three-quarters image with the original lower-quarter image.
Based on the recognizable material edges of each printed piece in the outlier image, an average thickness of 15 pixels can be determined for the boundary area. To remove the effect of the edges of the prints for subsequent analysis, for each pixel column, the first 15 pixels in the column with a value of 1 are set from 1 to 0, effectively removing the surface area on average, and the final binary mask is obtained as shown in
Figure 5b.
Figure 5c shows the B-scan plot of the original OCT combined with the masks for the detected outliers (red dots) and the mask for distinguishing between the environment and material volume, with the green part marking the inner area of the FFF print.
By computing the percentage of the accumulation of the area of the detected outliers in the material volume, or the proportions of the red area in the green area, respectively, the percentage of outliers within the examined section of the printed part can be determined.
As evident from
Figure 6, the percentage and variance of ‘
A-09_1_layer’, ‘
X5Y4_1_layer’, ‘
X5Y4_2_layer’, and ‘
X5Y4_3_layer’ are significantly lower than those of the other B-scan image sets, which demonstrates the consistency and stability of these datasets in the overall data.
In particular, the datasets ‘
X5Y4_1_layer’, ‘
X5Y4_2_layer’, and ‘
X5Y4_3_layer’ show lower outlier rates, more stable ratios, and lower variances compared to the other datasets. This suggests that these images are more reliable for determining the boundary between the labels ‘good’ and ‘bad’. However, since ‘
X5Y4’ and ‘
A-09’ represent two different printing materials, the ‘
A-09_1_layer’ dataset is considered in the experiments to ensure the completeness and reliability of the threshold determination. Finally, to establish the criteria for image labeling, the labeling threshold is determined based on the distribution covering 95% of the data from the four B-scan image sets: ‘
X5Y4_1_layer’, ‘
X5Y4_2_layer’, ‘
X5Y4_3_layer’, and ‘
A-09_1_layer’. The 95th percentile threshold method is a common choice for image classification and quality control in additive manufacturing. It effectively disregards outliers or noise by averaging the analysis results across the bulk of the data, ensuring that the labeling threshold is more representative and robust. The 95th percentile is used in additive manufacturing to evaluate the effects of powder quality and build orientation, for example, and this method is also frequently employed in image classification to establish classification criteria with high accuracy [
56,
57].
The analysis results in a value of 0.93 for the 95th percentile. Consequently, images with outlier scores greater than or equal to 0.93 are classified as ‘bad’, while images with scores less than 0.93 are classified as ‘good’. This threshold not only takes into account the differences in the assessment of different datasets, but also different material properties, thus providing the overall most reliable basis for the classification process.
3.3. Data Preprocessing
3.3.1. Image Preprocessing and Conversion
First, resizing is performed by uniformly reducing all images to 224 × 224 pixels given the required consistency of the size of images introduced into the deep learning model. This size is primarily chosen to accommodate common CNN architectures, such as ResNet, where this size retains sufficient detail without consuming significant memory and computing resources.
Data augmentation is then performed to improve the model’s generalization ability and reduce the risk of model overfitting. Data augmentation techniques are applied to the training set. The images in the training set are randomly flipped horizontally using the random level flipping method, helping the model learn different viewing angle features, especially for the task of undirected images.
Finally, normalization is performed. In deep learning models, the range of input values has an important influence on training efficiency and convergence speed. The image pixel values are normalized by adjusting their mean to 0.1605 and their standard deviation to 0.1056. Normalization helps improve training stability and avoid the problem of exploding or vanishing gradients due to pixel values that are either excessively large or small.
The results of the preprocessing are shown in
Figure 7, from left to right: the Resized plot (
Figure 7a), the randomized horizontally flipped plot (
Figure 7b), and the Normalized plot (
Figure 7c).
3.3.2. Data Transformation
Since two different materials, PA12 and PLA, are involved in this study, the surface structure and scanning properties of these materials can lead to significant differences in the distribution of gray values in the images. In order to avoid the influence of such material differences on the model classification performance, histogram matching (HM) is applied to make the images of different materials more consistent in terms of gray value distribution. Histogram transformation in OCT image processing has been shown to be effective for adapting image features from one material to another, especially when combined with deep learning techniques. With such image transformation techniques, OCT images of different materials can be normalized to similar image features for training in a unified model, improving the accuracy of image classification or other recognition tasks [
58,
59].
Since most images in the ‘X5Y4’ dataset manufactured from PLA material are labeled as ‘good’, only the images labeled as ‘good’ from the ‘
A-09_1_layer’ material dataset manufactured from PA12 material are used to calculate the average cumulative distribution, as shown in
Figure 8a. The gray scale histograms of the ‘X5Y4’ image sets made of PLA material are then adjusted to the average cumulative distribution using histogram matching.
Figure 8b shows the version before adjustment and
Figure 8c shows the image after adjustment. It can be observed that the image processed by histogram matching after adjustment has a gray value distribution that is similar to the average cumulative distribution of the reference images, as shown in
Figure 8b, and can be directly utilized for model training and validation.
3.3.3. Dataset Loading and Splitting
Due to the spatial continuity of the B-scan images during the OCT scanning process, there is a high degree of similarity between adjacent B-scan images. To avoid the problem of data leakage, which is very likely to occur if the training and test sets contain excessively similar images during the model training process, a block-based segmentation strategy is applied. With this strategy, M consecutive B-scan images are considered as a block and randomly assigned to blocks when dividing the datasets (training, validation, and test sets). This means that images within the same block are assigned to the same dataset, ensuring that there are no adjacent or excessively similar image samples in different datasets.
Small block-size M: Adjacent images can split into different datasets, which increases the risk of data loss and reduces the accuracy of the model evaluation [
60];
Large block-size M: Reduced probability of data leakage but prone to excessive differences between training, validation, and test sets, compromising the model’s generalization ability [
61].
To ensure the reproducibility and accuracy of the experiments, the following steps are used to divide the dataset:
Determination of the number of blocks: All B-scan images are divided into a number of blocks, each containing number M of B-scan images, used as the basic unit of data division.
Division of training, validation, and test data: The division is 7:2:1, with 70% of the data used for training, 20% for validation, and 10% for testing. The training dataset includes augmentation operations as well as normalization operations. The validation and test sets retain the original structure of the image and only perform normalization operations to ensure their originality, allowing better simulation of the performance in a real-world environment.
Use of random seeds: The random number 42 is used, which ensures that the results are consistent each time the data is split.
Loading training data, a random shuffling strategy is used to ensure that the order of the images is different for each training run, so that the model is not dependent on a specific image order.
Table 4 shows the number of samples within training, validation, and test sets for each classification method for each block size category.
3.4. Modeling
3.4.1. Model Architecture
An improved residual unit based on ResNet-V2 [
50] is adopted to reduce the number of network layers and increase the width of the network, and an improved deep residual network model is proposed to be applied to the classification of B-scan scanned images of 3D-printed parts. The goal is to achieve better image classification performance even with limited computing resources by using a more shallow residual network, since it is intended to run the model inference close to a machine for real-time process monitoring, which requires a certain degree of efficiency.
This network, based on the ResNet-V2 architecture, uses the BottleneckV2 module, with the input sizes and depth of the network adjusted according to the binary classification task into ‘good’ and ‘bad’. According to
Table 5, the ResNet-V2 network architecture consists of the following modules:
Convolutional Layer: A 6 × 6 convolutional kernel is used for initial feature extraction with 64 channels and 3 padding, followed by batch normalization and ReLU activation to normalize features and activate convoluted nonlinear features.
Maximum pooling layer: A 2 × 2 pooling kernel is used with a step size of 2 and no padding (padding = 0), reducing the feature map size from 224 × 224 to 112 × 112.
Residual module: Each residual module consists of several BottleneckV2 modules, where N is the number of BottleneckV2 modules contained in each residual module and K is the multiplication factor of the number of channels, which is the width of the network, as shown in
Figure 9a. The ‘pre-activation’ method, specifically ‘BN-ReLUConv’, is used. N BottleneckV2 modules in each residual module are structured by a residual connection, enabling the input features to be passed directly to the next layer (see
Figure 9b,c).
Adaptive Average Pooling: Adaptive Average Pooling is used to gradually reduce the size of the feature maps and thus extract to high-level global features.
Fully Connected Layer: A dropout layer is implemented before a fully connected layer, which maps the input features to classification outputs ‘good’ and ‘bad’, converting them into probability distributions by a softmax function.
In the evolved residual network model, the residual unit has been modified to reduce the depth of the network while increasing its width. In particular, the total number of layers in the network has been reduced and the width of each layer has been expanded by increasing the number of channels. The total number of layers in the network is 2N + 2, with N representing the number of BottleneckV2 modules per residual block.
The following adjustments to the model structure are considered:
Bottleneck design: The BottleneckV2 module is used with an expansion ratio of 4, reducing the computational load on the middle layer of the network while maintaining strong feature representation capability at deep layers [
62];
Multi-layer design: The ResNet-V2 network architecture consists of four residual modules corresponding to 64, 128, 256, and 512 channels. Multiple bottleneck blocks are embedded in each module, allowing the model to extract more abstract, higher-level features layer by layer;
Dropout: A dropout layer is included before the fully connected layer to randomly drop neurons to improve the robustness of the model and avoid the model being overly dependent on the training set [
63];
Tuning of K and N: Parameters K and N in this model were not locked, as they are determined through tuning experiments.
3.4.2. Hyperparameter Configuration
During the training process, default hyperparameters are set to ensure that the ResNet-V2 network converges within a reasonable time frame while achieving optimal performance. In the comparative experiments outlined, only the hyperparameters under comparison are modified, while all other settings remain unchanged to ensure a valid comparison. The specific training configuration is as follows:
Learning Rate: An initial learning rate of 0.001 is used, along with a step decay schedule. The learning rate decay is applied every 5 epochs, with a decay factor of 0.1.
Optimizer: The Stochastic Gradient Descent (SGD) optimizer with momentum is employed, using a momentum value of 0.9 and weight decay of 0.0005.
Batch Size: A batch size of 32 is chosen based on the available memory of the 8GB RTX 3070 GPU employed.
Dropout: A dropout rate of 0.5 is applied to reduce overfitting.
Early Stopping: Early stopping with a patience of 12 epochs is used to prevent overtraining and ensure the model does not continue training once performance plateaus.
Epochs: The maximum number of epochs is set to 40.
Loss Function: The model is optimized using cross-entropy loss, appropriate for the binary classification task.
5. Discussion
Although the motivation is clearly focused on preparing grounds for in-line process monitoring, the experiments documented in this study are conducted ex situ. The successful integration of OCT sensor heads into printers has already been demonstrated [
49,
64]. This has shown that neither the spatial dimensions nor the mass of OCT sensor heads constitute an obstacle to their integration into a printer. The energy required or the burden on the environment, for example due to radiation, also do not represent an obstacle. If the measurement is to be carried out in-line without any loss of time, rather than with a time delay inside the printing compartment, then the biggest hurdle will be the accessibility of the beams to the location of the material deposition. This means that a measuring setup must either be positioned directly after the material deposition, meaning after the nozzle, which requires the measuring setup to be able to rotate around the nozzle, or it could be positioned at an angle to the nozzle so that the location of material deposition directly below the nozzle is irradiated via an inclined beam angle.
The binary labels serving as ground truth in the proposed attempt are derived from an OCT image-based heuristic using outlier area ratios. This approach makes the label generation rely on the same OCT-derived features that are later used for training the classification model, which introduces a degree of circularity. Therefore, the performance of the trained classification model only reflects consistency with the adopted labeling strategy rather than absolute physical defect validation. As a consequence, future work must involve complementary validation methods, such as CT.
Evolutions of accuracy and loss, respectively, across training progressions for block size values M of 5, 10, 20, and 40 are investigated. For block size M = 5, the accuracy of the validation set consistently remains higher than that of the training set, and the accuracy of the test set is slightly lower than the accuracy of the validation set, but higher than the accuracy of the training set, which is counterintuitive for training processes.
In a harmonically learning model, the development of validation metrics shall track the trend observed in the training data closely. Over the course of training, both training and validation loss shall decrease, and both accuracies shall increase in a similar manner, indicating that the model is learning informative and transferable patterns rather than memorizing training examples. Validation performance is supposed to be typically slightly worse than training performance, a difference often referred to as the generalization gap, which remains modest and stable throughout training progression. The presence of a moderate and consistent generalization gap is to be expected and reflects the inherent difference between optimizing a model on the training set and evaluating it on unknown data [
65,
66]. One possible explanation for missing the generalization gap in the case of block size M = 5 is that the block size M value is too small, causing similar neighboring images to be split into different datasets, resulting in data leakage.
The effect of five sets of multiplication factor K and number of bottleneck modules N on model performance is investigated. For K = 1 and N = 1, K = 1 and N = 2, and K = 2 and N = 1, the accuracy of validation and test sets is significantly above the accuracy of the training set. It is possible that the reason for this phenomenon is a lack of complexity in the model, meaning that it is underfitted. If N and K are low, as for K = 1 and N = 1 or K = 2 and N = 1, for example, the model does not have enough parameters to capture the complex patterns in the training data, resulting in the underfitting of the training set. The accuracy of the validation set is relatively high, as the examples in the validation set can be relatively simple and the oversimplification of the model does not have as strong a negative impact on the examples in the validation set. Additionally, data augmentation methods such as random horizontal mirroring are applied to the training set, which can result in lower training accuracy compared to the validation set.
Four dropout values are evaluated (
Table 7). The model achieves the highest accuracy and F1-score at a dropout of 0.5, while recall (’bad’) and precision (’good’) improve with increasing dropout up to 0.8. In general, a higher dropout value is effective in improving the model’s generalization ability, but anyway, a dropout rate of 0.8 is commonly considered extremely high [
63]. Such a high drop out rate can be justified in specific scenarios in which such an aggressive regularization strategy is required [
67]. A relatively small training dataset is likely to lead to severe overfitting. If the model is unable to generalize, a high dropout rate can be applied to severely limit the effective capacity of the network and force it to learn more robust, distributed representations. Ideally, a high dropout is applied exclusively to fully connected classifier heads and not to convolutional layers. Fully connected layers often contain a disproportionate number of parameters and are particularly prone to overfitting. In these cases, using a higher dropout rate acts as structural regularization, preventing the co-adaptation of neurons and improving generalization, while leaving earlier convolutional feature extractors largely untouched [
68]. An edge case occurs when dropout is intentionally used to approximate ensemble-like behavior during training. High dropout can effectively serve as implicit training of a large number of subnetworks and averaging their predictions at inference time. Thus, a higher dropout rate can be used to improve robustness and reduce variance. [
63] However, it is worth considering applying dropout values of up to 0.8 selectively, rather in the final layers of a network than throughout the entire architecture [
69].
The optimized model delivers results based on test data with an accuracy of 0.9446, a recall (‘bad’) of 0.9227, and a precision (‘good’) of 0.9175. An overall accuracy of 0.9446 for the test data indicates that the model effectively generalizes to unknown samples and correctly classifies the vast majority of images. The recall value of 0.9227 for the class ‘bad’ shows that more than 92% of the samples that are truly defective are correctly identified. This is particularly important in quality assurance or safety-critical applications, where missing a defective item can have significant consequences. The high recall value therefore shows that the model is effective in minimizing false negatives for the critical class. At the same time, a precision of 0.9175 for the class ‘good’ shows that the model is correct in the vast majority of cases when it predicts a sample as non-defective. This suggests that the classifier does not excessively penalise normal samples and maintains an appropriate balance between rejecting defective items and correctly accepting good items. Overall, these metrics are internally consistent and suggest that the classifier achieves a favorable compromise between sensitivity and reliability. Overall, the reported performance can be considered strong and credible for an applied image classification task and would generally be suitable for further validation or deployment testing.
Grad-CAM heatmaps shown in
Figure 13 illustrate causes of model problems and misclassifications.
Figure 13a shows that the sample has extra noise in the vertical direction, which may have affected the model’s assessment. The Grad-CAM heatmap shown in
Figure 13b shows that the red areas of the model focus are mainly concentrated on the areas of severe noise rather than on the internal structure of the 3D-printed part. This suggests that the model is not sufficiently robust against such signal noise and is prone to being influenced by external noise signals, leading to misclassifications.
In
Figure 13d, the Grad-CAM heatmap shows that the red areas of high attention are mainly concentrated upon the edges towards the upper air periphery area of the 3D-printed part, thus deviating from the critical structural areas of the material interior and leading to a misclassification of the model. This suggests that during the processing of such samples, the model is not properly focused on the relevant features of the 3D-printed part, but is instead distracted by irrelevant areas.
In the case of misclassification illustrated in
Figure 13e,f, the image is first prepared for histogram matching. It can be observed that after histogram equalization, dense white areas appear on both sides of
Figure 13e. This phenomenon is also reflected in the Grad-CAM visualization in
Figure 13f, where the model’s attention is drawn to these white areas due to histogram equalization, leading to misclassification. Although most histogram-matched images are correctly classified, this example shows that this technique is subject to a degree of residual uncertainty, which in certain cases can lead to noise or interference and compromise the classification performance of the model.
Although the developed model achieves an overall accuracy of 94%, the misclassifications reveal areas where the model’s performance can be further improved. In particular, the model is sensitive to noise, edge effects, and preprocessing techniques such as histogram equalization, which in certain cases can shift the model’s focus to irrelevant areas. While these issues tend towards marginal impact on the overall classification performance, addressing them will lead to improved robustness and accuracy.
A comparison of the results shows that ResNet-V2 outperforms EfficientNet-B0 and VGG16 in every metric. Furthermore, the ResNet-V2 model developed in this study has approximately 3.5 million parameters, while EfficientNet-B0, which is used for binary classification tasks, has approximately 5.3 million parameters, and VGG16 has over 138 million parameters. It is worth noting that ResNet-V2 achieves better performance than the EfficientNet-B0 and VGG16 models used in comparison, with far fewer parameters. This suggests using ResNet-V2 significantly improves computational efficiency and memory requirements while maintaining model accuracy. This makes ResNet-V2 the better choice for binary classification tasks in practical applications, especially in environments with limited computing resources.
Since the experiments are performed using two materials and a single printer setup, the demonstrated performance is strongly dependent on the materials, equipment, and process parameters used. As far as the printing equipment used allows integration of an OCT sensor in terms of design and kinematics, transferring the approach examined to other equipment would probably not have to be considered critical. A transfer of this approach to other printing materials is certainly the greatest challenge in disseminating the approach demonstrated. In order to detect defects below the surface, the light used by OCT must be able to penetrate the material to the relevant depths. Commercially available systems operate with wavelengths in the range of 800 nm to 1500 nm for the central wavelength of the spectrum used. Depending on the transparency of the feedstock used and the thickness of the layer produced during the manufacturing process, the number of layers made visible by OCT can vary from a single layer to several layers [
49].
The study addresses a classification into ‘good’ and ‘bad’ as expression of overall internal quality instead of explicitly distinguishing between different defect types as result of specific defect mechanisms. Defect types in FFF are numerous and can be attributed to both material properties and unsuitable printing process parameters [
70]. Since this study’s approach involves using OCT scans as source data for defect detection, the main focus in the future will primarily remain on identifying internal defects, in particular delamination, cracks, gaps, and voids. The term ‘gaps’ in this regard refers not only to gaps in the material, but also to other types of defects such as under-extrusion and cavities, which can be grouped into a single category, as they are expected to have a similar appearance in OCT scans, since they involve local missing material. A more precise differentiation of the causes of delamination, cracks, gaps, and voids is not expected to be reliably possible due to the limited resolution of OCT. In any case, molecular defects are not expected to be detectable due to this resolution limit.
Model training involves stochastic processes that may affect performance. Experiments in this study were conducted using a single training run per configuration. Performance variability is therefore not explicitly quantified and constitutes a limitation of this study, as performance variance across multiple runs is not explicitly quantified. The reported results are sufficient to support the comparative and methodological conclusions of this work.
6. Conclusions
CNNs based on ResNet-V2 are used for the classification of tomographic cross-sections. A dataset of 8135 OCT images passes a semi-automatic labeling, preprocessing, model training and evaluation. A sliding window indentifies outlier regions in the tomographic cross-sections, while masks suppress peripheral noise, enabling label generation based on outlier ratios. Data are split using block-based partitioning to limit leakage.
It is confirmed that a combination of width multiplication factor N = 2 and number of bottleneck modules K = 2 exhibited superior performance across various metrics, particularly in recall (‘bad’) and precision (‘good’), which is of particular importance in the detection of defective printed parts. In addition, other hyperparameters including dropout and learning rate are compared and tested, eventually determining that a dropout of 0.5 or 0.8, depending on the evaluation metrics emphasized, and an initial learning rate of 0.001, is effective for achieving optimum generalization ability and classification accuracy of the model. The optimized model delivers results based on test data with an accuracy of 0.9446, a recall (‘bad’) of 0.9227, and a precision (‘good’) of 0.9175. The effectiveness of the ResNet-V2 model is verified by conducting comparative experiments with alternative state-of-the-art CNN models, including EfficientNet-B0 and VGG16, demonstrating that the custom ResNet-V2 model outperforms the other models in terms of classification accuracy.
In future research and applications, the types of materials and the total number of image sets are to be extended to grant the model better generalization ability and robustness, and different classes of defects shall be considered.