1. Introduction
Laser-driven manufacturing such as cutting, welding, marking, and surface modification is used for metals, polymers, ceramics, and composite systems. Process control now relies on data-driven and AI-assisted monitoring rather than offline parameter optimisation [
1,
2,
3,
4,
5,
6]. The automotive shift to battery electric vehicles (BEVs) has made laser welding of aluminium busbar-to-terminal connections safety-critical [
5,
6]. Each battery pack requires several thousand defect-free welds; a single faulty connection compromises the electrical, thermal, and mechanical integrity of the whole pack [
7,
8]. Remote laser welding (RLW) dominates because the galvanometer scanner decouples beam motion from robot motion, enabling sub-second cycle times, large stand-off distances, and high-frequency beam wobbling that stabilises the keyhole against aluminium’s high reflectivity and low viscosity [
9,
10,
11,
12]. RLW of 1xxx- and 6xxx-series aluminium nevertheless remains susceptible to lack of fusion, lack of connection, hot cracking, porosity, and through-piercing, with the defects depending nonlinearly on laser power, part-to-part gaps, beam shape, and clamping conditions [
10,
11,
12,
13].
In-process monitoring translates the physical emissions of welding (visible light, plasma intensity, infrared radiation, X-ray transmission, ultrasonic waves, and airborne acoustic emissions) into quality indicators [
1,
2,
3,
4]. Photodiode and high-speed-camera pipelines have matured into industrial closed-loop systems through machine learning (ML) feature extraction and classification [
14,
15,
16,
17,
18,
19]. Deep convolutional and recurrent architectures have been demonstrated on visual and thermal data [
20,
21,
22,
23,
24,
25], and one-dimensional (1-D) deep models are emerging for raw process-signal time series [
26,
27,
28,
29,
30].
A less explored modality is airborne acoustic emissions captured by Fabry–Pérot membrane-free optical microphones. Compared with capacitive or piezoelectric microphones, they offer a flat response from 10 Hz to tens of MHz, no mechanical resonances, sub-millimetre sensing heads that can be placed close to the process, and high immunity to electromagnetic interference [
31]. This enables monitoring of keyhole resonance, melt-pool oscillations, plume venting, and rapid solidification, phenomena previously accessible only through high-cost or high-latency sensors [
32,
33,
34].
Basile et al. [
31] released a benchmark RLW dataset of 183 weld trials of 1.0 mm AA1050 sheets, captured at
= 2.5 MHz with a Xarion Eta250 ultra optical microphone (Xarion Laser Acoustics GmbH, Vienna, Austria) over a full-factorial 12 × 5 design-of-experiments grid plus three 2000 W piercing tests. Their analysis used Welch power spectral density (PSD) features, total sound power
, a band-stop-filtered variant
, and pairwise PSD quotients to derive threshold rules separating four weld events. Threshold rules are effective for descriptive analysis but do not adapt to the laser-power/gap interaction effects that the same author quantified through ANOVA, do not generalise across replicates, and provide limited interpretability for marginal cases. An end-to-end deep learning approach that learns class-discriminative spectro-temporal features from the raw microphone signal is therefore suitable for improving on the threshold baseline while preserving real-time deployability.
This paper introduces Acoustic Signal Intelligence (ASI), a deep learning framework for laser beam welding state classification from optical-microphone signals. The framework is built around three design choices motivated by the literature reviewed in
Section 2:
A compact 1-D CNN encoder on a mel-scale short-time Fourier transform (STFT) spectrogram, based on the author’s prior 1-D-signal modelling experience [
35,
36,
37], which achieves the strongest deep learning performance on this dataset. A multi-representation fusion variant that further concatenates a wavelet-packet decomposition [
14,
22] and a 24-feature library targeting the 8, 63 and 110 kHz keyhole-resonance bands [
31,
32,
33], following the multi-modal autoencoder fusion of Weisbrod and Metternich [
29], is evaluated as an ablation arm and not retained in the final model.
A systematic ablation against larger 1-D CNN–Transformer hybrids inspired by recent architectures for 1-D welding signals [
25,
26,
27,
28,
29,
30] and against a class-conditional GAN-style augmentation [
19]. Both arms underperform compared to the compact CNN, indicating that Transformer-hybrid attention models do not transfer well to small welding datasets, consistent with the benchmark-data-scarcity observation of recent reviews [
1,
2,
3]. The Transformer underperformance is further confirmed by a systematic 30-configuration grid search over learning rate, weight decay, and dropout, under which the best tuned Transformer-hybrid macro-F1 (0.441 ± 0.057) remains 0.283 below the proposed CNN’s 0.724 ± 0.034.
Grad-CAM-based frequency-domain interpretability that recovers the dominant keyhole-resonance bands reported by Basile et al. and by previous physical models [
32,
33] without explicit prior knowledge of these bands.
The framework is benchmarked against (i) the threshold rule of Basile et al. [
31]; (ii) classical ML baselines including random forest [
34], principal-component analysis with support vector machines [
16], sequential forward floating selection with SVMs [
17,
18], XGBoost [
29,
38], ensemble learning [
39], and shallow ANN/MLP regression [
13,
40,
41]; and (iii) deep baselines including 1-D CNN, LSTM, and Transformer encoders. The present work extends the author’s prior physics-driven RLW penetration framework [
42], surrogate-driven RLW process-window modelling [
43], and unsupervised photodiode-based weld-defect classification [
37] by replacing optical with acoustic emissions and handcrafted features with end-to-end deep learning. The remainder of the paper is organised as follows.
Section 2 reviews the related literature.
Section 3 describes the dataset, signal-processing pipeline, proposed architecture, and training protocol.
Section 4 presents the results.
Section 5 discusses the implications and limitations.
Section 6 concludes.
5. Discussion
On the held-out replicate (61 trials, Fold 3), the proposed 1-D CNN achieved 0.721 accuracy and 0.707 macro-F1 with 100% recall on the rare
Pierc. class (one piercing trial per fold; three piercing trials in total, distributed two train/one test). This is an absolute improvement of
accuracy (
macro-F1) over the dual-threshold rule of Basile et al. [
31] extended to the five-class taxonomy (0.295/0.092, piercing recall 0). The strongest classical-ML baseline, SFFS-SVM on the 24-feature library, achieved 0.754 accuracy and 0.682 macro-F1; the proposed CNN had slightly lower accuracy (
) but higher macro-F1 (
) because it predicted the rare
Marg. and
Pierc. classes more uniformly, and the McNemar test indicates the two are statistically indistinguishable on the 61-trial test set (
,
Table 10). Photodiode-based four-class pipelines such as Caprio et al. [
7] (KNN, 100%) and Chianese et al. [
14] (DWT+NN, 97.5%) use simpler four-class taxonomies and different alloys (Cu/Ni-steel vs. AA1050) and are not directly comparable; the present results indicate that a single optical-microphone signal supports a five-class classifier of similar quality without the multi-sensor photodiode stacks of [
8,
15,
16,
19,
20]. The acoustic framework benefits from a much higher sensor bandwidth than the photodiode pipeline of [
37] (DC–MHz vs. 50 kHz), capturing the 63 kHz keyhole-resonance peak and broadband piercing attenuation; the Grad-CAM analysis confirmed these are the network’s primary discriminators (
Figure 8).
On a per-class accuracy basis, the proposed 1-D CNN performs below the vision-based DL state of the art: Knaak et al. [
21] report F1 = 0.952 on six classes from dual-IR sequences, Balakrishnan et al. [
25] reach 99.55% binary accuracy on radiographs with a Swin–DeiT transformer, and Wang et al. [
8] achieve a high F1 on Al-busbar visual data. However, the proposed model operates on a one-dimensional input of
samples per trial after decimation (
at full bandwidth), one to two orders of magnitude smaller than typical 2-D image inputs. CPU inference latency was 4.1 ms per trial, compared with 1.1 ms on a Jetson AGX Xavier for the CNN–GRU of Knaak et al. [
21] on IR images and 20.4 ms for the ST-TCN of Liu et al. [
27] on coaxial-vision penetration estimation. The framework is therefore suitable for embedded deployment when camera placement is impractical (closed enclosures, beam-wobble interference, shielding-gas blow-back), such as busbar laser welding [
9,
10].
A central finding is that under three-fold replicate-out CV on the 122-sample training set, every Transformer-hybrid arm stayed well below the proposed CNN. The original 4L/8H/
hybrid scored 0.514 ± 0.028/0.333 ± 0.036, and the four light variants clustered between 0.437 and 0.541 accuracy (small1 0.541 ± 0.013, small2 0.514 ± 0.039, small4 0.519 ± 0.020, sinpos2 0.437 ± 0.015); none closed the gap to the proposed CNN’s 0.732 ± 0.041/0.724 ± 0.034. The 27-cell hyperparameter grid on Tr-small2 plus the three-cell validation on Tr-original (
Table 6) confirms this conclusion is not a tuning artefact: the best-tuned configurations reach macro-F1 = 0.441 ± 0.057 on Tr-small2 and 0.304 ± 0.016 on Tr-original, both well below the proposed CNN’s 0.724 ± 0.034. The 90-fold-run search is consistent with the small-data attention findings of recent reviews [
1,
2,
3] and indicates that the gap reflects an inductive-bias mismatch rather than a missing hyperparameter. ACGAN-style oversampling, evaluated on Fold 3 only, degraded four of the five Transformer arms to accuracy ≤ 0.246 via single-class collapse; only the sinusoidal-position variant was unchanged (0.426/0.279). This contradicts the favourable Transformer-hybrid results of Solovev et al. [
28] (mAP 0.85, four classes, IR thermography), Liu et al. [
27] (98.96%, three classes, coaxial vision), and Shi et al. [
30] (99.39%, ten classes, multi-signal). Two factors explain the difference: those datasets are an order of magnitude larger, consistent with the benchmark-data-scarcity gap noted in [
1,
2,
3]; and the GAN-style augmentation introduces spurious patterns that the larger attention layers overfit, consistent with Eren’s [
3] caution that the benefits of synthetic data in laser welding DL are poorly understood. The local-pattern inductive bias of the compact CNN, by contrast, matches the spectro-temporal structure of keyhole-resonance acoustic signatures (
Figure 8) and avoids the data-hungry global-attention regime.
The Grad-CAM analysis (
Section 4.6) recovered the 8, 63 and 110 kHz keyhole-resonance bands without these being prescribed in the architecture (
Figure 8), providing post hoc evidence that the network learned a spectral structure consistent with the keyhole-resonance physics of [
31], the capillary-wave melt-pool dynamics of [
32], and the laser-ultrasonic attenuation regimes of [
33]. The Pearson correlations
and
on the 61-trial held-out test set match the sign and order of magnitude of the
(
vs.
) and
(
g vs.
) reported by Basile et al. [
31], confirming that the network captured physical trends rather than trial-specific artefacts. These observations address the interpretability and physics-grounding gap noted in laser welding DL reviews [
1,
2,
3,
4].
For battery-pack busbar manufacturing, the framework supports closed-loop laser-power control similar to [
13,
29,
55] and complements pre-process seam-tracking pipelines such as [
12]. The 8.1 ms end-to-end latency (4 ms STFT plus 4.1 ms encoder) gives a ∼120 Hz update rate, well below the 50–100 ms control-loop budget of the Industry 4.0/5.0 context reviewed in [
2,
67] but an order of magnitude below the 1 kHz beam-wobble cycle of [
10,
31], so the framework is suitable for outer-loop quality classification rather than inner-loop wobble control. The framework demonstrates the data-driven, AI-assisted approach to laser-driven manufacturing reviewed in [
1,
2,
3,
4,
5,
6]: a single high-bandwidth optical-microphone channel replaces multi-sensor stacks, an end-to-end neural classifier replaces hand-tuned thresholds, and a Grad-CAM layer replaces black-box predictions, all at millisecond latency. The pipeline (single-sensor acquisition → multi-representation front end → compact CNN encoder → Grad-CAM) is transferable in principle to laser cutting, marking, and surface-modification monitoring across metals, polymers, ceramics, and composites; further validation is future work.
Several limitations remain. First, the dataset is restricted to 1.0 mm AA1050 and a single beam-wobble configuration; transferability to AA6063, copper, and copper-to-steel busbars [
7,
10,
14] needs empirical testing through fine-tuning of the trained network on small target-domain sets (20–30 labelled trials per new alloy or thickness), following transfer-learning practice common in welding monitoring. Second, acquisition took place in a controlled laboratory without cross-jet flow or factory-floor noise; production deployment will require noise-robust training via three complementary tracks—(i) noise-injection augmentation using public industrial-acoustic recordings mixed into the training trials at target signal-to-noise ratios; (ii) SpecAugment-style time and frequency masking during training; and (iii) unsupervised domain adaptation (DANN, CORAL) once a small amount of unlabelled factory-floor data is available—along the lines reviewed in [
3,
4]. Third, a possible label uncertainty remains near weld-regime boundaries. Although the five class labels were adopted from Basile et al. [
31] and supported by metallographic inspection and piercing verification, process variability may still affect welds produced under the same nominal conditions. This is especially relevant for the transition between lack of connection, marginal connections, and sound connections, and may explain some confusion between the marginal and sound classes. Future studies should use trial-specific metallographic, radiographic, or CT-based labels [
5,
6,
10,
45] to reduce this uncertainty. Fourth, the log-mel front end is a fixed log-frequency compression rather than a physically derived ultrasonic representation, allocating roughly one-sixth of its bands to the unused sub 1 kHz range; a head-to-head three-fold comparison against linear-Hz and log-Hz filterbanks (
Table 9) placed all three within 0.022 macro-F1, with the mel scale at the lowest fold variance; a learnable filterbank or a constant-Q transform calibrated to the 8/63/110 kHz keyhole peaks may still extract more signal from the ultrasonic band and is a target for future work.
6. Conclusions
This paper applied an Acoustic Signal Intelligence framework to in-process classification of laser beam welding states from a single optical-microphone signal, developed on an open dataset (183 weld trials of 1.0 mm AA1050 overlap joints, Xarion Eta250 ultra optical microphone, MHz) under a five-class taxonomy: lack of fusion, lack of connection, sound, marginal, and piercing.
The contributions are as follows: (i) A compact 1-D CNN encoder on a 128-band log-mel STFT spectrogram (199 237 parameters; three convolutional blocks of 64/128/256 channels and a single Linear() head) trained with class-balanced focal cross-entropy, which under three-fold replicate-out cross-validation achieved 0.732 ± 0.041 accuracy and 0.724 ± 0.034 macro-F1 with a piercing recall of 1.00 in every fold (one piercing trial per fold), the highest macro-F1 across all seventeen models tested (six classical-ML and four deep learning baselines including a five-variant 1-D CNN–Transformer hybrid); a multi-representation fusion variant adding a 64-element Daubechies db-10 wavelet-packet decomposition and a 24-element handcrafted feature library targeting the 8, 63 and 110 kHz keyhole-resonance peaks was evaluated as an ablation arm and not retained in the final model. (ii) A demonstration that Transformer-hybrid models and ACGAN-style oversampling underperformed compared to the proposed network on the 122-trial training set: every Transformer variant fell at least 0.19 accuracy and 0.39 macro-F1 below the CNN, and ACGAN collapsed four of the five attention variants to single-class predictions (accuracy ≤ 0.246). This conclusion is further supported by a 30-configuration grid search over learning rate, weight decay, and dropout (90 fold-runs in total) under which the best-tuned Transformer-hybrid macro-F1 (0.441 ± 0.057) remains 0.283 below the proposed CNN. (iii) A Grad-CAM analysis that recovered the keyhole-resonance bands (60–110 kHz for Sound/Pierc., 8–26 kHz for LoC/Marg., low-frequency for LoF) without explicit prior knowledge.
Three further points follow. First, the dual-threshold rule extended to the five-class taxonomy achieved only 0.295 accuracy/0.092 macro-F1 with zero piercing recall; the proposed network improved on this by
accuracy and
macro-F1 on the held-out replicate (McNemar
,
,
). The strongest classical baseline, SFFS-SVM on the 24-feature library, achieved slightly higher accuracy (0.760 ± 0.008) but lower macro-F1 (0.690 ± 0.007); the McNemar test (
,
,
) indicates the two are statistically indistinguishable on the 61-trial test set. Second, the complementarity of the three signal representations did not survive cross-validation: the full mel + WPT + 24-feature fusion model lifted Fold 3 accuracy by
and macro-F1 by
over the mel-only variant, but under three-fold CV it dropped to 0.628 ± 0.112/0.531 ± 0.183, below the mel-only 0.732 ± 0.041/0.724 ± 0.034 (
Table 8); this shows it was a single-fold outlier rather than a real lift, justifying the mel-only branch as the proposed model. Third, the Pearson correlations
and
confirmed that the network captured physical trends rather than trial-specific artefacts. CPU inference latency was 4.1 ms per trial on a single x86_64 core; with the 4 ms STFT window, the ∼120 Hz update rate fits the 50–100 ms outer-loop budget but lies an order of magnitude below the 1 kHz beam-wobble cycle, so the framework is suitable for outer-loop quality monitoring rather than inner-loop wobble control.
Future work will extend the framework along four directions: (a) validation on AA6063 and copper-to-steel busbars for cross-alloy transferability; (b) integration of cross-section-level porosity and crack labels alongside the parameter-derived taxonomy; (c) edge-hardware deployment for closed-loop laser-power control with a 1 kHz wobble cycle; and (d) self-supervised pre-training and noise-robust domain adaptation for production environments.