Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments

Jin, Deliang; Chen, Gang; Feng, Shuo; Ling, Yufeng; Zhu, Haoran

doi:10.3390/make7030095

Open AccessArticle

Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments

by

Deliang Jin

^1,*

,

Gang Chen

^1,2

,

Shuo Feng

¹

,

Yufeng Ling

¹ and

Haoran Zhu

¹

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 95; https://doi.org/10.3390/make7030095

Submission received: 19 June 2025 / Revised: 27 August 2025 / Accepted: 2 September 2025 / Published: 5 September 2025

Download

Browse Figures

Versions Notes

Abstract

Deep neural networks (DNNs) are highly effective across many domains but are sensitive to noisy or corrupted training data. Existing noise mitigation strategies often rely on strong assumptions about noise distributions or require costly retraining, limiting their scalability. Inspired by machine unlearning, we propose a novel framework that integrates attribution-guided data partitioning, neuron pruning, and targeted fine-tuning to enhance robustness. Our method uses gradient-based attribution to probabilistically identify clean samples without assuming specific noise characteristics. It then applies sensitivity-based neuron pruning to remove components most susceptible to noise, followed by fine-tuning on the retained high-quality subset. This approach jointly addresses data and model-level noise, offering a practical alternative to full retraining or explicit noise modeling. We evaluate our method on CIFAR-10 image classification and keyword spotting tasks under varying levels of label corruption. On CIFAR-10, our framework improves accuracy by up to 10% (F-FT vs. retrain) and reduces retraining time by 47% (L-FT vs. retrain), highlighting both accuracy and efficiency gains. These results highlight its effectiveness and efficiency in noisy settings, making it a scalable solution for robust generalization.

Keywords:

machine unlearning; neuron pruning; fine-tuning; attribution methods

1. Introduction

Deep neural networks (DNNs) have achieved unprecedented success across diverse domains, fundamentally transforming fields such as computer vision [1,2,3], natural language processing [4,5,6], speech recognition [7,8,9], medical diagnosis [10,11], and reinforcement learning [12,13,14]. These remarkable achievements stem largely from groundbreaking architectural innovations including convolutional neural networks (CNNs) [1], residual networks [2], transformer architectures [5], and generative adversarial networks (GANs) [15]. Each of these developments has contributed substantially to enhanced model performance, training stability, and the ability to capture complex data representations. However, despite these architectural advances, the effectiveness and reliability of DNNs remain fundamentally constrained by the quality and integrity of their training data.

Real-world datasets are inherently susceptible to various forms of noise that compromise both input features and target labels, presenting significant challenges for robust model development. Feature noise manifests through multiple sources including sensor measurement inaccuracies, data acquisition errors, missing or corrupted values, and deliberate adversarial perturbations [16] designed to mislead model predictions [17]. Concurrently, label noise emerges from human annotation inconsistencies, subjective interpretation differences among annotators [18], ambiguous category boundaries, automated labeling system errors, and temporal changes in labeling criteria [19]. The pervasiveness of these noise sources is exemplified in widely-adopted benchmark datasets: comprehensive analysis reveals that approximately 4% of ImageNet labels contain errors or ambiguities [20], while clinical datasets such as MIMIC-III [21] exhibit substantial missing physiological measurements and inconsistent diagnostic annotations that further complicate reliable model training.

The fundamental challenge lies in the demonstrated capacity of DNNs to memorize arbitrary noise patterns, a phenomenon that severely undermines their generalization capabilities [22]. High-capacity networks are capable of achieving near-zero training error even when trained on datasets with completely randomized labels [23,24]. This behavior reveals their tendency to memorize noise and overfit to spurious correlations rather than learning meaningful patterns. This memorization behavior becomes particularly problematic in practical applications where model reliability is critical, as networks may confidently make incorrect predictions based on learned noise patterns rather than genuine signals [25].

Existing approaches to address noise-related challenges in deep learning can be broadly categorized into several paradigms, each with distinct advantages and limitations. Data preprocessing and cleaning methods [26,27,28] attempt to identify and remove noisy samples before training, but they often require substantial domain expertise and may inadvertently eliminate valuable edge cases or minority class examples. Noise-robust loss functions [29,30] modify the training objective to reduce sensitivity to label errors [31,32], yet these approaches typically rely on strong assumptions about noise distribution characteristics that may not hold in practice. Sample re-weighting [33,34] and curriculum learning techniques [35,36] dynamically adjust the influence of training examples during optimization but struggle with complex, asymmetric, and instance-dependent noise patterns commonly encountered in real-world scenarios [37]. Meta-learning [38,39] approaches attempt to learn noise-robust representations, but they require additional computational overhead and may not scale effectively to large-scale problems.

Machine unlearning emerges as a particularly promising paradigm for addressing data quality issues; it was originally developed to meet privacy compliance requirements by efficiently removing the influence of specific training samples without complete model retraining [40,41,42]. Its core principle—selectively erasing specific learned information—aligns with robustness objectives. Traditional data cleaning approaches typically operate at the dataset level. In contrast, machine unlearning techniques provide fine-grained control over the influence of individual samples within trained models. Recent theoretical and empirical investigations [43] have demonstrated that strategic application of unlearning methods can significantly enhance model robustness against noisy labels by selectively eliminating the impact of detrimental examples while preserving valuable generalizable knowledge [44]. This selective forgetting capability offers computational efficiency advantages over complete retraining while maintaining model performance on high-quality data. Despite these advances, existing noise mitigation strategies face several critical limitations that constrain their practical applicability. Many approaches require explicit assumptions about noise characteristics, distributions, or generation processes that are difficult to verify in real-world settings. Others demand extensive manual intervention, domain-specific expertise, or substantial computational resources that may not be available in resource-constrained environments.

To address these limitations, we propose a novel robust learning framework that synergistically combines attribution-based data [45,46] quality assessments, discriminative neuron pruning [47,48], and targeted fine-tuning [49,50] to comprehensively address the challenges posed by both noisy input features and corrupted labels. Our approach leverages gradient-based attribution methods to quantify sample quality without requiring explicit noise distribution assumptions, employs neuron-level analysis to identify and remove noise-sensitive model components, and applies selective fine-tuning to restore and enhance model performance on high-quality data. The key contributions of this work include the following:

Attribution-guided Data Partitioning: We utilize gradient-based attribution scores to reliably distinguish high-quality samples from noisy samples. Leveraging Gaussian mixture models allows for probabilistic clustering without imposing restrictive assumptions about the noise distributions.
Discriminative Neuron Pruning via Sensitivity Analysis: We introduce a novel methodology for quantifying neuron sensitivity to noise as a linear regression problem based on neuron activations. This strategy enables precise identification and removal of neurons primarily influenced by noisy samples. This regression-based sensitivity analysis for targeted pruning of noise-sensitive neurons represents a key methodological novelty not explored in prior attribution or pruning studies.
Targeted Fine-Tuning: After pruning, the network is fine-tuned exclusively on high-quality data subsets, effectively recovering and enhancing its generalization capability without incurring substantial computational costs.

Our proposed framework addresses critical gaps in existing noise mitigation strategies by providing a unified approach that operates without explicit noise modeling requirements, scales effectively to large-scale models and datasets, and integrates seamlessly into standard deep learning workflows. The empirical results demonstrate significant improvements in model robustness and generalization across diverse noise scenarios, contributing valuable insights and practical tools for developing reliable deep learning systems in noisy real-world environments. Our goal is to focus more on analyzing trained weight samples to reduce errors learned during training. Noise-resistant algorithms are inconsistent with our goal.

The remainder of this paper is systematically structured as follows: Section 2 establishes the theoretical foundation by presenting essential preliminaries from noise-robust learning and machine unlearning. Building upon this framework, Section 3 elaborates our proposed methodology with rigorous algorithmic formulations. Section 4 provides empirical validation through controlled experiments, including comparative analysis with baselines and ablation studies. Section 5 summarizes our main findings and their implications for robust learning under noise. It also highlights key limitations and future research paths such as adaptive pruning and deployment in high-noise domains.

2. Preliminaries

In this section, we introduce the formal notation and core concepts that serve as the foundation for our proposed methodology. We first formulate the supervised learning problem in the presence of noisy labels, then review gradient-based attribution techniques for measuring sample quality, describe neuron pruning strategies that target spurious representations, and finally summarize the fine-tuning paradigm used to adapt pruned networks.

2.1. Supervised Learning

Let

X

denote the input space and

Y

the output space. In supervised learning, we aim to learn a function

f : X \to Y

that maps inputs to outputs. Given a dataset

D = {(x_{i}, y_{i})}_{i = 1}^{n}

where

x_{i} \in X

and

y_{i} \in Y

, the standard approach is to find the parameters

θ

of a model

f_{θ}

that minimize the empirical risk:

min_{θ} \frac{1}{n} \sum_{i = 1}^{n} L (f_{θ} (x_{i}), y_{i})

(1)

where

L : Y \times Y \to R^{+}

is a loss function measuring the discrepancy between predicted and actual outputs.

In practical scenarios, the observed dataset is often contaminated by measurement noise and data corruption that can affect both input features and target outputs. Specifically, we can represent the observed dataset as

D_{obs} = {({\tilde{x}}_{i}, {\tilde{y}}_{i})}_{i = 1}^{n}

, where

{\tilde{x}}_{i} = x_{i} + δ_{i}

and

{\tilde{y}}_{i} = y_{i} + ϵ_{i}

. Here,

δ_{i}

and

ϵ_{i}

represent input and output noise, respectively, which may follow unknown distributions, exhibit adversarial characteristics, or arise from systematic measurement errors. This dual contamination presents significant challenges as the corruption can manifest in various forms: additive noise, multiplicative distortions, missing values, or adversarially crafted perturbations.

Standard empirical risk minimization applied to such corrupted data often leads to suboptimal parameter estimation and poor generalization performance. The degradation is particularly pronounced when the noise magnitude is significant, affecting a substantial portion of the dataset, or when the noise characteristics differ between the training and deployment environments. Input noise can cause the model to learn spurious correlations and reduce robustness to distributional shifts, while output noise leads to inconsistent supervision signals that impede convergence to the true underlying function.

2.2. Attribution Methods

Attribution methods provide a principled framework for quantifying the contribution of individual input features to a model’s predictions, serving as fundamental tools for model interpretability and explainability. In the context of learning with noisy observations, we leverage attribution techniques to identify and characterize samples that may contain corrupted inputs or outputs, exploiting the hypothesis that such samples exhibit distinctive attribution patterns.

Formally, for input space

X \subseteq R^{d}

, an attribution method

ϕ : X \times Y \times F \to R^{d}

maps an input–output pair

(x, y)

and a model

f \in F

to a d-dimensional vector in the input space, where each element quantifies the contribution of the corresponding feature to the model’s prediction. This mapping enables us to decompose the model’s decision-making process and identify the most influential features, making it applicable across different tasks and model types.

The landscape of attribution methods encompasses diverse approaches, broadly categorized into gradient-based and perturbation-based techniques [51,52]. Gradient-based methods leverage the model’s differentiability to compute feature importance through backpropagation, while perturbation-based approaches systematically modify inputs to observe changes in model behavior. Among these methods, we focus on Integrated Gradients (IGs) [45], which satisfies several desirable axiomatic properties including sensitivity, implementation invariance, and completeness, making it particularly suitable for robust analysis.

For the model f, input x, and baseline

x^{'}

(typically chosen as a zero vector or neutral reference point), Integrated Gradients is mathematically defined as follows:

ϕ_{IG} (x, y, f) = (x - x^{'}) ⊙ \int_{α = 0}^{1} \nabla_{x} f (x^{'} + α (x - x^{'})) d α

(2)

where

\nabla_{x} f (x)

denotes the gradient of f with respect to x and ⊙ represents element-wise multiplication. The integral captures the accumulated gradients along the straight-line path from the baseline to the input, providing a principled attribution that satisfies the completeness axiom.

In practice, the integral is approximated using Riemann sum numerical integration with m steps, where

ϕ_{IG} (x, y, f) \approx (x - x^{'}) ⊙ \frac{1}{m} \sum_{k = 1}^{m} \nabla_{x} f (x^{'} + \frac{k}{m} (x - x^{'}))

(3)

The resulting attribution scores provide valuable insights into sample quality and data integrity. Our key insight builds on a growing body of work demonstrating that noisy or corrupted samples tend to induce abnormal attribution patterns that deviate from those of clean, high-quality data [53,54]. Specifically, such samples often force the model to rely on spurious, irrelevant, or unstable features, as the learning process adapts to accommodate corrupted inputs or inconsistent labels. This misalignment leads to attribution vectors with unusual magnitude distributions, unexpected importance rankings, and reduced consistency across semantically similar inputs [55,56].

These phenomena have been observed under both noisy supervision and adversarial perturbation, where models maintain high confidence despite relying on implausible or non-robust signals. As such, attribution methods can serve as indirect detectors of data quality issues, providing a principled mechanism to flag samples whose explanation patterns diverge from the norm. We exploit this property to guide the identification and removal of low-quality training data.

2.3. Neural Network Pruning

Neural network pruning is a model compression technique that systematically removes redundant connections or neurons from trained networks to reduce computational complexity while preserving performance on the target task. We focus on pruning middle-layer neurons in feed-forward networks, building on prior findings that such neurons often serve as key-value memories [57] and exhibit high functional specialization, where only a sparse subset is activated by any given input. This activation sparsity yields a degree of linear separability in the hidden representations, which aligns well with the assumptions underlying our regression-based neuron sensitivity metric. Similar structural properties have been highlighted in studies employing linear probes [58] and representational similarity analysis [59], both suggesting that individual neurons can encode highly discriminative information.

Under our setting, neurons whose activations strongly differentiate retained vs. forgotten samples tend to emerge as dominant features in regression and thus become more vulnerable to noise transfer. This intuition is consistent with prior insights from attribution and interpretability studies [60,61], which observe that highly important neurons are disproportionately impacted by distributional shifts. Moreover, we find that pruning decisions based on diverse importance metrics—including mean activation, activation frequency, and SVD-based orthogonality—tend to select overlapping neuron subsets, echoing trends observed in sparsity-aware pruning methods [62,63]. These theoretical and empirical findings jointly motivate our attribution-guided pruning approach for isolating neurons susceptible to noise propagation.

Consider a feed-forward neural network (FNN)

f_{θ}

with L layers and parameter set

θ

. Let

W^{l} \in R^{d_{l - 1} \times d_{l}}

denote the weight matrix connecting layer

l - 1

to layer l, where

d_{l}

denotes the number of neurons in layer l. The forward propagation through layer l for input x is given by

h^{l} = σ ({(W^{l})}^{T} h^{l - 1} + b^{l}), l = 1, \dots, L

(4)

where

σ

denotes a non-linear activation function,

b^{l} \in R^{d_{l}}

is the bias vector, and

h^{0} = x

represents the network input.

Structured pruning, particularly neuron-level pruning, involves the targeted elimination of individual computational units by zeroing their associated parameters. Specifically, pruning neuron j in layer l entails setting all of its incoming connections from layer

l - 1

to zero, where

W_{j, :}^{l} = 0, b_{j}^{l} = 0, l = 1, \dots, L

(5)

This operation ensures that neuron j in layer l receives no input from the preceding layer, effectively disabling its activation in the forward pass.

The selection of neurons for pruning has been approached through various criteria in the literature. Magnitude-based methods [47] identify less important parameters by examining weight magnitudes, operating under the assumption that smaller weights contribute less to network function. Sensitivity-based approaches [64] evaluate neuron importance by measuring the impact of their removal on network performance, typically through gradient analysis or direct performance assessments. Information-theoretic methods [65] leverage concepts such as Fisher information to quantify parameter importance based on the curvature of the loss landscape around the current parameter configuration.

Beyond computational efficiency, pruning has emerged as a powerful tool for understanding and modifying learned representations. The selective removal of neurons can reveal the internal organization of neural networks and provide insights into how different components contribute to various aspects of the learning task. This capability is particularly relevant in scenarios where networks may learn spurious correlations or overfit to noise, as targeted pruning can potentially mitigate these issues by removing the responsible computational pathways.

2.4. Fine-Tuning

Fine-tuning is a transfer learning paradigm that adapts a pre-trained neural network to a new task or dataset by continuing the training process with carefully controlled parameter updates. This approach leverages the learned representations from the original training while allowing the model to specialize for the target domain, typically employing reduced learning rates to prevent catastrophic forgetting of useful pre-trained features.

Formally, consider a pre-trained model

f_{θ}

with initial parameter configuration

θ_{0}

, and let

D^{'} = {(x_{i}^{'}, y_{i}^{'})}_{i = 1}^{n^{'}}

denote the target dataset for adaptation. The fine-tuning process optimizes the parameters

θ

by minimizing a regularized empirical risk objective:

min_{θ} \frac{1}{| D^{'} |} \sum_{(x_{i}^{'}, y_{i}^{'}) \in D^{'}} L (f_{θ} (x_{i}^{'}), y_{i}^{'}) + λ Ω (θ, θ_{0})

(6)

where

L

represents the task-specific loss function (e.g., cross-entropy for classification or mean squared error for regression),

Ω (θ, θ_{0})

is a regularization term that penalizes excessive deviation from the pre-trained weights, and

λ > 0

controls the trade-off between task adaptation and parameter conservation.

The regularization component

Ω (θ, θ_{0})

plays a crucial role in maintaining the stability of fine-tuning. A widely adopted choice is

ℓ_{2}

-regularization:

Ω (θ, θ_{0}) = {∥ θ - θ_{0} ∥}_{2}^{2}

(7)

which encourages the fine-tuned parameters to remain in the vicinity of their pre-trained values, thereby preserving beneficial representations while allowing controlled adaptation to the new task.

The fine-tuning process typically incorporates several practical considerations to ensure effective transfer learning. These include the use of reduced learning rates compared to training from scratch, layer-wise learning rate scheduling that applies different rates to different network depths, and early stopping mechanisms that monitor validation performance to prevent overfitting to the target dataset. The choice of the fine-tuning strategy—whether updating all parameters or specific subsets—should be primarily task-driven, balancing computational efficiency and model performance based on the target task’s requirements.

2.5. Machine Unlearning for Robustness

Machine unlearning is an emerging paradigm originally motivated by privacy concerns, aiming to remove the influence of specific training samples from a trained model without requiring complete retraining [40,41]. From a robustness standpoint, the ability to selectively forget the impact of noisy or corrupted examples aligns closely with the goal of enhancing model generalization under data contamination.

Formally, let

D = {(x_{i}, y_{i})}_{i = 1}^{n}

denote the training dataset, and let

D_{target} \subset D

be the subset of samples to be unlearned. A model

f_{θ}

is said to have unlearned

D_{target}

if its behavior (e.g., predictions and gradients) closely approximates that of a model trained from scratch on

D ∖ D_{target}

. The challenge lies in achieving this goal efficiently, often through strategies such as parameter modification, projection-based updates, or architectural pruning.

Recent studies have explored the dual use of unlearning techniques for improving robustness by removing noisy or harmful training samples [43,44]. The central idea is that selectively forgetting examples with anomalous attribution patterns or spurious feature reliance can mitigate overfitting and improve generalization. Notably, these noisy examples often do not exist in isolation: samples with similar feature profiles or attribution distributions may also be partially influenced during unlearning, either due to shared neuron pathways or overlapping representations.

In this work, we exploit gradient-based attribution to identify candidate samples for unlearning and employ structured pruning to remove neurons associated with spurious or noisy patterns. The unlearning operation, in our context, is thus not merely a privacy mechanism, but a robustness enhancement tool that builds upon the insights of explainability and representation analysis.

3. Method

3.1. Problem Setup

We investigate supervised learning with a neural network model

f_{θ} : X \to Y

parameterized by

θ

, where

X

and

Y

denote the input and output spaces, respectively. The model is trained on a dataset

D = {({\hat{x}}_{m}, {\hat{y}}_{m})}_{m = 1}^{M}

, where each

({\hat{x}}_{m}, {\hat{y}}_{m})

represents an input–output pair. In practical scenarios, such datasets frequently contain noise-corrupted samples that degrade model performance. While conventional approaches primarily enhance data robustness through preprocessing techniques to build systems resilient to imperfect data under complex quality conditions, our approach distinctively leverages the intrinsic dynamic properties of neural networks during training to identify high-fidelity samples.

We formally decompose dataset

D

into two disjoint subsets:

D_{r}

(high-quality data) and

D_{n}

(noise-corrupted data). The presence of

D_{n}

causes the learned model

f_{θ}

to deviate from the optimal function that would be obtained using

D_{r}

exclusively. Our method employs

f_{θ}

to systematically partition

D

into

D_{r}

and

D_{n}

, subsequently evaluates

f_{θ}

using these identified subsets to selectively prune noise-sensitive neurons, and finally fine-tunes

f_{θ}

on

D_{r}

to enhance generalization performance.

Our research objectives are threefold: (1) to develop an efficient algorithm for partitioning the dataset into

D_{r}

and

D_{n}

through analysis of

f_{θ}

’s learning dynamics; (2) to establish a principled methodology for identifying and selectively pruning neurons that disproportionately contribute to learning from noisy data, utilizing both

D_{r}

and

D_{n}

; and (3) to optimize a fine-tuning procedure for the pruned model architecture on the high-quality subset

D_{r}

to restore and potentially enhance model performance and generalization capabilities. The overall algorithm flow is shown in Figure 1.

3.2. Data Partition

To partition the dataset

D

into

D_{r}

and

D_{n}

, we leverage neural attribution scores to quantify the contribution of individual training samples to the model’s decision-making process. Specifically, we adopt the Integrated Gradients (IG) method as our gradient-based attribution approach due to its desirable properties such as sensitivity, implementation invariance, and completeness. For multi-class tasks, we first compute IG attribution scores for each sample with respect to each output class, and then combine them by performing max pooling across output dimensions to obtain a single attribution vector per sample. The clean/noisy separation threshold is not manually set; instead, we use a Gaussian mixture model (GMM) to automatically determine the split based on posterior assignment probabilities, assigning each sample to the cluster with the highest posterior value. Intuitively, clean samples tend to produce sharp and concentrated attributions aligned with semantic regions, while noisy samples often yield noisy or diffuse attribution patterns [60]. Within the model

f_{θ}

, we employ an FNN architecture to define an attribution function that assigns a comprehensive quality score to each sample.

3.2.1. Attribution Computation

To quantify each sample’s influence on model decisions, we compute attribution scores through gradient-based sensitivity analysis. For the model

f_{θ}

with L layers, let

f_{m}^{l} \in R^{N}

denote the activation vector at layer l for the m-th sample, where N represents the number of neurons in that layer. Let

y_{m} = f_{θ} ({\hat{x}}_{m}) \in R^{D}

, and

y_{m}^{d}

denotes the d-th dimension of the output. We calculate the gradient of the model’s output with respect to the activation

f_{m}^{l}

, denoted as

\nabla_{f_{m}^{l}} y_{m}^{d}

. The attribution score for the m-th sample at layer l for output dimension d is formally defined as

A_{m, d}^{l} = |f_{m}^{l} ⊙ \nabla_{f_{m}^{l}} y_{m}^{d}|,

(8)

To quantify the contribution of individual neurons in deep neural networks, we define

A_{m, d}^{l, n}

as the attribution score of the m-th neuron in layer l for the d-th output dimension given the n-th input sample. To obtain a precise per-neuron importance measure for each sample, we denote

A_{m, d}^{l} = {A_{m, d}^{l, n}}_{n = 1}^{N}

and compress the attribution tensor through max pooling across output dimensions. This operation is mathematically defined as

{\hat{A}}_{m}^{l, n} = max_{1 \leq d \leq D} A_{m, d}^{l, n}

(9)

Each element

{\hat{A}}_{m}^{l, n}

represents the most significant influence of neuron n in layer l on any output dimension for sample m.

3.2.2. Attribution-Based Clustering via a Gaussian Mixture Model

After computing the sample-wise attribution scores, we partition the dataset using probabilistic clustering techniques. Among various clustering methodologies evaluated, the Gaussian mixture model (GMM) [66] demonstrated superior performance due to its capacity to capture non-spherical cluster shapes through covariance modeling, provide soft assignment probabilities, and accommodate varying cluster densities. The GMM models the attribution vectors

a_{m} = {{\hat{A}}_{m}^{l, 1 : N}} \in R^{N}

, where

m = 1, \dots, M

, as a weighted superposition of K Gaussian components:

p (a_{m}) = \sum_{k = 1}^{K} π_{k} N (a_{m} ∣ μ_{k}, Σ_{k})

(10)

where

π_{k}

denotes the mixing coefficient (subject to

\sum_{k = 1}^{K} π_{k} = 1

) and

μ_{k} \in R^{N}

and

Σ_{k} \in R^{N \times N}

represent the mean vector and covariance matrix of component (k), respectively, characterizing the central tendency and inter-neuron correlation structure of attribution scores. For this binary partitioning scenario (

K = 2

), we standardize all attribution vectors to follow a standard normal distribution

N (0, 1)

, initialize parameters via the k-means++ [67] algorithm, and optimize the model parameters using the expectation–maximization (EM) algorithm [68]. The posterior cluster assignment probability for sample m is given by

γ_{k} (a_{m}) = \frac{π_{k} N (a_{m} ∣ μ_{k}, Σ_{k})}{\sum_{j = 1}^{K} π_{j} N (a_{m} ∣ μ_{j}, Σ_{j})}

(11)

During the experiment, we compared the classification results of the GMM method with those of other classic methods, such as K-means [69]. Our method effectively improved all of the above classification methods, but the GMM method showed a performance improvement of one to two points in specific cases. Although attribution vectors may be high-dimensional, research indicates that the GMM remains effective for modeling data manifolds, even in high-dimensional environments. This is especially true when the GMM is combined with preprocessing techniques, such as normalization or dimensionality reduction [70,71]. In our case, we first standardized all attribution vectors to follow a standard normal distribution. This preprocessing reduces sparsity and scale imbalance, making the Gaussian assumption of the GMM more reasonable. Additionally, the GMM’s soft allocation property provides a principled probabilistic framework to capture uncertainty in sample attribution, which is advantageous for robust learning tasks.

The final partitions are determined by classifying samples into high-quality data

D_{r} = a_{m} ∣ arg max_{k} γ_{k} (a_{m}) = 1

and noise-corrupted data

D_{n} = a_{m} ∣ arg max_{k} γ_{k} (a_{m}) = 2 .

3.3. Selective Pruning

To systematically identify neurons that are disproportionately influenced by noisy samples and responsible for encoding spurious noise patterns, we propose a novel regression-based sensitivity analysis. We implement regression-based sensitivity analysis by utilizing features extracted from a pre-trained model

f_{θ}

and the partitioned dataset

D = {({\hat{x}}_{m}, {\hat{y}}_{m})}_{m = 1}^{M}

with corresponding binary quality labels

z_{m} \in {0, 1}

, where

z_{m} = 1

denotes high-quality samples in

D_{r}

and

z_{m} = 0

denotes noise-corrupted samples in

D_{n}

.

Let

f_{m}^{(l)} = {[f_{m, 1}^{(l)}, f_{m, 2}^{(l)}, \dots, f_{m, N}^{(l)}]}^{⊤} \in R^{N}

denote the activation vector at layer l for sample m, where N represents the number of neurons in layer l. We formulate and solve a least-squares linear regression problem to predict the sample’s quality label

z_{m}

from these activations:

min_{T^{l}, u^{l}} \sum_{m = 1}^{M} {(z_{m} - {T^{l}}^{⊤} f_{m}^{l} - u^{l})}^{2},

(12)

where

T^{l} = {[T_{1}^{l}, T_{2}^{l}, \dots, T_{N}^{l}]}^{⊤} \in R^{N}

represents the vector of learned regression coefficients and

u^{l} \in R

is the intercept term.

For each neuron n in layer l, we define a comprehensive sensitivity score

s_{n}

that quantifies its influence on distinguishing between high-quality and noise-corrupted samples:

s_{n} = | T_{n}^{l} | + λ | u_{n}^{l} |

(13)

where

λ

is a hyperparameter that modulates the relative importance of the weights versus the bias term in the sensitivity calculation.

This sensitivity score

s_{n}

fundamentally differs from conventional neuron importance measures used in pruning (e.g., weight magnitude [72] and activation statistics [64]). While those criteria aim to identify neurons that are generally important for task performance or computationally redundant,

s_{n}

specifically targets neurons whose activation patterns strongly correlate with the presence of noise in the training data. Pruning based on

s_{n}

directly aims to excise the pathways within the network that have learned to rely on or are activated by noisy patterns, thereby enhancing model robustness to label and feature corruption.

To execute the pruning operation, we identify the top-

α

(

α \in (0, 1)

) proportion of neurons

N_{prune} = {n_{j} ∣ s_{j} \geq s_{(⌈ (1 - α) N ⌉)}}

ranked in descending order by their sensitivity scores. The pruning procedure systematically zeros out both the incoming weights and biases of the identified neurons in layer l:

\forall n_{j} \in N_{prune} : W^{l} [j, :] \leftarrow 0, b^{l} [j] \leftarrow 0

(14)

where

W^{l} \in R^{N_{l} \times N_{l - 1}}

and

b^{l} \in R^{N}

represent the weight matrix and bias vector of layer l, respectively. This selective pruning process yields a refined model

f_{θ_{p}}

with enhanced robustness to noise-corrupted training samples.

3.4. Fine-Tuning

Following selective pruning, the model undergoes a fine-tuning phase exclusively on the high-quality subset

D_{r}

. This step is crucial for restoring the performance deterioration caused by pruning and reinforcing the learning of clean patterns.

Let

θ

denote the parameters of the original trained model,

θ_{p}

the parameters after pruning (from Section 3.3), and

θ_{f}

the final parameters after fine-tuning. We obtain

θ_{f}

by optimizing

min_{θ_{f}} \frac{1}{| D_{r} |} \sum_{({\bar{x}}_{i}, {\bar{y}}_{i}) \in D_{r}} L (f_{θ_{f}} ({\bar{x}}_{i}), {\bar{y}}_{i}) + λ_{reg} {∥ θ_{f} - θ_{p} ∥}_{2}^{2}

(15)

where

L

is the task-specific loss function (e.g., cross-entropy for classification) and

λ_{reg}

is a regularization coefficient that controls the deviation from the pruned model. This regularization term prevents catastrophic forgetting of useful information learned before pruning.

During the fine-tuning stage, we designed two strategies to compare their impact on the final performance:

Layer-wise Fine-Tuning Only parameters

θ_{l}

of layer l are updated (where

θ_{l} \subset θ_{p}

); other parameters (

θ_{frozen} = θ_{p} ∖ θ_{l}

) remain fixed. The optimization becomes

θ_{l, f} = arg min_{θ_{l}} (\frac{1}{| D_{r} |} \sum_{({\bar{x}}_{i}, {\bar{y}}_{i}) \in D_{r}} L (f_{{θ_{l}, θ_{frozen}}} ({\bar{x}}_{i}), {\bar{y}}_{i}) + λ_{reg} {∥ θ_{l} - θ_{l, p} ∥}_{2}^{2})

(16)

The resulting model is denoted as

f_{θ_{f}}^{layer}

, where

θ_{f} = {θ_{l, f}, θ_{frozen}}

.

Full-Model Fine-Tuning All parameters are updated jointly:

θ_{f} = arg min_{θ} (\frac{1}{| D_{r} |} \sum_{({\bar{x}}_{i}, {\bar{y}}_{i}) \in D_{r}} L (f_{θ} ({\bar{x}}_{i}), {\bar{y}}_{i}) + λ_{reg} {∥ θ - θ_{p} ∥}_{2}^{2})

(17)

The resulting model is denoted as

f_{θ_{f}}^{full}

.

By comparing the performance of the original model

f_{θ}

, pruned model

f_{θ_{p}}

, and fine-tuned models

f_{θ_{f}}^{layer}

and

f_{θ_{f}}^{full}

on the test set, we can effectively demonstrate the validity of our method. The complete methodology is presented in Algorithm 1.

Algorithm 1: Robust learning via attribution-based pruning (RLAP)

Require:: Training dataset $D = {({\hat{x}}_{m}, {\hat{y}}_{m})}_{m = 1}^{M}$
1:: Pretrained FNN model $f_{θ}$ with L layers
2:: Target layer index $l \in {1, . . ., L}$
3:: Pruning ratio $α$
4:: Regularization coefficient $λ_{reg}$
Ensure:: Refined model $f_{θ}^{final}$
5:: Phase 1: Data Partitioning
6:: Pretrained FNN model $f_{θ}$ with L layers
7:: for each sample $({\hat{x}}_{m}, {\hat{y}}_{m}) \in D$ do
8:: Forward pass to compute layer-l activations: $f_{m}^{l} = f_{θ}^{l} ({\hat{x}}_{m}) \in R^{N}$ and the output $y_{m} = f_{θ} ({\hat{x}}_{m}) \in R^{D}$
9:: Backpropagate to compute gradients of $d -$ th dimension of output: $\nabla_{f_{m}^{l}} y_{m}^{d} = \frac{\partial y_{m}^{d}}{\partial f_{m}^{l}}$
10:: Compute attribution matrix using Equation ( 8)
11:: Max pool across outputs: ${\hat{A}}_{m}^{l, n} = {{max}_{1 \leq d \leq D} A_{m, d}^{l, n}}_{n = 1}^{N}$
12:: end for
13:: Fit GMM with $K = 2$ components on ${a_{m}}_{m = 1}^{M}$ with $a_{m} = {\hat{A}}_{m}^{l, 1 : N}$
14:: Partition $D$ into $D_{r}$ and $D_{n}$ using Equation (11)
15:: Phase 2: Layer-l Selective Pruning
16:: Extract layer-l activations ${f_{m}^{l}}_{m = 1}^{M}$ and quality labels ${z_{m}}_{m = 1}^{M}$
17:: Solve linear regression: ${min}_{T^{l}, u^{l}} \sum_{m} {(z_{m} - {T^{l}}^{⊤} f_{m}^{l} - u^{l})}^{2}$
18:: for each neuron n in layer l do
19:: Compute sensitivity score $s_{n} = | T_{n}^{l} | + λ | u_{n}^{l} |$
20:: end for
21:: Select top- $α$ neurons $N_{prune}$ by $s_{n}$
22:: Zero weights and biases for layer l:
23:: $W^{l} [j, :] \leftarrow 0, b^{l} [j] \leftarrow 0$ for $j \in N_{prune}$
24:: Obtain pruned model $f_{θ^{'}}$
25:: Phase 3: Fine-Tuning
26:: Option A: Layer-l fine-tuning using Equation (16)
27:: Option B: Full-model fine-tuning using Equation (17)
28:: Return the fine-tuning model $f_{θ_{f}}^{layer}$ or $f_{θ_{f}}^{full}$ .

The comprehensive experimental validation of our method will be presented in Section 4, where we expect to observe (i) consistent performance improvement over baseline models, (ii) greater robustness to varying noise levels compared to retraining approaches, and (iii) computational efficiency gains from our selective pruning strategy.

4. Experiment

To validate our proposed robust learning via attribution-based pruning (RLAP) framework, we conducted comprehensive evaluations across two distinct application domains: computer vision (CIFAR-10 image classification) and speech recognition (speech command keyword spotting).

Our test results only consider noise in features for classification tasks, but our method can easily be extended to tasks where both features and labels have noise. The reason for selecting the above two datasets is to validate the reliability of our theory on FNN. Many simple basic network structures can achieve good results on this dataset. For larger-scale datasets, we cannot deny that the superior performance of these networks primarily stems from their complex CNN architectures based on ResNet combinations. CNN networks based on ResNet no longer operate as isolated layers like FNN layers but interact with more layers. This is the next step in our work, which involves extending our research to CNN networks with more interconnected layers. In these two experiments, we selected the typical ResNet18 model and a simple neural network consisting of three layers of CNNs and two layers of FNNs as the base models to demonstrate that our method can provide new training ideas in combination with current models. The pre-trained models were obtained by training these networks. The training epochs were selected by checking the curve after the first training and choosing the epoch where the model did not reach overfitting. Additionally, during the experiment, we chose to inject the most common type of noise relevant to the task into the dataset. We did not select other noise datasets for verification because we needed to control the variables more clearly by turning our experiment into a white box. This allowed us to observe the effect of each step in the experimental process.

We also report the per-epoch training time in the experiment to reflect computational efficiency. This comparison shows that although our method involves two additional stages—data partitioning and neuron pruning—their overhead is modest. Specifically, both the partitioning and pruning procedures are executed only once before fine-tuning and are computationally light. In our implementation, these two stages take a total time equivalent to approximately 2–3 epochs of standard training, depending on the dataset size and model architecture. By contrast, the total training time of each method is primarily dominated by the number of training epochs and the average time per epoch.

Subsequently, we designated the trained model for each task as the initial model baseline. To demonstrate the effectiveness of our proposed approach, our method includes these two strategies for comparison, layer-wise fine-tuning (

f_{θ_{f}}^{layer}

, L-FT) and full-model fine-tuning (

f_{θ_{f}}^{full}

, F-FT), across all experimental tasks. Furthermore, to validate the efficacy of our proposed data partition methodology, we incorporated an additional baseline involving a retrained model. The retrained model differs from the initial model solely in the composition of the training set, where the full dataset is replaced with the refined subset

D_{r}

obtained through the data differentiation process described in Section 3.2. This comparison enables us to isolate and assess the contribution of our data curation approach independently from the pruning strategy outlined in Section 3.3.

Following the methodology established in Section 3, all experiments were implemented using Algorithm 1 in PyTorch with a fixed pruning ratio of

α = 0.15

. This pruning ratio was determined through preliminary ablation studies, which verified that

α = 0.15

effectively removes the majority of data-sensitive neurons from the initial model while maintaining optimal performance.

4.1. Computer Vision

4.1.1. Dataset and Experimental Setup

We evaluate our method on CIFAR-10, a widely adopted benchmark in computer vision research. This dataset comprises 60,000 color images (

32 \times 32

pixels) equally distributed across 10 object classes, with a standard split of 50,000 training and 10,000 test samples. CIFAR-10 serves as an ideal testbed due to its manageable computational scale, balanced class distribution, and established role in evaluating CNN architectures, hyperparameter optimization, and novel training methodologies. Beyond classification tasks, this dataset facilitates comprehensive studies on transfer learning, semi-supervised learning, data augmentation, and robustness against noisy inputs—making it particularly relevant for our noise-robust learning framework.

We adopt resnet-18 as the backbone architecture for all experiments. We conduct comprehensive performance comparisons across four key evaluation metrics:

\begin{matrix} Accuracy & = \frac{TP + TN}{TP + TN + FP + FN}, \\ Precision & = \frac{TP}{TP + FP}, \\ Recall & = \frac{TP}{TP + FN} \\ F 1 - score & = 2 \times \frac{Precision \times Recall}{Precision + Recall}, \end{matrix}

where TP (true positive) is the number of correctly predicted positive samples, TN (true negatives) is the number of correctly predicted negative samples, FP (false positives) is the number of negative samples incorrectly predicted as positive, and FN (false negatives) is the number of positive samples incorrectly predicted as negative. And in our experiment, recall is equal with accuracy, so it is not shown in the table.

We evaluate these metrics at varying training scales and measure the per-epoch computational time. We compared the results of the initial model, L-FT, F-FT, and the retained model across these metrics (Table 1).

To establish competitive baselines, we compare against prior work by Pochinkov et al. [73], who proposed four neuron-scoring methods for machine unlearning. The baseline importance functions are defined as

\begin{matrix} I_{abs} (D, n) & = \frac{1}{| D |} \sum_{m = 1}^{M} | f_{m}^{l} (n) | \\ I_{rms} (D, n) & = \sqrt{\frac{1}{| D |} \sum_{m = 1}^{M} {(f_{m}^{l} (n))}^{2}} \\ I_{freq} (D, n) & = \frac{1}{| D |} \cdot |{m \in {1, . . ., M} ∣ f_{m}^{l} (n) > 0}| \\ I_{std} (D, n) & = \sqrt{\frac{1}{| D |} \sum_{m = 1}^{M} {(f_{m}^{l} (n) - {\bar{f}}^{l} (n))}^{2}} \end{matrix}

where n is the neuron in layer l of model

f_{θ}

,

f_{m}^{l} (n)

denotes the activation of neuron n in layer l for input sample

{\hat{x}}_{m}

, and

{\bar{f}}^{l} (n) = \frac{1}{M} \sum_{m = 1}^{M} f_{m}^{l} (n)

is the mean activation of neuron n.

The scoring function for comparing neuron behavior between high-quality (

D_{r}

) and noisy (

D_{n}

) subsets is

Score (n, D_{r}, D_{n}) = \frac{I (D_{n}, n)}{I (D_{r}, n) + ϵ}

(18)

where

ϵ > 0

is a small constant for numerical stability

Our novel pruning method (Section 3.3) outperforms these baselines in post-pruning accuracy (Table 2). We further validate robustness through noise-level experiments, demonstrating consistent performance across varying noise conditions (Table 3). Table 4 demonstrates that our method achieves superior accuracy across various attack rates while maintaining computational efficiency, and Table 5 confirms its robust generalization capability under different test-time attack scenarios.

All implementations use PyTorch 2.0 with NVIDIA A10 GPUs. For the training epochs, 60 epochs were used for the initial model and the Retrain Model, while 30 epochs were used for fine-tuning variants (L-FT/F-FT). The results are evaluated independently on the test set.

4.1.2. Results and Analysis

Figure 2 provides visual evidence of our method’s capability to rectify classification errors made by the baseline model. The displayed samples—misclassified by the initial model but correctly predicted by both fine-tuning variants—demonstrate the effectiveness of our noise-robust learning approach in recovering discriminative features that were corrupted during standard training procedures.

Superior Performance of Full-Model Fine-tuning: As demonstrated in Table 1, full-model fine-tuning achieves superior performance and enhanced capacity recovery across all data scales. At the complete 50k training size, full-model fine-tuning attains 80.20% accuracy—representing a substantial 10.76% absolute improvement over the noise-corrupted initial model (69.44%) and a significant 9.29% enhancement over conventional retraining (72.99%). This dual improvement confirms that our pruning strategy effectively preserves critical discriminative features while eliminating noise-sensitive components.

Computational Efficiency: Our specialized fine-tuning achieves rapid performance restoration with minimal computational overhead. As evidenced in Table 1, full-model fine-tuning requires only 14.44 s per epoch—28% less time than retraining (15 s per epoch)—while delivering superior accuracy (80.20% vs. 72.99%). Remarkably, this performance gain is achieved with half the training epochs (30 vs. 60), demonstrating our method’s accelerated convergence enabled by precise neuron pruning.

Table 1. Comparative analysis of fine-tuning strategies across different training scales.

Train Size	Metric	Method
Train Size	Metric	Initial Model	L-FT	F-FT	Retrain Model
50 k	Accuracy	0.6944	0.7508	0.8020	0.7299
	Precision	0.6962	0.7546	0.8075	0.7450
	F1-score	0.6962	0.7500	0.8024	0.7298
	Time (s)	20.48	7.98	14.44	15
25 k	Accuracy	0.6196	0.6818	0.7253	0.6616
	Precision	0.6710	0.6835	0.7257	0.6879
	F1-score	0.6129	0.6812	0.7228	0.6569
	Time (s)	11.68	3.1	5.55	5.43
12.5 k	Accuracy	0.5619	0.6067	0.6124	0.5828
	Precision	0.5857	0.6112	0.6226	0.6238
	F1-score	0.5525	0.6070	0.6108	0.5884
	Time (s)	6.87	2.02	3.63	4.77

Note: All metrics except time are displayed in the F-FT column. The best time results are highlighted in bold.

Data-Efficient Generalization: As training data availability decreases (Table 1), our method maintains significant performance advantages. With only 12.5k samples (25% of the full dataset), full-model fine-tuning achieves 61.24% accuracy—a substantial 5.05% absolute improvement over retraining (58.28%). This result highlights our method’s ability to effectively leverage limited high-quality data for robust model training.

Attribution-Guided Pruning Effectiveness: Table 2 demonstrates our attribution-guided pruning strategy’s unique capability to identify and eliminate noise-sensitive neurons, resulting in superior final model performance. The results reveal three key advantages: (1) Our method exhibits the most significant initial accuracy drop (24.23% at the 50k scale), indicating that the regression-based sensitivity criterion successfully identifies and removes neurons that are critically involved in making predictions based on noisy patterns. (2) Following full-model fine-tuning, our approach achieves 80.20% accuracy—outperforming all alternatives, including the second-best standard deviation method’s full-model fine-tuning (79.88%) by 1.32%, despite requiring less training time. (3) The performance advantage becomes more pronounced at smaller training scales, with our method maintaining a 5.8% accuracy lead (72.53% vs. 68.72%) over RMS’s full-model fine-tuning at the 25 k scale, demonstrating robust scalability across different data regimes. This validates that the model captures a different and more targeted aspect of neuron importance—sensitivity to noise—compared to unsupervised criteria like activation magnitude or frequency.

Table 2. Comparative analysis of pruning methods across training scales.

Size	Metric	Method	Abs	Rms	Freq	Std	Ours
50 k	Acc.	Pruned	0.4707	0.5073	0.5480	0.4139	0.2423
		L-FT	0.7353	0.7586	0.7742	0.7526	0.7508
		F-FT	0.7966	0.7894	0.7853	0.7988	0.8020
	Prec.	Pruned	0.4874	0.6498	0.7284	0.5801	0.3610
		L-FT	0.7360	0.7579	0.7729	0.7547	0.7546
		F-FT	0.7976	0.7984	0.7867	0.7896	0.8075
	F1	Pruned	0.4140	0.4449	0.5425	0.4139	0.1388
		L-FT	0.7443	0.7579	0.7731	0.7526	0.7500
		F-FT	0.7966	0.7894	0.7827	0.7965	0.8024
25 k	Acc.	Pruned	0.5492	0.5737	0.6169	0.5587	0.1771
		L-FT	0.2731	0.6759	0.4683	0.6755	0.6818
		F-FT	0.1285	0.6872	0.2738	0.7073	0.7253
	Prec.	Pruned	0.6647	0.6908	0.6808	0.6151	0.0800
		L-FT	0.3199	0.6742	0.5747	0.6772	0.6835
		F-FT	0.4678	0.6878	0.4358	0.7216	0.7257
	F1	Pruned	0.6294	0.5565	0.6203	0.5148	0.0938
		L-FT	0.2029	0.6741	0.4382	0.6753	0.6812
		F-FT	0.0757	0.6857	0.1942	0.7077	0.7228
12.5 k	Acc.	Pruned	0.5567	0.5104	0.2552	0.4735	0.1000
		L-FT	0.5859	0.5939	0.5515	0.5704	0.6067
		F-FT	0.6216	0.6040	0.6104	0.6296	0.6124
	Prec.	Pruned	0.5929	0.5468	0.1738	0.5442	0.0100
		L-FT	0.5911	0.5930	0.5574	0.5660	0.6112
		F-FT	0.6223	0.6093	0.5531	0.6345	0.6266
	F1	Pruned	0.5490	0.5133	0.1213	0.4416	0.0182
		L-FT	0.5859	0.5924	0.5531	0.5666	0.6070
		F-FT	0.6196	0.6035	0.6104	0.6309	0.6108

Note: The best results are highlighted in bold.

Noise-Adaptive Performance: Table 3 demonstrates that our data partitioning algorithm achieves consistent advantages as noise levels increase, primarily through improved model accuracy due to enhanced separation of high-quality and noisy data. At noise level 2, full-model fine-tuning outperforms the Retrain Model by 8.07% (79.34% vs. 71.27%). This performance gap widens to 10.20% at level 9 (80.06% vs. 69.86%), confirming that higher noise levels enhance the discriminative power of neurons in distinguishing

D_{r}

(high-quality data) from

D_{n}

(noisy data). L-FT demonstrates the highest computational efficiency, reducing training time by 40.0–67.8% compared to retraining (e.g., 6.67 s vs. 12.02 s at level 9). Full-model fine-tuning maintains high accuracy while being up to 47.2% faster than retraining, achieving an optimal performance–efficiency balance. Fine-tuning this pruned network on high-quality data allows the model to more effectively recover and strengthen the representations for genuine patterns without the interference from the pruned noise-sensitive pathways. In other words, the fact that the same method performs increasingly better at different noise levels also indicates that as the differences between datasets grow larger, the characteristics of neurons in the network become more pronounced. We can confirm that our method can effectively classify data to a large extent. However, since we assume that the method is a black box, there will inevitably be classification errors, especially when sample noise is minimal. Nevertheless, we can confirm that using our method should result in a significantly higher number of noise samples being excluded compared to normal samples, thereby significantly improving the network’s performance. We present examples subjected to different levels of interference in Figure 3.

Sensitivity Analysis: According to the noise sensitivity analysis in Table 3, it can be confirmed that the FT time at low noise levels proves that our separation is not thorough enough, but as the noise level increases, our FT time becomes shorter and shorter, proving that more noisy data is found through our method. To clarify the respective contributions of data partitioning and neuron pruning, we revisit Table 1 and Table 2 to perform a structured sensitivity analysis. In Table 1, the “Retrain Model” represents the performance of training only on filtered data. In Table 2, the “Pruned” rows represent performance when pruning is applied.The rows labeled “F-FT” represent the combined effect of both mechanisms. Across all training scales (50 k, 25 k, and 12.5 k), the Retrain Model consistently improves over the initial model, confirming the effectiveness of attribution-based data filtering. When comparing the pruned strategy with other methods, we can see that its accuracy decreases most significantly. We can confirm that our method has indeed found the neurons that are most sensitive to the distinction between the two types of data. Subsequently, combining the results obtained using our method further proves that by removing the influence of these data on the neurons, we obtained the expected experimental results.

Performance with Rising Noise Levels: From the experimental results in Table 4, it can be observed that as the attack ratio (proportion of noisy data) increases from 0.2 to 0.9, the performance of all methods shows a declining trend. However, the proposed F-FT method consistently maintains the best performance. Specifically, when the attack ratio is 0.2, the accuracy of F-FT reaches 0.8358, which is 2.23 percentage points higher than the second-best L-FT method. When the attack ratio rises to 0.9, the accuracy of F-FT (0.5924) still significantly outperforms L-FT (0.5705) and the retrained model (0.5184). Notably, as the noise ratio increases, the performance advantage of F-FT over other methods gradually expands. For example, at an attack ratio of 0.7, the F1-score of F-FT (0.7448) is 3.56 percentage points higher than that of L-FT (0.7092), indicating that our method exhibits stronger robustness in high-noise environments. In terms of time efficiency, although L-FT remains the fastest, the training time of F-FT under different attack ratios is consistently better than that of the retrained model, demonstrating good computational efficiency. These results validate the stability and effectiveness of the F-FT method when data quality deteriorates.

Table 3. Comparative analysis of fine-tuning strategies under different noise levels (train size = 50 k).

Noise Level	Metric	Method
Noise Level	Metric	Initial Model	L- FT	F- FT	Retrain Model
level = 2	Accuracy	0.7128	0.7741	0.7934	0.7127
	Precision	0.7309	0.7749	0.7969	0.7499
	F1-score	0.7126	0.7735	0.7945	0.7123
	Time (s)	20.48	12.28	21.86	22.72
level = 3	Accuracy	0.7220	0.7754	0.7966	0.6855
	Precision	0.7442	0.7749	0.7968	0.7257
	F1-score	0.7237	0.7746	0.7951	0.6806
	Time (s)	20.48	12.16	21.53	22.84
level = 4	Accuracy	0.7079	0.7761	0.7904	0.7058
	Precision	0.7269	0.7796	0.7932	0.7468
	F1-score	0.7062	0.7758	0.7902	0.7086
	Time (s)	20.48	10.35	18.55	18.12
level = 5	Accuracy	0.6818	0.7754	0.7954	0.7074
	Precision	0.7370	0.7742	0.7969	0.7448
	F1-score	0.6857	0.7743	0.7952	0.7131
	Time (s)	20.45	6.94	12.69	12.45
level = 6	Accuracy	0.6992	0.7654	0.7923	0.7039
	Precision	0.7033	0.7672	0.7934	0.7301
	F1-score	0.6923	0.7656	0.7915	0.7053
	Time (s)	20.48	8.71	15.59	18.0
level = 7	Accuracy	0.7292	0.7575	0.7925	0.6307
	Precision	0.7360	0.7585	0.7946	0.6770
	F1-score	0.7295	0.7568	0.7922	0.6192
	Time (s)	20.48	6.60	12.12	14.78
level = 8	Accuracy	0.6944	0.7508	0.8020	0.7299
	Precision	0.6962	0.7546	0.8075	0.7450
	F1-score	0.6962	0.7500	0.8024	0.7298
	Time (s)	20.48	7.98	14.44	15
level = 9	Accuracy	0.7150	0.7657	0.8006	0.6986
	Precision	0.7482	0.7648	0.8034	0.7562
	F1-score	0.7183	0.7643	0.8012	0.6945
	Time (s)	20.61	6.67	12.21	12.02

Note: All metrics except time are displayed in the F-FT column. The best time results are highlighted in bold.

Model Adaptability Analysis: The experimental results in Table 5 demonstrate that as the noise ratio (attacked rate) in the test set increases, the performance of all methods declines. However, our approach (particularly F-FT) generally outperforms both the initial model and the fully retrained model (Retrain Model) under most conditions. On the clean, noise-free test set (attacked rate = 0), F-FT achieves the best performance (accuracy = 0.8292), indicating its ability to effectively leverage new data to enhance model performance. In noisy test environments (attacked rate = 0.2–0.8), our methods (L-FT and F-FT) maintain superior performance in terms of accuracy and the F1-score. Notably, under high noise levels (attacked rate = 0.8), F-FT still achieves significantly higher accuracy (0.3574) compared to the initial model (0.2861) and the Retrain model (0.2361). This suggests that our approach not only adapts to the noise in the training data but also preserves the robust features learned by the original model, resulting in more stable performance on noisy test data. In contrast, while the Retrain model exhibits higher precision under high noise, its overall performance (accuracy and F1-score) deteriorates more severely, indicating weaker generalization. In summary, our method demonstrates strong adaptability and robustness on both clean and noisy data, making it suitable for real-world scenarios where noisy inputs may be encountered.

Table 4. Comparative analysis of fine-tuning strategies Under different attacked rates (train size = 50 k).

Attack Level	Metric	Method
Attack Level	Metric	Initial Model	L- FT	F- FT	Retrain Model
Attacked Rate = 0.2	Accuracy	0.7322	0.8135	0.8358	0.7744
	Precision	0.7658	0.8126	0.8404	0.7936
	F1-score	0.7338	0.8128	0.8362	0.7733
	Time (s)	20.14	9.32	17.21	16.89
Attacked Rate = 0.4	Accuracy	0.6991	0.7859	0.8154	0.7410
	Precision	0.7440	0.7869	0.8177	0.7611
	F1-score	0.7050	0.7857	0.8154	0.7435
	Time (s)	20.33	12.06	21.56	21.05
Attacked Rate = 0.5	Accuracy	0.7128	0.7741	0.7934	0.7127
	Precision	0.7309	0.7749	0.7969	0.7499
	F1-score	0.7126	0.7735	0.7945	0.7123
	Time (s)	20.48	12.28	21.86	22.72
Attacked Rate = 0.7	Accuracy	0.5923	0.7100	0.7475	0.6668
	Precision	0.6582	0.7165	0.7461	0.7095
	F1-score	0.5990	0.7092	0.7448	0.6672
	Time (s)	21.06	9.23	16.19	15.92
Attacked Rate = 0.9	Accuracy	0.4896	0.5705	0.5924	0.5184
	Precision	0.5252	0.5688	0.5875	0.5388
	F1-score	0.4827	0.5661	0.5838	0.5148
	Time (s)	21.61	7.29	12.63	12.59

Note: All metrics except time are displayed in the F-FT column. The best time results are highlighted in bold.

Table 5. Comparative analysis of fine-tuning strategies under different test dataset attacked rates (train size = 50 k).

Attack Level	Metric	Method
Attack Level	Metric	Initial Model	L- FT	F- FT	Retrain Model
Attacked Rate = 0	Accuracy	0.7401	0.7972	0.8292	0.7782
	Precision	0.7622	0.7952	0.8259	0.8019
	F1-score	0.7412	0.7955	0.8264	0.7822
Attacked Rate = 0.2	Accuracy	0.6272	0.6842	0.6836	0.6417
	Precision	0.6870	0.6893	0.6897	0.7362
	F1-score	0.6340	0.6828	0.6827	0.6603
Attacked Rate = 0.4	Accuracy	0.5159	0.5766	0.5771	0.5054
	Precision	0.6442	0.5984	0.5999	0.7147
	F1-score	0.5306	0.5751	0.5757	0.5465
Attacked Rate = 0.6	Accuracy	0.4023	0.4650	0.4642	0.3692
	Precision	0.6109	0.5148	0.5114	0.6863
	F1-score	0.4152	0.4609	0.4599	0.4124
Attacked Rate = 0.8	Accuracy	0.2861	0.3549	0.3574	0.2361
	Precision	0.5767	0.4302	0.4326	0.6693
	F1-score	0.2750	0.3422	0.3445	0.2487

Note: The best results are highlighted in bold.

The synergistic combination of precise attribution-based pruning followed by targeted fine-tuning achieves dual objectives: (1) efficient removal of noise-induced interference and (2) enhanced recovery of discriminative capacity beyond standard approaches. This advantage is particularly evident under extreme conditions (high noise levels or limited data availability), where conventional methods deteriorate while our framework maintains robust performance.

4.2. Speech Recognition

4.2.1. Dataset and Experimental Setup

To validate our method’s generalizability across domains, we evaluate on speech recognition using the Speech Commands dataset (10-class subset), a standard benchmark for keyword spotting tasks. This dataset contains 35,000 one-second audio clips sampled at 16 kHz, covering 10 core vocal commands (“yes”, “no”, etc.) along with additional silence and unknown samples. The dataset’s constrained vocabulary and uniform temporal duration make it particularly suitable for evaluating noise-robust learning approaches in speech applications.

The proposed experimental framework employs a baseline architecture comprising a single convolutional layer followed by two fully-connected layers. For this evaluation, we adopt a custom 80–20 training–testing partition rather than the standard partition to better demonstrate our approach’s effectiveness across different data distributions. Consistent with Section 4.1.1, we assess performance using the following metrics: accuracy, precision, the F1-score, and the per-epoch computational time. What’s more, we adopt Top-3 accuracy to evaluate model performance, which checks if the ground-truth label is within the top-3 predicted classes. Formally, for a dataset with M samples, it is computed as

Top - 3 Accuracy = \frac{1}{M} \sum_{i = 1}^{M} I (y_{i} \in {Top - 3 predictions for x_{i}})

(19)

4.2.2. Results and Analysis

Figure 4 presents visualization results using Mel-spectrogram preprocessing (64 bins, 25 ms windows, and 10 ms hop length). We display representative samples from two categories that demonstrate the diversity of audio features. The visualization reveals that while the initial model incorrectly predicts certain audio samples, our proposed method achieves correct classification, indicating substantial room for improvement in the baseline approach.

Table 6 demonstrates consistent performance advantages across all evaluation metrics. Accuracy improvements: Full-model fine-tuning achieves 71.21% accuracy—representing a 6.84% improvement over the initial model (64.20%) and a 5.84% enhancement over retraining (65.37%). The substantial 7.01% F1-score improvement (71.92% vs. 64.02%) confirms enhanced noise immunity. Computational efficiency: L-FT delivers a 3.23× speedup (1.32 s vs. 3.23 s per epoch) compared to retraining while maintaining superior precision (76.34% vs. 65.31%). Full-model fine-tuning achieves an optimal performance–speed balance at 2.85s per epoch. Prediction reliability: Our method’s 93.41% top-3 accuracy indicates consistent prediction reliability even when the correct label is not the highest-ranked prediction—a crucial characteristic for practical speech interface applications.

This comprehensive validation on speech recognition data strengthens our method’s claim as a domain-agnostic solution for noise-robust learning across diverse application areas.

5. Conclusions

In this study, we have introduced a robust deep learning framework designed explicitly to address the pervasive issue of noisy training data through the integration of machine unlearning principles, attribution-guided data partitioning, discriminative neuron pruning, and targeted fine-tuning. Our approach initially employs gradient-based attribution methods combined with Gaussian mixture models to effectively distinguish high-quality samples from noise-corrupted ones without imposing restrictive assumptions on noise characteristics. Subsequently, we apply a regression-based neuron sensitivity analysis to selectively identify and prune neurons highly susceptible to noise interference. This strategic pruning is followed by selective fine-tuning—either layer-wise or full-model—focused exclusively on clean, high-quality data, incorporating regularization to mitigate potential knowledge degradation. The core methodological contribution lies in the novel regression-based sensitivity analysis for neurons. By formulating the prediction of sample quality labels as a linear regression problem on neuron activations, we derived a neuron sensitivity metric specifically designed to identify components most vulnerable to noise corruption. This targeted pruning criterion, distinct from traditional pruning objectives, coupled with attribution-guided data partitioning and focused fine-tuning, provides a new pathway towards building robust DNNs in noisy environments. We validated our framework across diverse domains. Experiments included CIFAR-10 image classification and speech command keyword spotting. The results demonstrate its effectiveness and generalizability.

Future research directions include investigating adaptive pruning mechanisms, such as online unlearning methodologies to enhance computational efficiency, and expanding applicability to semi-supervised and unsupervised learning scenarios. Future work will also focus on migrating our methods to advanced models such as CNNs and transformers, which have more complex internal architectures. Additionally, further exploration into theoretical convergence and generalization guarantees under varying noise conditions, as well as validation on larger-scale benchmarks and domains prone to high noise levels—such as medical imaging and natural language processing—will provide deeper insights and broader applicability. Ultimately, our work contributes a practical and theoretically grounded approach for achieving reliable and robust deep neural networks in noisy real-world environments.

Author Contributions

Conceptualization, D.J. and G.C.; methodology, D.J.; software, S.F.; validation, Y.L. and H.Z.; formal analysis, S.F.; investigation, S.F.; resources, Y.L.; data curation, H.Z.; writing—original draft preparation, D.J.; writing—review and editing, G.C.; visualization, D.J.; supervision, S.F.; project administration, D.J.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 28 July–2 August 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Graves, A.; Mohamed, A.r.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion attacks against machine learning at test time. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), Prague, Czech Republic, 23–27 September 2013; pp. 387–402. [Google Scholar]
Frénay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2013, 25, 845–869. [Google Scholar]
Northcutt, C.G.; Athalye, A.; Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv 2021, arXiv:2103.14749. [Google Scholar] [CrossRef]
Song, H.; Han, B.; Liu, Y.; Sugiyama, M. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 5907–5915. [Google Scholar]
Rolnick, D.; Veit, A.; Belongie, S.; Shavit, N. Deep learning is robust to massive label noise. arXiv 2017, arXiv:1705.10694. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef] [PubMed]
Arpit, D.; Jastrzębski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M.S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. A closer look at memorization in deep networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 233–242. [Google Scholar]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv 2016, arXiv:1611.03530. [Google Scholar] [CrossRef]
Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar] [CrossRef]
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; Mané, D. Concrete problems in AI safety. arXiv 2016, arXiv:1606.06565. [Google Scholar] [CrossRef]
Lee, K.H.; He, X.; Zhang, L.; Yang, L. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5447–5456. [Google Scholar]
Lai, K.H.; Zha, D.; Wang, G.; Xu, J.; Zhao, Y.; Kumar, D.; Chen, Y.; Zumkhawaka, P.; Wan, M.; Martinez, D.; et al. Tods: An automated time series outlier detection system. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 2–9 February 2021; Volume 35, pp. 16060–16062. [Google Scholar]
Brodley, C.E.; Friedl, M.A. Identifying mislabeled training data. J. Artif. Intell. Res. 1999, 11, 131–167. [Google Scholar] [CrossRef]
Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; Bailey, J. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 322–330. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8792–8820. [Google Scholar]
Ma, X.; Wang, Y.; Houle, M.E.; Zhou, S.; Erfani, S.; Xia, S.; Wijewickrema, S.; Bailey, J. Dimensionality-driven learning with noisy labels. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3355–3364. [Google Scholar]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8536–8546. [Google Scholar]
Jiang, L.; Zhou, Z.; Leung, T.; Li, L.J.; Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 2304–2313. [Google Scholar]
Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1944–1952. [Google Scholar]
Ren, M.; Zeng, W.; Yang, B.; Urtasun, R. Learning to reweight examples for robust deep learning. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4334–4343. [Google Scholar]
Liu, T.; Tao, D. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 447–461. [Google Scholar] [CrossRef]
Xia, X.; Liu, T.; Wang, N.; Han, B.; Gong, C.; Niu, G.; Sugiyama, M. Are anchor points really indispensable in label-noise learning? Adv. Neural Inf. Process. Syst. 2019, 32, 6838–6849. [Google Scholar]
Li, J.; Wong, Y.; Zhao, Q.; Kankanhalli, M.S. Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5051–5059. [Google Scholar]
Shu, J.; Xie, Q.; Yi, L.; Zhao, Q.; Zhou, S.; Xu, Z.; Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. Adv. Neural Inf. Process. Syst. 2019, 32, 1919–1930. [Google Scholar]
Cao, Y.; Yang, J. Towards making systems forget with machine unlearning. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 17–21 May 2015; pp. 463–480. [Google Scholar]
Bourtoule, L.; Chandrasekaran, V.; Choquette-Choo, C.A.; Jia, H.; Travers, A.; Zhang, B.; Lie, D.; Papernot, N. Machine unlearning. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 141–159. [Google Scholar]
Graves, L.; Nagisetty, V.; Ganesh, V. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 2–9 February 2021; Volume 35, pp. 11516–11524. [Google Scholar]
Thudi, A.; Jia, H.; Shumailov, I.; Papernot, N. On the necessity of auditable algorithmic definitions for machine unlearning. In Proceedings of the USENIX Security 22, Boston, MA, USA, 10–12 August 2022; pp. 4007–4022. [Google Scholar]
Golatkar, A.; Achille, A.; Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9304–9312. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1135–1143. [Google Scholar]
Louizos, C.; Welling, M.; Kingma, D.P. Learning sparse neural networks through L₀ regularization. arXiv 2017, arXiv:1712.01312. [Google Scholar]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar] [CrossRef]
Ruder, S. Neural Transfer Learning for Natural Language Processing. Ph.D. Thesis, NUI Galway, Galway, Ireland, 2019. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Ghorbani, A.; Abid, A.; Zou, J. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3681–3688. [Google Scholar]
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL Tech. 2017, 31, 841. [Google Scholar] [CrossRef]
Kindermans, P.J.; Hooker, S.; Adebayo, J.; Alber, M.; Schütt, K.T.; Dähne, S.; Erhan, D.; Kim, B. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer: Cham, Switzerland, 2019; pp. 267–280. [Google Scholar]
Sixt, L.; Granz, M.; Landgraf, T. When explanations lie: Why many modified bp attributions fail. In Proceedings of the International Conference on Machine Learning (ICML), Online, 13–18 July 2020; pp. 9046–9057. [Google Scholar]
Geva, M.; Schuster, R.; Berant, J.; Levy, O. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5484–5495. [Google Scholar]
Alain, G.; Bengio, Y. Understanding intermediate layers using linear classifier probes. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Morcos, A.S.; Barrett, D.G.; Rabinowitz, N.C.; Botvinick, M. On the importance of single directions for generalization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Hooker, S.; Erhan, D.; Kindermans, P.J.; Kim, B. A benchmark for interpretability methods in deep neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 9737–9748. [Google Scholar]
Agarwal, C.; D’souza, D.; Hooker, S. Estimating example difficulty using variance of gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10368–10378. [Google Scholar]
Sanh, V.; Wolf, T.; Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Adv. Neural Inf. Process. Syst. 2020, 33, 20378–20389. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; Kautz, J. Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11264–11272. [Google Scholar]
Theis, L.; Korshunova, I.; Tejani, A.; Huszár, F. Faster gaze prediction with dense networks and fisher pruning. arXiv 2018, arXiv:1801.05787. [Google Scholar] [CrossRef]
Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar]
Arthur, D.; Vassilvitskii, S. k-Means++: The Advantages of Careful Seeding; Technical Report; Stanford: Stanford, CA, USA, 2006. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C 1979, 28, 100–108. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Pochinkov, N.; Schoots, N. Dissecting language models: Machine unlearning via selective pruning. arXiv 2024, arXiv:2403.01267. [Google Scholar] [CrossRef]

Figure 1. This is the overall algorithm flow. First, we pre-trained a network. In the data partition stage, we used the network to divide the data into high-quality and noise-corrupted data. Next, during selective pruning, we use the distinguished data to sort and prune the selected layers of the neural network. Finally, we use the high-quality dataset to obtain layer-wise and full-model fine-tuning models for the pruned network.

Figure 2. Representative samples misclassified by the initial model but correctly predicted by both our fine-tuning strategies. Green labels indicate correct predictions by our method. Red labels indicate wrong predictions by the initial model.

Figure 3. Examples illustrating data samples before and after noise injection. The first column on the left is randomly selected from the samples, and from left to right are examples of the sample gradually increasing the intensity of Gaussian noise. The intensity parameters are labeled at the top of each column of images.

Figure 4. Representative samples misclassified by the initial model but correctly predicted by both our fine-tuning strategies. Green labels indicate correct predictions by our method. Red labels indicate wrong predictions by the initial model.

Table 6. Comparative analysis in speech commons.

Stage	Performance Metrics				Time (s)
Stage	Accuracy (%)	Precision (%)	F1 (%)	Top-3 Acc (%)	Time (s)
Initial Model	64.20	64.03	64.02	89.92	6.90
L-FT	66.94	76.34	77.01	90.45	1.32
F-FT	71.21	72.90	71.92	93.41	2.85
Retrain Model	65.37	65.31	65.27	90.38	3.23

Note: All metrics are reported in percentage (%). Time represents the average training time per epoch. The best results are highlighted in bold. Time recorded per epoch was measured on an NVIDIA Tesla T4 GPU.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, D.; Chen, G.; Feng, S.; Ling, Y.; Zhu, H. Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments. Mach. Learn. Knowl. Extr. 2025, 7, 95. https://doi.org/10.3390/make7030095

AMA Style

Jin D, Chen G, Feng S, Ling Y, Zhu H. Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments. Machine Learning and Knowledge Extraction. 2025; 7(3):95. https://doi.org/10.3390/make7030095

Chicago/Turabian Style

Jin, Deliang, Gang Chen, Shuo Feng, Yufeng Ling, and Haoran Zhu. 2025. "Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments" Machine Learning and Knowledge Extraction 7, no. 3: 95. https://doi.org/10.3390/make7030095

APA Style

Jin, D., Chen, G., Feng, S., Ling, Y., & Zhu, H. (2025). Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments. Machine Learning and Knowledge Extraction, 7(3), 95. https://doi.org/10.3390/make7030095

Article Menu

Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments

Abstract

1. Introduction

2. Preliminaries

2.1. Supervised Learning

2.2. Attribution Methods

2.3. Neural Network Pruning

2.4. Fine-Tuning

2.5. Machine Unlearning for Robustness

3. Method

3.1. Problem Setup

3.2. Data Partition

3.2.1. Attribution Computation

3.2.2. Attribution-Based Clustering via a Gaussian Mixture Model

3.3. Selective Pruning

3.4. Fine-Tuning

4. Experiment

4.1. Computer Vision

4.1.1. Dataset and Experimental Setup

4.1.2. Results and Analysis

4.2. Speech Recognition

4.2.1. Dataset and Experimental Setup

4.2.2. Results and Analysis

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI