Next Article in Journal
Enhancing Customer Quality of Experience Through Omnichannel Digital Strategies: Evidence from a Service Environment in an Emerging Context
Previous Article in Journal
Multi Stage Retrieval for Web Search During Crisis
Previous Article in Special Issue
Real-Time Identification of Look-Alike Medical Vials Using Mixed Reality-Enabled Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

by
Qi Liu
*,† and
Wanjing Ma
Key Laboratory of Road and Traffic Engineering of the Ministry of Education, College of Transportation, Tongji University, Shanghai 200092, China
*
Author to whom correspondence should be addressed.
Current address: 4800 Caoan Rd., Jiading District, Shanghai 201804, China.
Future Internet 2025, 17(6), 241; https://doi.org/10.3390/fi17060241
Submission received: 16 April 2025 / Revised: 21 May 2025 / Accepted: 23 May 2025 / Published: 29 May 2025
(This article belongs to the Special Issue Smart Technology: Artificial Intelligence, Robotics and Algorithms)

Abstract

:
Data corruption, including missing and noisy entries, is a common challenge in real-world machine learning. This paper examines its impact and mitigation strategies through two experimental setups: supervised NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal control (Signal-RL). This study analyzes how varying corruption levels affect model performance, evaluate imputation strategies, and assess whether expanding datasets can counteract corruption effects. The results indicate that performance degradation follows a diminishing-return pattern, well modeled by an exponential function. Noisy data harm performance more than missing data, especially in sequential tasks like Signal-RL where errors may compound. Imputation helps recover missing data but can introduce noise, with its effectiveness depending on corruption severity and imputation accuracy. This study identifies clear boundaries between when imputation is beneficial versus harmful, and classifies tasks as either noise-sensitive or noise-insensitive. Larger datasets reduce corruption effects but offer diminishing gains at high corruption levels. These insights guide the design of robust systems, emphasizing smart data collection, imputation decisions, and preprocessing strategies in noisy environments.

1. Introduction

Machine learning models rely heavily on high-quality data, yet real-world datasets often suffer from corrupted data—such as missing entries or noisy measurements—which significantly degrade model performance. Data corruption arises from diverse sources, including sensor errors, transmission artifacts, or incomplete data collection. It manifests in two primary forms: missing data (e.g., masked tokens in NLP or undetected vehicles in traffic control) and noisy data (e.g., mislabeled text or perturbed sensor readings), while this challenge is well documented in classical machine learning (Little and Rubin, 2019 [1]; Emmanuel et al., 2021  [2]), its implications for modern paradigms like large language models (LLMs) and deep reinforcement learning (DRL) remain underexplored.
Different learning paradigms exhibit distinct vulnerabilities to data corruption. For example, LLMs may tolerate certain missing tokens but struggle with semantic noise, whereas DRL policies can compound observational noise over sequential decisions. Prior work has shown that noise in NLP tasks (e.g., mislabeled text) biases model outputs (Brown et al., 2020 [3]), whereas in DRL, observational noise disrupts policy stability (Pathak et al., 2017  [4]). Despite these insights, fundamental questions remain unanswered: How do corruption type and severity impact different learning paradigms? Can imputation effectively recover lost performance, or does it can make things worse by introducing new noise? And when is collecting more data a viable alternative to cleaning existing data? While traditional imputation methods—including statistical interpolation and deep generative approaches—offer partial solutions, their efficacy varies significantly across tasks. This variation highlights persistent gaps in our understanding of the trade-offs between data quality, quantity, and imputation strategies in machine learning systems.
This study bridges these gaps through a systematic analysis of two distinct machine learning paradigms: supervised NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal control (Signal-RL). It evaluates how varying corruption levels affect model performances, quantify the trade-offs of imputation strategies, and assess whether expanding datasets can counteract corruption effects. The experiments reveal universal patterns (such as diminishing returns from data quality improvements), paradigm-specific insights (such as DRL’s heightened sensitivity to noise), as well as task-specific observations (such as the critical 30% of data that determines performance in traffic signal control). This work provides actionable guidelines for designing robust systems in noisy environments, emphasizing smart data prioritization, imputation decisions, and noise-aware preprocessing. Key innovations include:
  • Modeling Performance Degradation: This study shows that performance degradation follows a diminishing-return pattern, well-captured by an exponential function, and reveal task-specific sensitivities.
  • Imputation Trade-offs: This study demonstrates the boundary conditions where imputation is beneficial versus harmful, providing actionable guidelines for practitioners.
  • Data Quantity–Quality Trade-offs: This study empirically shows that larger datasets can only partially offset corruption effects, with diminishing utility at high corruption levels.
By addressing these questions, our study advances the understanding of data corruption’s impact on machine learning and provides a framework for future research in robust model development. The code for reproducing our experiments is available at https://github.com/qiliuchn/data-corruption-study (accessed on 20 May 2025).

2. Related Work

2.1. Types of Data Corruption

Data corruption in machine learning encompasses various forms, including missing data, noisy data, and adversarial perturbations. This study focuses on two major types of corruption: missing data and noisy data, which commonly occur in real-world scenarios. Missing data can result from sensor dropout, incomplete data collection, or non-response in surveys. Rubin’s classification categorizes missing data into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) [5]. In NLP, missing data often manifests as masked or unknown tokens, while in reinforcement learning, it appears as incomplete state observations. Data noise can affect both labels and features, with sources ranging from environmental factors to measurement errors. Noise is often categorized by its statistical distribution (e.g., Gaussian [6]/adversarial [7]) or types (e.g., additive/multiplicative [8]; label/feature [9]). Noisy data significantly disrupt model learning, particularly when noise affects key features or labels critical to decision making.

2.2. Impacts of Data Corruption

Prior research has empirically demonstrated the substantial impact of data corruption on model performance across learning paradigms. In the realm of language models, Joshi et al. (2020) [10] found that missing rare tokens during pre-training led to incomplete token embeddings that limits the model’s ability to capture fine-grained semantics. Missing data reduce the available context, weakening learned representations and hindering downstream tasks such as summarization or question answering [11,12]. Noisy data, particularly in large-scale corpora, introduce biases and degrades model robustness. Brown et al. (2020) [3] highlighted that noisy training data increase the likelihood of generating biased or low-quality outputs, while filtering and robust training objectives mitigate such effects. Semantic noise, such as contradictory or irrelevant text, reduces the model’s ability to retain factual knowledge and generalize across tasks [13].
For reinforcement learning, data corruption affects state observations, which are critical for decision making. Missing features reduce state informativeness, leading to suboptimal policies, particularly in partially observable environments (POMDPs) [14]. Studies by Bai et al. (2019) [15] demonstrated that missing features disrupt state-transition dynamics, causing instability in model-free RL algorithms and inaccurate environment models in model-based RL. Pathak et al. (2017) showed that noisy features distort latent state representations, leading to poor decision making in high-dimensional environments [4]. Mnih et al. (2015) [16] found that Q-values fluctuate with noisy observations, resulting in unstable policies. Moreover, noise hampers transfer learning, as policies trained in corrupted environments fail to generalize to clean environments [17]. These studies collectively highlight how data corruption undermines performance in both paradigms, though through different mechanisms: contextual understanding and factual accuracy in language models versus compounding errors in sequential decision making for reinforcement learning.

2.3. Data Imputation Techniques

Data imputation aims to recover missing information, and various strategies have been proposed. Emmanuel et al. [2] provide a comprehensive review of missing data in machine learning, classifying imputation methods into several categories: simple imputation, regression-based imputation, hot-deck imputation, the Expectation–Maximization (EM) method, multiple imputation, and machine learning–inspired techniques such as k-nearest neighbors (KNN), support vector machines (SVM), decision trees, clustering-based imputation, and ensemble methods. In a more recent taxonomy, Zhou et al. [18] group imputation approaches into four main types: statistical, machine learning-based, neural network-based, and optimization-based methods.
Statistical Methods: These include simple approaches like mean, mode, or median imputation [19], as well as more sophisticated techniques like regression imputation and multiple imputation [20]. While straightforward, these methods may introduce bias or underestimate variability.
Machine Learning Methods: Techniques such as k-nearest neighbors (KNN) imputation [21], decision tree-based imputation [22], and Random Forest imputation [23] leverage patterns in data to predict missing values. These methods often outperform simple statistical approaches by capturing complex nonlinear relationships in the data.
Deep Learning-based Imputation: Autoencoders and Generative Adversarial Networks (GANs) have emerged as powerful tools for data imputation. Denoising autoencoders reconstruct inputs with noisy values [24], while GANs generate plausible synthetic data to fill gaps [25]. Yuan et al. (2021) [26] used masked language models (e.g., BERT) to impute missing tokens in text data.
For many applications, particularly LLM pre-training, masking missing data are often sufficient. Che et al. (2018) showed that masking missing time-series data combined with recurrent neural networks (e.g., GRU) can effectively handle missing data [27]. We summarize common imputation methods, their strengths and weaknesses, and use cases in Table 1.
Key Research Gaps
While many studies address specific aspects of corrupted data handling, key questions remain unanswered:
  • Impact of Data Corruption: What is the quantitative relationship between data corruption ratio and model performance? Can this relationship be consistently modeled across tasks?
  • Effectiveness of imputation: How do different imputation methods compare in mitigating the effects of missing data? Is it possible to fully restore the utility of corrupted data through imputation?
  • Trade-Off Between Data Quality and Quantity: Can larger datasets compensate for data corruption? How many additional data are required to offset quality issues and does the marginal utility of additional data diminish with increasing corruption level?
This study aims to bridge the above gaps by evaluating the effects of data corruption on supervised and reinforcement learning tasks. This study explores the utility of data imputation methods and analyze the trade-off between data quality and quantity, providing insights to guide data collection and preprocessing strategies.

3. Learning with Corrupted Data

3.1. Experiment Design

Two experiments are designed—Natural Language Processing Supervised Learning (NLP-SL) and Traffic Signal Deep Reinforcement Learning (Signal-RL)—to investigate the impact of data corruption. These two vastly different experimental setups were chosen to demonstrate the generality and broad relevance of the research questions across a wide range of machine learning tasks. By selecting tasks with diverse characteristics, we aim to derive more general insights and uncover deeper connections between these seemingly unrelated domains.

3.1.1. NLP Supervised Learning (NLP-SL)

The first experiment focuses on the GLUE benchmark tasks. Firstly, a BERT model is pre-trained using Wikitext and Bookcorpus as the base model; then the base model is frozen. We add classification head on top of the base model to fine-tune it on eight tasks: CoLA, SST-2, MRPC, STSB, QQP, MNLI, QNLI, and RTE using GLUE dataset. The GLUE input sentences are corrupted by replacing certain words with the [UNK] token, while the labels y remain uncorrupted. This type of data corruption is commonly encountered in natural language processing tasks. For instance, when digitizing text corpora, some words may become indistinguishable and are thus masked as unknown, whereas the associated labels are typically unaffected. Evaluating the amount of knowledge learned by a model remains a subjective challenge. In this experiment, performance is measured using classification-related accuracy metrics. Specifically:
  • The Matthews Correlation Coefficient (MCC) is used for CoLA;
  • The average of Pearson and Spearman correlation coefficients is used for STSB;
  • Test accuracy is used for the remaining tasks.
Baseline scores ( 1 2 for binary classification and 1 3 for three-way classification) are subtracted from these metrics. The final model score is computed as the average of the scores across all tasks. To ensure consistency, hyperparameters such as the number of training epochs and learning rates are tuned using uncorrupted data and kept fixed for all subsequent experiments (see Table A1). Model convergence for these experiments is illustrated in Figure 1. For clarity, this experiment is referred to as the “Natural Language Processing—Supervised Learning (NLP-SL)” experiment in the following sections. Note that the base model is used only for sentence embedding and is frozen; the goal is to test the model’s ability to extract knowledge from corrupted data and store it in the feed-forward network.
For the NLP-SL task, two types of data corruption are introduced:
  • Data missing: Each word in the training samples has a probability p of being replaced with a [MASK] token.
  • Inserting noise: Each word in the training samples has a probability p of being replaced with a randomly selected word from the vocabulary.

3.1.2. Traffic Signal Control Deep Reinforcement Learning (Signal-RL)

The second experiment is a deep reinforcement learning (DRL) task. An isolated intersection environment is built using SUMO 1.20.0. The environment is illustrated in Figure 2. The objective is to optimize the traffic signal at this intersection. The intersection consists of four approaches, each with three lanes. The traffic demand is generated using a binomial distribution, and the ratio of left-turn, through, and right-turn traffic demands is 1:3:2. The arrival rates are time varying: East–West traffic demands follow a sine curve, while North–South demands follow a cosine curve within range [ 0 , π / 2 ] . The simulation step size is one second, and each episode has a horizon H of one hour.
A Deep Q-Network (DQN) model is used to learn the traffic signal control strategy (see Listing A1 for model and Table A1 for configuration). The state of the environment consists of road occupancy, the current signal phase, and the duration of the current signal phase, resulting in a state vector of dimension 965. A neural network with two hidden layers of sizes 256 and 48 is used to extract features. Layer normalization is applied, but no dropout layers are included. Standard techniques such as double networks and replay buffers are implemented. The action space is discrete, with four possible actions: East–West left turn, East–West through, North–South left turn, and North–South through. Each action in the simulation lasts for 6 s. The reward r t for each time step is defined as the number of queuing vehicles transformed according to Equation (1). The queue length is divided by 80 to normalize step rewards approximately within the range of 0 to 1. The performance indicator for the model is the episode cumulative reward, R. Although the reward is accumulated over one hour, the process itself is infinite.
R = t = 1 H r t = t = 1 H q t 80 80
where q t is the number of queuing vehicles at the intersection at time t; and the threshold for deciding whether a vehicle is stopped is 0.3 m/s. H is the horizon of one simulation episode.
The model is trained using linearly decaying exploration ( ϵ ) and learning rate ( l r ) for 80% of the training time, after which the values were fixed. The initial and final ϵ values were 1.0 and 0.01, respectively, while the initial and final learning rates were 1 × 10 3 and 1 × 10 4 , respectively. The DQN model convergence results are shown in Figure 1b. The overview of model, dataset, and training configurations for both NLP-SL and Signal-RL tasks are shown by Table A1.
For the Signal-RL task, three types of data corruption are introduced: vehicle missing, inserting noise, and masking region:
  • Vehicle missing: Each vehicle is not detected with probability p. This scenario is relevant in Vehicle-to-Everything (V2X) environments, where roadside units detect vehicles’ presence through communication channels like DSRC [28]. However, only a proportion of vehicles are equipped with onboard devices. This type of corruption is analogous to the data-missing scenario in the NLP-SL experiment.
  • Inserting noise: Noise is added to the road occupancy state and rewards. Each road cell occupancy state has probability p of being replaced with random binary value. This scenario is relevant in environments where road occupancies are detected using computer-vision systems, which can introduce errors.
  • Masking region: This special type of corruption is specific to traffic signal settings. The simulation environment assumes a lane length of 400 m. A masking-region ratio p means that the farthest 400 * p m of each lane will be invisible to the model, simulating the range limitations of video cameras used for vehicle detection.
When investigating imputation methods, two common types of data-missing scenarios often encountered in practice are distinguished:
  • Exact imputation: This scenario arises when the precise locations of missing data are known. A common example occurs in natural language processing (NLP), where missing words are explicitly marked with placeholders such as “[UNK]”.
  • General imputation: In this case, the locations of missing data are unknown. For instance, in the Signal-RL experiment, vehicles may go undetected, leaving it unclear which elements of the state vector are corrupted. Imputing data under such conditions requires checking all possible locations, potentially introducing significantly more noise compared to exact imputation.

3.2. Observations

Figure 3 illustrates the relationship between the data-missing ratio and model performance. Both experiments (NLP-SL and Signal-RL) exhibit an initial gradual performance decline, followed by a steeper descent that culminates in a sharp performance drop as the data-missing ratio approaches 1.0. For reference, the figure includes the performance of an optimized fixed-timing signal as a benchmark. The RL-trained signal outperforms the fixed-timing signal when the data-missing ratio is below 0.8.
Table 2 and Table 3 provide the detailed data for these experiments. In Figure 4, the x-axis is changed to represent (1—corruption ratio) and fit the curve to the function in Equation (2). The fitted parameters for the NLP-SL experiment are a = 0.475 , λ = 3.517 , where λ represents the decay rate that controls the curve’s steepness. The goodness-of-fit analysis results in R 2 = 0.995 , indicating that the exponential cumulative distribution function (CDF) is an excellent model for the observed performance. Similarly, for the Signal-RL experiment, the fitted-curve parameters are a = 395.8 , λ = 7.493 , with R 2 = 0.956 .
This striking coincidence reveals an important and universal rule in machine learning: the diminishing return of data. The decay rate b reflects the task’s nature, with the RL task demonstrating a much larger decay rate. This implies that RL tasks are more sensitive to data corruption compared to NLP tasks. An explanation for Equation (2) is provided at the end of this section.
S = a ( 1 e λ ( 1 p ) )
where parameter a = S 0 1 e λ ; and S 0 is the model score when corruption ratio p = 0 (i.e., no corruption).
The Signal-RL model scores under inserting-noise and masking-region types of data corruption are shown in Figure 5. From Figure 5a, results show that inserting noise is significantly more detrimental than data missing. The model’s performance deteriorates much more rapidly as the noise level increases, falling below the fixed-timing signal performance as soon as the noise level exceeds 10%. Additionally, the model scores become unstable, as indicated by the oscillations in Figure 5a. The training process also becomes unstable, as shown in Figure 6.
Masking region is a unique type of data-missing corruption. Information about vehicles farther away from the intersection is less critical. By gradually discarding less important data, it is observed that the model score only experiences a sharp decline when the masking ratio p exceeds 0.7. This observation leads to an empirical rule: 30% of the data is critical and determines the model’s performance, while the remaining 70% can be missing without significantly affecting the model’s performance. The exact numbers may not hold for other tasks. And for many tasks (e.g., previous NLP tasks), this kind of screening is not feasible.

3.3. Explanation

An explanation for the observed pattern in Figure 4 and Equation (2) is provided here. For both NLP and DRL tasks, the model relies on recognizing patterns (e.g., semantic patterns, queuing patterns) in the data. When the data are completely corrupted, the critical patterns necessary for performance are entirely lost. As corruption decreases, the model rapidly recovers key patterns, resulting in a steep improvement in performance. However, the marginal utility of additional clean data diminishes, leading to a saturation of performance.
First, some basic probability theory is introduced. The probabilities of observing rare events in a large number of trials converge to the Poisson distribution [29]. When n (the number of trials) is large, p (the probability of success per trial) is small and the expected number of successes λ = n · p is finite, the binomial distribution approximates the Poisson distribution:
P ( K = k ) λ k e λ k ! ,
where λ = n · p is the rate parameter, representing the expected number of successes. The exponential term e λ in the Poisson distribution describes the probability of observing 0 successes.
Next, the above theory is applied to our pattern-recognition problem. In both experiments, the dataset size n is large, making the discovery of a pattern from an individual sample a rare event. In our experiments, each pattern has an equal probability p of being corrupted. Let x = 1 p . A pattern can be recovered multiple times from samples in the dataset with probability P ( K = k ) . The probability of failing to recover such a pattern with corruption level p is e λ x , where λ is the pattern appearance rate given no corruption. As x increases, the probability of failing to recover a pattern have derivative λ e λ x . In other words, the probability of identifying such a pattern increase by λ e λ x marginally. Suppose that model score S is proportional to the number of patterns identified. Then, as x increases, the rate of S is proportional to λ e λ x . This leads to Equation (4), where a is coefficient for this linear relation. Its solution corresponds to Equation (2). It can be shown that λ e λ x = λ ( a S ) , so Equation (4) actually describes a dynamic system where the rate of change in performance depends on the difference between the system’s current performance and its limit.
d S d x = a λ e λ x
where S is the model score; x = 1 p and p is corruption rate; b is the pattern appearance rate; and a is the model score when there is no corruption.

4. Effectiveness of Data Imputation

This section is aimed at assessing whether and how different imputation strategies mitigate the impact of missing data. It is noted that the decision to impute missing data involves a trade-off between recovering missing information and potentially introducing noise.

4.1. Experiment Design

These imputation methods will be evaluated under varying data-missing ratios. There are various types of imputation methods, as reviewed in the literature. However, evaluating the effectiveness of these algorithms is non-trivial, and their accuracies are not adjustable parameters. For this study, where the focus is on the trade-off between missing data and noise, it is crucial to control the accuracy of imputation. To achieve this, an artificial imputation method—“inserting-noise”—is proposed. For this method, accurate words (for NLP-SL) or state elements (for Signal-RL) are randomly (with probability q) replaced with random values, allowing us to control the imputation method noise level q. The example code for NLP-SL is provided at Listing A2. The function perturb_sentence is responsible for adding missing-type corruption to clean data and carrying out the artificial imputation. Later in this section, traditional imputation methods are also evaluated.
Let S ( p ) represent the model score when the data-missing ratio is p and no imputation is applied. Let S ˜ ( p , q ) represent the model score when the data-missing ratio is p and an imputation method with noise level q is used to preprocess the data before training. The imputation advantage, A ( p , q ) , is defined as:
A ( p , q ) = S ˜ ( p , q ) S ( p )
A ( p , q ) quantifies the improvement (or harm if negative) caused by imputation relative to no imputation. Figure 7 shows the heatmap of imputation advantage for both experiments.

4.2. Observations

Several interesting observations emerge from the heatmaps. First, the Signal-RL task demonstrates significantly greater sensitivity to imputation noise. This can be explained by the fact that sequential decision making is inherently more noise sensitive, combined with this task’s use of “general imputation”. In the heatmaps, red regions indicate where imputation improves model performance, while blue regions show where imputation is detrimental. These visualizations clearly reveal the trade-off between data missingness and noise introduction, with two distinct regions being identifiable:
  • Imputation advantageous corner: This region is located in the lower-right corner of the heatmap, where the data-missing ratio is high, and the imputation noise level is low. Accurate imputation in this region restores critical information, leading to significant improvements in model performance.
  • Imputation disadvantageous edge: This region is near the edge where q = 1. When the imputation noise level approaches 1.0, the noise introduced during imputation overwhelms the model, leading to performance degradation. Interestingly, the greatest harm occurs when the data-missing ratio is around p = 0.6.
Additionally, the black dashed contour line corresponding to A ( p , q ) = 0 is overlaid on the heatmap, indicating the decision boundary between regions where imputation is advantageous versus disadvantageous. The black solid line shows the fitted decision boundary. The green and lime-green dashed lines denote the 68% and 95% confidence intervals, corresponding to ± 1 and ± 1.96 standard errors, respectively. A few notable differences between the Signal-RL and NLP-SL tasks can be observed:
  • Signal-RL Decision Boundary: The contour curve for the Signal-RL task lies much lower and is shifted to the right compared to the NLP-SL task. Moreover, the contour curve for Signal-RL is more ragged and fits to an exponential curve that starts at (0, 0) and intersects the line p = 1 .
  • NLP-SL Decision Boundary: The contour line for the NLP-SL task is smoother and fits well to a logistic function. When p in range [ 0 , 0.4 ] , the decision boundary is relatively stable and remains around q = 0.68 . For p in range [ 0.4 , 1.0 ] , the contour line transitions into a sigmoid curve, with its midpoint around p = 0.7 . This gradual transition reflects the trade-off between recovering critical information through accurate imputation and introducing ambiguities (e.g., incorrect word predictions) through noisy imputation.
The sigmoid shape of the NLP-SL contour line reflects a smoother transition between advantageous and disadvantageous regions as imputation noise increases, whereas the Signal-RL task exhibits an exponential drop-off in performance due to the compounding effect of errors in sequential decision making. Tasks can be classified based on their sensitivity to noise:
  • Noise-sensitive tasks: Tasks with contour curves below the diagonal (e.g., Signal-RL) are highly sensitive to noise, showing sharp performance degradation as imputation noise increases.
  • Noise-insensitvie tasks: Tasks with contour curves above the diagonal (e.g., NLP-SL) are more robust to imputation noise.
These observations are summarized in the Figure 8.
Additional experiments were conducted using other common imputation methods. For NLP-SL, two imputation methods are introduced: “wordvec” and “BERT”. The wordvec imputation uses context word vector embeddings (GloVe) and cosine similarity to impute missing words. The BERT method leverages the pre-trained bert-base-uncased model for imputation. These two imputation methods are examples of “exact imputation”. The results of these experiments are shown in Figure 9a,b. To speed up preprocessing, a subset ratio of 0.1 was used for BERT imputation.
The model score for the NLP-SL task shows an increasing standard error, rising from 0.0015 to 0.0227 as the data corruption ratio increases from 0.1 to 1.0. Imputation using Word2Vec leads to a significant performance decline relative to the no-imputation baseline (Figure 9a), indicating that this approach introduces more noise than it mitigates. BERT-based imputation effectively reconstructs informative content for classification while introducing minimal additional noise, resulting in better performance compared to the no-imputation condition (Figure 9b). At a noise-missing ratio of 0.3, the model score difference between BERT-based imputation and the no-imputation baseline is +0.046, with a standard error of 0.0043, yielding a z-score of 10.65. This difference is statistically significant. An important observation is that common imputation methods do not maintain a fixed accuracy level as the missing data ratio p varies. Instead, their performance corresponds to curves, rather than horizontal lines, on the advantage heatmap.
For Signal-RL tasks, an imputation method called “context-filling” is proposed. In this approach, road occupancy cells with a value of zero are imputed as one if sufficient surrounding vehicles are detected, as illustrated in Figure 10. This imputation method is an example of “general imputation”, where each element of the state vector is evaluated for potential correction. Model performance with context-filling imputation, compared to the no-imputation baseline, is presented in Figure 9c. The model score exhibits substantial variability, with standard errors ranging from 4.8 to 47.6 as the corruption level increases from 0.1 to 0.9. Context filling shows no clear advantage and in fact performs slightly worse in our results. At a data-missing ratio of 0.6, the model score between context filling and no imputation is 22, with a standard error of 15. The z-score is 1.48 and the p-value is 0.14. This difference is not substantial, given this high performance variance observed in the Signal-RL experiments.

5. Effectiveness of Enlarging Dataset

5.1. Experiment Design

The aim of this section is to evaluate the effectiveness of enlarging the dataset and to quantify how many additional data are needed to offset the effects of data corruption. For the NLP-SL experiment, the model is tested on datasets of different sizes and varying data corruption ratios. The variable “subset ratio” represents the proportion of the GLUE dataset used for fine-tuning. The results are shown in Figure 11a. The Signal-RL task experiments with different numbers of training episodes and data-missing ratios, as shown in Figure 11b.

5.2. Observations

As shown in the figure, as the dataset size increases, model performance converges. However, the results show that data corruption leads to a decline in model performance that cannot be fully recovered by increasing the sample size. When corruption ratio p is in the range [0, 0.4], the performance decline is nearly linear with respect to p (Figure 11a). For larger p, the model’s performance drops sharply to near zero (Figure 3).
This behavior is characteristic of an exponential function: for e x = 1 + x + , the linear term dominates when x is small. Hence, it is concluded that the performance drop increases approximately exponentially as the data corruption ratio increases. In addition, data corruption also hampers learning efficiency. To achieve the same level of performance (if achievable at all for a corrupted model), the number of samples—and therefore the training time required—increases exponentially with the data corruption level. This is illustrated by the dashed benchmark line in Figure 11b. Quantitative curves showing the relationship between data quality and the required amount of data provide practical insights into data collection strategies.

6. Conclusions

This study explored the impact of data corruption, including missing and noisy data, on deep learning performance across two distinct domains: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). Our experiments aimed to provide insights into the relationship between data quality and model performance, the trade-offs of imputation strategies, and the effectiveness of increasing data quantity as a remedy for data corruption.
Key Findings
  • Diminishing Returns in Data-Quality Improvement: Both NLP-SL and Signal-RL experiments revealed that model performance follows a diminishing return curve as data corruption decreases. The relationship between the model score S and data corruption level p is well modeled by the function:
    S = a ( 1 e λ ( 1 p ) )
    where parameter a = S 0 1 e λ ; and S 0 is the model score when corruption ratio p = 0. This universal trend emphasizes the importance of balancing data quality and preprocessing efforts.
  • Data Noise is More Detrimental than Missing Data: Our results demonstrate that noisy data are significantly more detrimental than missing data, leading to faster performance degradation and increased training instability. This was particularly evident in the reinforcement learning task, where inserting noise caused substantial fluctuations in both training and policy stability.
  • Trade-offs in Data Imputation: Imputation methods can restore critical information for missing data but introduce a trade-off by potentially adding noise. The decision to impute depends on the imputation accuracy, the corruption ratio, and the nature of the task. The imputation advantage heatmap highlights two key regions:
    • Imputation Advantageous Corner: A region where accurate imputation significantly boosts model performance.
    • Imputation Disadvantageous Edge: A region where imputation noise outweighs its benefits, harming model performance.
  • Two Types of Tasks Identified: Tasks are classified into two categories based on their sensitivity to noise:
    • Noise-insensitive tasks: These tasks exhibit gradual performance degradation, with decision boundaries on the heatmap that can be effectively modeled using a sigmoid curve.
    • Noise-sensitive tasks: These tasks exhibit sharp performance drops, with decision boundaries closely approximated by an exponential curve. This behavior is typical in deep reinforcement learning tasks. When only “general imputation” is available—as opposed to “exact imputation”—the sensitivity to noise tends to be further amplified.
  • Limits of Enlarging Datasets: Increasing the dataset size partially mitigates the effects of data corruption but cannot fully recover the lost performance, especially under high noisy levels. Enlarging datasets does not entirely offset the detrimental effects of noisy data. The analysis showed that the number of samples required to achieve a certain performance level increases exponentially with the corruption ratio, confirming the exponential nature of the trade-off between data quality and quantity.
  • Impact of Data Corruption on Learning Efficiency: Missing data hampers learning efficiency. To achieve the same performance level (if at all possible), the number of required samples—and hence training time—increases exponentially with the data-missing level.
  • Empirical Rule on Data Importance: For traffic signal control tasks, approximately 30% of the data are critical for determining model performance, while the remaining 70% can be lost with minimal impact on performance. This observation provides practical guidance for prioritizing efforts in data collection and preprocessing. Note that the exact number may not apply to other tasks and, for many task, this kind of screening is not feasible.
Implications
The insights from this study provide direct implications for designing robust machine learning systems, benefiting both practitioners and researchers. Imputation strategies should be tailored to task-specific sensitivity, leveraging the identified boundary conditions between beneficial and harmful imputation. For noise-sensitive tasks, such as reinforcement learning, conservative imputation approaches and stricter data-quality controls are essential to maintain model stability and performance. Practitioners should prioritize accurate data collection and preprocessing for the critical subset identified as having the most significant impact on performance. Specifically, in scenarios with limited resources, efforts should focus on minimizing noise in key data elements rather than indiscriminately expanding the dataset.
Future Work
Although this study evaluates two distinct tasks, the generalizability of its findings to other domains remains uncertain. The current study employs straightforward corruption methods, whereas real-world data corruption can be more complex, involving correlated or structured patterns of corruption not accounted for in the current experiments. Additionally, the artificial method of controlling imputation accuracy by inserting random noise provides useful insights into theoretical trade-offs but may not fully capture the nuances of practical imputation techniques. Moreover, although standard imputation methods were investigated, more advanced and sophisticated approaches remain underexplored, leaving their potential effectiveness unclear.
This study opens up several avenues for future research. First, the observed patterns and empirical rules should be validated across broader datasets and additional machine learning tasks, such as computer vision and time-series forecasting. While the core principles–exponential performance decay, imputation decision boundaries—are hypothesized to generalize, their manifestations will likely depend on domain-specific factors. For instance, spatial correlations in CV or temporal dependencies in time-series data may influence how corruption impacts model performance. Notably, CV tasks may exhibit greater noise resilience compared to NLP, as pixel-level redundancies in images can mitigate localized corruption, whereas natural language is heavily compressed. Second, developing adaptive imputation strategies that dynamically balance missing data recovery and noise introduction could further enhance model robustness. Furthermore, numerous techniques have been developed for error detection, removal, and correction [30], investigating the effectiveness of these methods presents an interesting direction for future study. Finally, theoretical work on the relationship between information entropy, marginal utility, and model learning dynamics under corrupted data could deepen our understanding of these phenomena. By addressing these challenges, the hope is to advance the field’s ability to build robust machine learning models that perform reliably even in the presence of real-world data corruption.

Author Contributions

Conceptualization, Q.L. and W.M.; methodology, Q.L.; software, Q.L.; validation, Q.L. and W.M.; formal analysis, Q.L.; investigation, Q.L. and W.M.; resources, W.M.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and W.M.; visualization, Q.L.; supervision, W.M.; project administration, W.M.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Baiyulan Talent Project Pujiang Program grant number 24PJD115, Shanghai Yangpu District Postdoctoral Innovation & Practice Base Project.

Data Availability Statement

The data presented in this study are openly available in [data-corruption-study] at [https://github.com/qiliuchn/data-corruption-study (accessed on 20 May 2025)], reference number [000001].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This appendix contains the key details of the datasets, hyperparameter settings, and code implementation snippets referenced in the main text.
Table A1. Overview of model, dataset, and training configurations for NLP-SL and Signal-RL tasks.
Table A1. Overview of model, dataset, and training configurations for NLP-SL and Signal-RL tasks.
TaskNLP-SLSignal-RL
Model typeBERTDQN
Architecture & model descriptionbert-base-uncased + classification_head
Vocab Size: 30,522 (WordPiece)
num_layers: 12
num_heads: 12
hidden_att: 768
hidden_ffn: 3072
hidden_dims: [256, 48]
State: road cell occupancy
Action: next phase for next 6 s
Action dim: 4
Step reward: r t = q t 80 80
Stop speed threshold: 0.3 m/s
DatasetsPretraining: Wikitext, Bookcorpus
Finetuning: GLUE
50 simulation episodes
(1 Episode = 1 h = 3600 steps)
Training config(Pretraining)
batch_size: 256
max_seq_len: 512
LR scheduler: linear warmup and decay
l r peak = 1 e - 4
weight_decay: 0.01
(Finetuning)
(Seq: CoLA, SST2, MRPC, QQP,
MNLI, QNLI, RTE, WNLI)
num_epochs: [5, 3, 5, 16, 3, 3, 3, 6, 2]
batch_size: [32, 64, 16, 16, 256, 256, 128, 8, 4]
lr: [3e-5, 3.5e-5, 3e-5, 3e-5, 5e-5, 5e-5, 5e-5, 2e-5, 1e-5]
weight_decay: 0.01
num_episodes: 50
batch_size: 256
LR scheduler: linear decay with plateau
l r init = 1 e - 3 , l r final = 1 e - 4
Epsilon scheduler: linear decay with plateau ϵ init = 1.0 , ϵ final = 1 e - 2
Discounting γ : 0.98
target_update: 10
buffer_size: 10,000
Listing A1. Code for DQN model of Signal-RL.
Listing A1. Code for DQN model of Signal-RL.
Futureinternet 17 00241 i001
Listing A2. Code for data corruption and artificial impuation (NLP-SL).
Listing A2. Code for data corruption and artificial impuation (NLP-SL).
Futureinternet 17 00241 i002aFutureinternet 17 00241 i002b

References

  1. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
  2. Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A Survey on Missing Data in Machine Learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef] [PubMed]
  3. Brown, T.B. Language Models Are Few-shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  4. Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
  5. Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  6. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  7. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
  8. Moon, T.K.; Stirling, W.C. Mathematical Methods and Algorithms for Signal Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2000; ISBN 0-201-36186-8. [Google Scholar]
  9. Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.-G. Learning from Noisy Labels with Deep Neural Networks: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
  10. Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
  11. Devlin, J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  12. Liu, Y. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, 364, arXiv:1907.11692. [Google Scholar]
  13. Petroni, F.; Piktus, A.; Fan, A.; Lewis, P.; Yazdani, M.; De Cao, N.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; et al. KILT: A Benchmark for Knowledge Intensive Language Tasks. arXiv 2020, arXiv:2009.02252. [Google Scholar]
  14. Hausknecht, M.; Stone, P. Deep Recurrent Q-learning for Partially Observable MDPs. AAAI Fall Symp. Ser. 2015, 45, 141. [Google Scholar]
  15. Bai, X.; Guan, J.; Wang, H. A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  16. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level Control Through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  17. Taylor, M.E.; Stone, P. Transfer Learning for Reinforcement Learning Domains: A Survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
  18. Zhou, Y.; Aryal, S.; Bouadjenek, M.R. Review for Handling Missing Data with Special Missing Mechanism. arXiv 2024, arXiv:2404.04905. [Google Scholar]
  19. Schafer, J.L.; Graham, J.W. Missing Data: Our View of the State of the Art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef] [PubMed]
  20. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; Wiley: New York, NY, USA, 1987. [Google Scholar]
  21. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
  22. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984. [Google Scholar]
  23. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar]
  24. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
  25. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
  26. Yuan, J.; Wang, R.; Zhang, Y. Missing Token Imputation Using Masked Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, 7–11 November 2021; pp. 1234–1240. [Google Scholar]
  27. Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef] [PubMed]
  28. Tong, W.; Hussain, A.; Bo, W.X.; Maharjan, S. Artificial Intelligence for Vehicle-to-Everything: A Survey. IEEE Access 2019, 7, 10823–10843. [Google Scholar] [CrossRef]
  29. Feller, W. An Introduction to Probability Theory and Its Applications, 3rd ed.; Wiley: New York, NY, USA, 1991; Volume 1. [Google Scholar]
  30. Rakhmanov, A.; Wiseman, Y. Compression of GNSS Data with the Aim of Speeding Up Communication to Autonomous Vehicles. Remote Sens. 2023, 15, 2165. [Google Scholar] [CrossRef]
Figure 1. Model convergence in two experiments. (a) NLP-SL experiment: The x-axis represents training steps, and the y-axis shows validation accuracy. (b) Signal-RL experiment: The x-axis denotes training episodes, and the y-axis indicates episode reward. Training configurations are detailed in Table A1.
Figure 1. Model convergence in two experiments. (a) NLP-SL experiment: The x-axis represents training steps, and the y-axis shows validation accuracy. (b) Signal-RL experiment: The x-axis denotes training episodes, and the y-axis indicates episode reward. Training configurations are detailed in Table A1.
Futureinternet 17 00241 g001
Figure 2. Simulation environment for Signal-RL experiment. The intersection comprises four approaches, each with three lanes. Traffic demand is generated based on a binomial distribution, with a left-turn—through—right-turn ratio of 1:3:2. The environment state is defined by road occupancy. Each simulation step corresponds to one second, and each episode spans one hour. The step-wise reward is defined in Equation (1).
Figure 2. Simulation environment for Signal-RL experiment. The intersection comprises four approaches, each with three lanes. Traffic demand is generated based on a binomial distribution, with a left-turn—through—right-turn ratio of 1:3:2. The environment state is defined by road occupancy. Each simulation step corresponds to one second, and each episode spans one hour. The step-wise reward is defined in Equation (1).
Futureinternet 17 00241 g002
Figure 3. Model performance across varying data-missing ratios. The x-axis indicates the data-missing ratio, while the y-axis shows the model score—normalized accuracy for NLP-SL and normalized negative queue length for Signal-RL. (a) NLP-SL experiment; (b) Signal-RL experiment. In both cases, model performance initially declines gradually, followed by a steeper drop and a sharp collapse as the missing ratio approaches 1.0. The RL-trained signal outperforms the fixed-timing signal when the data-missing ratio is less than 0.8.
Figure 3. Model performance across varying data-missing ratios. The x-axis indicates the data-missing ratio, while the y-axis shows the model score—normalized accuracy for NLP-SL and normalized negative queue length for Signal-RL. (a) NLP-SL experiment; (b) Signal-RL experiment. In both cases, model performance initially declines gradually, followed by a steeper drop and a sharp collapse as the missing ratio approaches 1.0. The RL-trained signal outperforms the fixed-timing signal when the data-missing ratio is less than 0.8.
Futureinternet 17 00241 g003
Figure 4. Relationship between model performance and data corruption ratio, fitted by Equation (2). The x-axis represents ( 1 corruption ratio ) , and the y-axis shows the model score—normalized accuracy for NLP-SL and normalized negative queue length for Signal-RL. (a) NLP-SL experiment: Curve-fitting parameters are a = 0.475 , λ = 3.517 , with R 2 = 0.995 . (b) Signal-RL experiment: Curve-fitting parameters are a = 395.8 , λ = 7.493 , with R 2 = 0.956 . Both experiments exhibit diminishing returns as data quality improves. The fitted curves align well with the function defined in Equation (2).
Figure 4. Relationship between model performance and data corruption ratio, fitted by Equation (2). The x-axis represents ( 1 corruption ratio ) , and the y-axis shows the model score—normalized accuracy for NLP-SL and normalized negative queue length for Signal-RL. (a) NLP-SL experiment: Curve-fitting parameters are a = 0.475 , λ = 3.517 , with R 2 = 0.995 . (b) Signal-RL experiment: Curve-fitting parameters are a = 395.8 , λ = 7.493 , with R 2 = 0.956 . Both experiments exhibit diminishing returns as data quality improves. The fitted curves align well with the function defined in Equation (2).
Futureinternet 17 00241 g004
Figure 5. Signal-RL model performance under varying data corruption ratios. (a) Noise insertion: The x-axis represents the noise level. Inserting noise is more detrimental than missing data, often leading to unstable model performance. (b) Masked region: The x-axis denotes the masking-region ratio. A ratio of p indicates that the farthest 400 × p meters of each lane are hidden from the model, simulating the limited visual range of video-based vehicle detectors. Model performance remains stable even when up to 70% of less critical data are removed.
Figure 5. Signal-RL model performance under varying data corruption ratios. (a) Noise insertion: The x-axis represents the noise level. Inserting noise is more detrimental than missing data, often leading to unstable model performance. (b) Masked region: The x-axis denotes the masking-region ratio. A ratio of p indicates that the farthest 400 × p meters of each lane are hidden from the model, simulating the limited visual range of video-based vehicle detectors. Model performance remains stable even when up to 70% of less critical data are removed.
Futureinternet 17 00241 g005
Figure 6. Training instability under noise-insertion corruption. The x-axis represents training episodes, and the y-axis shows episode return. From (ad), the observation noise ratio increases from 0.0 to 0.4, leading to progressively more unstable training dynamics.
Figure 6. Training instability under noise-insertion corruption. The x-axis represents training episodes, and the y-axis shows episode return. From (ad), the observation noise ratio increases from 0.0 to 0.4, leading to progressively more unstable training dynamics.
Futureinternet 17 00241 g006
Figure 7. Heatmap of imputation advantage. The x-axis denotes the data-missing ratio p, and the y-axis represents the imputation noise level q. The black dashed line indicates the decision boundary, separating regions where imputation is beneficial from those where it is detrimental. The black solid line is the fitted curve of decision boundary. The green and lime-green dashed lines denote the 68% and 95% confidence intervals. (a) NLP-SL experiment: NLP-SL is a noise-insensitive task. The decision boundary lies above the diagonal and is well approximated by a logistic function. (b) Signal-RL experiment: Signal-RL exemplifies a noise-sensitive task. The contour line appears below the diagonal, more irregular in shape, and fits an exponential curve starting from (0, 0) and intersecting the line p = 1 .
Figure 7. Heatmap of imputation advantage. The x-axis denotes the data-missing ratio p, and the y-axis represents the imputation noise level q. The black dashed line indicates the decision boundary, separating regions where imputation is beneficial from those where it is detrimental. The black solid line is the fitted curve of decision boundary. The green and lime-green dashed lines denote the 68% and 95% confidence intervals. (a) NLP-SL experiment: NLP-SL is a noise-insensitive task. The decision boundary lies above the diagonal and is well approximated by a logistic function. (b) Signal-RL experiment: Signal-RL exemplifies a noise-sensitive task. The contour line appears below the diagonal, more irregular in shape, and fits an exponential curve starting from (0, 0) and intersecting the line p = 1 .
Futureinternet 17 00241 g007
Figure 8. An illustration of the imputation advantage pattern. The x-axis denotes the data-missing ratio p, and the y-axis represents the imputation noise level q. “Imputation advantageous corner” and “Imputation disadvantageous edge” are areas where imputation advantage and disadvantage concentrate, respectively. A task with a decision boundary below the diagonal line is called “noise-sensitive” and its decision boundary is fitted by exponential function. A task with a decision boundary above the diagonal line is called “noise-insensitive” and its decision boundary is fitted by logistic function.
Figure 8. An illustration of the imputation advantage pattern. The x-axis denotes the data-missing ratio p, and the y-axis represents the imputation noise level q. “Imputation advantageous corner” and “Imputation disadvantageous edge” are areas where imputation advantage and disadvantage concentrate, respectively. A task with a decision boundary below the diagonal line is called “noise-sensitive” and its decision boundary is fitted by exponential function. A task with a decision boundary above the diagonal line is called “noise-insensitive” and its decision boundary is fitted by logistic function.
Futureinternet 17 00241 g008
Figure 9. The effectiveness of alternative imputation methods. The x-axis represents the data-missing ratio, and the y-axis indicates the model score (normalized accuracy). Solid lines denote the no-imputation baseline, while dashed lines represent model performance with imputation. (a) Imputation using word vectors (NLP-SL task): This method introduces substantial noise, resulting in a significant performance drop compared to the no-imputation baseline. (b) Imputation using BERT (NLP-SL task): BERT-based imputation effectively recovers informative content useful for classification while introducing considerably less noise, leading to improved performance over the no-imputation baseline. At p = 0.3, model score difference = +0.046, std error = 0.0043, z-score = 10.65; the difference is significant. (c) Context-filling imputation (Signal-RL task): This method shows no clear advantage over no imputation. At p = 0.6, model score difference = −22, std error = 15, z-score = −1.48, p-value = 0.14; the difference is not significant.
Figure 9. The effectiveness of alternative imputation methods. The x-axis represents the data-missing ratio, and the y-axis indicates the model score (normalized accuracy). Solid lines denote the no-imputation baseline, while dashed lines represent model performance with imputation. (a) Imputation using word vectors (NLP-SL task): This method introduces substantial noise, resulting in a significant performance drop compared to the no-imputation baseline. (b) Imputation using BERT (NLP-SL task): BERT-based imputation effectively recovers informative content useful for classification while introducing considerably less noise, leading to improved performance over the no-imputation baseline. At p = 0.3, model score difference = +0.046, std error = 0.0043, z-score = 10.65; the difference is significant. (c) Context-filling imputation (Signal-RL task): This method shows no clear advantage over no imputation. At p = 0.6, model score difference = −22, std error = 15, z-score = −1.48, p-value = 0.14; the difference is not significant.
Futureinternet 17 00241 g009
Figure 10. Illustration of context-filling imputation in the Signal-RL experiment. This represents a type of “general imputation”, where each element of the state vector is evaluated for potential correction. Road occupancy cells with a value of zero are imputed as one if sufficient surrounding vehicles are detected.
Figure 10. Illustration of context-filling imputation in the Signal-RL experiment. This represents a type of “general imputation”, where each element of the state vector is evaluated for potential correction. Road occupancy cells with a value of zero are imputed as one if sufficient surrounding vehicles are detected.
Futureinternet 17 00241 g010
Figure 11. The effectiveness of enlarging the training dataset. (a) NLP-SL. The x-axis represents the subset ratio, and the y-axis indicates model score. (b) Signal-RL. The x-axis represents number of training episodes, and the y-axis indicates model score. Data corruption leads to a decline in model performance, which cannot be fully recovered by increasing the sample size. To achieve the same level of performance (if achievable by the corrupted model), the number of samples required, and hence the training time, increases exponentially with the data corruption level.
Figure 11. The effectiveness of enlarging the training dataset. (a) NLP-SL. The x-axis represents the subset ratio, and the y-axis indicates model score. (b) Signal-RL. The x-axis represents number of training episodes, and the y-axis indicates model score. Data corruption leads to a decline in model performance, which cannot be fully recovered by increasing the sample size. To achieve the same level of performance (if achievable by the corrupted model), the number of samples required, and hence the training time, increases exponentially with the data corruption level.
Futureinternet 17 00241 g011
Table 1. Taxonomy of imputation techniques with strengths, weaknesses, and use cases.
Table 1. Taxonomy of imputation techniques with strengths, weaknesses, and use cases.
CategoryMethodStrengths/WeaknessesUse Cases
Statistical-based [19,20]Mean/Median/Mode+ Simple, fast
− Ignores correlations, distorts variance
Small datasets, MCAR data
Maximum Likelihood+ Handles MAR data well
− Computationally intensive
Surveys, clinical trials
Matrix Completion+ Captures global structure
− Requires low-rank assumption
Recommendation systems
Bayesian Approach+ Incorporates uncertainty
− Needs prior distributions
Small datasets with domain knowledge
Machine Learning [21,22,23]Regression-based+ Models feature relationships
− Assumes linearity
Tabular data with correlations
KNN-based+ Non-parametric, local patterns
− Sensitive to k, scales poorly
Small/medium datasets
Tree-based+ Handles nonlinearity
− Overfitting risk
High-dimensional data
SVM-based+ Robust to outliers
− Kernel choice critical
Nonlinear feature spaces
Clustering-based+ Group-aware imputation
− Depends on cluster quality
Data with clear subgroups
Neural Network [24,25,26]ANN-based+ Flexible architectures
− Requires large data
Complex feature interactions
Flow-based+ Exact density estimation
− High computational cost
Generative tasks
VAE-based+ Handles uncertainty
− Blurry imputations
Image/text incomplete data
GAN-based+ High-fidelity samples
− Training instability
Media generation
Diffusion-based+ State-of-the-art quality
− Slow sampling
High-stakes applications
Table 2. NLP-SL model scores at different data-missing ratios; plotted in Figure 3a.
Table 2. NLP-SL model scores at different data-missing ratios; plotted in Figure 3a.
Data-missing ratio0.00.050.10.150.2
Model score0.46690.46630.45880.45720.4455
Data-missing ratio0.250.30.350.40.45
Model score0.43590.42830.43120.41060.4012
Data-missing ratio0.50.60.650.70.75
Model score0.39060.35010.33570.31860.2619
Data-missing ratio0.80.850.90.951.0
Model score0.23750.20080.16540.08970.0081
Table 3. Mean and standard deviation of Signal-RL model scores at different data-missing ratios; plotted in Figure 3b.
Table 3. Mean and standard deviation of Signal-RL model scores at different data-missing ratios; plotted in Figure 3b.
Data missing ratio0.00.10.20.30.4
Model score mean409.86403.30396.40398.40381.74
Model score std3.835.047.928.978.44
Data missing ratio0.50.60.70.80.9
Model score mean376.33371.30348.58292.54231.56
Model score std9.6210.0513.2526.4447.59
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Q.; Ma, W. Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies. Future Internet 2025, 17, 241. https://doi.org/10.3390/fi17060241

AMA Style

Liu Q, Ma W. Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies. Future Internet. 2025; 17(6):241. https://doi.org/10.3390/fi17060241

Chicago/Turabian Style

Liu, Qi, and Wanjing Ma. 2025. "Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies" Future Internet 17, no. 6: 241. https://doi.org/10.3390/fi17060241

APA Style

Liu, Q., & Ma, W. (2025). Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies. Future Internet, 17(6), 241. https://doi.org/10.3390/fi17060241

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop