You are currently viewing a new version of our website. To view the old version click .
Computers
  • Article
  • Open Access

20 June 2025

A Data Quality Pipeline for Industrial Environments: Architecture and Implementation

,
,
,
and
1
CIICESI, ESTG, Polytechnic of Porto, rua do Curral, 4610-156 Porto, Portugal
2
JPM Industry, 3731-901 Vale de Cambra, Portugal
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
This article belongs to the Special Issue Adaptive Decision Making Across Industries with AI and Machine Learning: Frameworks, Challenges, and Innovations

Abstract

In modern industrial environments, data-driven decision-making plays a crucial role in ensuring operational efficiency, predictive maintenance, and process optimization. However, the effectiveness of these decisions is highly dependent on the quality of the data. Industrial data is typically generated in real time by sensors integrated into IoT devices and smart manufacturing systems, resulting in high-volume, heterogeneous, and rapidly changing data streams. This paper presents the design and implementation of a data quality pipeline specifically adapted to such industrial contexts. The proposed pipeline includes modular components responsible for data ingestion, profiling, validation, and continuous monitoring, and is guided by a comprehensive set of data quality dimensions, including accuracy, completeness, consistency, and timeliness. For each dimension, appropriate metrics are applied, including accuracy measures based on dynamic intervals and validations based on consistency rules. To evaluate its effectiveness, we conducted a case study in a real manufacturing environment. By continuously monitoring data quality, problems can be proactively identified before they impact downstream processes, resulting in more reliable and timely decisions.

1. Introduction

In recent years, the increasing adoption of digital technologies in industrial environments has transformed the way organizations manage and optimize their operations. The integration of Internet of Things (IoT) devices, smart sensors, and advanced manufacturing systems has led to the generation of vast amounts of real-time data [1]. These sensor-integrated devices continuously generate significant amounts of data structured as time series, with each record associated with a specific timestamp [2]. This continuous flow of structured data plays a central role in enabling data-driven decision making, supporting key industrial applications such as predictive maintenance, quality control, and optimizing operational efficiency [3]. However, the value of this data depends on its quality. Decisions based on inaccurate, incomplete, inconsistent, or outdated data can have significant negative consequences, from minor process inefficiencies to severe production downtime or safety risks [4]. This reality makes data quality the top industrial data management challenge. Unlike traditional data systems, industrial data is often generated in high volume and high frequency, with significant heterogeneity in format, structure, and semantics [5]. In addition, its real-time nature and dependence on physical devices introduce variability, noise, and a higher risk of data degradation over time [1].
Data quality is an inherently multifaceted concept, and its interpretation and assessment require careful analysis to align with each organization’s unique objectives and operational needs [6,7]. Rather than being defined at the point of data creation alone, data quality must be assessed in the specific context of its use. Identifying instances of poor data quality is critical, as it can directly impact business performance and prevent informed decision making [3]. Data quality is typically assessed using a set of dimensions, each of which represents a specific attribute or characteristic that contributes to the overall quality of the data [7,8]. These dimensions serve as a structured framework for evaluating, quantifying, and systematically managing data quality [9]. Identifying and understanding these dimensions is essential for organizations seeking to improve the reliability and usability of their data. They form the basis for effective quality assessments and support continuous improvement efforts across data-driven processes [10]. According to Liu et al. [4], who analyzed data-related challenges in smart manufacturing, the most critical issues are directly related to four key dimensions: accuracy, completeness, consistency, and timeliness. Moreover, the operationalization of these dimensions is often supported by quantitative metrics. As noted by Goknil et al. [1], such metrics can take various forms, including normalized scores within the [0, 1] interval, binary indicators, percentages, frequencies, or probabilities, depending on the nature of the dimension and the evaluation context. It is also important to note that a single data quality dimension can be assessed using multiple metrics, and these can be applied at different levels of analysis to produce more granular and accurate evaluations [9].
Without appropriate mechanisms to assess and ensure data quality, decisions based on this data can be unreliable or even potentially dangerous to industrial processes. As a result, there is a growing need for robust solutions that can systematically assess and monitor the quality of data in real-time industrial contexts. This paper addresses this challenge by proposing a data quality pipeline specifically designed for industrial environments, architected and implemented with the specific characteristics and constraints of manufacturing systems in mind. The architecture supports data ingestion, profiling, validation, and continuous monitoring, and is structured around key data quality dimensions such as accuracy, completeness, consistency, and timeliness. By implementing targeted metrics, the pipeline enables early detection of data issues, contributing to more trusted insights and improved responsiveness across industrial operations.
To validate the practical applicability of this pipeline, a case study was conducted in a real-world manufacturing environment. The case study demonstrates how the pipeline can be deployed in a live production environment, providing not only visibility into data quality trends over time, but also actionable feedback to support continuous improvement efforts. In this way, this work contributes to the expanding field of research and practice focused on improving data quality in industrial systems, filling the gap between theoretical frameworks and operational needs. Compared to existing data quality assessment approaches in industrial and big data contexts, the proposed pipeline provides a unified and lightweight architecture that integrates ingestion, profiling, validation, and visualization in real time. While many solutions [1] rely on rule-based frameworks or static configurations, our architecture introduces dynamic metric computation and adaptive profiling mechanisms, allowing the system to react autonomously to changes in data patterns. This reduces manual intervention and increases adaptability in volatile environments. Additionally, the introduction of quality scores ( W Q S , L W Q S , and Q S D ) enables continuous, time-aware monitoring, a limitation in many prior works that only provide snapshot evaluations. By addressing issues such as dynamic thresholding, modularity, and real-time responsiveness, our pipeline directly overcomes key challenges identified in the recent literature, such as the rigidity of validation frameworks [1,11] and the lack of operational feedback in industrial systems [4].
The main contributions of this work are as follows:
  • The design of a lightweight, modular data quality pipeline adapted to industrial environments with real-time constraints.
  • The implementation of a metric-based profiling mechanism that supports the continuous evaluation of key data quality dimensions over streaming data.
  • The integration of data quality metrics and score results into dashboards enables operators to effectively monitor, analyze, and respond to data quality issues.
  • The validation of the approach through a real-world case study demonstrates its applicability and practical impact.
The remainder of this article is organized as follows: Section 2 reviews previous studies that address challenges and strategies related to data quality and ingestion. It begins with data quality metrics (Section 2.1), followed by quality scores used to assess and summarize data quality (Section 2.2). After these two topics, the section discusses data ingestion architectures designed to handle industrial data streams (Section 2.3). The case study in which the data quality assessment metrics are applied is presented in Section 3. Section 4 outlines the architecture developed in this work, detailing its main components and the mechanisms employed. Section 5 analyzes the results obtained through a case study, highlighting the effectiveness and impact of the proposed approach. Finally, Section 6 summarizes the conclusions and suggests future directions for research in this area.

3. Case Study

The case study describes a real-time monitoring scenario in an industrial production environment focusing on electric motors integrated into conveyor systems. A conveyor system is a common piece of mechanical handling equipment that transports materials from one location to another [24]. They are especially useful in applications involving the transportation of heavy or bulky materials. These motors are part of automated production lines and play a critical role in the continuous operation of conveyor machines. Conveyor technology is used in various applications, including moving sidewalks and escalators, as well as on many manufacturing assembly lines [24]. Additionally, conveyors are widely used in automated distribution and warehousing systems [24].
Accurate capture and evaluation of sensor data is essential in this context to ensure machine reliability, operational efficiency, and timely fault detection. Each conveyor system is driven by an individual motor designated as SE01, SE02, or SE03. Each motor is equipped with vibration and external temperature sensors. While vibration data is also available, this study focuses exclusively on the motors’ external temperature readings. These sensors are essential for detecting potential overheating issues, identifying abnormal thermal patterns, and supporting predictive maintenance strategies.
The temperature sensors, referred to as T e m p 1 , T e m p 2 , and T e m p 3 , are configured to collect one data point per second, generating a continuous, high-frequency stream of readings. The data is transmitted in a standardized JSON format and ingested by a real-time processing architecture. Each message contains metadata including a sensor identifier, variable name, measured value, and exact timestamp of the reading, as illustrated in Table 1.
Table 1. Data structure.
An example of a real-time message received:
{
  "code": "Temp1",
  "name": "Temp1",
  "value": 42.5,
  "date": "2025-05-23T09:07:36.627000"
}
Three conveyors are instrumented with these sensors, allowing for consistent and scalable data acquisition. This setup enables the monitoring of equipment performance in real-time, capturing dynamic behaviors that could be missed in traditional periodic inspections.
As data volume increases, it enables deeper analysis of data quality characteristics, including the identification of missing values, temporal gaps, out-of-range measurements, and anomalies. These challenges are commonly encountered in industrial environments due to the complexity of sensor networks and harsh operating conditions.
Figure 1 provides a visual representation of the data collected from three temperature sensors ( T e m p 1 , T e m p 2 , and T e m p 3 ). The left side of the figure contains a line graph showing the time evolution of the values recorded by each sensor. The right side contains a table with the raw data, including the timestamp, the recorded value, and the corresponding sensor. This combined view allows for a more comprehensive analysis, facilitating the visual inspection of trends and anomalies, as well as the detailed verification of individual data points. The graph reveals significant anomalous behavior, including peaks, abrupt drops in values, and temporal gaps. Such irregularities could indicate sensor failures, loss of communication, calibration problems, or other issues. These patterns highlight the importance of analyzing data quality in depth before using it for monitoring or decision-making processes.
Figure 1. Raw temperature sensor data from 24 May 2025, between 12:10 p.m. and 6:00 p.m.

4. Architecture

In the domain of industrial data processing, the definition of a robust and extensible architecture is fundamental to ensure the seamless acquisition, processing, and monitoring of high-frequency data streams. The architecture proposed in this work builds upon the foundational principles presented in [11,23], particularly regarding modular ingestion, streaming validation, and data profiling in smart manufacturing environments. Inspired by these contributions, the system was designed to ensure interoperability, traceability, and adaptability in the face of heterogeneous industrial data sources and volatile data streams.
Figure 2 presents the implemented architecture, which integrates three core services, namely ingestion, quality assessment, and visualization, operating in parallel over an event-driven infrastructure powered by Apache Kafka (https://kafka.apache.org/, accessed on 25 May 2025), InfluxDB (https://www.influxdata.com/, accessed on 25 May 2025), and Grafana (https://grafana.com/, accessed on 25 May 2025). This configuration is designed to facilitate low-latency data management and to process high-frequency sensor data in real time, while ensuring continuous monitoring of data quality.
Figure 2. Data streaming pipeline architecture.
At the beginning of the system, industrial sensors publish real-time measurements directly to a centralized Kafka broker, eliminating the need for intermediate preprocessing layers. Each sensor functions as an autonomous source, contributing to a Kafka topic that aggregates data across the production environment. This source-to-broker model reduces latency, promotes scalability, and ensures traceability from the point of origin.
The ingestion service is designed to consume messages from the Kafka topic continuously, parse them, and then transmit the data to InfluxDB, a time-series database optimized for high-throughput workloads and real-time analytics. Unlike previous approaches that used AVRO schema validation [23], the current pipeline operates without schema enforcement at this stage. It focuses on real-time acquisition and storage instead, prioritizing speed and the separation of validation into a parallel process. Although the ingestion process does not validate the schema, data integrity is indirectly assessed during the profiling stage through the detection of anomalies, missing values, and formatting inconsistencies.
Operating in parallel to the ingestion process, a dedicated data quality service performs continuous profiling of incoming data. This service processes the same Kafka stream and evaluates each batch of records over one-minute windows. For each temporal block, a comprehensive set of data quality metrics (shown in Section 2) is computed. These metrics are calculated incrementally and aggregated to produce a composite data quality score, which enables high-level monitoring of system performance. In contrast to conventional rule-based validation frameworks, such as Great Expectations, the validation logic in this case is integrated within a customized profiling mechanism. The system’s capacity for dynamic adaptation is supported by the incorporation of rolling historical windows, which facilitate the inference of expected behaviors. This strategy enables the system to function effectively in scenarios where predefined thresholds or schemas are unavailable.
The raw data and the computed quality metrics are stored in InfluxDB and exhibited through Grafana dashboards, which function as the primary interface for monitoring the system. Grafana facilitates the real-time and historical visualization of data readings and the corresponding data quality indicators. This dual perspective empowers technical professionals to cross-reference raw values with their evaluated integrity, facilitating rapid issue diagnosis and operational decision-making.
This architectural choice reflects the key priorities of industrial environments: robustness, modularity, and responsiveness. Decoupling ingestion from validation and monitoring enables each component to scale independently, recover from failures, and avoid compromising the entire pipeline. Additionally, event-driven streaming with Kafka enables near real-time data propagation with minimal delay, and InfluxDB facilitates efficient querying of time-series sensor data. Unlike monolithic or rule-heavy architectures, this design supports the flexible integration of heterogeneous data sources and can adapt to evolving validation needs through profiling.

5. Results and Discussion

This study proposed and implemented a data quality pipeline architecture designed for industrial environments. The focus was on capturing and monitoring sensor data in real time. The system was tested in a real production scenario involving electric motors on conveyor belts, where temperature sensors transmitted data every second. The main objective was to validate the effectiveness of a pipeline capable of ensuring data quality from origin to evaluation and visualization.
The developed architecture consists of three main services: data ingestion, continuous data quality evaluation by calculating metrics in one-minute windows through data profiling, and storage and visualization of these metrics. The complete data flow begins with sensor readings. This data is sent to a specific Apache Kafka topic. In this case, only one topic is used because only temperature sensors were utilized ( s e n s o r s . t e m p e r a t u r e v X ). A service consumes this data, stores the raw readings in InfluxDB ( b u c k e t :   d a t a ), directs the data to one-minute time buffers, activates quality calculation routines based on these time blocks, stores the calculated metrics in InfluxDB ( b u c k e t :   d a t a _ q u a l i t y ), and displays the results in real time via Grafana dashboards.
The quality assessment was based on four key dimensions: accuracy, completeness, consistency, and timeliness, each with specific metrics. For historical data, this corresponds to the last day of records stored in InfluxDB. All reported results refer to data collected between 12:10 and 15:30 on 24 May 2025. This time window was selected because it has higher data variability, which provides a more comprehensive basis for quality evaluation.
Accuracy (Equation (1)) was calculated by normalizing current values in relation to each sensor’s history. The minimum and maximum values used in the normalization Equation (1) were dynamically defined using the Hampel filter. Unlike the approach adopted by [16], which establishes normalization bounds based on fixed percentiles (specifically the 10th and 90th), the method proposed in the present work utilizes a distinct statistical filtering approach to dynamically define these limits. The Hampel filter identifies and mitigates the influence of outliers to ensure that extreme values do not distort the normalization process [25]. Recognized for its robustness and effectiveness in outlier removal [26], the Hampel filter relies on the median and the median absolute deviation (MAD), making it particularly suitable for IoT and sensor data, which are often affected by noise and sporadic anomalies [25]. In this approach, outliers are defined as values that lie outside the interval I = [ m e d i a n 3 M A D , m e d i a n + 3 M A D ] , where MAD is the median absolute deviation from the median of the historical dataset. The threshold of three MADs is commonly used due to its balance between sensitivity and robustness [25]. This approach allows for the real-time detection of outliers, since any normalized value outside the interval [0, 1] is considered anomalous.
As illustrated in Figure 3, the Hampel-based normalization method results in a clearer distinction between valid and anomalous values compared to the percentile-based approach. The Hampel filter, in particular, demonstrates improved sensitivity to outliers while maintaining robustness in fluctuating environments. This visual comparison underscores the rationale behind adopting a dynamic, median-based strategy instead of static percentile thresholds for more accurate computation of the accuracy metric.
Figure 3. Comparison of accuracy metrics computed using the Hampel filter (top panels) and fixed percentiles (bottom panels) for temperature sensors.
Figure 4, Figure 5 and Figure 6 show the evolution of the accuracy metrics over time for the three temperature sensors during the selected analysis window. The variability in these metrics is due to the dynamic normalization process, which is based on each sensor’s most recent history and the limits defined by the Hampel filter. These visualizations use colored dots to indicate the evaluation result of each time block. Green dots correspond to values within the expected range, yellow dots indicate small deviations, and red dots highlight blocks with significant anomalies or outliers that may require investigation.
Figure 4. Accuracy metric over time for the sensor T e m p 1 .
Figure 5. Accuracy metric over time for the sensor T e m p 2 .
Figure 6. Accuracy metric over time for the sensor T e m p 3 .
The Sensor T e m p 1 shows significant initial fluctuations, with several blocks outside the expected range of [0, 1], which are represented in yellow, indicating outliers and atypical behavior compared to the sensor’s historical trend. This aligns with Figure 1 of the received data, which shows significant oscillations above and below typical values at the beginning of the period. As time progresses, the values stabilize within the expected range, indicating a more consistent temperature profile.
For Sensor T e m p 2 (middle panel, Figure 5), the accuracy results start at lower levels, suggesting below-normal readings. However, on average, the block analyzed did not have enough poor quality values to be considered a poor quality block. Accuracy gradually increases and stabilizes after 1:00 p.m. The absence of yellow dots suggests that most readings remained within the expected range, indicating reliable and consistent sensor performance over time.
The Sensor T e m p 3 (bottom panel, Figure 6) also shows initial instability with small oscillations and anomalous values around 2:20 p.m. and 3:00 p.m. These values are visible in Figure 1, which shows two high peaks around the same time. However, accuracy improves and becomes more consistent after that point.
It should be noted that no red dots were identified throughout the analyzed window, indicating that, despite variations and some values outside the [0, 1] range, none of the blocks were classified as severely inaccurate or compromised.
There are two metrics for completeness: Equations (2) and (3). The first metric is completeness at the content level. It ensures that all required fields (code, name, value, and date) are present and that there are no null or empty values. The second metric calculates temporal completeness by comparing the number of observed records with the expected number (60 records per minute per sensor). Figure 7 shows two completeness metrics. The use of color encoding facilitates visual inspection: green denotes high completeness, yellow indicates partial data loss or irregularity, and red highlights severe loss.
Figure 7. Content Completeness (top panel), and Temporal Completeness (bottom panel).
As can be seen in the top panel, the content completeness remains close to 1 throughout the analyzed period, with rare exceptions of small drops (e.g., around 12:20 and 12:40), which may indicate incomplete records. Overall, the quality of the data content is very high, and the results consistently remain above 0.8, indicating a green result.
The temporal completeness metric compares observed records to the expected 60 per minute per sensor, and the bottom panel in Figure 7 shows significant fluctuations, with values below 0.2. This suggests regular data loss in terms of temporal frequency or a problem with the sensors not sending data frequently enough, resulting in periods of absence. As can be seen in Figure 1, there are multiple data gaps and large blank spaces. This indicates that no data was received during those intervals. These gaps explain the consistently low values in the temporal completeness metric. Though two yellow points stand out, they appear to be isolated incidents. The variability of this metric suggests that, although data is present, it is not transmitted at the expected frequency.
To calculate consistency (Equation (4)), rules are automatically identified based on strong correlations with a coefficient greater than 0.8 between sensors, as determined by historical data using the Pearson correlation coefficient. These rules are stored and continuously compared with the correlations observed in each time block. If a new consistent rule appears during the analysis process, it is added to this file. This allows the rule set to evolve dynamically over time. Figure 8 illustrates the evolution of the consistency metric. Each time block is color-coded: green for high consistency, yellow for low consistency, and red for an absence of expected relationships. This coding helps you quickly identify changes in system behavior or data correlation.
Figure 8. Consistency over time.
As shown in the Figure 8, most of the time, the analyzed blocks have a value of 0, indicating an absence of the expected rules between the sensors. This may indicate atypical behavior, sensor faults, or changes in the monitored phenomenon affecting the historical relationships. A few blocks had values of 0.333 and 0.666, and one block had a value of 1. This indicates that the data only maintained complete consistency with the historical correlations at very specific times. The green points with values of 0.666 and 1 (e.g., around 12:15, 13:20, and 14:00) are positive exceptions. The low frequency of blocks with complete consistency reinforces the need to investigate possible causes of the loss of correlation, such as operational anomalies, environmental changes, or inadequate updating of the dynamic rules.
Timeliness (Equation (5)) is calculated based on the age of the record in relation to its volatility. In this study, volatility was defined as a two-minute window, meaning data is considered timely if it arrives within two minutes of generation. This parameter can be adjusted according to the application scenario’s specific requirements (e.g., real-time monitoring systems or critical control processes). A shorter threshold may be more appropriate for time-sensitive environments, while a longer threshold might be acceptable for less time-sensitive ones. Adjusting the volatility threshold ensures that the timeliness metric remains meaningful and aligned with operational expectations. Figure 9 shows the evolution of the timeliness metric. The timeliness values are also color-coded across the time window. Blocks above 0.5 are green, indicating timely data. Intermediate values are yellow. Blocks below 0.2 are red, which indicates the possibility of latency or delivery issues.
Figure 9. Data timeliness metric.
The graph shows oscillations between values close to 0.5 and 0.95. Most blocks are above 0.7, indicating that data is generally received within the acceptable period. However, frequent drops reflect occasional delays in receiving data. These variations may be due to network latency, intermediate processing or queues, or system overload. Despite these variations, there are no long periods of extremely low punctuality, showing that the system generally maintains adequate performance in terms of data arrival time with room for occasional improvement.
For the Data Quality Indices, the weights assigned to the accuracy and completeness dimensions, as well as to each sensor, are based on their importance. The weights used are w a = 0.7 for accuracy and w c = 0.3 for completeness. This emphasizes the greater impact of accuracy on data quality. These weights were manually defined based on the expected relevance of each metric and sensor in the evaluated scenario. This approach reflects domain specifics rather than automated adjustments. This approach aligns with the operational context and objective of real-time monitoring interpretability. The L W Q S is calculated using the previous 10 blocks (10 min, m = 10 ), and sensor weights for completeness are set to w c s = 1 / n s for sensors s = 1 , , n s . This setup ensures balanced contributions from all sensors, while allowing for some to be prioritized based on their criticality. Sensor weights for accuracy are set to w a s = 1 / n s for sensors s = 1 , , n s , but the number of sensors is not the total number in the system, it is only the number of sensors that sent data in block j.
The choice of β = 5 was not the result of formal parameter tuning, but rather based on the visual inspection of the exponential decay curves, as shown in Figure 10. Among the values considered, β = 5 offers a visually balanced trade-off between sensitivity to recent data and retention of historical influence. This value provides a weighting curve in which the most recent blocks retain significant influence, while still incorporating meaningful historical context. This balance ensures that the L W Q S responds to recent changes in data quality without overreacting to short-term fluctuations, and remains consistent with the intended behavior of the score in industrial time-series contexts.
Figure 10. Exponential decay weights of historical data for different β values ( m = 10 ).
The W Q S (Figure 11) reveals short-term fluctuations in data quality, reflecting sensor performance in one-minute intervals. These variations are particularly evident in the early segments of the timeline, where drops in accuracy and completeness caused by anomalies and sporadic sensor delays or failures lead to lower W Q S values. The W Q S results use colored dots to show the evaluation of each time block. Green dots correspond to high-quality data in the current block. Yellow dots indicate medium-quality data, and red dots highlight blocks with low-quality data. The metric reaches its lowest values between 12:20 and 12:40. As illustrated in Figure 4, Figure 5 and Figure 6, this corresponds to the time interval with the highest occurrence of anomalies. The W Q S never exceeds 0.8, mainly due to the consistently low completeness metric Equation (3) results.
Figure 11. Temporal evolution of W Q S over the monitored period.
In contrast, the L W Q S (Figure 12) has a smoother curve because of its historical aggregation process. As expected, it responds more gradually to sharp drops in quality because it incorporates the influence of past blocks with exponentially decreasing weights. For instance, while the W Q S shows a significant decline around 12:25, the L W Q S exhibits a less pronounced drop, reflecting the influence of higher-quality previous blocks while still demonstrating a notable decline due to the greater impact of more recent blocks. The L W Q S results use colored dots to show how previous blocks were evaluated. Green dots indicate a stable high quality across these blocks. Yellow dots indicate moderate historical quality. Red dots highlight consistently poor quality in past blocks.
Figure 12. Temporal evolution of L W Q S over the monitored period.
The Q S D (Figure 13), which is defined as the difference between the W Q S and the L W Q S , explicitly captures these variations. Positive Q S D values indicate an improvement in quality compared to recent history, and negative values indicate a deterioration. The yellow dots on the graph do not indicate warnings, only situations in which quality has varied minimally, remaining close to zero. Green dots indicate a substantial improvement in quality compared to previous blocks, and red dots indicate a drop in quality. For instance, when the W Q S decreases significantly, the Q S D decreases as well, taking on negative values. In contrast, when the W Q S increases significantly, the Q S D increases, and green dots appear. The predominance of yellow dots along the graph indicates general stability in data quality with no significant variations compared to previous data.
Figure 13. Temporal evolution of Q S D over the monitored period.
This validates the role of Q S D as a sensitive indicator for detecting real-time changes in data quality. The Q S D works as an advanced indicator of degradation or improvement, making it suitable for proactive monitoring in industrial IoT-based systems. It should be noted that incorporating specific weights for each sensor and quality dimension allows the system to reflect the operational relevance and relative importance of each sensor. This enables more precise prioritization in scenarios with heterogeneous data sources and variable reliability.
Two interactive Grafana dashboards were developed for real-time analysis, as illustrated in Figure 14 and Figure 15.
Figure 14. Dashboard showing raw data overview.
Figure 15. Dashboard focused on data quality, featuring key indices, alerts, and metric trends over time.
Figure 14 displays the raw data received from the b u c k e t :   d a t a , including high-level information such as the total number of records received in the last day, the number of records sent by each sensor, a table with the raw entries, an overall records graphic, and individual graphics per sensor.
In contrast, Figure 15 focuses entirely on data quality. The top three panels display the main data quality indices and then present a graphic that consolidates all quality metrics over time. The table highlights warnings and instances where a metric falls below expected thresholds. Finally, individual plots are provided for each metric. Separating raw data visualization from quality monitoring allows for clearer, more efficient system oversight and rapid issue identification.
These dashboards are designed to support operational decision-making by allowing users to interact with data in real time. Operators can apply time filters, isolate specific sensors, and monitor critical quality indicators over time. Alert thresholds and visual indicators facilitate the rapid detection of anomalies, incomplete data streams, and sensor malfunctions. This interactive setup ensures that data quality issues are promptly identified and addressed, reducing diagnostic time and supporting proactive system maintenance.
When tested, the architecture was able to handle continuous data ingestion with minimal latency. Metrics showed visible fluctuations, such as drops in temporal completeness when sensors stopped sending data, reductions in consistency when expected correlations between sensors were not maintained, and detection of outliers in specific sensors, which could signal possible failures or inaccurate readings. This level of granularity enables detailed diagnosis of data quality in an industrial environment. Separating ingestion and quality services and running them in parallel guarantees robustness and scalability. Additionally, using extreme percentiles increases sensitivity to deviations, and automatically identifying new consistency rules makes the system self-adjusting to new behavior patterns.
The architecture’s flexibility and lightweight design make it suitable for industrial environments with limited computing resources. The system does not depend on external declarative frameworks because it incorporates validation and analysis directly into the data quality service through profiling integration. This approach has the potential to be extended to other types of sensors and industrial variables, such as vibration or pressure, without requiring significant structural changes. Specific adaptations may be required depending on the industrial sector. For example, vibration and torque sensors might be prioritized in automotive environments, while completeness and conformity of batch parameters may be more relevant in food production. The pipeline’s modular structure facilitates these adjustments with minimal effort.
Despite its practical advantages, the proposed architecture has certain limitations. First, reliance on temporal processing windows may introduce slight delays in metric calculation, limiting its suitability for applications requiring ultra-low latency. While the system is designed to scale efficiently and has reliably handled continuous ingestion and processing during evaluations, it may require further adjustments to perform optimally under extreme load conditions, such as processing thousands of concurrent sensor streams. Optimizing Kafka partitioning, adjusting consumer group configurations, and refining InfluxDB indexing strategies may be necessary to avoid ingestion or storage bottlenecks. Additionally, the pipeline has not yet been tested in scenarios involving machine state transitions or multimodal operational contexts, where sensor behavior can vary significantly depending on the phase of the equipment or the operational mode. Finally, while the metric computation logic is solid, it is tailored to a specific set of quality dimensions and may need to be reconfigured for alternative use cases or data models.

6. Conclusions

This article presents a data quality pipeline architecture adapted for industrial environments, with a focus on monitoring temperature sensors on production lines. The proposal integrates the processes of ingesting, evaluating, and visualizing data quality in real time. This approach offers a practical and efficient way to continuously monitor critical parameters. The solution effectively identified quality issues by applying specific metrics associated with four dimensions: accuracy, completeness, consistency, and timeliness. Using composite indices facilitated interpreting results and promoted a holistic view of data quality, supporting technical teams’ decision-making. This approach also enables objective comparisons between different periods or data sources, contributing to process traceability and continuous improvement.
The architecture was implemented using consolidated open source tools, such as Apache Kafka, InfluxDB, and Grafana, which guarantee satisfactory performance, robustness, and clear, intuitive visualization. Additionally, the implementation’s modular and lightweight structure makes it adaptable to different industrial contexts, even in the presence of computational or operational constraints. This flexibility is essential for supporting the evolution of monitoring systems for Industry 4.0, where integration and scalability are fundamental requirements. The solution’s practical applicability is evident in scenarios where information quality directly impacts operational efficiency, process reliability, and the implementation of predictive maintenance strategies. The developed solution provides a solid foundation for more advanced, data-based analyses, facilitating a safer and more informed transition to the digitalization of industrial operations.
For future work, we suggest integrating additional sensors, such as vibration or pressure sensors, to extend the scope of the solution. We also propose conducting collaborative evaluations with industrial partners to assess the direct impact of the pipeline on operational efficiency. These evaluations will quantify the effects of improved data quality monitoring on decision-making, production stability, and maintenance planning. This analysis will validate the system’s practical benefits and reinforce its relevance in real-world scenarios. Furthermore, we propose validating the entire pipeline under various industrial scenarios to ensure its adaptability and applicability across different operational conditions. Finally, we suggest conducting a comparative evaluation between our approach and alternative data quality architectures in order to benchmark performance, scalability, and applicability in industrial contexts.
The developed solution is a practical contribution to improving the reliability of data-based industrial systems, bridging the gap between theory and practice in the context of data quality in Industry 4.0.

Author Contributions

Conceptualization, T.P., E.C.e.S., Ó.O. and B.O.; methodology, T.P., E.C.e.S., Ó.O. and B.O.; validation, F.R.; investigation, T.P., E.C.e.S., Ó.O. and B.O.; resources, F.R.; data curation, F.R.; writing—original draft preparation, T.P.; writing—review and editing, E.C.e.S., Ó.O. and B.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to confidentiality agreements with industrial partners.

Acknowledgments

This work was supported by the European Union under the Next Generation EU, through a grant from the Portuguese Republic’s Recovery and Resilience Plan (PRR) Partnership Agreement, within the scope of the project PRODUTECH R3—“Agenda Mobilizadora da Fileira das Tecnologias de Produção para a Reindustrialização”. Total project investment: EUR 166.988.013,71; total grant: EUR 97.111.730,27.

Conflicts of Interest

Author Fillipe Ribeiro was employed by the company JPM Industry. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Goknil, A.; Nguyen, P.; Sen, S.; Politaki, D.; Niavis, H.; Pedersen, K.J.; Suyuthi, A.; Anand, A.; Ziegenbein, A. A Systematic Review of Data Quality in CPS and IoT for Industry 4.0. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  2. Hu, C.; Sun, Z.; Li, C.; Zhang, Y.; Xing, C. Survey of Time Series Data Generation in IoT. Sensors 2023, 23, 6976. [Google Scholar] [CrossRef] [PubMed]
  3. Peixoto, T.; Oliveira, B.; Oliveira, Ó.; Ribeiro, F. Data Quality Assessment in Smart Manufacturing: A Review. Systems 2025, 13, 243. [Google Scholar] [CrossRef]
  4. Liu, C.; Peng, G.; Kong, Y.; Li, S.; Chen, S. Data Quality Affecting Big Data Analytics in Smart Factories: Research Themes, Issues and Methods. Symmetry 2021, 13, 1440. [Google Scholar] [CrossRef]
  5. Kuemper, D.; Iggena, T.; Toenjes, R.; Pulvermueller, E. Valid.IoT: A framework for sensor data quality analysis and interpolation. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; pp. 294–303. [Google Scholar] [CrossRef]
  6. Batini, C.; Scannapieco, M. Data and Information Quality; Springer: Berlin/Heidelberg, Germany, 2016; Volume 63. [Google Scholar] [CrossRef]
  7. Mahanti, R. Data Quality: Dimensions, Measurement, Strategy, Management, and Governance; ASQ Quality Press: Milwaukee, WI, USA, 2019; Available online: https://asq.org/quality-press/display-item?item=H1552 (accessed on 14 April 2025).
  8. Wang, R.Y.; Strong, D.M. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Manag. Inf. Syst. 1996, 12, 5–33. [Google Scholar] [CrossRef]
  9. Zhang, L.; Jeong, D.; Lee, S. Data Quality Management in the Internet of Things. Sensors 2021, 21, 5834. [Google Scholar] [CrossRef] [PubMed]
  10. Cichy, C.; Rass, S. An Overview of Data Quality Frameworks. IEEE Access 2019, 7, 24634–24648. [Google Scholar] [CrossRef]
  11. Oliveira, Ó.; Oliveira, B. An Extensible Framework for Data Reliability Assessment; SCITEPRESS—Science and Technology Publications: Setúbal, Portugal, 2022; pp. 77–84. [Google Scholar] [CrossRef]
  12. Abideen, Z.u.; Mazhar, T.; Razzaq, A.; Haq, I.; Ullah, I.; Alasmary, H.; Mohamed, H.G. Analysis of Enrollment Criteria in Secondary Schools Using Machine Learning and Data Mining Approach. Electronics 2023, 12, 694. [Google Scholar] [CrossRef]
  13. Ullah, I.; Noor, A.; Nazir, S.; Ali, F.; Ghadi, Y.; Aslam, N. Protecting IoT devices from security attacks using effective decision-making strategy of appropriate features. J. Supercomput. 2023, 80, 5870–5899. [Google Scholar] [CrossRef]
  14. Seghezzi, E.; Locatelli, M.; Pellegrini, L.; Pattini, G.; Giuda, G.M.D.; Tagliabue, L.C.; Boella, G. Towards an Occupancy-Oriented Digital Twin for Facility Management: Test Campaign and Sensors Assessment. Appl. Sci. 2021, 11, 3108. [Google Scholar] [CrossRef]
  15. Cerquitelli, T.; Nikolakis, N.; Bethaz, P.; Panicucci, S.; Ventura, F.; Macii, E.; Andolina, S.; Marguglio, A.; Alexopoulos, K.; Petrali, P.; et al. Enabling predictive analytics for smart manufacturing through an IIoT platform. IFAC-PapersOnLine 2020, 53, 179–184. [Google Scholar] [CrossRef]
  16. Peixoto, T.; Oliveira, B.; Oliveira, Ó.; Ribeiro, F. Real-Time Manufacturing Data Quality: Leveraging Data Profiling and Quality Metrics. In Proceedings of the 10th International Conference on Internet of Things, Big Data and Security—IoTBDS, Porto, Portugal, 6–8 April 2025; SciTePress: Setúbal, Portugal, 2025; pp. 56–68. [Google Scholar] [CrossRef]
  17. Loshin, D. The Practitioner’s Guide to Data Quality Improvement; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar] [CrossRef]
  18. Abedjan, Z.; Golab, L.; Naumann, F.; Papenbrock, T. Data Profiling; Synthesis Lectures on Data Management (SLDM); Springer: Cham, Switzerland, 2018; Volume 10, pp. 1–154. [Google Scholar] [CrossRef]
  19. Costa e Silva, E.; Peixoto, T.; Oliveira, Ó.; Oliveira, B. Data Quality Assessment: A Practical Application. In Innovations in Industrial Engineering IV; Lecture Notes in Mechanical Engineering; Springer: Berlin/Heidelberg, Germany, 2025; Chapter 42. [Google Scholar]
  20. Ji, C.; Shao, Q.; Sun, J.; Liu, S.; Pan, L.; Wu, L.; Yang, C. Device Data Ingestion for Industrial Big Data Platforms with a Case Study. Sensors 2016, 16, 279. [Google Scholar] [CrossRef] [PubMed]
  21. Qiao, L.; Li, Y.; Takiar, S.; Liu, Z.; Veeramreddy, N.; Tu, M.; Dai, Y.; Buenrostro, I.; Surlaker, K.; Das, S.; et al. Gobblin; VLDB Endowment: London, UK, 2015; Volume 8, pp. 1764–1769. [Google Scholar] [CrossRef]
  22. Sawant, N.; Shah, H. Big Data Ingestion and Streaming Patterns; Apress: New York, NY, USA, 2013; pp. 29–42. [Google Scholar] [CrossRef]
  23. Oliveira, B.; Oliveira, Ó.; Peixoto, T.; Ribeiro, F.; Pereira, C. Extensible Data Ingestion System for Industry 4.0. In Proceedings of the EPIA Conference on Artificial Intelligence, Viana do Castelo, Portugal, 3–6 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 105–114. [Google Scholar]
  24. Prabhudesai, M. Conveyor Belt System with 3 Degrees of Freedom. Int. J. Res. Appl. Sci. Eng. Technol. 2018, 6, 407–411. [Google Scholar] [CrossRef]
  25. Shabou, S. Outlier Detection in Time Series. 2023. Available online: https://s-ai-f.github.io/Time-Series/outlier-detection-in-time-series.html (accessed on 22 May 2025).
  26. Bhowmik, S.; Jelfs, B.; Arjunan, S.P.; Kumar, D.K. Outlier removal in facial surface electromyography through Hampel filtering technique. In Proceedings of the 2017 IEEE Life Sciences Conference (LSC), Sydney, NSW, Australia, 13–15 December 2017; pp. 258–261. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.