A Holistic Quality Assurance Approach for Machine Learning Applications in Cyber-Physical Production Systems

: With the trend of increasing sensors implementation in production systems and comprehen-sive networking, essential preconditions are becoming required to be established for the successful application of data-driven methods of equipment monitoring, process optimization, and other relevant automation tasks. As a protocol, these tasks should be performed by engineers. Engineers usually do not have enough experience with data mining or machine learning techniques and are often skeptical about the world of artiﬁcial intelligence (AI). Quality assurance of AI results and transparency throughout the IT chain are essential for the acceptance and low-risk dissemination of AI applications in production and automation technology. This article presents a conceptual method of the stepwise and level-wise control and improvement of data quality as one of the most important sources of AI failures. The appropriate process model (V-model for quality assurance) forms the basis for this. data is represented by interference in the measurement signals when shields of cable connections are insufﬁcient or defective. During data preprocessing, coding inconsistencies and value inconsistencies must also be checked and eliminated if necessary. Coding inconsistencies are deviating uses of units of measurement. Value inconsistencies are, for example, simultaneous uses of terms in metadata. In addition, typographical errors, i.e., spelling errors during manual data entry, must be cleaned up. Figure examples of some


Introduction
This article considers the application of data-driven methods in engineering applications, especially in production engineering. Due to the current developments in the focus of Industry 4.0, the situation in production engineering is characterized by the fact that machine, plant, and tool technology are becoming increasingly capable for data acquisition and data provision for external applications. Combined with improved networking technologies and more powerful computer systems, this enables a wide application of data mining (DM) and machine learning (ML) algorithms to process control, equipment monitoring, process optimization, and other relevant automation tasks.
The know-how obtained from the data can be used in different ways [1,2] In the simplest case, the data is used for condition monitoring [3] and, if necessary, to detect deviations from predefined process settings or a norm behavior (e.g., anomaly detection [4]). The complex DM and ML algorithms are also often used to investigate the cause-effect relationships in the production processes [5]. Finally, complex algorithms are used for faster process commissioning [6,7], continuous process improvement and optimization [8,9], or predictive maintenance [10].
The benefit of ML essentially depends on the accuracy of obtained models, on which basis the appropriate decisions are made. Furthermore, the accuracy of the ML results depends both on the suitability of the applied analysis algorithms for the given context [11] and the quality of the used data [12]. After all, the quality of the data depends on which measuring principles are used, which measuring ranges have sensors, or whether the data transmission functions without interference. The influences mentioned above are superimposed in the signals. Before the main purpose of the data analysis can be carried out, disturbance effects in the data must be recognized and eliminated. For this purpose, there are extensive methods with which simple effects [13] can be identified. There are also methods, e.g., based on fuzzy models, with which complex interactions between linked signals can be detected to be able to evaluate the uncertainty of the recording of parameters [14]. Meanwhile, there is preliminary work using AI methods such as deep learning to assess the trustworthiness of complex data sets from machine systems [15,16]. For this, a holistic method is required that thinks from the essence of the CPPS and places individual methods for signal processing in an overall context. The existing state-of-the-art of data quality improvement solutions "lack general means to detect, validate and correct the gathered sensory data" [17]. Many known solutions deal with the quality assurance of enterprise-level data (e.g., master data [18,19]), where usually the customer, production planning, and facility data are analyzed and maintained. For the analysis of the production processes, the primary sensory (or signal) data are essential [20], for which further methods for quality assurance are necessary [21]. There are already different metrics for the master data quality assurance and procedures for their systematic improvement.
For the data-driven analysis algorithms in production engineering systems, there are particularly complex challenges regarding data quality assurance. These challenges result from the use of data from different sources with different data types and a large number of relevant parameters (often several hundred), which must be combined into a data set. Moreover, the structure of technical production systems, as well as the product life cycle and the analysis workflow, produce numerous possibilities for influencing data quality, which must also be taken into account.
The product life cycle includes the design, assembly, commissioning, operation, and withdrawal from service phases. In the design phase of a production system, data quality is already influenced by the choice of suitable measurement principles and the selection of the appropriate sensors for data acquisition. During the assembly and commissioning phase, failure detection methods are implemented during machine or tool programming. In the operation phase, faults can occur either in the measuring system or in the communication system that has nothing to do with the desired system behavior, such as signal noise, cable breakage, etc. The fault detection methods are usually implemented during installation and commissioning of the equipment, or later during the operation phase. The strategy and algorithms of the data processing must realize the "correct synchronization" of data from different sources and the context-related extraction of effects from the total data with "mixed content". It becomes clear from the examples that all possible influencing factors must be taken into account regarding data quality assurance. Therefore, the "machine" can learn automatically and recognize and correct the "data failures" itself if required. This is not possible with the known approaches in the required complexity.
This paper presents a holistic approach to data quality assurance in modern production systems. Holistic means that all relevant methods, most of which have already been introduced in the participating disciplines, are brought together to form a new comprehensive process model: the V-model for data quality assurance. The new approach aims to create transparency in data processing, to avoid just in time the incorrect results and decisions based on them, and finally to improve the often still lacking trust in the AI.

State of the Art
This chapter will outline the status of all relevant system components and all methodological aspects that are brought together in the approach presented. These are the system structure of a production engineering system, the ICT infrastructure, the algorithms of data processing, the algorithms of data mining, and the methods of quality assurance in data.

Production Engineering System
A production engineering system is comprised of all the elements required to perform the production task, such as machines, tools, and conveyor systems. In the third stage of the industrial revolution, production engineering systems are developed into automated Appl. Sci. 2021, 11, 9590 3 of 20 systems through the integration of electronics and information technologies. One also speaks of mechatronic systems [22]. Automated systems are capable of acquiring and outputting data on system states, initially to perform automation tasks. As development progressed, additional software functions for data analysis were integrated into production engineering systems. These, combined with the ability to network, led to Cyber-Physical Systems (CPS) [23], specifically Cyber-Physical Production Systems (CPPS) [24]. Acatech defines CPS as follows: Cyber-physical systems are systems with embedded software (as part of devices, buildings, means of transport, transport routes, production systems, medical processes, logistic processes, coordination processes, and management processes), which: directly record physical data using sensors and affect physical processes using actuators; evaluate and save recorded data, and actively or reactively interact both with the physical and digital world; are connected and in global networks via digital communication facilities (wireless and/or wired, local and/or global); use globally available data and services; have a series of dedicated, multimodal human-machine interfaces." Figure 1 shows the modules of a CPPS. All modules influence the data quality in their own way:

•
The basic system, i.e., machine, tool, workpiece, and process (manufacturing process, handling...) represents the behavior that is to be examined and influenced using data. Thus, the behavior of the basic system first defines the analysis tasks and finally the requirements for the data needed for this. • All data must be generated. This is usually done by measurement systems that influence the quality of the raw data. For example, the measurement range and accuracy as well as the suitability for the existing environmental conditions must be matched to the system behavior to be determined.

•
The raw data must be transported from the measurement systems into the database or the analysis environment or the concrete application. The data transfer is done via the IT system which is the transmission path for the data that also influences the data quality in the database. For example, the number of signals and their clock rate determine the required data transmission rate in the transmission path which is determined, among other things, by the technology of the network connection (e.g., cable, radio) and protocol and the arrangement of computing power (local, edge, fog, cloud).

•
For the transport, the establishment of the analysis capability, and the actual analysis, the data are preprocessed with a sequence of algorithms. The algorithmic steps are summarized in the data processing workflow and determine the quality of the data. Examples include the use of compression methods, window and filter functions, and the type of data synchronization.

Data-Driven Methods for System Design
From the point of view of mathematics and computer science, data-driven methods are indeed state of the art. For some years now, methods of data analysis have also been  The development of mechatronic systems is usually based on the V-model according to VDI guideline 2206 [22]. Due to the increasing development complexity (from a system perspective), the need for collaborative model-based procedures is also increasing to efficiently bring together all the necessary specialist disciplines. For this purpose, the Wmodel [25] was developed for the development of so-called active systems. This W-model is also used for the development of CPPS [26]. The essential difference between the W-Model and the V-Model becomes apparent during the development of the discipline-specific components since it does not take place independently of each other in the W-Model. These process models partly include data-based approaches as well as a data management system, but do not consider the problem of "data quality".

Data-Driven Methods for System Design
From the point of view of mathematics and computer science, data-driven methods are indeed state of the art. For some years now, methods of data analysis have also been introduced in production engineering to gain different knowledge about the behavior of machines and tools and about the interactions in the manufacturing process. Many applications ranging from statistical methods to machine learning techniques, which will not be discussed further here. All applications face the same problem: the analysis results can only be as good, as valuable, as the quality of the incoming data.
In the different areas of production engineering, different software systems are used for data analysis, such as R, Matlab/Simulink, Modelica, or SPSS. In the past, the focus was on mathematical algorithms. However, for the acceptance of data-based methods in production, automation of the methods in an end-to-end data management workflow is needed. Systems that support all phases from connection to sensor and control interfaces to data analysis and result feedback are therefore particularly important. Some examples are ProDaMi-Suite, SAS, SAP-Hana, Detact, or Time-View.
Procedures have been developed for data-driven knowledge generation to develop a powerful, targeted analysis strategy through a systematic way of working and to introduce it into the application in the company. Knowledge Discovery in Databases (KDD) was introduced as a general process for extracting knowledge from data using data analysis algorithms [27]. KDD comprises the following phases: (1) selection of target data to focus on, (2) preprocessing and cleaning data to obtain consistent data, (3) transformation to reduce dimensions of data, (4) data mining to search for patterns of interest, (5) interpretation and evaluation of mined patterns. Based on KDD, there are two widely used methodological approaches [28,29]. One is the SEMMA method, developed by SAS Institute for the use of their statistical and business intelligence software. IBM, on the other hand, developed the method CRISP-DM (revised in ASUM-DM [30]) for the use of the SPSS software. The application focus of both methods was originally in business analyses (business intelligence). Therefore, further developments were dedicated to the transfer to manufacturing tasks [31,32]. With DMME [33,34] (Data Mining Methodology for Engineering Applications), CRISP-DM was extended to integrate additional work steps for engineering-specific development tasks. The additional work steps support the development and testing of suitable measurement methods for the required condition parameters. DMME was justified by the fact that previous approaches assumed that the data are already available-in the engineering world, the data must usually first be generated by measurement procedures.

Quality Assurance on Data
This section first concisely systematizes the types of data before summarizing the dimensions for quality assessment of data and associated methods.

Types of Data
All data circulating in a production company can first be divided into two basic types: master data (context data) and transaction data.
Master data includes information that usually does not change at all or rarely, such as: Transaction data, on the other hand, is data that changes continuously. In a production example, this is mainly: machine and plant data (status data and signal data), 2.
sensor values from external measurement technology (signal data), 3.
control data (states and signal data from PLC and CNC), 4.
data from quality control (metrology data).
Among motion data, signal data have a special meaning. Signal data are mostly highfrequency measurement data or time series that characterize the technical process running in a CPPS, corresponding technical system, or control system and map its properties and processes. Signal data are the data that are continuously generated in a CPPS and characterize it at any time. The signal data change constantly, usually at a high frequencydepending on the type of process and the type of plant. The signal data form the basis of all data-driven applications and strategies in modern CPPS.
In addition to the signal data, state data provides further information about current machine states such as utilization, operating modes, machine alarms, etc. The change frequency in this case is rather low. Additionally, there are quality data, which describe the final states (or intermediate states) of a product, are relevant. The change frequency of the quality data is usually low.
As can be seen, these subsets of transaction data differ according to where the data originates, but also in the frequency of their possible changes.
All processes that take place in a company or a CPPS use both master and transaction data. In Industrie 4.0, access is largely automated, making high quality of all master and transaction data increasingly important. The networking of different subsystems to form the CPPS also means that different types of data must be correctly linked to be able to represent the complex system behavior for the first time and also to analyze it. For example, to be able to carry out failure cause analyses, the master data of the ERP system describing the processed raw materials are combined with the signal data from the sensors of machines and tools and the measurement data of the quality assurance.

"Quality Dimensions" for Data
In our work, we understand data as a reinterpretable representation of information in a formalized manner, suitable for communication, interpretation, or processing (see also ISO/IEC 2382-1). Quality is commonly defined as the degree according to which inherent characteristics of an object meet requirements (ISO 9001). In the case of data, this is the degree to which dimensions of data quality meet requirements.
The German Society for Information and Data Quality e.V. (DGIQ) defines 15 dimensions of data quality [35], which are shown in Figure 2. These are divided into four categories. The DGIQ data quality model is based on a study by the Massachusetts Institute of Technology (MIT) [36]. to [35].
Other works propose further definitions of data quality criteria [37][38][39][40]. In 2021, t Data Quality Working Group (Data Quality of DAMA Netherlands) conducted a gene survey on definitions of data quality dimensions. It collected definitions from differe sources and compared them. The working group also tested the definitions against crite derived from a standard for concepts and definitions (ISO 704). The result represents a of 60 dimensions of data quality and their definitions [41]. The resulting dimensions we thereby assigned to the corresponding data concept (such as "Attributes", "Data", "D taset", "Format", "Metadata", etc.). The most important dimensions of data quality cor spond to those defined by DGIQ.
Experience shows that in the production engineering field, no more than "only" relevant data quality criteria are actually pursued [42]. In rare cases, they are all cons ered simultaneously. These 11 criteria are usually selected according to their meaningf ness and purpose [43]. The following 6 criteria are the most important ones [42]: • Completeness: a data set must contain all necessary attributes. Attributes must co tain all necessary data. • Uniqueness: each data set must be uniquely interpretable. • Correctness: the data must correspond to reality. • Timeliness: all data sets must correspond to the current state of the depicted reali • Accuracy: the data must be available with the required accuracy. • Consistency: a data set must not show any contradictions within itself or with oth data sets.
The other 5 criteria can be added over time-to secure and further improve t achieved (basic) data quality: • Non-redundancy: there must be no duplicates within the data records. • Relevancy: the information content of data records must meet the respective inf mation requirements. • Uniformity: the information in a data set must be structured uniformly. That is, a to [35].
Other works propose further definitions of data quality criteria [37][38][39][40]. In 2021, the Data Quality Working Group (Data Quality of DAMA Netherlands) conducted a general survey on definitions of data quality dimensions. It collected definitions from different sources and compared them. The working group also tested the definitions against criteria derived from a standard for concepts and definitions (ISO 704). The result represents a list of 60 dimensions of data quality and their definitions [41]. The resulting dimensions were thereby assigned to the corresponding data concept (such as "Attributes", "Data", "Dataset", "Format", "Metadata", etc.). The most important dimensions of data quality correspond to those defined by DGIQ.
Experience shows that in the production engineering field, no more than "only" 11 relevant data quality criteria are actually pursued [42]. In rare cases, they are all considered simultaneously. These 11 criteria are usually selected according to their meaningfulness and purpose [43]. The following 6 criteria are the most important ones [42]: • Completeness: a data set must contain all necessary attributes. Attributes must contain all necessary data. • Uniqueness: each data set must be uniquely interpretable. • Correctness: the data must correspond to reality. • Timeliness: all data sets must correspond to the current state of the depicted reality. • Accuracy: the data must be available with the required accuracy. • Consistency: a data set must not show any contradictions within itself or with other data sets.
The other 5 criteria can be added over time-to secure and further improve the achieved (basic) data quality: • Non-redundancy: there must be no duplicates within the data records. • Relevancy: the information content of data records must meet the respective information requirements. • Uniformity: the information in a data set must be structured uniformly. That is, a set of data is continuously presented uniformly. • Reliability: the origin of the data must be traceable.
• Understandability: the terminology and structure of the data sets must be consistent with the ideas of the recipients of the information (e.g., departments).
As already mentioned, all data must meet the set quality requirements. However, there are specific (adapted) quality assurance methods for each type of data. Up to now, data quality has been viewed predominantly from a database perspective. That is, the focus was on the master data (context data) and not on the production-relevant transaction data (especially-signal data). In Table 1, the authors of this paper have attempted to summarize the status of quality assurance methods for the most frequently occurring data types "signal data". Table 1. Data quality criteria and examples of corresponding methods for quality inspection as well as for quality restoration using the example of signal data.

Data Quality Criteria
Examples of Methods for Checking the Quality Criteria

Examples of Associated Methods for Restoring the Data Quality
Completeness Check if all data sources provide data. Restore functionality and connection to the data source.
Verification of the operation of sensors in the required specification such as the measuring range.
Adjust the setup or replace the measuring system.
Checking the signal curves for gaps, for example, by checking the time steps.
Fill in the gaps with interpolated values. Adjust the sampling rate.

Uniqueness
Checking the plausibility of patterns in the data as well as unrealistic outliers. Correct signal allocation.
Checking the ID for data records. Correct ID assignment.
Checking the consistency of timestamps when integrating different data records.
Readjust system clocks and reassign records.

Correctness
Comparison of signal characteristics with analytical description to physical laws and with realism to system behavior.
Mapping against data that is confirmed to be correct or a defined, agreed-upon plausibility rule.
Comparison of the signal curves with limit values. Correct parameterization of the sensor.

Timeliness
Check the timestamps. Selection of a correct sampling period.
Check assignment of path reference to time, e.g., for roll goods.
Adaptation of the transformation of path-related signals on a time basis.
Readjust system clocks.

Accuracy
Check the accuracy of the measurement procedure for example with a reference measurement.
Change the measuring method or calibrate the measuring system.  Check the use of consistent physical dimensions.
Adjust the physical dimensions.
Check the structure of the metadata schemas. Unify the metadata schemas.

Reliability
Check the traceable documentation of the origin of the data along the measurement chain and the pipeline to data processing.
Complete the documentation on the design and operation of the system and, in particular, on the measurement methods and data processing algorithms used.
Check the traceable documentation of the application processes of the system or the experimental plan.
Complete the documentation on the experimental design and associated metadata.
Monitor the reliability of data sources. Replace unreliable measurement systems.

Understandability
Check data structure and comprehensibility of names of signals and features, e.g., for ordinally scaled data.
Adaptation data structure, names, and attributes. Use standardized taxonomies.
Check the structure of the documentation. Adjust the structure of the documentation. Use metadata schemas.
ISO 8000 [44] also represents a concept standard for ensuring data quality-from the point of view of organizational processes in connection with quality management in companies. The focus here is again on master data, document designations, etc., and not on the signal data addressed here. It should be noted that at this point, methods for quality assurance on "master data" or "context data" have already been very well researched and described as part of database theory [45] and therefore are not the subject of the considerations in this article.

Derivation of the Problem
In previous work, numerous methods can be used to check and improve data quality. The methods originate from different disciplines (such as database theory, measurement, and automation technology, machine tool development, or production engineering) and are usually tailored to delimited problems such as Statistical Process Control (SPC) [46] or Advanced Process Control (APC) [47]. In addition, as already shown, a very large chapter on the use of enterprise master data opens up under the heading of data quality. To differentiate, it should be noted once again that the considerations presented here focus on signal data, which usually represent time-varying data, e.g., measurement signals while relevant context information can also be included. To ensure the quality of the analysis, some process models (like SEMMA and CRISP-DM [28], ASUM-DM [30], or DMME [33,34]), which treat the methodological process of machine learning holistically, are known.
There is a lack of a holistic approach focused on data quality assurance, with which the relevant measures and algorithms, mostly known to be applied separately in individual special areas, are brought together. In addition, there is the requirement that all areas of influence on the quality of data and the analysis results are mapped holistically with the specifics of the relevant engineering applications.

The New Methodology: Goals and Requirements
Data quality assurance is a complex challenge that must include all steps in the data processing method-from sensor selection to configuration of the measurement or control software to context-related analysis and interpretation of the data. Thus, the approach must describe a principal procedure for the holistic data quality assurance for the ML to production engineering plants and processes. On one hand, the approach is intended to improve the accuracy of results of ML procedures, thus increasing their usefulness and ultimately their acceptance. On the other hand, the methods integrated into the procedure model enable the introduction of quality assurance in ML projects. At last, but not least, it enables the implementation of software for data management which would offer the best conditions for a broad and simple subsequent use. Thus, three user groups are addressed: production engineers, data analysts, and developers of relevant software systems.
In the production engineering environment, the following engineering tasks offer particularly high potential for the application of ML projects and generate demand for a data quality assurance approach: • development of measurement methods as a basis for process control and quality control, • commissioning from the component to the plant, • commissioning (also sampling or tryout) and ramp-up of processes, • condition monitoring (CM) and predictive maintenance (PM), • machine optimization based on field tests on many machines of the same type, • optimization of manufacturing processes regarding quality, time, energy, and resource consumption as well as costs.
These engineering tasks must be the focus of the Quality Assurance procedure. Thus, the Quality Assurance Approach will make a sustainable contribution to increasing the productivity and quality of manufacturing processes and ultimately to conserving resources in manufacturing-based on the increased performance of data-driven methods in the sense of Industry 4.0. The above-mentioned engineering tasks must be analyzed from the perspective of influencing data quality and developed as task-specific workflows subdividing the process model. Although the engineering tasks each have their own target variables and methodical workflows, they are potentially based on the same datahandling principle. Based on the state of the art and the summarized problem statement, the following requirements are placed on the Holistic Quality Assurance Approach. The approach must: • cover the whole chain of data processing (from the acquisition of raw data from the respective process or plant to the presentation of the analysis results of a data-driven application); • cover the whole development chain of the (measurement) process or plant (from the goal-setting for an ML application to the implementation of the data acquisition); • shall assign the individual data quality criteria to the corresponding areas of influence; • support the involvement of all required competencies of the responsible personnel or system development experience in all phases of plant design and subsequent data processing; • establish visibility and assessability of the data quality in all phases, so that the responsible person can assess and decide in time; • have template capability and allow for automation so that as many steps as possible to improve data quality can be automated; • support the basic procedure as a guideline for action and assistance.

Concept
The solution approach consists of merging various methods-previously largely applied separately-with which the quality of the data can be planned, controlled, and produced into a holistic methodology. The approach to quality assurance for ML applications presented in the following, especially in cyberphysical production systems, is based on the V-model of software development [48] and the V-model for mechatronic systems [49]. The approach is holistic. This means that all areas of influence on data quality and consequently on analysis quality are included. This means that the entire chain of data processing must be taken into account. The chain of data processing begins with data generation and the creation of the analysis-capable database continues through the gain of knowledge using applied analysis algorithms. Finally, it extends to the decision or the initiation of the actions that are justified from the analysis results-at first glance, this tends to address software issues. But in actual when it comes to the question of generating the data that must be suitable for answering the analysis task and transferring it to the database in a robust and performant manner, it becomes clear that hardware development plays an equally important role. Under the focus of CPPS, hardware development includes the design of the system from actuators, sensors to mechanical components, from transmission elements and interfaces in IT to humans. One crucial thing for the quality of the data is the "data-fair design" of the CPPS. Data-fair design means that during the system development the later data workflow must be thought of already, which is usually possible only after the commissioning of the CPPS with the data won from it. For this reason, the viewpoint of the DMME [33,34] was included in the concept presented here and integrated into the contents of the individual workflows. Figure 3 shows the concept based on the V-model. The approach is divided into the spheres of influence "Planning and installation of a CPPS" (left-hand path), "Operation of Production system" or CPPS (base of the V), and "Data processing" (right-hand path)equivalent to the phases of the classic V models. Under "Planning and Installation", the left branch describes the relevant subtasks of the engineering activity from conceptual design to the implementation of task-appropriate data acquisition in the executable CPPS. The executable CPPS (including the influencing periphery) generates the required data during its operation in the required production process. With the provision of sensor and actuator data, the workflow of data processing begins up to the interpretable analysis result, as this is illustrated in the right branch of the V-model with "Data processing".

Planning and Installation
The "Planning and Installation" phase begins with the definition of the development objectives for the ML applications. The ML applications are part of the CPPS. Therefore, the development objectives are derived from the overall objective for the CPPS. The overall objective is usually defined by the company and reflects global and strategic requirements such as market needs.

Development Objectives for the ML Applications
In the task "Development objectives for the ML applications", it is first determined which system components of the CPPS are to be monitored, analyzed, or controlled. In other words, what the data-driven methods are to be used for. For example, data-driven methods can be used to reduce downtime through predictive maintenance or improve process quality, productivity, and ecology. Use cases are used to define technical metrics that are to be investigated in a data-driven manner. In addition, usability requirements, deployment requirements, time and cost constraints, and other project constraints are also agreed upon.
How do these planning requirements affect data quality? First, the basic development task of the ML application is defined. This also specifies the form in which the results are to be presented or made available. Furthermore, with the task, the technical target values are available that can evaluate the development success. The technical objective also describes the required analysis accuracy, which is significantly influenced by the quality of the underlying data. The achievable data quality is in turn influenced by the project conditions. The budget, for example, determines the performance of the sensors and IT components to be procured. The available time frame influences the scope of experimentation and testing and the learning of models. The scope of possible tests even determines the selection of suitable ones-KNN requires a great deal of data.

Conception of the ML Application
The objective definition is followed by the task "Conception of the ML application". This step initially involves developing ideas about the physical effects that can be used to Both workflows pass through four action levels. The action levels are at the same time influence levels on the data quality, thus on the analysis quality, and finally on the effectiveness of the ML application. In the levels, there are different types of influences on the data quality, which are to be worked out and assigned here systematically. In this way, the influences are to be separated from each other so that they can be identified as quality characteristics and treated individually. In this way, specific measures to ensure data quality can be developed and implemented. The complexity can be mastered in this way.

Planning and Installation
The "Planning and Installation" phase begins with the definition of the development objectives for the ML applications. The ML applications are part of the CPPS. Therefore, the development objectives are derived from the overall objective for the CPPS. The overall objective is usually defined by the company and reflects global and strategic requirements such as market needs.

Development Objectives for the ML Applications
In the task "Development objectives for the ML applications", it is first determined which system components of the CPPS are to be monitored, analyzed, or controlled. In other words, what the data-driven methods are to be used for. For example, data-driven methods can be used to reduce downtime through predictive maintenance or improve process quality, productivity, and ecology. Use cases are used to define technical metrics that are to be investigated in a data-driven manner. In addition, usability requirements, deployment requirements, time and cost constraints, and other project constraints are also agreed upon.
How do these planning requirements affect data quality? First, the basic development task of the ML application is defined. This also specifies the form in which the results are to be presented or made available. Furthermore, with the task, the technical target values are available that can evaluate the development success. The technical objective also describes the required analysis accuracy, which is significantly influenced by the quality of the underlying data. The achievable data quality is in turn influenced by the project conditions. The budget, for example, determines the performance of the sensors and IT components to be procured. The available time frame influences the scope of experimentation and testing and the learning of models. The scope of possible tests even determines the selection of suitable ones-KNN requires a great deal of data.

Conception of the ML Application
The objective definition is followed by the task "Conception of the ML application". This step initially involves developing ideas about the physical effects that can be used to implement the ML applications. For example, bearing wear can be indicated by the power consumption of the drive motors via bearing temperature or vibrations at the bearing. For the measurement of the physical effects, technical concepts must be developed with which the required data can be generated. Ideas for the algorithmic procedure and the mathematical modeling of the relationships sought should already be developed in the early phase since these also place demands on measurement methods and data acquisition. Here, it is essential to analyze the quality of the interactions (linear, nonlinear, number of influencing parameters, etc.). With the number of influencing parameters to be considered, it must be clarified whether further data sources are necessary, for example, software systems for material or tool management, hall climate management. For the use of multiple data sources, a reliable synchronization approach, e.g., timestamp or ID, must be developed. This determines the allocation of the data from the different sources and thus the data quality. Furthermore, the ML application has to be adapted to the constraints of the CPPS and its environment like network infrastructure. In short, the algorithmic concept must fit the technical concept. The algorithmic concept includes the mathematical workflow from data preprocessing to synchronization of data sources to models describing the effects. Usually, there are different solution principles, from which the best one is selected in this task-considering effort and benefit, integrability, and stability.
This task also plans the experimental studies required to represent the effects on the CPPS [50]. In the experimental plan to be created, it will be determined which tests will be performed for commissioning the system and which series of tests, if any, will be performed for learning from the ML models.
From a data quality point of view, fundamental decisions are already made here. The selected physical effect and the selected measurement approach provide the physical context information that has to be investigated with the data analyses. With the measurement approach, the physical effects are determined directly or indirectly. Thus, the required data quality is determined by the analysis task. That is, whether regression analyses or Fourier analyses are targeted will affect the required data quality. In addition, the distance of the sensor from the effective point is determined in principle during the design phase. Both determine the achievable measurement accuracy and the risk of interfering disturbances. The IT concept for data transmission determines its performance and thus limits the amount of data that can be transmitted and thus the usable resolution of the measurements. The context information created in this task allows a context-related plausibility check or correction of the acquired data in the model phase "data processing" (Section 4.3).

Design of Measurement System and Its Integration
In the task "Design of measurement system and its integration", measurement systems and their integration into the IT environment are designed according to the concept. This means that the measurement and data transmission chain from the sensor to the data memory is planned in detail. For example, measuring ranges, sampling times, and protection classes are specified. The handling of the measurement technology is worked out, for example, in instructions for calibration or measured variable calculation. With the concrete design of the IT structure, it is determined which operations in the data pre-processing take place at the control PC, edge PC, Fog PC, or in the cloud.
For the data quality, this means that the selection of the specific sensors and IT components determines the measurement ranges, signal patterns, performance, and susceptibility to faults.

Realization of Data Acquisition
Finally, measurement systems, data acquisition, and IT integration are installed, commissioned, and tested at the task "Realization of data acquisition". For this purpose, the specifications of the test plan from "Conception of the ML application" and the instructions from "Design of measurement system and its integration" are used. During the tests, data is already generated that contains all relevant information and characteristics for the ML application to map the effects to be observed in the CPPS. During commissioning, the first digital fingerprint of the real CPPS is also generated, mapping the behavior at the beginning of its life. The operability of the IT infrastructure is also established so that the database can be generated without interference. During this task, any planning errors are detected and eliminated.
From a data quality perspective, the following results are achieved with the implemented data collection. Data acquisition functions without errors deliver data from sensors and other data sources up to the data store. This proves the initial functionality and quality of the data acquisition. On one hand, the initial fingerprint provides a data set with which the ML algorithms can be developed-at least in the Data Understanding phase. The algorithms for data preprocessing can also be developed with the data set. On the other hand, it is a basis of comparison for state changes in the CPPS to be monitored, especially regarding the faults and aging or wear effects. In addition, the developed measurement and acquisition methods must be described transparently, for example in metadata, to be able to support the future interpretation of the data.

System Operation
In the "System operation" area of influence, the CPPS is used for its intended purpose in production and produces data.
The sensors initially provide raw data in the form of individual signals or signal bundles that have been calibrated for the intended use during "realization of data acquisition". However, data acquisition can usually only represent a slice of reality. That is, some effects such as unforeseen deviations from the intended use of the CPPS, defects in the CPPS, or interference from the environment may affect the data quality in such a way that error-free preprocessing and analysis of the data are not possible. Difficulties may also arise in correctly interpreting the analysis results. Therefore, additional potentially influential events must be logged. For example, maintenance work on the shop floor can influence system behavior, knowledge of which is helpful for data interpretation with regard to anomalies or the causes of errors. In short, the generated data result from the system behavior of the CPPS and potentially from superimposed influences, errors, and disturbances.

Data Processing
The "Data processing" phase includes the actual work with the data. The phase begins with the gathering of raw data from the data sources via the developed interfaces in the "Data acquisition" task. The delivered raw data are checked and prepared in "Data preprocessing" with respect to their data quality before they are analyzed in the task "Contextual data analysis" with respect to the physical questions in context. Finally, in "Use of the ML application" the results of the ML applications are fed back into the CPPS and fulfill the required task in the CPPS. The tasks described below serve to separate the physical effects of the CPPS from disturbances in the data flow.

Data Acquisition
During the operation of the CPPS, the connected data sources or devices (sensors, IIoT devices, PLCs, etc.) deliver raw data into the pipeline to the data store. The incoming raw data includes the context-related effects as well as superimposed extraneous effects and errors, for example, performance limits, network failures may result in patchy data histories, or data sources may fail completely. Maintenance work on non-CPPS data sources such as ERP systems can also result in outages. In this task, the macroscopic availability of the data is checked. This means checking whether the files and data streams are available and whether the data sets are complete. The raw data includes metadata, whose availability and completeness are also checked. Missing data or data gaps are to be documented in the dataset report, and if necessary indicated if the trustworthiness and analysis accuracy of the ML application and thus the functionality of the CPPS are affected.
From a data quality perspective, missing and incomplete data pose a high risk of making the ML application impossible to use or highly error-prone, which can lead to functional limitations or even failure of the CPPS. Missing or incomplete metadata at least limits the interpretability of the data and thus the explainability/reproducibility of the analysis. During data acquisition, the completeness of the data is checked.

Data Preprocessing
Data preprocessing has the goal of producing a dataset that is ready for analysis. This includes tasks such as the integration of data from the different sources, the calculation of so-called aggregated data from the individual raw data, or the generation of extraction and filtering.
Potentially, data quality can be affected along the transmission path from sensors to storage. For example, performance limits can lead to jitter fluctuations, which can cause problems for analysis algorithms based on constant time step sizes. It is essential to check the microscopic availability of the data. This means, for example, checking whether the signal traces are without gaps or whether they contain the outliers. Data containing gaps and outliers have to be completed or corrected if this is necessary and possible. Another example of the subsequent influence on sensor data is represented by interference in the measurement signals when shields of cable connections are insufficient or defective.
During data preprocessing, coding inconsistencies and value inconsistencies must also be checked and eliminated if necessary. Coding inconsistencies are deviating uses of units of measurement. Value inconsistencies are, for example, simultaneous uses of terms in metadata. In addition, typographical errors, i.e., spelling errors during manual data entry, must be cleaned up. Figure 4 shows examples of some DQ problems. making the ML application impossible to use or high functional limitations or even failure of the CPPS. Missi limits the interpretability of the data and thus the ex analysis. During data acquisition, the completeness of t

Data Preprocessing
Data preprocessing has the goal of producing a dat includes tasks such as the integration of data from the d so-called aggregated data from the individual raw da and filtering.
Potentially, data quality can be affected along the storage. For example, performance limits can lead to ji problems for analysis algorithms based on constant tim the microscopic availability of the data. This means, fo signal traces are without gaps or whether they contain and outliers have to be completed or corrected if this i example of the subsequent influence on sensor data is measurement signals when shields of cable connection During data preprocessing, coding inconsistencie also be checked and eliminated if necessary. Coding in units of measurement. Value inconsistencies are, for ex in metadata. In addition, typographical errors, i.e., sp entry, must be cleaned up. Figure 4 shows examples of All steps of data preprocessing must be documen ensure the traceability and trustworthiness of the ML a All of the above content imperfections in the data a All steps of data preprocessing must be documented in the data cleaning report to ensure the traceability and trustworthiness of the ML application.
All of the above content imperfections in the data affect the contextual analysis accuracy. During data preprocessing, the content control of data quality and the establishment of freedom from errors is performed.

Contextual Data Analysis
In this task, the actual contextual data analysis of the data takes place to solve the ML task. Ideally, the data should be complete and free of errors as a result of the previous steps. During the data analysis, the data is prepared for further analysis. For example, the data is formatted as required by the analysis software without changing the meaning of the data. During data exploration, new data sets are also often transformed or constructed from the primary data. For example, ratios to the data sets such as averages, derivatives, etc., are calculated to be able to visualize the physical behavior of the CPPS with appropriate models. Effects such as dead time or compensation behavior of the underlying processes also must be taken into account or used to restore data quality. Thus, the dynamic process properties provide decisive information regarding the data quality. Figure 5 shows data quality problems that can only be identified and corrected in context. , 11, x FOR PEER REVIEW process properties provide decisive information regard shows data quality problems that can only be identified a In data analysis, different modeling variants are us nally evaluated in terms of meaningfulness. Model devel which the data are prepared in different ways and mode prove the informative value and also the performance o models must be documented.
In this task, data quality is affected by the preparatio data or sound recordings, the labeling of the data is of quality. Here, the individual scenes are assigned to ce model learns the meaning of contained features. Imprope the models. Otherwise, analysis accuracy reflects the choi els. There are different assessment methods for this pur example, the confusion matrix is used to assess the qua the control of the statement accuracy of the algorithmic so system behavior which can be represented. In data analysis, different modeling variants are usually developed, tested, and finally evaluated in terms of meaningfulness. Model development is an iterative process in which the data are prepared in different ways and model parameters are adjusted to improve the informative value and also the performance of the model. All revisions to the models must be documented.
In this task, data quality is affected by the preparation of the data. When using image data or sound recordings, the labeling of the data is of crucial importance for the data quality. Here, the individual scenes are assigned to certain interpretations so that the model learns the meaning of contained features. Improper labeling affects the accuracy of the models. Otherwise, analysis accuracy reflects the choice and training state of the models. There are different assessment methods for this purpose. In classification tasks, for example, the confusion matrix is used to assess the quality of the classifier. In this task, the control of the statement accuracy of the algorithmic solution takes place regarding the system behavior which can be represented.

Use of the ML Application
At the end of the presented workflow, the actual use of the developed ML application for the addressed use cases takes place. The results of the analysis algorithms are visualized and output to influence the behavior of the CPPS with respect to the target variables. With the acceptance of the CPPS, the effectiveness of the ML application regarding the technical and economic target variables defined in the specifications is also checked and evaluated. This also evaluates the success of the development project. The explainability of the presented effects as well as the comprehensibility of the algorithmic workflow is important for the acceptance of the ML application.
During the operation of the CPPS, the data quality and the analysis quality respectively are continuously monitored with the aspects of data completeness and freedom from errors, algorithm accuracy as well as effectiveness of the ML application. Thus, being able to assess the reliability and the trustworthiness respectively at any time and to warn against risks if necessary. Based on the data quality monitoring, a system of measures is set up to deal with critical conditions of both the CPPS and in the data pipeline as well as algorithm reliability.

Summary of the Concept
In the previous chapters, the phases and steps of the holistic data quality approach were described. Each step influences the quality of the data in different ways. Table 2 summarizes which quality criteria are influenced by the methodological steps.

Evaluation of the Concept
The proposed concept for holistic data quality assurance has been applied and tested in several research projects, such as: "Smart Data Services for Production Systems", "PREMAMISCH-Predictive-Maintenance-Systems for Mixing Plants" and "GIgAfLoPs-Holistic Machine Learning in Production". In the GIgAfLoPs, the application of industrial ML algorithms was thus secured. The flow chart shown in Figure 6 illustrates the application of A Holistic Quality Assurance Approach on the example of Tool Condition Monitoring. For more details about this example see [51].
Appl. Sci. 2021, 11, x FOR PEER REVIEW 17 of 20 Figure 6. Application of the holistic quality assurance approach on the example of tool condition monitoring.

Conclusions and Outlook
The paper presents a holistic approach to systematically assure data quality. The motivation for this methodology results from the growing importance of data-driven methods for the realization of cyber-physical production systems (CPPS). CPPS are described as systems in which mechanical components, information technology, and software, to which algorithms for processing and analyzing data are also assigned, interact. In the state of the art of data quality assurance methods, it is clear that there are numerous methods for data quality assurance. However, these methods originate from the world of relational databases, for example, for processing master data. However, the operation of CPPS is based on transaction data such as signal data. For the quality assurance of transaction data the methods for quality assurance of databases are only conditionally applicable.
In chapter 4, a holistic approach based on the V-model is proposed for quality assurance on transaction data. The V-model is known from software development and is sub- Figure 6. Application of the holistic quality assurance approach on the example of tool condition monitoring.

Conclusions and Outlook
The paper presents a holistic approach to systematically assure data quality. The motivation for this methodology results from the growing importance of data-driven methods for the realization of cyber-physical production systems (CPPS). CPPS are described as systems in which mechanical components, information technology, and software, to which algorithms for processing and analyzing data are also assigned, interact. In the state of the art of data quality assurance methods, it is clear that there are numerous methods for data quality assurance. However, these methods originate from the world of relational databases, for example, for processing master data. However, the operation of CPPS is based on transaction data such as signal data. For the quality assurance of transaction data the methods for quality assurance of databases are only conditionally applicable.
In chapter 4, a holistic approach based on the V-model is proposed for quality assurance on transaction data. The V-model is known from software development and is subdivided into the 3 phases "Planning and installation of a CPPS" (left-hand path), "Operation of Production system" or CPPS (base of the V), and "Data processing" (right-hand path). Characteristic work steps are defined in the phases and their influences on data quality are assigned to them. Furthermore, the quality characteristics of the data quality that can be checked there are assigned to the work steps. This structuring of the workflow enables a systematic analysis and, if necessary, repair of the data quality from the data source (sensor) to the context-related analysis of the data (decision in the operation of the CPPS). By controlling the data quality step by step, it becomes possible to separate the mixed effects of "error effects" superimposed in the raw transaction data and the contextual effects relevant for the operation of the CPPS and to treat them separately.
Initially, the approach serves as a basis for the methodological workflows in the development and commissioning of CPPS. Furthermore, the algorithmic workflows for data preprocessing and data analysis can be structured systematically for quality assurance. Finally, the presented approach contributes to the transparency and trustworthiness of ML methods as well as to the increase in the security of the operation of CPPS.
Author Contributions: H.W. worked in the research projects "C3-Carbon Concrete Composite", "Smart Data Services for Production Systems" and "PREMAMISCH-Predictive-Maintenance-Systems for Mixing Plants". Based on these projects, H.W. designed the overall V-model for data quality management. A.D. worked in the research project "GIgAfLoPs-Holistic Machine Learning in Production" and participated in the concept development. H.W. and A.D. wrote the paper. S.I. leads the chair and supervises the research work. All authors have read and agreed to the published version of the manuscript.

Funding:
The following research projects of the chair have made a significant contribution to the methodology presented. The input from a machine perspective was developed in the research projects "Smart Data Services for Production Systems" (This research project was funded by the European Social Fund (ESF) and the Free State of Saxony under the funding code 100302264.) and "PREMAMISCH-Predictive-Maintenance-Systems for Mixing Plants" (This research project is funded by the German Federal Ministry of Economics and Technology through the AiF as part of the program "Central Innovation Program for SMEs" based on a resolution of the German Bundestag with the funding code KK5023201LT0). The following research project of the working group at the Fraunhofer Institute was also incorporated into the presented methodology: "GIgAfLoPs-Holistic Machine Learning in Production" (project sponsor: German Aerospace Center e. V. (DLR), funding agency: BMBF, funding code: 01|S17068B).