Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology

We propose a process model for the development of machine learning applications. It guides machine learning practitioners and project organizations from industry and academia with a checklist of tasks that spans the complete project life-cycle, ranging from the very first idea to the continuous maintenance of any machine learning application. With each task, we propose quality assurance methodology that is drawn from practical experience and scientific literature and that has proven to be general and stable enough to include them in best practices. We expand on CRISP-DM, a data mining process model that enjoys strong industry support but lacks to address machine learning specific tasks.


Introduction
Many industries, such as manufacturing (Lee et al., 2015;Brettel et al., 2014), personal transportation (Dikmen and Burns, 2016) and healthcare (Kourou et al., 2015;Esteva et al., 2017) are currently undergoing a process of digital transformation, challenging established processes with machine learning-driven approaches. The expanding demand is highlighted by the Gartner report (Gartner, 2019), claiming that organizations expect to double the number of machine learning (ML) projects within a year.
However, 75-85 percent of practical ML projects currently don't match their sponsors' expectations, according to surveys of leading tech companies (Nimdzi Insights, 2019). One reason is the lack of guidance through standards and development process models specific to ML applications. Industrial organizations, in particular, rely heavily on standards to guarantee a consistent quality of their products or services.
Due to the lack of a process model for ML applications, many project organizations rely on alternative models that are closely related to ML, such as, the Cross-Industry Standard Process for Data Mining (CRISP-DM) (Chapman et al., 2000;Wirth and Hipp, 2000;Shearer, 2000). It is grounded on industrial data mining experience (Shearer, 2000) and is considered most suitable for industrial projects amongst related process models (Kurgan and Musilek, 2006). In fact, CRISP-DM has become the de-facto industry standard (Mariscal et al., 2010) process model for data mining, with an expanding number of applications (Kriegel et al., 2007), e.g., in quality diagnostics (de Abajo et al., 2004), marketing (Gersten et al., 2000), and warranty (Hipp and Lindner, 1999).
However, we have identified two major shortcomings of CRISP-DM. First, CRISP-DM does not cover the application scenario where a ML model is maintained as an application. Second, and more worrying, CRISP-DM lacks guidance on quality assurance methodology. This oversight is particularly evident in comparison to standards in the area of information technology (IEEE, 1997) but also apparent in alternative process models for data mining (Marbán et al., 2009) and SEMMA (SAS, 2016). In our definition, quality is not only defined by the product's fitness for its purpose (Mariscal et al., 2010), but the quality of the task executions in any phase during the development of a ML application.

Related Work
CRISP-DM defines a reference framework for carrying out data mining projects and sets out activities to be performed to complete a product or service. The activities are organized in sequence and are henceforth called phases. CRISP-DM consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment. The successful completion of a phase initiates the execution of the subsequent activity. However, the sequence is not strict. In fact, CRISP-DM includes iterations of revisiting previous steps until success or completion criteria are met. It can be therefore characterized as a waterfall life cycle with backtracking (Marbán et al., 2009). The standardized process model set out tasks to be performed during the development of ML applications. Methodology instantiates these tasks, i.e. stipulates how to do a task (or how it should be done).
For each activity, CRISP-DM defines a set of (generic) tasks that are stable and general. Hereby, tasks are called stable when they are designed to keep the process model up to date with new modeling techniques to come and general when they are intended to cover many possible project scenarios. Given a set of specific application scenarios, specialized tasks instantiate generic ones, describing how a task should be carried out within these scenarios. We refer to Chapman et al. (2000) for an exhaustive listing and description of tasks involved in data mining. CRISP-DM has been specialized, e.g., to incorporate temporal data mining (CRISP-TDM; (Catley et al., 2009)), null-hypothesis driven confirmatory data mining (CRISP-DM0; (Heath and McGregor, 2010)), evidence mining (CRISP-EM; (Venter et al., 2007)), and data mining in the healthcare domain (CRISP-MED-DM; (Niaksu, 2015)).
Complementary to CRISP-DM, Amershi et al. (2019) and Breck et al. (2017) proposed process models for ML applications (see Table 1). Amershi et al. (2019) conducted an internal study at Microsoft on challenges of ML projects and listed 1) End-to-end pipeline support, 2) Data Availability, Collection, Cleaning and Management, 3) Education and Training, 4) Model Debugging and Interpretability, 5) Model Evolution, Evaluation and Deployment, 6) Compliance, 7) Varied Perceptions as the main challenges in the development of ML applications. Based on the study Amershi et al. (2019) derived a process model with nine different phases. However, their process model lacks quality assurance methodology and does not cover the business needs. Breck et al. (2017) proposed 28 specific tests to quantify issues in the ML pipeline and to reduce the technical debt of ML applications. These tests estimate the production readiness of a ML application, i.e., the quality of the application in our context. However, their tests do not completely cover all project phases, e.g., excluding the business understanding activity. From our practical experiences, business understanding is a necessary first step that defines the success criteria and the feasibility for the subsequent tasks. Without considering the business needs, the ML objectives might be defined orthogonal to the business objectives and causes to spend a great deal of effort producing the rights answers to the wrong questions.
To our knowledge, Marbán et al. (2009) were the first to consider quality in the context of process models for data mining. Borrowing ideas from software development, their work suggests creating traceability, test procedures, and test data for challenging the product's fitness for its purpose during the evaluation phase.
We address these issues by devising a process model for the development of practical ML applications. The process model follows the principles of CRISP-DM, but is modified to the particular requirements of ML applications, and proposes quality assurance methodology that became industry best practice. Our contributions focus primarily on the technical tasks needed to produce evidence that the development process of a given ML application is of sufficient quality to warrant the adoption into business processes. The scope of our work outline methods to determine the quality of the task execution for every step along the development process, rather than testing the completed product alone. The quality assurance methodology outlined in this paper is intended to be industry-, tool-, and application-neutral by keeping tasks generic within the application scenario. In addition, we will pro-vide a curated list of references for an in-depth analysis on the specific tasks.
Note that the processes and quality measures in this document are not designed for safety-critical systems. Safety-critical systems might require different or additional processes and quality measures.

Quality Assurance in Machine Learning Projects
We propose a process model that we call CRoss-Industry Standard Process for the development of Machine Learning applications with Quality assurance methodology (CRISP-ML(Q)) to highlight its compatibility to CRISP-DM. It is intended for the development of machine applications i.e. application scenarios where a ML model is deployed and maintained as part of a product or service, see fig. 1. In addition, quality assurance methodology is introduced in each phase of the process model. In the same manner as CRISP-DM, CRISP-ML(Q) is designed to be industry and application neutral. CRISP-ML(Q) is organized in six phases and expands CRISP-DM with an additional maintenance phase, see Table 1. Moreover business and data understanding are merged into one phase because industry practice has taught us that these two activities, which are separate in CRISP-DM, are strongly intertwined and are best addressed simultaneously, since business objectives can be derived or changed based on available data. A similar approach has been outlined in the W Model (Falcini et al., 2017).
In what follows, we describe selected tasks from CRISP-ML(Q) for developing ML applications and propose quality assurance methodology to determine whether these tasks were performed according to current standards from industry best practice and academic literature. We follow the principles from the development of CRISP-DM by keeping tasks generic within the application scenarios. We cannot claim that the selection is complete, but it reflects tasks and methods that we consider the most important.

Business and Data Understanding
The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to asses the feasibility, to collect and verify the data quality and, finally, decide upon whether the project should be continued. In the data mining process information is extracted from data directly to find pattern und gain knowledge. B) A machine learning application consists of two steps. A machine learning model on data is trained and applied to perform inference on new data. Note that the model itself can be studied to gain insight within a knowledge discovery process.
3.1.1. Define the Scope of the Machine Learning Application The first task in the Business Understanding phase is to define the Scope of the ML Application. CRISP-DM names the data scientist responsible to define the scope. However, in daily business, the separation of domain experts and data scientists carries the risk, that the application will not satisfy the business needs. It it, therefore, best practice to get a common understanding of the application combining the know-how of domain experts and data scientist: The domain expert can formulate the business needs for the ML application and the constraints of the domain.

Success Criteria
We propose to measure the success criteria of a ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria. According to IEEE (1997), the requirement measurable is one of the essential principles of quality assurance methodology. A definition of success criteria that is deemed to be unmeasurable should be avoided. In addition, each success criterion has to be defined in alignment to each other to prevent contradictory objectives.
Business Success Criteria: The first step is to define the purpose and the success criteria of the ML application from a business point of view. The busi-  ness success can be defined in many different ways and measured objectively, for example, increase the user rate to a certain level or giving useful insight into a process.
Machine Learning Success Criteria: The next task is to 'translate' the business objective into ML success criteria, see table 2. It is advised to define a minimum acceptable level of performance which is good enough to support the business goals for a Proof of Concept (PoC) or Minimal Viable Product (MVP) and improved further on.
Economic Success Criteria: Companies follow at a higher level economical success criteria in the form of key performance indicators (KPI). Adding a KPI to the project contributes to the success of the project and is considered best practice. A KPI shows decision-makers how the project contributes to their business success and carries information that is usually not expressed in common ML goals. In this task, a measurable KPI is defined like time savings in manufacturing, decreases in costs, increases in sales or quality increase of a product.

Feasibility
A feasibility test of the ML application could give a rough assessment of the situation and whether further development steps should be pursued. The assessment should cover data availability, data quality, legal constraints, the applicability of the ML technology and preliminary works. Checking the feasibility before setting up the PoC is considered best practice for the overall success of the ML approach (Watanabe et al., 2019). A feasibility study should minimize the risk of premature failures due to false expectations and spending resources on a project that does not deliver the expected results.
Applicability of ML technology: It is common to demonstrate the feasibility of a ML application with a MVP or PoC when the ML algorithm is used for the first time in a specific domain. However, if an ML application has been used before successfully, the development of a MVP or PoC could amount to a loss in time and can be skipped. In that case, it might be more efficient to set up a software project that focuses on the deployment directly. An example from the automotive industry is the price estimation of used cars using ML models (Pudaruth, 2014). ML models are state-of-the-art in any car vending platform and, therefore, doesn't require a PoC. Scanning preliminary works for either similar applications on a similar domain or similar methodological approaches on a different domain could assess the applicability of the ML technology.
Legal constraints: It is beyond the scope of this paper to consider legal issues but it is essential to include the legal department to check for legal constraints. Legal constraints could be, for example, defined by the licenses on the used software or data, the necessary data anonymization or safety requirements. Legal constraints have to be taken carefully as it could impede the feasibility of the project.
Requirements on the application: The minimal requirements of the applications should be defined as an input for the subsequent phases. Requirements could be, for example, the inference time of a prediction, the memory size of the model (considering it has to be deployed on hardware with limited memory), the performance and the robustness of a model or on the quality of the data (Kuwajima et al., 2018). The challenge during the development is to optimize the success metric while not violating the requirements and constraints.

Data collection
Before starting to collect data, estimate roughly which and how many data is necessary and what costs occur. Data could be collected from many different sources and have to be merged into one data set. Different data sets could have different formats, features or labels which has to be considered during the merge. Merging the data sets could either be done already in this phase or later in the data preparation phase. However, in the case if there is no data available or very few data, it might be necessary to create an infrastructure to collect the data. The recording of additional data could be done using, for example, techniques like active learning (Cohn et al., 1996) or Bayesian optimization (Osborne et al., 2009). This will prolong the project until the data is collected or could act as an exit criteria if the collection of new data not feasible.
Data version control: Collecting data is not a static task but rather an iterative task. Thus modification on the data set by adding and removing data, modifications on the selected features or labels should be documented. Version control 1 on the data is one of the essential tools to assure reproducibility and quality as it allows to track errors during the development i.e. unfavorable modifications.

Data Quality Verification
ML models depend heavily on the training data and, as a consequence, poor data often leads to poor models. These tasks examine whether the business and ML objectives could be achieved with the given quality of the available data. The lack of a certain quality on the data will trigger the previous data collection task. The data quality verification includes three tasks: describe the data, define requirements on the data and verify the data.
Data description: A description and an exploration of the data is performed to gain insight about the underlying data generation process. The data should be described on a meta-level e.g. a pedestrian should have two legs and two arms and by their statistical properties e.g. distribution of the features and labels. Furthermore, a technically well funded visualization (McQueen et al., 2016) of the data should help to understand the data generating process. Information about format, units and description of the input signals is expanded by domain knowledge. The data description forms the basis for the data quality verification.
Data requirements: The requirements on the data could be defined either on the meta-level or directly in the data and encode the expected conditions of the data i.e. whether a certain sample is plausible. The requirements can be, for example, the expected feature values (a range for continuous features or a list for discrete features), the format of the data and the maximum number of missing values. The bounds of the requirements has to be defined carefully by the development team to include all possible real world data but discard nonplausible data. Data points that do not satisfy the expected conditions could be treated as anomalies and have to be evaluated manually or excluded automatically. Breck et al. (2017) advises reviewing the requirements with a domain expert to avoid anchoring bias in the definition phase. Polyzotis et al. (2017) and Schelter et al. (2019) propose to document the requirements on the data in the form of a schema.
Data verification: The initial data, added data but also the production data (see section 3.6) has to be checked according to the requirements. In cases the requirements are not met, the data will be discarded and stored for further manual analysis. This helps to reduce the risk of decreasing the performance of the ML application through adding low-quality data and helps to detect varying data distributions or unstable inputs e.g. the units of one of the features changed from kilograms to grams during an update. Finally, check the coverage of the data by plotting histograms and computing the statistics of the data to assure a sufficient representation of extreme cases.

Review of Output Documents
The Business & Data Understanding phase delivers the scope for the development and success criteria of a ML application and a data quality verification report to define the feasibility. The output documents need to be reviewed to rank the risks and define the next tasks. If certain quality criteria are not met, re-iterations of previous phases are possible.

Data Preparation
Building on the experience from the preceding data understanding phase, data preparation serves the purpose of producing a data set for the subsequent modeling phase. However, data preparation is not a static phase and backtracking circles from later phases are necessary if, for example, the modeling phase or the deployment phase reveal erroneous data.

Select Data
Select data is the task of selecting a relevant subset of representative data and features for the training, validation and test set. However, an additional test set is selected by an independent process to ensure an unbiased test set i.e. errors propagating from the training set to the test set (see section 3.4) and to protect optimization on the test set.
Feature selection: Selecting a good data representation based on the available measurements is one of the challenges to assure the quality of the ML application. It is best practice to discard underutilized features as they provide little to none modeling benefit but offer possible loopholes for errors i.e. instability of the feature during the operation of the ML application (Sculley et al., 2015). In addition, the more features are selected the more samples are necessary. Intuitively an exponentially increasing number of samples for an increasing number of features is required to prevent the data from becoming sparse in the feature space. This is termed as the curse of dimensionality. Thus, it is best practice to select just as many necessary features. A checklist for the feature selection task is given in (Guyon and Elisseeff, 2003). Note that data often forms a manifold of lower dimensions in the feature space and models have to learn this respectively (Braun et al., 2008).
Feature selection methods can be separated into three categories: 1) filter methods select features from data without considering the model, 2) wrapper methods use a learning model to evaluate the significance of the features and 3) embedded methods combines the feature selection and the classifier construction steps. A detailed explanation and in-depth analysis on the feature selection problem are given in (Hira and Gillies, 2015;Saeys et al., 2007;Chandrashekar and Sahin, 2014;Guyon et al., 2006). We recommend to do a brief initial feature selection, on easy to compute properties like the number of missing values or the variance of a feature and to run a more comprehensive analysis as a final step in the data preparation. Ideally, feature selection should be performed within the cross-validation of the model hyper-parameters (Ambroise and McLachlan, 2002) to account for all possible combinations.
However, the selection of the features should not be relied purely on the validation error and test error but analyzed by a domain expert as potential biases might occur due to spurious correlation in the data. Lapuschkin et al. (2016Lapuschkin et al. ( , 2019 showed that classifiers could exploit spurious correlations, here the copyright tag on the horse class, to obtain a remarkable test performance and, thus, fakes a false sense of generalization. In that case, the copyright tag could be detected manually by reviewing the pictures but spurious correlation could be imperceptible to humans e.g. copyright watermarks in videos or images. In such cases, explanation methods  could be used to highlight the significance of features (see section 3.4) and analyzed from a human's perspective.
Data selection: After collecting the initial data, certain samples might not satisfy the necessary quality i.e. doesn't satisfy the requirements defined in section 3.1.5 and are not plausible and, thus, should be removed from the data set. Another way to select the data is the computation of Shapley Values (Ghorbani and Zou, 2019) and determine whether a data point contributes positively or negatively to the predictive performance. However, discarding samples should be well documented and strictly based on objective quality criteria. ML models rest upon the assumption of an adequate number of samples and, therefore, the predictive performance of the model increases by adding more samples (Vapnik, 1995;Simard et al., 2017).
Unbalanced Classes: In cases of unbalanced classes, where the number of samples per class is skewed, different sampling strategies can improve the results. Over-sampling of the minority class and/or under-sampling of the majority class (Lawrence et al., 1998;Chawla et al., 2002;Batista et al., 2004;Lemaître et al., 2017) have been used. Over-sampling increases the importance of the minority class but could result in overfitting on the minority class. Under-Sampling by removing data points from the majority class has to be done carefully to keep the characteristics of the data and reduce the chance of introducing biases. However, removing points close to the decision boundary or multiple data points from the same cluster should be avoided. Comparing the results of different sampling techniques' reduces the risk of introducing bias to the model.

Clean Data
Cleaning data addresses the noise in the data and the imputation of missing values. If a feature or sample subsets cannot be sufficiently cleaned it might be better to discard these data, i.e. returning to the data selection task described before.
Noise reduction: The gathered data often includes, besides the predictive signal, noise and unwanted signals from other sources. Signal processing filters could be used to remove the irrelevant signals from the data and improve the signal-tonoise ratio. We refer to the introductory books for signal processing methods (Walker, 2002;Lyons, 2004). For example, a band-pass filter is often applied in human speech recognition to cut out lower and higher frequencies outside of the human voice spectrum. However, filtering the data should be documented and evaluated because an erroneous filter could remove important parts of the signal in the data.
Data imputation: To get a complete data set, missing, NAN and special values could be imputed with a model readable value. Depending on the data and ML task the values are imputed by mean or median values, interpolated, replaced by a special value symbol Che et al. (2018) (as the pattern of the values could be informative), substituted by model predictions (Biessmann et al., 2018), matrix factorization (Koren et al., 2009) or multiple imputations (Murray et al., 2018;White et al., 2011;Azur et al., 2011) or imputed based on a convex optimization problem (Bertsimas et al., 2018). To reduce the risk of introducing substitution artifacts, the performance of the model should be compared between different imputation techniques.

Construct Data
Constructing data includes the tasks of deriving new features (feature engineering) and constructing new samples (data augmentation).
Feature engineering: New features could be derived from existing ones based on the domain knowledge of the data. This could be, for example, the transformation of the features from the time domain into the frequency domain, discretization of continuous features into bins or augmenting the features with additional features based on the existing ones e.g. squaring, taking the square root, the log, the inverse, etc. In addition, there are several generic feature construction methods, such as clustering (Coates and Ng, 2012), dimensional reduction methods such as Kernel-PCA (Schölkopf et al., 1997) or auto-encoders (Rumelhart et al., 1985). This could aid the learning process and improves the predictive performance of the model. Consider using models that construct the feature representation as part of the learning process, e.g. neural networks, to avoid the feature engineering steps altogether unless prior knowledge is available. Nominal features and labels should be transformed into a one-hot encoding while ordinal features and labels are transformed into numerical values. However, the engineered features should be compared against a baseline to assess the utility of the feature. Underutilized features should be removed if it doesn't improve the performance of the model.
Data augmentation: Data augmentation utilizes known invariances in the data to perform a label preserving transformation to construct new data. The transformations could either be performed in the feature space (Chawla et al., 2002) or input space, such as applying rotation, elastic deformation or Gaussian noise to an image (Wong et al., 2016). Data could also be augmented on a metalevel, such as switching the scenery from a sunny day to a rainy day. This expands the data set with additional samples and allows the model to capture those invariances. It is recommended to perform data augmentation in the input space if invariant transformations are known (Wong et al., 2016).

Standardize Data
The data and the format of the data should be standardized to get a consistent data set i.e. transforming into a common file format, normalization of the features and labels, the usage of common units and standards.
File format: Some ML tools require specific variable or input types (data syntax). Indeed in practice, the comma separated values (CSV) file format is the most generic standard (RFC 4180), it has been proven as a method for PoC studies or to obtain an early MVP.
SI units and ISO standards: ISO 8000 recommends the use of SI units for formatting of time, mass, distance etc. according to the International System of Quantities. Defining a fix set of standards and units, helps to avoid errors in the merging process and further in detecting erroneous data i.e. doesn't satisfy the requirements made in section 3.1.5.
Normalization: It is best practice to normalize the features and labels (in regression and prediction tasks) to mean zero and a standard deviation of one (LeCun et al., 2012). Without proper normalization, the features could be defined on different scales and lead to strong bias to features on larger scales. In addition, normalized features lead to faster convergence rates in neural networks than without (LeCun et al., 2012;Ioffe and Szegedy, 2015). Note that the normalization, applied to the training set has to be applied also to the test set using the same normalization parameters.

Modeling
The choice of modeling techniques depends on the ML and the business objectives, the data and the boundary conditions of the project the ML application is contributing to. The requirements and constraints that have been defined in section 3.1 are used as inputs to guide the model selection to a subset of appropriate models. The goal of the modeling phase is to craft one or multiple models that satisfy the given constraints and requirements. An outline of the modeling phase is depicted in fig. 2.
Literature research on similar problems: Before starting the modeling activity, it is best practice to screen the literature e.g. publications, patents, internal reports on similar ML problems for a comprehensive overview on similar problems. ML has become an established tool for a wide number of applications and related works might be done already in other projects. The given insights could be used as a starting point and the results of other models could be used as a baseline to the own developed model.
Define quality measures of the model: The modeling strategy has to have multiple objectives in mind. Baylor et al. (2017) suggest to evaluate the model by two properties: a model has to be safe to serve and has to have the desired prediction quality. We suggest to evaluate the models on six complementary properties, see table 2. Besides a performance metric, soft measures such as robustness, explainability, scalability, hardware demand and its model complexity have to be evaluated. The measures can be weighted differently depending on the application. In some cases, explainability or robustness could be valued more than accuracy. In a case study, Schmidt and Bießmann (2019) showed empirically that highlighting the three most important features of a ML model could help to improve the performance of a human in text classification problems.
Model Selection: In this task, ML models have to be selected for further development. There are plenty of ML models and it is out of the scope of this paper to compare and list their characteristics. However, there are introductory books on classical methods (Bishop, 2007;Schölkopf et al., 2002) and Deep Learning (Goodfellow et al., 2016). The model selection depends on the data and has to be tailored to the problem. There is no such model that performs the best on all problem classes. This has been formalized as the No Free Lunch Theorem for machine learning (Wolpert, 1996). It is best practice to start with models of lower capacity, say simple linear models, to get a good baseline and gradually increase the capacity. Validating each added capacity assures its benefit and avoid unnecessary complexity of the model.
Incorporate domain knowledge: Prior knowledge can be incorporated into the model to improve its quality. A specialized model for a specific task will always be better than a general model for all possible tasks. Zien et al. (2000) showed that specialized kernels could improve the performance of the model in recognizing translation initiation sites from nucleotide sequences. Another example are convolutional layers in neural networks are used because of the assumption that pixels in an image are locally correlated but also that the features are translation invariant. The convolutional layer uses parameter sharing and reduces the solution space to a sub-

Robustness
Resiliency of the model to inconsistent inputs e.g. adversarial attacks, out-of-distribution samples, anomalies and distribution shifts and to failures in the underlying execution environment e.g. sensor, actuators and computational platform.

Scalability
The property of the model to scale to high data volume during the training and re-training in the production system. Complexity analysis on the execution time and hardware demand dependent on the number of samples and feature dimension.

Explainability
Models could be either directly explainable or given by post-hoc explanations. The decisions of explainable models could be inspected manually and could increase the user acceptance. In addition, uncertainty and confidence estimates provide guidance on indecisive decisions.

Model Complexity
Models with large capacities overfit easily on small data sets. Assure that the capacity of your model suits the complexity of your data and use proper regularization.

Resource Demand
The model has to be deployed on hardware and is restricted by its memory. In addition, the inference time has to be considered dependent on the application. set which allows the model to learn more efficient from data. A fully connected layer would be able to represent a convolutional layer but has to learn these properties from data. However, due to the highly non-linear optimization problem and overfitting issues, it will not normally do that. Adapting the model to a specific problem involves the danger of incorporating false assumption and could reduce the solution space to a non-optimal subset. There-fore, it is best practice to validate the incorporated domain knowledge in isolation against a baseline. Adding domain knowledge should always increase the quality of the model. If it does not add anything to the quality of the model, remove it to avoid false bias.
Model training: The trained model depends on the learning problem and as such are tightly coupled. The learning problem contains an objective, optimizer, regularization and cross-validation. An extensive and more formal description can be found in (Bishop, 2007;Goodfellow et al., 2016). The objective of the learning problem depends on the application. Different applications value different aspects and have to be tweaked in alignment with the business success criteria. The objective is a proxy to evaluate the performance of the model. The optimizer defines the learning strategy and how to adapt the parameters of the model to improve the objective. Regularization which can be incorporated in the objective, optimizer and in the model itself is needed to reduce the risk of overfitting and can help to find unique solutions. Cross-validation is performed to test the generalization property of the model on unseen data and optimize the hyperparameters. The data set is split into a training, a validation and a test set. While the training set is used in the learning procedure the validation set is used to test the generalization property of the model on unseen data and to tune the hyperparameters (Muller et al., 2001). The test set is used to estimate the generalization property of the model, see section 3.4. Hyper-parameters of all the models including the baselines should be optimized to validate the performance of the best possible model. Melis et al. (2017) showed that a baseline LSTM achieves similar performance to stateof-the-art models when all hyper-parameters are optimized properly. Frameworks such as Auto-ML (Hutter et al., 2019;Feurer et al., 2015) or Neural Architecture Search (Zoph and Le, 2016) enables to automatize the hyper-parameters optimization and the architecture search partly but should be used with care.
Using unlabeled data and pre-trained models: In some cases, labeling data can be very expensive and limits the data set to a few labeled data points. However, if unlabeled data can be gathered much cheaper one should exploit unlabeled data in the training process. The generalization ability of ML models could be improved using unsupervised pretraining (Erhan et al., 2010) and semi-supervised learning algorithms (Kingma et al., 2014;Chapelle et al., 2010). Complementary, Transfer Learning could be used to cope with small data sets (Yosinski et al., 2014). The idea is to pre-train the network on a proxy data set that resembles the original data to extract common features. The proxy data can be obtained from simulations or closely related data sets. Gathering simulated data is much cheaper and enables the construction of rare data points.
For example, in industrial applications CAD models for all parts of a technical product are usually available and might be used for pre-training networks for object recognition and localization (Andulkar et al., 2018).
Model Compression: Compression or pruning methods could be used to obtain a compact model of lesser size. In kernel methods low rank approximations of the kernel matrix is an essential tool to tackle large scale learning problems (Williams and Seeger, 2001;Drineas and Mahoney, 2005). Neural Networks use a different approach by either pruning the network weights Frankle and Carbin (2018) or applying a compression scheme on the network weights (Wiedemann et al., 2019). Frankle and Carbin (2018) was able to prune up to 90% of the neural network weights while (Wiedemann et al., 2019) was able to compress the VGG16 ImageNet model by 63.6 times with no loss in accuracy. A survey on neural network compression can be found in Cheng et al. (2017).
Ensemble methods: Ensemble methods train multiple models to perform the decision based on the aggregate decisions of the individual models. The models could be of different types or multiple instantiations of one type. This results in a more fault-tolerant system as the error of one model could be absorbed by the other models. Boosting, Bagging or Mixture of Experts are mature techniques to aggregate the decision of multiple models (Rokach, 2010;Zhou et al., 2002;Opitz and Maclin, 1999). In addition, ensemble models are used to compute uncertainty estimates and can highlight areas of low confidence (Lakshminarayanan et al., 2017;Gal and Ghahramani, 2016).

Assure reproducibility
A quality assurance method that is common to software engineering and science is to validate any result by peer-review. For instance, experiments can be validated by re-implementing the algorithms or running the given source code to reproduce the results. Ultimately, reproducibility is necessary to locate and debug errors.
However, ML algorithms are difficult to reproduce due to the mostly non-convex and stochastic training procedures and randomized data splits. The results could differ depending on different random seeds. This has been addressed at the Neural Information Processing Systems (NeurIPS) 2019 with the creation of a Reproducibility Chair and a reproducibility checklist (Pineau, 2019). This task aims at assuring the reproducibility of ML algorithms at two different levels: reproducibility of the method and of the results. Method reproducibility: This task aims at reproducing the model from the given description of the code and algorithm. The algorithm should be described in detail i.e. with pseudo codes or on code level and on the meta-level including the assumptions. The description should contain the version of the data sets used to train, validate and test the model (see section 3.1.5), a description of the modeling techniques, the chosen hyper-parameters, the software and its version being used to apply these techniques, the hardware it is been executed on and the random seeds (Pineau, 2019). Additionally, Tatman et al. (2018) proposed to provide an environment to run the code to avoid the it runs on my computer problem. The environment could be provided by either using a hosting service, providing containers or providing a virtual machine.
Result reproducibility: It is common dubious practice to train multiple models with different random seeds and report the top performance of the model (Bouthillier et al., 2019;Henderson et al., 2018). This is deeply flawed as the variance of the performance is completely ignored and the result could be obtained by chance. Large variances dependent on the random seeds indicate the sensitivity of the algorithm and it is questionable if the model could retain the performance after multiple updates. It is, therefore, best practice to validate the mean performance and assess the variance of the model on different random seeds (Henderson et al., 2018;Sculley et al., 2018).
Experimental Documentation: As the modeling phase could cover many models and modifications in the data set, it is hard to keep track of all the changes, especially beneficial or unfavorable changes. Keeping track of the experimental results and causes by precedent modifications allows some form of model comprehension i.e. which modifications were beneficial and which ones were harmful. This can be used either to debug code or improve the model quality. The documentation should contain the listed properties in the method reproducibility task. Plan a documentation strategy and list the properties that should be documented. For example, Vartak et al. (2016) showed a toolbased approach on version control and meta-data handling while experimenting on ML models and hyper-parameters.

Evaluation
This evaluation phase consists of three tasks: evaluation of performance, robustness and explainability. When evaluating a ML solution to a business problem it is important to assure the correctness of the results but also to study its behavior on false inputs. A major risk is caused by the fact that a complete test coverage of all possible inputs is not tractable because of the large input dimensions. However, extensive testing reduces the risk of failures. When testing, one has to always keep in mind that the stochastic nature of the data resulting in label noises bounds the test accuracy from the top. That means, 100% test accuracy can be rarely achieved.
Validate performance: A risk occurs during the validation of the performance by using feedback signals from the test set to optimize the model. To avoid this, it is good practice to hold back an additional test set, which is disjoint from the training (and validation) set and stored only for a final evaluation and never shipped to any partner to be able to measure the performance metrics in a kind of blind-test way. To not bias the performance of a model, the test set should be assembled and curated with caution and ideally by a team of experts that are capable to analyze the correctness and ability to represent real cases. In general, the test set should cover the whole input distribution and consider all the invariances in the data. Invariances are transformations of the input that should not change the label of the data. (Zhou and Sun, 2019;Tian et al., 2018;Pei et al., 2017) have shown that a highly sophisticated model for autonomous driving could not capture those invariances and found extreme cases which led to false predictions by transforming a picture taken on a sunny day to a rainy day picture or by darkening the picture. It is recommended to separate the teams and the procedures collecting the training and the test data to erase dependencies and avoid false methodology propagating from the training set to the test set. On that test set, the prior defined performance metrics should then be evaluated. Additionally, it is recommended to perform a sliced performance analysis to highlight weak performance on certain classes or time slices. A full test set evaluation may mask flaws on certain slices.
Determine robustness: The robustness of the model, in terms of the model's ability to generalize to a perturbation of the data set, can be determined with K-fold cross-validation. Hereby, the algorithm is repeatedly validated by holding disjoint subsets of the data out of the training data as validation data. The mean performance and variance of the cross-validation can be analyzed to check the generalization ability of the model on different data sets. It might be beneficial to accept a lower training performance which can generalize well to unseen data than having a model that exhibits the inverse behavior. Moreover, robustness should be checked when adding different kinds of noise to the data or varying the hyper-parameters which characterize the model indirectly (e.g. the number of neurons in a deep neural network). In addition, it is recommended to assure robustness of a model when given wrong inputs e.g. missing values, NaNs or data out of distribution as well as signals which might occur in case of malfunctions of input devices such as sensors. A different challenge is given by adversarial examples (Goodfellow et al., 2014) that perturbs the image by an imperceptible amount and fool classifiers to make wrong predictions. A survey of current testing methods can be found in (Zhang et al., 2019). The model's robustness should match the quality claims made in table 2.
Increase explainability for machine learning practitioner and end user: Case studies have shown that explainability helps to increase trust and users' acceptance (Hois et al., 2019) and could guide humans in ML assisted decisions . Moreover, explainability of a model helps to find bugs and allows for a deep discussion with the domain experts leading to strategies on how to improve the overall performance e.g. by enriching the data set. To achieve explainability and gain a deeper understanding of what a model has already learned and to avoid spurious correlations (compare clever hans phenomenon in ), it is best practice to carefully observe the features which impact the model's prediction the most and check whether they are plausible from a domain experts' point of view. For example, heat maps highlight the most significant pixels in image classification problem (Lapuschkin et al., 2016;Ribeiro et al., 2016;Lundberg and Lee, 2017;Lapuschkin et al., 2019) or the most significant words in NLP tasks (Arras et al., 2017). For root cause analysis of misclassifications caused by training data issues, the study of Chakarov et al. (2016) is recommended for further reading. The toolbox by Alber et al. (2019) provides a unified framework for a wide number of explanation methods. Compare results with defined success criteria: Finally, domain and ML experts have to decide on whether to enter the next phase of deploying the model. Therefore, it is best practice to document the results of the performance evaluation and compare the results to the business and ML success criteria defined in section 3.1.2. However, if success criteria were not met, one might backtrack to earlier activities (modeling or even data preparation) or stop the project. Identified limitations of robustness and explainability during evaluation might require an update of the risk assessment (e.g. Failure Mode and Effects Analysis (FMEA)) and might also lead to backtracking to modeling or stopping the project.

Deployment
After the model has successfully passed the evaluation state, it is ready to be deployed. The deployment phase of a ML model is characterized by its practical use in the designated field of application.
Define inference hardware: Choose the prediction hardware based on the hardware, connectivity and business constraints. Models deployed on embedded system are restricted in size and inference time. Contrary, while cloud services offer a tremendous amount of computation power, a steady, lag free and reliable connection needs to be guaranteed. Complementary, devices at the edge of the cloud have only limited access to large data centers and while they can contact such data centers the computations have to be done locally. Such devices can download the most up-to-date ML models at regular intervals and can be maintained by the ML deployment team. Offline devices face more constraints as they have to be updated manually or not at all as a consistent connection to a data center can not be ensured.
Model evaluation under production condition: As training and test data is gathered to train and evaluate the model, the possible risk persists that the production data does not resemble the training data or didn't cover corner cases. Previous assumptions on the training data might not hold in production and the hardware that gathered the data might be different. Therefore it is best practice to evaluate the performance of the model under incrementally increasing production conditions by iteratively running the tasks in section 3.4. On each incremental step, the model has to be calibrated to the deployed hardware and the test environment. This allows identifying wrong assumptions on the deployed environment and the causes of model degradation. Domain adaptation techniques can be applied (Wang and Deng, 2018;Sugiyama et al., 2007) to enhance the generalization ability of the model. Face detection algorithms, for example, are trained on still images which allow the ML algorithm to detect key features under controlled conditions. The final test should run the face detection algorithm in real-time on the production hardware, for example, an embedded system, to ensure consistent performance.
Assure user acceptance and usability: Even after passing all evaluation steps, there might be the risk that the user acceptance and the usability of the model is underwhelming. The model might be incomprehensible and did not cover corner cases. It is best practice to build a prototype and run an exhaustive field test with end users. Examine the acceptance, usage rate and the user experience. A user guide and disclaimer shall be provided to the end users to explain the system's functionality and limits.
Minimize risk of unforeseen errors: The risks of unforeseen errors and outage times could cause system shutdowns and a temporary suspension of services. This could lead to user complaints and the declining of user numbers and could reduce the revenue e.g. for paid services. A fall-back plan, that is acitvated in case of e.g. erroneous model updates or detected bugs, can help to tackle the problem. Options are to roll back to a previous version or a pre-defined baseline e.g. an established model or to rule-based algorithms. Otherwise, it might be necessary to remove the service temporally and reactivate it later on.
Deployment strategy: Before rolling out a model to all existing applications, it is best practice to deploy it first to a small subset and evaluate its behavior in a real-world environment (also called canary deployment). Even though the model is evaluated rigorously during each previous step, possible errors might slip through the process. The impact of such erroneous deployments and the cost of fixing errors should be minimized. If the model successfully passes the canary deployment, it can be deployed to all users.

Monitoring and Maintenance
With the expansion of ML from knowledge discovery to data-driven applications to infer real-time decisions, ML models are used over a long period and have a life cycle which has to be managed. Maintaining the model assures its quality during its life cycle. The risk of not maintaining the model is the degradation of the performance over time which leads to false predictions and could cause errors in subsequent systems. In addition, the model has to adapt to the changes in the environment (Sugiyama et al., 2007). The main reason for a model to become impaired over time is rooted in the violation of the assumption that the test and train data comes from the same distribution. The causes of the violations are: • Non-stationary data distribution: Data distributions change over time and result in a stale training set and, thus, the characteristics of the data distribution are represented incorrectly by the training data. Either a shift in the features and/or in the labels are possible. This degrades the performance of the model over time. The frequency of the changes depends on the domain. Data of the stock market are very volatile whereas the visual properties of elephants won't change much over the next years.
• Degradation of hardware: The hardware that the model is deployed on will and the sensor hardware will age over time. Wear parts in a system will age and friction characteristics of the system might change. Sensors get noisier or fail over time e.g. dead pixels in cameras. This will shift the domain of the system and has to be adapted by the model.
• System updates: Updates on the software or hardware of the system can cause a shift in the environment. For example, the units of a signal got changed during an update from kilograms to grams. Without notifications, the model would use this scaled input to infer false predictions.
After the underlying problem is known, we can formulate the necessary methods to circumvent stale models and assure the quality. We propose two sequential tasks in the maintenance phase to assure or improve the quality of the model. In the monitor task, the staleness of the model is evaluated and returns whether the model has to be updated or not. Afterward, the model is updated and evaluated to gauge whether the update was successful.
Monitor: Baylor et al. (2017) proposes to register all input signals and notify the model when an update has occurred. Updates on the input signals could then be handled automatically or manually. Complementary, the schema defined in section 3.1.5 can be used to validate the correctness of the incoming data. Inputs that don't satisfy the schema can be treated as anomalies and denied by the model (Baylor et al., 2017). In addition, the statistics of the incoming data such as quantiles, histograms, mean and standard deviation, top-K values of most frequent features and the predicted labels can be compared to the training data. If the labels of the incoming data are known e.g. in forecasting tasks, the performance of the model can be compared to previous data streams. The results of these data streams could be written in a report and reviewed automatically or manually. Based on this review, it can be decided upon whether the model should be updated e.g. if the number of anomalies reaches a certain threshold or the performance has reached a lower bound. Thresholds are set to notify the system that the model has to be updated and have to be tuned in either case to minimize the update frequency because of the additional overhead but also minimize erroneous predictions due to stale models. Libraries such as Deequ (Schelter et al., 2019) could help to implement an automatic data validation system.
Update: In the updating step, new data is collected to re-train the model under the changed data distribution. Consider that new data has to be labeled which could be very expensive. Instead of training a completely new model from scratch, it is advised to fine-tune the existing model to new data. It might be necessary to perform some of the modeling steps in section 3.3 to cope with the changing data distribution e.g. by adding additional layers and more weights. Every update step has to undergo a new evaluation before it is pushed to the system. The evaluation tasks in section 3.4 are also applied here. The performance of the updated model should be compared against the previous versions and could give insights on how quick a model degrades over time. In addition, create a deployment strategy for the updated model (see section 3.5). It is best practice, to deploy the updated model to a small fraction of the users alongside its previous model to minimize the damage of possible errors. The number of updated models is increased gradually. Plan ahead on how and when to update the model to minimize the downtime of the whole system.

Conclusion and Outlook
We have drafted CRISP-ML(Q), a process model for machine learning applications with quality assurance methodology, that helps organizations to increase efficiency and success rate in their machine learning projects. It guides machine learning practitioners through the entire machine learning development life-cycle, providing quality-oriented methods for every phase and task in the iterative process including maintenance and monitoring. The methods provided have proven to be best practices in automotive industry projects and academia and have the maturity to be implemented in current projects.
Our survey is indicative of the existence of specialist literature, but its contributions are not covered in machine learning textbooks and are not part of the academic curriculum. Hence, novices to industry practice often lack a profound state-of-theart knowledge to ensure project success. Stressing quality assurance methodology is particularly important because many machine learning practitioners focus solely on improving the predictive performance. Note that the process and quality measures in this work are not designed for safety-relevant systems. Their study is left to future work.
We encourage industry from automotive and other domains to implement CRISP-ML(Q) in their machine learning applications and contribute their knowledge to establish a CRoss-Industry Standard Process for the development of machine learning applications with Quality assurance methodology. Defining the standard is left to future work.

Acknowledgements.
The authors would like to thank the German Federal Ministry of Education and Research (BMBF) for funding the project AIAx -Machine Learning-driven Engineering (Nr. 01IS18048). K.-R.M. acknowledges partial financial support by the BMBF under Grants 01IS14013A-E, 01IS18025A, 01IS18037A, 01GQ1115 and 01GQ0850; Deutsche Forschungsgesellschaft (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689 and by the Technology Promotion (IITP) grant funded by the Korea government -0-00451, No. 2017. Special thanks to the internal Daimler AI community to share their best practices on machine learning development and for inspiring us with their great ideas. We would like to thank Miriam Hägele, Lorenz Linhardt, Simon Letzgus, Danny Panknin and Andreas Ziehe for proofreading the manuscript and the in-depth discussions.