Self-Service Data Science in Healthcare with Automated Machine Learning

: (1) Background: This work investigates whether and how researcher-physicians can be supported in their knowledge discovery process by employing Automated Machine Learning (AutoML). (2) Methods: We take a design science research approach and select the Tree-based Pipeline Optimization Tool (TPOT) as the AutoML method based on a benchmark test and requirements from researcher-physicians. We then integrate TPOT into two artefacts: a web application and a notebook. We evaluate these artefacts with researcher-physicians to examine which approach suits researcher-physicians best. Both artefacts have a similar workﬂow, but di ﬀ erent user interfaces because of a conﬂict in requirements. (3) Results: Artefact A, a web application, was perceived as better for uploading a dataset and comparing results. Artefact B, a Jupyter notebook, was perceived as better regarding the workﬂow and being in control of model construction. (4) Conclusions: Thus, a hybrid artefact would be best for researcher-physicians. However, both artefacts missed model explainability and an explanation of variable importance for their created models. Hence, deployment of AutoML technologies in healthcare remains currently limited to the exploratory data analysis phase. AutoPrognosis model


Introduction
After reviewing the literature on the state of data analysis in healthcare, it becomes evident that there is still significant progress to be made in the application of analytics in healthcare [1,2]. One of the ways to progress, is the automation of the knowledge discovery process in healthcare [1]. Automation of the knowledge discovery process can increase the adoption of analytics by enabling domain experts to contribute to the knowledge discovery in the field using state-of the art techniques in adaptive analytic systems, especially with the amount of data that is available in healthcare. The introduction of the Electronic Health Record (EHR) accelerated the digitalisation of this formerly unavailable dataset. Combining this available EHR data with analytics provides great potential to improve the healthcare industry by making the data available for automatic processing [3].
The concept of self-service data science was first described in [4] and has been defined as "the engineering discipline in which analytic systems are designed and evaluated to empower domain professionals to perform their own data analyses on their own data sources without coding in a reliable, usable and transparent manner within their own daily practices" [5]. Figure 1 visualises self-service data science research in the context of adjacent data science research disciplines. Bridging purely foundational and purely applied research processes, both applied data science [6] and self-service data science studies pursue a translational (e.g., application-oriented) research process, often including CRISP-DM as the knowledge discovery process of choice [7]. The Machine Learning (ML) community has noticed the need to enable access for non-expert users to ML techniques. The need to enable non-experts to use machine learning is one of the drivers that gave birth to the fast-paced research area of AutoML [8], [9]. The AutoML community aims to automate all steps in the process of creating a machine learning pipeline. However, to the best of our knowledge, no AutoML applications were tested in real-world situations with non-expert users in the healthcare domain. Hence our research question is: How can we support the knowledge discovery process of domain experts in healthcare using automated machine learning? To answer this question, we answer three sub-research questions in Section 4: 1. Which AutoML method performs best on a benchmark test, given medical datasets? 2. What are the requirements of healthcare professionals for starting to use AutoML in their daily practice? 3. How does the selected AutoML method suit healthcare professionals in their knowledge discovery process?
Each question is answered in a subsection of Section 4.

Overview of AutoML Methods
In this section, we summarize and categorize the AutoML methods that we found during the literature review. In our analysis we only consider the newest versions of the methods and only include non-commercial AutoML methods. First, we compare two AutoML methods that are developed for the healthcare domain. Second, we discuss methods with a fixed pipeline length. Third, we discuss the AutoML methods that build neural nets. Fourth, we discuss evolutionary methods. Fifth, we discuss distributed methods. Finally, we provide a detailed overview of the AutoML methods discussed in this chapter.

Healthcare
FLASH [10] and AutoPrognosis [11] have both been developed for healthcare or with funds for healthcare, but with different incentives. FLASH was developed to improve the efficiency of creating and evaluating pipelines. It proposes a two-layer Bayesian Network approach to improve search efficiency. AutoPrognosis has recently been developed with the practitioner in mind. It follows a principled Bayesian approach in all components. AutoPrognosis uses Bayesian model averaging for pipeline construction. Meta-learning is used to find similar groups of patients. Both clinical and statistical features of datasets are used for Warm-starting. FLASH is a black-box tool, as most AutoML tools are. In contrast, AutoPrognosis is the only AutoML method that contains an explainer to justify its recommendations to a clinician.
We cannot compare the performances of both methods, as there is no benchmark test available featuring both methods at the time of writing. FLASH tested its performance on a medical dataset with the binary classification task of predicting drug non-responders. In this case, it outperformed other methods based on TPE and SMAC using error rate as the performance metric [10]. AutoPrognosis outperformed Auto-WEKA, Auto-Sklearn, and TPOT on multiple datasets in its own comparison with other methods [11].

Self-Service Data Science
Citizen Data Science Figure 1. Self-service data science research in context of related data science disciplines.
The Machine Learning (ML) community has noticed the need to enable access for non-expert users to ML techniques. The need to enable non-experts to use machine learning is one of the drivers that gave birth to the fast-paced research area of AutoML [8,9]. The AutoML community aims to automate all steps in the process of creating a machine learning pipeline. However, to the best of our knowledge, no AutoML applications were tested in real-world situations with non-expert users in the healthcare domain. Hence our research question is: How can we support the knowledge discovery process of domain experts in healthcare using automated machine learning? To answer this question, we answer three sub-research questions in Section 4:

1.
Which AutoML method performs best on a benchmark test, given medical datasets? 2.
What are the requirements of healthcare professionals for starting to use AutoML in their daily practice? 3.
How does the selected AutoML method suit healthcare professionals in their knowledge discovery process?
Each question is answered in a subsection of Section 4.

Overview of AutoML Methods
In this section, we summarize and categorize the AutoML methods that we found during the literature review. In our analysis we only consider the newest versions of the methods and only include non-commercial AutoML methods. First, we compare two AutoML methods that are developed for the healthcare domain. Second, we discuss methods with a fixed pipeline length. Third, we discuss the AutoML methods that build neural nets. Fourth, we discuss evolutionary methods. Fifth, we discuss distributed methods. Finally, we provide a detailed overview of the AutoML methods discussed in this chapter.

Healthcare
FLASH [10] and AutoPrognosis [11] have both been developed for healthcare or with funds for healthcare, but with different incentives. FLASH was developed to improve the efficiency of creating and evaluating pipelines. It proposes a two-layer Bayesian Network approach to improve search efficiency. AutoPrognosis has recently been developed with the practitioner in mind. It follows a principled Bayesian approach in all components. AutoPrognosis uses Bayesian model averaging for pipeline construction. Meta-learning is used to find similar groups of patients. Both clinical and statistical features of datasets are used for Warm-starting. FLASH is a black-box tool, as most AutoML tools are. In contrast, AutoPrognosis is the only AutoML method that contains an explainer to justify its recommendations to a clinician.
We cannot compare the performances of both methods, as there is no benchmark test available featuring both methods at the time of writing. FLASH tested its performance on a medical dataset with the binary classification task of predicting drug non-responders. In this case, it outperformed other methods based on TPE and SMAC using error rate as the performance metric [10]. AutoPrognosis outperformed Auto-WEKA, Auto-Sklearn, and TPOT on multiple datasets in its own comparison with other methods [11].

Fixed Pipelines
Auto-WEKA [8,12], Hyperopt-Sklearn [13], Auto-Sklearn [14], PoSH Auto-Sklearn [15], and ML-Plan [16] are all methods that have a fixed pipeline length. Auto-WEKA searches the search Appl. Sci. 2020, 10, 2992 3 of 18 space by a tree-based BO method: SMAC [8,17]. Hyperopt-Sklearn searches the search space is using Hyperopt, which makes use of either random search or Tree Parzen Estimators (TPE) with BO [18]. To improve efficiency in search, Hyperopt-Sklearn makes a distinction between conditional and non-conditional hyperparameters. Conditional parameters always need to be assigned, and non-conditional parameters depend on the chosen algorithm in the pipeline [13]. PoSH-Auto-Sklearn is an extension of Auto-Sklearn method described above and builds on the scikit-learn library. Successive halving was introduced in the search process to improve the efficiency of the method. In addition to this, they build a portfolio based on OpenML datasets [19], which contained meta-information to warm-start the search process. Finally, the ensemble technique of Auto-Sklearn was altered to exclude poor performing models. In PoSH Auto-Sklearn a model is not added to the ensemble if it performs over 3% worse than the best model. ML-plan uses a hybrid approach to construct pipelines. It combines the ideas and concepts from two different approaches: the idea of evaluating candidates at runtime [8] and the idea of using hierarchical task networks [20] for pipeline planning [16]. ML-Plan contributes to the existing knowledge base by proposing a two-phase search with HTN and a dedicated system to prevent overfitting. The authors of ML-Plan claim to have invented the first AutoML technique that prevents overfitting [16].
PoSH Auto-Sklearn outperforms all other methods as it is the winner of the latest AutoML competition at the time of writing [9]. Auto-WEKA, Hyperopt-Sklearn, and Auto-Sklearn were the first three methods that were developed to tackle the CASH problem. What is noteworthy to observe is that Auto-Sklearn has served as a basis for multiple other AutoML systems, whereas the other two methods have not. We assume that this is due to the warm-start procedure that is built into Auto-Sklearn.

Neural Networks
The first version of AutoNet [21] was the first to automate the configuration of a Neural Network. It laid the groundwork for its successor and the inception of AlphaD3M [22] and Auto-Keras [23]. Besides laying the groundwork for these applications, it incentivised the inception of a lot of commercial applications. Most commercial applications that automatically tune Neural Networks are inspired by the first version of Auto-Net [9]. This is because Auto-Net was the first AutoML program to beat human experts in configuring a pipeline [9,21].
AlphaD3M is the only AutoML method that makes use of reinforcement learning and is much faster than any other method in the field. In one case, it was 32 times faster than TPOT. However, AlphaD3M does not outperform other AutoML methods. In comparison with three other methods, its average rank is third, based on mean scores. As AlphaD3M ranks first on some datasets, it is still competitive [22]. It is striking to see if reinforcement learning gets widely adopted as a search strategy for pipeline configuration.

Evolutionary Methods
Evolutionary methods can create pipelines of flexible length. These are TPOT [24], LTPOT [25], RECIPE [26], and Autostacker [27]. They can do so due to their search strategy. The downside of evolutionary algorithms is that they can produce invalid pipelines and get stuck at local optima. RECIPE and LTPOT have independently overcome these downsides. RECIPE uses grammar to overcome this whereas LTPOT uses a maximum evaluation time for a pipeline.
TPOT was a reaction to the fixed-length pipeline methods discussed above. TPOT uses Gaussian Processes (GP) to construct pipelines; the building blocks of a pipeline are GP primitives to build a tree. These trees are an arbitrary representation of the ML pipeline, consisting of multiple datasets, pre-processors, and classification operators. Each node uses the output of its preceding node as input. Layered TPOT [25] is the successor of TPOT with a focus on improving the efficiency of the pipeline generation. It improves efficiency by implementing the idea of an Age Layered Population Structure [28] in addition to the original TPOT algorithm. RECIPE [26] is an evolutionary method based on GP that makes use of grammar to improve its efficiency compared to other GP methods.
Appl. Sci. 2020, 10, 2992 4 of 18 RECIPE uses grammar to prevent the creation of invalid pipelines and focusses on classification tasks. Autostacker is an AutoML method that focusses on providing a set of potential useful pipelines for users without any pre-processing steps based on the scikit-learn and the XGBoost library. Pipelines are built using a stacking mechanism based on an evolutionary algorithm and should generalize well to new data [27].

Distributed Methods
Two AutoML methods can process data in a distributed matter: Autostacker [27] and ATM [29]. It is remarkable, that there are only two systems that can run in a distributed manner, when taking the computing cost of creating a pipeline in mind. Autostacker can use parallel processing as it proposes the best pipelines to its user. Hence it needs the performance scores of the pipelines. ATM is the only method which can run in parallel on different machines and is set up to be distributed and scalable. ATM is developed to support a multi-user machine learning platform in a cloud or cluster. The aim is to provide standardized abstractions in a library to become as influential for the AutoML community with a library as scikit-learn is to the machine learning community [29]. The development of ATM and integration with the methods described above is one of the most attractive developments in AutoML. An incentive to spur this development could be having a separate performance challenge for distributed AutoML methods. Figure 2 visualizes the relations between AutoML methods discussed in this section. It makes a distinction between methods that build NNs and methods that use traditional classifiers or regressors in their pipeline. Arrows between methods point out a relationship between methods. The colors of the methods indicate the search strategy that is applied in a method to create a pipeline. A tabular overview of the discussed AutoML methods is available in Appendix A, including the prediction tasks, a link to the code repository and more information about the pipeline creation.

Overview of Methods
Appl. Sci. 2020, 10, 2992 4 of 18 any pre-processing steps based on the scikit-learn and the XGBoost library. Pipelines are built using a stacking mechanism based on an evolutionary algorithm and should generalize well to new data [27].

Distributed Methods
Two AutoML methods can process data in a distributed matter: Autostacker [27] and ATM [29]. It is remarkable, that there are only two systems that can run in a distributed manner, when taking the computing cost of creating a pipeline in mind. Autostacker can use parallel processing as it proposes the best pipelines to its user. Hence it needs the performance scores of the pipelines. ATM is the only method which can run in parallel on different machines and is set up to be distributed and scalable. ATM is developed to support a multi-user machine learning platform in a cloud or cluster. The aim is to provide standardized abstractions in a library to become as influential for the AutoML community with a library as scikit-learn is to the machine learning community [29]. The development of ATM and integration with the methods described above is one of the most attractive developments in AutoML. An incentive to spur this development could be having a separate performance challenge for distributed AutoML methods. Figure 2 visualizes the relations between AutoML methods discussed in this section. It makes a distinction between methods that build NNs and methods that use traditional classifiers or regressors in their pipeline. Arrows between methods point out a relationship between methods. The colors of the methods indicate the search strategy that is applied in a method to create a pipeline. A tabular overview of the discussed AutoML methods is available in Appendix A, including the prediction tasks, a link to the code repository and more information about the pipeline creation.

Research Method
We used the Design Science research framework [30] to answer the research questions. To do so, we first performed a benchmark test using the framework of Gijsbers and colleagues [31] on all available medical datasets from the OpenML100 [32] to find out if one AutoML method performed best on medical tasks. We ran medical binary-classification tasks on four datasets: Breast cancer [33], Diabetes [34], Indian Liver Patient [34], and Sick dataset [35]. All selected methods received a time budget of one hour in a 10-fold cross-validation set-up to create the best pipeline on the given datasets. The time limit is set on one hour, as longer runs do not significantly provide better results [31], a finding that was also confirmed during our experiments. To baseline the performance of the

Research Method
We used the Design Science research framework [30] to answer the research questions. To do so, we first performed a benchmark test using the framework of Gijsbers and colleagues [31] on all available medical datasets from the OpenML100 [32] to find out if one AutoML method performed best on medical tasks. We ran medical binary-classification tasks on four datasets: Breast cancer [33], Appl. Sci. 2020, 10, 2992

of 18
Diabetes [34], Indian Liver Patient [34], and Sick dataset [35]. All selected methods received a time budget of one hour in a 10-fold cross-validation set-up to create the best pipeline on the given datasets. The time limit is set on one hour, as longer runs do not significantly provide better results [31], a finding that was also confirmed during our experiments. To baseline the performance of the AutoML methods in the benchmark test, we added a decision tree and a constant predictor. Following Gijsbers et al. [31], we used Area Under the Receiver Operator Curve (AUROC) for scoring. All tests have been run on Amazon Web Services using m5.2xlarge machines (32 GB memory, 8 vCPUs (Intel Xeon Platinum 8000 series Skylake-SP processor with a sustained all core Turbo CPU clock speed of up to 3.1 GHz). The OS used is Amazon Linux (https://aws.amazon.com/ec2/instance-types/m5/) to get constant circumstances and enough computing power for the AutoML methods. In our research we only used the AutoML methods available in the benchmark set to create an even playing field. Hence, no industrial tools were considered in this research.
To understand the needs of domain experts we elicited requirements using semi-structured interviews. We selected semi-structured interviews as the best method for requirements elicitation for three reasons. First, semi-structured interviews are considered to be the most effective way for requirements elicitation [36]. Second, it is an accepted method for conducting qualitative research in healthcare [37]. Third, semi-structured interviews have the benefits of eliciting people's own views and uncovering issues or concerns that have not been considered beforehand by the researcher [38]. For our interviews, we constructed an interview protocol following the guidelines for interview research [39]. The sample consisted of five domain experts who were active in the scientific department of a regional hospital in the Netherlands. The interviewees have decided to participate voluntarily and hold different roles and medical expertise within non-academic hospitals in the Netherlands. The interviewees are active in the research fields of cancer, orthopedy, and cardiology and participate in medical research, either full-time or part-time. The sample consists of three women and two men. We evaluated the capabilities of the AutoML methods to the requirements of the domain experts. Based on the requirements we created two artefacts and evaluated these with the interviewees. The first artefact had a graphical user-interface, the second artefact had a code-based interface. We used artificial summative evaluation as part of the framework for evaluating design science research [40]. We evaluated the artefacts on the user story categories from the previous research question. To be able to evaluate the artefact properties, we used refined hypotheses [41].

Results
This section describes the result of the benchmark test, the requirements evaluation and artefact evaluation.

Benchmark Test
The benchmark test was ran using a time budget of one hour with a total of 160 hours of computational time. Figure 3 contains the visualization of results; the datasets are on the X-axis, the Y-axis shows the AUROC score. A colored dot marks the score for an AutoML method on each of the ten folds. A Kruskal-Wallis H test indicated that there was a statistically significant difference in the distribution for the Breast (H = 11.36, p < 0.001), Diabetes (H = 18.64, p < 0.001), Liver (H = 17.93, p < 0.001), and Sick dataset (H = 27.87, p < 0.001) between the AutoML methods, see Table 1 for statistics. What is noteworthy in Figure 3 is that on the Liver dataset, the decision tree and Hyperopt-Sklearn do not always outperform the constant predictor. On the Diabetes dataset, Hyperopt-Sklearn lags compared the three other methods, but performs better than the constant predictor and has a similar performance to the decision tree. On the Breast dataset, all AutoML methods have the maximum score in at least one fold. All methods perform well on the Breast set, given their median scores and distribution. The performance of the decision tree indicates that it is not a hard prediction problem. For the results on the Sick dataset, we see that TPOT and Auto-Sklearn outperform the other two methods in both consistency and score of their predictions despite the fact that the set has missing values. Hyperopt-Sklearn is again not better than the decision tree. two methods in both consistency and score of their predictions despite the fact that the set has missing values. Hyperopt-Sklearn is again not better than the decision tree. Overall, TPOT registered the highest median score after running for one-hour on all sets but the Breast dataset, in which autosklearn registered the best median performance. A Mann-Whitney U test indicated that autosklearn significantly outperforms the decision tree (U = 2.0, p < 0.001) and hyperoptsklearn (U = 17.5, p < 0.01) on the Breast set. TPOT significantly outperforms the decision tree on the datasets Diabetes (U = 0.0, p < 0.001), Liver (U = 8.0, p < 0.001) and Sick (U = 0.0, p < 0.001). Furthermore, TPOT outperforms hyperoptsklearn significantly on the datasets Diabetes (U = 2.0, p < 0.001), Liver (U = 7.0, p < 0.001) and Sick (U = 0.0, p < 0.001). Finally, AutoWEKA is significantly outperformed on the Sick dataset by TPOT (U = 0.0, p < 0.001). This is probably because AutoWEKA does not impute data for missing values. Autosklearn and TPOT do impute values for missing data and have significantly better results. The statistics for the Mann-Whitney U test are available in Table  2. The table shows the p-values and U statistics for each method compared to the best performing method on each dataset. TPOT and autosklearn do not significantly differ in performance for any of the datasets.
From the benchmark test we can conclude that no method consistently outperforms all others. However, we see that TPOT records the highest median scores on three of the four tasks in this test but this performance is not significantly better than the performance of other methods. Autosklearn and AutoWEKA get similar results to TPOT, only AutoWEKA is outperformed on the Sick dataset.  Overall, TPOT registered the highest median score after running for one-hour on all sets but the Breast dataset, in which autosklearn registered the best median performance. A Mann-Whitney U test indicated that autosklearn significantly outperforms the decision tree (U = 2.0, p < 0.001) and hyperoptsklearn (U = 17.5, p < 0.01) on the Breast set. TPOT significantly outperforms the decision tree on the datasets Diabetes (U = 0.0, p < 0.001), Liver (U = 8.0, p < 0.001) and Sick (U = 0.0, p < 0.001). Furthermore, TPOT outperforms hyperoptsklearn significantly on the datasets Diabetes (U = 2.0, p < 0.001), Liver (U = 7.0, p < 0.001) and Sick (U = 0.0, p < 0.001). Finally, AutoWEKA is significantly outperformed on the Sick dataset by TPOT (U = 0.0, p < 0.001). This is probably because AutoWEKA does not impute data for missing values. Autosklearn and TPOT do impute values for missing data and have significantly better results. The statistics for the Mann-Whitney U test are available in Table 2. The table shows the p-values and U statistics for each method compared to the best performing method on each dataset. TPOT and autosklearn do not significantly differ in performance for any of the datasets.
From the benchmark test we can conclude that no method consistently outperforms all others. However, we see that TPOT records the highest median scores on three of the four tasks in this test but this performance is not significantly better than the performance of other methods. Autosklearn and AutoWEKA get similar results to TPOT, only AutoWEKA is outperformed on the Sick dataset.

AutoML Requirements Evaluation
We elicited twenty-one requirements of domain experts in the form of user stories, as shown in part in Appendix B. These are texts that capture a desired functionality and a reason for the desire from the end-users of a product. The user stories are categorized in four categories: User interaction, model construction, model usage and model explanation. We only consider the AutoML methods that participated in the benchmark test for comparison with user stories. Furthermore, we only include user stories that were mentioned by at least two participants, leaving us with fifteen user stories. For user stories, we use the template of Cohen [42]: "As a <type of user>, I want <goal>, [so that <some reason>]." Table A1 provides the collection of elicited user stories that are requested by two or more interviewees. There are three user stories in the user interaction (UI) category with the AutoML artefact. These are in conflict, two out of five domain experts prefer to use code to do their analysis, so they have more control over what is happening, e.g., "I like SPSS for its syntax function, not so much for its clicking". In contrast, two other domain experts prefer to have a GUI, as small coding errors cost a lot of time to solve, e.g., "Just an interface? That would be great!".
For model construction (MC) there are five user stories. Automatic model configuration as well as the possibility to set a time budget is covered by the concept of AutoML, e.g., "Using R is a bit a too far removed from my interests". The restriction of only using logistic regression can only be satisfied by TPOT and AutoWEKA, e.g., "If you do not use it (logistic regression), then the reviewers do not know what is happening, so they reject your paper. Not because it is bad, just because they do not understand it". Autosklearn and TPOT are the methods that satisfy the requirement of an explanation on handling missing data. There is only one user story about using the created model in practice (MU). Using the created model on unseen data is supported by all assessed AutoML methods.
Over half of the user stories are about model explanation (MX). The domain experts are interested in what variables are important, what the statistical power of a model is and what decisions are made during model creation, e.g., "Being accurate is one thing, but how explainable is it?". However, none of the assessed AutoML methods could satisfy any of these requirements. It is possible to integrate comparison of created models and an explanation of regression in the artefacts.
The word cloud in Figure 4 illustrates the need for an explanation of the variables and understanding of the process of creating a model. The word cloud is made from the transcripts of the answers to the why questions on the individual evaluations of the interviewees. The words variable, model, happening, explanation, control and understand all stand out. Most interviewees do not consider the output of TPOT as a model, e.g., "It does not show an answer to my question. I do not consider this a model." We noted that the interviewees only wanted logistic regression models, which limits the power of using AutoML methods. Furthermore, it was evident that the interviewees wanted to keep in control of creating the model. Without the rationale of the AutoML method on the creation of the model they considered it useless. There was no trust in the method as it was so opaque. In addition to that it was unhelpful that the subjects found the machine output of the model unreadable. This was due to the fact that there was no description of importance of variables. consider this a model." We noted that the interviewees only wanted logistic regression models, which limits the power of using AutoML methods. Furthermore, it was evident that the interviewees wanted to keep in control of creating the model. Without the rationale of the AutoML method on the creation of the model they considered it useless. There was no trust in the method as it was so opaque. In addition to that it was unhelpful that the subjects found the machine output of the model unreadable. This was due to the fact that there was no description of importance of variables. After comparing the four AutoML methods to the user stories, we can conclude that TPOT is the most appropriate AutoML method for this set of requirements. TPOT satisfies five of the fifteen After comparing the four AutoML methods to the user stories, we can conclude that TPOT is the most appropriate AutoML method for this set of requirements. TPOT satisfies five of the fifteen assessed requirements compared to four out of fifteen by Auto-WEKA and Auto-Sklearn. What is noteworthy to see is the inability of all AutoML methods to explain the created models. The need for explainability is evident: Model explanation is the biggest category in the user story categorization. Besides that, explaining model decisions is obligatory in Europe since the introduction of the General Data Protection Regulation Law [43] (pp. [40][41][42]. Table 3 contains an overview of the AutoML methods in each category. We have not included the user-interaction category into the table as it contains conflicting user stories and does not apply to AutoML methods, only to the artefacts. As TPOT performs better than autosklearn on the requirements we integrated TPOT in the artefacts. Table 3. AutoML method scores on user story categories.

Category/Method Auto Sklearn Auto-WEKA TPOT Hyperopt-Sklearn
User interaction (UI) n/a n/a n/a n/a Model construction (MC)

Artefact Evaluation
Based on the user stories we created two artefacts to automate a part of the data preparation phase and the complete modelling phase of CRISP-DM [44]. The data preparation activities involve the possibility to include or exclude variables, data imputation for missing values and the recoding of categorical variables to numerical variables, as TPOT cannot handle non-numerical data as input. Due to the conflict in user-interaction requirements, we designed two artefacts with the same functionalities, but a different user interface. The artefacts can be used to create logistic regression models and users who do not know how to code should be able to use the artefacts. The artefacts contain a description of missing data handling, as well as the possibility to compare the different models.
The two artefacts are a Flask web application (GUI artefact) to satisfy the graphical user interface preference and a Jupyter notebook (Code artefact) to satisfy the coding preference, as shown in Figures A1-A3 in Appendix C, respectively. The source codes for the web application as well as the Jupyter notebook are available at this URL: https://github.com/richooms/healthcare_automl.
Using the risk and efficacy strategy from the framework for evaluation in design science [40]. We created a set of refined hypotheses based on the user story categories to evaluate the artefacts with the domain experts. The hypotheses are available in Appendix D. For the user-interaction we tested on four elements: Uploading a dataset, creating a subset, the workflow of the application and the workflow explanation. Only for uploading a dataset the GUI artefact was preferred. For all other actions the code artefact was preferred, as the interviewees felt more in control of the process. The same was the case for model construction and the progress reporting on model construction as parts of the model construction category. The model explanation category consisted of comparing of different results, explanation of missing data handling and readability. In this category artefact A was preferred over artefact B for all interactions except for explanation of missing data handling. An overview of the preferences and categories is available in Table 4. The preference column demonstrates the type of artefact that was preferred for that part of the artefact interaction. The score column indicates how many subjects had the preference.
What we found is that a hybrid version of the two artefacts is preferred to interact with AutoML by the interviewees. To keep control over the process of creating a model they prefer coding, for the relatively simple tasks such as uploading a set or comparing results, they prefer a graphical user interface. Furthermore, TPOT needs to explain variable importance to make it usable in their research practice as it is considered a must-have for the adoption of AutoML by domain experts.

Discussion
This section discusses the conducted research. First, we discuss the lessons learned. Second, we discuss the validity of this research.

Lessons Learned
In this section, we discuss two lessons learned during this research project. First, we discuss the suitability of AutoML for healthcare practitioners. Second, we discuss the bias in medical (analytics) publications.

Suitability of AutoML Methods for Researcher-Physicians
After the elicitation of requirements, we found that researcher-physicians prefer to create logistic regression models in their research. There are multiple reasons why we find that AutoML in the way we assessed it, is not the best way to create models for researcher-physicians. First, AutoML originates from the CASH problem. If the algorithm type is pre-selected, AutoML can only contribute to HPO. As logistic regression only has two hyperparameters [8], we question the need for using AutoML to tune these hyperparameters.
Second, even if we drop the constraint of logistic regression for model construction, previous research found that other ML models do not significantly outperform logistic regression models in medical studies [45]. Even when we drop the logistic regression model constraint, we still doubt the usability of AutoML due to the results of Christodoulou et al. although other researchers do not find the same results on general datasets [46,47].
Third, there is no explanation of variable importance by the tested AutoML method. As model explainability and variable importance are essential requirements for researcher-physicians, this makes AutoML unsuitable in its current form. If the given explanations are satisfactory, AutoML might be useful in research. More on variable importance is available in the future research section below.
If AutoPrognosis can deliver on its promises it can be a promising technology considering the user stories on model explanation.
Fourth, there is no structure in the pipelines created by TPOT; this can lead to very complicated pipelines in with three or four logistic regression models, all using each other's results as input. These constructions are hard to understand for domain experts, if we would use grammar to represent the pipeline like in RECIPE [26]. The grammar can help to create more understandable pipelines for researcher-physicians. Another option could be using fixed-pipeline methods based on BO. Although they performed worse on the benchmark test, they can compensate if their results make more sense to the researcher-physicians. Finally, we have learned that there is a gap in the knowledge level of ML between literature and practice for researcher-physicians. The literature on AutoML states that AutoML aims to aid non-expert users of ML techniques [8]. However, we find that most non-expert users have no knowledge or education in programming. Hence, the current offering of AutoML techniques are still too technical for non-expert users in healthcare.

Bias in Medical Analytics Publications
Most publications about healthcare analytics that we have come across during this research, have all published a positive result. [48] noticed the same: there is a need to publish every positive result, but there are no real tests for the value of ML solutions in healthcare. If we add the findings of Christodoulou et al. [45] to the fact that ML models do not perform significantly better than traditional methods in medical research we find a discrepancy between what is published and what is improving healthcare. Hence, we think that it would be helpful for the development of accurate methods for analytics in healthcare to publish results that do not provide a direct benefit. In this way, funding for research can be either used for improving methods that do not yet work in healthcare or in researching different methods for solving the same problem. When negative results are not published, we fear a waste of research funding by researchers continually reinventing the wheel and thus stalling research and innovation.

Validity
In this section, we address the "subjective" nature of the data collection and analysis of this research [49]. To assess the validity of this study, we look at three of the five aspects of validity for qualitative research, as proposed by Burke Johnson [50]. We do not discuss theoretical validity as the goal of design science research is on artefact creation instead of theory creation. We do not aim to explain a phenomenon. We also do not discuss Internal validity, as we do not aim to answer a question about a causal phenomenon. The main research question of this research is a "how"-question.

Descriptive Validity
Descriptive validity is on the factual accuracy of the account of events as reported by the researcher [50]. As a single researcher has conducted this research, hence there is a bias in data collection. A researcher is subjective by nature, and so are his data collection and analysis [49]. Besides that, the relationship between the researcher and participants significantly influences what the participants reveal to the researcher [51] (pp. [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]. To mitigate this validity threat, we used a framework to set up the semi-structured interviews and the framework for evaluation of design science to set up the artefact evaluation. We recorded all interactions with the participants, and we took part in sessions to obtain peer-feedback on our research to increase the descriptive validity.

Interpretive Validity
Interpretive validity is about accurately portraying the meaning that was attached by the participants to the objects that were studied [50]. To mitigate this threat, we have sent the elicited user-stories to the participants to obtain feedback on our findings [49]. Furthermore, we used data-triangulation by tapping into other sources to confirm our findings. To make sure we portrayed the meaning of the participants well, we used low inference descriptors by quoting participants in this research [50].

External Validity
External validity is crucial if we want to generalize our findings to a larger part of the population [50]. Although generalizability is not the primary purpose of this research, we will touch upon the subject. As this research conducts a case-study, the best way to generalize its findings is to find the similarity in subjects, objects and issues [52]. As the sample size and characteristics are not valid for generalizability, the best method to generalize our findings is to identify similarity in other situations.
Characteristics that make the sample inapplicable for generalization to medical professionals are the limited set of medical domains in which the participants operate and the fact that all participants decided to participate voluntarily. Hence, insights derived from this study are hard to generalize but could be a steppingstone for future research.

Conclusions
The main research question of this study was "How can we support healthcare professionals in their knowledge discovery process by applying Automated machine learning?". We found that AutoML is currently only suitable for the data understanding phase of the CRISP-DM method when using TPOT as AutoML method. TPOT performed best on the benchmark test along with Auto-Sklearn, but satisfied more requirements than Auto-Sklearn. Although the assessed AutoML methods are capable of modelling and data pre-processing, it misses an explanation for the decisions made in the modelling process. Part of medical knowledge discovery is finding the cause of a medical event. Because modelling decisions are not shared and variable importance is absent in the result, AutoML does not support the discovery of new knowledge. However, the domain experts point out that they see the added value of automatically finding out possible scores for their datasets. Furthermore, they mention that AutoML can help them in getting an understanding of their data in the data understanding phase of their knowledge discovery process.

Future Research
In this section we discuss three possibilities for future research, first we discuss model uncertainty of AutoML created models. Second, we discuss possible new use-cases for AutoML methods. Finally, we discuss interpretability as a direction for future research.

AutoML Model Uncertainty
One of the benefits that are proposed by AutoML is the reproducibility of created ML pipelines [8,9,12]. However, these authors state that the outcome of the creation of an ML pipeline with an AutoML method is dependent on the time budget allocated to the AutoML method. Besides that, EA based AutoML methods start with a random population. Thus, it is harder to reproduce the result of a single run without explicitly setting the seed. [53] investigated model uncertainty in a medical context. They have found that as much as changing the seed can influence the prediction outcome for an individual patient. Hence, we argue that there should be more research on the stability of AutoML pipelines in the medical domain.

AutoML Use Cases
In our synthesis in Section 2, we explain that AutoML is applicable for tuning neural networks and creating classification and regression models. In healthcare, the fields with the most significant potential for the application of ML are image recognition and natural language processing [48]. Because researcher-physicians accept black boxes in image recognition and natural language processing more than in traditional research, these ML tasks might be better suited to enable domain experts to work with AutoML in healthcare. Hence, we argue that the scope of AutoML use-cases could be widened to NLP and image recognition tasks if we want to accelerate the adoption of analytics in healthcare.

AutoML Interpretability
As pointed out in Section 5, the explainability of ML models is crucial to adoption for domain experts in healthcare [54]. Molnar [55] argues that ML interpretability is crucial to the adoption of black-box algorithms in every sector. In healthcare, this barrier to adoption is even higher, as being able to explain decisions is part of the medical culture and vital to patient-doctor interaction. Hence, AutoML methods must become more interpretable for non-expert users. The interpretability technique should be model agnostic. In that way, the technique is suitable for all pipelines created by the AutoML methods [56]. To improve the interpretability of AutoML models we propose three areas for further research on the interpretability of AutoML: Surrogate models, Local Interpretable Model-agnostic Explanations (LIME) [55] and Shapley values [57]. However, others argue that we should improve the trust in artificial intelligence in healthcare rather than improving the interpretability. If the trust is high enough, the researcher-physicians will start using the black boxes [58]. Most researcher-physicians also do not precisely know how a car works. However, they still use cars in their daily lives. We believe that the addition of interpretability will ignite the adoption rate of AutoML methods in healthcare and other sectors to enable self-service data-science.   As a researcher-physician, I want to get suggestions for variables to include by the computer, so that I can improve my models.

15
As a researcher-physician, I want to see the amount of variance that is explained by my model, so that I can assess the model quality.

MX
Appendix C Figure C1. Screenshot 1 of of AutoML artefact A, the Web application interface (preference: GUI).

Appendix D
Overview of the refined hypotheses. For user interaction the hypotheses are: 1. Artefact A is preferred over Artefact B to upload a dataset. 2. Artefact A is preferred over Artefact B to create a subset. 3. Artefact B is preferred over Artefact A to find the way in the different steps.