VERONICA: Visual Analytics for Identifying Feature Groups in Disease Classiﬁcation

: The use of data analysis techniques in electronic health records (EHRs) offers great promise in improving predictive risk modeling. Although useful, these analysis techniques often suffer from a lack of interpretability and transparency, especially when the data is high-dimensional. The emergence of a type of computational system known as visual analytics has the potential to address these issues by integrating data analysis techniques with interactive visualizations. This paper introduces a visual analytics system called VERONICA that utilizes the natural classiﬁcation of features in EHRs to identify the group of features with the strongest predictive power. VERONICA incorporates a representative set of supervised machine learning techniques—namely, classiﬁcation and regression tree, C5.0, random forest, support vector machines, and naive Bayes to support users in developing predictive models using EHRs. It then makes the analytics results accessible through an interactive visual interface. By integrating different sampling strategies, analytics algorithms, visualization techniques, and human-data interaction, VERONICA assists users in comparing prediction models in a systematic way. To demonstrate the usefulness and utility of our proposed system, we use the clinical dataset stored at ICES to identify the best representative feature groups in detecting patients who are at high risk of developing acute kidney injury.


Introduction
A key component of precision medicine is to determine a person's individualized estimates of different health outcomes, which then guides therapy to increase the chance of long-term good health. Identifying the group of features in electronic health records (EHRs) with the most substantial predictive power helps in the development of robust predictive models [1,2]. The data in EHRs has great promise for improving predictive risk modeling [3]. However, EHRs are often challenging to analyze due to their high dimensionality [4,5]. In recent years, several studies have incorporated various data mining and machine learning techniques to address this problem. Most of the existing studies use unsupervised learning techniques such as principal component analysis [6], K-means [7,8], and hierarchical clustering [9] to find the best representative group of features in high dimensional EHRs [10][11][12][13][14][15][16][17][18]. Although these unsupervised techniques have shown promise in managing high dimensional data, to our best knowledge, this problem has not been studied thoroughly using supervised techniques [19,20]. One of the main issues with both supervised and unsupervised techniques is that they suffer from a lack of interpretability and transparency [21][22][23]. In healthcare settings, it is essential to better understand how a given technique works. Therefore, increasing a technique's interpretability by involving humans in the analytics process can play a vital role in building trust with users [24][25][26][27][28]. The analytics results can be made accessible to users through visual analytics (VA) to address these issues.
Visual analytics (VA) integrates data analytics techniques with interactive visualizations to improve users' capabilities in performing data-driven tasks [21,29]. It enables users to achieve their goals through interactive exploration and manipulation of the data [30,31]. The design of a VA system is often not straightforward because it requires the designer to consider users' activities and tasks, the structure of the data, and human factors [20,[32][33][34]. Thus, the designer needs to make several non-trivial decisions when developing such systems. For instance, one needs to consider which techniques to use, which features and samples to incorporate, and what level of granularity to look for when choosing a data analytics technique [29]. Similarly, it is important to determine how to map and classify data items and help users accomplish their tasks when developing interactive visualizations. Consequently, combining analytic techniques with interactive visualizations becomes a more complex challenge. Thus, it is important to involve stakeholders (e.g., clinical researchers and medical practitioners) in the design and development process of a VA system [35].
The purpose of this paper is to show how VA systems can be designed systematically to identify the best representative subset (i.e., a combination of groups) of high-dimensional EHRs. The proposed VA system, VERONICA (Visual analytics for idEntifying featuRe grOups iN dIsease ClAssification), takes advantage of the group structure of features stored in EHRs. EHRs are generally classified into different groups: comorbidities, medications, laboratory tests, hospital encounter codes, and demographics. It is possible to combine these groups to create multiple subsets of groups. For instance, one can create a subset by combining all features from both comorbidity and demographic groups. Depending on the predictive power of features within them, some groups or subsets (i.e., combinations of groups) are stronger predictors in identifying diseases in comparison to others. To identify the subset with the most substantial predictive power, VERONICA considers every possible subset of groups (i.e., groups of features) and applies several supervised learning techniques to each subset. It allows users to compare the results based on different performance measures through an interactive visual interface. VERONICA aims to assist healthcare providers at ICES-KDT, where ICES is a non-profit, world-leading research organization that utilizes population-based health data to produce knowledge on a broad range of healthcare issues, and KDT refers to the Kidney Dialysis and Transplantation Program located in London, Ontario, Canada. We utilize the clinical dataset housed at ICES to identify the best representative feature groups in detecting patients with a high risk of developing acute kidney injury to demonstrate VERONICA's utility and usefulness.
The rest of the paper is organized as follows. Section 2 gives an overview of the conceptual and terminological background to understand the design of VERONICA. Section 3 briefly describes existing EHR-based VA systems. Section 4 explains the methodology used for the design of VERONICA. Section 5 presents VERONICA by describing its structure and components. We address the limitations of the system in Section 6. Finally, Section 7 discusses the conclusions and future areas of application.

Background
In this section, we present the terminological and conceptual background to understand the design of VERONICA. We discuss different components of VA systems to provide a better understanding of the concept of VA. Finally, we provide a summary of the chosen machine learning techniques-namely, classification and regression tree (CART) [36], C5.0 [37], random forest [38], naïve Bayes (NB) [39], and support vector machine (SVM) [40].

Visual Analytics
Visual Analytics (VA) systems combine the strengths of data analysis and interactive visualizations to enable users to apply filters and explore and manipulate the data inter- actively to accomplish their goals [41]. The processing load in VA is distributed between users and the key components of the system-namely, the analytics module and interactive visualization module [21,29,[42][43][44][45].

Analytics Module
The analytics module is responsible for storing, pre-processing, transforming, and performing computerized analysis of the data. It involves three main stages: data preprocessing, data transformation, and data analysis [29]. The raw data retrieved from different sources gets processed in the pre-processing stage. This stage involves tasks such as fusion, integration, cleaning, and synthesis [46]. Then in the transformation stage, the pre-processed data is transformed into a form suitable for analysis. Common tasks in this stage include smoothing, aggregation, normalization, and feature generation [46]. Finally, the analysis stage involves the discovery of hidden patterns and relationships and allows for the extraction of useful and novel information from the data [47,48]. This can be carried out by applying various statistical and machine learning techniques (e.g., random forest, SVM, NB, and decision trees) to the transformed data. However, despite all the benefits, most of these computational techniques do not support proper exploration and manipulation of the computed results [21]. VA systems address this problem by allowing users to engage in a more involved discourse with the data through interactive visualizations.

Interactive Visualization Module
In VA systems, the interactive visualization module is composed of a mapping component that retrieves the analyzed data from the analytics module and generates interactive visual representations. It allows users to change the displayed information, modify the subset of the information displayed, and guide and control the intermediary steps of the analytical processes within the analytics module. This, in turn, incites a chain of reactions that leads to the execution of additional analysis processes. The interactive visualization module provides users with flexibility and supports their cognitive and perceptual needs as they engage in various complex tasks. However, despite the advantages of interactive visualizations in amplifying users' cognitive needs, they fell short when confronted with data-intensive tasks that require computational analysis [21]. Therefore, an approach that integrates analytical processes with interactive visualizations through VA is required to overcome these challenges [49][50][51].

Machine Learning Techniques
In this section, we give a brief overview of all machine learning techniques used in VERONICA.

Decision Tree
Decision trees are among the most popular and powerful classification techniques that can provide informative models [52]. They construct a set of predictive rules to solve the classification problems using the recursive partitioning process. In their simplest form (e.g., C4.5 [37]), each feature is tested and ranked based on its ability to split the remaining data. The training data is propagated through the decision tree branches until enough features are chosen to correctly classify them. The classifier has a tree-like structure where each of its leaf nodes corresponds to a subset of the data that belongs to one class. Two widely known methods for generating decision trees are Classification and Regression Trees (CART) and C5.0. CART is based on a tree-growing algorithm that uses the GINI index as its splitting criteria. The strategy is to choose the feature whose GINI Index is minimum after each split. On the other hand, C5.0 builds the tree by splitting based on the feature that yields the most considerable information gain (Entropy). These classifiers are robust in handling missing values since the tree-growing process is not affected by missing data [53]. However, despite all the benefits, they tend to over-fit the training data [54]. Random forest addresses this problem by generating an ensemble of decision trees where each tree is built from a random arrangement of features [38,55]. A new object passes through every tree in the forest to get classified based on a vector of features. Each distinctive tree gives a classification and votes for the class. The final classification of the new object is based on the majority "vote" of all the trees in the forest.

Support Vector Machines
Support Vector Machines are among the most successful and robust classification techniques [40,56]. They aim to identify an optimal separating hyperplane that can distinctly divide the instances of multiple classes in a multi-dimensional space by maximizing the minimum distance from the hyperplane to the closest instance. Although models produced by SVM are often hard to interpret and understand, they work well on classification tasks involving a large number of features [57]. SVM is first outlined for the linearly separable classification problems, but a linear classifier might not be the most appropriate candidate for the binary classification. SVM can support non-linear decision surfaces using kernel functions. Due to its good generalization ability and its low sensitivity to the curse of high-dimensionality, SVM is often used in many classification problems.

Naive Bayes
Naive Bayes is a simple and powerful probabilistic classifier that often creates stable and accurate models [39]. The model is based on the probability of each class and the conditional probability of each class given each feature. These probabilities that are directly calculated from the data can be used for the classification of new data based on the Bayes theorem. Naive Bayes makes a simplistic assumption that all the features are independent of one another. Despite this assumption and its simplistic design, it can be very efficient, particularly when the data is high-dimensional.

Class Imbalance Problem
In EHRs, data are usually composed of "negative" samples with only a small percentage of "positive" ones, resulting in the imbalanced classification problem. The imbalance problem in the healthcare domain, where one class often has notably fewer samples than the other class, affects the performance of classification techniques. The former class is known as the minority class, and the latter is known as the majority class. Most standard classification techniques, such as support vector machines, assume that both classes are equally common and aim to maximize the overall classification accuracy without accounting for uneven distribution of the minority and majority classes. Thus, the impact of the imbalance problem in the performance of classification techniques could have adverse consequences. It often results in a learning bias to the majority class and poor sensitivity toward the minority class [58,59]. In EHRs with imbalance class distribution, accurately detecting samples from the minority class is of great importance as they often correspond to high-impact events. For instance, among patients with suspicious mole(s) pigmentation, the prevalence of patients with cancer (i.e., minority class) is significantly lower than patients who are likely not to have cancer (i.e., the majority class). In this example, the incorrect classification of a cancer patient as a patient without cancer will incur an unacceptably high cost, thus making the class imbalance into a problem of great importance in predictive learning, especially in the healthcare domain. A common strategy to address the imbalance problem is to rebalance the class distribution at the data level using sampling techniques [60][61][62][63]. In the next section, we discuss some of the widely used sampling techniques in more detail.

Sampling Techniques
In their simplest forms, random oversampling duplicates random samples from the minority class, while random undersampling selects random samples from the majority class [64]. One of the main issues of undersampling is the removal of valuable information if a large number of samples are discarded from the majority class. A considerable deletion of samples from the majority class might also change the distribution of the majority class, resulting in a change in the distribution of the overall dataset. On the other hand, since oversampling increases the size of the training data, it will result in an increased training time. It has also been shown that oversampling approaches might cause overfitting since classification techniques tend to focus on replicated minority samples [65]. Overfitting occurs when a prediction model fits too closely to the training set and is then incapable of generalizing the new data. To avoid overfitting, oversampling approaches that create artificial minority samples are preferred [66]. The synthetic minority oversampling technique (SMOTE) is an oversampling approach that randomly selects samples from the minority class and generates artificial minority samples by random interpolation between the selected samples and their nearest neighbors [67]. The generation of new minority class samples will lead to a more balanced class distribution compared to the original minority to majority class ratio. One potential disadvantage of SMOTE is that it creates the same number of artificial samples for each original minority sample without taking the neighboring samples into consideration, which ultimately increases the occurrence of overlaps between classes [68]. Several modifications of SMOTE that improve its performance by adjusting minority sample selection procedures have been proposed in the literature. For instance, adaptive synthetic sampling adaptively alters the number of artificial samples from the minority class following the density of majority samples around the original samples from the minority class [69].

Related Work
The most common application of VA in EHRs is identifying and exploring patient cohorts [70]. Several EHR-based VA systems have been developed to facilitate the process of generating and comparing multiple patient cohorts and identifying risk factors associated with a specific disease. For instance, VisualDecisionLinc [71] is a VA system that supports clinicians in identifying groups of patients with similar characteristics to understand the effectiveness of different treatments for those patients by providing summaries of patient outcomes and treatment options in a dashboard. Similarly, PHENOTREE [72] is a hierarchical and interactive phenotyping EHR-based VA system that allows users to interactively explore patient groups and explore hierarchical phenotypes by integrating principal component analysis and a user interface. VALENCIA [19] is another EHR-based VA system that facilitates the exploration of high-dimensional data stored in EHRS by combining various dimensionality reduction and clustering techniques with interactive visualizations. It allows clinical researchers to identify which features are more important within each cluster of patients. RadVis [73] is a VA system that enables clinicians to better understand the characteristics of patient clusters. It allows the user to apply different clustering techniques and displays the result using a 3-dimensional radial coordinate visualization. Likewise, Guo et al. [74] developed another EHR-based VA system to assist clinical researchers in clustering similar patients, comparing values of medical features of patients, and finding similar time tamps among similar patients. To support the user in performing these tasks, the system integrates a dimensionality reduction technique and a density-based clustering method with multiple interactive linked views. SubVIS [75] is another VA system that assists clinical researchers in exploring and interpreting highdimensional clinical data by integrating different subspace clustering techniques and an interactive visual interface. Similarly, Huang et al. [76] developed a VA system that supports the exploration of patient trajectories to help clinical researchers in identifying how a group of similar patients with a specific disease might develop other comorbidities over time. The system integrates frequency-based and hierarchical clustering techniques with a Sankey-like timeline to support clinical researchers in performing these tasks. Most of these existing EHR-based VA systems that have been developed to manage high dimensional data use unsupervised learning techniques such as dimensionality reduction, principal component analysis, and clustering techniques. Although these techniques have shown great promise in addressing this issue, to our best knowledge, this problem has not been studied thoroughly using supervised techniques.

Materials and Methods
In this section, we describe the methods we used to design the proposed VA systemnamely, VERONICA.

Design Process and Participants
The design and development of VA systems is an integrated process that requires various sets of skills and expertise. In light of this, we adopted a participatory design approach to obtain the needs and requirements of the healthcare providers and to understand the real-world EHR-driven tasks that they perform. Participatory design is a co-operative approach that places users at the center of the design process. It is an iterative group effort that requires all the stakeholders to work together to ensure the system meets their expectations [35]. An epidemiologist, a clinician-scientist, a statistician, data scientists, and computer scientists were involved in the conceptualization, design, and evaluation of this VA system. It is important to optimize the communication between all stakeholders involved in the process because they might experience a language gap due to their different backgrounds. For instance, it is critical to ensure that the medical terms are comprehensible to the team members with a technical background and the motivations of the analysis process and design decisions are well-addressed across the team. In light of this, we asked healthcare experts to provide us with their formative feedback on different design decisions and a list of tasks they perform on EHRs. Multiple participatory design approaches are used to obtain the healthcare providers' needs and identify opportunities that can significantly improve VERONICA's performance through more effective visualizations and analysis techniques.

Data Sources
We formed a derivation cohort using large linked administrative healthcare databases held at ICES. We ascertained hospital and patient characteristics, outcome, and drug use from five administrative databases (see Table A1). These datasets were linked using unique encoded identifiers that were derived from health card numbers of patients and were analyzed at ICES. The Ontario Drug Benefit Program database is used to identify prescription drug use. This database contains highly accurate patient records of all outpatient prescriptions administered to patients aged 65 years or older, with an error rate of less than 1% [77]. We acquired vital statistics from the Ontario Registered Persons Database, which includes demographic data on all Ontarians who have ever been issued a health card. We identified baseline comorbidity data, ED visits, and hospital admission codes from the National Ambulatory Care Reporting System and the Canadian Institute for Health Information Discharge Abstract Database (hospitalizations). We used ICD-10 (i.e., International Classification of Diseases, post-2002) codes to identify hospital encounter codes and baseline comorbidities. In addition, baseline comorbidity data and health claims for physician services were acquired from the Ontario Health Insurance Plan database. All the coding definitions for the comorbidities are provided in Tables A2 and A3.

Cohort Entry Criteria
We created a cohort of patients aged 65 years or older who were admitted to a hospital or visited an emergency department (ED) between 2014 and 2016. The discharge date from the hospital or ED served as the index date, also referred to as the cohort entry date. If a patient had multiple ED visits and hospital admissions, we chose the first incident. Individuals with invalid data regarding age, sex, and the health-card number were excluded. In addition, we excluded individuals who: (1) previously received a kidney transplant or dialysis treatment as the assessment of acute kidney injury is usually no longer relevant in patients with end-stage kidney disease; (2) left the hospital or ED against medical advice or without being seen by a physician; and (3) had acute kidney injury recorded during their hospital admission or ED visit prior to hospital discharge, as acute kidney injury was already present prior to the follow-up period. The diagnosis codes for the exclusion criteria are shown in Table A4.

Response Variable
Acute Kidney Injury (AKI) is defined as a sudden deterioration of kidney function. It is associated with a lower chance of survival, prolonged hospital stays, subsequent morbidity after discharge, and incremental healthcare costs [78][79][80][81]. A system that detects early AKI or predicts its clinical manifestations with considerable lead-time allows healthcare experts to provide more effective treatments to prevent AKI. We build models to predict hospital admission with AKI within 90 days after being discharged from ED or hospital. The incidence of AKI is identified using the Canadian Institute for Health Information Discharge Abstract Database and National Ambulatory Care Reporting System based on ICD-10 diagnostic codes (i.e., "N17").

Input Features
The final cohort includes 162 unique features. These features can be classified into five groups: demographics, comorbidities, hospital encounter codes, general practitioner (GP) visits, and medications. The demographic group includes four features: age, sex, region, and income quintile. The comorbidity group contains ten known risk factors of AKI, including diabetes mellitus, chronic kidney disease, chronic liver disease, cerebrovascular disease, coronary artery disease, hypertension, major cancers, peripheral vascular disease, heart failure, and kidney stones. These comorbidity features are detected prior to index hospital admission or ED visit. We applied a 5-year look-back window to identify these features. The GP visit group contains twenty-three features that are identified based on the billing codes from the Ontario Health Insurance Plan database ( Table 1).
The hospital encounter code group includes 1878 diagnostic codes that were detected during the index hospital admission and ED visit. The medication group consists of 595 medications prescribed to the patients within 120 before the index date. We apply the Chi-Square test for feature selection on the hospital encounter code and medication groups and then filter the chosen features with a healthcare expert. We select seventy and fifty-five most significant features for hospital encounter code and medication groups, respectively, based on the result of the chi square test. The ten most important features in the hospital encounter code and medication groups are shown in Table 2.

Implementation Details
VERONICA is implemented in HTML, JavaScript library D3, and R packages. R is used to develop the Analytics module. Html and D3 are used to build the interface and controls in the Interactive Visualization module. We implement the communication between these two modules using PHP and JavaScript.
We use R to develop different components of the Analytics module because it (1) offers support in performing various sampling and machine learning techniques, (2) is an open-source and platform-independent tool, (3) has several libraries, (4) is available in the ICES environment, and (5) has a large community and user forums.
D3 (Data-Driven Documents) is used to implement the interactive visualizations, and the Java programming language will be used to integrate data analytics with the visualizations. D3 (1) is an open-source Javascript library that works with web standards, (2) provides users with the full capabilities of the modern web browsers, (3) enables them to reuse JavaScript code and add different functionalities, and (4) is compatible with multiple platforms and other programming languages that are used in the implementation of VERONICA.

Workflow
As shown in Figure 1, VERONICA has two modules: Analytics and Interactive Visualization. The Analytics module utilizes the group structure of features stored in EHRs to identify the subset of feature groups that best represent the data in the prediction of AKI. The Interactive Visualization module maps the data items generated by the Analytics module into interactive visual representations to assist users in exploring the results. It supports six main interactions: (1) arranging, (2) drilling, (3) searching, (4) filtering, (5) transforming, and (6) selecting.
The basic workflow of VERONICA is as follows. First, we gather patient and hospital characteristics from five different databases stored at ICES. We then classify these features into five main groups-namely, hospital encounter codes, comorbidities, GP visits, medications, and demographics. The features included in these groups are pre-processed and transformed into forms appropriate for the analysis. We then create all possible subsets of groups (i.e., thirty-one groups), as shown in Figure 2. In the next step, we apply undersampling and SMOTE [67] to each subset to obtain two sampled datasets. Next, five machine learning techniques, namely CART, C5.0, random forest, naïve Bayes, and SVM, are applied to each sampled dataset, generating 310 prediction models. We use the area under the receiver operating characteristic curve (AUROC) to report the performance of these models. To help users compare and explore the analytic results, we make them accessible to users through interactive visualizations. The Interactive Visualization module uses an interactive visual interface to show the results of the Analytics module. It allows users to explore the prediction models and compare their performance. The interface is supported by several controls, such as a search bar, selection buttons, and drop-down menus. Finally, several interactions are built into the system to allow users to manipulate the results. ualization. The Analytics module utilizes the group structure of features stored in EHRs to identify the subset of feature groups that best represent the data in the prediction of AKI. The Interactive Visualization module maps the data items generated by the Analytics module into interactive visual representations to assist users in exploring the results. It supports six main interactions: (1) arranging, (2) drilling, (3) searching, (4) filtering, (5) transforming, and (6) selecting.  models. To help users compare and explore the analytic results, we make them accessible to users through interactive visualizations. The Interactive Visualization module uses an interactive visual interface to show the results of the Analytics module. It allows users to explore the prediction models and compare their performance. The interface is supported by several controls, such as a search bar, selection buttons, and drop-down menus. Finally, several interactions are built into the system to allow users to manipulate the results.

The Design of VERONICA
We use VERONICA to identify the subset of groups that has the most substantial predictive power in the classification of AKI. VERONICA applies several machine learning techniques to each subset and allows exploration of the analysis results through interactive visualizations. In this section, we describe the two main components of the system. We explain how the data is processed and analyzed in the Analytics module. We then describe the Interactive Visualization module and how it assists users in the interpretation and exploration of the results.

Analytics Module
The Analytics module utilizes a representative set of machine learning and sampling techniques to identify the subset that best represents the data in identifying AKI. Three tree-based classifiers (CART, C5.0, and random forest), one kernel-based classifier (SVM), and one probabilistic classifier (naive Bayes) are used in this analysis. In this section, we explain how these techniques can be employed to analyze the data.

The Design of VERONICA
We use VERONICA to identify the subset of groups that has the most substantial predictive power in the classification of AKI. VERONICA applies several machine learning techniques to each subset and allows exploration of the analysis results through interactive visualizations. In this section, we describe the two main components of the system. We explain how the data is processed and analyzed in the Analytics module. We then describe the Interactive Visualization module and how it assists users in the interpretation and exploration of the results.

Analytics Module
The Analytics module utilizes a representative set of machine learning and sampling techniques to identify the subset that best represents the data in identifying AKI. Three tree-based classifiers (CART, C5.0, and random forest), one kernel-based classifier (SVM), and one probabilistic classifier (naive Bayes) are used in this analysis. In this section, we explain how these techniques can be employed to analyze the data.
We classify features stored in our clinical dataset into five main groups based on the domain knowledge-namely demographics, comorbidities, medications, hospital encounter codes, and GP visits. For each feature included in these groups, the last recorded value before the index date is chosen. The features in comorbidity, medication, hospital encounter code, and GP visit groups are set to either "Y" or "N". If an individual is prescribed medication or has a comorbid condition, then its corresponding value is set to "Y". If there is evidence of a particular hospital encounter code present for a patient, we set its corresponding value to "Y". We create multiple dummy variables for the age feature where each variable represents a specific age range. If a patient's age lays within a specified range, then the corresponding variable is set to "1". The region feature takes either "R" or "U", representing rural or urban, respectively. The sex feature takes either "M" or "F" for males and females. The income feature takes an integer value that lies within 1 to 5 to represent the income quintile. All features included in the cohort are transformed into a scale and format suitable for further analysis by machine learning techniques.
A total of 924,533 participants are included in the final cohort, of which 5993 experienced AKI after being discharged from the index encounter. This dataset has an imbalanced class distribution, where the negative class (i.e., non-AKI) is represented by a large number of patients (i.e., 899,449 patients) compared to the positive class (i.e., 5993). The proposed system supports a number of sampling techniques such as random oversampling, Borderline-SMOTE [82], and Adaptive Synthetic Sampling. In this paper, we use undersampling and SMOTE. We configure these techniques so that the number of positive cases becomes equal compared to the negative cases. We use the DMwR package in R to implement the SMOTE algorithm. The "k" (i.e., nearest neighbors) and "perc.over" variables of the SMOTE algorithm are set to 5 and 100, respectively.
To develop the prediction models, we first split the dataset into training and test sets. The training and test set includes 903,442 and 2000 cases, respectively. In the next step, we create every possible subset of groups, as shown in Figure 2. The total number of subsets is 2 5 − 1 = 31, where 5 is the number of groups. We then apply both undersampling and SMOTE to each subset to obtain two sampled datasets. We develop ten prediction models for each subset by applying five machine learning techniques, namely CART, C5.0, random forest, naive Bayes, and SVM, to the sampled datasets. We created a total of 31 * 2 * 5 = 310 models, where 31, 2, and 5 are the number of subsets, sampling approaches, and machine learning techniques, respectively. In each model, AKI is the response variable and all features included in the subset are predictor variables. The CART and C5.0 classifiers are implemented using the "rpart" and "C50" packages in R, respectively. We use the "e1071" package in R to implement naive Bayes and SVM with a radial kernel (kernel = "radial"). Random forest is implemented using the "randomForest" package in R with fifty trees (i.e., ntree = 50).
We compare the performance of all the generated models using AUROC [83,84]. A ROC curve shows the trade-off between sensitivity and specificity across different decision thresholds. Sensitivity measures how often a test classifies a patient as "at-risk" correctly. On the other hand, specificity is the capacity of a test to classify a patient as "risk-free" correctly [85]. The AUROC ranges from 0.51 to 0.89 for the classification of AKI among the generated models.
In total, VERONICA generates 310 models that are built by applying five machine learning techniques mentioned above on two sampled datasets (i.e., undersampled and SMOTE) for each subset. As a result, a large number of models and subsets are generated, which makes it difficult for users to understand the results. To overcome this issue, the data items generated by the Analytics module are made available to users through an interactive visual interface.

Interactive Visualization Module
VERONICA is composed of an interactive visual interface and several selection controls, such as a search bar, drop-down menus, and selection buttons. In this section, we explain how data items produced by the Analytics module and subsets of groups are mapped into visual representation to allow users to accomplish various tasks.
As shown in Figure 3, groups of features (i.e., comorbidities, demographics, medications, hospital encounter codes, and GP visits) and their subsets are represented by a two-layer graph structure. In the first layer, the group nodes are mapped by color-coded rectangles, where each rectangle is labeled with a code representing the first letter of its corresponding group's name (Table 3). For instance, the rectangle representing the comorbidity group is color-coded in pink and is labeled with "C". The second layer includes all the nodes representing subsets of groups, where each node includes a grey circle and a combination code in the text format. The combination code for each subset contains the first letters of all the groups that are included in the subset. For instance, as shown in Figure 3, the first grey circle from the top represents the subset of all groups, and it is labeled with "MHDGC". The connections between the nodes in the first and second layers are shown by color-coded links where the link's color is identical to its corresponding group node's color. Two nodes from the first and second layers are connected if the node in the first layer (i.e., group node) is included as one of the groups that make up the node in the second layer (i.e., subset node). dition, to get additional information, users can move their mouse over the circles representing subsets to bring out tooltips. Furthermore, the system enables users to select any number of subsets by clicking on their corresponding circles.   Table 3. Groups and their representing codes.

Groups Codes
Comorbidities "C" Demographics "D" GP visits "G" Hospital encounter codes "H" VERONICA uses a sortable heatmap to show the result of the Analytics module, as shown in Figure 3. It enables users to compare the performance of the generated models by placing the analysis techniques in the columns and subsets of groups in the rows. Each cell in the heatmap includes a color-coded numerical value representing the AUROC achieved by applying an analysis technique to a subset in the connecting column and row. The color of the cells of the heatmap is light grey by default. However, through different interactions, users can observe the cell's color based on the value of test AUROC corresponding to that cell. This color-coding is based on two gradient scales. The first gradient scale is created by blending different shades of green. It represents all the cells corresponding to models where AUROC is greater than 0.8. It is interesting to note that most of the models are densely clustered between 0.8 and 0.9. Thus, the second scale is built by blending different shades of blue to represent all the cells corresponding to models where AUROC is less than 0.8. We included a legend to assist users in interpreting the heatmap based on these gradient scales. There is also a help button ("?") located to the right of the legend that provides users with additional information on how to interact with the heatmap.
Users can hover the mouse over any rectangle representing a group to highlight all the subset nodes that include the hovered group, links connecting the hovered group and highlighted subsets, and cells corresponding to the highlighted subsets ( Figure 4A). In addition, VERONICA allows users to select group nodes by clicking on their corresponding rectangles ( Figure 4B). The system then highlights all the subset nodes that contain all the groups corresponding to the selected rectangles, links connecting the selected groups and highlighted subsets and rows of cells corresponding to the highlighted circles. In addition, to get additional information, users can move their mouse over the circles representing subsets to bring out tooltips. Furthermore, the system enables users to select any number of subsets by clicking on their corresponding circles.  This interaction highlights all the cells corresponding to the selected subsets, group nodes that contain the selected subset, and links connecting the selected subset node and highlighted group nodes ( Figure 5). This interaction highlights all the cells corresponding to the selected subsets, group nodes that contain the selected subset, and links connecting the selected subset node and highlighted group nodes ( Figure 5).
Users can observe the performance of different analysis techniques by clicking on circles representing the combinations. This interaction highlights all the cells in the heatmap representing the selected column. When a circle gets selected, its color changes to dark blue. As shown in Figure 6, when several subset nodes (or group nodes) and circles representing analysis techniques are selected simultaneously, the color of all the cells that both their rows and columns are selected changes based on the gradient scales mentioned above (i.e., shades of green or blue based on the value of the cell's AUROC).  This interaction highlights all the cells corresponding to the selected subsets, group nodes that contain the selected subset, and links connecting the selected subset node and highlighted group nodes ( Figure 5). Users can observe the performance of different analysis techniques by clicking on circles representing the combinations. This interaction highlights all the cells in the heatmap representing the selected column. When a circle gets selected, its color changes to dark blue. As shown in Figure 6, when several subset nodes (or group nodes) and circles representing analysis techniques are selected simultaneously, the color of all the cells that both their rows and columns are selected changes based on the gradient scales mentioned above (i.e., shades of green or blue based on the value of the cell's AUROC).
Users can also hover the mouse over the cells in the heatmap to highlight the labels and circles representing the hovered cell. Additionally, this interaction changes the cell's color based on its corresponding AUROC value. The system enables users to sort the cells by rows and columns based on their corresponding AUROC values by clicking on the pink sort icons. For instance, cells in the heatmap are sorted by the "MHGDC" subset and "undersampling-SVM" technique in Figure 6.  Users can also hover the mouse over the cells in the heatmap to highlight the labels and circles representing the hovered cell. Additionally, this interaction changes the cell's color based on its corresponding AUROC value. The system enables users to sort the cells by rows and columns based on their corresponding AUROC values by clicking on the pink sort icons. For instance, cells in the heatmap are sorted by the "MHGDC" subset and "undersampling-SVM" technique in Figure 6.
The horizontal and vertical groups of "Select All" and "Deselect All" buttons on the top left corner of the heatmap allow users to select/deselect all the subsets and techniques. These buttons help users easily get an overview of all the performances without selecting all the circles individually. VERONICA provides users with a search bar and four drop-down menus on the top left corner of the screen. Suppose users are interested in learning about a specific subset. In that case, they can enter the combination code corresponding to that subset in the search bar to change its color from black to green in the interface. In addition, when users hover their mouse over the help button placed beside the search bar, a tooltip appears with information on how to use the search bar.
The drop-down menus allow users to interactively filter subsets and techniques based on different criteria. This gives users great flexibility to focus on the data points of interest. The drop-down menus provide filtering based on groups, sampling techniques, machine learning techniques, and subsets from top to bottom, respectively. Each drop-down menu provides users with several options to choose from using radio buttons. The "Groups" menu allows users to focus on a specific group of features. If users select a group, the system only displays all subsets that contain the chosen group. For instance, Figure 7 shows how the system updates the interface if the "Medications" option is chosen from the menu. The "Sampling Techniques" and "Machine Learning Techniques" menus allow users to filter the columns of the heatmap based on sampling and machine learning techniques, respectively. For instance, if users are interested to learn how a specific combination of sampling and machine learning techniques such as SMOTE and random forest performs, they can select them in the second and third drop-down menus, respectively, as shown in Figure 8. The "Subsets" menu provides users with an option to compare all models that only include a specific number of groups. For instance, if users are interested in comparing the performance of all the techniques on subsets that only include two groups, they can choose "Subsets of Two" from the last menu ( Figure 9). Users can filter data points based on different criteria by choosing an option from each menu ( Figure 10). All these menus give users an option to reset the interface based on all groups, subsets, and techniques. Additionally, if users select any groups, subsets, or techniques, the system restores all the selections when it gets updated using any of the drop-down menus. The drop-down menus allow users to interactively filter subsets and techniques based on different criteria. This gives users great flexibility to focus on the data points of interest. The drop-down menus provide filtering based on groups, sampling techniques, machine learning techniques, and subsets from top to bottom, respectively. Each dropdown menu provides users with several options to choose from using radio buttons. The "Groups" menu allows users to focus on a specific group of features. If users select a group, the system only displays all subsets that contain the chosen group. For instance, Figure 7 shows how the system updates the interface if the "Medications" option is chosen from the menu. The "Sampling Techniques" and "Machine Learning Techniques" menus allow users to filter the columns of the heatmap based on sampling and machine learning techniques, respectively. For instance, if users are interested to learn how a specific combination of sampling and machine learning techniques such as SMOTE and random forest performs, they can select them in the second and third drop-down menus, respectively, as shown in Figure 8. The "Subsets" menu provides users with an option to compare all models that only include a specific number of groups. For instance, if users are interested in comparing the performance of all the techniques on subsets that only include two groups, they can choose "Subsets of Two" from the last menu ( Figure 9). Users can filter data points based on different criteria by choosing an option from each menu ( Figure 10). All these menus give users an option to reset the interface based on all groups, subsets, and techniques. Additionally, if users select any groups, subsets, or techniques, the system restores all the selections when it gets updated using any of the drop-down menus.

Limitations
This tool should be evaluated with respect to four limitations. The first limitation relates to the problem of using undersampling. The main issue with this sampling approach is that it results in the loss of potentially useful data that could be essential for the induction process. The second limitation is that the system only supports a limited number of data mining and sampling techniques. Third, the system is designed for imbalanced datasets. The sampling techniques are unnecessary if the dataset is balanced. Forth, most of the guidelines for AKI diagnosis rely on an increase in serum creatinine as a gold standard. However, these guidelines need a premorbid serum creatinine value to be used as a baseline creatinine, which was not available for all patients in this research. Therefore, the episode of AKI was identified using the ICD-10 code. The fifth limitation is that although the healthcare experts at ICES have found VERONICA helpful and usable through the participatory design process, we have not conducted a formal study to evaluate the system's performance or the efficiency of its user-information discourse mechanism. Finally, the system only accepts a complete dataset that is correctly labeled because it does not incorporate any active learning mechanisms.

Conclusion and Future Work
In this paper, we demonstrate how VA systems can be designed to address the challenges stemming from the high dimensional EHRs to identify the subset of feature groups with the most predictive power in the classification of AKI systematically. To accomplish this, we have reported the development of VERONICA, a VA system designed to assist healthcare providers at ICES' KDT program. VERONICA incorporates two components: Analytics and Interactive Visualization modules. The Analytics module identifies the best representative subset of data in detecting the patients at high risk of developing AKI using different sampling and machine learning techniques. It incorporates two sampling techniques-undersampling and SMOTE. It also uses a representative set of machine learning Figure 10. How the system gets updated when users select "Comorbidities", "Random Forest", and "Subsets of Three" from "Groups", "Machine Learning Techniques", and "Subsets" dropdown menus.

Limitations
This tool should be evaluated with respect to four limitations. The first limitation relates to the problem of using undersampling. The main issue with this sampling approach is that it results in the loss of potentially useful data that could be essential for the induction process. The second limitation is that the system only supports a limited number of data mining and sampling techniques. Third, the system is designed for imbalanced datasets. The sampling techniques are unnecessary if the dataset is balanced. Forth, most of the guidelines for AKI diagnosis rely on an increase in serum creatinine as a gold standard. However, these guidelines need a premorbid serum creatinine value to be used as a baseline creatinine, which was not available for all patients in this research. Therefore, the episode of AKI was identified using the ICD-10 code. The fifth limitation is that although the healthcare experts at ICES have found VERONICA helpful and usable through the participatory design process, we have not conducted a formal study to evaluate the system's performance or the efficiency of its user-information discourse mechanism. Finally, the system only accepts a complete dataset that is correctly labeled because it does not incorporate any active learning mechanisms.

Conclusions and Future Work
In this paper, we demonstrate how VA systems can be designed to address the challenges stemming from the high dimensional EHRs to identify the subset of feature groups with the most predictive power in the classification of AKI systematically. To accomplish this, we have reported the development of VERONICA, a VA system designed to assist healthcare providers at ICES' KDT program. VERONICA incorporates two components: Analytics and Interactive Visualization modules. The Analytics module identifies the best representative subset of data in detecting the patients at high risk of developing AKI using different sampling and machine learning techniques. It incorporates two sampling techniques-undersampling and SMOTE. It also uses a representative set of machine learn-ing techniques, including CART, C5.0, random forest, SVM, and naive Bayes. Our clinical dataset includes comorbidities, demographics, hospital encounter codes, GP visits, and medications. The system generates a large number of prediction models by applying sampling and machine learning techniques mentioned above to each subset. The performance of all the generated models is reported using AUROC. The system enables users to access, explore, and compare these models through interactive visualizations. The Interactive Visualization module is composed of an interactive visual interface and several selection controls, such as a search bar, drop-down menus, and selection buttons. The interactive visual interface assists users in the exploration of the analytic results by providing them several interactions such as arranging, drilling, searching, filtering, transforming, and selecting.
In terms of VERONICA's scalability and extensibility, we design it in a modular way so that it can accept new data sources and sampling and machine learning techniques. VERONICA can be used to analyze high-dimensional datasets in many other domains, such as insurance, bioinformatics, and finance, where the features included in the dataset have a group structure.
Future research directions include (but are not limited to) the following. Further research is needed to effectively evaluate the performance of the system by comparing it with other standard feature selection techniques. In addition, we plan to measure the effectiveness of the system for different datasets that support natural groupings. Furthermore, since the proposed system is developed in an access-restricted virtual machine [20,86], we could not evaluate the systems' scalability. Thus, further efforts are needed to access VERONICA more comprehensively by conducting formal studies.  Informed Consent Statement: ICES is a prescribed entity under PHIPA. Section 45 of PHIPA authorises ICES to collect personal health information, without consent, for the purpose of analysis or compiling statistical information with respect to the management of, evaluation or monitoring of, the allocation of resources to or planning for all or part of the health system.

Data Availability Statement:
The study dataset is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (eg, healthcare organisations and government) prohibit ICES from making the dataset publicly available, access might be granted to those who meet prespecified criteria for confidential access, available at www.ices.on.ca/DAS (email das@ices.on.ca). The full dataset creation plan and underlying analytic code are available from the authors upon request, understanding that the computer programs might rely upon coding templates or macros that are unique to ICES and are therefore either inaccessible or require modification. Table A1. List of databases held at ICES (an independent, non-profit, world-leading research organization that uses population-based health and social data to produce knowledge on a broad range of healthcare issues).   I62, I630, I631, I632, I633, I634, I635, I638,  I639, I64, H341, I600, I601, I602, I603, I604,  I605, I606, I607, I609, I61, G450, G451, G452