Analysis of Missingness Scenarios for Observational Health Data

Simple Summary This paper argues the importance of considering domain knowledge when dealing with missing data in healthcare. We identify fundamental missingness scenarios in healthcare facilities and show how they impact the missing data analysis methods. Abstract Despite the extensive literature on missing data theory and cautionary articles emphasizing the importance of realistic analysis for healthcare data, a critical gap persists in incorporating domain knowledge into the missing data methods. In this paper, we argue that the remedy is to identify the key scenarios that lead to data missingness and investigate their theoretical implications. Based on this proposal, we first introduce an analysis framework where we investigate how different observation agents, such as physicians, influence the data availability and then scrutinize each scenario with respect to the steps in the missing data analysis. We apply this framework to the case study of observational data in healthcare facilities. We identify ten fundamental missingness scenarios and show how they influence the identification step for missing data graphical models, inverse probability weighting estimation, and exponential tilting sensitivity analysis. To emphasize how domain-informed analysis can improve method reliability, we conduct simulation studies under the influence of various missingness scenarios. We compare the results of three common methods in medical data analysis: complete-case analysis, Missforest imputation, and inverse probability weighting estimation. The experiments are conducted for two objectives: variable mean estimation and classification accuracy. We advocate for our analysis approach as a reference for the observational health data analysis. Beyond that, we also posit that the proposed analysis framework is applicable to other medical domains.


Introduction
Healthcare data encompass a wide range of variables related to various diseases and health conditions collected from different facilities under the supervision of distinct and even contrasting guidelines.It is, therefore, naive to expect a medical dataset in which a sufficient number of informative variables are available for all patients.This is why data-driven research in healthcare almost always faces the challenges of missing data.
As two out of many existing approaches, data scientists may choose to simply discard the incomplete data samples and use only the complete ones for analysis (complete-case analysis) or utilize complex deep learning models to fill in the missing entries and create a complete semi-synthetic dataset (imputation).In any case, the reliability of the methods depends on the nature of the missingness problem, e.g., how a variable distribution changes when the variable is observed or missed, or why physicians decide to measure a variable for a patient and not the other.Inevitably, we must make assumptions about these questions, which are often not testable using the data alone [1,2].Hence, we can trust the analysis result only if the assumptions are made explicit, the method conforms to them, and the sensitivity of the results to departures are investigated [1].
Even though the scholarly discourse on missing data consistently highlights this concern [3][4][5][6][7], surprisingly, the applied machine learning (ML) research in healthcare often suffers from ambiguous missingness problem statements [6] and the lack of transparent reports of assumptions for methods which are detached from the reality of healthcare facilities [4].The complete randomness of missingness is assumed even though observations in healthcare facilities are clearly conducted by guidelines and protocols [6,7].Missing data methods are often chosen inconsistently when training and evaluating prediction models [8].The problem intensifies, as the survey and review articles on missing data in healthcare [5,[9][10][11][12] are often limited to merely imputing the medical datasets without the scrutiny of assumptions.
We believe that the remedy is a formalization of the events and scenarios within healthcare facilities that result in missing data.Having a curated catalog of common and influential scenarios along with their theoretical implications for the missing data analysis, medical data scientists can examine the data collection environment of their datasets.They can determine which scenarios apply to their situation, and tailor their analysis accordingly.
In this regard, this paper introduces a framework for identifying missingness scenarios in a data collection environment and translating them to the language of missing data theory.We apply this framework to the healthcare domain, introducing and analyzing ten fundamental missingness scenarios for observational data in healthcare facilities.

Contribution and Scope
Addressing a theory-application gap in healthcare, this paper falls under the translational research category.We mainly focus on missingness scenarios in healthcare facilities such as clinics and hospitals.We exclude planned clinical studies, as they comprise considerably different scenarios (e.g., case drop-out and planned missingness).We achieve the goal of the paper through the following steps:

•
We introduce an analysis framework for identifying and analyzing the missingness scenarios in different domains of application.

•
We introduce ten fundamental and prevalent scenarios in healthcare facilities that lead to the observation, recording, or missingness of data.Table 1 gives an overview of the scenarios.

•
To make the theory-application connection, we introduce theoretical inquiries to be made about each missingness scenario.Table 2 gives an overview of the inquiries.

•
For each scenario, we make the inquiries above and analyze the theoretical implications, along with various examples from the medical data analysis literature.

•
To demonstrate the effect of domain-informed assumption on the method reliability, we perform a simulation study, showing how domain-agnostic analysis may lead to different levels of bias depending on the active scenarios and different estimands.
We propose this paper as a reference point for correct data analysis and reporting using observational health data.The methodology of this paper is not limited to the selected missingness scenarios and can be applied to less common yet equally essential scenarios that the readers may encounter.For the scope of the paper, we mainly focus on problem formulation and missing data assumptions, minimally going into detailed discussions about the implementation of missing data algorithms and methods.Nevertheless, when required, we provide sufficient references to influential works in the missing data literature.

Structure
To present a clear image of the problem at hand and the existing research gap, we introduce the general ideal approach to missing data analysis and highlight the influential body of works regarding this approach in Section 2. The methodology of the paper, presented in Section 3, is divided into three separate parts: After introducing the analysis framework in Section 3.1, first, we identify the key missingness scenarios in healthcare facilities in Section 3.2 and then analyze the scenarios within the framework of missing data theory in Section 3.3.We conduct experiments to show the significance of domain-informed missing data analysis in Section 4. Discussions, limitations, and proposals for future works are presented in Section 5.

An Ideal Approach to Missing Data Analysis
The ideal approach to the analysis with missing data comprises a correct formulation of the analysis target, a correct estimation of it, and finally investigating the effect of possible errors in the formulation and estimation steps.This approach, established and adopted in the missing data literature [2,13] makes a common skeleton for the missing data methodologies.It is only in the details of each step of this approach where the methodologies differ.We elaborate on the approach, depicted in Figure 1, as follows: 1.
Analysis begins by formulating the objective (e.g., risk factors for a disease) as a mathematical expression, namely an estimand, which should be learned using incomplete data.

2.
Next, assumptions about the missingness status of the variables in the estimand are collected.Such information could include, for example, how other variables influence the missingness in another variable, or which model is the most suitable for describing the chance of observation or missingness.As depicted in Figure 1, the assumptions must be directly informed by the actual setting of the problem at hand.

3.
In light of the assumptions, it is determined whether and how the estimand can be learned from the incomplete data.This step is referred to as identification (or recovery), and the estimand is called identifiable if the answer to the identification step is positive.The identification property is the direct product of the objective and the problem setting.It may be the case that an objective cannot be recovered from the available data.A trivial example of unidentifiability is to estimate the mean value of a variable while it is completely unavailable.4.
The next step is to learn the estimand using the available data, e.g., by employing ML methods.This step is referred to as estimation.The right method and learning setting is also influenced by the problem assumptions.

5.
Finally, to increase the reliability and robustness of the results, the results' sensitivity to model perturbations and violation of assumptions is measured (sensitivity analysis).Similar to steps 3 and 4, this step is also influenced by the assumptions when choosing meaningful perturbation ranges according to the problem at hand.

Research Gap
Related to the introduced ideal approach in the previous section, Table 3 presents examples of related works supporting each analysis step: (1) correct formulation of different estimands, (3) identification theory for missing data problems, (4) efficient estimation for missing data, and (5) sensitivity analysis.
Regarding step 2 (assumption specification), which is the focus of this paper, cautionary articles exist for advocating domain-informed analysis [3,5,7], and providing high-level strategies for correct interpretation and reporting of missingness scenarios [4,14].Several papers have performed such domain-informed analyses (for example, Mirkes et al. [15] and Millard et al. [16]), yet only for specific diagnoses (trauma and COVID-19), limited to unique missingness complications (selection bias) and estimands (outcome prediction).To our knowledge, no comprehensive taxonomy and analysis of missingness scenarios exists for observational data in healthcare facilities.
Our paper is particularly inspired by Moreno-Betancur et al. [17] and Marino et al. [18].These works develop general guidelines for treating missing data in epidemiology and clinical (point-exposure) studies, mainly focusing on missingness scenarios and circumstances that suit these settings.To name a few, they consider the case dropout, where participants are removed from the remaining of the study if they miss a measurement session, and the assumption of no hidden confounders, which is the product of a curated study design.As outlined in Section 3.2, different and more diverse reasons for missingness exist in healthcare facilities, where patient admission, data observation, and data collection, do not always follow a designed plan as they would in clinical studies.
Despite this difference, their motivation and approach for identifying and analyzing missingness scenarios remain relevant for observational health data-a foundational principle upon which this paper is built.The increasing need for developing missing data methods is reflected in the findings of the survey and systematic review papers, including that of Ismail et al. [12], which reported a three-fold increase in the publication of imputation techniques for machine learning algorithms alone.This shows the need for a clear guideline for reliable missing data analysis in all healthcare domains, including that of observational data within healthcare facilities.Sensitivity analysis for unidentifiable missingness [33][34][35][36] * Research gap.

Analysis Framework
Developing the missingness scenario-to-theory mapping would first require knowing what lies at both ends of the mapping.Then, the mapping itself must be developed, determining the theoretical implications of every scenario.
To identify scenarios, first, we investigate the steps through which the data end up in the analysis dataset: 1.
First, a piece of evidence is obtained through observation or measurement; 2.
After observation, the data must be recorded in the database which is the source of the analysis dataset; 3.
Finally, the recorded data must be retrieved for creating the dataset and kept (not removed) for data analysis.
An interruption in any of these steps will lead to missing data (unavailability), i.e., when data are not observed, not recorded, or not included for the analysis.
Depending on the process, these three steps may occur naturally until a specific event interrupts them.For instance, occupation information is normally obtained and recorded unless patients refuse to provide it, while MRI images are generally not available for patients unless the attending physicians order them.Therefore, depending on the features of the data and the data collection environment, we must investigate either the scenarios that lead to the interruption of usual data availability, or actively making the data available despite the default conditions.
Another point of investigation concerns the agents that are responsible and directly influence data observation, recording, and inclusion.We identify four agents for the case study of healthcare facilities: (1) patients, (2) attending physicians and medical staff, (3) healthcare facility elements, including the medical devices, medical data software, etc., and (4) medical data scientists, analyzing the resulting datasets.
Cross-investigation of the data availability steps results in identifying the missingness scenarios.In particular, one shall study whether and how an agent influences data observation, recording, and inclusion in the dataset.This includes either the interruption of a usual availability process, or actively deciding to observe or collect the data.The crossinvestigation can be performed for any type of case study.In the scope of this paper, we apply it to the domain of observational data from healthcare facilities.Section 3.2 provides the result of the investigation.As a summary, Table 1 presents the identified scenarios.
Once the scenarios are extracted, we must scrutinize each to find out their implication for many assumptions that restrict the effectiveness of the analysis steps, namely, identification, estimation, and sensitivity analysis.We do so by collecting a set of inquiries made about each scenario, which influence the assumptions in the analysis step.Clearly, different methods may require different inquiries.Therefore, the analysis presented in this paper might slightly differ if the reader considers different identification, estimation, or sensitivity analysis methods.Within the scope of this paper, we choose widely utilized and well-established methodologies for the sake of wider applicability; the methodologies are identification theory for graphical models, inverse probability weighting estimation, and exponential tilting sensitivity analysis.They will be introduced in detail in Section 3.3.As a summary, Table 2 presents the inquiries analyzed in this paper.Figure 2 presents a schematic of the analysis framework.A schematic of the analysis framework in this paper: on the left, a cross-investigation of the data availability steps and agents guides us toward the existing scenarios (blue dots).Each scenario then will be subject to inquiries about various missing data assumptions on the right.

Missingness Scenarios in Healthcare Data
Observational data from healthcare facilities, such as clinics and hospitals, comprise information about outpatient and inpatient (hospitalization) visits.Variables in healthcare data include patient demographic information (e.g., age and gender), medical history, signs and symptoms (e.g., blood pressure value and pain symptom), lab tests (e.g., blood chemistry test), diagnoses, and prescribed medications.The variable list is extended for inpatient visits, including higher resolution observations and prescription information (e.g., oxygen saturation from bedside monitoring, and the input/output chart).In addition, new variables are collected during different hospitalization modes, such as ICU hospitalization.For more information about collected observations in healthcare facilities, see the documentation of publicly available datasets such as the MIMIC-IV electronic health record dataset [37].
Health variables are available through various data observation, recording, and inclusion scenarios that are influenced by the four agents introduced above.In this section, we explore ten fundamental and prevalent scenarios that drive the data-availability steps in healthcare facilities.For each scenario, we provide real-world examples within the text, as well as in Appendix A. The majority of examples are extracted from the clinical prediction model (CPM) literature, introduced by Tsvetanova et al. [8].CPMs make a crucial portion of AI applications in healthcare, and this work has performed an extensive investigation to highlight highly relevant and well-known CPM models.

Scenarios Related to Patients
By default, data are only collected during a patient's visit to a healthcare facility, resulting in a gap in data for the time between visits.At the same time, the sub-population that has not visited the healthcare facility is not observed at all.These are situations where we inevitably encounter the missing data problem unless additional data sources are used to complement the primary dataset.
It is naive to assume that visits are decided randomly and unrelated to the patient's health condition.This motivates the following key missingness scenario: Scenario 1 (Patient complete non-visit): Sub-populations with no healthcare facility visits during data collection are not included in the dataset.
In Scenario 1, a specific sub-population is completely missing, e.g., due to health status.For instance, healthy people with no serious health complications infrequently visit clinics with different intentions such as preventive check-ups.Likewise, the data of patients deceased before any visit (e.g., dead-on-arrival) are often absent from the facility database.Other factors, such as socioeconomic status, can also influence the non-visit.References to Scenario 1 are presented in Appendix A.1.
Scenario 2 describes another type of non-visit, namely missing follow-ups for patients with at least one recorded visit.
Scenario 2 (Missing follow-up visit due to health status): Patients may miss a follow-up visit due to death, facility transfer, or if they decide not to continue the treatment.
The difference between Scenarios 1 and 2 lies mainly in the reasons for missingness; patients have potentially different reasons not to visit a healthcare facility for the first time, or to drop the follow-up visits, possibly despite the physician's recommendations.References to Scenario 2 are presented in Appendix A.2.
Health status factors influence not only the visits but also measurements during the visits.As highlighted by Scenario 3, location transfer within the facility due to health conditions influences the observed variables.Scenario 3 (Missing measurements due to health-related events during hospitalization): Observations may be interrupted or limited by extreme health conditions or transfer to a different location.
In Scenario 3, observations may be interrupted due to events such as the occurrence of code blue and the resulting disconnection of devices for resuscitation [11,38], or patient transfer, e.g., to the operation room or ICU ward [14].These events may also lead to observing new health variables that had not been recorded prior to the event, e.g., monitoring during operation [37].References to Scenario 3 are presented in Appendix A. 3.
Another reason for missingness in a variable is the patient's refusal to take a test or consent to data sharing as stated by Scenario 4. Scenario 4 (Patient's refusal): Patients may actively refuse specific observations or decline consent to data sharing.
Overall, patients' personal decisions, whether for medical (e.g., pain intolerance) or non-medical (e.g., fear of examination) reasons, may induce missingness.References to Scenario 4 are presented in Appendix A.4.

Scenarios Related to Physicians
During a visit, the attending physicians decide which variables to observe.Bickley and Szilagyi [39] describe the examination and diagnosis practice as a step-by-step process in which physicians use basic observations such as vital signs and symptoms to form a first diagnostic belief, referred to as impression.To prove or rule out the possible diagnoses, physicians then order more specific, expensive, and sometimes invasive tests.Therefore, as Scenario 5 states, the primary reason for taking or skipping a measurement is the diagnostic information it provides, compared to the cost of observation (e.g., monetary, time, and/or harm to the patient.).Scenario 5 (Missing measurements due to diagnostic irrelevance): Variables that are less relevant to the physician's impressions are less likely to be observed.Scenario 5 concerns diagnostic flowcharts and score systems in a dataset (see Elovic and Pourmand [40]).These provide rules for selecting the following observation until the final diagnosis.Nevertheless, observation patterns may not entirely reflect one particular guideline, as many guidelines are used during the data collection phase within a cohort, and other scenarios also affect the data.
One should note the implications and subtle differences between these tools when conducting a missingness analysis.For example, in flowcharts, the value range for the parent variable(s) determines the next observation.In contrast, in a score system, the cumulative score of all related variables determines whether more observations are necessary for concluding the decision [41].References to Scenario 5 are presented in Appendix A.5.

Scenarios Related to Healthcare Facilities
Measurement decisions are not only determined by physicians but also by protocols and guidelines in healthcare facilities as stated by Scenario 6.

Scenario 6 (Missing measurements outside protocol requirements): Data collection protocols decide the measurements in different conditions during hospitalization.
For instance, hospital protocols may mandate specific data (e.g., demographic information and basic blood tests) to be collected upon admission.Similarly, there are measurements only performed in particular conditions, e.g., pre-and post-surgical measurements.It is, therefore, crucial to consider the role of protocols within healthcare facilities when investigating the causes of missingness or observation of variables.References to Scenario 6 are presented in Appendix A.6.
Scenarios 5 and 6 assume that measurements can always be taken if required.While this may generally be true, especially for routine tests, a measurement may sometimes be hindered by the unavailability or shortage of necessary resources.Diagnostic tests may be dropped or delayed for a patient due to prioritization in long waiting queues or temporary unavailability of equipment or staff.Scenario 7 describes the situation where measurement orders were not realized despite physicians' decisions.
Scenario 7 (Unavailability or shortage of resources): The physician's order for observation may not be realized due to unavailability or shortage of resources.
References to Scenario 7 are presented in Appendix A.7.
Another assumption for Scenarios 5 and 6 which does not always hold is that the measurements and physicians' observations are all recorded in the database.As stated by Scenario 8, variables might be observed and influence medical decisions, yet they are withheld from the dataset.Scenario 8 (Unrecorded observations): Some variables are not recorded in the database or used in the data analysis, even though they have been observed and relied upon in medical practice.
There might be aspects characterizing the overall patient's health, which are not explicitly recorded but implicitly considered in the decision-making process.In addition, certain modalities of data may not be efficiently recorded or integrated into the medical record.Further, some modalities, such as textual data, may be excluded from data analysis due to complexity.All these reasons are mainly determined by the quality of data collection software in healthcare facilities, the physician's style of practice, and the choice of data modalities for analysis.References to Scenario 8 are presented in Appendix A.8.

Scenarios Related to Data Pre-Processing
For the analysis of collected and recorded data, the first step is dataset selection and preprocessing.Depending on the analysis objective, data scientists apply inclusion/exclusion criteria based on demographic information, patient cohort, or variable availability.As stated by Scenario 9, this should be considered a missingness scenario induced in the data analysis step.

Scenario 9 (Data sample omission based on inclusion/exclusion criteria):
Samples are included or excluded depending on data and missingness characteristics, such as measurement availability, values within a specific range, or patient cohort.
References to Scenario 9 are presented in Appendix A.9. Another common reason for data omission during pre-processing is the presence of invalid, unextractable, or erroneous values as stated by Scenario 10.
Scenario 10 (Missingness of invalid data entries): Data rows with invalid or erroneous entries are removed from the data during data pre-processing.
Examples are omission due to poor handwriting or corrupted medical chart pages [14], negative age values, or entries specified by ERROR code.References to Scenario 10 are presented in Appendix A.10.

Analysis of Missingness Scenarios
This section presents the foundation of the missing data theory necessary for analyzing the introduced scenarios in Section 3.2.After briefly explaining the missing data problem formulation, we describe the steps required for solving the problem under missing data, according to Figure 1.Throughout the steps presentation, we identify theoretical questions to answer for each scenario to bridge the gap between the analysis and application domains.We call these questions inquiries about the scenarios.Extensive details for the inquiries are presented in Appendix C.

Setting and Notation
Let the random vector X ∈ R d comprise d study variables X i ∈ X, i ∈ {1, . . ., d}.For ease of reference, we denote a specific variable of interest beside X (e.g., the class labels in the ML classification problem) as Y. Furthermore, we denote independence between X i , X j as X i ⊥ ⊥ X j .Independence by conditioning on a variable X h is denoted as In reality, the variable X i is realized for all patients, though it may or may not always be available (i.e., observed, recorded, and present in the dataset).We therefore refer to X i as a counterfactual variable since this is what the data would have been if they had always been available, possibly contrary to reality.Corresponding to each X i , we define a binary variable R i ∈ {0, 1}, called the missingness indicator, to express X i 's availability: we set R i = 1 when X i is available, and R i = 0 otherwise.The version of X i which is masked by missingness is called proxy variable, denoted as X * i ∈ R ∪ {NaN} where NaN represents the missing entries.By this definition, a proxy variable is modeled as The distribution of R is determined by the subset of scenarios from Section 3.2, which describe the data observation and recording in a healthcare facility and dataset selection for analysis.A data availability policy π represents the union of scenarios such that the missingness distribution follows the policy, i.e., R ∼ π.Subsequently, the resulting distribution given a policy π is denoted as p π .We denote three special policies to reference in the paper: (1) the initial policy during data collection as π init , (2) the full-availability policy as π full , under which X is always available, and (3) any other policy as π new , which is neither π init nor π full .This notation yields p(X) = p π full (X * |R = 1).
Example 1 (missingness under availability policies).Suppose a variable X 1 is realized for four patients, giving X 1 = (2, 3, 4, 7) ⊤ .If the fourth patient has missing values under a policy π init , we have X * 1 = (2, 3, 4, NaN) ⊤ .In this case, the mean estimations for X and X * are defined as Under a new policy π new where only the X ≥ 4 observations are available, we have E π new (X * 1 |R = 1) = 5.5 An availability policy for X i is, in general, parameterized by all variables (including X i itself) as well as other missingness indicators, i.e., {X, R \i }.To encode these dependencies, we model the joint (X, R) distribution using m-graphs Mohan et al. [2].An m-graph under the availability policy π, denoted as G π (V), is a causal directed acyclic graph (DAG) with the node set V = {X ∪ X * ∪ R}.The edges in the structure X i → X * i ← R i are deterministic, representing Equation (1).While non-graphical approaches for missing data exist, we focus on m-graphs for their effectiveness and popularity in this paper.Section 3.3.3will provide more details about m-graphs.
Example illustrations of m-graphs are depicted in Figure 3, where three m-graphs model different policies for a similar (X 1 , X 2 ) distribution.
(c) Figure 3. Three example m-graphs, modeling the joint distribution (X, X * , R) for a bivariate dataset: represents the full-availability case (no missingness scenario).Dashed edges are deterministic, encoding the definition in Equation (1).

Defining the Estimand
In the first step of data analysis, an objective must be set by the domain expert and the data scientist and translated into an estimand, which will be fitted to the data.Examples include finding the weights of a prediction model for patient morbidity or the mean value of a biomarker for a population.Based on the form of the estimand, and whether and how it depends on the unavailable data distribution under missingness, we may face diverse challenges.
A basic question of interest is the mean of an outcome variable Y (e.g., the mean value of a test or the chance of recovery).If Y is partially available under the policy π init , one may formulate the question directly as E π init (Y * |R Y = 1), which reads as the "mean of Y when it is available".However, we are often interested in estimating the entire population regardless of the missingness status "had Y for all samples been available for analysis".This objective, referred to as the counterfactual mean estimation, is presented as As a more advanced objective, we may be interested in developing a prediction model for the outcome variable Y using the covariate vector X, i.e., E(Y|X = x), which reads as "conditional mean of Y given X".We often choose an ML model for estimation, such as a multi-layer perceptron neural network f (x; w), parameterized by w.The weights of the network are learned by minimizing a loss function, e.g., the mean squared error . Model performance at deployment can also be evaluated using the same formula.
Given a fully observed outcome and missing covariates, the estimand formulates the MSE loss for the available X.The estimand in Equation ( 3) suits the situation where the prediction model is to be deployed in an environment with the same observation policy, meaning that all missingness scenarios are the same during deployment as during the data collection stage.In Equation ( 3), we may use the information in R X , e.g., we train (maximum) 2 d separate sub-models g(x j ), R j = 1 for each unique value (pattern) of R [20].
Example 3 (Health status estimation at hospital discharge).We aim to develop a prediction model for the 6-month outcome based on the observed variables during hospitalization, queried at discharge.The model deployment will not influence the physicians' decisions.The fact that the hospitalization data are being analyzed retrospectively can justify the assumption that the observation and recording policy will not change at deployment.The MSE loss for this case is given by the estimand in Equation (3).
Alternatively, we may be interested in learning a prediction model that is deployed in healthcare facilities with different missingness scenarios, e.g., with varying guidelines of observation and protocols (Scenarios 5 and 6), for a different patient cohort (Scenarios 1 and 9), or in the same healthcare facility but with a change of observation policy because the physicians would measure different variables to "feed" it to the prediction model.In particular, suppose a training dataset generated given the m-graph in Figure 3a will be deployed in an environment modeled by the m-graph in Figure 3b.The estimand for such a case is which reads as "MSE loss under new missingness scenarios at deployment", where π new represents the new policy.
Example 4 (Change in hospital discharge protocols).Suppose the hospital in Example 3 adopts a new discharge protocol mandating performing a medical test for all patients before discharge.
The MSE loss under the newly adopted policy is given by the estimand in Equation (4).
A special case of Equation ( 4) is when the prediction model is expected to make predictions always using full covariates (Figure 3c).
The estimand for this case is , with only one missingness pattern, the full-availability R = ⃗ 1.This objective is employed for most clinical prediction models (see Tsvetanova et al. [8]).
For more examples, Appendix B presents the estimands for prediction using decision trees and feature importance.
Example 5 (Clinical prediction model).Suppose a clinical prediction model is developed using an incomplete dataset.As a result of successful development, physicians will use the model while they actively collect all study variables every time to feed to the model.The MSE loss at deployment is given by E π full [(y − f (x; w)) 2 ].

Identification
As shown in the previous step, estimands may query different missingness distributions, while the only available distribution is given by the data collection policy π init .If an estimand queries π init , such as Equation ( 3), it can be computed directly using the training dataset D. On the other hand, estimands such as ( 2) and (4) query different distributions and hence are subjected to the distribution shift problem.In the identification step, we find a procedure that computes a consistent estimate of an estimand under a target distribution using the available π init [2].
To elaborate further, we consider an estimation approach under distribution shift, namely, importance sampling: for a functional θ of the distribution at deployment q(X, R), θ is estimated using the data collection distribution p(X, R) as where the fraction λ(X, R) = p(X, R)/q(X, R) is called the importance ratio.By Equation ( 5), samples are drawn from p(.) but re-weighted by their "importance" in reflecting q(.).Equation ( 5) states that a θ estimation is possible given the p(X, R) samples when λ is known for all (x, r) over the support of p.
The importance ratio can be re-written using the selection model factorization [42] as The conditional terms p(R|X) and q(R|X) in the fraction are the data collection and deployment availability policies, respectively.Assuming no additional counterfactual data distribution shift, i.e., p(X) = q(X), Equation ( 6) is simplified as λ(X, R) = q(R|X)/p(R|X), i.e., the ratio of missingness models at the data collection and deployment stages.When the availability policy does not change at deployment, the ratio is re-written as and when a new policy is adopted at deployment, it is re-written as While the following arguments are valid for Equation (5) in general, we consider a special case where the full-availability policy π full is running at deployment (e.g., the estimand in Equation ( 2)).In this case, we trivially have λ(X, R) = 0 for all incomplete data since the numerator p π full (X, R) is zero when R ̸ = ⃗ 1.This means that only the complete cases (R = 1) are used for computation, for which λ = 1/p π init (R = 1|x).The resulting estimator according to Equation ( 5) is expressed (for estimation using D) for N samples, where 1(r = 1) selects only the complete cases.Equation ( 8) is known as the inverse-probability weighting (IPW) estimator.The denominator in Equation ( 8) is referred to as the propensity score, often denoted as PS(x).
The challenge of identification lies in the conditioning set of importance ratio terms, as they generally depend on the counterfactual distribution (X), which is only partially available.As a solution, we assume an m-graph for the problem and seek independence properties among (X, R) variables that allow us to express the importance ratio in terms of factors that can be estimated using the available distribution (X * , R).For the scope of this paper, we mainly focus on identification with respect to m-graphs [2,43].See Section 3 of Mohan et al. [2] for other identification approaches.
Example 6 (Identification with respect to an m-graph).Suppose a functional θ(X 1 , X 2 ) is to be estimated, given the data collection and deployment policies π init and π full , respectively.The propensity score for IPW estimator is p ), which cannot be estimated using D. Assuming the m-graph in Figure 3a, we proceed as follows (we drop the distribution index for brevity): By the missingness definition in Equation ( 1), we express the second term using the proxy variable and rewrite the propensity score as p(R . Both factors in the propensity score can be estimated using (X * , R) In conclusion, identification in this manner requires an m-graph model, and within it, the causal relations of the missingness indicators are specifically important.It is therefore necessary to discover what kind of causal structures the missingness scenarios induce for R. In particular, we specify the parents and ancestors (direct and indirect causes) for the R nodes as stated by Inquiry 1.The causes of R nodes are commonly referred to as the missingness mechanism.
Inquiry 1 (missingness mechanism): Causal relations that a scenario implies for R nodes.
To facilitate identifying the causes, we categorize all potential causes to search for in the following three categories: 1.
(X and R components) First, the candidates for causes of R are the study variables and their corresponding missingness indicators within the dataset.Examples can be found in Figure 3a,b, where X is a cause of indicator R.

2.
(latent/hidden confounders) Variables that have not been collected and available in the dataset may also causally influence R.More importantly, they may confound two or more study variables within the estimand, and may therefore hinder the identification process.

3.
(exogenous causes) Other variables that may lie outside the dataset and do not confound the study variables of interest are considered exogenous causes, having, in general, no identification implications.
Missingness in health-related variables such as lab test items is mainly caused by physicians under Scenario 5 (missing due to diagnostic irrelevance), where they make measurement decisions based on the observed history.Therefore, in this case, R indicators for health variables have incoming edges from the previous measurements (recorded or unrecorded).Other potential causes include the health status under Scenarios 1 (patient complete non-visit) and 2 (missing follow-up visit due to health status).Examples of the latent/hidden confounders include socioeconomic variables as well as variables in secondary datasets with information about the non-visit population under Scenario 1.As for the exogenous causes, many causes may be recognized, such as simply forgetting to enter the data for a patient, under Scenario 8 (unrecorded observations).However, one should be cautious about treating all medically unrelated variables as exogenous causes, as they may still confound the study variables and missingness indicators.A detailed analysis of missingness scenarios with respect to Inquiry 1 is presented in Table A1.
Inquiry 1 explores the structural distribution shift caused by a change in the m-graph between the data collection and deployment stages.Another possibility is that the m-graph stays invariant, but the causal relations are subjected to a parametric shift.For example, assume the m-graph in Figure 3a holds for both data collection and deployment, but the missingness probability in X 2 changes from σ(R 1 + 2X 1 ) to σ(0.2R 1 + 5X 1 ).As stated by Inquiry 2, it is crucial to explore the potential parametric shift at deployment due to a change in the observation and recording policies.
Inquiry 2 (Missingness distribution shift): Whether a scenario is subjected to missingness parametric distribution shift at deployment.Parametric shift may occur in Scenario 5 (missing due to diagnostic irrelevance), if the definition of normal/abnormal ranges for a health marker changes.In this case, the results of primary tests still influence the performing decision of later tests, however, via different rules.As another example, a parametric shift may occur in Scenario 7 (missing due to resource unavailability), if the monetary cost of a medical test decreases as a result of equipment upgrade or insurance plans, leading physicians to order the test more often.A detailed analysis of missingness scenarios with respect to Inquiry 2 is presented in Table A2.
Example 7 (parametric shift due to decreased test costs).Consider a primary test X 1 and a secondary and more expensive test X 2 .Patients with abnormal primary test values (X 1 > 5) are more likely to give the X 2 test.After a cost reduction for the X 2 test, the overall frequency of the test (R 2 = 1) increases such that now the relative number of tests for patients with abnormal X 1 values is ρ times larger than before, yet the association between X 1 and R 2 is retained.This statistics gives = ρ, which is the importance ratio for (X 1 = 1, R 2 = 1) samples in Equation ( 5).We leave it to the readers to calculate other importance ratios based on assumed statistics about this hypothetical problem.
So far, the described identification methodology has been based on the selection model factorization in Equation ( 6) and the no-distribution-shift assumption for the counterfactual variables.However, there might exist missingness scenarios under which this assumption is violated.A case of violation is when the observation and measurement decisions directly affect the counterfactual variables.In terms of m-graphs, this translates to an R → X edge.
The assumption that such a causal relation does not exist is referred to as no-direct-effect (NDE) [22], discussed in the m-graph identifiability literature [25].Since the violation of NDE influences the identification procedure, it is crucial to know whether the problem setting permits it as stated by Inquiry 3.

Inquiry 3 (no-direct-effect assumption): Whether a scenario implies outgoing edges from missingness indicators to counterfactual variables.
A crucial case of NDE violation occurs when invasive tests such as biopsy affect the health status of patients.The effect of observation may be exerted on the corresponding counterfactual variable itself or other variables.This effect may also be exerted indirectly, e.g., through temporarily stopping a certain medication before a medical test.For example, due to the contraindication of radiology contrast agents and metformin, it is recommended that for diabetic patients, medication is stopped before performing angiography [44].Note that under violation of the NDE assumption, the problem definition stated in Section 3.3.2becomes ill posed and requires further elaborations.An example of a problem definition under NDE violation is discussed in Example 8.A detailed analysis of missingness scenarios with respect to Inquiry 3 is presented in Table A3.

Example 8 (Problem definition under NDE violation).
Assume the following m-graph X → Y ← R X , describing an (X, Y) dataset with fully observed Y, where the measurement of X negatively influences Y.This problem cannot be analyzed unlike Example 1, as the counterfactual realizations cannot be described ignoring the missingness status.As a hypothetical data generation mechanism, suppose the X − Y relation follows Y = wX + ϵ, ϵ ∼ N (0, 1) in the absence of any measurement (R X = 0).When X is measured (R X = 1), the Y distribution changes to Y = wX + w 0 + ϵ.Therefore, E π init (Y) ̸ = E π full (Y).Possible questions to pose with regard to a target quantity θ are as follows:

•
If the observation policies remain unchanged; • If we begin to always observe X; • If we knew the value of X but without negative influences on Y, e.g., using a new testing technology.
Another common assumption for the missing data problem is the no-interference assumption, stating that the measurement decisions for one individual do not affect other individuals [22].This is similar to the independent and identically distributed assumption in general ML problems: having interfered measurements, the independent and identically distributed assumption cannot be made for the R distribution.It is therefore important to check whether the no-interference assumption is permitted for observation scenarios as stated by Inquiry 4.
Inquiry 4 (No-interference assumption): Whether a scenario causes interference among the availability status of data samples.
Similar to the NDE assumption, one may find realistic scenarios where the nointerference assumption is violated.In general, competing for limited resources under Scenario 7 (unavailability or shortage of resources) or for available hospitalization services under Scenario 1 and 2 (complete non-visit and missing follow-up) imply interference.A detailed analysis of missingness scenarios with respect to Inquiry 4 is presented in Table A4.
Finally, we discuss a unique case of missingness, where data samples are completely omitted from the dataset prior to any analysis.This case can be modeled in m-graphs via an R † node that influences all R i such that if R † = 0, then R i = 0, ∀i (Figure 4).The risk in this situation lies in the fact that we cannot infer the occurrence of such omissions from a dataset without additional information, which may thus lead to the wrong conclusion that the dataset is complete and free of missingness.This case is commonly referred to as selection bias in causal inference literature.Selection bias is argued in Inquiry 5.

Inquiry 5 (Selection bias): Whether a scenario causes the omission of an entire data sample in the form of selection bias.
Clearly, sample omission can be a result of non-visit under Scenario 1 and inclusion/exclusion criteria under Scenario 9. Whether or not this should be conceived as a bias depends on whether the target parameter (e.g.Y in Equation 2) is believed to vary between the observed and the unobserved sub-populations.A detailed analysis of missingness scenarios with respect to Inquiry 5 is presented in Table A5.

Estimation
There are several methods for estimation with missing data, including likelihoodbased methods such as the Expectation Maximization (EM) algorithm, multiple imputation (MI), IPW estimator, and outcome regression (OR) [27,42].In the scope of this paper, we continue with the importance sampling approach in Equation ( 5), in particular, the IPW estimator in Equation ( 8) and the estimation of the propensity score.
Even though a successful identification step guarantees that the propensity score can be estimated using the available data, we still face some challenges, e.g., when the missingness pattern is non-monotone.A missingness pattern is called monotone if there is at least one ordering of the variables such that observing the j-th variable ensures that all variables k > j in the ordering are all observed for all samples (Figure 5a).Estimation of the propensity score has a straightforward solution for monotone patterns.Example 9 showcases propensity score estimation for identifiable monotone missingness.
Example 9 (Propensity score estimation for identifiable monotone missingness).For the missingness in Figure 5a, we have ∑ 4 j=1 p(S j |X) = 1, while PS(X) ≡ p(S 1 |X).Assuming identifiability, each p(S j |X) can be estimated using only the variables available in S j .As a result, the propensity score is estimated as While methods have been developed for an effective estimation under non-monotone missingness [27,31], it is beneficial to adopt monotone solutions if applicable.In that regard, Inquiry 6 argues whether a missingness scenario individually induces monotone missingness patterns.

Inquiry 6 (Monotonicity): Whether a scenario induces missingness with monotone patterns.
If an individual missingness scenario is active, monotonicity can be directly inferred from the emerged patterns, revealed by a simple sorting of the variables with respect to their missingness ratio (Figure 5a).However, in practice, several scenarios influence a dataset.In such cases, the monotone pattern attributed to one scenario is broken by other scenarios.If we can attribute the emerged non-monotone pattern to a dominant monotoneinducing scenario along with less effective non-monotone scenarios (hypothetically in Figure 5b), then methods exist based on resolving the missing entries up to recovery of the monotone pattern, e.g., via imputation, and proceeding with IPW estimation for monotone missingness [45].A noteworthy scenario likely inducing monotonicity is the sequential observations of physicians under Scenario 5 (missing due to diagnostic irrelevance).Given a specific diagnostic flowchart, it is reasonable to assume that more specific secondary tests shall not be made unless primary tests are conducted.As said, this pattern may be broken for many reasons, including more than one diagnostic flowchart being used and other scenarios such as 4 (patient's refusal) or 7 (resource unavailability).A detailed analysis of missingness scenarios concerning Inquiry 6 is presented in Table A6.

Sensitivity Analysis
The assumptions made for handling missing data may not hold under all circumstances.They might be too strong for practical implementation, or we may expect the environment to undergo some perturbations that violate them.To ensure the robustness of the analysis, it is crucial to measure the sensitivity of results to departures from the assumptions and report the variation.Sensitivity analysis is usually performed by perturbing the m-graph model.
In addition, it is possible that due to the nature of the problem, assumptions do not lead to a successful identification.In this case, we may impose stronger assumptions that lead to identifiability, model the departures from the actual assumptions, and finally measure the sensitivity to different degrees of magnitude of those departures.
Example 10 (Sensitivity analysis for the unidentifiable self-masking missingness).Consider an outcome variable Y that is subjected to missingness under the following mechanism Y → R Y .
The estimand E(Y) is unidentifiable under this mechanism, referred to as self-censoring [25] or self-masking [20].
We can assume that the mean of the unobserved population is δ units away from the observed population, additively [6,33].We then measure the variation of E π full (Y * |R = 1) assuming a range of values for the sensitivity parameter δ.
For reliable meaningful sensitivity analysis results, it is crucial to interpret the sensitivity parameters based on meaningful real-world quantities.Inquiry 7 states that scenarios may carry valuable information for choosing meaningful parameters.
Inquiry 7 (Meaningful sensitivity parameters): Given a scenario, what are the meaningful units and ranges of parameters for sensitivity analysis.Specific to the importance sampling approach and Equation ( 8), the unidentifiable terms appear in the importance ratio.The importance ratio captures the differences in the levels of availability for different covariate strata.To make an informed guess about this quantity, we may refer to other research works or collaborations with health domain experts.For instance, Zamanian et al. [36] suggest that the sensitivity parameters for physicians' observations (Scenario 5) are related to the odds of making an observation for relatively healthy or sick patients, which can be inferred based on the guidelines, protocols, and referring to the attending physicians.The sensitivity parameter for this case is formulated for the model in [36], assuming a logistic model for missingness, by the following odds-ratio term: where O(R|X) = p(R = 0|X)/p(R = 1|X).Equation ( 10) follows the so-called exponential tilting model, where the multiplicative departure for R is modeled as an exponential term [34].Likewise, the parameters for hospital visits (Scenario 1) are related to the odds of visiting a healthcare facility for healthy and sick populations, formulated similarly as Equation ( 10), for which some information can be extracted from epidemiologic studies and public health reports.Overall, the form of the sensitivity model depends on the estimand and the estimator.Yet, a similar ratio as in Equation ( 10) often appears in the analysis, which must be specified.A detailed analysis of missingness scenarios concerning Inquiry 7 is presented in Table A7.

Setup
In this section, we perform a simulation experiment to demonstrate the impact of different scenarios and their corresponding assumptions on the results of the missing data analysis.Simulation allows us to induce different missingness scenarios in a controlled manner and evaluate the reliability of analysis compared to the ground truth.Details regarding the study design and implementation of simulations, medical use case, missingness scenarios, estimands, algorithm derivations, and missing data methods are presented in Appendix D.
Figure 6 presents three m-graphs with a shared counterfactual causal DAG for cardiovascular disease (CVD) influenced by baseline variables and measurements, namely age, body mass index (BMI) at admission, (systolic) blood pressure (BP), and CVD outcome.The causal DAG was inspired by the work of Bakhtiyari et al. [46].
For the CVD causal model, three combinations of missingness scenarios were induced: Case 1: (Figure 6a) We induced Scenario 10, where invalid entries were dropped from the dataset.Case 2: (Figure 6b) In addition to Scenario 10, we induced Scenario 1, where the healthy sub-population in terms of CVD visited the healthcare facility less frequently.Case 3: (Figure 6c) In addition to Scenario 10, we induced Scenario 5, assuming that age and BMI values were always measured and influenced the measurement of BP while they may have not been always recorded (Scenario 8).For each case, we estimated two objective estimands: Finally, the analysis was performed using three different missing data methods: Method 1: We performed complete-case analysis (CCA).For the mean BP estimand, we took the average BP value for the observed cases.For the model performance estimand, we tested f (.) using complete cases only.Method 2: We used the Missforest imputation method [29] for the incomplete datasets.The analyses were then carried out using the imputed dataset.Method 3: We performed IPW estimation using Equation (8).After the identification step, propensity scores were estimated via re-weighted complete cases.
We modeled the structure of the m-graphs for each case and simulated their distribution parameters in 20 iterations by sampling the uniform randomly from a search space.In this way, various simulation cases were generated, following the imposed structure [47].For each iteration, we reported the estimation bias regarding absolute error using the ground truths, namely, the counterfactual mean BP (on a log scale) and factual classification accuracy.

Results
Figure 7a presents the estimation bias for the estimands 1 and 2 on a log scale.For method 1 (CCA), only case 1 aligns with its assumption, giving the most negligible bias among the three cases.For method 2 (Missforest), cases 1 and 3 align with their assumption, yet interestingly, case 3 has the highest bias.Method 3 (IPW) is designed to be domain-informed.Higher biases in case 3 for methods 2 and 3 can be attributed to possible model misspecifications for the propensity score estimation.Overall, the domain-informed IPW estimation gives the lowest bias.
Figure 7b presents the results for the classification accuracy estimation at deployment.The average classification accuracy of the logistic regression model was 0.77 across all cases.Throughout the experiment, three methods suffered from approximately up to 0.1 bias (13% of the average accuracy), except for the CCA estimation in case 2 with up to 0.25 bias (33% of the average accuracy).
The experiments reflect an important message of this paper, that estimation bias may considerably vary for missing data methods under different circumstances, while it cannot be discovered during the training phase.Therefore, the only trustworthy approach to missing data is to rely on realistic assumptions and domain-informed methods.

Discussion
In order to overcome the challenges of a missing data problem, it is recommended to report the problem transparently [3,4], collect complementary datasets from various sources [7,14], and choose a suitable method accordingly.However, the question remains: what kind of data should we collect and incorporate into the analysis, and how should the missing data be reported?
This paper propounds a framework for analyzing missingness in observational health data.Via the recognition and categorization of fundamental scenarios, we provide a basis for understanding the data collection processes in healthcare facilities.According to this framework, focusing on human agents (physicians, patients) and environments (hospitals, clinics) helps discover and report the scenarios.As shown in this paper, the implications of scenarios must be formulated for identification, estimation, and sensitivity analysis steps.Once the corresponding assumptions are specified, a suitable method can be selected.During the analysis, complementary datasets can be utilized when they enable and facilitate the analysis steps, e.g., when new variables make the missingness mechanism identifiable (Inquiry 1) or when a dataset offers meaningful interpretations of the sensitivity parameters (Inquiry 7).
This paper is intended for not only the medical data scientists but also the developers of missing data methods in the ML community as a response to the recurring theme of devising sophisticated (deep learning) imputation models [48][49][50].While imputation (or IPW) methods with high learning capabilities may considerably enhance the estimation step, most assumptions and inquiries concern the identification and sensitivity analysis steps (Inquires 1-5 and 7).Therefore, to ensure the method's effectiveness, it is necessary to actively scrutinize the assumptions, especially if the application domain is safety-critical.
In fact, methodology papers often report the conditions using Rubin's at-randomness categorization [23], according to which a missingness mechanism can be completely-atrandom (MCAR), at-random (MAR), and not-at-random (MNAR).However, this categorization has been a blessing and a curse; even though it effectively formalizes the first step of identification addressed by Inquiry 1, it was introduced in the original work specifically for Estimand (2) and was later extended for the joint-distribution estimand p(X, Y) [25].The works we criticize often make at-randomness assumptions disconnected from the reality of the use case and denuded of their context, ignoring the requirements of the estimand.If these works were more sensitive to the nuances of the problem and adhered to the standard missing data analysis process, they would positively influence the medical data analysis works that adopt them.We suggest that future works discuss the assumptions in a form similar to the inquiries in this paper and present some real-world examples for validity and violation cases.

Limitations and Future Works
The current paper studied ten fundamental and prevalent scenarios considering healthcare facilities.Nevertheless, new and different scenarios may arise under new data observation, recording, and collection circumstances.Likewise, the inquiries in this paper were made regarding the general frame of the missing data theory, which is relevant for most analysis methods.It is conceivable that specific methods have unique assumptions and, therefore, further inquiries to explore.Nevertheless, the analysis process in this paper can be applied to new scenarios and inquiries.
The scope of this paper is limited to observational data from healthcare facilities.Other similar health domains, such as medical wearable sensors, can be similarly analyzed by employing the analysis framework presented in Section 3.1, e.g., one can study the scenarios induced by patients (e.g., missing measurement due to taking off the smartwatch because of irritation) or technical issues (operating system issues or low battery), and subject them to the same inquiries introduced in this paper.
In addition, exploring all the introduced scenarios scenarios in great detail would have indeed exceeded the publication's scope.While the answers to the inquiries presented in Section 3.3 and Appendix C can be considered a constructive first step, future research endeavors will benefit from deeper investigations into each individual missingness scenario with greater granularity.Furthermore, it is worthy of note that the missing data problem lies in the broader category of the data-coarsening processes,where the data veracity, representativeness, or completeness is impaired.Other data-coarsening processes in healthcare include the patient recall bias [51], variations in medical outcome definition over time or depending on the defining consortium [52], or coarsened representation of pertinent negative/positive values [14,53].Extending our analysis framework to other data-coarsening problems in future works could bridge further translational gaps in medical and healthcare data analysis.
Finally, obtaining a real-world healthcare facility dataset with reliable ground truth for missing data is exceedingly challenging.Collecting such a dataset would require the investigation of all missingness occurrences without disrupting the environment and the missingness distributions.This is why we relied on the simulation experiment to be able to investigate different aspects of various objectives, scenarios, and methods.Nonetheless, we would obtain more insightful results if real-world datasets exist for validation, however small scale or simple.The missingness indicators under this scenario are influenced by observed health variables X, mediated by the attending physician unless X itself is subjected to missingness under Scenario 8.In this case, the edge is received from the counterfactual counterpart (since it influences missingness regardless of its observation status).As an extreme case, a counterfactual variable that is completely missing under Scenario 8 is a latent cause for missingness (see the corresponding entry for Scenario 8).Causes that might confound missingness with other study variables include the attending physician's identifiers, which are proxies of medical practice styles.

6, 7
By definition, missingness under these scenarios can be predicted using the patient's location, transfer information, or type of resource required for making the observation.This information is considered crucial, being part of the management and billing data; therefore, variables for this scenario are likely to be fully observed and available, especially for datasets from large healthcare facilities. 8 Recording of the counterfactual study variables depends on the nature of the variable, as well as the style of the medical practice of the attending physician and the recording capabilities of the software tool.For instance, expensive and decisive tests such as medical imaging or lab tests are generally recorded, while qualitative examination results may escape from recording depending on the physician or if the software tool does not provide an entry for it.Overall, there is a possibility that the reasons for missingness under this scenario confound other study variables if they are also affected by the medical practice style, e.g., when a physician with a tendency to record the most variables also diagnoses and prescribes treatments more effectively.

Scenario
No.

Interpretation 10
The potential reasons for invalid entries mentioned in this paper are likely unrelated to the analysis of interest, as they are mostly related to human and software tool errors.However, one should be cautious about treating all medically unrelated variables as exogenous causes.Instead, whether these variables can realistically be confounders for other variables should be investigated.For instance, socioeconomic or occupational status may affect overall health and healthcare facility visits.

4
A no-shift assumption appears reasonable regarding patient refusal behavior for taking a test or answering questions.However, it is essential to consider the possibility of a shift due to data-sharing consent.In such cases, the data available for analysis could differ from the data available to physicians at deployment.

5, 6, 7, 8
Shift occurs when there are changes in the observation policy of physicians, healthcare facility protocols, available equipment, or data collection software.Examples include (i) alterations in the measurement decisions resulting from the deployment of a prediction model, (ii) modifications in the utilized diagnostic flowcharts and scores, (iii) fluctuations in the level of physicians' expertise, and (iv) enhancing data collection protocols following significant events such as an epidemic.9 Inclusion/exclusion criteria typically imply a shift in missingness unless the same criteria are applied for the admission of patients, which is highly unlikely in most cases.An example of no-shift occurs when the data scientist restricts the general population to the cohort of interest for deployment, using inclusion/exclusion criteria.
10 Shift occurs only when the reasons behind errors and invalid entries in the data change.
Table A3.Analysis of scenarios with respect to Inquiry 3-no-direct-effect assumption.

Scenario
No.

Interpretation
1 & 2 Non-visit possibly influences the health variables via a direct causal effect on the treatments: patients usually do not receive treatment until being admitted (except in the self-medication case).This means that the NDE assumption is mostly violated.Take the example of missing follow-up after the first visit, compared to the unrecorded observations in a realized follow-up visit.In the latter, missingness does not influence the health status at the end of the follow-up visit.In contrast, in the former, the health status, for example, may degrade due to discontinuation of diagnosis/treatment.Assuming most of the variables measured under these scenarios are related to the patient's health status, the validity of the NDE assumption under these scenarios depends on the nature of the measurement.If the measurement directly influences the patient's health (e.g., invasive tests) or indirectly (temporary pause of a medication), the NDE assumption is violated.Following the discussion on Scenarios 1 and 2, the NDE assumption is likely violated for the treatment and medication variables since treatment decisions usually depend on the observations.8 Unless disproved explicitly, missingness under this scenario admits the NDE assumption since recording the status of the variables cannot influence the variables by any conceivable means.

9, 10
Since missingness under these scenarios is related to the data analysis and occurs after the data collection step, the NDE assumption can be made.
Table A4.Analysis of scenarios with respect to Inquiry 4-no-interference assumption.

Scenario
No. Interpretation Observations and measurements under these scenarios permit the no-interference assumption, as the decisions are being generally made per individuals.

6
The healthcare facility protocols typically apply uniformly to individuals and remain consistent over a short period.Hence, it is reasonable to make the no-interference assumption in this scenario.

7
This scenario is the most critical and obvious example of violating the no-interference assumption.In this scenario, a prioritization scheme is usually adopted to allocate observation and measurement resources.Examples are (i) early discharge or no admission due to limited hospital capacity during the epidemic, and (ii) delayed or canceled measurements for healthier patients during staff overload.

9, 10
Unless for particular reasons the data scientists do not induce interference by the inclusion/exclusion criteria, the no-interference assumption holds.An example of a violation of the no-interference assumption (though not to be conceived as a meaningful scenario) is when performing sample selection based on the so-far selected samples from different cohorts, e.g., when we only choose up to 20 patients from an age stratum.
Table A5.Analysis of scenarios with respect to Inquiry 5-selection bias.

Scenario
No.

1, 9
By their definitions, these scenarios induce selection bias, discarding the entire sample (of a specific sub-population) from the dataset.

2
To miss a follow-up under this scenario implies that the patient still has recorded data in the database.However, if analysis is limited to a specific follow-up (e.g., analysis of health status in the second hospital visit), then patients with limited data are subjected to selection bias.
3, 5, 6, 7, 8, 10 These scenarios by default concern data entries and do not cause missingness of an entire data sample; hence, no selection bias occurs.

4
A situation where patients' refusal can lead to selection bias missingness is when they refuse to give data-sharing consent.
Table A6.Analysis of scenarios with respect to Inquiry 6-monotonicity.

Scenario
No.

1, 9
Complete non-visit and sample exclusion induce only two complete-case and allmissing patterns.
2 Missing follow-ups in clinical studies induce monotone missingness since by the study design rules, the patients who are absent, for any reason, from a visit are excluded from the remaining visits (case drop-out).However, such a rule does not apply in healthcare facilities; therefore, this scenario, in general, leads to non-monotone missingness.
3, 4, 6, 7, 8, 10 The scenarios are not determined to induce a monotone missingness pattern, unless there is a specific reason related to the problem at hand.

5
Observations according to the diagnostic flowcharts and score tables induce a monotone missingness pattern, where secondary measurements are not made unless the primary ones are.However, many diagnostic flowcharts are utilized across all patients in a healthcare facility dataset.The set of primary tests usually overlaps among different flowcharts; therefore, a monotone pattern may still emerge.The pattern graph framework [27] provides a powerful methodology for dealing with missing data in this situation.
Table A7.Analysis of scenarios with respect to Inquiry 7-sensitivity parameters.

Scenario
No.

Interpretation
1 Potential odds-ratio information for sensitivity analysis includes (i) difference in total hospital visits between the healthy and sick sub-populations, (ii) difference in medical care advantages received among different socioeconomic strata, and (iii) difference in the death rates reported in the healthcare facility, and in total, specific to a disease.
2 Potential odds-ratio information for sensitivity analysis includes differences in the number of visits for healthy vs. sick patients, between groups with different socioeconomic status due to insurance plans, or morbidity rate for a specific diagnosis, possibly obtained from epidemiological research.
3 Potential odds-ratio information for sensitivity analysis includes the difference in the interrupted measurement level due to specific events, such as the code-blue.

4
For those types of missingness due to patients' health-related refusal (such as intolerance to pain), meaningful differences may be found for available and unavailable samples, e.g., conceivable level of infection which may cause intolerable pain.
5 Potential odds-ratio information for sensitivity analysis directly includes the level of measured variables, e.g., in different branches of the diagnostic flowcharts.For instance, one may ask how later specific measurements may change if the results of the primary tests flip (reflecting the unavailable sub-population).
6 Potential odds-ratio information for sensitivity analysis can be found by analyzing healthcare facility protocols for specific measurements.Since these rules are justified based on extensive research, informed sensitivity analysis might be possible via public health works that analyze such protocols.
7 Potential odds-ratio information for sensitivity analysis includes differences in the level of measurement for situations when the availability of resources changes, e.g., comparing the waiting line for a medical test or number of admissions.

8, 10
Due to complete-randomness, no specific sensitivity parameter can be conceived in general for this scenario.9 The sensitivity parameters have interpretations similar to the complete non-visit scenario, except that the data scientists have induced omission under this scenario.Therefore, the sensitivity parameters and ranges may be obtained from the original dataset.
or normalized for PS 1 as The numerator of PS 0 in Equation (A8) can be estimated from the available data.However, the denominator queries the odds of CVD in the entire population, including the censored patients.This means that the estimand is unidentifiable due to the self-masking edge CVD → R CVD .
Regardless of the apparent dead-end, we can proceed with the analysis if populationwide summary statistics about CVD are available.Such information can possibly be found in the public health research literature.For the sake of simulation, we query the odds from the ground truth dataset and employ it in the IPW estimation, knowing that, in reality, such information must be sought outside the current dataset.
For case 3, we have R ⊥ ⊥ BP, CVD|Age, BMI.Hence, we write the propensity score estimand as p π init (R|Age, BMI).This model, however, is expressed in terms of counterfactual variables.According to the m-graph in Figure 6c, we factorize R as follows: p(R|Age, BMI) = p(R Age )p(R Age )p(R CVD )p(R BP |Age, BMI).

Figure 2 .
Figure 2. A schematic of the analysis framework in this paper: on the left, a cross-investigation of the data availability steps and agents guides us toward the existing scenarios (blue dots).Each scenario then will be subject to inquiries about various missing data assumptions on the right.

Figure 4 .
Figure 4.An example m-graph to model selection bias, determined by the Y values.Blue edges represent the deterministic masking relation between R † and R i s.

Figure 5 .
Two missingness patterns: (a) monotone; (b) non-monotone.The monotone pattern is described by four S 1 − S 4 patterns.One can infer, using prior knowledge, that the non-monotone pattern in (b) is a result of some non-monotone missingness scenarios, interrupting the monotone pattern in (a).

Figure 6 .
Figure 6.Example m-graphs for cardiovascular disease under the influence of different constellations of missingness scenarios: Scenario 1 (patient complete non-visit), 5 (missing measurements due to diagnostic irrelevance), 8 (unrecorded observations), and 10 (omission of invalid data entries).Given these scenarios, the three m-graphs induce the following: (a) Scenario 10; (b) Scenario 1 + 10; (c) Scenario 5 + 8 + 10.Top row nodes and edges are the shared counterfactual sub-graph.Blue edges encode the missingness mechanisms.

Estimand 1 :
(Counterfactual mean BP) We estimated the counterfactual mean blood pressure.Estimand 2: (Classification accuracy for CVD) We evaluated the MSE loss for a trained prediction model for CVD under the full-availability policy.The estimand was E π full [(y − f (x)) 2 ].The model f (.) was a logistic regression classifier trained on a mean-imputed dataset.The focus of this study was only to estimate the performance of the existing model at deployment.Therefore, perfect model training or fine-tuning was not necessary.

Figure 7 .
Figure 7. Estimation bias results for two objective estimands: (a) the counterfactual mean BP, (b) classification accuracy of the logistic regression model for CVD.Cases and methods are presented as the x-axis ticks and box plot colors, respectively.

Table 1 .
Overview of missingness scenarios in healthcare facilities, analyzed in this paper.

Table 2 .
Overview of theoretical inquiries for missingness scenarios analyzed in this paper.

Table 3 .
Examples of literature supporting the steps of the missing data analysis process, presented in Figure1.
Example 2 (counterfactual mean LDL cholesterol level).As part of public health research, we aim to estimate the nationwide average LDL cholesterol level, denoted as Y. Available datasets are collected from a hospital where LDL levels are not available for all the patients.• E π init (Y * |R y = 1) gives the average observed value in the hospital.• As a possible new policy, E π new (Y * |R y = 1) gives the average value if the LDL level had been observed for all patients in the hospital.• E(Y) = E π full (Y * |R y = 1) gives the target estimand, i.e., the nationwide average LDL level.

Table A2 .
Analysis of scenarios with respect to Inquiry 2-missingness distribution shift.Parametric shift may occur if the population distributions change.Examples include (i) conducting analysis using hospital data but deploying them for patients in local clinics or the general healthy population, such as in a preventive healthcare plan, (ii) conducting analysis using data from a specific cohort but deploying them for another cohort, and (iii) when the target population visits healthcare facilities more or less frequently than during the data collection stage.3Shiftoccurs if the transfer protocols or observation protocols during hospitalization change.Examples include (i) observing more, fewer, or different variables in different hospital wards, and (ii) encountering data availability or unavailability after the transfer despite being available during the data collection stage.