1. Introduction and Context of the Empirical Application
Asthma is recognized as one of the most important chronic diseases that affects millions of people worldwide. This disease produces a decrease in the quality of life, disability and premature death of people in all ages [
1]. In addition, it continues to be an important source of global economic burden in terms of costs and social impact [
2,
3]. Asthma is described as a heterogeneous disease by the Global Initiative for Asthma (GINA:
https://ginasthma.org) and usually characterized as a chronic airway inflammation. It is defined by the history of respiratory symptoms such as chest tightness, cough, shortness of breath, and wheeze that varies over time and in intensity together with variable expiratory airflow limitation. Although it is not strictly a definition, this description captures the essential features for clinical purposes. The National Asthma Education and Prevention Program (
https://www.nhlbi.nih.gov/science/nationalasthmaeducationandpreventionprogramnaepp) has classified asthma as: intermittent, mild persistent, moderate persistent, and severe persistent. These classifications are based on severity, which is determined by symptoms and lung function tests. According to [
4], in recent decades, the asthma prevalence is increasing in many countries, especially among children and adolescents. Therefore, strategies based on scientific evidence are crucial to generate better preventive measures as well as greater access and adherence to treatments that reduce the economic burden. Thus, organizations, such as the Global Asthma Network (
http://www.globalasthmanetwork.org/index.php), the International Study of Asthma and Allergies in Children (
http://isaac.auckland.ac.nz), and the mentioned GINA, have been created worldwide to generate scientific evidence and disseminate information on the best care of asthma in terms of its prevention and management.
The scientific evidence about asthma is strongly related to data analysis, which is already part of medical decisionmaking or medical decision science, a process increasingly associated with data science and big data [
5,
6,
7,
8,
9,
10,
11]. Then, data analysis tools as predictive models provide precious information to the areas of clinical practice, medical research and public health [
12,
13,
14,
15]. One of the most popular predictive models for fitting the presence or absence of a disease by means of categorical data, especially by considering data with a binary response, is the logistic regression [
16]. Modeling and prediction for correlated and uncorrelated binary data through the logistic regression model have been carried out in different areas of science and especially in medicine. The logistic regression model is one of the most useful statistical tools due to its good properties and easy interpretation; read more information in [
16,
17,
18]. This model presents statistical challenges that have a strong implication on the results and can compromise the inference, predictions and, consequently, the conclusions, as well as datadriven medical decisions making. In this regard, once the model has been fitted to the binary response data, it is essential to check that its fit is valid. There are several manners to make this validation in models for binary data [
19]. Recent advances in model checking and diagnostics have been developed by several authors [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. For more details and references regarding to statistical diagnostics, see
Section 3.
In a recent study [
4], children and adolescents, who were diagnosed with persistent or intermittent asthma, have been in medical followup for at least one year in a public hospital at São Paulo, Brazil. The patients in the study were 362 children and adolescents aged from 6 to 20 years of old, of both sexes (59% male patients and 41% female patients) from numerous ethnicities. Clinical examinations detected whether or not the patients had a fixed airway obstruction (FAO hereafter). These results were reported based on gender, age, height, region and pulmonary function test data when there is no significant response to a bronchodilator. Patients were classified into four groups according to their current asthma severity: [Group 1] Intermittent asthma; [Group 2] Mild persistent asthma; [Group 3] Moderate persistent asthma; and [Group 4] Severe persistent asthma. The explanatory variables considered are duration of treatment in years (
treatment hereafter), blood test presence or absence of eosinophilia (increased number of circulating eosinophils in the blood,
eosinophilia hereafter) and sum of all levels of all factors that produce allergy (
allergy hereafter) following the radio allergosorbent test (RAST). The interval (mean ± SD) of the variables
treatment and RAST are (5.946 ± 3.255) and (7.064 ± 4.051), respectively, with SD denoting their standard deviation. The observations, grouped by severity level and analyzed by using a mixed model [
17], allow us to include the correlation and variability due to factors that were not observed in the study. Because the interest is to analyze or predict the asthma state through the binary response variable FAO, a mixedeffect logistic regression model can be proposed [
31] to describe this response.
The primary objective of this work is to provide datainfluence analytics using a mixedeffect logistic regression model applied to the asthma disease. This analytics is based on global and local influence diagnostic techniques, which are used simultaneously in this study but often used separately. Therefore, the main contribution of this research is to consider global and local influence diagnostic techniques simultaneously in a mixedeffect logistic regression model applied to asthma worldreal data. Such a joint usage allows us to identify situations which could not be detected if we use these techniques separately. In addition, predictive performance measures are considered for such a datainfluence analytics. The secondary objectives of this work related to the application are: (i) to provide an algorithm that summarizes the methodology proposed in this study as a mechanism for improved scientific evidence in asthma data; (ii) to determine what explanatory variables are associated with FAO and to model the probability that the patient presents FAO given the asthma severity group in which it was classified; (iii) to identify values that, after their elimination, cause disproportionate changes in the estimates of the model parameters and allow us to improve its predictive performance; and (iv) to detect patients who are too different medically in relation to FAO.
This article is organized as follows.
Section 2 describes the mixedeffects logistic regression model for the asthma status study. In
Section 3, we present the methodology for datainfluence analytics of the described predictive model.
Section 4 and
Section 5 introduce the global and local influence techniques. The Monte Carlo and Metropolis–Hastings methods are presented in
Section 6 to calculate the respective influence measures. In
Section 7, we provide the computational aspects and algorithms used in this study. In
Section 8, the quality of the fitted mixedeffects logistic regression model for studying asthma status is analyzed. Finally, in
Section 9, the conclusions and proposals for future studies are discussed.
2. MixedEffects Logistic Regression Model for Asthma Status Study
To study the asthma status of children and adolescents at a public hospital of São Paulo, Brazil, we consider a clustered data set by current severity of asthma of
$q=362$ patients, with
q being used in
Section 5. In the context of mixed models, the clustered data set has four asthma severity groups, defined in the introduction, labeled by
i, with the
ith group being conformed by
${n}_{i}$ patients, for
$i=1,\dots ,k$, where
$k=4$ in this study. The asthma status is represented by the binary response variable
${Y}_{ij}$, with
${Y}_{ij}=1$ if the patient
j in the
ith group is classified with FAO; otherwise,
${Y}_{ij}=0$ for
$j=1,\dots ,{n}_{i}$, with
$i=1,\dots ,k$. The probability
${\pi}_{ij}=\mathrm{P}({Y}_{ij}=1)$ is modeled as a function of the explanatory variables, which include the duration of the treatment (in years),
${X}_{1}$ (
treatment); an indicator variable of eosinophilia,
${X}_{2}$ (
eosinophilia); and sum of all levels of all factors that produce allergy according to the RAST,
${X}_{3}$ (
allergy). The change between asthma severity groups is accommodated through random intercept
${u}_{i}$. Then, our mixedeffect logistic regression model is described by
${Y}_{ij}{u}_{i}\sim \mathrm{Bernoulli}({\pi}_{ij})$, with
${u}_{i}\sim \mathrm{N}(0,{\sigma}^{2})$ and
where
${x}_{1ij},{x}_{2ij},{x}_{3ij}$ represent the values of
${X}_{1},{X}_{2},{X}_{3}$, respectively, and
Let
$\mathbf{\theta}={({\beta}_{0},{\beta}_{1},{\beta}_{2},{\beta}_{3},{\sigma}^{2})}^{\top}$ be the vector of unknown parameters of the proposed mixedeffect logistic regression model. The maximum likelihood (ML) estimate of
$\mathbf{\theta}$, standard error (SE),
pvalues, and sensitivity (Sens), specificity (Spec) and accuracy (Acc) performance measures are presented in
Table 1. Computational aspects related to parameter estimation and calculation of prediction performance measures are described in
Section 7. The procedure of formulation of the mixedeffect logistic regression model until obtaining the final prediction model is summarized in Algorithm 1.
Algorithm 1 Formulation/estimation/fit/validation of the mixedeffect logistic regression 
 1:
Collect a sample of data $\mathit{y}$ according to a mixedeffect logistic regression model.  2:
Conduct an exploratory data analysis to show evidence of mixed effects in the logistic regression model.  3:
Estimate the parameters of the mixedeffect logistic regression model with the ML method.  4:
Use the asymptotic properties of the ML estimators to obtain the SE and pvalues associated with each parameter estimated in step 3.  5:
Calculate Sens, Spec and Acc performance measures to validate the model.

The results related to the fixed effects of the model indicate that the explanatory variables
treatment and
eosinophilia are significant at 5% according to the
pvalues of
Table 1, which are 0.12% and 2.20%, respectively. Then, the overall level of both covariates to reach significance is 5%. This level is one of the most commonly used in the literature and it is chosen as a benchmark to make other inferences and obtain the necessary conclusions. However, this does not prevent any reader can draw her/his own conclusions by means of the
pvalues reported in the tables of the present manuscript. Note that this significance level of 5% is also adopted as a benchmark for the postdeletion of cases after applying the datainfluence analytics detailed in the following sections. The estimates with positive sign of the
treatment and
eosinophilia coefficients indicate that, for a given group, as the treatment time of a patient increases, the probability of presenting FAO increases as well. In addition, a patient with eosinophilia is more likely to present FAO. Note that the SD associated with the random intercept distribution is greater than zero. Hence, heterogeneity is detected among the four asthma severity groups. Regarding to the performance of the model predictive,
Table 1 reports that the probability of correct classification of having FAO is 69.69%, the probability of correct classification of not having FAO is 75.98%, and probability of correct classification is 75.41%.
3. DataInfluence Analytics in MixedEffects Logistic Regression Model
Datainfluence analytics is used to identify potentially influential cases that can affect the parameter estimates and the quality of the model prediction. This can allow us to detect implicit problems in the data set and cases that, after being removed, might modify the inferences/predictions and conclusions drawn from the analysis and possibly altering the decisions made from the study results.
In the statistical literature there are two main techniques for detecting influential cases. The first one corresponds to global influence diagnostics, performed commonly by casedeletion, which consists of the elimination of cases of the total data set; see details in, for example, Refs. [
32,
33,
34,
35,
36]. The second one corresponds to local influence diagnostics that allows us to identify cases that, under small perturbations in the model or in the data, may cause disproportionate changes in the estimates of the model parameters; see details in, for example, Refs. [
22,
24,
25,
26,
27,
28,
30,
37,
38,
39].
The difference between both techniques is that local influence diagnostics does not require the elimination of cases and allows us simultaneously evaluating the joint influence of all potentially influential cases. Nevertheless, both techniques can be connected to generate a more complete diagnostics, that is the proposal of this work. On the one hand, global influence by casedeletion [
36] is a technique which develops a diagnostic measure by evaluating the difference between the estimates of model parameters before and after deleting potentially influential cases from the data set. On the other hand, the local influence technique [
37,
39] derives diagnostic measures by using the curvature of the influence graph for an appropriate function.
For the mixedeffects logistic regression model, we combine the global influence diagnostics proposed in [
40] for the model with incomplete data and the local influence diagnostics presented in [
24] for binary response variables, both supported in the Monte Carlo integration and sampling observations from the Metropolis–Hastings algorithm.
Let the random effects of the mixedeffects logistic regression model be represented as a missing (unobserved) data set,
${\mathit{y}}_{\mathrm{u}}=\{{u}_{i}:i=1,\dots ,k\}$, and augmented with the observed data set
${\mathit{y}}_{\mathrm{o}}=\{{y}_{ij}:j=1,\dots ,{n}_{i};i=1,\dots ,k\}$. Then, the complete data set can be represented as
${\mathit{y}}_{\mathrm{c}}=({\mathit{y}}_{\mathrm{o}},{\mathit{y}}_{\mathrm{u}})$. Thus, the completedata loglikelihood function for the model parameter
$\mathbf{\theta}$ is given by
$\ell (\mathbf{\theta};{\mathit{y}}_{\mathrm{c}})={\sum}_{i=1}^{k}{\sum}_{j=1}^{{n}_{i}}log\left(\right)open="("\; close=")">{p}_{{Y}_{ij}{u}_{i}}({y}_{ij}){p}_{{u}_{i}}({u}_{i})$ where
and
${p}_{{u}_{i}}({u}_{i})$ is the density function of the normal distribution of mean zero and variance
${\sigma}^{2}$ for
$j=1,\dots ,{n}_{i}$ and
$i=1,\dots ,k$. Subsequently, inspired by the expectationmaximization (EM) algorithm [
40,
41,
42], we develop global and local influence measures based on the conditional expectation of the completedata loglikelihood function,
$Q(\widehat{\mathbf{\theta}})=Q(\mathbf{\theta}){{}_{\mathbf{\theta}=\widehat{\mathbf{\theta}}}=\mathrm{E}[\ell (\mathbf{\theta};{\mathit{Y}}_{\mathrm{c}}){\mathit{Y}}_{\mathrm{o}}={\mathit{y}}_{\mathrm{o}}]}_{\mathbf{\theta}=\widehat{\mathbf{\theta}}}$, where the expectation is calculated with respect to the conditional density function
${p}_{{\mathit{Y}}_{\mathrm{u}}{\mathit{Y}}_{\mathrm{o}}={\mathit{y}}_{\mathrm{o}}}$.
4. The Global Influence Diagnostics
The global influence technique allows us to study the effect of deleting cases or casegroups on the estimate of
$\mathbf{\theta}$. Thus, for the mixedeffects logistic regression model, there are two kinds of interesting deletions. One of them is the deletion of each case, in order to evaluate the influence of the deleted case on the ML estimate of
$\mathbf{\theta}$. And the other one is the casegroup deletion, in order to evaluate the influence of the deleted casegroup on the ML estimate of
$\mathbf{\theta}$. In this context, consider the following notations. A quantity with a subscript “
$\left[\xb7\right]$” means the relevant quantity with the
$ij$th case or
ith group deleted. Hence, we define
${\mathit{y}}_{\mathrm{o}\left[\xb7\right]}$,
${\mathit{y}}_{\mathrm{u}\left[\xb7\right]}$ and
${\mathit{y}}_{\mathrm{c}\left[\xb7\right]}$ as the observed, unobserved and complete data sets, respectively, with the
$ij$th case or
ith group deleted. Additionally, we define
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}$ as the ML estimate of
$\mathbf{\theta}$ obtained with the
$ij$th case or
ith group deleted. Then, according to [
40], in order to assess the influence of
$ij$th case or
ith group on the ML estimate
$\widehat{\mathbf{\theta}}$, the difference between
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}$ and
$\widehat{\mathbf{\theta}}$ is calculated through the global influence measure given by
where
However, the measure given in (
1) implies calculating
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}$ for every case. Hence, The procedure can be computationally intensive depending upon the size of the data set. Therefore, in [
40] is proposed a onestep approximation
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}^{1}$ of
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}$ given by
where
Note that
${\widehat{\mathbf{\theta}}}_{\left[\xb7\right]}^{1}$ depends on only the ML estimate
$\widehat{\mathbf{\theta}}$ to save the computation time. Consequently, substituting (
2) into (
1), the global influence measure is given by
where the derivatives included in
$\dot{\mathit{Q}}{(\widehat{\mathbf{\theta}})}_{\left[\xb7\right]}$ are
To study the influence of
$ij$th case or
ith group, we propose to work with the benchmark
$\overline{D}+2\mathrm{SE}(D),$ where
$\overline{D}$ and
$\mathrm{SE}(D)$ correspond to the mean and SE of all values of
${D}_{\left[\xb7\right]}^{1}$.
5. The Local Influence Diagnostics
The local influence technique allows us to study the effect of minor modifications or perturbations in the model or the data on the estimate of $\mathbf{\theta}$ due to some source of uncertainty of model. One of the sources of uncertainty which is crucial in mixedeffects logistic regression models corresponds to the binary response variable. Note that, in this case, the response may assume only values zero or one, so that the local influence technique cannot be applied with direct perturbation of the response, but its probability of success can perturbed as described below.
Let
$\mathbf{\omega}={({\omega}_{1},\dots ,{\omega}_{q})}^{\top}$ be a
$q\times 1$ perturbation vector in
$\Omega \subset {\mathbb{R}}^{q}$ and
$\mathcal{M}\equiv \{{p}_{{\mathit{Y}}_{\mathrm{c}}}({\mathit{y}}_{\mathrm{c}};\mathbf{\theta},\mathbf{\omega}):\mathbf{\omega}\in \Omega \subset {\mathbb{R}}^{q}\}$ be the perturbed mixedeffects logistic regression model, where
${p}_{{\mathit{Y}}_{\mathrm{c}}}({\mathit{y}}_{\mathrm{c}};\mathbf{\theta},\mathbf{\omega})$ is the density function of
${\mathit{Y}}_{\mathrm{c}}$ perturbed by
$\mathbf{\omega}$ and
$\ell (\mathbf{\theta},\mathbf{\omega};{\mathit{y}}_{\mathrm{c}})$ is its corresponding completedata loglikelihood function. Assume that there is a
${\mathbf{\omega}}_{0}$ nonperturbation vector such that
${p}_{{\mathit{Y}}_{\mathrm{c}}}({\mathit{y}}_{\mathrm{c}};\mathbf{\theta},{\mathbf{\omega}}_{0})={p}_{{\mathit{Y}}_{\mathrm{c}}}({\mathit{y}}_{\mathrm{c}};\mathbf{\theta})$ and
$\ell (\mathbf{\theta},{\mathbf{\omega}}_{0};{\mathit{y}}_{\mathrm{c}})=\ell (\mathbf{\theta};{\mathit{y}}_{\mathrm{c}})$ for all
$\mathbf{\theta}$. To assess the local influence of
$\mathbf{\omega}$ on the ML estimate
$\widehat{\mathbf{\theta}}$, one can consider the Qdisplacement function [
38] given by
${f}_{Q}(\mathbf{\omega})=2(Q(\widehat{\mathbf{\theta}})Q(\widehat{\mathbf{\theta}}(\mathbf{\omega})))$, where
$\widehat{\mathbf{\theta}}(\mathbf{\omega})$ is the ML estimate of
$\mathbf{\theta}$ that maximizes
${Q(\mathbf{\theta},\mathbf{\omega})}_{\mathbf{\theta}=\widehat{\mathbf{\theta}}}=\mathrm{E}[\ell (\mathbf{\theta},\mathbf{\omega};{\mathit{Y}}_{\mathrm{c}}){\mathit{Y}}_{\mathrm{o}}={\mathit{y}}_{\mathrm{o}}]{}_{\mathbf{\theta}=\widehat{\mathbf{\theta}}}$ and the expectation is calculated with respect to the conditional density function
${p}_{{\mathit{Y}}_{\mathrm{u}}{\mathit{Y}}_{\mathrm{o}}={\mathit{y}}_{\mathrm{o}}}$. Therefore,
$Q(\widehat{\mathbf{\theta}})=Q(\mathbf{\theta},\mathbf{\omega}){}_{\mathbf{\theta}=\widehat{\mathbf{\theta}},\mathbf{\omega}={\mathbf{\omega}}_{0}}$. Following the arguments given in [
37] to characterize the behavior of
${f}_{Q}(\mathbf{\omega})$ at
${\mathbf{\omega}}_{0}$, in [
41] is shown that the normal curvature
${C}_{{f}_{Q},\mathit{h}}$ of
$\alpha (\mathbf{\omega})$ at
${\mathbf{\omega}}_{0}$, in the direction of a unit vector
$\mathit{h}\in {\mathbb{R}}^{q}$, is given by
where
${\ddot{\mathit{Q}}}_{{\mathbf{\omega}}_{0}}={\partial}^{2}Q(\widehat{\mathbf{\theta}}(\mathbf{\omega}))/\partial \mathbf{\omega}\partial {\mathbf{\omega}}^{\top}$ is a
$q\times q$ matrix evaluated at
$\mathbf{\omega}={\mathbf{\omega}}_{0}$,
${\ddot{\mathit{Q}}}_{\theta}(\widehat{\mathbf{\theta}})={\partial}^{2}Q(\mathbf{\theta})/\partial \mathbf{\theta}\partial {\mathbf{\theta}}^{\top}$ is a
$p\times p$ symmetric and semipositive definite matrix evaluated at
$\mathbf{\theta}=\widehat{\mathbf{\theta}}$, and
${\mathbf{\Delta}}_{{\mathbf{\omega}}_{0}}={\partial}^{2}Q(\mathbf{\theta},\mathbf{\omega})/\partial \mathbf{\theta}\partial {\mathbf{\omega}}^{\top}$ is the
$p\times q$ perturbation matrix evaluated at
$\mathbf{\theta}=\widehat{\mathbf{\theta}}$ and
$\mathbf{\omega}={\mathbf{\omega}}_{0}$. Nevertheless, the measure given in (
4) is invariant under reparametrization of
$\mathbf{\theta}$. In [
41] also is proposed the conformal normal curvature
${B}_{{f}_{Q},\mathit{h}}$ at
${\mathbf{\omega}}_{0}$, in the direction of a unit vector
$\mathit{h}\in {\mathbb{R}}^{q}$, as
Let
${\lambda}_{1}\ge \cdots \ge {\lambda}_{r}>0$ be the
r nonzero eigenvalues of
$\mathit{Q}=2{\ddot{\mathit{Q}}}_{{\mathbf{\omega}}_{0}}/\mathrm{trace}(2{\ddot{\mathit{Q}}}_{{\mathbf{\omega}}_{0}})$ and
${\mathit{e}}_{1},\dots ,{\mathit{e}}_{r}$ be their corresponding orthogonal eigenvectors. Based on
Q, the aggregate contribution vector defined as
$\mathit{M}(0)={\sum}_{m=1}^{M}{\lambda}_{\mathrm{u}}{\mathit{e}}_{m}^{2}$ is used to assess the local influence of
$\mathbf{\omega}$, where
${\mathit{e}}_{m}^{2}={({e}_{m1}^{2},\dots ,{e}_{mq}^{2})}^{\top}$ [
41,
43]. To study the influence of
$\mathbf{\omega}$, we work with the following benchmark. The
$ij$th case or
ith group are potentially influential if
$M(0)>\overline{M}+2\mathrm{SE}(M)$, where
$\overline{M}$ and
$\mathrm{SE}(M)$ are the mean and SE of
$M(0)$ values.
Arbitrarily perturbing the model or data may lead to unreliable results regarding to the influence diagnostics. In [
44] is proposed a form for selecting an appropriate perturbation vector
$\mathbf{\omega}$, for the model
$\mathcal{M}$, based on the expected Fisher information matrix with respect to
$\mathbf{\omega}$. This matrix is given by
$\mathit{G}(\mathbf{\omega})=({g}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}(\mathbf{\omega}))$, with
where the expectation is calculated with respect to
${p}_{{\mathit{Y}}_{\mathrm{c}}}({\mathit{y}}_{\mathrm{c}};\mathbf{\theta},\mathbf{\omega})$; see more information about the properties of this matrix in [
44]. Then, a perturbation vector
$\mathbf{\omega}$ is appropriate if
$\mathit{G}(\mathbf{\omega})$ evaluated at
${\mathbf{\omega}}_{0}$ equals
$a{\mathit{I}}_{q}$, that is,
$\mathit{G}({\mathbf{\omega}}_{0})=a{\mathit{I}}_{q}$, with
$a>0$, and
${\mathit{I}}_{q}$ being the
$q\times q$ identity matrix. Now, if
$\mathit{G}({\mathbf{\omega}}_{0})\ne a{\mathit{I}}_{q}$, we can always reparametrize the perturbed model
$\mathcal{M}$ by considering the onetoone transformation
$\mathbf{\omega}({\mathbf{\omega}}^{*})={\mathbf{\omega}}_{0}+\mathit{G}{({\mathbf{\omega}}_{0})}^{1/2}({\mathbf{\omega}}^{*}{\mathbf{\omega}}_{0})$, such that
$\mathit{G}({\mathbf{\omega}}^{*})$ evaluated at
${\mathbf{\omega}}_{0}$ is equal to
$a{\mathit{I}}_{q}$.
In this context, because perturbing the probability of success given by
${\omega}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}{\pi}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}$, with
${\omega}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}\in (0,1]$, is not appropriate, the perturbation
$({\omega}_{{0}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}}+{g}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}{({\mathbf{\omega}}_{0})}^{1/2}({\omega}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}^{*}{\omega}_{l\phantom{\rule{0.166667em}{0ex}}{l}_{0}^{\prime}})){\pi}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}$ can be considered, where the elements
${g}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}({\mathbf{\omega}}_{0})$ are stated as
and the nonperturbation vector is
${\mathbf{\omega}}_{0}^{*}=\mathbf{1}$. Thus, the derivative different from zero involved in
${\mathbf{\Delta}}_{{\mathbf{\omega}}_{0}}$ is
Note that, when
${y}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}=1$, the derivative given in (
5) is equal to zero. In practice, initially we carry out the local influence diagnostics for cases with
${y}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}=0$, and then we alternate the values of
${y}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}$ to perform the diagnostics with
${y}_{l\phantom{\rule{0.166667em}{0ex}}{l}^{\prime}}=1$.
7. Computational Framework
To carry out the procedure of datainfluence analytics, we summarize the methodology that has been introduced in
Section 2,
Section 3,
Section 4,
Section 5 and
Section 6 by means of Algorithms 3–6. Specifically, Algorithm 6 corresponds to the full procedure of datainfluence analytics, which implements the other three algorithms sequentially through what we denominate phases. In Phase I, Algorithm 3 is called for executing the procedure of sampling observations from the Metropolis–Hastings algorithm. In Phases II and III, Algorithms 4 and 5 are designed to execute global and local influence diagnostics, respectively. Note that when we refer to global and local influence diagnostics, these include the postdeletion analysis which consists of evaluating the impact on the estimates, SE, and
pvalues, relative change (RC), and predictive performance measures (Sens, Spec and Acc using the selection criteria Sens = Spec) of the groups or cases detected due to their potential influence. Based on results obtained in the Phases II and III, Phase IV decides the cases that need a new postdeletion analysis. Thus, with the results obtained in Phase IV, Phase V performs the final postdeletion analysis.
The proposed methodology is implemented in the
R and
RStudio software [
45].
R is a noncommercial open source software for statistical computing and graphics and
RStudio is an integrated development environment (IDE) for
R. Both of them can be downloaded from
www.rproject.org and
www.rstudio.com, respectively. For an application of
R and
RStudio in medical sciences, see [
46]. Some
R packages related to fit of nonnormal data with mixed effects are available in
CRAN.Rproject.org [
47]. Specifically, we use the
base package for descriptive statistics and the
lme4 package for fitting the mixedeffects logistic regression model. We use the command
glmer of the
lme4 package for the ML estimation of
$\mathbf{\theta}$ based on the AGHQ procedure with 25 quadrature points. We employ the
matrixcalc package for calculations associated with global and local influence measures, whereas the
PresenceAbsence package is considered for calculating the Sens, Spec and Acc measures.
R codes with the implementation of the proposed methodology are available from the authors upon request.
Algorithm 3 Procedure of sampling observations from the Metropolis–Hastings algorithm. 
 1:
Collect clustered binary data ${y}_{ij}$ and a ${p}_{1}\times 1$ vector with the values of the covariates denoted by ${\mathit{x}}_{ij}$ for the fixed effects, with $j=1,\dots ,{n}_{i}$ and $i=1,\dots ,k$.  2:
Formulate a mixedeffects logistic regression model and determine the ML estimates of its parameters by using the AGHQ procedure with 25 quadrature points.  3:
Generate a random sample $\{{u}_{i}^{({s}_{2})};{s}_{2}=1,\dots ,{S}_{2}=2000\}$ from the normal distribution with zero mean and variance ${\widehat{\sigma}}^{2}$ and calculate the elements of the matrix $\mathit{G}({\mathbf{\omega}}_{0})$ given in ( 9).  4:
Generate data $\{{\mathit{y}}_{\mathrm{u}}^{({s}_{1})}:{s}_{1}=1,\dots ,{S}_{1}=10000\}$ from the conditional density function given in ( 6) by using the MetropolisHastings method defined in Algorithm 2.

Algorithm 4 Procedure for global influence diagnostics. 
 1:
Based on the data $\{{\mathit{y}}_{\mathrm{u}}^{({s}_{1})}:{s}_{1}=1,\dots ,{S}_{1}\}$ generated in Algorithm 3, approximate the vector $\dot{\mathit{Q}}{(\widehat{\mathbf{\theta}})}_{\left[\xb7\right]}$ given in ( 7) for $ij$th case and ith casegroup, with $j=1,\dots ,{n}_{i}$ and $i=1,\dots ,k$.  2:
Calculate the global influence measures ${D}_{\left[\xb7\right]}^{1}$, given in ( 3), for $ij$th case and ith casegroup, with $j=1,\dots ,{n}_{i}$ and $i=1,\dots ,k$.  3:
Compute the benchmark $\overline{D}+2\mathrm{SE}(D)$ for the cases and casegroups identifying potentially influential points.  4:
Perform postdeletion analysis with the cases or casegroups detected as potentially influential.

Algorithm 5 Procedure for local influence diagnostics. 
 1:
Based on the data $\{{\mathit{y}}_{\mathrm{u}}^{({s}_{1})}:{s}_{1}=1,\dots ,{S}_{1}\}$ generated in Algorithm 3, approximate the Fisher information matrices $\ddot{\mathit{Q}}(\widehat{\mathbf{\theta}})$ and ${\mathbf{\Delta}}_{{\mathbf{\omega}}_{0}}$ given in (8).  2:
Calculate the local influence measures $M(0)$, with $j=1,\dots ,{n}_{i}$ and $i=1,\dots ,k$.  3:
Compute the benchmark $\overline{M}+2\mathrm{SE}(M)$ and identify potentially influential points.  4:
Alternate values of grouped binary data ${y}_{ij}$, with $j=1,\dots ,{n}_{i}$ and $i=1,\dots ,k$; carry out steps 2 to 4 of Algorithm 3; and then continue with steps 1 to 3.  5:
Perform postdeletion analysis with the cases detected as potentially influential.

Algorithm 6 Procedure for datainfluence analytics. 
 1:
Produce the formulation, estimation, fit and validation of the model with Algorithm 1.  2:
Consider the MetropolisHastings method to obtain observations as in Algorithm 2.  3:
Execute Phase I (sampling observations using Metropolis–Hastings) with Algorithm 3.  4:
Perform Phase II (global influence diagnostics) with Algorithm 4.  5:
Carry out Phase III (local influence diagnostics) with Algorithm 5.  6:
Establish Phase IV (Phase II and Phase III for postdeletion analysis).  7:
Conduct Phase V based on the results of Phase IV to perform the final postdeletion analysis.

8. Model Quality
To evaluate the quality of the mixedeffects logistic regression model used in the study with asthma data, we carry out Phases II and III for datainfluence analytics described in Algorithm 6. The results are the following.
Figure 1 shows the index plots of global influence measures for (a) the cases with benchmark equal to 0.0141 and (b) the casegroups with benchmark equal to 0.5250. All potentially influential cases from the four groups identified from
Figure 1a have been displayed in
Table 2. In addition,
Figure 1b indicates the Group 4 (severe persistent asthma) as potentially influential.
Figure 2a,b show index plots of local influence measures for (a)
${y}_{ij}=0$ with benchmark equal to 0.0031 and (b)
${y}_{ij}=1$ with benchmark equal to 0.0040. All local influence cases from four groups identified from
Figure 2 are reported in
Table 3.
Table 4 and
Table 5 display the results of estimates, SE,
pvalues, RC and predictive performance measures from postdeletion analysis of the cases and casegroups detected as influential under global influence diagnostics (Phase II). Regarding to the parameter estimates of the fixed effects (
$\widehat{\mathbf{\beta}}$), note that after removing the cases detected as potentially influential for each group, the estimates present moderate changes, but the estimate related to the intercept random (
$\widehat{\sigma}$) presents a high change, in accordance with the RC values. In addition, inferential changes observed for the eosinophilia covariate pass from significant to not significant at 5%, when cases from Groups 3 and 4 are removed. With respect to the Sens, Spec and Acc measures, with Sens = Spec determining an optimal threshold equal to 0.1, once the potential influential cases of Groups 1, 2 and 3 are removed, the values of Sens, Spec and Acc increase considerably. Observe that the maximum values of Sens, Spec and Acc are obtained by removing the cases detected as potentially influential of Group 3, that is, 0.8333, 0.8328 and 0.8328, respectively. Under global influence analysis for the casegroups, the estimates related to fixed effects (
$\mathbf{\beta}$) present moderate changes and inferential changes at 5% in the covariate eosinophilia for the Group 4. Estimate of the variance parameter associated with the distribution of the random intercept (
$\sigma $) is almost zero, that is, the model does not capture the change or heterogeneity between asthma severity groups, suggesting a standard logistic regression model. In addition, after the Group 4 is removed, the values of Sens, Spec and Acc decrease.
The results of postdeletion analysis of the cases detected as influential by group under local influence diagnostics (Phase III) are presented in
Table 6 and
Table 7. We note that, after removing the cases detected as potentially influentials for each group, the estimates related to fixed effects (
$\mathbf{\beta}$) present moderate changes in all groups and inferential changes at 5% in the covariate eosinophilia for the Groups 4 and 3. Estimate of the variance parameter associated with the distribution of the random intercept (
$\sigma $) is almost zero, that is, the model not capture the change or heterogeneity between asthma severity groups, suggesting a standard logistic regression model. In relation to the Sens, Spec and Acc measures, with the selection criteria Sens = Spec determining an optimal threshold equal to 0.1, the values of Sens, Spec and Acc are equal to 0.6969, 0.7598, 0.7541, respectively, for all data. Now, after removing the cases detected as potentially influential of the Groups 1, 2 and 3, the values increase considerably. The maximum values are obtained by removing the cases detected as influential in the Group 3, that is, these values are 0.8333, 0.8371 and 0.8368, respectively. However, for the Group 4, these values decrease mainly for Spec and Acc.
According to results obtained in Phase II and III, we observe that the Groups 3 and 4 need more study (Phase IV). For that reason, we now perform postdeletion analysis considering each type of response.
Table 8 and
Table 9 displays the results of estimates, SE,
pvalues, RC and predictive performance measures, with
${y}_{ij}=0$ and
${y}_{ij}=1$ of the Groups 3 and 4. For the global influence diagnostics, the cases with responses
${y}_{ij}=0$ for the Group 4 lead to a significant allergy covariate at 5%. For the cases with responses
${y}_{ij}=1$, in both groups we conclude that the eosinophilia covariate is not significant at 5%. By observing the performance measures, after removing the cases with
${y}_{ij}=0$ of the Group 4, the values of these measures increase partially, whereas after removing the cases with
${y}_{ij}=1$ of the Group 3, the values of these measures increase substantially. For the local influence diagnostics, we observe that the cases with responses
${y}_{ij}=0$ and
${y}_{ij}=1$ of the Group 4, and cases with responses
${y}_{ij}=1$ of the Group 3 are related to the eosinophilia covariate which is not significant at 5%. In addition, the cases with responses
${y}_{ij}=1$ of the Group 4 are related to the estimate of the variance (
$\sigma $), which is almost zero and this is associated with the normal distribution of the random intercept. By observing the performance measures, after removing the cases with
${y}_{ij}=0$ of the Group 4 and
${y}_{ij}=1$ of the Group 3, the values of these measures increase. Thus, we decide to remove the cases with
${y}_{ij}=0$ of the Group 4 and cases with
${y}_{ij}=1$ of the Group 3 (Phase V). This makes sense because they are patients with severe persistent asthma (Group 4) but without FAO, or they have moderate persistent asthma (Group 3) and present FAO.
Table 10 and
Table 11 report the results of the postdeletion analysis. Observe that the eosinophilia covariate is not significant at 5%, and allergy is not significant at 10%. Nevertheless, the Sens, Spec and Acc measures present a large increase, that is, 0.8750, 0.8860 and 0.8851, respectively.
Table 12 reports the fit for the model reduced (without these covariates), and
Table 13 reports the performance measures of the prediction. Note that the variance increases and that the prediction measures decrease. Hence, these covariates must remain in the model. Therefore, the method of combining the global and local influence diagnostics at the group and cases levels allow us to obtain a model with higher prediction capacity and some inferential changes.
9. Conclusions, Discussion, and Future Research
When patients belong to a specific group, such as patients classified according to their severity of asthma, the data present dependence and have a hierarchical structure that can be modeled through the use of mixed models [
17]. If the interest is to analyze or predict the binary response variables of individuals based on certain variables fixed or random measured from those individuals, a mixedeffect logistic regression model can be used [
17,
18]. This model is a typical predictive model widely used in practice.
This research reported the following findings:
 (i)
We have provided a datainfluence analytics using a mixedeffect logistic regression applied to the asthma disease based on global and local influence diagnostic techniques, which are used simultaneously in this study but often used separately. Such a joint usage allowed us to identify situations which could not be identified if we use these techniques separately. In the case of our application, this datainfluence analytics is provided in
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12 and
Table 13 and
Figure 1 and
Figure 2.
 (ii)
We have considered predictive performance measures for these analytics. In the case of our application, results for these predictive performance measures are provided in
Table 1,
Table 5,
Table 7,
Table 9,
Table 11, and
Table 13.
 (iii)
We have given an algorithm that summarizes the methodology proposed in this study; see Algorithm 4.
 (iv)
We have proposed and implemented a methodology for the datainfluence analytics of this type of predictive models, which allows the provision of improved scientific evidence in asthma data, to evaluate if the data contain particular observations that may impact on the conclusions to be drawn from the analysis and, therefore, impact the medical decisionmaking.
 (v)
We have illustrated the proposed methodology with a case study of realworld data regarding to the asthma data collected from a public hospital at São Paulo, Brazil.
The case study has shown that the new methodology allowed us to obtain a model with the high predictive capacity, identify patients who are too different medically in relation to fixed airway obstruction values, especially for severe persistent asthma and moderate persistent asthma groups. In addition, we explained what characteristics or explanatory variables are associated with fixed airway obstruction, in order to model the probability of fixed airway obstruction given the asthma severity group in which it was classified. The results of this work can be taken as a contribution to the datainfluence analytics in predictive models applied to the asthma disease. Note that improving the data quality with analytics has gained attention in recent years, especially in medicine. It allows us to identify anomalies increasing the efficiency of medical experiments, while maintaining a high level of data quality. Thus, it is possible to avoid inaccurate conclusions from results of the study. Therefore, good statistical practices must be followed with sophisticated techniques, such as those presented in this work related to detection of influential data and outliers, as well as other possible inconsistencies in the data; see the studies presented in [
48,
49], which support our discussion in terms of data quality and analytics in medicine. Thus, our study can be a knowledge addition to the toolkit of diverse practitioners, including medical doctors, applied statisticians, and data scientists.
Some themes for future research, which arose from the present investigation, are the following:
 (i)
The procedure of datainfluence analytics is very useful for identifying a set of the particular observations termed influential. However, this set may include other type of particular observations that are those socalled outliers. These outliers are those that are not well fitted by the model and their detection is based commonly on the residual analysis. Therefore, developing a methodology, which allows the identification of outliers detected in a data set using different types of residuals for mixedeffects logistic regression models, is of interest for future study about quality of fitted and prediction capability of the model [
50].
 (ii)
An important aspect to be considered when medical data are analyzed is censorship. Model parameter estimates with censored data is more efficient than when censorship is not considered. Indeed, if censored cases are present and a censoring is not considered, it is not possible to estimate the variance of the censored part. Nevertheless, if the censored case is used, such a variance may be estimated from the data. In addition, asymptotic behavior and performance of maximum likelihood estimators in more complex statistical models can be studied in [
51,
52]. Estimation methods for the regression parameters upon a high censoring may be studied by a mixture structure [
53,
54,
55].
 (iii)
An extension of the present study to the multivariate case is also of practical relevance [
52,
56,
57].
 (iv)
Incorporation of temporal, spatial, functional, and quantile regression structures in the modeling, as well as errorsinvariables, and PLS regression, are also of interest [
26,
29,
30,
58,
59,
60,
61,
62,
63].
Therefore, the proposed methodology in this investigation promotes new challenges and offers an open door to explore other theoretical and numerical issues. Research on these and other issues are in progress and their findings will be reported in future articles.