The existence of patients who do not keep their appointments, commonly referred to as no-shows, is currently one of the main problems of health centers. The absence of patients from their appointments causes the under-utilization of the center’s resources, which extends the waiting time of other patients. No-shows also have an economic impact on health facilities limiting future staff recruitment and the improvement of the center’s infrastructure. As an example, considering only the primary care centers in the United Kingdom, the number of missed appointments exceeds 12 million [1
]. Moore et al. [2
] reported that the percentage of no-shows and cancellations represented 32.2% of the scheduled time in a family planning residence clinic. In terms of economic losses, they reported that the decrease in the health center’s annual income ranges from 3% to 14%. In the United Kingdom, the annual economic cost caused by non-shows is 600 million pounds [3
]. Besides the problems caused to the health center and the rest of the patients, missing an appointment can cause serious health problems for the no-show patients [4
In order to reduce these negative effects, health centers have implemented various strategies including sanctions and reminders. However, besides the fact that various articles question their efficiency [5
], these strategies have a significant cost associated with them. On the one hand, sanctions may limit the access to patients with restricted incomes to medical centers [7
]. On the other hand, reminders have an economic impact that has been estimated at 0.41 euros per reminder [8
During the last decades, a significant number of scheduling systems have been developed to provide an alternative to these strategies. These systems aim to achieve better appointment allocation based on the patient no-show prediction. They are described in several review articles, such as Cayirli and Veral [9
] and Gupta and Denton [10
], and more recently by Ahmadi-Javid et al. [11
]. An important aspect to point out is that the efficiency of these systems depends mainly on two elements: the discriminatory capacity of the predictors and the classification technique used to estimate the probabilities.
Regarding the first of these two elements, several research works have been carried out to discover which are the best predictors that discriminate the patients who attend their appointments from those who do not. These investigations have led to the identification of a significant number of predictors, such as the percentage of previous no-shows, lead time, diagnosis, or age. A recent review of the literature by Dantas et al. [12
] has identified more than 40 potential predictors.
In contrast to the research conducted in the identification of good predictors and in the construction of accurate scheduling systems that have been conducted since the 1960s, research on the development of predictive models has mainly been carried out in the last decade. This is primarily since, until the recent availability of Electronic Health Records (EHR), there were no databases of sufficient size to build these models accurately. It is important to note that building accurate classifiers is essential for the scheduling system to work effectively. However, obtaining these predictions is still an unsolved problem on which a significant number of publications are appearing.
In this work, a systematic review is carried out to establish the state-of-the-art in no-show prediction. The review aims to identify the models that have been proposed along with their strengths and weaknesses. To accomplish this, besides identifying each of the different techniques, various elements such as the characteristics of the database, the protocol employed to evaluate the model, or the performance obtained are analyzed. The review also identifies which are the most widely used predictors in the literature.
The rest of the article is structured as follows. In Section 2
, the bibliographic search protocol is presented. This includes the bibliographic databases, the search criteria, the exclusion criteria, and the variables that will be extracted from each of the selected articles. Next, in Section 3
, the selected articles are described, grouping them according to the proposed technique, and exposing their most relevant contributions. The article ends in Section 4
where a discussion on the findings is made and the conclusions are presented.
4. Discussion and Conclusions
In this work, a systematic review has been carried out on the prediction of patient no-show. The relevance of the problem can be observed in the fact that 41 of the articles on no-show prediction (82% of the total) have been published in the last 10 years (and 32, that is 64% of the total, in the last five years). The review has identified several factors that influence the results reported in each of the studies analyzed. These factors include the choice of the predictive model, the features used by these models, the variable selection, the performance assessment framework, the class imbalance together with the performance measure, the intra-patient temporal dependence, and whether the experiments take into account the first visits or not. The main findings for each of these factors are described in detail below.
The revision found out that the most widely used algorithm was LR, which appears in 30 articles, that is, more than 50% of the total. This can be explained by the fact that the early works were focused on identifying the most influential factors in patient no-shows, in which the LR plays a primordial role. The second most frequent predictive model is DTs, used as a primary technique in 10 articles (20% of the total). Among the different models, the LR with L2 regularization proposed by Kurasawa et al. [30
] stands out, achieving an AUC of 0.958. Another work that deserves special attention is Snowden et al. [52
], which used NN, reaching an accuracy of 91.11% in a database with an attendance rate of 80%. With the current explosion of deep networks and the growth in databases, this methodology is a promising line of research.
As both Deyo and Inui [62
] and Dantas et al. [12
] indicated, there are no universal variables in the no-show patient databases. The most appropriate variables depend, for example, on the population under study or the specialty. However, as shown in Table 3
, some variables show a discriminatory capacity in the majority of the studies. Variables such as age, gender, insurance, distance, weekday, visit time, lead time, and no-show appeared in at least half of the studies. Among these variables, previous no-shows (along with the number of previous appointments) have been reported as the most significant. This shows the importance of including the patient’s history in the study and reaffirms the intra-patient dependence on observations. Although on a smaller scale, variables such as race, marital status and visit type (first/follow-up) have also been frequently used. A limitation of existing studies is that, in many cases, using a variable depends on its availability in the EHR.
Another aspect worth mentioning is whether the studies perform feature selection, as this can significantly influence the performance of predictive models. As a general rule, the addition of variables with low predictive capacity reduces the generalization of the results. According to Guyon and Elisseeff [63
], feature selection techniques can be divided into three main groups: filter, wrapper, and embedded. The filtering methods, which are the most used in the analyzed articles, select the variables before passing them to the predictive model. The majority of the works that employed a filtering technique used univariate models to select significant variables [15
]. On the other hand, wrapper methods evaluate multiple models that are created with different combinations of variables. The most used technique within the wrapper methods was stepwise feature selection [7
]. Other techniques are metaheuristics such as genetic algorithms [61
] or Opposition-Based Self-Adaptive Cohort Intelligence [51
]. Finally, embedded methods incorporate the selection of variables within the model itself. In this category, the most used techniques were decision trees [43
] and penalized regression [30
]. In fact, studies that applied penalized LR such as Kurasawa et al. [30
] and Lin et al. [37
] present the best adjustment measures.
Performance evaluation framework. A very important aspect is the experimental design since it conditions the generalization of the results. In 13 studies (26% of the total), the performance of the model was evaluated on the same data that were used to train it. It is a well known fact that this approach is prone to overfitting the data, which results in a drastic decrease in accuracy when the developed classifier is used in future data. Thirty-one of the articles (62% of the total) conducted a single validation in which the data were divided into training and testing. The disadvantage of this approach is that there is no guarantee that the easiest-to-classify observations might be in the test set, which leads to overconfident results. Only six studies (12% of the total) performed a repeated validation or a k-fold cross-validation. These number indicates that the results reported may be not realistic with new datasets.
An important aspect to point out from our analysis is that the no-show prediction performance is evaluated very differently across studies. In particular, 44 out of the 50 articles report at least one measure of performance. Among these, the most commonly used metric was AUC, included in 29 of the works. Of these 29 works, a single study obtained an AUC value larger than 0.9 and only six articles (near to 20%) reported an AUC higher than 0.85. The second most used performance measure was accuracy, reported in 21 articles. Following, 17 studies reported specificity and sensitivity, six reported PPV and NPV, and four used recall and precision as performance metrics. Finally, four studies reported an error measure (MAE, MSE or RMSE), and only one proposed to use F-measure and another one to use G-measure (see the next paragraph). This heterogeneity on the use of performance measures makes difficult the comparison of results across studies. The detailed information of the performance measures used can be found in Table 4
As already mentioned, the class imbalance is a common characteristic of all the studies analyzed. A fact observed in several studies is that the accuracy obtained was lower than the attendance rate [14
]. In particular, accuracy exceeds the attendance rate in only 5 of the 15 studies reporting these two values. This low performance could be partly due to the class imbalance that biases the different algorithms to predict each observation as a show. In the analyzed studies, 26 of the works report a no-show rate lower than 20%, which represents 68% of the 38 articles that presented this index. Several approaches have been proposed in the literature to deal with class imbalance in binary classification (see [64
] for an overview). They can be categorized in three groups: (1) those based in training set data transformations aimed at reducing the imbalance between the classes (by undersampling or oversampling the majority or minority class, respectively), (2) those based in the use of specific algorithms that take into account the prior imbalanced class distribution and (3) hybrid approaches combining (1) and (2). Among the analyzed articles, only the cost-sensitive method proposed by [29
] and the ensembles/stacking methods can tackle class imbalance. They fit into the second of the three above mentioned groups, that is, the algorithm-level approaches. Among these works, only Elvira et al. [59
] relates the classifier choice with the imbalance problem. The authors of this study also pointed out that accuracy was not an adequate performance measure and proposed to use the AUC. Alternatively, Kurasawa et al. [30
] proposed to use the F-score and Topuz et al. [57
] proposed the G-measure.
An important aspect is the intra-patient temporal dependence of the observations. Several authors avoid this problem using only the last appointment (most recent) to train the model [7
]. However, this approach results in a loss of information. Only 7 of the 50 analyzed articles include the intra-patient temporal dependence in the model. This dependence was incorporated using different approaches including Markov chains [5
], weighting observations by their temporal closeness [18
], using an exponential sum for regression [28
], building various LR based on the number of previous visits [29
], or using a MELR [38
]. The last approach provides a promising approximation in the resolution of the problem since it allows to unify the behavior of the patient, the socio-demographic variables, and the environmental variables.
One element that significantly affects the results is the inclusion of new patients in the analysis. At the time of the first visit of the patient, the available information is very limited. Only some environmental variables are available (e.g., month, day and time of the appointment) and perhaps some socio-demographic variables (age and sex of the patient). This limitation makes it very difficult to predict missing attendance on the first visit. Different authors have addressed this problem by means of different techniques such as not including in the analysis patients who do not have a certain number of previous visits [5
], not including the first appointment in the study [7
], or including a variable that indicates whether the appointment corresponds to the first visit [14
To conclude, the above discussion has shown that the identification of patients who do not attend their appointments is a challenging and unsolved problem. As it was shown above, this can be observed in the fact that only five articles attained an accuracy higher than the no-show rate. This is a consequence of several pitfalls. Firstly, the researchers only had access to a limited number of predictors with low discrimination capacity and, in addition, those were not the same for all the researchers. Moreover, many studies were conducted with databases consisting of a small number of patients, which limited the information provided to the classifiers. However, the recent availability of more informative databases obtained from EHR opens up new research opportunities. These current databases containing records of hundreds of thousands of appointments allow the use of modern predictive techniques such as deep neural networks or novel binary classification algorithms for high-dimensional settings, such as [65
]. A second research line consists of developing and incorporating strategies that reduce the negative effects of class imbalance. For instance, the use of sampling techniques, cost-sensitive approaches, or the previously commented ensemble models might improve the performance of the selected classifier. A third possibility is the incorporation of intra-patient temporal dependence, which would allow a better characterization of the patients’ behavior by unifying their previous attendance records, their socio-demographic characteristics, and the environmental variables. These strategies could lead to obtaining more accurate predictions that, when incorporated into scheduling systems, will reduce the economic losses suffered by health centers and the waiting time for access to the medical services.