Predicting the Intention to Donate Blood among Blood Donors Using a Decision Tree Algorithm

: The blood donation process is essential for health systems. Therefore, the ability to predict donor ﬂow has become relevant for hospitals. Although it is possible to predict this behaviour intention from donor questionnaires, the need to reduce social contact in pandemic settings leads to decreasing the extension of these surveys with the minimum loss of predictivity. In this context, this study aims to predict the intention to give blood again, among donors, based on a limited number of attributes. This research uses data science and learning concepts based on symmetry in a particular classiﬁcation to predict blood donation intent. We carried out a face-to-face survey of Chilean donors based on the Theory of Planned Behaviour. These data, including control variables, were analysed using the decision tree technique. The results indicate that it is possible to predict the intention to donate blood again with an accuracy of 84.17% and minimal variables. The added scientiﬁc value of this article is to propose a more simpliﬁed way of measuring a multi-determined social phenomenon, such as the intention to donate blood again and the application of the decision tree technique to achieve this simpliﬁcation, thereby contributing to the ﬁeld of data science.


Introduction
Since the COVID-19 pandemic outbreak, tens of thousands of scientific papers have researched the associations with it, many of which use artificial intelligence and data science. For some authors, data science will be a crucial element in the global response to the pandemic [1]. According to Van der Aalst [2], data science is a cross-disciplinary field of transforming data into absolute values. Data can be structured or unstructured, large or small, static or dynamic. The value is provided in predictions, automated decisions, data models, or any data visualisation that gives information. This field can be considered a mix of classic disciplines such as statistics, data mining, databases, and distributed systems.
Data science techniques have been widely applied in past epidemics to help health professionals and authorities take better measures against the disease [3]. Today, data science applications tackle COVID-19 in three main phases: screening, tracking and forecasting, and medical aid [4]. In particular, the different use cases can be arranged in eight lines of application: assessing risk and prioritising patients; testing and diagnostics; simulating and modelling; contact tracing; comprehending social interventions; logistic planning and economic action; automated patient services; and supporting the development of vaccines and new therapies [1] As part of the data science's applications to support a pandemic response, prediction systems to improve the healthcare supply chain appear necessary. Furthermore, the changing and uncertain COVID-19 environment, and the need to maintain social distancing, are barriers to effective supply chain systems. In that context, data science-based cuttingedge technology plays a critical role in supply chain operations [5]. behaviour), subjective norm (SN, general perception of social pressure to adopt or not adopt a behaviour), and perceived behavioural control (PBC, perceived control over or a capacity to perform a behaviour). TPB has been widely used in several research topics. Recently, TPB has been used as a basis in studies on social phenomena in Latin America as examples to explain the intention to adopt electronic commerce by SMEs [28], to predict the purchase of products [29], and in the modelling of behaviours associated with public health, such as condom use [30] and telemedicine [31,32]. In these TPB applications, the behavioural intention is explained in a range from 35 to 85%, depending on the phenomenon studied. Regarding the three determinants of behavioural intention, all studies supported the effect of ATT. However, in some studies, the impacts of the SN or PBC are not supported.
The theory of planned behaviour (TPB) was suggested by [26], in which there was an additional factor to the theory of reasoned action [27], and that factor was perceived behavioural control. As exhibited in Figure 1, TBP has three factors in the model that lead to intention and form the behaviour. Attitude (ATT, positive or negative assessments of a behaviour), subjective norm (SN, general perception of social pressure to adopt or not adopt a behaviour), and perceived behavioural control (PBC, perceived control over or a capacity to perform a behaviour). TPB has been widely used in several research topics. Recently, TPB has been used as a basis in studies on social phenomena in Latin America as examples to explain the intention to adopt electronic commerce by SMEs [28], to predict the purchase of products [29], and in the modelling of behaviours associated with public health, such as condom use [30] and telemedicine [31,32]. In these TPB applications, the behavioural intention is explained in a range from 35 to 85%, depending on the phenomenon studied. Regarding the three determinants of behavioural intention, all studies supported the effect of ATT. However, in some studies, the impacts of the SN or PBC are not supported. In general, TPB has been used to model blood donation intent and behaviour [33]. For example, to determine the return behaviour of blood donors [34,35] and understand different underlying motives that influence the intention to perform voluntary blood donation [36][37][38][39][40]. Regarding re-donation behaviour, Wevers et al. [34] reported that the act of donating blood stimulates re-donation behaviour. However, the pressure exerted by the blood bank can affect this behaviour, so interventions that promote donor retention should be carried out to avoid forcing the behaviour of the individuals. In this vein, M'Sallem [35] indicates that internal motivation to re-donate prevails over external reasons and considers that blood donation centres can believe in their retention programmes, showing that both the ATT and the PBC affect the re-donate blood intention.
Giles and Cairns [41] first applied TBP in blood donation to examine the perceived behavioural control factor. They [41] demonstrated that the perception of control has an important impact on behavioural motivation. The subsequent research by [42] supported the results of [41]. It showed evidence for the inclusion of self-efficacy, moral norms and self-identity as other influential predictors. Self-efficacy comes from social cognitive theory; although it is conceptually similar to PBC [43], the main difference is operational. PBC is often evaluated by the ease or difficulty of the behaviour, whereas the individual's confidence performs self-efficacy. Armitage and Conner [42] showed in an empirical study that self-efficacy is a significant predictor and has the highest impact on blood donation. Moreover, in another study [44], self-efficacy was the most significant predictor for blood donation. Moral norms consider "personal feelings of … responsibility to perform, or refuse to perform a certain behaviour" [42]. Furthermore, self-identity was In general, TPB has been used to model blood donation intent and behaviour [33]. For example, to determine the return behaviour of blood donors [34,35] and understand different underlying motives that influence the intention to perform voluntary blood donation [36][37][38][39][40]. Regarding re-donation behaviour, Wevers et al. [34] reported that the act of donating blood stimulates re-donation behaviour. However, the pressure exerted by the blood bank can affect this behaviour, so interventions that promote donor retention should be carried out to avoid forcing the behaviour of the individuals. In this vein, M'Sallem [35] indicates that internal motivation to re-donate prevails over external reasons and considers that blood donation centres can believe in their retention programmes, showing that both the ATT and the PBC affect the re-donate blood intention.
Giles and Cairns [41] first applied TBP in blood donation to examine the perceived behavioural control factor. They [41] demonstrated that the perception of control has an important impact on behavioural motivation. The subsequent research by [42] supported the results of [41]. It showed evidence for the inclusion of self-efficacy, moral norms and self-identity as other influential predictors. Self-efficacy comes from social cognitive theory; although it is conceptually similar to PBC [43], the main difference is operational. PBC is often evaluated by the ease or difficulty of the behaviour, whereas the individual's confidence performs self-efficacy. Armitage and Conner [42] showed in an empirical study that self-efficacy is a significant predictor and has the highest impact on blood donation. Moreover, in another study [44], self-efficacy was the most significant predictor for blood donation. Moral norms consider "personal feelings of . . . responsibility to perform, or refuse to perform a certain behaviour" [42]. Furthermore, self-identity was proposed from identity theory by [42] as the extension of social norms while having a different interpretation. Social norms are supposed to be what we believe others want us to do, but self-identity reflects the individual's perception of a particular social role. The more an individual perceives a role, the more impact their self-identity will have on intention [42]. When it comes to moral norms, it should be considered that they have a different concept from religious beliefs. Another study by [45] showed that religious beliefs among young donors significantly impact their intention to donate blood. An empirical study was conducted by [44] to study blood donation in young people. It turned out that self-efficacy, attitude and moral norms were the most influential correlates in producing an intention to become a blood donor. Contrary to previous studies, Robinson et al. [46] studied blood donation among nondonors. They proposed descriptive norms, donation anxiety and anticipated regret in addition to the previous predictors. Moral norms explain perceived moral duty; subjective norms describe perceived pressure from others. However, descriptive norms show the same as others [46]. Donation anxiety is defined as concern about needles, exposure to blood or pain [46]. Meanwhile, in another study by [47], donation anxiety focused more on fear, which plays a crucial role in anticipating donation intentions. Additionally, anticipated regret is an expectation of predicted future experience of regret considering anticipated future action. The result of [46] illustrated that all predictors were directly correlated with blood donation for nondonors except the subjective norm. In their research, Masser et al. [48] proposed a new framework to divide predictors in blood donors into direct and indirect factors. They introduced donation anxiety, moral norms and self-identity as indirect predictors for intention directly related to attitude. On the other hand, they presented attitude, subjective norms, self-efficacy and anticipated regret as direct predictors in their framework. Bednall et al. [28] reviewed 61 studies associated with blood donations from a broader perspective. They concluded that PBC, attitude, self-efficacy, role identity and anticipated regret are the strongest positive predictors. At the same time, moral norms, satisfaction and service quality have a medium impact on donor's intentions. In 2011, Masser et al. [49] studied the donor's behaviour in the emergence of an outbreak due to influenza in Australia. Based on their research, the impact of the outbreak on blood donation was scarce. However, in two low-risk and high-risk scenarios, attitude and subjective norm were influential. In low-risk scenarios, gender was an additional significant predictor, but in high-risk circumstances, this was PBC. With the advent of Covid-19, another article by [50] conducted an empirical investigation to find the most critical predictor concerning blood donors based on TPB and its extension. It turned out that trust in blood collection agencies anticipated a higher evaluation and, therefore, a more vital subjective norm. Thus, self-efficacy and subjective norms play a crucial role in predicting donors' intention in the Covid-19 era. When it comes to trust, the feelings of trust were shown as attitudes to assess blood donation by using digital platforms [37].
In this context, this study aims to predict the intention of donors to give blood again based on a limited number of attributes. Given a set of variables used to predict behaviour based on a social science theory, the problem to be solved is to determine a smaller number of variables that can predict this behaviour. In general, this research process is an example of using data science and learning concepts based on symmetry for a particular classification and subsequent forecasting. The added scientific value of this article is to propose a more simplified way of measuring a multi-determined social phenomenon, such as the intention to donate blood again; the application of the decision tree technique to achieve this is a significant contribution to the field of data science.
We want to emphasise this study's contributions. From the practical point of view, the contributions are related to improving information capture for predicting future blood donations, an event of fundamental importance given the context where this activity is carried out, and the current global health emergency. From the academic point of view, the main contribution of this work is associated with the application of a known data science technique in a novel way in social sciences, specifically, to determine a smaller number of attributes that predict behaviour based on a social theory that initially requires a larger number of attributes.
This paper is organised as follows. In Section 2, we describe the data collection procedure and techniques used to examine the data. We present the results of this data analysis in Section 3. Section 4 offers a discussion of these findings. Lastly, the final section gives a brief summary of the outcome of this paper.

Data
For the empirical study, a convenience sampling technique was used to gather the data of Chilean blood donors. The data were obtained through an in-person questionnaire for adult users in two health centres in Valdivia (Chile). In particular, a cross-section survey was conducted between March and April 2020. All surveys were conducted as a final action in the blood donation process. The respondents were maintained anonymous during the data collection process. According to standard socio-economic studies, there are no other ethical concerns than to preserve the participants' anonymity. The scales were adapted from Jen and Hung. A 7-point Likert scale was used. Table 1 shows the items used to measure the study's variables. Using TPB modelling, the questionnaire was developed to obtain primary microdata. The research model has been tested using this data. This study meets the ethical standards of social research established by Universidad Austral de Chile (UACh, Valdivia, Chile) for its researchers. The health centres where the data were collected are associated with that institution. Moreover, as in other eHealth studies carried out by the leader of this research team [37], the study followed the Checklist for Reporting Results of Internet E-Surveys (CHERRIES) guidelines [51].
A total of 197 surveys were completed for this study. Most of the completed surveys were females (52%), and the average age was 32.1 years old. See Table 2 for more details of the distribution of the variables of interest.

Decision Tree Algorithm
We used a decision tree algorithm to predict intended blood donation. A decision tree algorithm is a nonparametric technique that identifies a pattern that best matches the relationship between the attributes set and the class label of the input data. In this context, a decision tree refers to a reverse tree scheme consisting of nodes intended to decide the values affiliated to the class. According to [52], there are six reasons for using this method: (1) a decision tree does not require prior assumptions on the type of probabilistic distributions met by the class and other attributes; (2) regarding computational time, a decision tree is inexpensive and fast, even when the size of the training package is significant; (3) an interpreting decision tree, in particular smaller trees, requires less effort; (4) decision tree algorithms are quite robust to the presence of noise, particularly when methods to prevent overtaking are used; (5) the accuracy of a decision tree is not affected by highly correlated and irrelevant attributes during pretreatment, and; (6) the technique is useful for predictive modelling. Algorithm C4.5 is a particular case of this technique [23]. In accordance with [24,53], we detail C4.5 below. Given a dataset D, C4.5 initially grows a tree using the divide-to-conquer strategy as follows. If all cases in D belong to the same class, or D is small, the tree is a leaf tagged with the most common class in D. If not, a test is selected based on one attribute having two or more results. This test is made the root of the tree with a branch for each outcome of the test, then D is divided into corresponding subsets, depending on the result of each case, and the same procedure is applied recursively to each subset. C4.5 uses two heuristic criteria for classifying potential tests: information gain and gain ratio. The former criteria minimise the total entropy of the subsets, and the latter criteria divide the information gain by the information provided by the results test. Both criteria are based on an impurity function that is defined as symmetric with respect to the discrete probability vectors associated with each subset [54]. The attributes may be numeric or nominal, and this determines the format of the test results. For a numerical attribute a, they are {a ≤ l, a > l}. The level l is found by sorting D from the values of a and choosing the distribution among successive values that maximise the above criterion. An attribute with discrete values has a result for each value or the values to be grouped into two or multiple subsets with a result for each subset. To avoid overfitting, the original tree is then pruned. This procedure is based on a pessimistic estimation of the error rate associated with a set of M cases, of which E does not belong to the most common category. C4.5 determines the upper limit of binomial likelihood when E events have been observed in M tests, using a specified confidence. Figure 2 shows the pseudocode of the C4.5 algorithm.
values that maximise the above criterion. An attribute with discrete values has a result for each value or the values to be grouped into two or multiple subsets with a result for each subset. To avoid overfitting, the original tree is then pruned. This procedure is based on a pessimistic estimation of the error rate associated with a set of M cases, of which E does not belong to the most common category. C4.5 determines the upper limit of binomial likelihood when E events have been observed in M tests, using a specified confidence. Figure 2 shows the pseudocode of the C4.5 algorithm.

Results
The intention to donate blood in the next six months was calculated as follows. First, we determined the average of the items associated with the donation intention variable. Then this average was classified into three levels, the "no" level being a value between one and two, the "maybe" level being a value greater than two but less than five, and the "yes" level being a value greater than five. The attributes used to generate the prediction model were the items associated with the latent variables of the TPB model that explain the behavioural intention. Additionally, the control variables that were considered are: age, education, reason for donation and number of previous donations.
To implement the decision tree algorithm, we used the C4.5 algorithm, which builds decision trees based on a collection of training data using information entropy [55]. A grid optimisation strategy was used as a procedure to adjust parameters related to division and shutdown criteria. The split criteria assessed were information gain, gain ratio, the Gini index and accuracy. The method specifies the gain ratio as the dividing criterion and value four as the maximum depth. The analyses were performed using 10-fold crossvalidation to prevent overfitting. The cross-validation process involves two stages. The first stage produces a model, and after that, the second stage applies the former model and measures its performance. For cross-validation by 10, the procedure splits the data sample into ten subsets of equal size. Out of the ten subsets, the method preserves a single subset as test data, and the other nine subsets are used as instruction data. This process is repeated ten times, and each of the ten subassemblies are used once as test data. Finally, the process averages the results of the ten iterations to produce an estimation. Table 3 details a description of the procedure parameters.

Results
The intention to donate blood in the next six months was calculated as follows. First, we determined the average of the items associated with the donation intention variable. Then this average was classified into three levels, the "no" level being a value between one and two, the "maybe" level being a value greater than two but less than five, and the "yes" level being a value greater than five. The attributes used to generate the prediction model were the items associated with the latent variables of the TPB model that explain the behavioural intention. Additionally, the control variables that were considered are: age, education, reason for donation and number of previous donations.
To implement the decision tree algorithm, we used the C4.5 algorithm, which builds decision trees based on a collection of training data using information entropy [55]. A grid optimisation strategy was used as a procedure to adjust parameters related to division and shutdown criteria. The split criteria assessed were information gain, gain ratio, the Gini index and accuracy. The method specifies the gain ratio as the dividing criterion and value four as the maximum depth. The analyses were performed using 10-fold crossvalidation to prevent overfitting. The cross-validation process involves two stages. The first stage produces a model, and after that, the second stage applies the former model and measures its performance. For cross-validation by 10, the procedure splits the data sample into ten subsets of equal size. Out of the ten subsets, the method preserves a single subset as test data, and the other nine subsets are used as instruction data. This process is repeated ten times, and each of the ten subassemblies are used once as test data. Finally, the process averages the results of the ten iterations to produce an estimation. Table 3 details a description of the procedure parameters. Figure 3 shows the results for blood donation intention. The attributes required for the prediction model are PBC1, PBC3, ATT1, ATT2, ATT3, SN2, and previous donations.
The prediction outcomes in Table 4 reveal that the method performs well regarding selection of cases that need to be chosen with an accuracy of 84.17% ± 8.21%. Figure 4 summarises the application of the decision tree technique and its results. Table 3. Description of the procedure parameters.

Parameter Value Description
Algorithm C4.5 C4.5 sets up decision tree models based on a training dataset using the concept of information entropy.

Split criteria Gain Ratio
Gain Ratio normalises the information gain of an attribute against the amount of entropy that attribute has. First, the information gain of all features is determined, and then the average information gain is calculated. Second, the gain ratio is calculated for all attributes whose calculated information gain is greater than or equal to the average information gain. Finally, the feature with the highest gain ratio is chosen to divide the data.

Maximum depth 4
Maximum depth refers to the maximum distance between the root of the tree and any leaf.
Optimisation strategy Grid This strategy runs the process for all combinations of selected parameter values and then determines the optimal values.

10-fold cross-validation
Of the ten sub-samples, only one subsample is preserved as validation data for model testing, and the remaining nine subsamples are used as training data. Thus, the process is repeated repeatedly, with each of the ten subsamples used exactly once as validation data. Finally, the results ten are averaged to generate one estimate.     Figure 4. Explanation of the application of the decision tree technique.

Discussion
As far as we know, there are no studies in the literature that predict the intention to re-donate among blood donors. Nevertheless, using data science, a recent study explores intending to donate or non-donate among nondonors from India [56]. The present study's accuracy of 84.17% can be positively compared with the 70.37% of the previous one. We believe that using a theoretical model as a basis generates this improvement in prediction.
The attributes that the prediction model uses are associated with the three antecedents of the TPB. According to these findings, items of the PBC and ATT variables emerge as stronger predictors, consistent with other studies among blood donors [48,57]. These results suggest that for experienced donors, blood donation remains a behaviour that is, at least in part, a rational decision. On the other hand, the item associated with SN shows that this variable has some importance in determining the intention to donate in experienced donors, which is in line with previous studies [58].
Of the control variables used as predictors, only previous donations were helpful. Education level, age, gender or the primary reason for donating do not predict the intention for experienced donors to repeat donate. However, previous donations emerge in the model as an essential attribute in the prediction. Blood donors who previously have donated blood are more likely to donate blood in the future. In line with Guglielmetti Mugion et al. [57], this variable may represent the lower presence of inhibitors such as fear or lack of information on the transparency of the process.
Although the literature indicates that artificial intelligence-based systems have achieved significant success in healthcare since 2016 [59], the prediction associated with blood donation has not been an area with significant development. We believe that this study provides a significant advance. In particular, this study has three practical implications. First, the result of this study implies a minimisation of the number of questions and, therefore, the response time for people who donate. Clearly, this will cause a higher percentage of survey completion and the possibility of improving blood management planning. Second, the registry of donation intentions will allow customisation, targeting and the development of attractive and appealing practices. So, there will be an increase of the people who intend to repeat blood donation, voluntarily and altruistically, thereby raising the availability of blood. Finally, a reduction in costs derived from contacting people who wish to donate blood, and in general from the blood management system, is expected, thanks to the two previous implications.

Discussion
As far as we know, there are no studies in the literature that predict the intention to re-donate among blood donors. Nevertheless, using data science, a recent study explores intending to donate or non-donate among nondonors from India [56]. The present study's accuracy of 84.17% can be positively compared with the 70.37% of the previous one. We believe that using a theoretical model as a basis generates this improvement in prediction.
The attributes that the prediction model uses are associated with the three antecedents of the TPB. According to these findings, items of the PBC and ATT variables emerge as stronger predictors, consistent with other studies among blood donors [48,57]. These results suggest that for experienced donors, blood donation remains a behaviour that is, at least in part, a rational decision. On the other hand, the item associated with SN shows that this variable has some importance in determining the intention to donate in experienced donors, which is in line with previous studies [58].
Of the control variables used as predictors, only previous donations were helpful. Education level, age, gender or the primary reason for donating do not predict the intention for experienced donors to repeat donate. However, previous donations emerge in the model as an essential attribute in the prediction. Blood donors who previously have donated blood are more likely to donate blood in the future. In line with Guglielmetti Mugion et al. [57], this variable may represent the lower presence of inhibitors such as fear or lack of information on the transparency of the process.
Although the literature indicates that artificial intelligence-based systems have achieved significant success in healthcare since 2016 [59], the prediction associated with blood donation has not been an area with significant development. We believe that this study provides a significant advance. In particular, this study has three practical implications. First, the result of this study implies a minimisation of the number of questions and, therefore, the response time for people who donate. Clearly, this will cause a higher percentage of survey completion and the possibility of improving blood management planning. Second, the registry of donation intentions will allow customisation, targeting and the development of attractive and appealing practices. So, there will be an increase of the people who intend to repeat blood donation, voluntarily and altruistically, thereby raising the availability of blood. Finally, a reduction in costs derived from contacting people who wish to donate blood, and in general from the blood management system, is expected, thanks to the two previous implications.

Conclusions
Prediction systems to improve the healthcare supply chain are necessary for the changing and uncertain environment of COVID-19. In that sense, blood supply forecasting is critical to making supply chain decisions and can help personalise and optimise the process for potential donors. In the past, the TPB has been used to predict blood donations, but the data collection time is currently critical. Therefore, the decrease in these times permits data capture and enables donation prediction. Thus, the research objective of this study was to predict the intention to repeat blood donations amongst donors based on a limited number of attributes. These data were analysed using the decision tree technique.
The experiment results indicate that it is possible to predict the intention to repeat blood donations with an accuracy of 84.17%, using only seven variables. Furthermore, the findings reveal that the attributes used by the prediction model are associated with the three antecedents of the TPB. According to these findings, items of the PBC and ATT variables emerge as the strongest predictors.
Some limitations that must be taken into account in the present study are related to the sample size and the donors' culture. The sample size limitation does not allow generalising this result to the whole population. Furthermore, the sample donors' culture does not apply these findings to a country with a different culture. In future studies, it will be helpful to carry out a replication of the procedure in a bigger sample and study different cultures (for example, with other religious beliefs). Additionally, future research should apply the decision tree technique in other areas of social sciences to determine a smaller number of attributes that predict behaviour based on an established theory, such as predicting learning styles based on the Felder-Silverman learning style model [60] or predicting the purchase of products or services based on the reasoned action theory [27].  Institutional Review Board Statement: Ethical review and approval were waived for this study because all the data used involving research on human subjects have been published before.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.