Identifying Vulnerable Households Using Machine Learning

: Many Afghanistan households face food insecurity (FI), and this threatens sustainable development. Policymakers and international donors are trying to alleviate FI using food aid, development assistance, and outreach. This study identiﬁed household characteristics that discriminate between food-insecure and food-secure households, facilitating accurate assistance targeting in Afghanistan. We used machine learning classiﬁcation models (classiﬁcation decision tree and random forest model) and applied to a household survey. This was done using equal priors and 1.5:1 misclassiﬁcation penalties. The resulting model is able to correctly identify 80% of food-insecure households. Characteristics in six major categories are found important. Unsurprisingly traditional key variables, such as (1) income and expenditure items, (2) household size, (3) farm-related measures; (4) access to particular resources, and (5) short term shocks are important determinants of food security level. We also found the relevance of long-term household characteristics, such as dwelling wall composition, which are not generally addressed in the existing literature. We argue that these are reﬂective of accumulated household wealth and this supports the idea that some factors determining food security are persistent. We also found that commonly used demographic variables were not important.


Introduction
Afghanistan has suffered from severe weather conditions and conflict. The majority of the country's population is living under the poverty line, with many food-insecure (FI) households [1][2][3]. This threatens sustainable development. One way of alleviating such issues is the provision of development assistance. However, the capability to provide financial and food aid is limited and insufficient if it is to be distributed to all households. Thus, it is essential to target aid toward only FI households, and this requires a means of identifying the food-insecure households [4] A number of existing studies have addressed FI household identification via causal models linking shocks to FI status [5][6][7][8][9]. However, such shocks are generally known afterwards, and it takes time to initiate aid efforts. Thus, there is a lack of real-time and anticipatory identification of FI households.
In addition, some scholars believe that persistent FI alters household characteristics, thus there may be bidirectional influences at work which may bias the results of causal models [10][11][12]. Given this background, perhaps it is desirable to use a less causal approach. The machine learning framework, especially decision tree approaches, offers powerful tools to do this. In this study, we use machine learning models to identify household characteristics that strongly discriminate between FI and not FI households.
To the best of our knowledge, only a few studies have used machine learning methods to address FI household targeting, but these studies used limited datasets. In this study, we use the machine learning method to identify the food insecurity indicators in Afghanistan using a countywide comprehensive dataset that arises from the Afghanistan National Risk and Vulnerability Assessment (NRVA) survey. This survey provides data on over 500 household characteristics.
The important variables we find that are associated with FI are robust across the models we used. Unsurprisingly and as found in other studies, household size, income, access to resources and farm production, and assets-related measures are relevant. However, we found persistent forces at work and not just short-term shocks that affect FI, such as the incidence of household stresses and negative household conditions from previous years, and long-term household characteristics such as quality of housing. These persistent characteristics are commonly omitted in prior studies. We also found that many commonly used explanatory demographic variables were less important. Overall, the use of a broader set of characteristics improved our ability to discriminate between FI and food-secure households, and this may be relevant in broader FI research.

Literature Review
FI threatens the life of people and the sustainability of development efforts, not only in Afghanistan but also in many Asian and African countries. Common approaches to overcome FI issues involve the provision of development assistance. However, the capacity to provide such assistance is limited, and targeting assistance to those most in need is critical. However, there is no "gold standard" means of targeting FI households [13], and the use of commonly identified indicators such as Hoddinott's [14] list of (a) food intake, (b) household energy acquisition, (c) dietary diversity, and (d) coping strategies requires the availability of costly household-level information. Therefore, many studies focus on the estimation of FI probability models using econometrics.
Such studies have commonly used several major classes of variables. These include (a) demographic measures, such as age of the household head, the gender of household head, the education level of the household head, and the household size; (b) farming-related measures, such as farm size, livestock holdings, fertilizer application; and (c) economic status-related measures, such as occupation, off-farm income, credit access, and the region of the country [15][16][17][18][19]. However, there might be bi-directional relationships between the above classes of variables and household food security. For example, an increase in off-farm income can decrease FI. However, an increase in FI can lead to a decrease in the off-farm income, because the increase in FI can lead to health issues, thus reducing off-farm income. Meanwhile, there might be omitted variable problems due to a lack of household-level information. Bi-directional causal relationships and omitted variables cause endogeneity problems [20]. The use of instrumental variables (IV) is then in order, but the regression result is sensitive to IV selection [21].
In terms of econometric model selection, censored data (mainly logistic) regression models are most commonly used, as the food security or food insecurity variable is binary [15][16][17][18][19]. Limited by degrees of freedom, researchers usually choose several variables that are potentially explanatory, which raises the problem of explanatory and instrumental variable selection when endogeneity occurs.
Machine learning approaches, especially decision trees and random forests, relax causal assumptions and have been widely used to identify indicators from data. Studies have addressed such things as indicators of cancer (e.g., [22,23]) and loan defaults (e.g., [24,25]). Machine learning models commonly start with a large number of possible household characteristics that could be used (also called features) then select the best set, which avoids the a priori variable selection issue. In machine learning, the GUIDE (GUIDE Classification and Regression Trees and Forests) procedure is designed to yield an unbiased estimation [26,27].
Few machine learning studies have addressed FI or poverty. Barbosa and Nelson addressed FI in Brazil and found a model able to identify FI households with a 75% accuracy [28]. They did this employing a Support Vector Machine (SVM) method using 75 household characteristic features. However, they did not report the nature of the characteristic features that enabled them to classify FI households. Mwebaze et al. addressed FI identification in Uganda using a household crop survey coupled with satellite data [11]. They only included 13 features in their model. Hossain et al., predicted FI households in Bangladesh [4]. They found equal performance between machine learning and non-machine learning methods. In their models, less than 30 features were included.
As stated above, none of these studies worked on relatively large datasets with hundreds of household characteristic features that could be used in model estimation. Thus, our study extends the FI analysis using machine learning over a large dataset with hundreds of possible household characteristic features and, in turn, using less restrictive assumptions to efficiently identify FI households.

Methodology
Among machine learning methods, decision tree formation is a common approach for classification. Generally, the approach recursively partitions the explanatory characteristics into relevant and irrelevant groups, and in doing this the misclassification cost should be reduced. The best partition for the variables is determined as the one with the least misclassification cost in classification discrimination. In turn, the results are displayed in the form of a "decision tree", where the status of a particular household characteristic is investigated, and based on the nature of that household status the households are classified. At the top of the tree model is the so-called "Root" node. This is the status of the most important household characteristic that is selected as the one with the highest classification potential among all of the investigated household characteristics. This process is then repeated to find the next most discriminating household feature and the FI classification or the need to investigate the additional characteristics it implies. In doing this, the algorithm runs an exhaustive search across all household features to determine the most important one and tests how household status within that characteristic is associated with FI status to determine the nature of the decision tree split. After addressing a number of household characteristic features, the procedure ends at a leaf node where the household is given an FI classification status. Both the order of importance of the independent variables and the way the tree splits toward ultimate FI classification are calculated iteratively using methods such as the χ 2 split method, the Gini index split method, or the information gain method. For this study, we used the GUIDE classification, regression trees and regression forests method (version 34.0). Loh indicates that, relative to other decision tree models, GUIDE has the advantage of producing unbiased predictors and is better at treating missing values [26,27].
In this study, the χ 2 split method is used. The household characteristic feature selection and split criterion calculation will keep developing the decision tree until the available features are exhausted or a stopping criterion is met [27,[29][30][31]. To avoid overfitting, the tree is pruned using 10-fold cross validation [26]. The leaf node will be removed if the overall cost of the cross validation could be reduced without this node. The resultant final tree is presented like a series of household status questions. For each household, it evaluates the status relative to the feature addressed in the root and then, if needed, in subsequent nodes. If the observation positively exhibits the characteristic that is the subject of that node (i.e., the response to the root node about the source of financing (q_5_26) is that one would mortgage the house or land, as portrayed by the answer being in S 1 as in Figure 1), it goes to the next node on the left (predicted as an FI household). Otherwise, it goes to the next node on the right (check if the effective household size is smaller than 5.17). Then, the next most important characteristic is investigated and the inquiries move according to the tree. The process is repeated until a leaf node is reached. As that point, the household at hand is assigned the relevant FI classification.
As the importance of some factors may be biased by sampling noise, we also use the random forest approach to build multiple trees using feature and sample subsets [32]. Then, we build a final tree that reflects the majority of the relevant features that appear within the trees in the "forest". Each tree in the forest is built using a subset of features and observations. Therefore, the random forest method minimizes the chance of overfitting and the associated bias caused by particular variables and observations. The random forest also produces a probability for each household that it is FI. In our study, 500 trees were formed using the random forest approach.
Finally, we needed to deal with the issue of an unbalanced sample, as we had unequal numbers of FI and non-FI households. After constructing the FI indicator across our dataset (as discussed below), we found that about 27% households were classified as FI and 73% were not. Such an unbalanced dataset can cause a null tree or poor splits [33,34]. To overcome this, we adopted the assumption of equal prior FI and non-FI household probabilities (instead of empirical priors) and uneven misclassification costs. We used the cost of 1.5:1 for misclassifying a FI household as non-FI, where we used a penalty cost of 1.5 when misclassifying a non-FI household as FI and a penalty cost of 1 when misclassifying an FI household as non-FI. The use of equal priors causes GUIDE to emphasize the search for key variables that can separate FI from non-FI households. The use of the uneven classification costs makes it more "costly" to classify FI households as non-FI ones relative to the opposite case, and reduces the likelihood of FI misclassification.

Data Description
The data used were drawn from the 2008 Afghanistan National Risk and Vulnerability Assessment (NRVA) Survey. That survey covers 20,511 randomly selected households across the whole country. The results contain demographic information, indicators of living condition, food security, labor market participation, education, health, and many other items. The URL to the full questionnaire is provided in Appendix A. Numerical yes or no answers (coded as 0/1) and not applicable responses are included in the data. A separate answer was created for missing or not applicable responses.

Handling Sample Weights
The sampling was randomized and, in use, the dataset contains weights for each observation that are designed to represent all households in the country [35]. These weights differ by household, and since the estimation procedures we used did not directly handle differential weighting, we replicated each observation according to the weights so that each resultant observation was equally likely. This resulted in 3,426,445 observations. The response to each survey question is treated as an individual feature in the dataset. We used all of the household responses to questions as possible indicators of food security, except those for food consumption. Food consumption questions were used to construct the FI indicator. This resulted in the dataset having 581 features that could be selected by the mode. Each feature represented the answer to 581 of the survey questions. The household observations in the dataset were randomly divided into a set for model training (80% of the cases) and a set for out of sample testing (20%).

Construction of FI indicator
The per capita sufficiency of calorie intake was used as the FI indicator. This is calculated based on the responses to the survey questions on food consumption quantities and the needed per capita calorie amount. Consumption amounts were given for 90 different foods in 10 different categories. We also accounted for calories from meals dining outside and subtracted calories for meals consumed by guests.
The broad categories are: (1) bread and cereals, (2) meat and fish, (3) dairy and eggs, (4) oil, (5) vegetables, (6) fruits, (7) nuts, (8) sugar and sweets, (9) beverages, and (10) spices. We then calculated the calorie intake by household members based on the calories in food consumed during the past week. We used caloric information from the United States Department of Agriculture Food Composition Databases [36]. We then calculated the sufficiency relative to needs, as they vary by the age and gender of household members based on the recommended dietary requirement provided by National Research Council (US) [37] (See Appendix B Table A1 for details). As a result, we computed a calorie requirement of 2550 calories per household member and classified a household as FI if the per capita calorie intake is below that. Otherwise, the household is indicated as food-secure. The result was that we classified 26.7% of the households as FI.

Constructing a Household Income Measure
A per capita household income measure was also needed, but there was not a direct question on this in the survey. However, questions on the amount of income from the main household source were asked as its proportion relative to the household's total annual income. We then calculated the total household income (HHincome) as: HHincome = MI PMI where MI is the annual income from the main income generating activities (q.8.4) and PMI is the proportion of the household's total income from the main income source (max (q.8.2.1 to 8.2.6)).
We calculated the income per capita (IpC) as: where HHIncome is the total household income and HHsize is the effective household size.

Resultant Statistics
A table of summary statistics for the selected items, including those calculated by the authors, is reported in Table 1. This table only shows the 15 most important FI determining household characteristics identified and ranked by GUIDE. The ranking will be discussed in detail in the next section.   The summary statistics are grouped by the household FI type and show the average difference between the FI and food-secure household groups.

Model Validation
To validate the models we built, we randomly split the dataset into training (80% households) and testing subsets (20% of the households). All the models were built using the training set (in sample). We then evaluated the model performance using the testing set. In particular, we used our estimated model to predict the food security status of each household and evaluated how well the prediction matched up with the household FI status. Recall rate, which is the number of FI households that were successfully predicted as FI divided by the total number of FI households in the dataset, is used as the model validation indicator. In turn, if the recall rates in the training and testing datasets are similar, then this shows our models are well calibrated. The resultant calculated recall rates are shown in Table 2. For the decision tree model, the recall rate was 80% and it was 79% in the testing set. When we only used the 100 most important variables in the model training, which resulted in the training dataset recall rate increasing to 81%, while the testing recall rate remained at 79%.
For the random forest model, the recall rate was about 82% in the training set and 80% in the testing set. The closeness of these recall rates across the model versions and datasets leads us to conclude that our models are well calibrated.

Results and Discussion
The estimated decision tree pruned by 10-fold cross validation is shown in Figures 1 and 2. The first split variable involves the response to the question "If your household had to borrow money in the future, who is the first source you would borrow from?" (q_5_26). If the answer to this question is "Mortgaging land/house" (set S 1 ), those households are classified as FI. The numbers beside this node show that in the testing set, 78% of the households that gave this answer were truly FI.
If the household responds that it used other sources for borrowing, the next question is whether the household size (V1) is less than 5.17 people. If the household is smaller than that, the applicable node is to the left, where the tree addresses whether expenditures on "Annual celebrations and charitable donations" (q_12_38) are less than or equal to 1003 Afghanis. Otherwise, the applicable node is to the right, where the next characteristic involves what is the most important household income-generating activity (q_8_1_1, the right branches on Figure 2). The questioning then proceeds until a leaf node is reached (where the questions run out). The class denoted at that leaf node is the predicted class for the household, where "I" (or a red node) indicates an FI household and "S" (or the green node) stands for a non-FI household. The questions for each node and the responses used to split the nodes are listed in Appendix C.
For a better demonstration of classification, Figure 3 shows the association of FI with the "Annual Celebrations and Charitable Donations" and income per capita in the training set, with separate panels for each different response on money borrowing sources. The red dots represent FI households, while the green dots represents food-secure households, with the size of the dots showing the number of households each dot represents. Recall that the tree classifies households having to mortgage land or house for money as FI. It is clear in the figure that most households in the "Mortgaging land/house" category (third plot in the second row) are FI.
While it is not surprising that household income or expenditure-related variables are important, the decision tree also contains several housing-related variables such as "What is the major construction material of the exterior walls of dwelling" (referencing question q_2_2 as listed in the appendix), as well as some variables describing potential short run stresses faced by the households, such as "How would you compare the overall economic situation of the household with 1 year ago?" (q_16_7). We can divide the selected variables into six major classes:

Household size (V1). •
Farm-related measures, such as the three most important crops harvested in the last cultivation season-e.g., wheat, maize, barley, etc.-and the area of the land that was rented out in the last summer cultivation season (q_4_10, q_4_15_1, q_4_15_2, and q_4_15_3).

•
Realized stresses, such as "How would you compare the overall economic situation of the household with 1 year ago?" (q_16_7) and " How often in the last year did you have problems satisfying the food needs of the household?" (q_16_8).

•
Long-term household characteristics representative of accumulated wealth, such as the construction material used in the exterior walls of dwellings (q_2_2), the type of kitchen/cooking facility in the dwelling (q_2_15), and the ownership of the dwelling (q_2_10).

•
Access to resources, such as the connection to sewage (q_2_29), the distance to the main source of water (q_2_34), the ability to access the main water source at any time (q_2_36), the main source of irrigation water (q_4_16), the knowledge of iodized salt (q_16_3_1), and the purchase of iodized salt (q_16_5 and q_16_6) In the previous literature, income and expenditure and farm-related measures are commonly used [4,[15][16][17][18][19]38]. Access to resources and the incidence of short-term shocks are used in some studies, but not often. Abdula [38] used distance from a main road and market access. We could not find the use of long-term household characteristics other than an attempt in Hossain et al. [4], who tried building materials but did not find them important. We do note that Carter and Barrett [39] and Mammen, Bauer, and Richards [40] argue that purely treating the food security problem for households as a current problem with current causes is not always appropriate, arguing for the importance of persistent forces. Our result supports their conclusion.
Another thing shown by our key results is that the association between the food security problem and the indicators can be bidirectional. Not only does a low income cause FI, but it also tends to cause lower quality housing through low wealth accumulation. In addition, FI causes a limited ability to work, which likely lowers the income and the ability to own higher quality housing. Thus, while the selected independent variables are associated with FI, we do not assert that they are the cause of it.
As there is no coverage in the survey that addresses wealth or assets directly, the model is likely to select variables such as the material of the dwellings as a proxy. We feel it would be good to include long-term wealth-related questions in future surveys. We believe that our results indicate that the long-term household financial status influences FI and that it would be good to have direct information on this for use in FI classification.
We also found limited demographic FI associations, with effective household size being the only selected demographic feature. This implies that when there is enough information on other characteristics of the household, the demographic information may well be superseded in importance. Future research may not need to dwell so much on demographics to develop FI classifications.
To better show the importance of each feature in the model, we calculated their importance scores following Loh [27] and Loh, He, and Man [41]. Table 3 shows the 15 most important variables selected. As the importance score calculations consider the number of observations, the order shown in Table 3 is slightly different from that in the model. Again, these most important variables are a combination of expected and unexpected forces.     Note: This figure shows the intuition of the first split from the model. The majority of the households having to mortgage their house or land for money borrowing have a food security problem. For other categories of q_5_26, the two classes of food security are messy. This figure is produced with uninflated data, so that each dot represents a household without compounding by its weight.
While it is not surprising that household income or expenditure-related variables are important, the decision tree also contains several housing-related variables such as "What is the major construction material of the exterior walls of dwelling" (referencing question q_2_2 as listed in the appendix), as well as Figure 3. The association of food security with the income per capita and annual celebrations and charitable donations (q_12_38) separated by money borrowing sources (q_5_26). Note: This figure shows the intuition of the first split from the model. The majority of the households having to mortgage their house or land for money borrowing have a food security problem. For other categories of q_5_26, the two classes of food security are messy. This figure is produced with uninflated data, so that each dot represents a household without compounding by its weight. Most of the 581 household characteristic features were not found to contribute to FI identification relative to the ones we list above, but created noise in the estimation. The model was re-estimated with the first 100 variables in the importance score ranking. With the smaller set of variables as the input, the mean misclassification cost improves to 0.3606 (a 1.5% reduction) on the testing set.

Robustness Tests
As the importance of some factors may be biased by the noise in the sample, we move on to build a random forest model with 500 trees. In turn, we find that there are 35 features that were selected across all the 500 trees. The ones selected are similar to those presented above for the decision tree model. Another random forest was built with only the top 100 most important features and the AUC for measurement was boosted from 0.845 to 0.851 (The AUC is a commonly used way to evaluate the performance of a classifier. A pure random classifier will lead to an AUC of 0.5, while a perfect classifier will cause AUC to equal 1.). However, the FI features selected by the models are all similar. We believe the random forest does not add much, and thus we do not cover the selected features or a composite tree.

Conclusions and Policy Implications
Food aid and other household assistance resources are limited but are vital to sustainable development in settings like Afghanistan. Thus, it is vitally important to accurately target households that need assistance. This study identified household characteristics that discriminate between food-insecure and food-secure households, facilitating accurate assistance targeting.
Our procedure involves using machine learning classification models to identify key household characteristic features that are associated with the likelihood that a household is food-insecure. We employ a classification decision tree and random forest model to do this using equal prior and 1.5:1 misclassification penalties. The resulting model is able to correctly identify 80% of the food-insecure households.
Across the study, 35 key household characteristic features were selected to identify FI households among 581 possible characteristics. Six major groupings of characteristics were identified: a) income and expenditure items, such as income per capita and expenditures for some goods; b) household size; c) farm-related measures, such as the major crops harvested; d) access to particular resources, such as the distance to water sources and the usage of iodized salt; e) short term shocks-e.g., "How would you compare the overall economic situation of the household with 1 year ago?"; and f) household dwelling characteristics, such as the material of the exterior walls of the dwelling, the type of kitchen, and the ownership. The first five types of variables are more commonly used in previous studies. However, we could not find studies that successfully used dwelling attributes as explanatory variables to estimate or predict food insecurity. We believe the dwelling attributes reflect household wealth, which implies that food security is a long-term problem affected by the overall household economic condition and is not just an issue caused by short-term shocks [39,40].
In addition, our findings support the idea that the food security problem persists and has a bidirectional relationship with other variables. Thus, long-term household wealth-related variables are recommended for inclusion in future surveys and in exercises identifying food-insecure households. Moreover, long-term household characteristics are also more easily observed and can be more objectively identified, which may reduce the cost of identifying FI households.
We also found that the demographic variables, such as the gender and education level of the household head, which were commonly used in previous research, were not selected as key indicators, and this may imply that other information is key in predicting food insecurity.
In summary, this study advances a key set of household characteristics that could be used to quickly identify FI households and increase the financial assistance-targeting accuracy. We recommend the use of income, family size, farm-related measures, incidence of short-term stresses, dwelling characteristics, and access to resources. This is a mix of short and long-term indicators.
Finally, we should mention the limitations of our work. The results are based on only one dataset for one somewhat dated year from Afghanistan. There is a need to explore the indicators we identify in other settings over time and space to confirm the importance of the identified characteristics on household food insecurity. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A. Survey Questions
The complete NRVA questionnaires are available at https://sites.google.com/tamu.edu/chengchengfei/research?authuser=1. q_2_36: Are you able to access this main water source whenever you want? q_3_5_f: Which household member mainly manage goats? q_4_10: How many jeribs of irrigated land did you or your household rent out during the most recent summer cultivation season? q_4_15_1: What were the first important crops you harvested in the last summer cultivation season? q_4_15_2: What were the second important crops you harvested in the last summer cultivation season? q_4_15_3: What were the third important crops you harvested in the last summer cultivation season? q_4_16: What was the main source of irrigation for the majority of the irrigated land you cultivated during the summer cultivation season? q_5_1_5: How many Radio machine does your household own? q_5_8_3: According to the current prices, how much do you think you could get if you sold all of gilim, satrangi, namad, fash (other carpet production)?

Appendix B. Reference of Food Security Indicator Construction
q_5_17: What was the main use of the largest loan taken in the last year? q_5_26: If your household had to borrow money in the future, who is the first source you would borrow from? q_8_1_1: What are your household's income generating activities in order of importance (first order)?
q_10_4: Why did none of your household members participate in any cash-for-work programme or income generating programme or projects during the past 12 months? Mark one main reason q_10_11: Why did none of your household members participate in any food aid programmes during the past 12 months? Mark one main reason q_12_4: What has the household spent in the last 30