Health Care, Medical Insurance, and Economic Destitution: A Dataset of 1042 Stories

The dataset contains 1042 records obtained from inpatients at hospitals in the northern region of Vietnam. The survey process lasted 20 months from August 2014 to March 2016, and yielded a comprehensive set of records of inpatients’ financial situations, healthcare, and health insurance information, as well as their perspectives on treatment service in the hospitals. Five articles were published based on the smaller subsets. This data article introduces the full dataset for the first time and suggests a new Bayesian statistics approach for data analysis. The full dataset is expected to contribute new data for health economic researchers and new grounded scientific results for policymakers.


Summary
This paper presents a comprehensive dataset of inpatients' financial conditions, their demographic information, opinions about treatment, and hospital fees.The survey, which was conducted from August 2014 to March 2016, strictly conformed to the ethical standards of the International Committee of Medical Journal Editors (ICMJE) Recommendations, the World Medical Association (WMA) Declaration of Helsinki, and Decision 460/QD-BYT by the Vietnamese Ministry of Health.The survey process was long due to the sensitive nature of the research.The survey team approached and gradually asked the patients and/or patients' families about sensitive matters related to their financial situation and their attitudes and behaviors regarding the hospital and treatment process, such as bribery or length of stay.In some instances, the process took up to three to four weeks due to emotional instability on the part of the patient or their family.Eventually, 1042 records were collected.Smaller subsets have been derived from the dataset and analyzed to explore health insurance issues [1], health care payments, financial destitution [2][3][4], and satisfaction with healthcare services [5].
The submitted dataset provides the full 1042 observations and the entire set of coded variables.Moreover, a demo analysis of a Bayesian statistics approach is also introduced in the article.The comprehensive information from the dataset and the new method are expected to provide resources for health economic researchers to investigate the healthcare and health insurance services in transitional economies such as Vietnam.
In the Data Description section, we explain in detail the coded variables and propose some potential research questions that might be explored using the dataset.Then, the employed methods and examples of analysis are shown in the Methods section.Finally, the article concludes with the limitations and implications of the dataset.

Data Description
The dataset includes 1042 records of patients' demographic information, financial status, opinions about treatment, and hospital fees.Previously, smaller datasets of 330 and 900 records extracted from this dataset were used to explore health insurance and healthcare services [1,2,5] in addition to the financial burden of patients [2][3][4] in Vietnam.The current dataset, never publicized before, presents all of the records with all measured variables.There are 15 categorical (discrete) variables and 15 numerical (continuous) variables.Some of these variables could be used indirectly.For instance, the numerical variable "Income" was used to constitute "IncRank".Details of the categorical variables can be found in Table 1.

Illness
The seriousness of the patient's illness or injury.In the dataset, the variable "Ill2" combined two values "ill" and "light" into one value "light" for analysis.In Figures 1 and 2, visualizations of the variables "Burden" and "IfHigher" are shown.Figure 1 confirms the intuitive observation that lower-income patients tended to have a higher financial burden, while the total medical expenditures and daily costs rose according to the degree of the financial burden.This result indicated a finance-health dilemma for low-income patients in Vietnam. Figure 2 shows that the income of male patients was relatively higher than that of female patients, while the total medical expenditures and average daily costs for both males and females were relatively similar.The implication is clear: female patients faced a greater financial risk than their male counterparts.Figure 3 shows the distribution of patients' ages on a histogram, which was created using the numerical variable 'Age.'Most patients ranged from late teens to early 60s with people in their 50s representing the highest percentage.Figure 2 shows that the income of male patients was relatively higher than that of female patients, while the total medical expenditures and average daily costs for both males and females were relatively similar.The implication is clear: female patients faced a greater financial risk than their male counterparts.
Figure 3 shows the distribution of patients' ages on a histogram, which was created using the numerical variable 'Age.'Most patients ranged from late teens to early 60s with people in their 50s representing the highest percentage.
Figure 2. The level of "Income," "Spent," and "Dcost" according to the types of "IfHigher" of the patients.
Figure 3 shows the distribution of patients' ages on a histogram, which was created using the numerical variable 'Age.'Most patients ranged from late teens to early 60s with people in their 50s representing the highest percentage.Since its economic reforms, Vietnam's health care system has experienced major changes, which have greatly affected the delivery and financing of health services [6,7].Several issues related to efficiency and equity have been raised.The cost of visiting a doctor and drugs are relatively expensive for many households [8].Besides, travel costs and the amount of time required might also be the reasons behind the increase in financial burden, and lead to discontinued income during the treatment period.
Low-income households usually spend a higher percentage of their monthly income on health services than wealthier households.As a result, the risk of being destitute seems to be higher among poor households [9].This dataset can, therefore, provide evidence and trends regarding the financing methods of Vietnamese patients in health services.
Table 3 shows some potential research questions and hypotheses that can be examined by employing this dataset.Several research questions and hypotheses have already been explored using smaller datasets [1][2][3][4][5].• What are the effects of socio-demographic factors on the probability of being destitute?
• To what extent are socio-demographic factors the determinants of the degree of illness?
• What is the impact of hospitalization length on patients' financial burden?
• How do the treatment costs and illness explain the end outcome of treatment?
• How does the amount of out-of-pocket "extra thank-you money" determine the end outcome of treatment?

Data Collection
In order to collect the data, 1042 patients from a number of hospitals in the northern region of Vietnam were surveyed by questionnaires.The surveyed hospitals were major hospitals in the region, such as Viet Duc Hospital and Bach Mai Hospital in Hanoi, Viet Tiep Hospital and Kien An Hospital in Haiphong, and Uong Bi Hospital in Quang Ninh, to name a few.Further details can be seen in the dataset.The survey strictly conformed to the ethical standards of the ICMJE Recommendations, the WMA Declaration of Helsinki, and Decision 460/QD-BYT by the Vietnamese Ministry of Health.A total of 330 records were collected during the first phase, from 2014 August 10 to February 2015.More records were obtained from February to May 2015, raising the total number of observations to 900.The third and final phase ended in March 2016, with the final set of 1042 patient records.
The survey took 20 months to finish due to the sensitive nature of the research.For instance, there were cases in which the survey team had to approach the patients or families four to five times over the course of four weeks in order to collect one questionnaire.As a matter of fact, some patients themselves or their family members became too emotional to finish the survey as they thought of the severity of their illnesses.
Raw data from the collected questionnaires were entered into an Excel file at 1042data.xlsx(see the dataset).The data were then edited and saved in CSV format for analyzing in the R statistical software (v3.5.3).Both frequentist and Bayesian statistics approaches were explored in the data analysis.

Frequentist Analysis
The analysis used the baseline-category logits (BCL) model [10].Because the current dataset was a combination of discrete and continuous variables, logistic regression was a suitable method for demonstrating the independence or association among variables.Using coefficients, the logistic model could estimate the probability for each value of response variables according to the condition of the exploratory variables.
The common equation of the logistic model is as follows: log π j (x) where π j (x) = P(Y = j x) , with Y as the response variable, indicates the probability corresponding to the exploratory variable x.
The probability of each response variable was calculated as follows: .
The current article employs the analysis used in [2], which estimated the probability of the type of Burden by using the 330-observation dataset.This time, the model was re-run using the full 1042-observation dataset.Table 4 reports the results obtained from the estimations.
The analysis was executed by using the following R commands: > library(nnet) > library(stargazer) > data1$Res<-relevel(data1$Res,ref="Yes") > data1$Insured<-relevel(data1$Insured,ref="Yes") > logit_burden<-multinom(Burden ~Res + Insured, data=data1) > stargazer(logit_burden,type = "text", out = "logit_burden.htm") The probabilities corresponding to the status of burden outcomes were also calculated according to each condition of residency and being insured.The results are demonstrated in Figure 4:  This dataset indicated a similar decreasing trend of probabilities of destitution corresponding to both long-time and short-time hospitalization (see Figure 5).It also confirmed that longer length of hospital stay increased the risk of falling into destitution [5]:  This dataset indicated a similar decreasing trend of probabilities of destitution corresponding to both long-time and short-time hospitalization (see Figure 5).It also confirmed that longer length of hospital stay increased the risk of falling into destitution [5]:

Bayesian Analysis
In this section, we use a Bayesian statistics approach to examine the dataset.We hoped that the application of Bayesian statistics would bring a fresh perspective to the dataset.The strength of the Bayesian approach is its capacity to visualize the result and the distributions of the coefficients.Moreover, the Bayesian approach also allows for a robustness check of the model using the analysis of prior sensitivity.Had the model been not sensitive to adjustment of the prior, we would have robust evidence for its credibility [11][12][13][14].
R statistical software and a BayesVL package (v0.6) were used to construct a regression model for the correlation between the patients and their families' financial situation after paying for treatment ("burden") against where the patients reside ("res") and whether they were insured or not ("insured") [13][14][15][16].Similar applications of Bayesian statistics can be found in [11,12].The BayesVL package is available in [17].
The mathematical formulation of the model is as follows: The BayesVL package (v0.6) was used to design the model, generate the STAN code for the model, and for the test.Examples of R code that were used to construct the model are as follows: # Design the model model <-bayesvl() model <-bvl_addNode(model, "burden", "norm")

Bayesian Analysis
In this section, we use a Bayesian statistics approach to examine the dataset.We hoped that the application of Bayesian statistics would bring a fresh perspective to the dataset.The strength of the Bayesian approach is its capacity to visualize the result and the distributions of the coefficients.Moreover, the Bayesian approach also allows for a robustness check of the model using the analysis of prior sensitivity.Had the model been not sensitive to adjustment of the prior, we would have robust evidence for its credibility [11][12][13][14].
R statistical software and a BayesVL package (v0.6) were used to construct a regression model for the correlation between the patients and their families' financial situation after paying for treatment ("burden") against where the patients reside ("res") and whether they were insured or not ("insured") [13][14][15][16].Similar applications of Bayesian statistics can be found in [11,12].The BayesVL package is available in [17].
The mathematical formulation of the model is as follows: burden The BayesVL package (v0.6) was used to design the model, generate the STAN code for the model, and for the test.Examples of R code that were used to construct the model are as follows: In the mathematical form: burden ~4.08 -1.03 * res -0.33*insured.
As shown above, all regression coefficients were negative, which suggested that where patients reside would affect their financial burden after paying for treatment, while having insurance showed less effect on the financial burden.The posterior distribution of all coefficients is presented in Figure 6.As shown in Box 3, all regression coefficients were negative, which suggested that where patients reside would affect their financial burden after paying for treatment, while having insurance showed less effect on the financial burden.The posterior distribution of all coefficients is presented in Figure 6.In the model, the correlation coefficients' posterior distributions are shown in Figure 8:  In the model, the correlation coefficients' posterior distributions are shown in Figure 8: Finally, the simulated parameter pairs of "insured" and "res" are shown in Figure 9: Finally, the simulated parameter pairs of "insured" and "res" are shown in Figure 9:

Conclusion
This data descriptor article presents a comprehensive dataset on the situations and opinions of inpatients regarding the cost of treatment at the hospital, and the application of both the frequentist and Bayesian statistics approaches in data analysis.Smaller subsets extracted from this dataset were the backbone of five different health economic publications, which contributed significantly to the literature of healthcare, health insurance, patients' satisfaction with the hospital, and their financial destitution.The public availability of the full dataset and the introduction of Bayesian method will

Conclusions
This data descriptor article presents a comprehensive dataset on the situations and opinions of inpatients regarding the cost of treatment at the hospital, and the application of both the frequentist and Bayesian statistics approaches in data analysis.Smaller subsets extracted from this dataset were the backbone of five different health economic publications, which contributed significantly to the literature of healthcare, health insurance, patients' satisfaction with the hospital, and their financial The public availability of the full dataset and the introduction of Bayesian method will enable health economic researchers to explore more issues and infer significant insights.Furthermore, the previous and upcoming findings based on this dataset have supported and will continue to inform the decisions of healthcare policy-makers in making grounded policies that will help inpatients [18].
We acknowledge that the dataset only reflects the situation in the northern region of Vietnam and the mindset of people in this region.In different areas with different economic contexts, specific findings may not hold.However, the values of this dataset do not only lie in its records, but also the design logic, the usage of coded variables, and the potential for replication and expansion.Therefore, we hope scholars from Vietnam and worldwide will breathe new life into the dataset.We believe researchers from different backgrounds will be able to exploit every aspect of this dataset, under comparative perspectives, for example.

15 Figure 1 .
Figure1.The level of "Income," "Spent," and "Dcost" according to the types of "Burden" of the patient.

Figure 3 .
Figure 3.A histogram for the distribution of patients' age.

Data 2019, 4 ,
x FOR PEER REVIEW 9 of 15

Figure 4 .
Figure 4.The probabilities were computed corresponding to the status of burden outcomes based on the conditions of residency and insurance.Recreated from the idea in [4].Note: minimally affected (A), adversely affected (B), destitute (C), adversely destitute (D).

Figure 4 .
Figure 4.The probabilities were computed corresponding to the status of burden outcomes based on the conditions of residency and insurance.Recreated from the idea in [4].Note: minimally affected (A), adversely affected (B), destitute (C), adversely destitute (D).

Data 2019, 4 , 15 Figure 5 .
Figure 5.The probabilities of destitution corresponding to both long-time and short-time hospitalization based on the conditions of residency and insurance.Recreated from the idea in [4].Note: destitution with long-time hospitalization (DestLong) and destitution with short-time hospitalization (DestShort).

Figure 5 .
Figure 5.The probabilities of destitution corresponding to both long-time and short-time hospitalization based on the conditions of residency and insurance.Recreated from the idea in [4].Note: destitution with long-time hospitalization (DestLong) and destitution with short-time hospitalization (DestShort).

Figure 6 .
Figure 6.The regression model's posterior distribution of all coefficients.Note: HPDI: Highest Posterior Density Interval.The Hamiltonian Markov chain Monte Carlo (MCMC) technical validations for the model using the STAN code are shown in Figure 7.The MCMC simulation in STAN contained 4 Markov chains with 5000 iterations.

Figure 6 . 15 Figure 7 .
Figure 6.The regression model's posterior distribution of all coefficients.Note: HPDI: Highest Posterior Density Interval.The Hamiltonian Markov chain Monte Carlo (MCMC) technical validations for the model using the STAN code are shown in Figure 7.The MCMC simulation in STAN contained 4 Markov chains with 5000 iterations.Data 2019, 4, x FOR PEER REVIEW 13 of 15

Figure 7 .
Figure 7.The Hamiltonian Markov chain Monte Carlo (MCMC) technical validations for the simulation model.

Figure 7 .
Figure 7.The Hamiltonian Markov chain Monte Carlo (MCMC) technical validations for the simulation model.

Table 2
shows the explanation and simple statistical description for numerical variables.

Table 3 .
Research questions and hypotheses.

Table 4 .
Rechecking the probability of the type of "Burden".