Association of Preterm Birth with Depression and Particulate Matter: Machine Learning Analysis Using National Health Insurance Data

This study uses machine learning and population data to analyze major determinants of preterm birth including depression and particulate matter. Retrospective cohort data came from Korea National Health Insurance Service claims data for 405,586 women who were aged 25–40 years and gave births for the first time after a singleton pregnancy during 2015–2017. The dependent variable was preterm birth during 2015–2017 and 90 independent variables were included (demographic/socioeconomic information, particulate matter, disease information, medication history, obstetric information). Random forest variable importance was used to identify major determinants of preterm birth including depression and particulate matter. Based on random forest variable importance, the top 40 determinants of preterm birth during 2015–2017 included socioeconomic status, age, proton pump inhibitor, benzodiazepine, tricyclic antidepressant, sleeping pills, progesterone, gastroesophageal reflux disease (GERD) for the years 2002–2014, particulate matter for the months January–December 2014, region, myoma uteri, diabetes for the years 2013–2014 and depression for the years 2011–2014. In conclusion, preterm birth has strong associations with depression and particulate matter. What is really needed for effective prenatal care is strong intervention for particulate matters together with active counseling and medication for common depressive symptoms (neglected by pregnant women).


Introduction
Preterm birth is a major part of disease burden for newborns and children on the globe [1][2][3][4]. Every year 15 million babies are born preterm in the world and preterm birth is a main contributor for global neonatal and childhood mortality, i.e., 1 million deaths among those aged 0-4 years [1,2]. For example, one out of every 10 babies was preterm in the United States during 2003-2012, that is, 5,042,982 (12.2%) of 41,206,315 newborns [3]. Indeed, cost-effective interventions are expected to prevent three quarters of mortality from preterm birth [4]. A recent review reports that the following maternal variables are important predictors of preterm birth: demographic/socioeconomic determinants (age, below high school graduation, urban region, insurance, marriage, religion), disease information (delivery/pregestational body mass index, predelivery systolic/diastolic blood pressure, upper gastrointestinal tract symptom, gastroesophageal reflux disease, Helicobacter pylori, gestational diabetes mellitus, systemic lupus erythematosus, increased cerebrospinal fluid and reduced cortical folding due to impaired brain growth), medication history (progesterone, calcium channel blocker, hydroxychloroquine sulfate) and obstetric information (parity, twins, infant sex, prior preterm birth, prior cone biopsy, cervical length, myomas and adenomyosis) [5].

Participants
Retrospective cohort data for this study came from Korea National Health Insurance Service claims data for 405,586 women, aged 25-40 years who gave birth for the first time after a singleton pregnancy during 2015-2017. South Korea runs a compulsory, universal health insurance service program and Korea National Health Insurance Service claims data cover most health events of all citizens residing in Korea (for more details, visit https://www.nhis.or.kr/static/html/wbd/g/a/wbdga0401.html, accessed on 15 March 2021). This retrospective study was approved by the Institutional Review Board (IRB) of Korea University Anam Hospital on 5 November 2018 (2018AN0365). Informed consent was waived by the Institutional Review Board (IRB) given that data were deidentified.

Variables
The dependent variable was preterm labor and birth during 2015-2017 (birth between 20 weeks and 0 day and 36 weeks and 6 days of gestation). Four categories of preterm labor and birth were defined based on ICD-10 Code: (1) PTB 1-preterm birth with premature rupture of membranes (PROM) only; (2) PTB 2-preterm labor and birth without PROM; (3) PTB 3-PTB 1, PTB 2 or both; (4) PTB 4-PTB 3 or other indicated preterm birth (Supplementary Table S1). This variable was coded as "no" vs. "yes". The following 90 independent variables were included: (1) demographic/socioeconomic determinants in 2014 such as age (years), socioeconomic status measured by an insurance fee with the range of 1 (the highest group) to 20 (the lowest group), and region (city) (no vs. yes); (2) particulate matter (PM 10 ) for each of the months January-December 2014; (3) disease information (no vs. yes) for each of the years 2002-2014, i.e., depression, diabetes, gastroesophageal reflux disease (GERD), hypertension and periodontitis; (4) medication history (no vs. yes) in 2014, i.e., benzodiazepine, calcium channel blocker, nitrate, progesterone, proton pump inhibitor, sleeping pills and tricyclic antidepressant; (5) obstetric information (no vs. yes) in 2014 such as in vitro fertilization, myoma uteri and prior cone. The 65 disease variables were denoted as Depression_2002, . . . , Depression_2014, Diabetes_2002, . . . , Diabetes_2014, GERD_2002, . . . , GERD_2014, Hypertension_2002, . . . , Hypertension_2014, and Periodontitis_2002, . . . , Periodontitis_2014. The disease information and the medication history were screened from ICD-10 and ATC codes, respectively (Supplementary Tables S1 and S2). Indeed, diabetes was defined as fasting glucose equal to or higher than 126 mg/dL or antidiabetic medication. Likewise, hypertension was defined as systolic/diastolic blood pressure equal to or higher than 140/90 mmHg or antihypertensive medication [16]. Finally, particulate matter was denoted as PM_2014_01 (2014 January), . . . , PM_2014_12 (December 2014) and its monthly average at a district level was obtained from [17]. Introducing the disease and particulate matter variables as above (so called "distributed lag variables") is one efficient way to analyze the effects of important independent variables in past periods on the dependent variable in the current period.

Analysis
Logistic regression, the random forest and the artificial neural network were applied and compared for the prediction of preterm birth [18]. Data on 402,092 observations with full information were divided into training and validation sets with a 70:30 ratio (281,464 vs. 120,628 observations). Accuracy, a ratio of correct predictions among 120,628 observations, was introduced as a criterion for validating the models trained. Random forest variable importance, which measures the contribution of a variable for the performance of the model, was used for identifying major determinants of preterm birth and testing its associations with depression, particulate matter and other predictors. R-Studio 1.3.959 (R-Studio Inc., Boston, MA, USA) was employed for the analysis during 1 August 2020-31 December 2020.

Results
Descriptive statistics for participants' preterm birth and its determinants are shown in Table 1. Among 405,586 participants, 21,732 (5.40%), 8927 (2.22%), 27,752 (6.90%) and 28,845 (7.17%) belonged to PTB 1, 2, 3 and 4, respectively. The median age and socioeconomic status of the participants were 29 and 12, respectively. Among the participants, 126,008 and 42 (December) in terms of 10 −6 g/m 3 , respectively. In terms of accuracy, the random forest was similar with logistic regression and the artificial neural network (94.50%, 97.66%, 93.08% and 92.83% for PTB 1, PTB 2, PTB 3 and PTB 4 in Table 2, respectively). The results of undersampling are shown in Table 3. Undersampling is an approach to match the sizes of two groups (participants with and without preterm birth) so that the training of machine learning can be balanced between the two groups. Undersampling leads to slight improvement in the performance (the area under the receiver-operating-characteristic curve) of the random forest, e.g., from 0.5585 to 0.5803 in the case of PTB 2.     Figure S1 for each of PTB 1, PTB 2, PTB 3 and PTB 4). The importance rankings of particulate matter were particularly high for PTB 2: PM_2014_04 (18th). These findings were similar with those of undersampling in Supplementary Figure S2. The results of logistic regression (Tables 4 and 5) provide useful information about the sign and magnitude for the effect of a major determinant on preterm birth. For example, the odds of PTB 4 will increase by 12.6% if socioeconomic status decreases by 10 in Table 4, e.g., from 2 to 12 (median). The odds of PTB 4 will increase by 24.1% if particulate matter in 2014 August (PM_2014_08) increases by 1 × 10 −6 g/m 3 in the table. In a similar vein, the odds of PTB 4 will be greater by 12.2% for those with depression in 2010 than those without it in the table.

Findings of This Study
Based on random forest variable importance, top-40 determinants of preterm birth during 2015-2017 included socioeconomic status, age, proton pump inhibitor, benzodiazepine, tricyclic antidepressant, sleeping pills, progesterone, GERD for the years 2002-2014, particulate matter for the months January-December 2014, region, myoma uteri, diabetes for the years 2013-2014 and depression for the years 2011-2014.

Summary of Existing Literature
A recent systematic review reported a positive association between gestational depression and spontaneous preterm labor and birth [6]. This review selected 39 cohort studies with 134,488 participants in total, published in English during 1980-2003. The majority of these studies came from high-income countries such as the United States (27), Denmark (2), France (2), Sweden (2), Canada (1), Norway (1) and the United Kingdom (1). Then, a subsequent systematic review reported that prenatal depression is an important risk factor for preterm birth [7]. This review selected 64 observational studies published in English during 2007 and 2017. Here, 49 (77%) and 15 (23%) of these studies were done in middle-income and low-income countries, respectively. Likewise, two systematic reviews [8,9] [11], Utah (the United States, 2002-2010) [12], Ontario (Canada, 2005(Canada, -2012 [13], Wuhan (China, 2011(China, -2013 [14] and Korea (2010-2013) [15]. Their odds-ratio ranges were 1.01-1.57 and 1.04-1.19 regarding PM 10 and PM 2.5 , respectively. However, the number of predictors in the existing literature above has been limited to 14. Moreover, no effort has been made based on machine learning in this line of research.

Contributions of This Study
This study presents the most comprehensive analysis for the determinants of preterm birth, using a population-based cohort of 405,586 participants and the richest collection of 90 predictors such as demographic/socioeconomic determinants, particulate matter, disease information, medication history and obstetric information. Firstly, this study confirms that depression and particulate matter are major predictors of preterm birth (they were the top-40 determinants of preterm birth in this study). Several researchers focus on behavioral, infectious, neuroendocrine and neuroinflammatory mechanisms between depression and preterm birth [19]. Other researchers develop a hypothesis that air pollution causes systemic inflammation, which in turn leads to preterm birth [20]. Little research has been undertaken and more investigation is needed to explore and evaluate various pathways among depression, particulate matter and preterm birth. The findings of this study demonstrate that what is really needed for effective prenatal care is strong intervention for particulate matter together with active counseling and medication for common depressive symptoms (neglected by pregnant women). Secondly, the results of this study agree with those of a previous study with 731 participants on gastroesophageal reflux disease, medication history and preterm birth [18]: The findings of this previous study highlighted the significance of age, socioeconomic status (below high school graduation), progesterone medication history, gastroesophageal reflux disease, region (city) and gestational diabetes mellitus. Above all, to the best of our knowledge, this study is the first attempt to use machine learning and population data to find the main predictors of preterm birth and evaluate its association with depression and particulate matter. This study will be a good starting point in this direction to find main predictors of preterm birth and draw effective implications for its prevention and management.

Limitations of This Study
Firstly, this study did not examine possible mediating effects among variables. Secondly, this study adopted the binary category of preterm birth as no vs. yes (birth between 20 weeks and 0 day and 36 weeks and 6 days of gestation). But preterm birth can have multiple categories and it will be a good topic for future study to compare different predictors for various categories of preterm birth, e.g., extremely preterm (less than 28 (or 24) weeks), very preterm (28-32 (or 24-32) weeks), moderate to late preterm (32-37 weeks) [2]. Thirdly, four categories of preterm birth were defined based on the ICD-10 Code and this could be a source of potential bias. Fourthly, it was not the scope of this study to explore and evaluate various pathways among depression, particulate matter and preterm birth. Little research has been undertaken and more investigations are needed on this topic. Fifthly, uniting various kinds of deep learning approaches for various kinds of preterm birth data would bring new innovations and deeper insights in this line of research. Finally, further investigations of single vs. multiple gestation would deliver more insights and more detailed clinical implications.

Conclusions
Preterm birth has strong associations with depression and particulate matter. What is really needed for effective prenatal care is strong intervention for particulate matters together with active counseling and medication for common depressive symptoms (neglected by pregnant women).

Data Availability Statement:
The data presented in this study are not publicly available. But the data are available from the corresponding author upon reasonable request and under the permission of Korea National Health Insurance Service.

Conflicts of Interest:
The authors declare no conflict of interest.