Infertility is a global health issue that afflicts around 15% of couples worldwide [1
]. However, less than 55% of these affected couples ask for medical assistance [2
]. Overall, 4% up to 17% undergo In Vitro Fertilization (IVF)/Intracytoplasmic Sperm Injection ICSI treatment in developed countries, while this number is even lower in least developed ones [3
]. The causes of infertility are multiple and both men and women may be affected. A male factor can be identified in 30% to 40% of cases [4
] and a similar proportion can be found in women. In about 30% of cases, both contribute to the problem while in around 10% it is not possible to establish a definitive cause [5
]. IVF/ICSI treatments are the last solution for many couples, but they come with several drawbacks, such as being expensive, emotionally burdensome, with secondary effects and with a variable likelihood of success. This issue is important not only to deal with the couples’ expectations, but also to prepare institutions particularly when public funding is applied. Costs and effectiveness continue to stimulate discussion in scientific and public forums [6
In recent decades, we have witnessed unique revolutions in access to information, both in pharmacological and clinical research, and in the means of diagnosis and therapy. All these advances resulted in the entry of technology, engineering, rational sciences and mathematics in the world of medicine [7
] and has been found to be determinant in medical areas including the prediction of the survival of patients with breast cancer [8
], liver cancer [9
] or pancreas cancer [10
]. Since then, Artificial Intelligence (AI) algorithms have been shown to be capable of handling large amounts of data and use the extracted information to select possible diagnoses of different health problems [11
]. Therefore, clinical decision support systems can assist both professionals and patients in this decision-making process [15
The growing amount of data available in reproductive medicine suggests that decision-making for patients with infertility is an ideal clinical approach, based on the analysis of their characteristics, combining Assisted Reproductive Technology (ART) techniques with AI methods [17
]. Several predictive models have been proposed, with the main aim of identifying clinical and demographic variables that correlate to the success of treatment. Success has been defined in several different ways: pregnancy, ongoing pregnancy or deliver of a healthy infant [6
]. As far as we know, Templeton et al. [6
] contributed to the oldest scientific study based on the identification of factors that affect the outcome of an IVF treatment and the measurement of the magnitude of its effect, using a logistic regression, finding that the highest success rates for live birth occurred in the group of women aged between 25 and 30 years, followed by a considerable decline in older woman. Regardless, additional factors were taken into account such as the number of successful previous pregnancies, the number of previous unsuccessful IVF treatments, female causes of infertility and duration of infertility. Nelson and Lawlor [20
] corroborated Templeton’s study [6
] with a new model using univariate and multivariate logistic regression. This approach, significantly, improved the overall prediction of live birth, being able to allocate couples on different prognosis classes. The age-related decline in female fertility has also been studied by other authors that proved that expression oocyte genes are influenced by woman’s age. Therefore, dysfunctions in this area are believed to be one of the main causes of this phenomenon [21
]. La Marca et al. [22
] incorporated the Anti-Müllerian Hormone (AMH) concentrations into the model and the authors demonstrated that it is strongly associated with success in obtaining a live birth, independently of age. Other authors highlighted the importance of other variables like semen-analysis parameters, Body Mass Index (BMI) [23
], previous abortion or miscarriage [24
], ethnicity [25
] or AFC [26
The aim of the present study is to construct an ANN complemented by a decision tree to predict the chance of live birth during an infertility treatment, using the couple’s demographic and clinical data. This will be helpful to assist the physician to correct the couple’s expectations, which might have changed after having started the treatment.
2. Materials and Methods
This is a retrospective study of 1193 cycles who undergone an IVF/ICSI treatment in Centro de Infertilidade e Reprodução Medicamente Assistida (CIRMA) at Hospital Garcia de Orta (HGO), E.P.E in Almada, Portugal. All considered cycles were performed between 2012 and 2019. Only cycles with a live birth delivery after 24 weeks, or cycles with no surplus embryos left were taken for consideration. The exclusion criteria were women without clinical information, including AMH or AFC. The study was approved on 23 January 2020 by the Hospital’s Ethics Committee for Health. The selected output variable was dichotomous (i.e., 1: with live birth, 0: without live birth) and the 26 input variables were duration of infertility (months), woman’s and man’s age (years), woman’s and man’s weight (kg), woman’s and man’s height (cm), woman’s and man’s BMI (kg/m2), woman’s and man’s smoking status (never, previous and present), woman’s and man’s ethnicity (African, Asian, Caucasian, Gipsy, Indian, and Mixture) woman’s and man’s previous children (yes or no), cause of infertility (endometriosis, male factor, both male and female factor, unexplained infertility, multiple female factors, other, ovulatory factor, tubal factor and uterine factor), total dose of gonadotropin (IU), days of stimulation, average daily dose of gonadotropin (IU/day), nature of treatment (IVF, ICSI and mixed IVF/ICSI), number of attempts in the center, number of eggs, number of mature eggs, number of embryos, AMH level (ng/mL) and AFC.
2.2. IVF/ICSI Procedures
According to the local protocol, ovarian stimulation was performed with 100 to 450 IU of recombinant Follicle Stimulating Hormone (r-FSH) or human Menopausal Gonadotropin (hMG) (Gonal-f®, Merck Serono (Darmstadt, Germany); Puregon®, MSD (Kenilworth, NJ, USA); Bemfola®, Gedeon Richter (Budapest, Hungry)) or HMG (Menopur®, Ferring (Saint-Prex, Switzerland)), based on ovarian reserve assessment, starting on cycle day 2 or 3, mostly within a Gonadotropin-Releasing Hormone (GnRH) antagonist flexible protocol (Cetrotide®, Merck Serono; or Orgalutran®, MSD) started on stimulation day 6. Final oocyte maturation was induced with hCG (mostly 6500 IU Ovitrelle®, Merck Serono) or GnRH agonist (0.2 mg Decapeptyl®, Ferring) when at least two follicles of 17 mm in diameter were visualized by ultrasound. Oocyte retrieval was performed 35–37 h after final maturation. ICSI was performed in cases of altered semen parameters, according to the World Health Organization (WHO) criteria or in cases of previous conventional IVF fertilization failure, or low fertilization rate. One or two embryos were transferred 2, 3 or 5 days after oocyte retrieval. A fresh transfer was canceled whenever the progesterone level was over 1.5 ng/mL, risk of Ovarian Hyperstimulation Syndrome (OHSS) or intracavitary uterine pathology was identified during stimulation. The luteal phase was supplemented with vaginal micronized natural progesterone (200 mg Progeffik®, Effik International (Brussels, Belgium), three times a day). Supernumerary embryos of sufficient quality were cryopreserved on days 2, 3 or at the blastocyst stage. Patients who did not become pregnant after fresh transfer could undergo frozen-thawed cycles under artificial endometrial preparation. Live birth was defined as at least one infant born alive after 24 weeks gestation, consistent with previous prediction models and publications. The hormonal measure of anti-müllerian hormone was done with blood serum sample using the Electrochemiluminescence (ECLIA) methodology, with the Modular EVO (E170) Roche Diagnostics® equipment (Basel, Switzerland).
2.3. Statistical Analysis
A univariate analysis was first performed. Categorical data were analyzed with the Qui-Square test, while continuous variables were compared using a standard t-test. Relationships between continuous input variables and the output variable were also assessed through calculation of a Pearson correlation coefficient. All the statistical tests considered a p-value less than 0.05 as a statistically significant result.
2.4. Predictive Model Analysis
This study can be divided into three phases: pre-processing, classification and comparison, as outlined in Figure 1
. The pre-processing phase includes data reduction and resampling in order to balance the dataset. Focusing on to the cycles performed between 2012 and 2018, there were 375 couples with a live birth and 746 couples who failed to conceive, resulting in a 1:2 ratio. For that reason, and to avoid a future overfitting of the model to the data (losing the ability to generalize knowledge due to excessive adjustment to the input data [27
]), SMOTE & Tomek-links, a hybrid sampling method was applied [28
]. As a result of this pre-processing 1374 cases were used, with a 1:1 ratio (687 cases for each class). Subsequently, 72 cycles recorded in 2019 were added, which 60 of them represents failure whereas 12 present success, making a total of 1446 data. The reason why the most recent data were not included previously has to do with the availability of these data at a late stage in this process, with the aim of incorporating more material for the validation of the model.
2.4.1. Artificial Neural Network.
A Multilayer Perceptron (MLP) was developed composed by an input layer, one hidden layer and an output layer. A MLP is used for classification and prediction and has a high predictive accuracy [29
]. All the input variables were previously normalized between [−1,1] according to the minmax
criterion, it was used the Bayesian regularization backpropagation algorithm was implemented with the hyperbolic tangent and logistic functions were the activation function of the input and hidden layers and the output layer, respectively. During the training of a neural network, the main goal is to generate a network that produces a low learning error and, mainly, that is capable of respond appropriately to the presentation of new data achieving a good generalization. Thus, the learning process was stopped by validation stopping, which means that the neural network stops the training as soon as the error on the validation set is higher than it was the last time it was checked, to avoid overfitting. It was also applied the holdout validation method and a number of cases (70%) of the entire dataset (n
= 1446) were randomly assigned to a separate set called training and were used during the learning process, as the same way that 15% of the cases were randomly choose to validate and the remaining 15% to test the model. The mean squared error was considered as the performance function and all the outcome metrics in this learning process come from the arithmetic mean of 100 iterations due to the existence of randomness in them.
The study of this model was divided into three steps. First of all, 20 neurons were considered in the input and hidden layers and the variables that most correlate with the target variable (p
-value < 0.05), without a Pearson’s correlation coefficient above 0.9 between themselves [30
], to prevent increasing the complexity of the system and noise in the classification. Afterwards, the next variable that most correlated with the dependent variable was included while the performance of the network was assessed, applying a technique similar to the wrapped method of forward variable selection, but starting with the two most correlated variables [31
]. This process was repeated until there were no more variables correlated to the target. The final input was obtained when the performance stopped increasing. After that, the number of input and hidden units was analyzed. At first, the same number of neurons was assayed for both layers. Then, a different number of units was also tested for each layer, setting a defined value for one layer and varying the number of neurons in the other. As the last step, the number of hidden layers was verified, but once again, the final network architecture was defined when the performance did not increase. In short, the process of arriving at the combination of parameter values which yielded the best solution consisted of adjusting each one in turn, re-training the network and comparing results. The neural network was constructed using Neural Pattern Recognition app, running on MATLAB®
R2019b (MathWorks, Natick, MA, USA).
2.4.2. Decision Tree
The logical rules followed by a decision tree are much easier to interpret than the numeric weights of the connections between nodes in a neural network [32
]. For that reason, a decision tree was developed with the goal to complement the neural network model. The Classification and Regression Trees (CART) algorithm was used for the predictive modelling [33
]. CART is a binary tree [34
] (i.e., each node has only two branches) with yes/no questions [35
]. The same input variables used in the final artificial neural network model were assessed as well as the most commons metrics to evaluate the degree of inhomogeneity or impurity (i.e., Gini’s index, entropy [36
] and twoing criterion). Regarding the partition of each node, an exhaustive research it was applied (i.e., the deepest tree was built). The evaluation criterion adopted to select the best attribute for each node was the twoing criterion [37
]. To mitigate overfitting [38
], the size of the tree was controlled, limiting the number of nodes from 10 to 25 and applying the pruning approach (once cost complexity was assured [33
], individually or combining the two methods. Also, a cross-validation was performed (k-fold = 10). The decision tree was constructed using Classification Learner app, running on MATLAB®
2.4.3. Model Evaluation
The predictive performance of the models were calculated on the basis of the results of the classification process [34
]: (i) Area Under the Receiver Operating Characteristic curve (AUROC) and (ii) accuracy.
1193 cycles were evaluated. Overall, 387 cycles ended up with at least one live birth (a success rate of 32.4%), while 7.8% were cancelled before egg pick up (n
= 93), 0.92% did not have oocyte (n
= 11), 1.7% did not have mature eggs (n
= 20) and 4.2% had no embryos (n
= 50). Baseline characteristics of couples are presented in Table 1
and Table 2
. In a univariate comparison and concerning the continuous variables, there were statistical significant differences in the mean of woman’s and man’s age, woman’s height, total dose of gonadotropin, average daily dose of gonadotropin, number of eggs, number of mature eggs, number of embryos, AMH level and AFC. Table 1
shows that younger women and men were more likely to achieve a live birth, as well as women with a higher number of eggs, mature eggs, embryos, AMH level and AFC. Moreover, total dose of gonadotropins or average daily dose were inversely related with success. Regarding demographic factors in Table 2
, the majority of women and men never smoked, were Caucasian and had no previous children. The male factor was the main cause of infertility while IVF was the most common adopted infertility treatment. Only woman’s previous children and nature of treatment were statistically significantly different. Table 3
shows the Pearson correlation between continuous attributes and the output variable.
3.1. Artificial Neural Network
The implemented classification model was created based on the trial and error method (Figure 2
), the following variables were used: woman’s age (years), total dose of gonadotropin (IU), number of eggs, number of embryos and AFC, which were all considered as statistically significant (p
-value < 0.05). According to the predictions of the ANN model, the success of IVF/ICSI drops with the increase of woman’s age and total dose of gonadotropin and with the decrease of number of eggs, number of embryos and AFC. This model had an accuracy of 75.0% while the ROC curve test for discriminatory ability of the final prediction model had an AUROC equal to 75.2% (95% CI 72.5-77.5%) (Figure 3
3.2. Decision Tree
Following the first approach to estimate the success rate to have a live birth, and since there are no easy explanations for the links between the weights and the input and output variables, a decision tree was built (Figure 4
), which allows a simpler link between the input and output variables. Although the same five input variables of the ANN model were used, only three (i.e., woman’s age (years), total dose of gonadotropin (IU) and number of embryos) have shown to be important for the determination of the output variable. Table 4
lists the predictor importance of each variable. Overall, the accuracy was 75.0% while the AUROC was 74.9% (95% CI 72.3–77.5%) (Figure 5
Neural networks offer a different approach to pattern recognition and have been used in a wide range of fields [18
], proving to be an effective diagnostic tool for many diseases or as an adjuvant to predict treatment outcomes [29
]. The usefulness of artificial intelligence methods in analyzing data concerning reproductive medicine has also been advocated, giving a greater visibility to ANN, producing excellent results in predicting negative outcomes of infertility treatment, accurately confirming the lack of pregnancy in 86.5% of cases [39
In 1997, Kaufmann et al. [18
] constructed a neural network, achieving an accuracy of 59.0% while only using four inputs (three of them shared by this study, i.e., woman’s age, number of eggs and number of embryos, in turn the other variable accounts for the nature of the embryos, that is, whether they have been previously frozen. 16 years later, Durairaj and Thamilselvan [40
] reported an accuracy of 73% using eight input variables including sperm concentration, woman’s BMI, infertility factor due to endometriosis and tubal factor, number of eggs, number of embryos and the nature of treatment (in particular, IVF treatment). In both cases, the variables number of eggs and number of embryos proved to be important in predicting a positive outcome. Hafiz et al. [41
] and Leijdekkers et al. [42
], also agreed on the importance of this two clinical variables.
Decision trees are also one of the most effective methods for data mining and they have been used in several disciplines, such as medical research. They can support diagnostic processes in cardiology [43
], gastroenterology [44
], general medicine [45
], gynecology [46
], neurology [47
] or psychiatry [48
In 2016, 3 decision trees were applied to select patients who were most likely to get pregnant trough their clinical information. The first one included information about woman’s age, number of eggs retrieved, number of embryos and its transfer, while the second and the third ones include woman’s age, number of eggs retrieved and woman’s age and number of embryos transferred, respectively [49
]. According to Ghaeini et al. [50
], a decision tree model composed with seven input variables (including the most common predictor variable, i.e., woman’s age) was implemented, reaching an accuracy of 70.3%. The significance of the variable AFC from the ANN and the decision tree models is in agreement with the publication by Dillon et al. [26
]. In contrast, increasing the variable woman’s age
, was associated with a decreased chance of at least one live birth. This is in agreement with other authors [6
]. Before the present study, a decision tree was already constructed using only pre-treatment variables, namely AFC, woman’s age, disovulation as a female cause of infertility, AMH level and woman’s BMI. An accuracy of 59.3% was reached, as well as a discriminatory ability of the final prediction model of 59.6% [52
From what is known so far, the ANN and the decision tree built are the first models to include total dose of gonadotropin.
The IVF/ICSI treatment predictors included here can be grouped into pre and post-treatment variables. In the ANN models from the 5 input variables considered, 2 were pre-treatment variables (i.e., woman’s age and AFC) and 3 were post-treatment specific attributes (i.e., total dose of gonadotropin, number of eggs and number of embryos). Compared to other studies (Table 5
), the predictive ability of the two developed models seem to be better than what has previously been published. One of the main reasons is probably related to the expansion of the dataset comparing 1193 actual cycles to 455, 250, 251 and 737 patient’s data analyzed in the study of Kaufmann et al. [18
], Durairaj and Thamilselvan [40
], Ghaeini et al. [50
] and Veiga et al. [52
It was also shown that increasing the number of statistically significant pre and post-treatment variables do not improve the success prediction. In fact, the actual study achieved better results than previous ones even using fewer input values. From the 26 variables in the dataset, the Pearson correlation coefficient was calculated between each input variable and the output value (Table 3
) and 10 statistically significant variables were found (p
-value < 0.05). Yet, variables with coefficient r close to 0 were discarded because they correspond to an almost nonexistent correlation (i.e., woman’s height and man’s age) [53
The Pearson correlation coefficient was also calculated for each pair of input variables, removing the variable number of mature eggs for having a strong correlation with the variables number of eggs and number of embryos (i.e., r = 0.945 and r = 0.917, respectively).
Therefore, seven variables (i.e., woman’s age, total dose of gonadotropin, average daily dose of gonadotropin, number of eggs, number of embryos, AMH level and AFC) were gathered as possible input variables for the ANN model. However, through the applied methodology during the study of the first model it was recognized that not all of these seven variables were necessary to integrate the input set, taking into account that the model’s performance did not increase. Then, the ANN model achieved the desirable results using the five variables already mentioned above.
Beyond these studies [18
], McLernon [51
] developed two models: one before treatment and one after the first embryo transfer attempt. While our study, with an AUROC around 75.0% in both prediction models, aids clinicians to shape couple’s expectations during that unique IVF/ICSI treatment and before the first embryo transfer, the accumulative aspect of McLernon’s models, with 72.0% of AUROC, will help physicians to communicate to couples their personalized chances of having a live birth over an entire package of IVF/ICSI treatment.
Taking into account that this study was performed with data from only one medical center, one limitation of these models is that only an internal validation was made. In the future, it is intended to perform an external validation, collecting new data from other fertility centers, to check if there is a geographical influence on chances of live birth for couples who need medical assistance. Nonetheless, there were no limitations on socioeconomic status since data was provided by a public hospital with universal access, and so we believe that these results can be generalized to any socioeconomic background. Moreover, the large temporal spectrum of the data (2012 to 2019) could mean that treatments made with older technology may have different results than those made with more recent ones. We believe that although our models achieved an accuracy of 75.0%, they should not be used to accept or refuse couples into treatment, but to be used in couples’ counseling and helping in the allocation of the resources of a center. In addition, this study can also be useful to compare results between infertility centers. That is, patients with similar characteristics should register similar outcomes. Otherwise, the models developed would serve to assess the quality of the laboratory in each center.
In conclusion, the aim of the current work was to model the success rate of IVF/ICSI treatment, supporting physicians in patient counselling in a daily basis and helping couples to understand their chances on having a live birth.
This paper showed that artificial neural network can be an effective tool for estimating the probability of success, while decision trees can clarify the previous model due to their easy understanding. As previously mentioned, the obtained results from the ANN learning process come from the arithmetic mean of 100 iterations for two reasons. The first because there are different initial values of synaptic weights and continuous components, and the second because the data is divided into three different subsets (i.e., each iteration the algorithm is trained, the choice of examples for training, validation and test are subsequently different).
Although the ANN model performed in a similar way to the decision tree method, both models are characterized by high performance values (75.0%), suggesting that the ANN and decision tree methods are capable of developing predictive queries in this knowledge area. Moreover, the variables that had an impact in both models were woman’s age, total dose of gonadotropin and number of embryos, though the ANN model considered two other variables (i.e., number of eggs and AFC), most likely because it is a more complex method.
The current investigation can be considered as a valid instrument in medical consultation, but nevertheless, it is emphasized that either the ANN or decision tree models should never replace a standardized diagnostic examination. Despite that, an external validation with data from other infertility centers will be necessary since this study was performed with data from a single source, resulting in constraints such as the date of sample collection and the technology applied in the IVF/ICSI treatments.