Generalized Linear Models to Forecast Malaria Incidence in Three Endemic Regions of Senegal

Affecting millions of individuals yearly, malaria is one of the most dangerous and deadly tropical diseases. It is a major global public health problem, with an alarming spread of parasite transmitted by mosquito (Anophele). Various studies have emerged that construct a mathematical and statistical model for malaria incidence forecasting. In this study, we formulate a generalized linear model based on Poisson and negative binomial regression models for forecasting malaria incidence, taking into account climatic variables (such as the monthly rainfall, average temperature, relative humidity), other predictor variables (the insecticide-treated bed-nets (ITNs) distribution and Artemisinin-based combination therapy (ACT)) and the history of malaria incidence in Dakar, Fatick and Kedougou, three different endemic regions of Senegal. A forecasting algorithm is developed by taking the meteorological explanatory variable Xj at time t−𝓁j, where t is the observation time and 𝓁j is the lag in Xj that maximizes its correlation with the malaria incidence. We saturated the rainfall in order to reduce over-forecasting. The results of this study show that the Poisson regression model is more adequate than the negative binomial regression model to forecast accurately the malaria incidence taking into account some explanatory variables. The application of the saturation where the over-forecasting was observed noticeably increases the quality of the forecasts.


Introduction
Malaria is a disease caused by a parasitic infection transmitted by a mosquito (female Anophele). It can also be passed to human by blood transfusion, sharing needles or congenitally [1]. According to the 2020 World Health Organization (WHO) report, malaria caused 627,000 deaths, 95% of which were registered in African Region. The number of malaria cases decreased from 2000 (238 million reported cases) to 2019 (229 million reported cases). In spite of this diminution, malaria is still endemic in many countries in the world, particularly in Africa. In Senegal, it constitutes a major public health problem, according to the "Programme national de lutte contre le paludisme (PNLP)".
Several mathematical or statistical models were developed for predicting malaria case incidence. Generalized Linear Models (GLM) [2][3][4] were used in the literature. Examples of GLM include the Poisson regression developed first by Nelder and Wedderburn [5], the negative binomial (NB) regression [3], the quasi-Poisson regression [5] and the zero-inflated regression [6]. In general, the Poisson regression is very popular for data fitting but its mean-equal-variance property can limit its application on over-dispersed data [5][6][7]. Also in [8], a multivariate generalized Poisson regression model was defined and studied.
In [6], a model that adapts to malaria incidence using the zero-negative binomial was developed based on climate variables and mosquito density in Limpopo province, South Africa. The results in [6] show how rain and average temperature affect the incidence of malaria. In [9], authors introduced a model that takes into account the incidence of malaria morbidity and mortality in Akure, Nigeria. In that work, the negative binomial regression model, with log as link function, was used to express the malaria morbidity and mortality incidence as functions of climatic variables. Then, the autoregressive integrated moving average (ARIMA (p, d, q)) model was used to fit the residuals. The findings in [9] revealed that an increase in minimum temperature and relative humidity at a 1-month lag significantly increases the chance of malaria transmission and thereby leads to an increase in the number of inpatient and outpatient individuals, as well as the total number of malaria cases. In another study [10], a Bayesian spatiotemporal analysis has been made to describe year-to-year variation of malaria incidence data from Zimbabwe, and in relation to variation in climate risk factors to enhance our ability of developing an operational malaria early warning system (MEWS) and determine areas prone to climate-driven epidemics. As methods in [10], the authors used the annual proportion of monthly malaria cases and Markham's seasonality index to display between-year variation in the data. Then, the data were fitted with the Bayesian negative binomial models such as the non-spatial model, the spatial model and the spatiotemporal model. In addition, a Markov Chain Monte Carlo (MCMC) simulation was applied to estimate the model parameters. As a result in [10], it was found that a high annual malaria incidence coincides with high rainfall and relatively warm conditions while low incidence years coincide only with low rainfall. In conclusion, all models indicated that the mean annual temperature, rainfall, vapour pressure and normalized difference vegetation index (NDVI) were strong positive predictors of increased annual incidence rate. In [11], the authors applied and compared a Bayesian and classical methods of parameter estimation on the effect of climatic factors in the context of modelling malaria incidence in Limpopo Province, South Africa. In that work, the authors estimated the parameters from a negative binomial model by a Bayesian estimation and maximum likelihood estimation. As result, in [11] the Bayesian method appears more robust than the classical method in analyzing malaria incidence. In [12], the authors include the link between CD4 cell count and influencing covariates of biometric and demographic factors from negative binomial mixed models. A GLM is applied in [13] where authors provide spatially explicit burden estimates of malaria in Senegal using the Senegal Malaria Indicator Survey (SMIS) data and Bayesian geostatistical Zero-Inflated Binomial models based on variable selection methods for spatial data.
A comparative study of existing models was carried out across six countries of Sub-Saharan Africa-Burkina Faso, Nigeria, DRC, Mali, Cameroon, and Niger-over a period of 28 years on malaria incidence in [14]. It is reported in [14] that the SARIMA model was found to work best with time series data that exhibited periodic or seasonal characteristics and was able to predict the seasonal trend of malaria. That model type is only suitable for a stationary or seasonal process. The negative binomial model correctly identified the association between climate variables (taken as explanatory variables) and the rate of malaria transmission. That last model type can make good short-term forecasts, but is not ideal for prediction in subsequent years.
In this paper, a GLM is used in the context of forecasting falciparum malaria incidence count per month based on climate variables and history of falciparum malaria incidence count per month in three endemic regions of Senegal: Dakar (hypoendemic zone), Fatick (endemic zone), and Kedougou (hyperendemic zone). The choice of these three regions is motivated by data availability (notably the presence of villages where longitudinal studies have been conducted), but also by geographical differences: influence of the ocean in the Dakar peninsula, tropical climate in Kedougou, and savanna landscape in Fatik. These fundamental geographical differences allow us to test the applicability of GLMs under drastically different evolutions of the climate variables. A machine learning approach is developed, based on a separation of the data into a train-set to estimate the parameters by maximum likelihood and a test-set to assess the forecast accuracy. Addition and ablation studies are developed to show the influence of each explanatory variable in the forecasts.
A Vuong test reveals that the Poisson distribution is preferred to the negative binomial distribution to model the malaria incidence given the explanatory variables. The forecast accuracy of GLMs with various distributions (Poisson, negative binomial, and Gaussian) and link functions (identity, log, and sqrt) is compared in terms of several model performance metrics. Whereas the best distribution-link combination varies according to the endemic region of interest and the performance metric, the experiments lead to the conclusion that the Poisson distribution with the identity link is overall the most suitable combination. In addition, a saturation method is introduced on the rainfall variable to remedy some overestimations observed during the forecasts. This method has reduced by 4% in the sense of MARE, the over-estimation occurring at the end of 2015 in Dakar.
The paper is organized as follows. The available data are presented in Section 2.1. The models are described in Section 2.2. The estimation and forecasting method containing the train-test machine learning method, the principles of forecasting, the algorithmic protocols, and the saturation concept are presented in Section 2.3. Experimental results and discussion are reported in Section 3 and conclusions are drawn at the end.

Data and Notation
There are two principal malaria transmission zones in Senegal: • "Faciès tropical": corresponding the regions of Ziguinchor, Kolda, Tambacounda, and Kedougou. In that zone, the raining season is the longest and most intensive in the country and covers 5 to 6 months. Malaria cases are observed between 4 to 6 months and the transmission is high (20 to 100 infected bites/human/year). • "Faciès sahélien": corresponding the regions such as Kaolack, Fatick, Diourbel, Dakar, Thies, Louga, Saint-Louis, and Matam with a less intensive rainy season and covers 2 to 3 months. The transmission is very low in general (0 to 20 infected bites/human/year).
We are interested in three regions of Senegal: Dakar (hypoendemic zone), Fatick (endemic zone), and Kedougou (hyperendemic zone) located in the map presented in Figure 1.
The historical data, such as the monthly falciparum malaria incidence count, the distributed insecticide-treated bed-nets (ITNs), and the distributed Artemisinin-based combination therapy (ACT), between 2008 and 2016, come from the "Programme national de lutte contre le paludisme (PNLP)" of Senegal (https://www.dropbox.com/s/0p4uc2di hfhr9cb/Dakar.csv?dl=0 (accessed on 3 July 2023)). In this study, we consider as malaria cases, the cumulative number of confirmed tests by Rapid diagnostic tests (RDTs) during the month, in all individual groups [15,16]. Malaria cases are confirmed by the methods validated by the "Programme national de lutte contre le paludisme (PNLP)" of Senegal in accordance to the WHO guidelines. The main method is the rapid diagnostic test (RDT) even if it has been recently discovered that this test could miss up to 20% of malaria cases [17]. Due to unavailability of the RDT in some deep localities and the lack of materials to keep it, some districts use the "goutte épaisse", which is a very old method and less sensitive than the RDT. There also is a more sensitive but very expensive test: polymerase chain reaction (PCR), used only in some high level medical research institute in Senegal. Then, we grouped the data from all the different big sanitary districts in each region such as Dakar, Fatick, and Kedougou. The hourly meteorological data, such as the temperature, the relative humidity, and the rainfall, between 2008 and 2016, come from meteoblue (https://www.meteoblue.com/historyplus (accessed on 3 July 2023)). To obtain adequate meteorological data for our study (monthly time unit) we add up all the measured values of the month for rainfall, and we calculate the mean of all the measured values of the month for temperature and humidity.

Response Variable
In this study, the response variable (or explained variable) is the falciparum malaria incidence count per month noted by Y o (t).

Independent Variables
The available explanatory variables of this study are the history of falciparum malaria incidence count per month (Y o (t − δ)), the rainfall (R(t − δ), mm per month), the average temperature (T(t − δ),°C per month), the relative humidity (H(t − δ), % per month), the number of insecticide treated bed-nets distributed per month (B(t − δ)), the number of anti-malarial drugs distributed per month (A(t − δ)) where we consider δ = h, h + 1, . . . , 6 (h represents a forecast horizon), and an artificial vector that we call intercept vector I equal to 1 all t.
We are interested in the meteorological explanatory variables such as the rainfall (R), the average temperature (T) and the relative humidity (H) because they are known to influence the mosquitoes (Anopheles) ecology by affecting its distribution, seasonality, and transmission intensity [18,19]. For example, the temperature influences the sporogonic development duration of the parasite and many parameters related to the mosquitoes such as: the biting rate, the egg deposition rate, and the death rate of immature and adult mosquitoes [19]. The rainfall influences the availability and the quality of the larval breeding grounds [20] and the maturation of immature mosquitoes [19]. As for the bed-net (B) and the drugs (A) distributed, we took them because we suppose that they constitute the main factors fighting against malaria [16], reducing the morbidity and the mortality of the disease. Needless to say, the number of bed-nets actually used would be a more suitable explanatory variable, but these data are not available. Note that, in contrast with the physical explanatory variables (R, T, and H), the human explanatory variables (B and A) may depend on the incidence in the past, and they may also be influenced by models used by the health authorities. For this reason, in most of our experiments, we only consider the physical explanatory variables.
All simulations and data analysis are carried out with Jupyter Notebook (Anaconda 3) and Spyder (Python 3.7).   Our experiments use the epidemiological data from PNLP and the meteorological data from meteoblue from 31 January 2008 to 31 December 2016, in Dakar, Fatick and Kedougou.

Models
We have n data points (X t1 , X t2 , ..., X tk , Y t ) ∈ R k+1 for t = 1, ..., n where k is the number of explanatory variables (including the intercept vector) and n is the number of months. We want to build a generalized linear model (GLM) of the response vector Y using the k explanatory variables X 1 , ..., X k , according to the diagram Equation (1), where we denote by R-v: Random variable, L-f: Link function and D-c: Deterministic component. According to [3], the link function permits the mean (µ) of the t th observation and its linear predictor (η) to be related. We let X t1 = 1, t = 1, . . . , n making β 1 the intercept. In the D-c block, the regression coefficients β 1 , . . . , β k are to be estimated on the train-set.
According to the studies in [3,12,21], candidate distributions for viable modeling include the Poisson and negative binomial (NB) distributions. Now, we are going to present these two regression models in the following Sections 2.2.1 and 2.2.2.

Poisson Regression Model
The Poisson distribution is probably the most used discrete distribution because of its simplicity, according to [11]. Its conditional probability mass function is defined as in [2,11] by According to Equation (1), we have

NB Regression Model
The Poisson-Gamma mixture distribution is the negative binomial distribution, according to [6]. Its probability mass function is given as in [5][6][7] where Γ is the gamma function. Its mean is µ t and its variance is µ t + αµ 2 t , where α is termed the distribution parameter. Note that, if α → 0, the negative binomial converges to the Poisson distribution.
The Section Appendix A describes how the regression coefficients are computed.

Train and Test Sets
We have t s < t i < t c < t e , where t s is the initial time of the data, t i is the initial time of the observed malaria incidence, t c is the end time of the train set, t e is the end time of the test set.

Parameter Estimation and Principles of Forecasting
We train the model by taking the observed malaria incidence (Y o (t)) in [t i , t c ] and the explanatory variables (X tj , . We did this in order to have the regression coefficients βs and the dispersion parameter α (only with NB regression model).
In [2], the authors suggest to calculate α using a technique that they call auxiliary Ordinary Least Squares (OLS) regression without a constant. In the negative binomial case, a first estimation of the βs is obtained by the procedure of the Poisson case. Then, α is computed by OLS method. Finally, the βs are re-estimated by maximizing the log likelihood (Equation (A3)) wrt β.
Then, we make the forecasts, according to Algorithm 1, with the coefficients found in the train period. We assess the model accuracy in the test period [t c + 1, t e ] by comparing the theoretical (mean µ t ) and the observed (Y o (t)) incidences.
Algorithm 1 describes the train-test procedure. For the link function g, the choices identity, log and sqrt are available in the Python library statsmodels.genmod.families.links. The forecasts are obtained with the formula (4) Fit the GLM with distribution f and link g in the train period using the Python library statsmodels.genmod.families; 4 Get the regression coefficients β j , j = 1, . . . , k;

Saturation Method
We would like to test a saturation in the explanatory variables, in particular rainfall. The motivation is that additional rainfall should have less impact on the malaria incidence in a wet period than in a dry period. The saturation can be simply a hyperbolic tangent function. We posit that, instead of being linear, the contribution of rainfall to η(X) is affected by a saturation, which we model by We estimate the parameter γ by making a research in many initial values of γ in order to find the more adapted value which gives the low RMSE_train after fitting the GLM. This procedure is entirely described in Algorithm 2.

Algorithm 2: Forecasting Algorithm with Saturation
Input: t s , t i , t c , t e ← times of Section 2.3.1; ← vector (Section 3.1); Fit the Poisson distribution and link g with Y train and X train−sat in order to obtain the predicted mean (µ); Fit the GLM with distribution f and link g with Y train and X train−opt using the Python library statsmodels.genmod.families; 12 Get the regression coefficients β j , j = 1, . . . , k;

Results and Discussion
In this section, we first present and discuss the results of the correlation between the explained variable and each explanatory variable. We define metrics. Then, we present the forecast results from the Algorithm 1. Finally, we present the results from other methods such as addition study, ablation study, and saturation.

Determination of Lags
For Dakar data in Figure 2, we observe, every year, an increase in malaria cases in the rainy season (from May or June to October or November), reaching a peak around the month of October or November, and a decrease to become stationary along the dry season. This situation is also observed in Fatick ( Figure 3) and Kedougou (Figure 4), and proves the seasonality of the malaria cases from these three regions.
The plots of the malaria cases and the explanatory variables (e.g., Figures 2-4) reveal that there is a delay between the maximum of malaria cases and the maximum of the explanatory variable) in the three data sets. This delay is called lag and represented by in the formulae. We set where r denotes the sample Pearson correlation coefficient. The statistical results are presented in Table 1 (reproduce with Determination_of_lag_2022_07_23.py) and the evo-lution of correlations as function of lags is illustrated in Figures 8-10 (reproduce with Correlation_Plots_2022_07_23.py). Since these p-values are less than 0.05, we could conclude that there is a statistically significant correlation between the malaria incidence and the explanatory variable at the delay considered. The only exception is the bed-nets distributed in Dakar where the p-value is 0.129 > 0.05.
The rainfall is highly and positively correlated, meaning that if the rainfall increases, the malaria incidence will increase. These results are also found in [11], revealing that rainfall was a strong positive predictor of increasing the annual incidence rate. The bed-net distribution is weakly correlated in Dakar. Then, it is positively correlated in Fatick (even if the value is low) and in Kedougou with a lag of three months. Indeed, the situation in Fatick and Kedougou can be explained by the fact that the authorities anticipate the bed-net distribution three months before the beginning of the rainy season. On the other hand, in Figure 7, we observe that the bed-net distribution is not regular in Kedougou because its behavior is very different in the last two years of the dataset. For all these reasons, we do not use the bed-net distribution (B) as an explanatory variable. The optimal lag between the malaria cases and the drugs (A) is equal to 0 in all the regions. This result means that the drugs is not a real predictor because it is usually taken after appearing some malaria symptoms from a patient. That reason leads us to do not consider the drug as an explanatory variable. But it can help to cure some sick people and to reduce future malaria cases. In Figures 8-10, the correlation between malaria cases at time t and the malaria in the past (t − δ) decreases for all values of δ meaning that the malaria in the past, as an explanatory variable, becomes less and less important when δ becomes larger. The

Model Performance Metrics
The output of the GLM is a probability distribution at each time instant. Accuracy measures such as the root mean square error (RMSE) and the mean absolute scaled error (MASE) defined in [22], and the mean absolute relative error (MARE) defined in [23], quantify the discrepancy between the mean µ t of the distribution and the observed incidence Y o (t). We also introduce other statistical measures such as the scatter index (SI) and the reliability analysis (RA) defined in [24]. We conduct experiments for 108 months with a train and a test period.

•
Root mean square error (RMSE): • Mean absolute error (MAE): • Mean absolute scaled error (MASE): It consists of the ratio between the MAE and the mean monthly variation of the observed values. A MASE value around 1 or below indicates an excellent accuracy. • Mean absolute relative error (MARE): • R-squared [2]: It is the proportion of variation in the outcome that is explained by the predictor variables. The higher the R-squared, the better the model, in contrast to all the above metrics.
The Scatter index (SI) (also called the normalized root mean squared error (NRMSE)) and the reliability analysis (RA) are defined in [24].

•
The SI presents the percentage of RMSE difference with respect to mean observation or it gives the percentage of expected error for the parameter. Lower values of the SI are an indication of better model performance.
• The reliability analysis (RA) is a statistical method for measuring the overall consistency of a model by determining if this suggested model achieves a permissible level of performance.
where the ks are determined through two steps. First, the relative average error (RAE) is defined as a vector whose tth component is Next, if RAE(t) ≤ ∆, then k(t) = 1, otherwise k(t) = 0, where ∆ is a threshold value that is 0.2 (20%) based on Chinese standards.

Model Selection by Using the Vuong Test
In order to assess the adequacy of the distribution, we apply the Vuong statistical test as in [25,26]. The Vuong test is defined as follows: is the first probability mass function and f 2 (Y t |X t ) is the second probability mass function. If V > 1.96, then the first model is preferred. If V < −1.96, then the second one is preferred. If −1.96 < V < 1.96, none of the models are preferred. In our case, we let f 1 be the Poisson distribution and f 2 the NB distribution. This statistical test permits to choose the most adequate between the two regression models in order to fit the data. Figures 11-13 (reproduce with Vuong_test_2022_07_23.py) present the dependence of the Vuong test value (V) with respect to α (dispersion parameter). The figures show that the α computed by OLS is usually in the window where the Poisson model is preferable to the NB model. An exception is observed in Kedougou where none is preferred with log and sqrt. We can conclude that the use of the GLM with Poisson distribution is justified.

Results Comparison by Using Metrics
In this part, we made a comparative analysis in Table 2 (reproduce with Compara-tive_results_with_all_models_2022_07_23.py) between the three models that are Gaussian (identity, log), Poisson (identity, log, sqrt), and negative binomial (identity, log, sqrt). Experiments are carried out with Algorithm 1 without saturation. We include the minimum values of the predicted mean obtained after fitting the model. A negative sign indicates that the model is not adequate for our count data. All these metric values permit us to validate and to compare the performance of the models.
, and 1 for the intercept. We have considered t s = 0, t i = 5, t c = 84 and t e = 108 and h = 1. Refer to Table 1 Table 2 show that the most of the MASEs and MAREs in the train and test windows are around 1 in the three regions. Then, we also have the R 2 COR s indicator whose values are high (>50 %) indicating a good contribution of the explanatory variables in the forecasts. Additionally, predictions provided by Poisson (identity) are globally more reliable in in the three regions when compared to other models based on RA values.
We can conclude that the Poisson (identity) model can be nicely used to fit our data for parameter estimation and to make the forecasts with these found parameters in the three regions. All the additional studies will base on this model.

Forecasts Results by Various Sets of Explanatory Variables
Using Algorithm 1 with Poisson for f , identity for g, t s = 0, t i = 5, t c = 84, t e = 108, and h = 1, we validate the model with datasets from Dakar, Fatick, and Kedougou.

Forecasts Results Using History of Malaria Incidence Only
For transmissible diseases, the incidence at a given time t − δ is very important to predict the expected incidence at a given time t. That reason leads us to first consider a set of explanatory variables composed of two variables that are the history of malaria incidence (Y o (t − δ)) and the intercept vector (I(t)). We defined this set as follows: Thus, our linear predictor becomes like a simple Markov model with the Gaussian distribution that is called first order autoregressive AR(1) [3] and defined by The estimates returned by the model in Figures 14-16 are very accurate because of the low standard errors and the tight 95% confidence interval and statistically significant due to the p-values that are less than 0.05. That is why we decide not to show the confidence intervals of the predictions as they do not clearly appear. In Figures 14-16, (reproduce with Addition_study_2022_07_23.py) we present the forecast results (noted by A) and the curves of β j X j (noted by B). These results are obtained from Equation (17) with Poisson (identity) model when we take h = 1. The figures noted by B permit to show which variable is highly (β j X j (t) 0) or weakly (β j X j (t) ∼ 0) used during the forecasting process. In all three regions, we observe that the sole use of the history of malaria incidence at time t − 1 gives good results. That can be explained by the fact that it is highly used based on figures (B) compared to the intercept whose values are close to 0. This situation is biologically true because with the transmissible diseases like malaria the history of cases is very important to estimate the future cases.
In addition, in Table 3, (reproduce with Various_forecasting_horizon_2022_07_23.py) we remark that the model gives less accurate predictions when the values of h increases corresponding a weak use of the history of malaria incidence (Y o (t − h), h > 1). We also have collected, in Table 3, the result of SI in the training and testing periods with various value of h. These values of SI reveal that the model is the most accurate when h = 1 because they are the lowest in all the three regions.
The conclusion of these results is that malaria in the past was a very good explanatory variable when h = 1.     In this part, we use the whole explanatory variables and the results from Algorithm 1 and Equation (4) are presented in Figures 17-19 (reproduce with Saturated_and_non_ saturated_rainfall_2022_07_23.py). The confidence intervals of the predictions are not shown based on the same observations made in the previous Section 3.4.1. Figure 17B of Dakar data shows with some peaks in the rainfall distribution. While, in Figure 18B about Fatick data, the rainfall variable is weakly used and this variable is negatively used in Figure 19 of Kedougou data. The peaks observed in Dakar and the situation in Kedougou leads us to develop the method of saturation in Section 2.3.3. Then, in Figure 17B of Dakar data, it is also shown that the humidity is weakly used to mean that this variable has a little contribution in the forecasts. Contrary to Figure 18B about Fatick data, the humidity is highly used, so it has a big participation in the forecasts. As for Kedougou, in Figure 19B, this variable (humidity) is moderately used. As for temperature, its contribution changes depending on the region. For example, in Dakar and Fatick, this variable is negatively used as shown in Figures 17 and 18. However, in Kedougou, it is positive and highly used in Figure 19.
We can summarize here that each explanatory variable does is not used the same in the three regions. Thus, these analyses and observations lead to the development of the methods in Sections 3.5 and 3.6. Figure 17. Forecasting results in Dakar, no saturation applied. Malaria incidence means the falciparum malaria incidence count per month. We present the forecast results (noted by A) and the curves of β j X j (noted by B). Figure 18. Forecasting results in Fatick: no saturation applied. Malaria incidence means the falciparum malaria incidence count per month. We present the forecast results (noted by A) and the curves of β j X j (noted by B). Figure 19. Forecasting results in Kedougou: no saturation applied. Malaria incidence means the falciparum malaria incidence count per month. We present the forecast results (noted by A) and the curves of β j X j (noted by B).

Addition Study
In this section, we experiment with different combinations of sets from Equation (16) and we define them as follows {start_set ∪ X j , j = 3, . . . , k}.
The purpose is to investigate the influence of each additional variable compared to the observation in Section 3.4.1.
The Table 4 (reproduce with Addition_study_2022_07_23.py) reveals that, in Kedougou, forecast results are improved by adding to start_set explanatory variables such as temperature or humidity. In contrast, the addition of the rainfall (R) gives some less good results in the sense of the RMSE, MASE and MARE, in Dakar and Fatick, even if the higher values of R 2 COR are observed there.

Ablation Study
We now present ablation studies where we start with all the available explanatory variables and investigate the effect of removing any one of them. The sets of explanatory variables after ablation are thus {X 1 , X 2 , . . . , X k } \ X j , j = 2, . . . , k. Table 4. Results of the addition study in the three regions: Algorithm 1 with Poisson for f and identity for g, t s = 0, t i = 5, t c = 84, t e = 108 and h = 1. We denote by w: situation in the test period with the added variable and wo: situation in the test period without the indicated variable. Ratios of w/wo are reported. Note that, for the metrics such as RMSE, MASE and MARE, a ratio lower than 1 implies an improvement of the model when the variable is added as an explanatory variable. A ratio of R 2 COR higher than 1 indicates that adding the variable improves the forecasts. By doing that, we can know which variable is more responsible of the over-forecasting (or under-forecasting) observed.

Regions
The results in Table 5 (reproduce with Ablation_study_2022_07_23.py) show a low accuracy in terms of RMSE, MASE, and MARE in the whole three regions data sets when the malaria incidence in the past was deleted. This situation was expected when we refer to the forecast results in Section 3.4.1. It is also shown there that the deletion of one of the others variables such as rainfall, temperature and humidity generally reduces a bit the performance of the models in the sense of RMSE, MASE, and MARE. Table 5. Results of the ablation study in the three regions: Algorithm 1 with Poisson for f and identity for g, t s = 0, t i = 5, t c = 84, t e = 108 and h = 1. We denote by wo: situation in the test period without the indicated variable and w: situation in the test period with all the explanatory variables. Ratios of wo/w are reported. Note that, for the metrics such as RMSE, MASE and MARE, a ratio lower than 1 implies that the result is good without the indicated variable. A ratio of R 2 COR higher than 1 indicates adding the variable improves the forecasts. We denote by Mcp: Malaria cases in the past.

Forecasts Results Using Saturation
In this section, we made a main modification to the rainfall variable. We call this method by saturation and the procedure is entirely detailed in Section 2.3.3 and the al-gorithm therein (Algorithm 2). Statistical results of this method are presented in Table 6. They permit us to distinguish the performance given by this novelty compared to Table 2   As a result, an improvement of the forecasts is observed based on Table 6. For Dakar, we observed an improvement of the results after applying the saturation according to all the metrics. It is interesting to see that, in Dakar, the introduction of the saturated rainfall has reduced the over-estimation occurring at the end of 2015; compare Figures 17 and 20. Then, in Fatick, all the values are 1, meaning that the saturation method does not improve the results. That can be explained by the fact that the rainfall is less used in this region, according to Figure 18B. As for Kedougou, we have a ratio of MARE smaller than its value before applying the method (0.87) and a ratio of R 2 COR higher than the value before applying the method (1.04). There is no improvement in terms of ration of the RMSE and MASE. That may be explained by the fact that there was not a peak as that is illustrated in Figure 19 or the forecasts have been already good. Another favorable effect of the saturation is that, in Kedougou, the non-saturated rainfall is negatively used ( Figure 19B) while the saturated rainfall is positively used ( Figure 21B). Table 6. Comparison results after and before the saturation: Algorithm 2 with Poisson for f and identity for g, t s = 0, t i = 5, t c = 84, t e = 108 and h = 1. We denote by w: the metric with saturation and wo: the metric without saturation, in the test period. Ratios of w/wo are reported. Note that, for the metrics such as RMSE, MASE, and MARE, a ratio lower than 1 implies an improvement of the model to make good forecasts with saturation. A ratio of R 2 COR higher than 1 indicates that applying the saturation improves the forecasts.

Conclusions
For three endemic regions of Senegal, we have investigated the accuracy, in the sense of the metrics such as RMSE, MASE and MARE, of the falciparum malaria incidence count per month forecasts obtained with GLM's by using meteorological data and history of falciparum malaria incidence count per month as explanatory variables. Using the Vuong test, we have compared the adequacy of Poisson-based and NB-based GLM's. And the Poisson with identity as a GLM link function is in practice the more adequate regression model to make forecasts of malaria incidence based on meteorological factors. We have observed that the choice of the GLM's link function and the use of adequate lags in the explanatory variables may have a considerable impact on the forecast accuracy. We also have observed that the application of saturation in the rainfall increases the quality of the forecasts in Dakar and Kedougou.
Ablation study shows that removing the history of malaria cases from the explanatory variables has a strong adverse effect on the forecast accuracy.
This study is led with a monthly malaria incidence count due to the unavailability of the daily malaria incidence reports that could have helped us to understand better the influence of the climatic data. This study is a step towards providing the authorities with decision-making tools for the optimal dispatch of resources.
This proposed GLM gives some overestimations in the forecasts (end of 2016 in Dakar in Figure 17 and Fatick in Figure 18). These peaks can be caused by an inadequacy of this model due to its linear character even if a non-linearity is then applied for rainfall. This may also be due to a variable ratio of infected people being sufficiently ill to go to the hospital and be confirmed, thus causing a bias in the collected data. Another and important explanation to the over-forecasting (or under-forecasting) observed can be the unavailability of some explanatory variables: the distance to water bodies, the normalized differenced vegetation index (NDVI), the night and day LST (land surface temperature), the ownership and use of insecticide treated nets (ITNs) and the intermittent preventive treatment distributed for pregnant women (IPTp) that are considered to fit malaria incidence [13]. Providing the enhanced vegetation index (EVI), and actual evapotranspiration (ETa) could help to improve the malaria incidence fitting model [27]. In our available data, there was the ITNs distributed because it constitutes the main factor fighting against malaria according to [16] but its distribution was not very regular (see Figures 5-7). Needless to say, the ownership and use of ITNs would be a more suitable explanatory variable. Funding: This work was supported by ARES-CCD and UCLouvain's "Conseil de l'action internationale" through PhD scholarships.
Acknowledgments: This work benefited from discussions with colleagues: Loïc Van Hoorebeeck and Hazan Daglayan Sevim. We are grateful to the PNLP for providing us with historical data, such as the monthly number of malaria cases, bed-nets, and therapy distributed from 2008 to 2016 and to meteoblue for providing us with meteorological data, such as temperature, humidity and rainfall from 2008 to 2016. This work has received funding from a fellowship awarded by UCLouvain's Conseil de l'action internationale.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Method for Parameter Estimation
Appendix A.1. Poisson Log-Likelihood

1.
The likelihood function is defined as follows where the right-hand side depends on β through Equation (2).

2.
The Log-likelihood is defined as follows where µ t depends on X and β and is defined as follows Equation (2). Observe that the term log(Y t !) does not depend on the parameters to be estimated. 3.
The first derivative of the log-likelihood is named the gradient. If g = log, we have log(µ t ) = η t = X T t β. So this function is defined as follows Then, the gradient at state j is defined by Then, if g=identity, we have µ t = η t = X T t β. So Then, the gradient at state j is defined by Then, the gradient at state j is defined as follows Appendix A.2. NB Log-Likelihood As in [11,26,28], we obtain the likelihood function Then, Abusing notation, we omit the term log(Γ(y i + 1)) since it does not depend on the parameters to be estimated. This yields According to [2] p. 81 and [28], and if Y t is an integer, we have We mention it in order to avoid using the gamma function that gives infinity when Y t is high. Finally, the formula Equation (A2) yields where µ t depends on X and β according to Equation (2).