1. Introduction
Detailed population projections are essential for political and economic decision-making. This includes, for example, choices about education policies, the pension system or the labour market [
1,
2]. Since demographic developments are usually non-homogeneous and decisions often have to be made at the regional or local level, accurate spatially disaggregated projection results are essential. However, if complex, multivariate outcomes are of interest or many explanatory variables should be included in the modelling, the projections become infeasible with traditional macro approaches due to the exponential growth of the state space when additional variables are included [
3,
4]. In such instances, dynamic microsimulation is the method of choice. Here, a microdata set, often denoted as the base population, is stochastically projected into the future. In this way, population projections for multiple characteristics are simultaneously produced to answer what-if questions that cannot be answered otherwise [
5]. Dynamic microsimulations are commonly used to project demographics, diseases, and health- or labour-market-related variables over time (e.g., [
6,
7,
8]). Microsimulations are also commonly applied in social sciences [
4], medicine [
9], transportation, agriculture and land usage [
10].
Once subnational analyses are of interest, the model must accurately reflect regional differences. This includes the consideration of heterogeneity in the geographic allocation of the micro units into real places, which is also called spatial microsimulation [
11,
12]. Population synthesis methods generate and allocate microdata that fulfil known regional values. However, differences must also be considered in the projection of the units over time since heterogeneity is usually not only due to compositional effects in the populations. If one simply applies models fit on microdata without regional information, this may lead to considerably biased results in subregions [
13,
14]. Unfortunately, subregional models based on survey data are often not feasible due to small sample sizes or a lack of geographic indicators in the data on which the models are fit. At the same time, marginal distributions (often classified) or totals may be available at the level of interest. For instance, German regional birth distributions by five-year age groups and citizenship are available to researchers on request [
15], which would, in principle, allow simplistic but fully regionalised projections of future births.
Thus, researchers often face a trade-off between the desired model complexity, which should capture the complex dynamics within a population, and the demand for a model that is adequate for the region of interest. Preserving model complexity is interesting to researchers and policy-makers, as the who is often as much of interest as the how many, which is usually the motivator for microsimulations in the first place.
One way to reduce regional biases is using alignment methods to adjust the transition probabilities to known values for the particular region. Such alignment methods include external information into the model to replicate observed developments [
16]. Generally, these methods are either used to correct the model in the sense of calibration (mean correction), which is the focus of this application, and/or to reduce variance (variance elimination) introduced by the Monte Carlo (MC) procedure [
17].
Previous research focussed on alignment to fulfil regionally observed totals. In this paper we introduce a flexible and easy-to-implement model recalibration method to align predictions with observed marginal distributions rather than totals only. This allows reproducing locally observed transition rates while maintaining the desired model complexity. We demonstrate the applicability in a MC simulation for predicting birth events in the German city of Trier for 2014–2022 and evaluating against observed distributions from registers. We show that the proposed recalibration to marginal distributions leads to improved predictions compared to models that were not calibrated or calibrated to totals only.
The paper is structured as follows. The microsimulation methodology and the ideas of alignment are introduced in 
Section 2. 
Section 3 introduces the concepts of model calibration techniques and their applicability for regional bias reduction of statistical models. 
Section 4 depicts the disaggregation and smoothing method used for abridged demographic rates which is applied in this paper. The simulation study to test the applicability of the model calibration for demographic microsimulations is described in 
Section 5. The overall findings of this paper are summarised and discussed in 
Section 6.
  2. Microsimulation Methods for Population Projections
Microsimulations are computer-intensive tools to study complex systems by simulating the units at the lowest level via repeated stochastic experiments. In contrast to macro models, individuals rather than population aggregates are studied using unit-level data and unit-level models. Thus, a list of persons rather than cross-tables of characteristics are updated yearly. Since the number of cells in the cross-tables grows exponentially with the number of variables included, macro models quickly become infeasible when many variables are included due to the size of the state space. In fact, even for a few variables beyond age and sex, the number of cells may be larger than the number of persons in the population [
3,
4]. Additionally, microsimulations are less restrictive on the modelling procedure for behaviour, variable types (continuous or categorical) or process types (Markovian or non-Markovian). Thus, population heterogeneity beyond age and sex can be included more easily in the micro approach.
However, this comes at increased computational costs, increased data requirements, usually lower understandability to non-experts, and significantly higher development costs in both hardware and work-time [
4]. For purely demographic projections by age, sex and region, simulation at the micro-level yields no benefit over macro models. However, when population heterogeneity matters and there are too many variables to be manageable in a macro approach, the behaviour is too difficult to model at the aggregate level or individual trajectories are important, microsimulations are the method of choice. Microsimulations can be used to (1) generate high-quality synthetic microdata sets, (2) simulate counterfactual scenarios to answer what-if questions via virtual experiments and (3) project populations [
4,
5,
18].
Since, in practice, no single data set containing all variables of interest is available and due to worries about Statistical Disclosure Control, the base population is usually at least partly synthetic. This is especially true if a high spatial resolution is required [
19]. Even if detailed geographic information is provided in the survey data, sample sizes are usually too small to obtain regionalised estimates. This motivates a partial or complete synthetisation of the base population. For this, census information and reliable surveys are typically combined in a simulated estimation of detailed (small area) micropopulations [
19,
20]. This process is also called population synthesis or, more generally, spatial microsimulation [
11,
12].
Two branches of microsimulations can be differentiated: static and dynamic models. While static models are primarily interested in immediate impacts at one or more time points without considering the change processes, dynamic microsimulations explicitly model changes in the unit’s characteristics over time [
4,
19]. Accordingly, static models are primarily applied in rule-based systems where life events are not required [
12], which is predominantly the analysis of the immediate impacts of tax-and-transfer policies. Conversely, dynamic models are usually interesting when life trajectories are essential. This is why dynamic models can also be understood as a multivariate process with state dependency [
21].
Within dynamic simulation, two approaches to handling the time dimension may be distinguished depending on handling the temporal dimension: continuous-time and discrete-time models. In continuous-time models, exact time points of state changes are modelled using survival models and hazard rates [
5]. Since events can occur at any point in time, no fixed order of updating the characteristics is required. Instead, the events that are determined to occur first are executed first. Dependent events are then re-simulated conditional on the new state upon unit characteristics changes. This process is repeated until the unit’s death or the end of the simulation horizon. This way, individual life histories are produced sequentially [
22]. In principle, competing events are handled more easily, and the models are easier to expand under the assumption of conditional independence, which may, however, pose substantial restrictions on the models at use [
22]. Further, continuous models are usually more computationally intensive, harder to align with external values and more challenging regarding data as they require detailed information on the life history of units [
5,
22].
In contrast to continuous-time models, discrete-time models are simulated in fixed, often yearly, intervals, updating characteristics only once per simulated period without giving an exact timing of an event. Thus, discrete-time models can be understood as aggregating over the time interval [
22]. The fixed-interval (net) transitions between states are usually obtained by statistical models or conditional distributions.
In practice, updating all characteristics simultaneously is not possible computationally. Additionally, not all necessary information to update all states is jointly available in the same data set. Thus, the transitions are usually modelled separately by recursive factorisation of the joint probability, allowing a drastic simplification of the estimation and simulation structure [
21]. For instance, the same joint distribution is obtained by conditioning marriage on births or births on marriages. However, one must note that the coefficients of the independent variables lose any causal interpretation in the process of the recursive factorisation and are dependent on the adopted order of the modules [
21]. For instance, if a marriage module is executed after the birth module, the marriage probabilities are conditional on the occurrence or non-occurrence of a birth event for the entire year. One must also note that the recursive conditioning imposes a strict and often complex structure on the modelling and estimation process.
In the MikroSim model, demographic events are simulated first, while non-demographic modules are found later in the module chain (see 
Figure 1). A description of the modules and a justification of the module order is given in Münnich et al. [
23]. The entire population passes through the module order each year and is updated to the following year. Modules are recursively conditioned on the simulated events in previous modules to reduce the impact of ordering.
To produce reasonable projection results, the transition probabilities between the states must closely reflect the dynamics within the population. This is often done using transition matrices based on statistical models fitted on survey data or conditional distributions [
5,
24]. However, in practice, estimated transitions may still differ from observed events due to insufficient information or sampling errors in survey data. Methods to harmonise simulated values with observed outcomes are used to remedy such biases regarding state transitions. These so-called alignment methods may also be used to reproduce outcomes from macro models or the implementation of what-if scenarios [
14,
24,
25]. For models with a recursive structure, such an adjustment can simply be made sequentially in each submodule. For instance, births or deaths might be aligned independently of each other. For non-recursive models, a system-wide alignment must be applied [
16]. Constraining the output of simulation models is also used to reduce MC uncertainty in simulation runs. In practice, various alignment methods are applied [
13,
16,
17,
25,
26]. A list of criteria to judge alignment methods has been established by Stephensen [
17].
If the base population of the simulation is rooted in the past, known values, such as those from official statistics or other administrative data sources, can be used to align the model. This is interesting for two reasons: to create a most realistic current-year population on which the projection should be based and to ensure more plausible (regionalised) transition probabilities between the states by adjusting the model parameters to replicate observed transitions. For instance, the base population of the MikroSim model is rooted in 2011 as it was created based on census information from 2011 [
23]. Therefore, before projecting the microdata into the future, it must first be updated to the present year. Thus, it is useful to integrate known additional information into the model. This includes observed migrations, births, deaths, employment, acquisition of citizenship and much more.
Usually, such alignment procedures are only applied to the totals but not their distributions. However, accounting for marginal distributions rather than totals alone may be crucial in certain circumstances. This is particularly true if patterns are known to strongly differ regionally. In some instances, adjusting the parameters may still be unnecessary. For example, when marginal (probability) distributions are specified for each future year based on a macro model or assumptions, selecting units with specific subgroups may simply be performed according to the relative probabilities of the units obtained by the statistical models with an appropriate selection method. For example, women with higher education may be selected less often for birth events than women with a lower educational degree. If events are supposed only to be due to the unit characteristics rather than exogenous assumptions, adjusting model parameters to replicate observed benchmarks first and applying them afterwards is interesting. Since the MikroSim model is designed to be self-running and no future distributions for demographic rates are available at the geographic level of interest, parameter adjustments are the default in the model.
  3. Regionalisation of Transition Models by Model Updating
The previously described alignment procedures can be understood as probability calibration methods. Outside the microsimulation field, ex post model adjustments to reflect observed totals or proportions are more commonly known as model updating techniques. Such methods ensure that model probabilities are well calibrated to observed outcomes [
27,
28]. For instance, in an ideal setting, predicted and observed proportions are identical for all predicted probabilities in each subgroup. This would be denoted as strong calibration. If this only holds overall but not in each subgroup, this is referred to as moderate calibration. If outcomes are neither systematically over- nor underestimated, this is called weakly calibrated or logistically calibrated in the case of probabilities. Finally, suppose the model, on average, predicts correctly. In that case, this is denoted as mean calibration. Visual and statistical tests to detect miscalibration are described in Cox [
29], Miller et al. [
30] and Steyerberg [
27]. Straightforwardly, a logistic regression can be fit on logit-transformed predicted probabilities and observed proportions. Deviations in the coefficients from a slope of 1 and an intercept of 0 indicate systematic under- or overestimation of proportions and/or inadequate variation in the probabilities. If model outcomes are not well calibrated, model updating techniques can be applied to align the predicted outcomes with observed realisation.
In principle, such recalibration methods can be applied to align the state transition models to subregions. For instance, Burgard et al. [
13] show the applicability of intercept adjustment to regionalise simulation outputs by aligning to published totals. The regionalisation problem can also be understood as a model updating problem. For example, outcomes for some regions may be systematically over- or underestimated due to regional heterogeneity that could not be modelled, leading to regionally implausible values. To adequately project units in the subregions, we may thus first update the transition model to reproduce observed outcomes and use the updated model for the projections afterwards.
Usually, as a minimum requirement, we want the model to be correct on average for each subregion. This is also referred to as a calibration in the large or mean calibration and is already applicable if only totals or proportions are known for the whole (regional) population [
28]. To achieve this mean calibration, the model expectation is adjusted to reproduce the observed outcomes by modifying the intercept of the statistical model while other coefficients remain unchanged. For instance, in a logit model, this implies that relative risks for all the other coefficients are unaltered. Since only the intercept is updated, this process is also known as intercept adjustment. This approach is easy to implement and has been chosen as the default alignment in MikroSim when only total values are known for the regions [
14].
Such calibration methods are neither limited to the binary/logistic setting nor statistical methods. For instance, De Cock Campo [
31] demonstrates the applicability of the calibration framework to the exponential family and machine learning methods. Applications for survival models are given by Steyerberg [
27]. Regardless of the setting, the predicted outcome must first be transformed according to the model link to produce adequate results [
32]. For instance, in a logistic regression setting, the predicted probabilities should be on the logit scale within the model. Since state transitions in discrete microsimulations are usually modelled using logit models, this paper’s notation focuses on logit models.
Let  be the possible states of a characteristic  in time t and  the state of the according individual in the previous period . We denote  as the known number of individuals of the risk population of size  that transition from state  to  within some region r between times  and t. Although, in general, this is applicable across several regions and time periods, for notational simplicity, we drop the indices for the region r and the time index t, as we operate within the same parameter values for these indices throughout the paper.
The alignment process of each individual’s transition probability 
 is conducted in an iterative form via Iterative Proportional Fitting. One straightforward implementation, 
Bi-Proportional Scaling, is an iterative process that adjusts 
 across the rows 
i (individuals) in Equation (
1) and the columns 
k (states) in Equation (
2) until the number of expected transitions is close to 
 [
14,
17].
      
We denote the intermediate scaled term 
; it is produced by Equation (
1), which due to the scaling process does not necessarily sum up to 1 for each individual across all states 
k. We denote the adjusted probabilities from Equation (
2) as 
. While this term is again a probability, summing up to 1 across all categories 
k, it may not yet fulfil the alignment target. Within the iterative process, it is again inserted into 
 in Equation (
1), restarting the process all over again until convergence. The process is repeated until the difference 
 between the expected number of transitions, the sum over all individuals’ post-scaling probabilities and the target value is 
small, i.e., we consider the error term 
 to be close to zero, such that the difference between the number of expected transitions and the target value 
 is less than 1.
      
Scaling probabilities towards a historically observed benchmark values harmonises the simulation results with reality. The multiplicative scaling of probabilities for the entire risk population of a region has the same effect as manipulating the statistical model’s intercept coefficient for that region (or the region coefficient, if available) in the case of a logit model.
Logit scaling minimises relative entropy, meaning the probability distribution is changed as little as possible [
17].
The procedure can—and arguably should—also be used to calibrate the statistical models for the specific region and against potential biases within the projection phase of the microsimulation. For this, the relation between 
 and 
 may be used to compute adjustment terms for future transitions when no historical values are known any more. Let 
 describe the difference of the model intercept between 
 and 
, and let 
 be any single unit 
i of the risk population where 
 [
17].
      
Due to the equal scaling fractions for all units 
i, the adjustment term 
 is equal for the entire risk population within a particular region. Assuming that the required adjustment term for projected transitions, when no historical aggregate data is available, is close to the adjustment within the period when benchmark data was available, the term 
 is a reasonable approximation to adjust the model for the projection phase (ceteris paribus assumption). More generally, the adjustment term 
 for the last simulation year in which historic benchmark values are still present is added to the log odds of the probabilities estimated by the model for all projection periods. An equivalent implementation resulting in the same adjustment is portrayed in Equation (
5) [
14,
17].
      
Adjusting the model intercept in the projection period with a fixed adjustment value leads to trajectories of the affected transition numbers that are mostly in parallel to the unadjusted model. The adjustment can be interpreted as a linear up- or downwards shift of the probabilities on the log-scale [
17].
The intercept-adjustment procedure may also be denoted as an adjustment model of its own with the logit transformed (unadjusted) model probabilities as offset, as laid out in Equation (
6). The target probabilities 
 used as the endogenous outcome here may be the 
aligned probabilities from the Bi-Proportional-Scaling procedure, in which case this approach is equivalent to that denoted in Equation (
5), but could also be some other arbitrary target.
      
If totals or proportions are known for population subgroups within a region, for example, nationality groups, this adjustment can be made for each subgroup separately. In that case, 
 is augmented by the group-specific proportions as denoted in Equation (
7).
      
If totals and distributions (for example, across age years) are available, models can be updated to achieve weak, moderate or strong calibration, depending on the information available. In cases where only  (or , respectively) is observed, but the size of the risk population  is unknown, it may be estimated by the size of the risk population within the simulation . This is typically the case in practice, which may insert additional uncertainty or slight biases into the adjustment, depending on the simulation method.
One easily implementable way to achieve weak calibration for binary outcomes, which already dates back to Cox [
29], is to regress the target probabilities on the predicted probabilities in a logistic regression model. Note that, since we only calculate the overall slope, the ordering of probabilities remains unchanged, and all probabilities shift in the same direction. This may be problematic for applications where shapes are known to be different within the subregions, which is the case, for example, regarding fertility rates across ages. In contrast to the intercept adjustment, the model probabilities are not offset in the model so that a slope coefficient 
 is estimated. This can be estimated by a standard logistic regression with the logit transformed unadjusted model probabilities as a covariate and the observed proportion as the dependent variable.
      
Rather than fitting an overall slope, the coefficients may be partly or fully re-estimated in a model revision or updating if information on covariates is available on the regional level, giving the coefficient vector 
 for the covariate matrix 
. Thus, we estimate
      
Unlike in Equation (
8), the predicted probabilities are used as an offset again, so a shift to the target proportion is modelled, taking the model prediction as given. This gives a moderately calibrated model with updated probabilities for the specified covariates. If all covariates from the original model are included in the recalibration, this may also lead to a strongly calibrated model for each possible subgroup. In practice, this is usually not possible. If the slope coefficients 
 become zero, the intercept term 
 is equal to 
, i.e., in this case Equation (
9) is equivalent to Equation (
6). The latter is, therefore, a special case of the former.
Using the derived model coefficients from the generalised form of Equation (
9), one may estimate adjusted transition probabilities 
. The expected number of transitions based on them are similar to the group-specific known values 
 with error term 
. In contrast, the error term is nearly zero over all groups 
g. Thus, the expected number of transitions is almost equal to the known global number of status changes 
, as is the case with 
calibration in the large via logit scaling across the entire risk population, as defined in Equations (
1) and (
2) [
17].
      
The adopted adjustment model in this paper builds on the outlined recalibration in Equation (
9) and is fully described in 
Section 5.
  4. Disaggregation Methods for Grouped Demographic Rates
For the regionalisation by model recalibration, known totals or proportions for population subgroups are necessary at the geographical level of interest. For instance, for demographic rates, we are usually interested in age × sex groups, ideally partitioned by other covariates like nationality. However, due to disclosure reasons and lack of details within registers, such marginal distributions are often only available in aggregates. For instance, marginal distributions for births from German Official Statistics sources [
15] are only available for the scientific community in age groups of 5 years by German or non-German nationality with an open-ended group starting at age 40 (see 
Table 1). For deaths, age groups, especially for younger ages, are even broader. Thus, probabilities may only be calculated for the age group as a whole, even though we want single-year ages in our transition model. Furthermore, class widths and the level of protection may differ regionally, depending on the type of data, size of the risk population and the occurrence of the according events within each subgroup.
Probabilities for demographic events obtained by a direct estimate tend to be unreliable for small areas, especially when the events are rare, like births or deaths. For instance, one would obtain death probabilities of zero for a particular age group if no death occurs within a (single-year) age group for a small area, which is an implausible assumption for the calibration and projection. In such instances, disaggregation and smoothing methods may be of interest to obtain more stable probabilities for single-year age groups. This may improve model recalibration results by stabilising the target distribution. Various disaggregation and smoothing methods have been established for demographic rates and probabilities, specifically for small areas (see, for example Wilson [
33] for mortality rates). Depending on the events of interest, different specialised disaggregation methods are applied.
One such type of proposal is relational methods that set local transition fractions in relation to a reference age schedule (for example, from a national level or a forerunner region). Well over a decade ago, de Beer [
34] coined this idea as 
tool for projecting age patterns using linear splines (TOPALS). By borrowing and modifying the shape of a reference age schedule, TOPALS models regionalised smooth single-age transition fractions based on grouped target values [
34]. TOPALS is a flexible method, initially developed for projecting fertility age schedules, which can be adapted for a wide variety of demographic distribution modelling such as mortality or migration [
35,
36,
37,
38].
For abridged target benchmark data, Grigoriev et al. [
39] propose a disaggregation method for fertility rates based on quadratic optimisation, which provides smooth rates across a single-year age while ensuring that the expected totals are maintained. Their 
QO-Splitter algorithm satisfies several important criteria for smoothing abridged age schedules, namely a smoothed shape of the estimated curve across the grouped input information, non-negativity of the curve, even for steep drops between groups, and equivalent density distributions between the abridged input and smoothed output curves. Furthermore, useful assumptions can be included in the method, such as tail constraints of the curve.
Grigoriev et al. [
39] describe the single-year age distribution within each group through Equation (
11). As before, 
 denotes the known number of transitions from state 
h to 
k within group 
g, 
 the lower and 
 the upper age bound of the group. For example, looking at 
Table 1, for group 
, the lower age bound would be 
 and the upper age bound would be 
. 
 denotes the size of the risk population for a single age year within group 
g, and 
 and 
 the (estimated) age-specific transition rate at age index 
j.
      
The algorithm proposed by Grigoriev et al. [
39] and described in detail by Michalski et al. [
40] derives a solution for 
 by minimising Equation (
11) under constraints that transition rates must be positive and the number of estimated transitions across all single age years of a group must be equal to the known number of transitions within that group. In our application for fertility, as suggested by Grigoriev et al. [
39], we fix rates for 14- and 50-year-olds to zero, so the estimated curve ends smoothly at the upper and lower tails for the risk population in fertile age 15–49.
Figure 2 shows the disaggregation of abridged fertility rates via the QO-Splitter for Trier between 2011 and 2013 for both German and non-German women as well as the observed single-year fertility rates for the same time span, taken from register information. As can be clearly observed, the method can smooth observed aggregated values closely towards the (typically unknown) single-year true values. Particularly for rather small risk populations like the non-German females on the right side of 
Figure 2, disaggregating abridged age schedules tends to be a lot more stable than observed single-age-year schedules even when pooled across multiple years.
 In the context of this article, we estimate birth rates for females in our simulation population based on a stochastic model estimated on national survey data. These probabilities should then be regionalised according to abridged regional observed values. While demographers have developed many schedule projection methods during the past few decades, many of them, including TOPALS, try to achieve slightly different goals and use substantially less input information than the stochastic models used within our microsimulation framework. The disaggregation method by Grigoriev et al. [
39] poses as a fitting candidate for our application, as their algorithm delivers satisfactory smoothed age schedules, which may be used as target probabilities 
 as shown in Equation (
9), and is both relatively fast and efficient, such that it may run within the microsimulation without increasing the computational burden substantially.
  5. Simulation Study
We conduct an MC simulation to simulate birth events to evaluate the performance of possible model recalibration strategies and the potential use of disaggregation methods before recalibration. For this, we estimate a logistic regression model to predict births within the birth module of MikroSim based on national survey data. The birth model is then calibrated to marginal birth distributions for Germans and non-Germans, available for all German districts in five-year age groups. All simulations are conducted for the city of Trier, for which a subset of anonymised register microdata has been provided to the authors for the time span between 2011 and 2022 to validate the results. The data include information on persons like age, sex, citizenship and marital status as well as estimated household constellations for each dwelling [
41]. Single ages and citizenship of the mothers are available for all births. Additionally, the registers provide information about the size of the risk population by single age year and citizenship used to derive the birth event probabilities. Using the registers rather than the published grouped data from official statistics allows a more exact evaluation due to the availability of single-year data and the possibility of using the true rather than estimated structure and size of the risk population.
The simulation set-up is thus as follows: First, the Trier population is projected from 2011 to 2022 using the MikroSim model [
23]. For each simulated year, mothers are selected based on observed proportions from the registers with relative risks from the unadjusted birth model via an MC experiment within each single-year age citizenship group. The risk population consists of the around 25,000–30,000 women in the fertile age, of whom around 10% possess non-German citizenship. For the analysis, we export a cross-section of all women in the fertile age for all simulated years.
In the second step, adjustment models are estimated for the exported microdata for 2011–2013 based on the unadjusted model prediction and the register distribution.
Lastly, the adjustment models are applied from 2014 onwards to update the predictions obtained from the unadjusted model. This allows us to analyse how each method would have performed when executed.
All three steps are repeated for each simulation run. To reduce the impact of MC uncertainty on the evaluation, introduced both by the random selection of mothers and the other modules, 100 runs with different seeds were conducted. The register population of 2011 served as the base population to which the required covariates like education level and employment status not contained in the registers were synthetically added based on the microcensus 2011.
  5.1. Simulation Scenarios
Six scenarios are simulated (see 
Table 2). In the first scenario (S1), no adjustments are made to the model. This is assumed to perform worst since no adjustment for local differences from the national model is considered. In the second scenario (S2), only the model’s expectation is adjusted. This only manipulates the intercept of the estimated model, thus changing the level but not the shape of the distribution. In the third scenario (S3), this adjustment is made for each citizenship group separately.
The other scenarios adapt the model prediction by changing the level and shape by model recalibration to marginal distributions. In the fourth scenario (S4), the model is calibrated to the available five-year and ten-year age groups by German or non-German citizenship (see 
Table 1), which is published for all German districts by the German Federal Statistical Office. In the fifth scenario (S5), aggregate information is first disaggregated using the QO-Splitter described in 
Section 3. Thus, the model is calibrated by citizenship to estimated regional single-year ages. Lastly, the sixth (S6) scenario adjusts the model based on single-year values obtained from registers. The last scenario is the ideal case; such access to registers is usually impossible due to lacking regional data or legal constraints. Note that the register rates have not been additionally smoothed, while the estimated disaggregation provides smooth estimated rates across single-year ages.
  5.2. Estimation of the National Birth Model
The initial, unadjusted birth model used within each scenario for the initial birth probability is estimated nationally based on the scientific use file of the German microcensus from 2012 to 2015, a yearly 0.7% sample of the German population with legally required participation. Starting in 2012, these waves are linkable to a panel by creating a person and household identifier based on the federal state, the number of the selection district in the microcensus, the household number within the selection district and the person number within the selection district [
42]. Thus, transitions from one year to another can be estimated. To emulate a most realistic setting, transitions would ideally be estimated on years before the projection period. However, only the waves from 2012–2015, 2016–2019 and those starting in 2020 are linkable.
Since the birth module is at the beginning of the module chain (see 
Figure 1), most other characteristics have not yet been updated to the following year. Therefore, for the estimation, all variables except the occurrence of a birth event and age variables must be taken from the previous year. For individuals who have been surveyed in the last year, this is straightforward. Due to the microcensus being a dwelling panel without a follow-up for moving individuals, a particular (im)mobility bias may occur if only individuals who participated in two consecutive surveys are modelled. This is especially relevant if moving is correlated with the outcome of interest. This is highly plausible as expecting women or women who recently gave birth may seek a change in their living location. Indeed, fertility rates for moving were slightly higher for women who entered the panel by moving into a dwelling in the waves analysed. To reduce this moving bias, lagged values that cannot be derived from the cross-sectional survey, namely the partnership status (single, partnered or married), the employment status and, to some extent, the education level and student status, were multiply imputed for in-movers. Out-movers and regularly observed units served as donors. However, information on whether a person moved during two panels was used in the imputation process. Each year, around 11% of the respondents entered the survey by a migration movement. Units that entered the survey by a planned rotation were not included in the model estimation for their first interview, as it can be plausibly assumed that their participation is uncorrelated with the target outcome. Including them would also raise the share of units with imputed lagged characteristics by, on average, 25 percentage points, as about a quarter of districts are rotated in and out each year.
Five imputed data sets were created using the 
mice (Version 3.16.0) [
43] package in R (Version 4.1.2) [
44] using classification and regression trees. Apart from all variables used in the model, information on immigration (immigrated and years since immigration), care, work (never worked, branch), pensioner status and having moved into the dwelling in the last 12 months are used for the imputation. After 15 iterations, no convergence problems could be detected.
Limiting the survey to women between the fertile ages of 15 and 49 and from 2013 to 2015 (since 2012 was only used to produce the lagged variables) resulted in over 215,000 observations, for which around 9,000 birth events occurred. Women are flagged as having given birth if there is at least one newborn in the household whose person identifier of the mother matches the mother’s identifier in the household. Note that this will underestimate the actual birth probabilities since (1) we cannot account for newborns who died before the survey was conducted, and (2) we cannot consider mothers who did not survive between the birth event and the survey. Further, this method cannot account for adoptions since the identifier of the mother is not necessarily linked to biological mothers. However, previous research found this bias to be negligible since the microcensus estimates match the birth registers well, in general [
45].
A Generalised Additive Model with a logit specification is estimated using the 
mgcv (Version 1.9-1) [
46] package in R to account for non-linearities in the covariates. The model uses past employment status (working, unemployed and inactive) and the lagged partnership status (single, partnered and married) as parametric predictors. Age and its interactions with lagged citizenship (German, EU, Non-EU), lagged education level (low, middle, high) and student status, as well as the years since the last birth (conditional on previous birth) and the lagged number of children are used as smooth terms with cubic regression splines. All factor interactions with age are fit as a smooth deviation from the main effect with a difference smooth to the reference level of the factor. Pooling of the models based on the five imputed data sets is carried out according to Rubin’s rule [
47] to account for additional uncertainty introduced by the imputation of missing values. Pooled coefficients are displayed in 
Table 3. Pooled smooth terms are shown in 
Figure 3. Note that the smooths denote the partial effect at the logit scale with the other variables fixed at 0 or the respective reference level (employed, single and no student). Overall, the model explains about 15% of the total deviance for birth events.
The probability of a birth event being a multiple birth is estimated based on the German microcensus 2012–2015 for all mothers based solely on age. The estimated probabilities are shown in 
Figure A1 in the 
Appendix A. Within the birth module, multiple births are executed after the modelled birth event, conditional on the birth event occurring. Due to their rarity, only twin births are considered rather than triplets or higher-order births.
  5.3. Estimation of the Adjustment Model
We formalise our adjustment model in Equation (
12), fitting the population characteristics matrix 
 according to the groups 
 against the individual level expected probability 
, augmented from the group-specific transition fraction 
 denoted in Equation (
13). Note that we drop the indicators 
h and 
k from our notation from here on out, as we operate in a binary case (birth and no birth), which also does not describe a state transition in the proper sense of the term. The probability 
 denotes each individual’s probability of giving birth and 
 the conditional probability for giving birth to more than one child between the two periods 
t and 
, derived from the (still unadjusted) stochastic models explained in detail in 
Section 5.2.
        
        where 
 is a vector of model coefficients and 
 the according 
 covariate matrix. The logit transformation of the unadjusted transition probabilities 
 and the conditional probability for multiple birth events 
 of each individual are inserted into the model as offset, as described by Vergouwe et al. [
48], so that no slope coefficient is estimated for the initial probabilities. Thus, the model aims to model the difference between initial and target probabilities by a set of specified covariates.
It is important to note that the adjusted value 
, derived through the estimated parameter 
 from Equation (
12), is not a probability, but instead an estimated birth rate for any given individual 
i. They reflect the expected number of births within one period for the according risk population and combine the probability of a birth event and the conditional probability of a birth event being a multiple birth. In practice, this has few consequences; however, since single and multiple births are modelled separately within our fertility module, 
 has to be separated into single and multiple birth event probabilities again after the adjustment step, as illustrated in Equation (
14).
        
From this results Equation (
15); the birth rate 
 augments the adjusted single birth probability, which in combination with the conditional probability for multiple births 
, approximately results in the expected number of births, observed from the benchmark aggregate data.
        
In the simplest adjustment scenario case (S2), 
 is defined as an 
 matrix of ones. Thus, the augmented probability 
 is regressed on an intercept only and offset of the unadjusted model probabilities. Similarly, the adjustment model distinguishing German and non-German mothers (S3) can be estimated by a binomial model with intercept and additional covariate accounting for the citizenship groups. In this case, 
 is again an 
 matrix, with the intercept column (for the reference citizenship category) and the dummy indicator for the non-German citizenship of the individual, whereas 
 is a vector of length two. The shape-adjusted models (S4–S6) are estimated with a cubic regression spline for the age variable interacted for each citizenship with a dropped intercept, or more generally with all covariates that discriminate between the groups 
g. This results in a smooth adjustment across the age and covariate variables even when the totals or proportions are only provided in 5-year age groups. In principle, the age groups may also be used as a factor in the model instead of using a cubic regression spline across the true age. However, this leads to cut-offs between age groups after the adjustment, which is usually undesirable and may be less stable, especially for small age groups, since no information is shared. As seen in Equation (
16), the cubic regression spline 
s is constructed from base functions 
 with maximum complexity 
Z [
49]. For an in-depth description of splines and Generalised Additive Models, we refer to Wood [
46].
        
We may also denote this as a linear combination whereas 
 denotes the model coefficients for the spline’s base functions and 
 the according spline covariates as laid out in Equation (
17).
        
Adjustment models are fit on the pooled regionally observed proportions for the time span between 2011 and 2013 using the predicted probabilities for the same year. The resulting model is used within the projection period from 2014 until 2022, which we use for evaluation.
The model recalibration procedure is visualised in 
Figure 4 of the disaggregated marginal distribution (S5) for the recalibration phase (2011–2013). Expected births for Germans and non-Germans are generally too high if the model is applied to Trier. For instance, from 2011 until 2013, the model predicted around 20% more births for Trier than observed, clearly indicating the necessity of model regionalisation. As can be seen, the nationally estimated model on average (blue line) predicts higher birth probabilities for women under 30 than what is known from the available regional marginal distribution (black line). For non-Germans, births are estimated slightly too low after age 30 until 49, while, for Germans, birth probabilities are slightly too high again after age 40. Thus, probabilities are adjusted downwards for women under 30 and Germans after 40, and upwards for Germans between 30 and 40 and non-Germans after 30. After the adjustment, the average probabilities (red lines) align with the known proportions from the birth registers for the age–citizenship groups. For Germans, the already-low probabilities for women over 40 are strongly adjusted downwards, making births even less likely. Additionally, the overall expectation is met in the adjusted model. The variability in the predicted probability, due to variations in the covariates in the population, is maintained, as indicated by the pre- and post-adjustment probabilities in the plot (blue dots and red crosses). Thus, the model still captures population heterogeneity after marginal recalibration for the calibration years.
Adjustments for all simulated methods are shown in 
Figure A2, 
Figure A3, 
Figure A4, 
Figure A5 and 
Figure A6 in the 
Appendix A. For the intercept adjustment methods (
Figure A2 and 
Figure A3), probabilities are simply shifted on the log scale as described in 
Section 3. Thus, the shape of the predicted birth distribution remains unchanged while the expected level of birth is matched to the known totals. As can be seen, S5 and S6 produce a very similar predicted distribution despite S5 only calibrating to an estimated rather than the true distribution (
Figure A5 and 
Figure A6). However, adjustments differ for S5 and S6 for the last age group 40–50. S5 adjusts increasingly strongly downwards for Germans and slightly upwards for non-Germans, while S6 only slightly adjusts probabilities for Germans upwards and increasingly strongly downwards for non-Germans. This is on the one hand related to the forced downward trend induced by the disaggregation method, which enforced a zero probability just outside the maximum age, and the width of the last age group combined with very few births, which makes the adjustment unstable for higher ages. This tail problem also occurs for S4 (
Figure A4). Here, for both Germans and non-Germans, there is an upward trend in the adjustment effect. While the adjustment is still in line for ages before 40, notable differences from S5 and S6 occur afterwards. Namely, there is an increasingly strong adjustment upwards, resulting in implausibly increasing adjusted probabilities for the last age years in the group.
  5.4. Simulation Results
Two measures are applied to evaluate the quality of the model adjustments. We use the Hellinger distance 
H to compare the discrete distribution 
Q of the expected fraction of births across each (sub-)group 
 with the according observed distribution 
P from the registers, as denoted in Equation (
18). We compute this measure for each scenario to evaluate the fit of the probability distribution’s shape compared to the observed fertility rates. The Hellinger distance is defined in the interval 
, where values closer to zero indicate more similar distributions.
        
We denote the subgroup  which only differentiates between citizenship groups, not between ages. Thus,  is a vector of size 35 (between lower age 15 and upper age 49) across the citizenship group . Each value indicates the share of births expected to occur within the according age groups of  with . Likewise,  is the share of births within subgroup  that were observed within the according age groups.
Secondly, the absolute difference of the expected number of births 
D in each simulation scenario compared to the empirically observed number of births is used as a global measure of difference, describing the deviation in level between the simulation scenarios and the observed number of births from the registers. It is essentially the sum across all age groups of a subpopulation 
 of absolute deviations between the expected number of births and the observed values, as defined in Equation (
19).
        
The adjustment model fit on 2011–2013 for each scenario is applied for the simulated projection years 2014–2022. The resulting adjusted distributions for S5 based on the same simulation run as in 
Figure 4 is shown in 
Figure 5. As can be seen, the recalibrated model more closely matches the observed distribution of births. Especially for the German population until 2018, the resemblance is very close. For 2019–2022, the adjusted model strongly overestimates births for the age groups 30 to 35. This is especially due to the effect of the COVID-19 pandemic, which drastically lowered the birth proportions for Trier starting in 2020. Nevertheless, the disaggregated adjusted (S5) and the single-year adjusted (S6) distributions consistently outperform the other adjustment methods as measured by the Hellinger distance between the predicted and observed distribution (see 
Figure 6). The adjustment model, which directly calibrates to the age classes (S4), also consistently provides a better fit in shape than the unadjusted (S1) or expectation-adjusted probabilities (S2 and S3). Remarkably, the performance of S5 and S6 is almost identical, even though S6 uses the true single-year proportions and S5 only single-year estimates. Thus, a smooth disaggregation of the target values before calibration can improve the suitability of a particular region, as seen by the difference between S4 and S5.
As further evident from 
Figure 6, for non-Germans, the 2014–2018 improvement is less strong than for Germans. Interestingly, for 2019–2022, the original model estimated the birth distribution more closely than the adjusted model for this period for the non-German population. Especially for the years characterised by the special migration (2015 until 2017), the adjusted models are outperformed by the unadjusted and intercept-adjusted models. This may be because, as a university city, the marginal birth distribution for non-Germans was strongly influenced by higher-educated university students, who experience birth events less often. Due to Trier being an initial admission centre for asylum seekers, the composition of the non-German population changed drastically during the projection after 2015. Furthermore, even after averaging over 3 years, rates for non-Germans were still relatively unstable due to small risk population sizes. Consequently, the models that were recalibrated more strongly are outperformed by the unadjusted or weakly recalibrated models for 2015–2017 as measured by the Hellinger distance. No clear trend is visible for the non-German population after 2017. However, calibration to 5-year age groups (S4) performs worst among the adjustment methods regarding the shape of the distribution.
Regarding the mean absolute difference between the predicted and observed births in the single years, shown in the lower part of 
Figure 6, S5 and S6, followed by S4, perform best until the birth drop due to the COVID-19 effect for the German women. The unadjusted model clearly performs worst for German citizens, leading to the largest absolute discrepancy between predicted and observed outcomes. Separately adjusting for Germans and non-Germans hardly improved the model predictions and only led to a negligible amount in the absolute difference. Contrary to expectation, the unadjusted model is not the worst performer for the non-German population. Rather, the global intercept adjustment (S1) leads to the largest absolute differences on average. For non-Germans, no large differences between the mean absolute differences occur between the scenarios throughout the years.
Looking at the performance in the individual simulation runs, visualised in 
Figure A7 in the 
Appendix A, it is visible that both metrics are generally more variable for the non-German population. This is primarily due to the smaller risk population. Notably, proposed adjustment models do not exhibit larger variability than the unadjusted or traditional mean-adjustment approaches. For the German population, S4–S6 outperformed S1–S3 in every simulation run as measured by the Hellinger distance. Up until the start of the COVID-19 pandemic in 2020, this is also true for the mean absolute difference. For non-Germans, performances vary strongly between the runs and years, with no method consistently outperforming others when considering the variance in the measures. Overall, the suggested model recalibration approach substantially improved simulation results for the majority of the population while not deteriorating outcomes for very small population subgroups.
  6. Discussion
Regionalised transition models are crucial to creating plausible trajectories in spatial dynamic microsimulation. However, due to various problems associated with estimating such a model for each region separately, namely too low sample sizes or the lack of any regional identifiers in surveys, transition models are usually fitted to national survey data. Since regional heterogeneity beyond the model covariates is ignored, applying the nationally estimated model directly may introduce regional biases, which directly influence simulated regional outcomes. For fertility, such regional variations are well known with large differences between urban and rural areas. University cities, in particular, tend to differ very strongly from national schedules. In this paper, we demonstrated model recalibration techniques’ applicability to harmonise nationally estimated models with regionally observed outcomes while maintaining the desired model complexity by recalibration to marginal distributions and totals.
If only totals are available, adjusting the model expectation by updating the intercept coefficient can already reduce the model’s prediction error. Failing to regionalise the prediction model for the particular region analysed would lead to a strong over- or underestimation of fertility. This impacts simulated regional development directly through the population structure and via indirect effects. For example, if births are overestimated, the number of individuals in the labour force, particularly women, may be underestimated in the short run due to family- and care work at home. In the long term, overestimating births may strongly inflate population size estimates when the newborns enter the fertile ages. Including information on the age distribution rather than totals alone drastically improved the model fit to the region of interest, leading to fewer total differences and differences in the shape for the projection, even when the information is only available in broad age classes. Disaggregating age classes before the recalibration with a suitable method is even more beneficial. In this application, for instance, the disaggregation performs very much in line with true single-year information. Thus, we recommend smooth disaggregation before model recalibration when true single-year information is unavailable, which is usually the case in practice.
This paper also highlighted problems associated with adjusting to very small populations. In our application, fertility rates for non-German women were rather unstable due to the small size of the risk population, which reduced the effectiveness of the proposed model recalibration methods. This was aggravated by a change in population composition for this particular subgroup during the projection horizon. However, miscalibration was not as severe in the first place for non-Germans, leading to very few adjustments in all methods and higher variability of the measures. Thus, while the proposed model adjustments improved predictions for the German population, all methods performed about equally for the non-Germans.
Still, ways to stabilise recalibration and smoothing for very small domains must be explored. One way to further stabilise rates may be to pool rates for longer periods. For the sake of this simulation, only rates across three years were averaged to leave enough years for evaluation. However, in practice, more years can easily be included. Another option would be the usage of partial pooling within a joint rate smoothing or disaggregation for multiple regions before the recalibration. This approach would stabilise the smoothing and allow the inclusion of region-level information in the process. For instance, more information may be shared within rural or urban areas by including additional covariates on a regional level. This might include information on neighbouring regions where levels and distributions are likely spatially correlated. This may considerably improve the stability and accuracy for smaller regions and non-Germans, particularly. Another possibility that should be explored in further research is the inclusion of serial trends in the adjustment, which may capture convergence or divergence from a national model. Future research should also investigate the applicability of the proposed adjustments for different rates, such as mortality or migration. Finally, the applicability for multinomial applications, for example, employment, should be explored.