# Simulating histories within dynamic microsimulation models

1. Maastricht University / UNU-MERIT, The Netherlands
2. Rural Economy and Development Programme, Teagasc, Ireland
Research article
Cite this article as: J. Li, C. O’Donoghue; 2012; Simulating histories within dynamic microsimulation models; International Journal of Microsimulation; 5(1); 52-76. doi: 10.34196/ijm.00067

## Abstract

Constructing a base dataset is one of the most important elements in the dynamic microsimulation modelling. However, the access to a long historical panel is usually restricted for many reasons. This paper aims to develop a back simulation method that has the potential to generate a consistent synthetic history panel based on a typical household survey dataset with some complementary statistics. The model uses Living in Ireland (LU) household survey as an example to reconstruct the individual labour market trajectory since 1939. The overall results of the simulated panel have been proven sensible and consistent based on several validation tests. This method opens the possibility to further investigate into several fields of application such as life-cycle income analysis and pension reform evaluation, which typically requires the historical profile of individuals and has traditionally been difficult to perform.

## 1. Background

Many countries today are facing the prospect of rapid demographic change in the decades ahead and numerous research papers have been devoted to the analysis of the implications of an ageing population. A number of microsimulation models have been developed to study this issue linked with social security, retirement incomes, and pension reforms. In order to generate an accurate and stable projection for elderly earnings and pensions, it is essential to include some historical information in the dataset, a luxury that many modellers do not have.

Microsimulation models are usually categorised as either “static” or “dynamic”. Static models, e.g. EUROMOD (Mantovani et al., 2007), are mostly arithmetic models that evaluate the immediate distributional impact upon individuals/households of possible policy changes, whilst Dynamic models, e.g. DESTINIE, PENSIM, SESIM (Bardaji et al., 2003; Curry, 1996; Flood, 2007), extend the static model by allowing individuals to change their characteristics as a result of endogenous factors within a model (O’Donoghue, 2001). Dynamic microsimulation models in theory, could offer more insights than static models as they usually integrate long-term projections and behaviour simulations; however, they are costly to develop and require a baseline dataset that is both rich in variables and historical information. These demanding requirements unfortunately are rarely matched by the existing data availability (Harding, 2007) and therefore compromises are often made in order to make the simulation possible.

This paper investigates longitudinal data availability issues in microsimulation and proposes a viable alternative by simulating a plausible and consistent history using a typical household survey panel from Ireland. The following section discusses the base dataset issues in dynamic microsimulation models and the potential alternatives. Section 3 describes the dataset used in the back simulation and Section 4 explains the modelling procedure of the back simulation, and followed by a description of the alignment technique used in Section 5. The three sections following report the results of the back simulation together with the validations, for employment status, pension memberships and earnings, respectively. A conclusion and discussion are provided in the final section of this paper.

## 2. Base dataset issues in dynamic microsimulation

### 2.1 Base dataset selection

Base dataset selection is important for a microsimulation model as the quality of the input data determines the quality of the output, yet this is not an easy task, as hardly any micro datasets contain all the information required by a dynamic microsimulation model that can be used to project the whole population. The difficulties of picking a base dataset have been discussed in several papers (Cassells et al., 2006; Zaidi & Scott, 2001). Typically, a dynamic microsimulation model starts with one or several of the following types of dataset:

• Cross-sectional Household Survey Data

• Cross-sectional Administrative data

• Census Data

• Panel Household Survey Data

• Panel Administrative data

Panel data is generally preferred over crosssectional data as it records changes over time, a useful component in statistical modelling. Household survey data, e.g. EU-SILC data in EUROMOD (Figari & Sutherland, 2007), is frequently used as the basis of the base dataset, because it is rich in the number of variables of interests and offers information on the dynamics of behaviours. However, the time period in these survey datasets may be insufficient to provide certain historical information required for life-cycle modelling and analysis. Administrative data, although typically consisting of a more limited range of contextual variables, often provides a longer history and contains relatively high quality information for certain variables, e.g. tax, and in most cases, a higher number of observations. Some Scandinavian country models, e.g. the MOSART model in Norway (Fredriksen, 2003), are based on extensive and detailed register information. However, access to administrative records is typically fairly limited even from within a government.

For microsimulation models analysing the dynamics of elderly earnings or pensions, it is essential to have historical social economics variables that could be used to reconstruct the career trajectories of today’s elderly workers. This implies that an ideal dynamic microsimulation base dataset should contain the following information for each individual from birth:

• Demographic information, which contains age, education, marriage, birth of children, household (or tax unit) formation and dissolution.

• Employment trajectory information, which contains labour force participation records, historical earnings, types of job etc.

• Pension membership and entitlement information, which contains the record of various pension schemes participation (including state, occupational, and private pension).

To meet these requirements a long panel dataset containing rich demographic, employment, and pension data is required; unfortunately, a dataset that matches the above description is not readily available to most researchers. Certain models (e.g. DYNASIM, CORSIM) as a result, have experimented with alternative methods such as statistical matching and simulation.

The first method, statistical matching, involves filling in the missing information required from a different but comparable dataset compiled within a similar time frame. For example, a microsimulation model based on a household survey dataset may need some historical earning records that are only available from another income study dataset. Under such a circumstance, a matching method may be used to fill in the information gaps based on a statistical model using available social economic characteristics shared by both datasets. The DYNASIM model was one of the pioneers in this area and uses Current Population Survey (CPS) as the base dataset matched with social security earnings records from administrative data. In DYNASIM3 (Favreault, 2004), the statistical matching was undertaken between two survey datasets, namely, Survey of Income and Program Participation (SIPP) and Panel Study of Income Dynamics (PSID), while PENSIM2 matches Family Resource Survey (FRS) and British Household Panel Survey (BHPS) to the Lifelong Labour Market Database (LLMDB) to incorporate household contextual data (Emmerson et al., 2004). In addition, statistical matching also occurs between survey and census datasets, e.g. SAGE matches survey data to their base census sample data to obtain the additional information required (Evandrou, 2004).

The second method used is to generate a synthetic historical panel using information from the base dataset itself. Unfortunately, the methodology of simulating histories is not as widely discussed as simulating future characteristics in the microsimulation field. There are many challenges in attempting to “back-cast” or “back simulate” historical earnings (and other characteristics) earlier in life (Harding, 2007). The DYNANCAN model uses a limited back simulation technique by imputing the historical earning profiles between 1966 and 1969 with limited retrospective consistency (Morrison, 1997;1998), whilst the CORSIM model simulates part of the historical profile based on a historical crosssectional dataset, matching the model output to historical aggregate information such as fertility and mortality rates (Caldwell, 1997). Nevertheless, the existing methods only impute the history for a limited number of years and they usually suffer from the inconsistency issues. While certain variables match the historical data at cross-sectional level, the longitudinal consistencies are typically ignored and variables covered are not extensive enough to support life cycle analyses and pension simulations.

There are advantages and disadvantages to both methods described above. Statistical matching can be used when there are sufficient matching variables in a comparable dataset and this method has the desirable feature of having a “real-world” value, although the quality of matching may vary substantially depending on the quality and quantity of matching variables. In some cases, certain variables, e.g. historical earning records, may not exist in any dataset or access is restricted due to legal restrictions. If these variables are needed within the model, then the only option available is simulation. Synthetic simulation has the advantage of flexibility but longitudinal consistency may be an issue due to the limited information available.

### 2.2 LIAM and its base dataset

LIAM is a dynamic microsimulation model designed to evaluate potential reforms of the Irish pensions system and other policies in terms of changes to life-cycle incomes, particularly on old age income replacement rates, poverty and inequality measures (O’Donoghue et al., 2009). Given the nature of the model, it requires a pension module which is able to simulate:

• Life-cycle income distribution under a given pension system

• Public and/or private pension fund accumulation and dissipation over individuals’ life-cycles, under a given pension system and alternatives

• Effects of reforms of a given system on life-cycle income distribution, costs, and other redistributive measures

Since the model aims to evaluate the impact of policy change for those “at risk”, i.e. potential pensioners, an ideal historical panel would start in the year in which the oldest potential retiree in the dataset was born. However, one practical issue of alignment prevents generating meaningful values in the very early years when few individual exists in the dataset. As alignment is necessary to ensure the cross-sectional consistency with the historical values, it is necessary to keep a minimum number of the observations for each year that the alignment is applied. Therefore, the study sets the starting year of the historical panel in 1939, the year when the youngest retirees in 1994 (age 55) were born and the elder retirees in 1994 (age 75+) just entered the labour market for no more than a few years.

Furthermore, in order to simulate the potential pension income, it is necessary to include a set of important social economic variables, covering demography, employment, and pension information. These include:

• Demographic data includes gender, age, marital status, number of children and education attainment

• Employment data covers employment status (working or not), employment type (employee or self-employed), employment sector (public or private) and job income

• Pension data includes the pension contribution to occupational pension and private pension, the fund size of the defined-contribution (DC) pension, and the type of pension claimed after retirement

LIAM ideally, should select a base dataset which has a long social economic history, i.e. a panel that looks like an extended version of the US PSID or certain administrative datasets available in Sweden/UK etc. Unfortunately, long historical panels do not exist for Ireland, neither in the form of survey datasets nor as administrative records, yet this missing information is crucial to life cycle modelling and analysis, as some forward simulation components, such as pension eligibilities, are built on the individual labour trajectory. Given these constraints, simulation is therefore the best tool for obtaining histories for LIAM. Compared with earlier works on the historical recreation in microsimulation model (e.g. CORSIM and DYNACAN), the model could reconstruct a much longer time period and focuses on the consistencies at both cross-sectional and longitudinal levels.

This paper, as part of the work developing LIAM, proposes a microsimulation algorithm which could generate a plausible, consistent and comprehensive historical panel based on a household survey dataset by extrapolating the retrospective variables concerning past employment history. Variables such as years of working and pension eligibilities are registered in the household survey and could be used to model a plausible working history for all individuals when taken together with information generated from some external statistics.

## 3. Data

This back simulation module is primarily modelled based on the 1994-2001 Living in Ireland Survey (LII) dataset along with some external statistics extracted from the pension questionnaire section of the 2002 Quarterly Household National Survey (QHNS), and the Irish census reports since 1930s.

The LII survey constitutes the Irish component of the European Community Household Panel (ECHP) and is a representative household panel survey that was conducted yearly on the Irish population between 1994 and 2001 (eight waves). The data contains panel information on demographics, employment, and other social economic characteristics for around 3,500 households in each wave. In 2000, an additional 1,500 households were brought into the dataset to compensate for the attrition since 1994. Table 1 lists some descriptive information of key demographic and employment variables in the LII dataset.

Table 1

Besides the LII survey, the back simulation module also uses the information gathered from the QHNS survey, which is a nationwide household survey designed to produce quarterly labour force estimates in Ireland since 1997. In the first quarter of 2002, a special module dedicated to pension savings was added to the survey. QHNS is used in the back simulation mostly for alignments when estimating occupational and private pensions. Census reports prior to 1994 were used in the back simulation module in order to align certain important history aggregates.

### 3.1 Demographic base data

Demographic data is the foundation of a population-based simulation. The back simulation module extracts demographic information from the LII survey and surmises some data, for example the time of birth of each individual is calculated from age information in the survey and marriage status is derived from the reported age of marriage. In the case where a missing value is spotted, average data is used, e.g. it is assumed that an individual would get married at 25, the average age of marriage in Ireland in the1990s. Divorce and remarriage is not simulated for in the current history panel for complexity reasons, and education level is assumed constant once an individual has left the reported schooling period.

In the current version of the back simulation, cohorts that died before 1994 were not simulated, as the primary goal for the back simulation is to complete the career trajectory for the potential living pensioners. This simplification helps to reduce uncertainties within the history panel and lowers the modelling difficulties by avoiding potential complex interactions and consistency concerns of the synthetic population. One drawback of this simplification is that it raises the bar for alignment, as the simulated data will not be able to compare with the historical aggregate indicators. Fortunately, the Irish census data contains detailed information for each age gender subgroup. As a result, all alignments described in this paper can be performed at cohort level to ensure the consistency between simulated values and historical census data.

### 3.2 Exploiting the retrospective variable

The LIIretrospective variables provide information for survey contains certain retrospective questions similar to many household survey datasets. These questions are helpful in history recreation as they can pinpoint the time when certain events happened in an individual’s history, e.g. birth, marriage etc. In the LII survey, retrospective variables provide information for

• the year when an individual started their current job and their job’s duration

• the year when an individual first entered the labour market

• the number of years spent in full-time education, employment (including selfemployment and farming), unemployment (seeking a job), illness or disability,

• home caring or retirement duration since the age of 10

• the duration of unemployment, if currently unemployed

While the information collected is highly relevant for back simulation, these retrospective variables may not contain high quality data. Missing values and inconsistencies are sometimes spotted across differing years of the LII survey, for example the declared number of years in school varying without engaging in education. This type of error could hamper the quality of the simulated histories severely, as there would be no reliable reference to constrain the shape of the career trajectory.

In order to mitigate the impact of lower quality data, it is useful to correct obvious mistakes in the data collection and impute the missing values to expand the base on the dataset. Adjustments are applied to enforce the consistencies with key variables (e.g. age) and avoid basic mistakes such as assigning college degrees to children. Since most retrospective values (more than 90% of the individuals) are not updated once collected in the base year, it is necessary to recalibrate the values from second wave onwards to ensure the consistencies with the recorded labour market activity in the previous year. For example, if the individual worked full time in 1994, the accumulated years of employment should increase by one in 1995. Same principle applies to other variables like years of education etc.

Table 2 presents an overall summary on the outcome of the retrospective variable adjustments and highlights the difference between the original data and the adjusted data in the base year (1994) when the retrospective data was first collected. This adjustment increases the usability of these variables by improving its internal consistency. The table does not include missing values, which are imputed in the next step.

Table 2

Missing values are imputed via a set of ordinary least squared (OLS) equations, where the number of years spent in certain employment conditions are estimated using a vector for personal characteristics. Imputed values are checked for consistencies with other variables like age, education and adjusted in case of conflicts.

Table 3 describes what variables are imputed using this specification and what personal characteristics are included in the vector. Models are separately estimated for males and females. Appendix 2 and 3 report the estimates obtained in the imputation equations.

Table 3

## 4. Methodology I: modelling the histories

This section describes the methodology used in the back simulation. While a back simulation model may use some historical information to refine the outcome, it also has a higher requirement on the output quality. Since the main purpose of this back simulation model is to recreate each individual career trajectory, it demands a high accuracy of the model prediction as the pension eligibility is highly sensitive to the past employment status. In a forward microsimulation model, one would typically use some kind of aggregated values (e.g. mean, standard deviation of the distribution) to evaluate the quality of the simulation. The predicted value does not need to be correct at the individual level as long as the distributions are reasonable. Nonetheless, in the case of backward simulation, one needs to ensure that the values are sensible at the individual level due to the longitudinal consistency requirement while maintaining the reasonable distribution shape at each crosssectional level.

Compared with a forward dynamic microsimulation model, a back simulation model could be more complex as it is designed to exploit more information both from historical values and retrospective information. In order to create a panel dataset that is as close as possible to the real history, three methodologies are used in back simulations:

• Deterministic simulation

• Semi-stochastic simulation

• Stochastic simulation

The deterministic simulation generates the part of the history that is directly determined by retrospective variables and ensures that the generated history is perfectly consistent with the reported values. For instance, if an individual reports starting work at age 20, it is safe to assume that this individual was in work that particular year. The variables used in deterministic simulation include the age when an individual begins to work, the year in which an individual quit their previous job, the number of years spent in their current employment status (e.g. length of current job, unemployment etc.) and the number of years spent in each employment status.

Semi-stochastic simulation recreates certain historical events from retrospective variables in conjunction with some reasonable assumptions. While the deterministic simulation pinpoints the timing of some major events in history, it gives only incomplete information regarding employment trajectory. Semi-stochastic simulation is still largely based on the reported retrospective variables but might involve some assumptions. For instance, since most women take maternity leave when giving birth, it is reasonable to simulate a career profile interrupted in the year of childbirth with a high probability. Another example would be “back to work” social welfare benefit, which usually implies a period of unemployment for the years preceding claiming the benefit.

Stochastic simulation is designed to fill the parts of the history that cannot be inferred from deterministic and semi-stochastic processes and it recreates the history through the predictions of estimated econometric models with random components. This method is similar to a regular dynamic microsimulation model, such as DYNAMOD2 and SAGE (King et al., 1999; Zaidi, 2004), with the difference being that the back simulation model ages the population in a reversed direction and is subject to a much more restricted alignment procedure for consistency reasons. An outline of the simulation steps are provided in the Appendix 1.

### 4.1 Discrete variable simulation

Discrete variables include both binary variables, such as pension membership and categorical variables, such as job position. Since the pension, membership calculation only requires basic employment status and the technical issues in alignment, the current version of the back simulation only simulate binary variables. Binary variables such as employment status, pension membership etc. are modelled using logit models that produce a probability of an event occurring as the output. These binary variables could be modelled in the following generic form

[1] $\text{logistic}\left({p}_{i}\right)\text{\hspace{0.17em}}=\alpha +\beta {X}_{i}+{\epsilon }_{i}$

Xi is the vector of personal characteristics. The error term can be decomposed into specific individual effects ui and an i.i.d. stochastic term v. While there are a few methods that can be adopted for controlling individual heterogeneity, this current version of the back simulation uses a simple logistic implementation for the program compatibility and the speed reason. Assuming that the stochastic term (vit) is i.i.d. with mean zero, the average of the error terms for each individual is an unbiased estimate for individual effects (ui). It then becomes possible to re-estimate the model using calculated individual effects and apply them to the simulation. Therefore, the final model applies can be described as

[2] $\text{logistic}\left({p}_{it}\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{\hspace{0.17em}}\alpha +\beta {X}_{it}+{u}_{i}+{v}_{it}$

The method is essentially an adjusted “fixed-effects” logit model, which is a logit variation of the specification suggested by Mundlak (1978). It yields a higher predictive power as parts of the unobserved heterogeneities are modelled. Table 4 lists the variables included for each logit model. Models are estimated separately for male and females.

Table 4

### 4.2 Continuous variable simulation

Income yit is modelled as an extended Mincer type earning equation (Mincer & Jovanovic, 1981). It consists of a deterministic component (representing the dependence of income on current state variables), a random effect ui (to account for unobserved individual heterogeneity), and a stochastic component εit (which represents random variation over time, in addition to variation from state changes). The statistical form looks like:

[3] ${y}_{it}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\text{\hspace{0.17em}}\alpha +\beta {X}_{i}+{u}_{i}+{\epsilon }_{it}$

Where Xi includes education attainment, labour market experience and unemployment experience in the equations. Earnings are estimated using a random effect specification and coefficients are reports in Appendix 4.

## 5. Methodology II: alignment and adjustment in the simulation

Alignment is a commonly used method for calibrating microsimulation models so that aggregate outputs from the model match the external projections or values. This is partially due to insufficient knowledge regarding microbehaviour to specify a fully dynamic model. Simulation models, if unbounded, may over or under predict the occurrence of a certain event, even in a well-specified discrete model (Duncan & Weeks, 2000). Therefore, although in theory alignment might be controversial (as a perfectly specified model with perfect data should not need any alignment), it is a de facto common practice for microsimulation models (see for example DYNACAN, CORSIM, and LIAM).

In the back simulation module, alignment shapes the earnings and the distribution of employment status in a way that is both consistent with historical census information and retrospective information. There are two types of alignments applied in this model; one is a cross-sectional alignment and the other a longitudinal alignment.

### 5.1 Cross-sectional alignment

Depending on whether the variable is continuous or discrete, the alignment technique applied is different. Continuous variables such as earnings were aligned to the same level for the first year of the survey in 1994 for each age, gender, and education group and wage growth was assumed to be equivalent to consumer price index (CPI) growth rate. The alignment is a proportional adjustment, which has been used in several models, e.g. Chénard (2000a, 2000b).

For binary variables such as employment status variables, the alignment matches the proportion of the population which has a certain employment status (e.g. working or not), to the external census values, or estimated historical values, for each age, sex and education group in a given year. The alignment is based on the probability predicted by the logit model, i.e. individuals with the highest predicted probabilities (with the stochastic term) would be selected. The method is described in details by O’Donoghue (2010).

One issue with the alignment usage in this particular study is the incomplete population. While the dataset is population representative between year 1994-2001, the lack of the deceased population makes the dataset biased in early years. To address this problem, cross sectional alignments are only applied at the sub population level, i.e. alignment by age, gender, education status etc. rather than at the whole population level.

### 5.2 Longitudinal alignment

Besides the cross-sectional alignment, the back simulation also requires consistency in the retrospective variables, which is crucial to the quality of the generated historical dataset, as one of the main purposes of a long panel dynamic microsimulation is life-cycle analysis (e.g. pension). Currently there are hardly any simulation models applying a longitudinal alignment, as most have been developed for forward simulation, where there is no benchmark with which to align the results. In the backsimulation module, longitudinal alignments are implemented for the following reasons:

• The simulated life path should be consistent with the reported retrospective variables such as date of marriage, year when started working, education, childbirth etc.

• The simulated number of years spent in a certain employment category (e.g. total years of work/unemployment) should be consistent with reported values in the LII dataset

• The state, occupational and private pension eligibilities should be consistent with the simulated labour force trajectory

As discussed earlier, major life events are determined before the stochastic simulation to ensure consistency. The reported number of years spent within a certain employment category, as presented in the LII dataset, is taken into account during the cross-sectional alignment design process. Together with the deterministic simulation, it is possible to identify how many working years are left undetermined. Individuals without any “working years” left would be assigned as out of a job during the cross-sectional alignment procedure. Some small adjustments were applied after the panel was created to eliminate small gaps experienced during public sector employment.

The alignment for the state contributory pension eligibility is a bit more complicated. Ireland introduced a state pension system in 1953 and in order to be eligible for an Irish state pension, an individual has to contribute for at least 19.52% (10/52) of the time between their first eligible job and age 65. Self-employed jobs were not deemed eligible until 1988. Given this information, it is possible to influence the pension entitlement by modifying two factors within a career profile: the years of contribution and the density of contribution. For example, if an individual is not eligible for a state pension but the simulation gives an opposite result (a type I error), it is most likely that the model over-simulated the number of years worked with a pension eligible job, or that the starting year is too late. In this case, it is possible to improve the consistency by removing the years of working experience in eligible jobs and replace them with non-eligible activities. Meanwhile, a reverse correction procedure can be applied to type II error individuals (the ones with pension eligibility but without simulated eligibility).

A simplified case below demonstrates this algorithm (Table 5 and Table 6). Assuming there are three individuals in the dataset: A, B and C. They reported that they have worked 3, 4, 3 years in total in their lifetime respectively, and in 1994, B and C received state pensions. Based on the reported total years of work and the yearly aggregate alignment, the back simulation module may generate an output as below, where a black dot [•] means that an individual is working in that year and a white dot [°] indicates that the individual is out of work. As seen below, the simulation is able to reconstruct a history with a controlled yearly aggregate and consistent total years worked.

Table 5
Table 6

As shown above in Table 5, individual A has a type I error and individual C has a type II error. In order to correct the eligibility while preserving the total years of work and yearly aggregate, one solution is to swap jobs. In this case, individual A’s job in 1951 has been swapped with the 1973 job, while individual C’s job in 1973 has been swapped with the 1951 job. Given the symmetric nature of the process, there is no change in the yearly aggregate and total years of work.

This method is essentially a swapping algorithm that works when the longitudinal inconsistency is found. This procedure, however, might break the cross-sectional alignment introduced earlier in some cases when there is no individual consistency that could be improved without breaking another one. In order to avoid this undesired consequence or alleviate the impact, employment statuses are added or removed only in carefully selected year(s) based on the difference between simulated labour market statistics and the census values. For instance, if the labour force participation is lower than the census value for a specific simulated year, then the model will not be allowed to remove employment statuses from that particular year, instead, it will put a higher priority on employment creation for that particular year. Since the LII is a panel dataset, the simulated history should be consistent with not only the first year, but with all the waves. Therefore, this alignment procedure takes several iterations; the model first aligns the history with the last wave in the survey, and then gradually moves to the first wave in a loop. It should be noted that this method does not guarantee perfect consistency but it does offer a reasonable result with some important insights into the life paths of individuals.

### 5.3 Data quality and the alignment

Apart from computational algorithm defects in alignments, there is another important factor affecting the consistencies of simulated history: the quality of the data. The retrospective information collected in the survey dataset can suffer from various recall or measurement biases and it is rare not to spot inconsistencies. An individual aged 67 in 1994 might report he/she received his/her state pension in that year, while claiming that he/she had only worked for 2 years during his/her lifetime. Sometimes, the inconsistency may not be as obvious as in this previous example, especially when dealing with cross-variable consistencies. Certain data quality issues might be difficult to spot beforehand and make longitudinal alignment difficult. Therefore, it may sometimes be necessary to allow for a small deviation from the reported values.

## 6. Results I: back simulating employment status

This section evaluates the back-simulation of discrete (mainly employment status) variables. It recreates labour force participation and job nature (public or private sector, employed or selfemployed), crucial information for future pension projection. Figure 1 illustrates their simulation order within the module. Among the discrete variables, the module simulates:

• In-work, Employment Sector (Public/Private)

• Self-employment

• Various pension memberships

Figure 1

The simulations are based on the complete sample of the 1994 population in the LII survey. Since the model does not generate historical profiles for people who have died before 1994, the simulated sample is not representative of the population prior to 1994. This implies that the average age in the earlier years of the simulation is significantly lower than the average age in the sample. Therefore, outputs are presented mostly by age group or cohort group in order to avoid misunderstandings.

As discussed in the earlier section, the main goal of the back simulation model is to recreate a sensible history for all individuals in the dataset. Therefore, it is essential that the model output can replicate what actually happened. In an ideal world, the model output should be compared with the actual data to see how well the model performs. However, the lack of real historical data is the very reason that the back simulation is required. In addition, it is not feasible to compare the model with other back simulation approaches, e.g. statistical matching method, in this particular case, as synthetic history recreation is the only option due to the data limitation. Therefore, the results of the model are mainly compared with the known benchmarks: the census reports and the pension eligibilities. A success replication of this information would suggest the simulated historical profiles are sensible at both aggregate and individual level.

### 6.1 In-work ratio

The evolution of the in-work ratio is reproduced in Figure 2 and Figure 3. As the graphs show, the male employment rate alters little over time except for a gradual fall of late career employment since the introduction of the retirement pension system in the 1970s. The female employment rate is 40%~50% lower than the male employment rate throughout the simulated history. The simulation indicates that the female labour participation rate is much higher before the age of 35 than for other age groups.

Figure 2
Figure 3

Figure 4 and Figure 5 further examine the simulated historical in-work proportion and offer a comparison with the census values to validate the simulation. As discussed earlier, the crosssectionals in the simulated period are not population representative due to the missing values of deceased individuals. Therefore, census values were adjusted based on the sample’s demographic features. In order to isolate the change of demographic composition, graphs are presented by cohorts.

Figure 4
Figure 5

The figures confirm that the simulated values roughly match the estimations from the census data with a deviation of around 5 percentage points on average. The simulation seems to work better for later cohorts than the earlier ones, however this difference may partially be caused by the longitudinal alignment procedure where the adjustments may not be perfectly balanced, and the inconsistencies in retrospective variables, where the data quality declines as age increases. Nonetheless, it is safe to say that the simulated ratios move in the same direction as the census values.

The back simulation module not only aims at providing a reasonable profile at the crosssectional level, but also a historical trajectory that is consistent with retrospective variables. Figure 6 illustrates the difference between the simulated number of years in work and the reported values. As shown, the longitudinal inconsistency in the simulation is well controlled; in total, over 83% of the individuals were simulated with an error no greater than 1 year, and over 89% of the observations have a difference of less than 2 years. A few extreme cases are also spotted with an error greater than 10 years. These observations typically report no working history but with old age pension entitlement. In this case, the model overrides the report of zero working year in favour of the observed pension eligibilities. On average, the absolute simulation errors for males are slightly larger than those for females. A possible reason is that males generally have longer career trajectories, which increase the computation complexity in balancing the crosssectional and longitudinal alignments. The longer career trajectory also implies that the measurement error of the retrospective values like reported years of work might be bigger for males. As a result, the quality of the simulation could be negatively affected due to the inconsistent reported values.

Figure 6

### 6.2 Employment sectors and self-employment

Figure 7 and Figure 8 depict the historical evolution of public sector employment by age. As shown, there is a gradual growth of public sector workers for all age groups, a trend that is consistent for both males and females. However, the increase of females working in the public sector is faster compared with male workers, especially for workers in their early and midcareer stages.

Figure 7
Figure 8

Regarding the self-employed, Figure 9 and Figure 10 reveal the different preferences amongst different age groups. From the graphs it can be seen that the self-employment rates increase dramatically for males above the age of 35 over time and this stabilises around 30%, while the self-employment rate for males under 35 is just above 12% most of the time. The raise slope of the curve in the earlier part of the history may be contributed to by the limited number of observations for the age group. For females, the self-employment ratio is more or less stable within each age group except for a moderate decline for workers over 50 years old. Despite the drop, this late-career age group has a self-employment rate of around 30%, which is more than 20 percentage points higher compared with other groups. The differences are mainly due to the dropped female labour participation at later ages. Since the average age of retirement for female employee is lower than the male counterparts, the proportion of the self-employed workers increases after age 50. The graphs demonstrate that workers under the age of 35 have the least preference towards self-employment, a preference which is observed in both the simulated histories and the original dataset.

Figure 9
Figure 10

## 7. Results II: pension memberships and eligibilities

According to Irish pension regulations, individuals retiring at the age of 65 or later are entitled to a retirement or old age social welfare pension (RP and OAP, respectively), which can either be contributory or means-tested. The state pension system consists of a social insurance pension introduced for those aged 70+ in 1961 and a retirement pension for those aged 65 introduced in 1970 (O’Donoghue, 2002). Individuals may be entitled to additional pensions depending on their jobs and personal choices. The back simulation module replicates the pension system by simulating each individual’s participation in pension schemes and calculating their eligibility for a state pension once an individual retires.

### 7.1 State pension eligibilities

State pension eligibility is calculated from the simulated contribution history, which means that an output consistent with observed pension eligibility requires the complete working trajectory (employment status/type/sector) to be within a plausible and restricted range. Therefore, the consistency of pension eligibilities could be seen as an important indicator of overall simulation quality.

Table 7 lists the accuracies of different simulated pension eligibilities. Overall, the back simulation module developed in this paper is able to simulate the eligibility with an error of less than 5% for any pension type. This result is obtained after the longitudinal alignment procedure, which improved the accuracy of simulated eligibility by more than 25 percentage points in the exercise1.

Table 7

### 7.2 Occupational pension membership

Occupational and private pensions are simulated through well-specified logit models and the results are aligned with the information extracted from the QNHS and LII survey2. Figure 11 describes the development of occupational pension membership by cohorts. It seems that later cohorts are more willing to participate in occupational pension schemes, which might imply an increase in the occupational pension coverage in Ireland over time, which in turn is consistent with the history of Irish pension reforms.

Figure 11

Analysing the trend for the cohort prospectively shows the dynamics of pension participation across time for the different cohort groups. Nonetheless, it does not reveal the dynamics of age preference, which is tightly linked with career development. Figure 12 discloses the shift of preference of same age groups across time. As suggested, mid-career individuals appear to be participating more actively then those in other groups. It seems that the age group closest to retirement has the lowest participation rate, which could due to the differences in career trajectories, availability of the pension options for this cohort and other reasons.

Figure 12

## 8. Results III: income variables

In addition to employment statuses, the back simulation module also provides monetary income histories for all individuals. Labour earnings are simulated using the models mentioned in Section 4, while other non-labour income is assumed stable over time as they do not affect pension memberships. Figure 13 and Figure 14 demonstrate the dynamics of the average labour income over lifetime in this back simulation by analysing the earnings of three different cohorts. The average earning curve, as shown by the earlier cohorts, has the distinctive shape of a quadratic function, which matches the expectation of classic human capital theory. Female earnings seem to have a more flat mid-career profile, which may be due to the career interruption caused by maternity leave.

Figure 13
Figure 14

Figure 15 and Figure 16 give an overview of the evolution of earnings across the different age groups. As can be seen, average earnings increase steadily due to the increasing level of education. This is also confirmed by the cohort graphs shown earlier, where the younger cohorts demonstrate higher peak values. Male earnings are substantially different depending on the stage of their career. The average earnings by age 50 or above can be more than 50% higher compared to a male aged 20–35, but the difference is a lot smaller for female workers.

Figure 15
Figure 16

## 9. Conclusions

As discussed above, the preparation of the base dataset is one of the most important elements in every dynamic microsimulation model. However, finding a long historical panel dataset with all the necessary variables is often infeasible. This paper develops an algorithm to simulate a historical panel for the LII dataset to fill in the missing history.

The back simulation module extracts retrospective information from the LII survey and applies a dynamic microsimulation in a reversed direction to simulate population histories. Due to the nature of historical simulation, the longitudinal consistency requirement is high and difficult to achieve when compared with a forward dynamic microsimulation. The method proposed in this paper solves this problem by introducing extra calibrations and alignments at both the crosssectional and longitudinal levels. The overall result of the simulated panel, as described in the earlier sections, matches the labour market history to a fairly high degree based on the macro statistics calculated. The simulated values follow the observed trend quite closely, and are able to pick up the dynamics of different types of pension eligibility across time and cohorts. Nevertheless, further validation of the long-term trajectories of the employment and earnings produced by the model might be necessary, although there is little reliable data to compare the results with for the earlier half of 20th century.

The back simulation method could potentially offer many benefits for microsimulation modellers. By expanding the longitudinal information while maintaining consistencies with a panel dataset with limited waves, it offers a viable alternative to the dilemma of base dataset choice as described by Cassells et al. (2006) and Zaidi and Scott (2001). The generated historical panel could consequently enhance the accuracy and stability of the forward dynamic simulation by feeding in information on career trajectory. Given the extensive retrospective information modelled, the back simulated panel has a much higher data quality than a simple synthetic panel. Since only standardized survey questions and macro statistics are used in the simulation, this proposed back simulation method could be potentially replicated in other datasets that include retrospective questions, e.g. ECHP, BHPS, German Socio-Economic Panel (GSOEP), FFS. The method could help modellers to run a life-cycle based simulation without using a scarce long panel. In addition, the historical panel provides many important insights, e.g. the dynamics of career trajectories, which would otherwise be easily overlooked in the original dataset.

The exploration of a back simulation method as described in this paper is still in its early stages and the current algorithm could be further improved. Future work is planned to improve the following aspects of the back simulation model:

• Simulation of the deceased cohort of the population in order to align historical statistics more accurately, as the current version does not generate a representative historical population.

• Develop an improved algorithm for the cross-sectional/longitudinal alignment to alleviate the impact of inaccurate responses for retrospective questions.

• Understand the robustness of the model and the standard errors of the results. Further analysing the variations in the model would enhance the credibility of the model, especially when stochastic components are used.

In spite of the shortcomings inherent in the current version of this model, this paper shows that reconstructing historical information is feasible based on standard household survey dataset and census information, providing there is some retrospective information in the dataset that can be used for modelling. From a practical point of view, the simulated panel is the only available historical panel dataset for Ireland, which offers the possibilities of investigating life-cycle income profiles together with a dynamic microsimulation model.

## Appendix 1 Steps of Back Simulation

1. Estimate models

1. Employment (in-work/out-of-work) model

2. Self-employed model

2. Impute social economic status variables for each year in the original dataset (1994-2001)

1. Impute the number of years in education, employment, sickness, home care duties and retirement using regression in 1994 by sex group

2. Update these variables to 1995–2001

3. Consistency check for major social economic variables during the years 1994–2001

3. Generate historical data using known information

1. Demographic information (e.g. year of birth)

2. Maternity information (based on child age)

3. Job information (e.g. year of starting first job, last job, total number of years worked etc.)

4. Pension information (e.g. total pension contribution, pension entitlements)

5. Others (e.g. school, geographic information)

4. Deterministic back simulation (in/out work)

1. Create age information

2. b. Infer working status from retrospective variables

1. Must be out of work before year of starting work

2. Must be in work in the year of starting work

3. Must be in work since the current job started

4. Must be in work in the year of quitting previous job

5. Must be unemployed since current unemployment started

5. Semi-stochastic back simulation

1. Assuming a 75% chance that a woman will be “out of work” in the year of giving birth to a child

2. State pensioners must start work before the age of 56

3. “Back to work” beneficiaries must have been unemployed for 2 years

6. Fill the gap of unallocated working status for history using stochastic simulation

1. Replace the unknown working status to “in work” if the number of known working years is lower than the total work years

1. Error term is stochastic. The ranking is based on the sum of the error terms and the personal effect in the model

2. The status is aligned with census data by interacting with the ranking

2. Impute the working status if there is still a gap between the reported number of years in work and the simulated years of in work

3. Set the rest of the employment statuses to “not in work” if the simulated number of years in work is consistent with the reported values

4. Adjust any differences

7. Simulate/Generate the variables that LIAM will use

1. Demographic information (age, alive, maternity)

2. Education information (based on the age of leaving full time education)

3. Marriage information

1. Calculate the year in which an individual married

2. Assume that an individual married at age 25 if the current marital status is separated, divorced or widowed.

4. d. Prepare employment information

1. Self-employment

1. Estimate the probability of being self-employed (1994–2001)

2. Read the historical values of self-employment and smooth the curve using moving average method

3. Use the stochastic term and personal effect to predict the probability of selfemployment, align data with estimates.

2. Public sector

1. Assume that people stick with the same sector as reported in 1994

2. Do not allow for small gaps in public sector employment

3. Store alignment data by sex education and cohort

4. Use stochastic term and personal effect to predict the probability of staying within the public sector, and align the data using saved values

3. Final adjustments to smooth labour participation history

8. Calculation for pay related social insurance (PRSI) pension entitlement

1. Create a variable for the contributory class in the PRSI system for state pensions

1. Public servants are treated as employees if they joined the job after 1995

2. Public servant pensions are eligible for those who joined the work before 1995

2. Calculate the eligibility for a PRSI pension for the rest

1. Self-employed are assumed to only be insured after 1988

2. Public sector are assumed to only be insured after 1995

3. Calculate the pension credit

9. Longitudinal Alignment

1. Compare simulated PRSI pension entitlement with the actual entitlement

2. Modify the history while keeping the cross-sectional alignment of the labour market participation, self-employed and pension coverage ratio stable

3. Repeat the above procedure several times until the consistency is satisfactory

10. Simulate earnings (for employees and self-employed)

1. Estimate the wage model (random effect model)

2. Record the standard deviation of random effects and the error term

3. Simulate the wage for each group with a stochastic term that shares the standard deviation of estimated random effects

4. Align the average earnings by sex, education level and age group to the year 2000 values

11. Simulate occupational pension membership using the equation estimated using the 1994-2001 data and calibrated using the QNHS samples.

1. Calculate the probability of receiving an occupational pension, employee personal pension and self-employed personal pension for each age and sex group

2. Estimate the occupational/self-employed/personal pension model using existing data

3. Align it with the probability breakdown calculated from the QHNS sample for both the existing dataset (LII) values and the simulated historical values

4. Correction for the unlikely event of an inconsistency between the occupational pension status and the employment status

5. Final adjustment and alignment

12. Calculate the value of the DC Occupational pension fund

1. Model contribution in a similar manner as wages and align with observed contribution rate (split into two rates, pre-1995 and post-1995)

2. Calculate the pension fund value from the 1930s using assumed interest rates

3. Calculate the contributions for each pension type

Table 1
Table 2
Table 3

Table 4
Table 5
Table 6

Table 7

## References

1. 1
Un outil de prospective des retraites: le modèle de microsimulation Destinie
(2003)
Économie et prévision pp. 193–214.
2. 2
CORSIM 3.0 User and Technical Documentation
(1997)
Ithaca, NY: Cornell University.
3. 3
Problems and Prospects for Dynamic Microsimulation: A Review and lessons for APPSIM. NATSEM Discussion Paper
(2006)
Problems and Prospects for Dynamic Microsimulation: A Review and lessons for APPSIM. NATSEM Discussion Paper.
4. 4
Earnings in DYNACAN: distribution alignment methodology
(2000)
DYNACAN Project Document.
5. 5
Individual alignment and group processing: an application to migration processes in DYNACAN, occasional papers-university of Cambridge department of applied economics
(2000)
238–250, Individual alignment and group processing: an application to migration processes in DYNACAN, occasional papers-university of Cambridge department of applied economics.
6. 6
PENSIM: A Dynamic Simulation Model of Pensioners’ Income ” in Government Economic Service Working Paper No. 129
(1996)
London: Analytical Services Division, Department of Social Security.
7. 7
Microsimulation Modelling for Policy Analysis: Challenges and Innovations
(2000)
Simulating Transitions Using Discrete Choice Models, Microsimulation Modelling for Policy Analysis: Challenges and Innovations, Cambridge, Cambridge University Press.
8. 8
An assessment of PenSim2
(2004)
An assessment of PenSim2.
9. 9
Imputing pension and caring histories to the base data in the SAGE dynamic microsimulation model
(2004)
SAGE Technical Note No 8.
10. 10
A Primer on the Dynamic Simulation of Income Model (DYNASIM3)
(2004)
A Primer on the Dynamic Simulation of Income Model (DYNASIM3).
11. 11
Using The Eu Silc For Policy Simulation: Prospects. Some Limitations And Suggestions, EUROMOD Working Papers
(2007)
Using The Eu Silc For Policy Simulation: Prospects. Some Limitations And Suggestions, EUROMOD Working Papers.
12. 12
Can we Afford the Future? An evaluation of the new Swedish pension system, Modelling our future: population ageing
(2007)
social security and taxation 33.
13. 13
The MOSART model-a short technical documentation
(2003)
The MOSART model-a short technical documentation.
14. 14
Challenges and Opportunities of Dynamic Microsimulation Modelling
(2007)
Challenges and Opportunities of Dynamic Microsimulation Modelling.
15. 15
DYNAMOD-2: An overview. Technical Paper
(1999)
DYNAMOD-2: An overview. Technical Paper, 19.
16. 16
Pension incomes in the European Union: policy reform strategies in comparative perspective
(2007)
Micro-simulation in action: policy analysis in Europe using EUROMOD 27.
17. 17
Labor mobility and wages, Studies in Labor Markets
(1981)
2164, Labor mobility and wages, Studies in Labor Markets.
18. 18
DYNACAN, the Canadian pension plan policy model: demographic and earnings components
(1997)
Proceedings of the Microsimulation Section at the International Conference on Information Theory, Combinatorics, and Statistics.
19. 19
Overview of DYNACAN, a full-fledged Canadian actuarial stochastic model designed for the fiscal and policy analysis of social security schemes
(1998)
Overview of DYNACAN, a full-fledged Canadian actuarial stochastic model designed for the fiscal and policy analysis of social security schemes.
20. 20
On the pooling of time series and cross section data
(1978)
Econometrica 46:69–85.
21. 21
Dynamic microsimulation: a methodological survey
(2001)
Brazilian Electronic Journal of Economics 4:77.
22. 22
Redistribution in the Irish tax benefit system
(2002)
University of London.
23. 23
The Life-Cycle Income Analysis Model (LIAM): A Study of a Flexible Dynamic Microsimulation Modelling Computing Framework
(2009)
International Journal of Microsimulation, 2, 1.
24. 24
Life-Cycle Income Analysis Modelling
(2010)
AG & CO.KG: 2010 Lambert Academic Publishing.
25. 25
Estimation of earnings in the SAGE dynamic microsimulation model: London. ESRC-Sage Technical Note
(2004)
Estimation of earnings in the SAGE dynamic microsimulation model: London. ESRC-Sage Technical Note.
26. 26
Base dataset for the SAGE model. Sage Technical Note
(2001)
Base dataset for the SAGE model. Sage Technical Note.

## Article and author information

### Author details

1. #### Jinjing Li

Maastricht University / UNU-MERIT, The Netherlands
##### For correspondence
Jinjing.Li@natsem.canberra.edu.au
2. #### Cathal O’Donoghue

Rural Economy and Development Programme, Teagasc, Ireland
##### For correspondence
cathal.odonoghue@nuigalway.ie

### Acknowledgements

The authors are grateful to the National Research Fund, Luxembourg for the financial support. For their helpful discussions and comments, the authors wish to thank Elisa Baroni, Gijs Dekkers, two anonymous referees, and the participants of the 2010 European workshop on dynamic microsimulation modelling in Brussels.Text of acknowledgement

### Publication history

1. Version of Record published: June 30, 2012 (version 1)

© 2012, Li & O’Donoghue