Comparative analysis of different techniques to impute expenditures into an income data set
Abstract
Income and budget data seldom are measured in the same dataset. In order to make simulations that need both, one requires a reliable procedure to merge an income and a budget survey into one combined dataset. This paper contains the comparison and evaluation of five different techniques to impute expenditures into income datasets: parametric estimation of Engel curves, nonparametric estimation, both constrained and unconstrained matching using a distance function and grade correspondence. After a detailed description of the methods as well as a comparison of the main pros and cons, their effectiveness is tested upon an artificially split data file. In general, the parametric and nonparametric estimation seem to yield the best results, generating imputed values that are closest to the observed values for the budget shares.
1. Introduction
In order to simulate concurrent changes in direct and indirect taxes a dataset which combines income and expenditure data is needed. However, it is unusual to have one data source that contains high quality information on both income and expenditures. A possible solution lies in the creation of a ‘new’ dataset which merges information of an income and a budget survey by using imputation or matching techniques using the overlapping variables – variables that are held in common by both datasets.
There is a large literature on Statistical Matching in different fields in the microsimulation field (Cohen, 1991). Sutherland et al. (2002) used statistical matching in the UK to combine income and expenditure datasets for indirect tax modelling. Decoster et al. (2010) also used statistical matching to combine income and expenditure files for indirect tax analysis in different EU countries as does Savage (2017) for Ireland and Donatiello et al. (2014) for Italy. Peichl and Schaefer (2009) utilise statistical matching in the combination of survey and administrative datasets for use in a macromicrosimulation model. Abello et al. (2008) and Von Randow et al. (2012) use statistical methods to link surveys in health microsimulation models. Cullinan (2010) links a spatial microsimulation model with locational data using statistical matching. In the wider inequality literature, Borra et al. (2013) link time use and income data, while Rasner et al. (2013) and Kum and Masterson (2010) look at wealth analysis.
There is a substantial literature, which focuses on combining Official Statistical Sources together (D’Orazio et al., 2002; D’Orazio et al., 2006a; D’Orazio et al., 2006b; D’Orazio et al., 2012; Leulescu and Agafitei, 2013; Serafino and Tonkin, 2017). Much of the statistical literature focused on techniques for specific methods (Moriarity and Scheuren, 2001a; Moriarity and Scheuren, 2001b; Moriarity and Scheuren, 2003; Rässler, 2003). However in general the literature undertakes statistical matching without evaluating the relative performance of different methods, a research gap that this paper aims to undertake.
Specific methods have been evaluated in Rodgers (1984) and Barry (1988). However given the range of methods used in the microsimulation modelling and their different strengths there is a relatively sparse literature comparing the performance of different techniques of statistical matching. Webber and Tonkin (2013) do however undertake a comparison of the statistical matching of the SILC and Household Budget Survey, evaluating the match under a number of different scenarios. They compare the impact on matching variables, mean expenditure by decile using different statistical matching methods and perform an interesting test of conditional independence. Rässler (2002) compares different imputation and statistical matching methods, but there is no paper that compares the different methods utilised in the microsimulation literature. We will test the distributional assumptions at a disaggregated level relative to these studies. This paper attempts to fill this research gap.
In this paper we evaluate statistical matching algorithms used to link an income dataset and an expenditure dataset in the generation of a dataset to be used to simulate indirect taxation within the EUROMOD model (Sutherland and Figari, 2013) using the 2001 and 2002 Belgian Budget Surveys. In the EUROMOD context, the income dataset, on which the direct tax and benefit changes are modelled, cannot be altered. Therefore we designate this income dataset to be the target data set in which expenditure data are to be imputed. The budget dataset then plays the role of the source data set. The purpose of this paper though, is to evaluate an appropriate methodology in which to create a statistically matched dataset rather than to utilise the resulting dataset for a simulation. Therefore, to avoid any issues associated with data definitional issues, differential survey design, differential weights between source and target dataset, we use in this paper donor and recipient data from the same data set, i.e. from the budget survey.
Imputing household expenditure data into income surveys, although not unique, is one of the main uses of statistical matching in microsimulation models (Sutherland et al., 2002). Although some taxbenefit microsimulation models use data that contains both income and expenditure, as in the case of earlier models using the former Family Expenditure Survey in the UK or the Household Budget Survey in France (Bourguignon et al., 1997), in general the quality of the income variables is weaker in household budget surveys and typically of lower quality and detail required to model income taxation and social transfers (O’Donoghue et al., 2004). Similarly the unit of analysis is often at the household level rather than a more disaggregated individual or tax unit level. On the other hand, in most OECD countries, as in the case of the Eurostat European Community Household Panel (ECHP) or the Survey of Income and Living Conditions (SILC), the expenditure data necessary to simulate indirect taxes is missing. The direction of match in some methods such as minimum distance matching is irrelevant as they link both datasets, whilst in explicit methods such as regression based approaches the direction is relevant. In the latter, the income survey is typically taken as the base, because it contains both an appropriate unit of analysis and because of the relative importance of income variables and income based policies relative to expenditure data and expenditure based policies in OECD countries. As a result it is often required as in the case of EUROMOD to use statistical matching to link income and expenditure data.
Five different matching techniques are examined representing the techniques used in the microsimulation literature, which can be divided into two categories. A first category contains the socalled explicit methods that use estimations of Engel curves to impute expenditure information into the income data set.^{1} The two techniques that we study in this category can be labelled as parametric (or standard) regression and nonparametric regression. The second category consists of the socalled implicit methods which match to each record in the income survey a record with expenditure information coming from the budget survey. In order to choose the most adequate record in the budget survey two different techniques are used, that is the distance function (both constrained and unconstrained methods) and the grade correspondence. For both techniques there are many variations possible in the practical application but based on the studies of respectively Decoster and Van Camp (2002) and Taylor et al. (2001) a limited selection was made.
When it comes to evaluating and comparing the five methods, two criteria will be essential. The first – microscopic – one is the quality of the match, in that one wants to create for each record of the income survey values for a number of new (budget) variables that correspond as well as possible – given the information available – to the true but missing values of that observation. The reason for this is obvious: the primary goal of the matching process is to obtain a dataset with observations that are realistic, in that they represent households that exist in society. A microsimulation of behavioural change based upon types of behaviour that do not exist in society may not yield very trustworthy results. The second – macroscopic – criterion refers to the fact that the replication of distributions of missing variables is also desirable from the simulation point of view: the observed distribution in the budget survey is considered to be representative and deviations from it may lead to under or overestimating certain indirect tax change effects (such as distributional effects). Of course, if the distribution of overlapping variables is the same in both datasets, the second criterion is a consequence of the first one: a good individual match will also generate the right marginal distributions for the budget shares. But if this condition is not met, a tradeoff between the two criteria will be inevitable, which can best be illustrated by the difference between constrained and unconstrained distance matching (cf. infra).
Section 2 describes the methodology used and the data utilised. Explicit methods are discussed and in the third section the two implicit methods are considered. Both the general strategy and the concrete implementation are discussed. The section makes a theoretical comparison of the different methods. In the next section some evaluation criteria are suggested and the practical performance of the five methods is investigated. Section 4 concludes.
2. Data and methodology
2.1. Explicit methods: imputation by means of Engel curves
As mentioned above the explicit methods use Engel curves to impute for every record in the income survey expenditure information. Theoretically, this expenditure information could be at the most detailed level but in practice this is impossible since this would result in very imprecise estimations of the Engel curves. Consider for instance the influence of the zero expenditures (see e.g. Pudney, 1989). This zero expenditure problem illustrates that the reliability of these imputations relies upon an explicit statistical model which can be (slightly) misspecified. It has been assumed that based on the explanatory variables (including disposable income and some demographic characteristics like household size and age of the household head), the behaviour of the dependent variable can be fully captured and, moreover, that (standard) regression issues such as heteroskedasticity and multicollinearity can be adequately dealt with.
Therefore, the application of the explicit techniques in practice boils down to aggregating the expenditure items (in order to avoid zeroes) and then estimating the Engel curves of these aggregates. The quality of these imputation techniques is then completely determined by the quality of the estimation of the Engel curves. Although there exists a large literature on this topic (see for instance Blundell, 1988; Banks et al., 1997; Blundell et al., 1998 and references therein), unfortunately in this specific setting the developed machinery cannot be applied fully. For instance, a functional specification has to be determined a priori in the parametric case and the explanatory variables are restricted to the set of overlapping variables. Beside these restrictions, which of course decrease the quality of the estimates, different definitions of the overlapping variables (e.g. income variables) possibly have to be dealt with.^{2} Again this could influence the quality of the imputation.
In the rest of this section, let y_{h} denote the disposable income of household h, E _{jh} the expenditures of the household on the aggregate j, and O _{h} the vector of overlapping variables between the datasets (excluding y _{h} ). For the first (standard) method the imputation is carried out by estimating the Engel curves of the budget shares:
using ordinary least squares regression on the budget dataset. Note that savings are treated in the same way as the budget categories in that it is also modeled by a regression equation. This points out why disposable income appears in the denominator rather than total expenditure: the budget and saving shares sum up to one. The explanatory variables are, as stated above, chosen out of the set of overlapping variables. In this way, the obtained model can be used to predict budget shares for the observations in the income survey.
In practice, the construction of the model is performed using the QUAIDS specification. The independent variables thus span the logarithm of the disposable income up to the second degree as well as the other overlapping variables:
where ${\alpha}_{j}$ , ${\beta}_{j}$ , ${\lambda}_{j}$ and ${\delta}_{j}$ are the parameters to be estimated and ${\epsilon}_{jh}$ is the error term. The function $g$ is included so as to allow squared values and cross effects of demographic variables to be taken into account (e.g. age as in O’Donoghue et al., 2004). Note also that the condition that the predicted budget shares have to sum up to one for each household needs no explicit restriction, since by the properties of the least squares estimators, the OLS performs this task automatically (see e.g. Deaton and Muellbauer, 1980, p. 19 ):
m being the dimension of the image of g. The regression equations derived by this procedure can then be applied to the observations of the income dataset, generating new variables w_{j} (possibly with an error term to randomize the results to some extent). An important remark in this respect is that the marginal distributions of the variables w_{j} will not necessarily be the same in the source and target dataset, except when the multivariate distribution of the overlapping variables is identical. The differences between the distributions of the overlapping variables in the source and target dataset are described in Decoster et al. (2007).
The nonparametric method starts from the same idea as the parametric method: to find a function that relates the budget shares to the overlapping variables in the household survey and in the next step apply this function to the observations in the income dataset. The difference lies in the fact that the parametric method starts from a functional specification while the nonparametric does not. In this way a misspecification of the Engel curves is avoided and much more flexibility is obtained for estimating the relation between the explanatory variables and the dependent variable. The nonparametric procedure consists of estimating density functions directly. In the univariate case, this can be visualised intuitively by a histogram, being the empirical density function:
where {[ai, bi ) , i = 1, ..., N } is a partition of the domain of f(t), H is the number of observations and ${1}_{A}$ is the indicator function of a set A. Note that so far, no regression has yet been performed. In most cases, the result will be a highly discontinuous function (which can be thought of as caused by the fact that the sample was finite). For continuous random variables, the question also arises which partition should be chosen to represent the data. Both problems are tackled at the same time by the use of a density kernel estimator K:
K represents a continuous function that integrates to one and acts as a smoothing device: $\widehat{f}\left(t\right)$ will indeed be continuous as a finite sum of continuous functions, and will integrate to one as one expects from a density function. Here the standard normal density function has been chosen to play the part of K, but the choice of K has been reported not to be of major importance (Decoster et al., 2004). The parameter b on the other hand, is a measure for the width of the intervals and is much more influential. If b is small, only those t_{k} ‘s close to t will have a significant impact on $\widehat{f}\left(t\right)$ (in the case of the standard normal density), and hence the bandwidth is smaller. A higher bandwidth has a more smoothening effect, while a smaller bandwidth will keep closer to the observed data, and the choice of b is therefore a trade off between variance and bias. A proposed optimal value for b that has been adopted here (see Deaton, 1997) is:
using H for the number of households, $\sigma $ for the sample standard deviation and IQR for the sample interquartile range.
The method to estimate density functions can be used in this context since the Engel curve can be formulated as follows:
Discretisation of the last expression (see Decoster et al., 2004) yields the following nonparametric estimator:
In this expression, K is a function on a moredimensional space, which can be easily implemented by using e.g. the multivariate standard normal density function. There is, however, a problem when the number of dimensions becomes too large: in order to estimate a functional relationship adequately, one typically needs a lot of observations, but the required number of data increases with the number of dimensions (“the curse of dimensionality”). A possible solution consists in limiting the set of independent variables that enter nonparametrically, and use a standard (multiple) regression method for estimation of the other explanatory variables. This is exactly what is done in semiparametric models. In this application, the variables y and age are taken up in the nonparametric part, as in Decoster et al. (2004), while the effect of the other independent variables $\stackrel{~}{\mathbf{O}}$ is estimated by least squares. The resulting equation takes the form:
and subtracting this from the model equation $w}_{j}={\mathit{\beta}}_{\mathit{j}}\stackrel{~}{\mathbf{O}}+{F}_{j}\left(y,age\right)+{\epsilon}_{j$ yields:
The expectation values on the right and on the left can be estimated nonparametrically as before, and what remains of the equation is a model that can be estimated using least squares regression. Note that the estimated w_{j} ‘s again sum up to one, as in the parametric case. See Blundell et al. (1998), and Decoster et al. (2004) for more details and an application of these semiparametric techniques.
We briefly compare both regression techniques. It is obvious that theoretically the semiparametric models are at least as good as the parametric method. Indeed, if the functional specification in a parametric method is the correct one, then the semiparametric method will result in similar estimates. But clearly the opposite does not hold. See for instance Härdle and Mammen (1993), for a comparison of both methods. In practice however estimating semiparametric methods can be very time consuming while estimation of parametric models can be done by using well known standard procedures. Finally, the parametric method has the advantage that a regression model estimated upon the budget data can be obtained in countries where the data themselves are inaccessible due to legal restrictions (see e.g. O’Donoghue et al., 2004).
2.2. Implicit methods: imputing complete records
The implicit methods avoid the (theoretical) assumptions and their implications by using as little theory as possible, meaning that they do not rely on an explicit statistical model to impute the expenditure information. These methods try to concatenate expenditure information to observations in the income dataset by using the values of an observation in the expenditure survey that is as similar as possible to the target observation. Similarity is expressed mathematically as a distance function which has to be minimized and which can take the form of a numerical value or of belonging to the same categories and having the same rank within these categories (cf. infra). To find a similar record, the overlapping variables in both surveys are used. Although this is a very simple idea (without theoretical assumptions), the performance crucially depends upon the available overlapping variables and the method used to find the matching records. To give a hypothetical example, suppose that one of the overlapping variables is a (unique) identification number and that in both surveys the same households are present. Then one can of course match to every record of the income survey a unique record of the budget survey based on this number. In reality, however, no such precise overlapping variables are available. What is more, the observations in both surveys are not the same. Finding an exact match is therefore impossible. Before describing the two implicit methods used in this application to find the best possible match, two remarks are given that apply to both.
Overall, two strategies are possible, which have both been implemented in this application: unconstrained matching allows replacement of already chosen records in the source dataset, while constrained matching forbids replacement. By construction the unconstrained technique will yield the lowest total distance, but with constrained matching it is possible to replicate the marginal distribution of the variables w_{j} in the target dataset. A necessary condition for this to happen is that the number of observations in the source and the target dataset is the same. Since in most datasets the “number” of observations is represented by means of a weight variable (which gives the weight of the observation in the entire population), this prerequisite of an equal number of observations is realized by some procedure of “reweighting” the data via a duplication mechanism in the source set.^{3} We sum up the weights of all the observations in the source dataset, the result being the number of households in the country. Dividing each weight by this sum, multiplying by the number of observations in the target dataset, and rounding the result (reweighting) gives the number of times an observation has to be duplicated (or deweighted) to get a source dataset with the same number of observations as the target dataset.^{4}
A second choice concerns the weights that will be assigned to the different overlapping variables in the distance function. Indeed, not every overlapping variable has to be equally important in defining the distance. In this paper we consider two different applications of this weighting procedure: one with finite weights and one in which some variables get weight infinity.
The most basic implementation of implicit methods uses distance functions with finite weights for the overlapping variables. To be precise, for a given record in the income survey, the distance in the (selected) overlapping variables to every record in the budget survey is calculated. This could for instance be the difference in the number of children, the difference in disposable income, the difference in household size, etc. Then the weighted sum of these differences is calculated and finally the record of the budget survey which has the smallest weighted sum is picked out. If there are several records which result in the same minimum distance, one of these records is chosen at random.
In this case, there are no variables that are deemed so important that matching is forced within their categories. Of course, this does not mean that all overlapping variables are of equal importance: assigning a finite weight to each variable can make it relatively more or less influential in determining the distance (with the special case of putting the weight equal to zero for variables that will not be considered). The strategy adopted here consists of calculating the Mahalanobis distance. Let t _{i} be the realisation of overlapping variables of observation i in the target dataset and t _{j} that of observation j in the source set, then the Mahalanobis distance is defined as:
where $\Sigma $ stands for the covariance matrix of the overlapping variables in the source dataset. Intuitively, one can keep in mind what this means for the uni and bivariate case. If there is only one variable, the Mahalanobis distance equals the usual, Euclidean distance divided by the standard deviation. This introduces a correction which considers the same absolute distance as less important when the variable under consideration has a high variance, than in case it is more concentrated. With several overlapping variables, also the correlations between the variables enter the scene (the off diagonal elements of matrix ${\Sigma}^{1}$ ). Compared to the Euclidean distance ( $\Sigma =1$ ),the expression under the square root will be lower if there are two variables that are highly correlated (which means they have a high covariance). This is in line with intuition. Since highly covariating overlapping variables essentially capture the same information, we do want to decrease the weight of these variables in the distance function.
The Mahalanobis distance thus accounts for differences in variation of and correlation between the overlapping variables. Yet it does not allow for making qualitative distinctions between those variables (e.g. it is more important to put together households with the same income level than with the same education level) other than putting the weight of one variable equal to 0 (which means leaving one variable out of consideration). Also from empirical studies it seems that ‘subjective weights’ perform better (see e.g. Moriarity and Scheuren, 2001b, and Decoster and Van Camp, 2002). These subjective weights are mainly based on the quality of the overlapping variable (for instance the definition of the overlapping variable is the same in both surveys) and on the explanatory power. A possible way to tackle this problem is to determine weights by using a stepwise linear regression. This concept points to a collection of algorithms that try to find the most efficient regression equation given a set of explanatory variables. In a number of consecutive steps, a model is tested leaving out a variable or adding one. If the explanatory effect of this variable is significant, the variable is retained, otherwise it is dropped. Consider the model that comes out of an algorithm like this. The magnitude of the regression coefficients is a measure for the influence of the respective regressors on the dependent variable. Therefore, these magnitudes can be used as weights for a distance function, setting the weights for variables that were left out equal to zero. The distance between observations in one variable can be taken to be the absolute value of the difference. This method clearly accounts for differences in explanatory power of the common variables in that the more influential a variable is, the more weight it will get. Variance and correlation effects of the independent variables are also taken into account via the regression model.^{5}
The grade correspondence technique consists in first clustering the observations in both datasets according to some overlapping variables (see Taylor, 2000 and Taylor et al., 2001). In a way, one sets the weights of these variables equal to infinity, because no matter how large the difference in the other variables within a cluster, the model does not allow matching across clusters. The division into clusters can be done based on experience (which variables are the most important?) or formal clustering procedures can be used (see the above references for a discussion of such procedures). Then, in a second step, a distance function is applied within each cluster, in the same way as before. So, one can again choose between a constrained and an unconstrained matching, and for the former method a reweighting/deweighting procedure can be put in place to obtain the same number of records in the source dataset as in the target dataset.
In this paper, the grade correspondence method is implemented using 18 a priori defined clusters: the observations are assigned to a cluster according to the age of the household head (below 40, between 40 and 60, and above 60), the profession of the household head (not working, blue or white collar worker) and whether children are present or not. These broadly correspond to the categories chosen in O’Donoghue et al. (2004).^{6} Then the clusters in both surveys are made equally large by reweighting/deweighting. The distance between observations is determined by the rank of disposable income within each cluster. So the record with the smallest disposable income in cluster A in the budget survey will be matched to the record with the smallest disposable income in the corresponding cluster in the income dataset.
An important remark is that one has to avoid small clusters since this could lead to bad matching results. For instance, instead of using the exact number of children as a variable for the clustering, one can use the fact that there are children or not to avoid small clusters. On the other hand, Taylor et al. (2001) show that clustering significantly improves their results and that their results are similar for different sets of clusters. These statements are mainly based on elementary statistics concerning the deciles and on the performance when dealing with different tax simulations.
We end this section by briefly comparing the two implicit methods. Theoretically, grade correspondence can more or less be considered as a special case of the method based on distance functions. Note moreover that the clustering idea can also be used to improve the results of the matching by distance functions (which actually implies that some weights are infinite or very high and so these variables can no longer be used in the distance function). On the other hand there is also a subtle difference. Since in grade correspondence we use the ranking information of the income variable, this method is less sensitive to difference in the income distribution in both surveys. This robustness can be a real advantage in situations where the measurement of the disposable income is not entirely reliable. Finally, in practice, applying the grade correspondence technique is straightforward while choosing the optimal weights for the distance functions can be quite cumbersome.
2.3. Prior comparison of explicit versus implicit methods
In the next section we will evaluate the empirical performance of the different matching techniques. Yet, it is also worth the while considering the prior theoretical arguments for choosing the ‘best’ matching technique, as well as the arguments and intuitions stemming from the considered literature. Note however that in the literature there are hardly any comparisons of the different techniques.
Purely theoretically, one is tempted to favour the implicit methods, since they do not rely on theoretical assumptions and they avoid many of the problems of the explicit methods. There are three types of problems associated with the latter.
A first problem concerns the influence of zero expenditures on the estimation of the Engel curves. From empirical studies it is clear that this highly influences the results.
Secondly, it is unfeasible to estimate Engel curves for hundreds of commodities. If one uses Engel curves, one first has to construct expenditure aggregates. This evidently also constrains the imputation of expenditure information to these aggregates. Since the aggregates are fixed before the matching procedure takes place, this deprives euromod users of the possibility to define other expenditure aggregates in a later stage. Implicit methods allow for more flexibility in that the records matched will be the same regardless of the number and the magnitude of the aggregates (since the overlapping variables of the records stay the same). So one can anticipate user manipulation by using many small aggregates. For the explicit methods, this would decrease the quality of the match, e.g. because of the zero expenditure problem.
The third problem is the variability in the imputed expenditure information. If estimated Engel curves are used to impute expenditures, one actually imputes ‘averages’. To increase the variability in the matched dataset, one could draw random errors from a normal distribution with mean zero and variance equal to the mean square error of the model, or draw errors randomly from the error terms in the budget dataset. But this again induces problems, such as negative expenditures. Again this is not an issue when using the implicit methods.
In this theoretical scenario, we assume that the implicit methods can be applied at full strength. But this is not the case in this application. Recall that in EUROMOD the direction of the matching is fixed (because it is useless to impute income information into the budget survey), and secondly that the income survey cannot be modified (for instance to duplicate observations). The latter implication means that we have to use either unconstrained matching methods, which implies that we possibly do not use all the information of the budget survey (since unconstrained matching might use only part of the source dataset records), or either forms of constrained matching which do not duplicate observations in the target dataset.
A final note pertains to the way possible tax and benefit changes will be evaluated and simulated. In simulating the effects, a change in the behaviour of the households could be incorporated. With explicit methods the estimated Engel curves can be used to simulate these behavioural reactions as far as real income changes are concerned. The implicit methods on the other hand have not modelled this.
2.4. Conditional independence
It is appropriate to underline that all matching techniques rely on the conditional independence assumption. In order to believe that the simulations with the ‘new’ data set are reliable, one has to be convinced that this conditional independence assumption holds. To recall the assumption, let us label the variables in the income survey by (X, Y) and the ones in the budget survey by (X, Z), meaning that we call the overlapping variables X and the nonoverlapping variables Y and Z. The conditional independence assumption then states that given X, Y and Z should be independent, or equivalently, that all the correlation between Y and Z has to be explained by X. Note that this can be a heavy assumption in the case of budget and income data. Consider, for instance, two people with the same disposable income and the same sociodemographic profile (and so with the same values of X). Suppose they both have a car, but one of them has bought an energyeconomical car so as to get an income tax reduction. In that way there can be a positive correlation between the height of the income tax, which belongs to Y, and the height of the private transportation costs, which belongs to Z.^{7}
2.5. Data
In the project that funded this research, we undertook this evaluation using data from different countries. In this paper we have chosen to select a particular country, grounded in our familiarity with the data, rather than for any specific country reason. The country we have chosen is Belgium. Also, in oder to avoid any issues associated with data definitional issues, differential survey design, differential weights etc., we in this study take the donor and receiving data from the same data set. The data we use is the Belgium Budget Survey from 2000 and 2001, collected by the Belgian National Institute for Statistics (NIS) containing 3,550 households.
Since the budget surveys only contain net or disposable household income (after taxes) and not gross income, we first used the microsimulation model described in the next paragraph to reconstruct gross incomes from net earnings. This backward calculation was based on the fiscal and parafiscal regulations of the year of the survey itself.
The unit of analysis of incomes is mostly individual, excluding housing allowances, social assistance, rental income and inheritance/lottery winning, whilst the period of collection is mostly monthly income together with the number of months received during the reference year. Household level crosssectional weights (shared weights) and individual level longitudinal weights are created that take into account of adjustment for sample attrition and external checks on population structure (demographic/socioeconomic/social welfare)
2.6. Summary
In summary, the different methods take the same overlapping variables and try to generate variables (expenditure and shares) from the target dataset to introduce into the source dataset. Parametric and nonparametric methods generate an estimate of each variable, conditional on the match variables. For budget shares, we do not utilise error terms assigning the same shares to the simulated expenditure (which incorporates an error term). This is because of the computational challenge of sampling from a multidimensional error distribution. Of course, it is possible to generate univariate distributions for the error term. However we believe that the outcome which would make conditional distributions independent of each other to be a more serious issue and would result in budget shares not summing to one. The matching methods retain the intervariable correlations as an observation in one dataset is linked to another, avoiding this problem. The grade correspondence method matches on the rank of a single variable, while the other two match on the Mahalanobis distance which contains more information. Nevertheless as datasets with different means and structures, they may generate different conditional means than in the regression based methods. They also come at an increased computation cost, albeit the grade correspondence method is much quicker.
3. Results: empirical evaluation of the different techniques
A number of tests can be used to determine the comparative matching strength between the different methods described above. However, the main assumption that underlies all methods, the conditional independence assumption, cannot be tested in an exact sense. There are a number of papers where this assumption is further investigated (e.g. Ingram et al., 2000; Black and Smith, 2004). For this application one has to keep in mind that it is possible that the CIA does not hold, and this can have negative effects on the matching methods’ efficiency. But since all methods will be affected by it, it seems reasonable to compare the methods relative to each other.
Further, the differences in distributions of the overlapping variables of the two datasets can distort the outcome of the matching process: the marginal distributions of the budget variables for instance will not be reproduced in this case unless constrained matching is applied.
Finally, some care has to be given to the fact that the overlapping variables have to be defined in the same way in both datasets, to avoid misidentification. Therefore, in what follows, the methods will be tested upon a dataset which was split artificially and randomly, so that firstly the problems of different distributions and differing variable definitions of the overlapping variables do not occur and secondly the imputed values can be compared to the observed values on an individual level. For the testing, the Belgian Budget Surveys of the years 2001 and 2002 were used. The datasets were concatenated so as to have more observations and the resulting dataset was split randomly in two equally large datasets, which acted as source and target dataset. Next, the budget shares were imputed from the source into the target dataset using the five different methods.
The relative matching quality was then evaluated by means of two criteria: a goodness of fit measure, and tests of the equality of the distributions of the imputed and the observed budget shares.
For the goodness of fit measure we calculated the differences between the observed and the imputed values for each budget share and took the root of the mean sum of squares of these differences, in short the root mean squared error (rmse). The rmse can be interpreted as a measure for the performance of the methods at the household level, in that it gives the expected deviation between the imputed and observed budget shares per household.
To test the equality of the distribution of observed and imputed budget shares, we want to take into account differences in the distribution of overlapping variables between the source and target dataset.^{8} We therefore perform tests on the conditional distributions. Ideally, this conditionality should be implemented for all overlapping variables simultaneously. Yet, due to lack of data, we only performed the tests conditional upon some important overlapping variables: for each income decile, for different household types (single with or without children, cohabiting with or without children), for different age groups of the household head (younger than 30, between 30 and 50, between 50 and 65 and older than 65) and for different professional statuses (not employed, (self) employed, retired or other). Three tests are carried out.
To compare the equality of distributions, the Kolmogorov Smirnov test was used. This is a nonparametric test: since the distribution of the imputed values is not known or assumed a priori for the implicit methods, parametric tests are not adequate here. The Kolmogorov Smirnov test compares the distribution functions by using the maximal distance between them as a test statistic. Note that this may disadvantage the explicit methods since they will create degenerate distributions conditional upon the overlapping variables by construction.
Two other nonparametric tests take a somewhat intermediate position: they test the equality of the conditional distributions of imputed and observed budget shares, but at the same time recognize that the budget shares are paired: every observation has a value for the imputed and for the observed share. Both tests calculate the differences between imputed and observed values and test whether the median of the resulting distribution is equal to zero.
The sign test takes the number of positive values (which should be around half of the total number of observations) as a test statistic.
The signed rank test also takes the magnitude of the differences into account: all observations get a rank number according to the magnitude of the difference between observed and imputed value and afterwards the ranks of the positive differences are summed. This sum should be around one half of the total sum of ranks.
3.1. Goodness of fit of the five different methods
The results for the rmse are summarized in Table 1 for all expenditure groups separately, and by means of an unweighted and a weighted average of the rmse’s, in which we use the shares in disposable income as weights. The conclusion is that overall, the explicit methods have a lower rmse, and so the quality of these imputations is better than that of the ones created by the implicit methods. Among the explicit methods, the parametric and the nonparametric case yield almost the same rmse. At first sight, it is surprising that the non parametric kernel regression does not have a lower rmse than the parametric Engel curve. Note however that the Engel curve used here is in fact not fully non parametric, but only semiparametric. Only income and age are treated non parametrically. The other factors are treated parametrically. Our results suggest that the quaids specification, with sufficient cross effects built in, is flexible enough to capture all curvatures captured by the semiparametric specification.
Within the group of implicit methods there is a lot of variation in performance. In general the constrained and unconstrained distance function seem to give the same result when it comes to expected deviation from the observed values. The grade correspondence technique performs worse for most of the budget shares, except for “communication” and “recreation and culture”. For some categories, especially for “saving”, all the methods perform very badly.
Note that this goodness of fit measure obviously omits a possible important criterion for selection of the best method. It only looks at the best fit for each expenditure category separately, but does not assess how well the methods replicate or preserve the covariance between the different expenditure categories. We will try to integrate this criterion of assessment in our future research.
3.2. Are the distributions of imputed and observed budget shares different?
A second important issue to be assessed is whether the distribution of both the observed and imputed budget shares conditional upon the overlapping variables is the same. As already mentioned, conditionality upon all the overlapping variables is not an option, since this would require a lot more data then available to get significant results. Therefore, in what follows, the three tests will be carried out conditionally upon four important variables (cf. supra). Since this results in four tables with three sub tables per budget share (one table for each test), a selection of budget shares has been made. Table 2 present the pvalues for the tests for the “food and nonalcoholic beverages”category, Table 3 for “clothing and footwear”, Table 4 for “private transport” and Table 5 for “saving” categorised by income category. We perform a similar analysis for age group, employment status and family status in the appendix. For each budget share, the first table gives the pvalues for the three tests per income decile, the second per age group of the household head, the third per professional status of the household head and the fourth per household type. In each table, the first subtable gives the results for the KolmogorovSmirnov test, the second one for the sign test and the third one for the signed rank test.
Take for instance Table 2 For the fifth income decile, both the parametric and nonparametric methods yield a pvalue of zero for the Kolmogorov Smirnov test, which means that the null hypothesis of equality of distributions is rejected at a significance level of, for instance, 0.05. For the unconstrained and constrained distance function (pvalues of 0.640 and 0.405) the null hypothesis can clearly not be rejected in this decile, whereas for the grade correspondence (pvalue of 0.022) the null is rejected at a significance level of 0.05 but not at a level of 0.01.
Overall, the implicit methods seem to replicate the conditional distributions, whereas this is not the case for the explicit methods. The bad performance of the explicit methods for the Kolmogorov Smirnov test can perhaps be explained by the fact that the conditional distributions of their imputed values are degenerate: if the overlapping values are the same, they predict only one share, without variation in the results. The KolmogorovSmirnov test is by construction very sensitive when it comes to comparing a degenerate to a nondegenerate distribution. If this is indeed the explanation, doping the imputed values of the explicit methods with random error terms as described above may improve the test results. This is planned for future research. However, also for the sign and signed rank test, the results of the explicit methods are worse than those of the implicit methods. The fact that the conditionality implemented here is only partial because of too few observations may explain the bad performance of the explicit methods, although this shortcoming is also present for the other methods.
4. Conclusion
This paper tried to formulate a solution to the fact that there often exists no single dataset in which both income and budget variables are present. The solution consists of a matching procedure, in which two datasets are merged using variables that are common to both sets. Many different methods are utilised within the literature, but there is no strong consensus as to the appropriate method to use.
Five different matching procedures were investigated in this paper: the parametric and nonparametric estimation of Engel curves, the use of an unconstrained or constrained distance function and grade correspondence. The first two generate a model estimated on the budget set that predicts expenses based on the overlapping variables, and then apply this model to the income set. The other methods attach to each observation of the income dataset the values for the budget variables of an observation in the budget set that is most similar to the original record. A difference in the (mathematical) definition of similarity leads to the three methods discussed above.
We applied the five procedures to the 2001 and 2002 Belgian Budget Surveys in order to test their quality. Overall, the parametric and nonparametric methods seem to generate the best fit of the imputed values with respect to the observed values, which was demonstrated by lower root mean square errors. Concerning the distribution of budget shares conditional upon disposable income, age and professional status of the household head and household type, the distance functions seem to yield the best result, whereas the parametric and nonparametric methods do not reproduce the same distribution. This result can be biased, however, by the fact that estimation procedures yield degenerate conditional distributions by construction.
Future research will be to see whether the above conclusions are robust with respect to the introduction of more variation after the imputation by means of the explicit methods (adding error terms), and/or with respect to inserting an additional criterion of assessment, to wit: how well are the covariances between the budget shares preserved under the different methods? While this study uses the same donor and receiver dataset, it would be of interest to test the robustness of the conclusions to datasets that were different.
We hope in this study to have provided some guidance to microsimulation model builders who wish utilise statistical matching. As in the case of Webber and Tonkin (2013), there are pros and cons with different methods. Ultimately, minimum distance measures produce better distributions both within variables and between variables, but weaker means than the parametric or nonparametric methods, but come at a significant computational requirement.
As to what to trust, as in the case of all data preparation for microsimulation models, it requires detailed validation that matched or imputed variables broadly follow the distributions and means from the matched dataset and that additional corrections occur when there are discrepancies. It should however be noted that one is generating a model for microsimulation purposes. They are by definition wrong, but hopefully useful. As microsimulation models typically are based upon differences between baseline and simulated distributions, the bar is not quite as high as when one is generating merely a base distribution as some of the differences cancel out. Nevertheless discipline norms and high standards of validation and verification remain essential.
Given the importance of intervariable relationships in distributional analysis, our preference is to use minimum distance methods where possible, perhaps correcting means if necessary. However in the case of the original EUROMOD analysis, where micro data sets were not available (O’Donoghue et al., 2004), or in the case of very large datasets such as the base data for dynamic microsimulation models or spatial microsimulation models, we are willing to sacrifice the improved distributional precision for a lower computational cost and use parametric methods.
Appendix A
References

1
Enhancing the Australian national health survey data for use in a microsimulation model of pharmaceutical drug usage and costJournal of Artificial Societies and Social Simulation 11:2.

2
Quadratic Engel curves and consumer demandReview of Economics and Statistics 79:527–539.https://doi.org/10.1162/003465397557015

3
An investigation of statistical matchingJournal of Applied Statistics 15:275–283.https://doi.org/10.1080/02664768800000038

4
How robust is the evidence on the effects of college quality?Evidence from matching, Journal of Econometrics 121:99–124.

5
Consumer Behaviour: Theory and Empirical Evidence  A SurveyThe Economic Journal 98:16–65.https://doi.org/10.2307/2233510

6
Semiparametric estimation and consumer demandJournal of Applied Econometrics 13:435–461.https://doi.org/10.1002/(SICI)10991255(1998090)13:5<435::AIDJAE506>3.0.CO;2K

7
Calibrating timeuse estimates for the British household panel surveySocial Indicators Research 114:1211–1224.https://doi.org/10.1007/s1120501201982
 8

9
Technical PapersStatistical matching and microsimulation models. In Improving Information for Social Policy Decisions: The Uses of Microsimulation Modelling, Volume II, In:, Technical Papers, Washington, DC, National Academy Press.

10
Developing a continuous space representation of a simulated populationSpatial Economic Analysis 5:317–338.https://doi.org/10.1080/17421772.2010.493954
 11

12
Statistical matching for categorical data: Displaying uncertainty and using logical constraintsJournal Of Official StatisticsStockholm 22:137.
 13

14
Proceedings of the European Conference on Quality in Official StatisticsQ2012Statistical matching of data from complex sample surveys, In:, Proceedings of the European Conference on Quality in Official StatisticsQ2012, 29.

15
The analysis of household surveys: A microeconometric approach to development policyBaltimore, MD: Johns Hopkins University Press.
 16

17
Comparative analysis of different techniques to impute expenditures into an income data set, work package 3.4 of accurate income measurement for the assessment of public policies (AIMAP contract no 028412), LeuvenComparative analysis of different techniques to impute expenditures into an income data set, work package 3.4 of accurate income measurement for the assessment of public policies (AIMAP contract no 028412), Leuven.

18
Matching of income and expenditure data by means of nonparametric estimation of Engel curves, report of the D.W.T.C. project AG/01/079Matching of income and expenditure data by means of nonparametric estimation of Engel curves, report of the D.W.T.C. project AG/01/079.

19
How regressive are indirect taxes? A microsimulation analysis for five European countriesJournal of Policy Analysis and Management 29:326–350.https://doi.org/10.1002/pam.20494

20
De constructie van één samengesteld bestand op basis van twee bestanden: koppeling van de budgetenquete 199798 en het fiscaal bestand 1999 (inkomstens 1998) [i.e. Match the expenditure survey of 199798 to the income survey of 1999]De constructie van één samengesteld bestand op basis van twee bestanden: koppeling van de budgetenquete 199798 en het fiscaal bestand 1999 (inkomstens 1998) [i.e. Match the expenditure survey of 199798 to the income survey of 1999].

21
Statistical matching of income and consumption expendituresInternational Journal of Economic Sciences 3:50.

22
Comparing nonparametric versus parametric regression fitsThe Annals of Statistics 21:1926–1947.https://doi.org/10.1214/aos/1176349403

23
Statistical matching: A new validation case studyproceedings American Statistical association.

24
Statistical matching using propensity scores: Theory and application to the analysis of the distribution of income and wealthJournal of Economic and Social Measurement 35:177–196.https://doi.org/10.3233/JEM20100332

25
Statistical matching: A model based approach for data integrationEurostatMethodologies and Working papers.

26
Statistical matching: Pitfalls of current proceduresProceedings of the Annual Meeting of the American Statistical Association, August 59.

27
Statistical matching: A paradigm for assessing the uncertainty in the procedureJournal of Official Statistics 17:407.

28
A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputationsJournal of Business & Economic Statistics 21:65–73.https://doi.org/10.1198/073500102288618766

29
Modelling the redistributive impact of indirect taxes in Europe: An application of EUROMODEUROMOD Working Paper No. EM7/01.

30
FiFoSiM  An Integrated Tax Benefit Microsimulation and CGE Model for GermanyInternational Journal of Microsimulation 2:1–15.https://doi.org/10.34196/ijm.00008

31
Modelling individual choice: The econometrics of corners, kinks and holesOxford: Blackwell. ISBN 9780631145899.

32
Statistical matching of administrative and survey data: An application to wealth inequality analysisSociological Methods & Research 42:192–224.

33
Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches, 168Springer Science & Business Media.

34
A NonIterative Bayesian approach to statistical matchingStatistica Neerlandica 57:58–74.https://doi.org/10.1111/14679574.00221

35
An evaluation of statistical matchingJournal of Business and Economic Statistics 2:91–102.

36
Integrated modelling of the impact of direct and indirect taxes using complementary datasetsThe Economic and Social Review 48:171–205.

37
Statistical matching of European Union statistics on income and living conditions (EUSILC) and the household budget surveyEurostat Statistical Working Papers.Simar, L., 2004, An Invitation to the Bootstrap: Panacea for statistical inference?, course handout.

38
EUROMOD: the European Union taxbenefit microsimulation modelInternational Journal of Microsimulation 6:4–26.https://doi.org/10.34196/ijm.00075

39
Combining household income and expenditure data in policy simulationsReview of Income and Wealth 48:517–536.https://doi.org/10.1111/14754991.00066

40
Guidelines for identifying clusters using grade correspondence analysis: Practical and technical issuesMicrosimulation unit research note MU/RN/39.

41
Using POLIMOD to evaluate alternative methods of expenditure imputationMicrosimulation unit research note MU/RN38.

42
Data matching to allocate doctors to patients in a microsimulation model of the primary care process in New ZealandSocial Science Computer Review 30:358–368.https://doi.org/10.1177/0894439311417153
 43
Article and author information
Author details
Funding
This research was carried out under European Commission FP6 contract no. 028412, “Accurate Income Measurement for the Assessment of Public Policies” (AIMAP).
Acknowledgements
We are grateful for helpful comments made by three anonymous referees, and are indebted to the many people who have contributed to the development of EUROMOD and participants at several EUROMOD workshops, where this paper has been presented. EUROMOD has originally been developed and maintained by the Institute for Social and Economic Research (ISER) in Essex in collaboration with national teams from the EU countries. Since 2021 EUROMOD is maintained, developed and managed by the Joint Research Centre (JRC) of the European Commission, in collaboration with EUROSTAT and the national teams. The results and their interpretation are the authors’ responsibility.
Publication history
 Version of Record published: December 31, 2020 (version 1)
Copyright
© 2020, Decoster, Rock, Swerdt, Loughrey, O’DonoghueVerwerft
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.