Pooling Incomplete Data Sets
Abstract
Data needed in micro studies are not always available from just a single source. In such cases it might be possible to combine two or more independent samples. The problem studied in this paper is to estimate a linear function between y and x from one sample of y-observations and another of x-observations. This is feasible if there are common variables z which can be used to predict x. A two-stage least squares estimator is propounded, which, for the model considered, is also an ML estimator. Simulation experiments show that it has a good relative efficiency and virtually no small sample bias.
Data needed in micro studies are not always available from just a single source. In such cases it might be possible to combine two or more independent samples. The problem studied in this paper is to estimate a linear function between y and x from one sample of y-observations and another of x-observations. This is feasible if there are common variables z which can be used to predict x. A two-stage least squares estimator is propounded, which, for the model considered, is also an ML estimator. Simulation experiments show that it has a good relative efficiency and virtually no small sample bias.
1. Introduction
Microdata, i.e. data on individuals, households or firms, have long been used in social science research, and their importance is increasing. Since human behavior sometimes calls for very complex explanations and since social experimentation is rarely a feasible approach, social scientists frequently work with complex models involving many variables in efforts to control for confounding effects. To estimate these models we do not only need large samples of microdata, but ideally we would also need to observe many aspects of behavior for each individual, household or firm.
Surveys are the primary source of microdata, but survey research is very expensive, and few researchers can afford new surveys. Contributing to the high costs are the increasing difficulties in many countries to gain the cooperation of the respondents. In particular, when many questions are asked and there is a heavy respondent burden, the respondents tend to economize their time. They have also become aware of the privacy issues. The public debate about the use of computers and the risks for invasion of privacy, which in some countries have resulted in data legislation, have made it even more difficult to make the respondents cooperate.
As pointed out in Dalenius (1982), matrix sampling is a class of sampling schemes which may cope with these problems. Since with such a design only a sample of all variables is observed for each selected unit, the respondent burden and the risk for invasion of privacy are reduced. It is, however, obvious that multivariate analysis from such a sample might meet with difficulties.
High costs and nonresponse problems also make us look for alternative data sources, i.e. already existing data files and administrative records. Frequently, however, we are unable to find all the variables needed in those files and records. If feasible, matching of two or more data sets might give us what we need. Matching is also much less expensive than a new survey. However, in most cases exact matching is not possible, either because there is no overlap, i.e. no individual, household or firm can be found in more than one data set, or there is no unique identification term, or the use of this term is prohibited in order to protect personal privacy. In Sweden, for instance, matching data on individuals falls under the Data Act and requires a permit from the National Data Inspection Board.1
If neither a new complete data collection nor an exact matching of existing files are feasible solutions, to what extent is it then possible to use non-overlapping datasets? The answer to this question will in the general case depend on the intended analysis and in what sense the data are incomplete. This paper treats the problem of estimating a linear relation from two independent data sets, none of which includes all the relevant variables. Suppose, for instance, that we want to estimate a model according to which y depends on x. y and x are, however, not observed for the same units, but y- and x-observations are found in two independent samples.
This is a hopeless situation unless something more is known about y and x, and there is some common information in both samples. Suppose a third variable, z, which could be used as a predictor of x, is observed in both samples. It is then possible to estimate the predictive relationship between x and z from one sample and use the estimates in the other sample to predict x. The relation between y and x is finally estimated on the basis of these predictions and observed y-values. This procedure requires model assumptions which make it possible to pool two (or more) samples. The properties of the estimates, of course, depend on the assumptions made, and the whole approach is useful only if these are not too constraining. As we shall see, it is possible to work within quite a general class of linear models.
When invasion of privacy is a real issue or nonresponse a severe problem, a complete data collection might not be permissible or feasible. However, if good predictors can be found, each individual, household or firm would only need to contribute partial information, which would be less sensitive, and an analysis would still be feasible with the methods suggested in this paper.
Missing variables is not a new problem. In most textbooks of regression analysis and econometrics it is discussed as a specification error. In the applied literature various ad hoc approaches can be found. For instance, in the economic literature on compensating wage differentials, aggregate data on occupational characteristics such as accident rates and aggregate work environment data are matched with individuals on the basis of their occupation and industry. One example is given in Brown (1980). As we shall find, this is a special application of the prediction approach discussed below.
An alternative approach is statistical matching (U.S. Department of Commerce, 1980). The theoretical basis for statistical matching is still undeveloped, and it is also a very expensive approach (Barr et al. (1982), Paass (1982), Rodgers and DeVol, 1982). The approach discussed in this paper has the advantages of being based on conventional statistical theory and of requiring no expensive matching of microdata.
In Section 2 of the paper we review some results for a model with stochastically dependent equations. Since interdependent systems have mostly been used in connection with aggregate time-series, while microdata applications are less frequent, it might erroneously be believed that the approach suggested in this paper is only of minor interest in microstudies. However, the importance of this type of model is likely to increase in microstudies as well, when more longitudinal data become available and when ample data make it feasible to analyze joint decisions of microunits, e.g. joint household decisions about work, leisure, consumption and savings. Also, and more important, the estimation approach suggested is not only applicable to interdependent systems but to a much wider class of models. In Section 3 this approach will be used to estimate a linear regression model. One reason to discuss its application to an interdependent system first is that the "predictors" of the unobserved variables are in a natural way given by theory, which is not obviously the case with a regression model.
Section 4 presents some findings from a simulation study of the small sample properties of the estimator, and Section 5 gives some concluding remarks.
2. Estimation of an interdependent linear model
The estimation of one equation in an interdependent system of equations from two independent samples with missing variables was discussed in Klevmarken (1982). This equation was specified as
and it was part of the interdependent system,
where Yn·G is a matrix of n observations on G endogenous variables,
is a vector of the n observations on the endogenous variable explained by (1),
is a matrix of the n observations on the g explanatory endogenous variables in (1),
is a matrix which includes all K exogenous variables, ,
is a submatrix of X which includes the k exogenous variables in (1),
is a matrix of stochastic disturbances,
is the vector of stochastic disturbances of (1), one of the columns of U,
and ΓG.K are parameter matrices,
and are vectors of the nonzero parameters in (1),
is an unknown positive definite moment matrix.
It is assumed that (1) is identified. The reduced form of the complete system is,
The part of the reduced form corresponding to the endogenous variables to the right in equation (1) is,
where 1 and V1 are the corresponding g· K and n·g submatrices of and V, respectively.
For later use it is also convenient to introduce an n·(K-k) matrix X2 defined by,
Suppose now that data are not available in the form of one complete sample, but that there are two samples, A and B, none of which contains all the variables. Assume that the data come in the following form,
and are the two sample sizes. They are not necessarily equal. Since (2c) implies that there is no residual correlation between observational units, the two samples can be treated as independent random samples.
An example to which this problem specification might be applicable is the joint estimation of demand functions for consumer goods and household time-use functions, both derived from a household production type of model. Consumer expenditure data could be obtained from a household expenditure study, while time-use data would have to be taken from a separate time-use survey. There are at present practically no surveys which include both kinds of data. Both kinds of surveys would, however, give income data and other characteristics of the household.
Eq. (1) cannot be estimated from sample A alone, since the Y1-variables are missing, but, if g K-k, the two samples can be combined in the following two-stage procedure:
I. Estimate the reduced form equations (4) from sample B by OLS, which gives the estimates . Use these estimates to predict Y1 in sample A, i.e.
II. Estimate by OLS from sample A
where .
Note that is not the vector of least squares prediction errors from sample A and thus not necessarily orthogonal to XA.
With the following notation
(7) becomes
and the estimator of is,
If the two samples would coincide, would be the usual TSLS estimator. In Klevmarken (1982) it was shown that the estimator is biased but consistent. The following asymptotic properties can also be proved.2
If n8=cnA , where c>0 is an arbitrary finite constant, and if (1/nA) (XA’XA) and (1/nB) (XB’XB) both tend to finite non-singular limits when nA and nB tend to infinity, and if the rows of the error matrix U are stochastically independent, then asymptotically follows a normal distribution with zero mean vector and co-variance matrix.
is the first column of the matrix . The first element of this vector is and the other elements are the covariances between the error term in the first equation and those in the other equations. B* is a matrix of g columns from (B’)-1 such that
Q, a finite non-singular matrix, is the limit to which (1/nA)(Z’Z) tends in probability when nA tends to infinity.
The asymptotic moment matrix of the ordinary TSLS estimator based on a complete A- sample is . As shown in Klevmarken (1982), it is possible to find cases for which the asymptotic variance of is smaller than the asymptotic variance of the ordinary TSLS estimator.
As a preliminary to the next section, suppose that sample A does not only lack the Y1-observations but also all observations on one or more of the X1-variables. Could we then use the information in sample B to predict X1 in A? Since X1 by definition is exogenous, there is no theoretical justification for predicting X1 within the present model. Additional assumptions about X1 are needed. This problem and the nature of the new assumptions are, however, not particular to a model of interdependent equations, and could more conveniently be discussed within the framework of an ordinary regression model.
3. Estimation of a linear regression model
As above, we assume that there are two independently drawn samples, A and B. These now include the following variables and observations,
The two samples will be used to estimate the model,
where is an unknown parameter vector, and is assumed to be a matrix of k1 stochastic variables, which depends linearly on X2.
R is a parameter matrix. X2 is treated as a matrix of non-stochastic variables. It is also assumed that there are at least as many predictors as predictands, i.e. . For the two samples we thus obtain the following three relations,
Since is not observed, (3.3 b) is inserted into (3.3 a) to give the following reduced system of observed variables
This model shows great similarities with an errors-in-variables model analysed in Goldberger (1972). Goldberger’s starting point was a regression model containing a single explanatory variable observed with a random error. This model also assumed that the true unobserved variable was a non-stochastic linear function of a number of independent variables. Zellner (1970) has previously considered this model and developed a generalized least-squares estimator and also presented a Bayesian analysis of the model. Goldberger developed the corresponding maximum likelihood theory and extended it to a model in which the true unobserved variable is a stochastic function of the independent variables rather than an exact function. The observable equivalent of Goldberger’s model has the same form as eq:s (3.4 a, b). The only difference is that the two equations refer to two different samples, while Goldberger’s observations were assumed to come from a single sample.
It is convenient to rewrite eq:s (3.4 a, b) in the following way,
where is the j:th column of , rj is the j:th column of R, and
It is assumed that u and are uncorrelated for all j. It follows that,
where
Since the two samples are independent, the joint distribution of the observable variables is,
The likelihood function becomes (disregarding irrelevant constants),
Differentiating with respect to the unknown parameters and putting the derivatives equal to zero gives,
If we define it thus follows that,
From eq:s (3.9) and (3.11) it follows that,
The concentrated likelihood function then becomes,
If eq. (3.15) is premultiplied by one easily sees that,
Eq. (3.16) inserted into eq. (3.13) gives a new concentrated likelihood function,
If the derivative of this function is put equal to zero, we obtain the ordinary least-squares solution,
or in matrix form,
From (3.15) and (3.18b) we thus find that the maximum-likelihood solution is equivalent to the TSLS procedure analogous to the estimation method suggested for the interdependent model in the previous section.
This result no longer holds if the X1-variables are correlated, i.e. if is not diagonal, or if the errors are heteroscedastic. With such changes in the model specification we would have to find the maximum likelihood estimates by numerical maximization of the likelihood function.
If we allow both for contemporaneously correlated errors and for heteroscedasticity, the model can be written in the following way,
where ,
Now let
We define the contemporaneous moment matrices for observation t,
and the individual specific moment matrices,
These two types of matrices are thus functions of the same parameters. We also allow uA to be heteroscedastic, i.e.
With these assumptions it follows that,
and,
The matrix (3.25) is denoted V and the matrix (3.26) Disregarding irrelevant constants the likelihood function then becomes,
The model will not be identified unless additional assumptions are made about the nature of the heteroscedasticity.
The TSLS estimate of is consistent. This result follows, because it is a function of consistent estimates of the parameters R of the auxiliary relations and it is in itself consistent conditional on R. This is true also for non-normal errors.
, however, is not unbiased. This is easily seen from the simple model with and only one -parameter:
The critical assumptions of the present model and of the whole approach are those about the auxiliary relations (3.3b) and (3.3c). It might at first seem very constraining to assume that there exist linear relations which explain the unobserved variables and that these relations are the same for both samples. However, we do not necessarily have to give them a causal interpretation but could rather look upon them as predictive relations. In practical applications of this approach, we will the face the problem of finding good predictors. Linearity is not a binding restriction. At the minor cost of discontinuities, non-linear relations can be transformed into linear relations with dummy variables.
If the samples are sufficiently large, we could use dummy variables only and still obtain a good precision of the estimates. That would be equivalent to grouping the two samples by the same groups, merging them at the group level and then estimating the relation between y and X1 from group means.
The auxiliary relations must be formulated in such a way that no parameters are needed to predict X1 in sample A, which cannot be estimated from sample B. This means that the two samples cannot differ to much with respect to observational units, variable definitions etc. However, the two samples do not have to be of the same size, and the number of observations with a particular configuration of X2-values can differ between the samples. Other differences might also be acceptable. For instance, if the model explains household behavior and sample A includes household data, it might be possible to use data on individual persons in sample B, viz. if household behavior (data) can be predicted from individual behavior (data).
4. A simulation experiment
A simulation experiment was made to illustrate the estimation procedure numerically and to get an idea about its small sample properties. The experiment was based on a demand equation, which relates expenditures on food, beverages and other every day commodities to household disposable income, age of household head and household size. The parameters of this function were estimated from two samples of household data. The first sample included all variables except disposable income. There was, however, in this sample information about tax assessed income for each household member. The second sample included both household disposable income and assessed income, as well as additional variables which could be used to predict disposable income, the most important one being the ownership of owner-occupied houses.3 The model was specified as follows,
where
LEXP = The logarithm of expenditures on food, beverages and other everyday commodities,
LDISP= The logarithm of disposable income,
AGE= The age of the household head,
HS = Household size,
LTAXINC = The logarithm of the total of the assessed incomes of all household members,
DH = A dummy variable for owners of owner-occupied houses, and DHTAXINC = Assessed income for owners of owner-occupied houses; zero for non-owners.
With minor adjustments the parameter values were obtained as the least squares estimates of the model from a small sample of 144 households, which included survey and register information of all variables. The observations on the exogenous variables of this sample were then used in the simulations. First, the sample was randomly divided into two samples, A and B. Then for each observation, LDISP was simulated either four times or ten times. In sample A the simulated LDISP values were used to simulate LEXP values. In this way we obtained a simulated sample four or ten times larger than the original sample but with the same covariance structure of the exogenous variables. Finally, the parameters of the demand function were estimated by the TSLS procedure using the two subsamples. The whole simulation and estimation procedure was then replicated.
There are altogether seven simulations. In all but the last two of these, the two subsamples were of the same size, 288 in simulation 1 and 2 and 788 in 3 and 5. In simulations 7 and 8 sample A included 360 observations, while sample B was three times as large.
The results are shown in Table 1. They indicate that, when the true predictive relation is used, there is virtually no bias. The bias estimates given in the table are so small that they are dominated by the random fluctuations of the simulation experiment. In simulation 5 the predictive relation used in the estimation was misspecified. The house-owner variable and the interaction variable were both deleted, and the logarithm of assessed income was thus the only predictor.4 The result is a small bias. The income elasticity is underestimated by 5 per cent, and there is also an 8-9 per cent bias in the estimates of the household size parameters.
Since the LDISP variable is predicted with an error, it might be expected that the efficiency of the TSLS estimator would be less than for the LS estimator based on a complete sample. In Table 1 we can compare the relative root mean-square errors for these two cases.5 In both cases this measure includes the variability caused by the drawing of a new LDISP vector for each sample. We find that the estimated relative efficiency, defined as the ratio of the two root mean-square errors, is approximately 80 per cent for the income elasticity, i.e. for parameter . It is between 80 and 90 per cent for the intercept and above 90 per cent for the other parameters. The number of replications is not large enough to justify any conclusions about the dependence of the efficiency on the sample size. When the predictor relation is incorrectly specified there is an additional, but in this particular case, modest loss in efficiency. This is at least partly caused by the bias component.
Finally, Table 1 also shows that the variance formula for the ordinary least-squares estimates applied to the second step of the TSLS procedure, i.e. the diagonal elements of gives good estimates of the true variance of . S2 is the residual variance in the second estimation step. The last row of each panel of Table 1 shows the square-root of the mean of the replications of each diagonal element relative to the true parameter value. They differ very little from their corresponding relative root mean-square errors.
5. Concluding remarks
It is not unusual in nonexperimental studies that variables are missing or replaced by proxies. Even when a new survey is to be made it might not be feasible to collect all items of information from every respondent, either because of the risk for invasion of privacy or because the respondent burden might become so high that the response rate would drop below an acceptable level. One approach to solve this problem has been suggested in this paper. When the statistical problem is to estimate a linear relation between y and x or an equation in an interdependent linear system, one sample including all variables is not necessarily needed, but two or more samples, each with missing variables, can be used instead. A sampling scheme like matrix sampling could thus be used in combination with the estimation method suggested. A condition is that it is possible to predict the missing variables.
If "true" predictors are used, the two-stage least-squares estimator suggested has good properties. It is consistent asymptotically normal. For a proof see Klevmarken (1983b). What has been shown here is that the estimator is a maximum-likelihood estimator, if the moment matrices of the regression model and the predictive relations are scalar. When they are not scalar matrices, the maximum-likelihood estimates can be obtained by numerical optimization of the likelihood function.
A sampling experiment has indicated that the two-stage least-squares estimator also has favorable small sample properties. We found virtually no bias, and the decrease in efficiency, relative to the case with one complete sample, was small.
In practice it might be difficult to know the "true" predictors. In the case of the interdependent model, the predictors are given by the model. If all of them cannot be used, e.g. because of a shortage of data, the estimates will no longer be consistent (see Klevmarken (1982)). For the regression model the same conclusion does not necessarily follow. In this case the predictive relations do not necessarily have the same status of theory. To some extent we can choose these relations at our convenience. If x1 is a stochastic variable, it is always possible to define distributions conditional on some x2. If these distributions do not have the same mean, could, in principle, be used as a predictive relation. The efficiency of the resulting estimate would, however, depend on how well this relation predicts x1. If the residual variance is high, the estimates are likely to have a high variance as well. In practical applications it would thus be desirable to use at least part of sample B to find good predictors. This search process is stochastic, and the variability introduced by the search should, in principle, be taken into account when the properties of the estimator are evaluated.
With unobserved and with no supplementary information, the regression model is unidentified. The identification is achieved by the predictive relation, which adds the necessary a priori information and bridges the two samples. A priori information might also come in other forms. For instance, if the unobserved X1 variables do not only explain one y-variable but two or more variables, we could look upon the y-variables as indicators of X1 and arrive at a model which is similar to a factor analysis model. All these indicators do not necessarily have to come from the same sample. Although the details have to be worked out, it should be possible to combine two or more independent samples in this case as well.
There are common features of the approach suggested in this paper and the general principles of statistical matching. In statistical matching similar observations in the two samples are matched. Similarity is defined either by a grouping principle or by a distance measure defined on a set of variables. These variables basically serve the same purpose as the predictor variables in the TSLS approach. There are, however, also important differences. In most applications of statistical matching there is the implicit assumption of independence between y and X1 conditional on X2. The models discussed in this paper do not involve this assumption. The conditional covariance between y and X1 in the regression model is a function of the structural parameters and the moment matrix of the errors in the predictor relations.
Finally, a remark on functional form. Linear models have been assumed throughout this paper. Extending the same approach to non-linear models would be quite conceivable, and there is no reason why this would not be feasible. The properties of the estimates would, however, be more difficult to derive, and this remains to be done.
Footnotes
1.
For a discussion of the need for microdata in economics, see Klevmarken (1983a).
2.
This corrects results given in Klevmarken (1982). A proof is parallel to that given in Klevmarken (1983b).
3.
Deduction of interest payments reduces the assessed income for owners of owner-occupied houses, as compared with non-owners
4.
The full model was, of course, used to simulate data.
5.
The relative root mean-square error for the estimator is defined as
References
-
1
An Empirical Evaluation of Statistical Matching Methodologies. Report Prepared for the Office of Assistant Secretary for Planning and EvaluationU.S. Department of Health and Human Resources.
-
2
Equalizing Differences in the Labor MarketThe Quarterly Journal of Economics 94:113.https://doi.org/10.2307/1884607
-
3
A Sample of Ideas for Research and Development in the Theory and Methods of Sample SurveysUtilitas Mathematica 21 A:59–74.
-
4
Maximum-Likelihood Estimation of Regressions Containing Unobservable Independent VariablesInternational Economic Review 13:1.https://doi.org/10.2307/2525901
-
5
Missing Variables and Two-Stage Least-Squares. Estimation from More than One Data Set. American Statistical Association, Proceedings of the Business and Economic Statistics Section156–161, Missing Variables and Two-Stage Least-Squares. Estimation from More than One Data Set. American Statistical Association, Proceedings of the Business and Economic Statistics Section, p.
-
6
Micro Econometrics, the IUI Yearbook 1982/83, The Industrial Institute for Economics and Social ResearchMicro Econometrics, the IUI Yearbook 1982/83, The Industrial Institute for Economics and Social Research, Stockholm, Sweden.
-
7
Asymptotic Properties of a Least-Squares Estimator Using Incomplete Data. Research Report 1983:3Sweden: Department of Statistics, University of Gothenburg.
-
8
Statistical Match with Additional Information. Report IPES.82.0204Bonn: Gesellschaft fur Matematik und Datenverarbeitung MBH.
- 9
-
10
Report on Exact and Statistical Matching Techniques. Statistical Policy Working Paper 5Report on Exact and Statistical Matching Techniques. Statistical Policy Working Paper 5.
-
11
Estimation of Regression Relationships Containing Unobservable VariablesInternational Economic Review 11:441.https://doi.org/10.2307/2525323
Article and author information
Author details
Funding
A grant from The Bank of Sweden Tercentenary Foundation is greatfully acknowledged.
Acknowledgements
This article was orginally published as "Statistical Review 1983:5, pp 67-79, Essays in Honour of Tore Dalenius, SCB (Statistics Sweden), Stockholm.
Paul Olovsson very efficiently helped with the programming, and Claes Cassel contributed useful comments on a previous draft.
Publication history
- Version of Record published: April 30, 2022 (version 1)
Copyright
© 2022, Anders Klevmarken
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.