Logit scaling: A general method for alignment in microsimulation models
Abstract
A general and easily implemented method for alignment in microsimulation models is proposed. Most existing alignment methods address the binary case and they typically put special emphasis on one of the alternatives rather than treating all alternatives in a symmetric way. In this paper we propose a general mathematical foundation of multinominal alignment, which minimizes the relative entropy in the process of aligning probabilities to given targets. The method is called Logit Scaling. The analytical solution to the alignment problem is characterized and applied in deriving various properties for the method. It is demonstrated that there exits an algorithm called BiProportional Scaling that converges to the solution of the problem. This is tested against two versions of the NewtonRaphsonalgorithm, and it is demonstrated that it is at least twice as fast as these methods. Finally, the method is not just computational efficient but also easy to implement.
1. Introduction
Alignment is the core facility for controlling microsimulation models. Microsimulation models are large, elab orate models that need such a tool at the macro level. Alignment works by manipulating the probabilities of the model, aiming to do model calibration, comparative analysis (static or dynamic) and/or to eliminate Monte Carlo noise.
In a more technical sense, alignment can be said to do one of two things: meancorrection or varianceelimination^{1}. The focus of this paper is meancorrection, meaning the process by which you manipulate the probabilities of the individuals such that i) you hit a predefined mean/average propensity for the entire group, and ii) the ma nipulation is as gentle as possible (discussed below). Meancorrection alignment makes it possible to calibrate the model and do comparative analysis. The other kind of alignment, here called varianceelimination, is equiva lent to the above mentioned elimination of MonteCarlo noise. In microsimulation models, events are typically modeled by drawing a uniformly distributed number and comparing it to a probability associated with the event occurring. This MonteCarlo technique introduces noise into the model. Such noise can be eliminated from the model, using alignment techniques (Bækgaard, 2002).
Most existing alignment methods address the binary case and they typically put special emphasis on one of the alternatives rather than treating all alternatives in a symmetric way (for a review, see Li & O’Donoghue, 2014). In this paper we propose a more general mathematical foundation of multinominal alignment. Our mean correction alignment method is called Logit Scaling, and is defined as a method that minimizes the information loss in the adjustment process. The analytical solution to the alignment problem is characterized and further more applied in deriving various properties for the method. It is demonstrated that there exits an algorithm called BiProportional Scaling that converge to the solution of the problem. This is tested against two versions of the NewtonRaphsonalgoritm, and it is demonstrated that it is at least twice as fast as these methods. Finally, it is argued, that the method is very easy to implement.
The practical implementation of Logit Scaling is very simple. Assume N individuals can be in one of A alternative states, and that you have probabilities ${p}_{ia}^{0}$ for all individuals i ∈ {1,…,N} and alternatives a ∈ {1,…,A}. We necessarily have that ${\sum}_{a}{p}_{ia}^{0}=1$ for all i and ${p}_{ia}^{0}\ge 1$ for all i and a. In the alignment process you would like to find new probabilities p_{ia} such that the aggregated number of individuals in the various states satisfies certain targets:
The algorithm we use is called BiProportionate Scaling and is demonstrated in Algorithm 1. Initially let ${p}_{ia}={p}_{ia}^{0}$. Think about p_{ia} as a matrix with N rows and A columns. Scale all columns such that the alignment targets are satisfied $\left({\displaystyle {\sum}_{a}{p}_{ia}={\overline{X}}_{a}}\right)$. Then scale all rows such that ${\sum}_{a}{p}_{ia}^{0}=1$ for all i. Repeat this until the system converges. In realworld problems the algorithm converges very quickly even for many individuals and states.
Require: Initially ${p}_{ia}={p}_{ia}^{0}$
Ensure: ${\sum}_{i}{p}_{ia}}={\overline{X}}_{a$
1: while not converged do
2: Scale columns with γ_{a}’s: p_{ia}:= γ_{a}p_{ia} such that ${\sum}_{i}{p}_{ia}}={\overline{X}}_{a$
3: Scale rows with α_{i}’s: p_{ia}:= α_{i}p_{ia} such that ${\sum}_{i}{p}_{ia}}=1$
4: end while
2. What is a good alignment method?
We would like to start by defining what a good meancorrection alignment method is. The points in the follow ing list ensures that the alignment method is theoretically sound and useable at the same time:
Adjust correctly to the target means
Retain the original shape of the probability distribution
Retain zero probabilities
Symmetric formulation
Multinominal alignment
Ability to work well in a logit environment
Computer efficiency
Easy implementation
Points 1) and 2) are similar to the first two points in Li & O’Donoghue (2014). The idea of meancorrection alignment is to be able to change the mean of the probability distribution. Therefore the method should be precise when it comes to adjusting to a new target (point 1). Let’s try to make this more precise. Consider a population of N individuals, and a number A of alternatives the individuals can be in. Let p_{ia } be individual i^{!}s probability of entering alternative a(i= 1,…,N and a= 1,…,A). In a binary model A= 2. Because we are working with probabilities we should necessarily have that
Let X_{a} be the number of persons in alternative a(such that ${\sum}_{a}{X}_{a}}=N$. The expected number of persons in alternative a is given by:
We can now define meancorrection alignment:
Let ${p}_{ia}^{0}$ (i = 1,…,N and a = 1,…,A) be the initial probabilities in a population. A meancorrection alignment to a target $({\overline{X}}_{1},\dots ,{\overline{X}}_{A})$ is a set of aligned probabilities p_{ia}, (i = 1,…,N and a = 1,…,A) such that:
Observe that this definition only involves the aligned probabilities. Nothing is said about the link between the initial and the aligned probabilities. This is the subject of point 2: a very basic demand for an alignment method is that it makes as little harm as possible to the original probabilities, in the sense that we strive to retain the shape of the original distribution. A way to handle this is to define a distance measure that measures the difference between the initial probabilities and the aligned probabilities. Let P be a N×A matrix containing the p_{ia}− elements. Let D (P, P^{0}) be our measure. A precise version of point 2 would be to minimize D (P,P^{0}) given the restrictions in Definition 1. We will return to this optimization problem later.
The third point in our list deals with zero probabilities. Saying that a probability is zero is a very strong state ment. It means that something is impossible. An alignment method should therefore retain zeroes such that what is impossible before the alignment is also impossible after the alignment.
The fourth point on the list is symmetric formulation. A good example of an asymmetric alignment method is the often used multiplicative scaling (Li & O’Donoghue, 2014). According to this kind of alignment you choose one of the alternatives (let’s say alternative 1) and scale the probabilities to the desired level:
where the scaling factor λ is defined by
If we have two alternatives, the probability of the other alternative is given residually
This is clearly asymmetric. If we had chosen to scale the probabilities of the second alternative instead, we would have had another alignment. This is not satisfying as the alignment method introduces arbitrariness into the analysis. This asymmetry seems not to be a wellknown property in the microsimulation community. As an example it is not mentioned in Li & O’Donoghue (2014). We give an graphical example in Section 3.2.
The only way to make sure that the alignment formulation is symmetric, is to make a definition based on all alternatives. A good alignment method should therefore depend on all the initial probabilities and some pa rameter vector θ_{a}
This is typically not the case in the most used alignment methods (Li & O’Donoghue, 2014).
A good alignment technique should generalize easily to the multinominal case (A > 2). As mentioned above, most alignment methods are binary. Multinominal problems are often solved by a sequence of binary align ments. This is also a source of arbitrariness. If we change the order of the binary alignments we potentially get a different total alignment. It is therefore preferable to have a method that is designed to the general multinominal case.
The final 3 points on the list have to do with the usability of the method. The logit specification is one of the most used methods to estimate probabilities. It is therefore a nice feature if an alignment method works well with this specification^{2}. Finally it is important for the usability of a method to have both computer efficiency and ease of implementation. These two goals can be conflicting goals. As will be demonstrated later, this is not the case here.
3. A multinominal alignment method
You can think about alignment as a method for manipulating probability distributions. Define
where N is the number of individuals. Q = {q_{ia}i = 1,.., N; a = 1,.., A} is a joined discrete probability distribution simultaneously decribing preferences (the A alternatives) and heterogenity (the N individuals)^{3}.
Let Q^{0} be the initial probability distribution. According to definition 1, a meancorrection alignment to a target $({\overline{X}}_{1},..,{\overline{X}}_{A})$ is a new probability distribution Q satisfying:
We would like to satisfy these linear restrictions with as little change in the distribution as possible.
Our way of solving this problem is inspired by methods developed in finance during the mid 90s (Stutzer, 1996; Kitamura & Stutzer, 1997; Buchen & Kelly, 1996). The basic idea is to minimize the relative entropy between the two distributions, subject to the additional restrictions imposed on the new distribution. In Stutzer (1996) the goal is to modify a nonparametric predictive distribution for the price of an asset to satify the martingale condition associated with riskneutral pricing. In Kitamura & Stutzer (1997) the goal is to provide an alternative to generalized method of moments estimation in which the moment conditions holds exactly. In Buchen & Kelly (1996) it is demonstrated how to estimate a distribution of an underlying asset from a set of option prices. The method has been used by Robertson et al. (2005) for VAR model forecasting. For an overview of the use of entropy in finance, see Zhou et al. (2013).
The entropy measure used in the above papers is the KullbeckLeibler Information Criterion (KLIC) distance from Q_{0} to Q(White, 1982):
This is also called the Relative Entropy. It is wellknown that (D Q, Q^{0}) ≥ 0, with equality if and only if Q = Q^{0}. The KLIC measures the information gained when one revises one’s beliefs from the prior probability distribution Q^{0} to the posterior probability distribution Q. In our use of the concept we therefore wish to minimize the information loss from moving away from the original distribution. The KLIC is a special case of a broader class of divergences called fdivergences as well as the class of Bregman divergences. It is the only such divergence over probabilities that is a member of both classes.
The KLIC is not a true metric as it is not symmetric (D (Q,Q^{0}) ≠ D (Q^{0},Q). Even so, it is a premetric and generates a topology on the space of probability distributions. To see that it actually works as it should, let us look at the binary case. Define:
In the binary case we have
For which value of q_{i}_{1} is this function minimized? For D_{i} to be a distance measure we would prefere ${q}_{i1}={q}_{i1}^{0}$ to minimize the function, and this is surely the case. Differentiation yields:
and $\frac{\partial {D}_{i}}{\partial {q}_{i1}}=0$ if and only if ${q}_{i1}={q}_{i1}^{0}$. In Figure 3.1 an example is shown. It is assumed that ${q}_{i1}^{0}=0.2$. For comparison the dashed curve shows $a({q}_{i1}{q}_{i1}^{0})$ where a is chosen such that the the curves are close to the right of ${q}_{i1}^{0}$. It can be seen that KLIC acts much like quadratic distance to the right of the minimum, but has more curverture to the left.
We would like to minimize the information loss D given the restrictions (4), (5) and (6) from definition 1. The restriction (4) makes sure that we satisfy 1) in our list (ability to adjust correctly to the target means). The fact that we minimize the information loss satisfies 2) in the list (ability to retain the original shape of the probability distribution). ’Symmetric formulation’ and ’multinominal alignment’ follows automatically (point 4 and 5).
We can now derive our new alignment method:
Let ${p}_{i\mathrm{a}}^{0}$ (i = 1,…,N and a = 1,…,A) be the initial probabilities in a population. A RelativeEntropyMinimizing MeanCorrection Alignment to a target $({\overline{X}}_{1},\dots ,{\overline{X}}_{A})$ is a set of aligned probabilities p_{ia} that minimize (14), given the restrictions (4), (5) and (6).
And we can prove the theorem:
Let p^{0} (i= 1,…,N and a= 1,…,A) be the initial probabilities in a population. In a RelativeEntropyMinimizing MeanCorrection Alignment the aligned probabilities p_{ia} are given by:
where the parameters φ_{a} satisfy:
Proof. (White, 1982) Define the Lagrangefunction L:
The first order derivatives are given by
such that:
Sum over the alternatives a(using (5)):
such that
Inserting this into (22) yields:
Observe that the Lagrangeparameters φ_{a} are not unique, as
We can therefore assume that
Observe that p_{ia} ∈[0,1] and that 3) is satisfied (retaining zero).
3.1 Logit scaling
We will actually prefere to give our method the catchier name Logit Scaling. The method fits perfectly a situation where the initial probabilities are supplied from a logitmodel (point 6 in the list). Assume the probabilities are given by
where V^{0} are some logitutilities. Then according to theorem 3:
The alignment is done by adding alternativespecific constants to the logitutilities. As a corollary to this, if the preferences of the agents change in an additive way, the probabilities can be corrected with alignment. This is similar to the result in Li & O’Donoghue (2014) where it is shown that a “biased intercept” in a logit model can be corrected with some of the alignment methods (“Sidewalk hybrid with non linear adjustment” and “Sort by the difference between logistic adjusted predicted probability and random number”). Here we give a mathe matical explanation.
After an alignment, the scaling parameters φ_{a} can be calculated and used to solve the following problem: assume we have a logit model describing some events in our microsimulation model. In the base run of the model we align these logit probabilities to given macro levels. In an alternative analysis we would like to calculate the behavioral responses using the logit model. But how can we do that when we have manipulated our logit probabilities in the alignment process?
Assume the logit model is given by
where x^{0} is a vector characterizing individual i and β_{a} is a vector of logitparameters for each alternative a. Let the aligned probabilities be:
Now assume we would like to analyze the effect of changing x^{0} to x_{i} for all individuals i. Having calculated the φ^{!}_{a}s we can calculate the new probabilities p^_{ia}:
These probabilities will give the best possible representation of the marginal effects of the changed characteristics of the individuals.
If the aligned probabilities p_{ia} are known, it is possible to calculate the φ^{!}_{a}s. It can be derived that, for any i:
This is socalled centered logratios known from compositional data analysis (Aitchison, 1986). Even if we are not doing the alignment via a direct calculation of the φ^{!}_{a}s (an idea we will introduce in the next section) we can calculate them from (3.4).
In the binary case (A= 2) it can be proven that
such that the scaling factors are calculated from changes in logoddratios.
3.2 An example
Let us consider a simple example. We have two alternatives (A = 2; death or survive) and two individuals (N = 2). We assume that the death probability of the first person is p_{1} = 0.2 (such that the probability of survival is 0.8) and the death probability of the second person is p_{2} = 0.4. The expected number of deaths in this tiny microsimulation model is 0.2 + 0.4 = 0.6. Let us assume we want to increase the expected number of deaths to 0.85. We will do this by alignment.
The first method we will use is the popular Multiplicative Scaling (Li & O’Donoghue, 2014). The initial point is A in the figure. The alignment process should move us to some point on the downward sloping curve defined by p_{1} + p_{2} = 0.85. With multiplicative scaling we have to possibilities^{4}: scaling up the probabilities of death or scaling down the probabilities of survival. If we scale up the probabilities of death we move from A to B. If we scale down the probabilities of survival we move from A to C. As we can see there is a significant difference in the resulting alignments.
Another problem associated with multiplicative scaling is that the aligned probabilities can exceed 1. If the death probabilitis is scales up to a certain level we reach the point Z. Here the probability of death for person 2 is 1. Any further scaling will imply a death probability larger than 1. Likewise, if the probabilities of survival is scaled enough, we reach the point Y.
Now let us consider Logit Scaling. In the binary case we have from theorem 3 that:
The microsimulation model DYNACAN uses an algorithm called ’Sidewalk with NonLinear Transformation’ (Li & O’Donoghue, 2014). This nonlinear transformation is given by:
We can rewrite (22) to:
such that the transformations are identical for
In the figure the solid nonlinear curve shows all possible alignments for varying values of φ. The current align ment is the point D. Observe that comparet to multiplicative scaling, the method is a compromise between point A and B. Also observe that the system can be aligned to any number of deaths between 0 and N = 2.
4. Biproportional scaling: An algorithmic solution
So far the alignment method seems to have nice theoretical properties, but how about the practical side of the equation? If we substitute the results of theorem 3 into equation (4), we get the following problem: choose parameters φ_{a}, a= 1,…,A, such that:
This is a system of Anonlinear equations with A unknowns. As A is typically small (2 to 10), this is not a big equation system. But from a computational point of view it is a problem that we sum over the number of individuals i. This can be a very big number (up to millions). We therefore need a fast algorithm.
The standard computational way to solve a set of nonlinear equations is to use the NewtonRaphson algorithm. This algoritm works by making a local linear approximation of the system (the Jacobian mitrix). This linear system is solved (the Jacobian matrix is inverted), and by repeating the process it is possible to converge to the solution of the nonlinear set of equations.
Using the NewtonRaphson algoritm implies the calculation of the A×A Jacobian matrix in each iteration, where each element in this matrix is a sum over all individuals. A considerably faster approach is Biproportional scaling. Look at the original problem: minimize (14) given (4) and (5). This is actually a Matrix Balancing Problem (Schneider & Zenios, 1990). Instead of the initial matrix P^{0} we should choose another matrix P, such that the difference between the matrices is as small as possible, and such that the row sums are given (see 2.2) and the column sums are given (see 2.1). It is well known (McDougall, 1999) that if the measure of difference between the matrices is Relative Entropy, then there exists a simple algorithm that converges to the solution of the matrix balancing problem. This algorithm is called biproportional scaling and is very simple and fast. You start out with the original matric P^{0}. In this matrix the row sums are 1 because we work with probabilities. The column sums are equal to the expected numbers of outcomes distributed on the A alternatives. First you scale the columns such that the colums sums are correct according to the restrictions, i.e. the expected numbers are equal to the alignment targets. Then you scale the rows such that the column sums are correct (sums to 1 again). But then the column sums are not correct any more. If you repeat this process again and again, you will realize that the errors in the column sums will decrease dramatically, and the system will converge. In real world problems the convergence is very fast (often less than 10 iterations). We therefore have a very surprising result: our very nonlinear problem can be solved by simpel iterative scaling.
The method is tested against two versions of the NewtonRaphson algorithm, a standard version and an op timized version (see Table 1). The reason for this comparison is twofold: first, it is proven that Biproportional scaling is faster then the standard method (even when it is optimized), and second, it is demonstrated that the methods yields the same results.
In the standard version of the NewtonRaphson algoritm, the Jacobian matrix is calculated based on abstract differentiation of the nonlinear equation system. In each iteration the elements in the matrix is calculated by an average over all individuals. The matrix is inverted using the C#library Math.Net. In the optimized version, the procedure is the same except for the calculation of the Jacobian matrix. The elements in the matrix is averaged over a sample of 10 pct. of the individuals. This improves the speed, only slightly compromising precision.
The artificial population is 1.000.000 individuals with 4 alternatives. The artificial set of probabilities are cal culated by drawing 1 mill. numbers from each of 4 normal distributions^{5}. If (x_{i}_{1},x_{i}_{2},x_{i}_{3},x_{i}_{4}) are these 4 number for individual i, the probabilitis are calculated as:
The result of the test is given in Table 1. Biproportional scaling ran twice as fast as the optimized Newton Raphson, and more than 6 times faster than the standard NewtonRaphson. If we look at the centered logratios we see that these are practically identical for the standard NewtonRaphson and BiPropotional Scaling. As mentioned the speed of the optimized NewtonRaphson resulted in a somewhat lower precision. This precision cost is not observed for BiProportional Scaling.
The final point on the list is ease of implementation. It took more than a week to develop and test the Newton Raphson algorithms. When the idea came to the author, it took half an hour to code biproportional scaling, showing the power of the approach.
5. Conclusion
A general mathematical foundation of multinominal alignment in microsimulation models is derived. The alignment problem is characterized as the solution to an optimization problem. The analytical solution to this problem is characterized and applied in deriving various properties for the method. It is demonstrated that there exits an algorithm called BiProportional Scaling that converges to the solution of the optimization problem, that the new method is more general than existing methods, and that it is fast and easy to implement.
Footnotes
1.
Or a more general term for both: moment correction, as it corrects the first two moments of the probability distribution.
2.
Admittedly this point is added because out method works well with logits.
3.
Q is actually a probability distribution as 0 ≤qia ≤1 and ^{), ),} qia = 1.
4.
This asymmetry seems not to be a wellknown in the microsimulation community. As an example it is not mentioned in Li & O’Donoghue (2014)
5.
The distibutions are: N(−3.0,0.8), N(−1.0,0.5), N(0,0.5) and N(−0.2,0.8) where the first number i mean and the second is variance.
References
 1

2
Micromacro linkage and the alignment of transition processes. Technical paper(25).Micromacro linkage and the alignment of transition processes. Technical paper(25)..

3
The maximum entropy distribution of an asset inferred from option pricesJournal of Financial and Quantitative Analysis 31:143–159.

4
An informationtheoretic alternative to generalized method of moments estimationEconometrica: Journal of the Econometric Society pp. 861–874.

5
Evaluating binary alignment methods in microsimulation modelsJournal of Artiftcial Societies and Social Simulation 17:15.

6
Entropy theory and RAS are friends. GTAP Working PapersEntropy theory and RAS are friends. GTAP Working Papers, 6.
 7

8
A comparative study of algorithms for matrix balancingOperations research 38:439–455.

9
A simple nonparametric approach to derivative security valuationThe Journal of Finance 51:1633–1652.

10
Maximum likelihood estimation of misspecified modelsEconometrica: Journal of the Econometric Society pp. 1–25.
 11
Article and author information
Author details
Acknowledgements
The author would like to thank Joao Miguel Ejarque, Michael Andersen, Marianne Frank Hansen and two anonymous referees for many thoughtful comments and edits. We would also like to thank the The Knowledge Centre for Housing Economics under Realdania for generous financial support.
Publication history
 Version of Record published: December 31, 2016 (version 1)
Copyright
© 2016, Stephensen et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.