========================================================================
Approximate Inference for Multinomial Logit Models with Random Effects
========================================================================
.. bibsource:: ../../mclogit.bib
The problem
===========
A crucial problem for inference about nonlinear models with random effects is
that the likelihood function for such models involves integrals for which no
analytical solution exists.
For given values $\bm{b}$ of the random effects the likelihood function of a
conditional logit model (and therefore also of a baselinelogit model) can be
written in the form
.. math::
\mathcal{L}_{\text{cpl}}(\bm{y},\bm{b})
=
\exp\left(\ell_{\text{cpl}}(\bm{y},\bm{b})\right)
=\exp
\left(
\ell(\bm{y}\bm{b};\bm{\alpha})
\frac12\ln\det(\bm{\Sigma})
\frac12\bm{b}'\bm{\Sigma}^{1}\bm{b}
\right)
However, this "complete data" likelihood function cannot be used for inference, because it
depends on the unobserved random effects. To arrive at a likelihood function
that depends only on observed data, one needs to used the following integrated
likelihood function:
.. math::
\mathcal{L}_{\text{obs}}(\bm{y})
=
\int
\exp\left(\ell_{\text{cpl}}(\bm{y},\bm{b})\right)
\partial \bm{b}
=
\int
\exp
\left(
\ell(\bm{y}\bm{b};\bm{\alpha})
\frac12\ln\det(\bm{\Sigma})
\frac12\bm{b}'\bm{\Sigma}^{1}\bm{b}
\right)
\partial \bm{b}
In general, this integral cannot be "solved", i.e. eliminated from the formula
by analytic means (it is "analytically untractable"). Instead, one will compute
it either using numeric techniques (e.g. using numerical quadrature) or
approximate it using analytical techniques. Unless there is only a single level
of random effects numerical quadrature can become computationally be demanding,
that is, the computation of the (log)likelihood function and its derivatives
can take a lot of time even on modern, stateoftheart computer
hardware. Yet approximations based on analytical techniques hand may
lead to biased estimates in particular in samples where the number of
observations relative to the number of random offects is small, but at least
they are much easier to compute and sometimes making inference possible after
all.
The package "mclogit" supports to kinds of analytical approximations, the
Laplace approximation and what one may call the SolomonCox appoximation. Both
approximations are based on a quadratic expansion of the integrand so that the
thus modified integral does have a closedform solution, i.e. is analytically
tractable.
The Laplace approximation and PQL
=================================
Laplace approximation

The (firstorder) Laplace approximation is based on the quadratic expansion the
logarithm of the integrand, the completedata loglikelihood
.. math::
\ell_{\text{cpl}}(\bm{y},\bm{b})\approx
\ell(\bm{y}\tilde{\bm{b}};\bm{\alpha})

\frac12
(\bm{b}\tilde{\bm{b}})'
\tilde{\bm{H}}
(\bm{b}\tilde{\bm{b}})
\frac12\ln\det(\bm{\Sigma})
\frac12(\bm{b}\tilde{\bm{b}})'\bm{\Sigma}^{1}(\bm{b}\tilde{\bm{b}})
where $\tilde{\bm{b}}$ is the solution to
.. math::
\frac{\partial\ell_{\text{cpl}}(\bm{y},\bm{b})}{\partial\bm{b}} = 0
and $\tilde{\bm{H}}=\bm{H}(\tilde{\bm{b}})$ is the value of the negative Hessian in $\bm{b}$
.. math::
\bm{H}(\bm{b})=\frac{\partial^2\ell(\bm{y}\bm{b};\bm{\alpha})}{\partial\bm{b}\partial\bm{b}'}
for $\bm{b}=\tilde{\bm{b}}$.
Since this quadratic expansion, let us call it
$\ell_{\text{Lapl}}(\bm{y},\bm{b})$, is a (multivariate) quadratic function of
$\bm{b}$, the integral of its exponential does have a closedform solution (the
relevant formula can be found in :citenop:`harville:matrix.algebra`).
For purposes of estimation, the resulting approximate loglikelihood is more useful:
.. math::
\ell^*_{\text{Lapl}}
=
\ln\int \exp(\ell_{\text{Lapl}}(\bm{y},\bm{b})) \partial\bm{b}
=
\ell(\bm{y}\tilde{\bm{b}};\bm{\alpha})

\frac12\tilde{\bm{b}}'\bm{\Sigma}^{1}\tilde{\bm{b}}

\frac12\ln\det(\bm{\Sigma})

\frac12\ln\det\left(\tilde{\bm{H}}+\bm{\Sigma}^{1}\right).
Penalized quasilikelihood (PQL)

If one disregards the dependence of $\tilde{\bm{H}}$ on $\bm{\alpha}$ and
$\bm{b}$, then $\tilde{\bm{b}}$ maximizes not only $\ell_{\text{cpl}}(\bm{y},\bm{b})$ but also $\ell^*_{\text{Lapl}}$.
This motivates the following IWLS/Fisher scoring equations for
$\hat{\bm{\alpha}}$ and $\tilde{\bm{b}}$ (see
:citenop:`breslow.clayton:approximate.inference.glmm` and `this page <../fittingmclogit>`__):
.. math::
\begin{bmatrix}
\bm{X}'\bm{W}\bm{X} & \bm{X}'\bm{W}\bm{Z} \\
\bm{Z}'\bm{W}\bm{X} & \bm{Z}'\bm{W}\bm{Z} + \bm{\Sigma}^{1}\\
\end{bmatrix}
\begin{bmatrix}
\hat{\bm{\alpha}}\\
\tilde{\bm{b}}\\
\end{bmatrix}
=
\begin{bmatrix}
\bm{X}'\bm{W}\bm{y}^*\\
\bm{Z}'\bm{W}\bm{y}^*
\end{bmatrix}
where
.. math::
\bm{y}^* = \bm{X}\bm{\alpha} + \bm{Z}\bm{b} + \bm{W}^{}(\bm{y}\bm{\pi})
is the IWLS "working dependend variable" with $\bm{\alpha}$, $\bm{b}$, $\bm{W}$,
and $\bm{\pi}$ computed in an earlier iteration.
Substitutions lead to the equations:
.. math::
(\bm{X}\bm{V}^\bm{X})\hat{\bm{\alpha}} = \bm{X}\bm{V}^\bm{y}^*
and
.. math::
(\bm{Z}'\bm{W}\bm{Z} + \bm{\Sigma}^{1})\bm{b} = \bm{Z}'\bm{W}(\bm{y}^*\bm{X}\bm{\alpha})
which can be solved to compute $\hat{\bm{\alpha}}$ and $\tilde{\bm{b}}$ (for given
$\bm{\Sigma}$)
Here
.. math::
\bm{V} = \bm{W}^+\bm{Z}\bm{\Sigma}\bm{Z}'
and
.. math::
\bm{V}^ = \bm{W} \bm{W}\bm{Z}'\left(\bm{Z}'\bm{W}\bm{Z}+\bm{\Sigma}^{1}\right)^{1}\bm{Z}\bm{W}
Following :citet:`breslow.clayton:approximate.inference.glmm` the variance
parameters in $\bm{\Sigma}$ are estimated by minimizing
.. math::
q_1 = \det(\bm{V})+(\bm{y}^*\bm{X}\bm{\alpha})\bm{V}^(\bm{y}^*\bm{X}\bm{\alpha})
or the "REML" variant:
.. math::
q_2 = \det(\bm{V})+(\bm{y}^*\bm{X}\bm{\alpha})\bm{V}^(\bm{y}^*\bm{X}\bm{\alpha})+\det(\bm{X}'\bm{V}^{}\bm{X})
This motivates the following algorithm, which is strongly inspired by the
``glmmPQL()`` function in Brian
Ripley's *R* package `MASS `__:
1. Create some suitable starting values for $\bm{\pi}$, $\bm{W}$, and $\bm{y}^*$
2. Construct the "working dependent variable" $\bm{y}^*$
3. Minimize $q_1$ (quasiML) or $q_2$ (quasiREML) iteratively (inner loop), to
obtain an estimate of $\bm{\Sigma}$
4. Obtain $\hat{\bm{\alpha}}$ and $\tilde{\bm{b}}$ based on the current
estimate of $\bm{\Sigma}$
5. Compute updated $\bm{\eta}=\bm{X}\bm{\alpha} + \bm{Z}\bm{b}$, $\bm{\pi}$,
$\bm{W}$.
6. If the change in $\bm{\eta}$ is smaller than a given tolerance criterion
stop the algorighm and declare it as converged. Otherwise go back to step 2
with the updated values of $\hat{\bm{\alpha}}$ and $\tilde{\bm{b}}$.
This algorithm is a modification of the `IWLS <../fittingmclogit>`__ algorithm
used to fit conditional logit models without random effects. Instead of just
solving a linear requatoin in step 3, it estimates a weighted linear
mixedeffects model. In contrast to ``glmmPQL()`` it does not use the ``lme()``
function from package `nlme `__ for
this, because the weighting matrix $\bm{W}$ is nondiagonal. Instead, $q_1$ or
$q_2$ are minized using the function ``nlminb`` from the standard *R* package
"stats".
The SolomonCox approximation and MQL
=====================================
The SolomonCox approximation

The (firstorder) Solomon approximation is based on the quadratic expansion the integrand
.. math::
\ell_{\text{cpl}}(\bm{y},\bm{b})\approx
\ell(\bm{y}\bm{0};\bm{\alpha})
+
\bm{g}_0'
\bm{b}

\frac12
\bm{b}'
\bm{H}_0
\bm{b}
\frac12\ln\det(\bm{\Sigma})
\frac12\bm{b}'\bm{\Sigma}^{1}\bm{b}
where $\bm{g}_0=\bm{g}(\bm{0})$ is the gradient of $\ell(\bm{y}\bm{b};\bm{\alpha})$
.. math::
\bm{g}(\bm{b})=\frac{\partial\ell(\bm{y}\bm{b};\bm{\alpha})}{\partial\bm{b}}
at $\bm{b}=\bm{0}$, while
$\bm{H}_0=\bm{H}(\bm{0})$ is the negative Hessian at $\bm{b}=\bm{0}$.
Like before, the integral of the exponential this quadratic expansion (which we
refer to as $\ell_{\text{SC}}(\bm{y},\bm{b})$) has a closedform solution, as
does its logarithm, which is:
.. math::
\ln\int \exp(\ell_{\text{SC}}(\bm{y},\bm{b})) \partial\bm{b}
=
\ell(\bm{y}\bm{0};\bm{\alpha})

\frac12\bm{g}_0'\left(\bm{H}_0+\bm{\Sigma}^{1}\right)^{1}\bm{g}_0

\frac12\ln\det(\bm{\Sigma})

\frac12\ln\det\left(\bm{H}_0+\bm{\Sigma}^{1}\right).
Marginal quasilikelhood (MQL)

The resulting estimation technique is very similar to PQL (again, see
:citenop:`breslow.clayton:approximate.inference.glmm` for a discussion). The only difference is
the construction of the "working dependent" variable $\bm{y}^*$. With PQL it is
constructed as $\bm{y}^* = \bm{X}\bm{\alpha} + \bm{Z}\bm{b} +
\bm{W}^{}(\bm{y}\bm{\pi})$ while the MQL working dependent variable is just
.. math::
\bm{y}^* = \bm{X}\bm{\alpha} + \bm{W}^{}(\bm{y}\bm{\pi})
so that the algorithm has the following steps:
1. Create some suitable starting values for $\bm{\pi}$, $\bm{W}$, and $\bm{y}^*$
2. Construct the "working dependent variable" $\bm{y}^*$
3. Minimize $q_1$ (quasiML) or $q_2$ (quasiREML) iteratively (inner loop), to
obtain an estimate of $\bm{\Sigma}$
4. Obtain $\hat{\bm{\alpha}}$ based on the current
estimate of $\bm{\Sigma}$
5. Compute updated $\bm{\eta}=\bm{X}\bm{\alpha}$, $\bm{\pi}$,
$\bm{W}$.
6. If the change in $\bm{\eta}$ is smaller than a given tolerance criterion
stop the algorighm and declare it as converged. Otherwise go back to step 2
with the updated values of $\hat{\bm{\alpha}}$.
References
==========
.. bibliography::
:citations: