[통계계산] 8. Numerical Integration

Computational Statistics

Featured image

[통계계산] 8. Numerical Integration


목차

  1. Computer Arithmetic
  2. LU decomposition
  3. Cholesky Decomposition
  4. QR Decomposition
  5. Interative Methods
  6. Non-linear Equations
  7. Optimization
  8. Numerical Integration
  9. Random Number Generation
  10. Monte Carlo Integration


sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_0.5.0       digest_0.6.27     utf8_1.2.1        crayon_1.4.1     
 [5] IRdisplay_1.0     repr_1.1.3        lifecycle_1.0.0   jsonlite_1.7.2   
 [9] evaluate_0.14     pillar_1.6.1      rlang_0.4.11      uuid_0.1-4       
[13] vctrs_0.3.8       ellipsis_0.3.2    IRkernel_1.1      tools_3.6.3      
[17] compiler_3.6.3    base64enc_0.1-3   pbdZMQ_0.3-5      htmltools_0.5.1.1
install.packages("ggplot2")
library(ggplot2)
library(statmod)
Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)

Registered S3 methods overwritten by 'tibble':
  method     from  
  format.tbl pillar
  print.tbl  pillar

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

Numerical integration

Goal

To evaluate $\int_a^b f(x) dx$, $f:[a, b] \to \mathbb{R}$.

Why do we need integrals?

  1. Expectation
    • $X$ has some density on $[0, 1]$; $Y = \sin(X)$
    • $\mathbf{E}[Y] = \int_0^1 \sin(x)f_X(x)dx$
  2. Bayesian inference
    • Parameter $\theta$ has a prior density $f_{\Theta}(\theta)$ on $\mathbb{R}$
    • Likelihood of data $x$ is $f_{X \Theta}(x \theta)$
    • Want to evaluate the posterior density \(f_{\Theta|X}(\theta|x) = \frac{f_{X|\Theta}(x|\theta)f_{\Theta}(\theta)}{\int_{-\infty}^{\infty} f_{X|\Theta}(x|\theta')f_{\Theta}(\theta')d\theta'}\)
    • The integral in the denominator may not be analytically solved.
  3. MLE of structured data
    • Data: $y_{ij}$ = # recalled words by $i$th patient of Alzheimer’s disease in $j$th month, $i=1,\dotsc, n$, $j=1, \dotsc, T$
    • Model: \begin{align} Y_{ij} | \lambda_{ij} &\stackrel{indep}{\sim} \text{Poi}(\lambda_{ij}), \quad j=1,\dotsc, T
      \lambda_{ij} &= \exp(\gamma_i + \beta_0 + j\beta_1)
      \gamma_i &\stackrel{i.i.d.}{\sim} N(0, \sigma^2), \quad i=1, \dots, n \end{align
      }
    • Called a generaized linear mixed model (here, Poisson GLM with a random intercept)
    • Likelihood: \begin{align} L(\beta_0, \beta_1, \sigma^2) &= \prod_{i=1}^n p(y_{i1}, \dotsc, y_{iT})
      &= \prod_{i=1}^n \int_{-\infty}^{\infty} \prod_{j=1}^T p(y_{ij} | \lambda_{ij}(\gamma_i) ) f(\gamma_i)d\gamma_i
      &= \prod_{i=1}^n \int_{-\infty}^{\infty} \prod_{j=1}^T e^{-\lambda_{ij}(\gamma_i)}\frac{\lambda_{ij}^{y_{ij}}(\gamma_i)}{y_{ij}!}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{\gamma_i^2}{2\sigma^2}}d\gamma_i,
      \text{where}~~ & \lambda_{ij}(\gamma_i) = \exp(\gamma_i + \beta_0 + j\beta_1).
      \end{align
      }
    • Likelihood function (hence score function) consists of integrals, which in most case do not have closed form.

What should we do?

Newton-Côtes Quadrature

(Taken from Givens and Hoeting)

Riemann rule

riemann <- function(f, a, b, n) {
    h <- (b - a) / n   
    
    xi <- seq.int(a, b, length.out = n + 1)
    xi <- xi[-1]
    xi <- xi[-length(xi)]

    intgrl <- h * (f(a) + sum(f(xi)))
    
    return(intgrl)
}
f <- function(x) sin(x)  # integrand: example above with uniform X
(truth <- 1 - cos(1)) # true integral

0.45969769413186

riemann(f, 0, 1, 100)

0.455486508387318

numsub <- c(10, 100, 1000, 10000, 100000)
err_R <- numeric(length(numsub))
for (i in seq_along(numsub)) {
    n <- numsub[i]
    R <- riemann(f, 0, 1, n)
    err_R[i] <- R - truth
}
ggplot() + geom_point(aes(x=numsub, y=abs(err_R))) + geom_line(aes(x=numsub, y=abs(err_R))) + 
scale_x_log10() + scale_y_log10() + theme_bw(base_size = 15)

png

Trapezoidal rule

trapezoidal <- function(f, a, b, n) {
    h <- (b - a) / n   
    
    xi <- seq.int(a, b, length.out = n + 1)
    xi <- xi[-1]
    xi <- xi[-length(xi)]

    intgrl <- h * (0.5 * f(a) + sum(f(xi)) + 0.5 * f(b))
    
    return(intgrl)
}
# f <- function(x) sqrt(1-x^2)   # integrand
trapezoidal(f, 0, 1, 100) 

0.459693863311358

truth

0.45969769413186

numsub <- c(10, 100, 1000, 10000, 100000)
err_T <- numeric(length(numsub))
for (i in seq_along(numsub)) {
    n <- numsub[i]
    T <- trapezoidal(f, 0, 1, n)
    err_T[i] <- T - truth
}
ggplot() + geom_point(aes(x=numsub, y=abs(err_T))) + geom_line(aes(x=numsub, y=abs(err_T))) + 
scale_x_log10() + scale_y_log10() + theme_bw(base_size = 15)

png

Simpson’s rule

Note the jump of accuracy order from $O(n^{-2})$ (Trapezoidal) to $O(n^{-4})$ (Simpson)!

Also note that Simpson’s method use twice more points than the Trapezoidal method. Hence for fair comparison, compare them with the same number of evaluation points.

simpson <- function(f, a, b, n) {
    h <- (b - a) / n   
    i <- seq_len(n - 1)
    xi <- a + i * h 
    xmid <- c(xi - h / 2, b - h / 2)
    intgrl <- h * (f(a) + 2 * sum(f(xi)) + 4 * sum(f(xmid)) + f(b)) / 6
   
    return(intgrl)
}
# f <- function(x) sqrt(1-x^2)   # integrand
simpson(f, 0, 1, 50)    # in fact about 100 pts

0.459697694157399

truth

0.45969769413186

Let’s equate the total number of points.

simpson <- function(f, a, b, n) {
  h <- (b - a) / n
   
  xi <- seq.int(a, b, length.out = n + 1)
  xi <- xi[-1]
  xi <- xi[-length(xi)]
 
  intgrl <- (h / 3) * (f(a) + 2 * sum(f(xi[seq.int(2, length(xi), 2)])) + 
                       4 * sum(f(xi[seq.int(1, length(xi), 2)])) + f(b))
   
  return(intgrl)
   
}
simpson(f, 0, 1, 100)

0.459697694157399

truth

0.45969769413186

numsub <- c(10, 100, 1000, 10000, 100000)
err_S <- numeric(length(numsub))
for (i in seq_along(numsub)) {
    n <- numsub[i]
    S <- simpson(f, 0, 1, n)
    err_S[i] <- S - truth
}
ggplot() + geom_point(aes(x=numsub, y=abs(err_S))) + geom_line(aes(x=numsub, y=abs(err_S))) + 
scale_x_log10() + scale_y_log10() + theme_bw(base_size = 15)

png

comp <- data.frame(method = factor(c(rep("riemann", length(numsub)), rep("trapezoidal", length(numsub)), 
                                rep("simpson", length(numsub)) )),
                   numsub = c(numsub = rep(numsub, 3)), 
                   error = c(err_R, err_T, err_S) )
comp
A data.frame: 15 × 3
methodnumsuberror
<fct><dbl><dbl>
numsub1riemann 1e+01-4.245669e-02
numsub2riemann 1e+02-4.211186e-03
numsub3riemann 1e+03-4.207738e-04
numsub4riemann 1e+04-4.207393e-05
numsub5riemann 1e+05-4.207359e-06
numsub6trapezoidal1e+01-3.831453e-04
numsub7trapezoidal1e+02-3.830821e-06
numsub8trapezoidal1e+03-3.830814e-08
numsub9trapezoidal1e+04-3.830813e-10
numsub10trapezoidal1e+05-3.830658e-12
numsub11simpson 1e+01 2.556920e-07
numsub12simpson 1e+02 2.553913e-11
numsub13simpson 1e+03 2.664535e-15
numsub14simpson 1e+04 5.551115e-17
numsub15simpson 1e+05 5.551115e-17
ggplot(data = comp, aes(numsub, abs(error))) + geom_point(aes(colour=method)) + geom_line(aes(colour=method)) + 
scale_x_log10() + scale_y_log10() + theme_bw(base_size = 15)

png

General $m$

Use the Lagrange polynomial \(L_{ij}^m(x) = \prod_{k\neq j}^{m+1} \frac{x - x_{ik}^*}{x_{ij}^* - x_{ik}^*}\) for equally spaced $x_{ij}^*$ in $[x_i, x_{i+1}]$ such that \(p_i(x) = \sum_{j=0}^m f(x_{ij}^*)L_{ij}^m(x) .\)

It is easy to check \(L_{ij}^m(x_{ik}^*) = \begin{cases} 1, & k = j , \\ 0, & k \neq j. \end{cases}\) Thus $p_i(x)$ is a polynomial of degree at most $m$ that interpolates $(x_{ij}^, f(x_{ij}^))$.

The Lagrange polynomial proves the interpolation theorem:

Theorem 1. Given $m+1$ distinct points $x_0, \dotsc, x_m \in \mathbb{R}$ and the corresponding values $y_0, \dotsc, y_m \in \mathbb{R}$, there exists a unique polynomial of degree at most $m$ that interpolates ${(x_0, y_0), \dotsc, (x_m, y_m)} \subset \mathbb{R}^2$.

Proof The existence of such a polynomial is shown above. To show uniqueness, let \(p(x) = \sum_{j=0}^m y_j L_j^m(x)\) be the interpolating polynomial constructed using the Lagrange polynomial, and $q(x)$ be another interpolating polynomial of degree at most $m$. Then we have \begin{align} p(x_0) &= q(x_0)
&\vdots
p(x_m) &= q(x_m), \end{align
} so that $f(x) = p(x) - q(x)$ has $m+1$ distinct roots. Since $f(x)$ is a polynomial of degree at most $m$, by the fundamental theorem of algebra, it can have at most $m$ roots. Therefore, $f(x) = 0$ for all $x$ or \(p(x) = q(x)\) for all $x$.

Corollary 1. Let $f(x)$ is a polynomial of degree at most $m$. Suppose $x_0, \dotsc, x_m$ subdivide $[a, b]$ in a equally spaced fashion. Then \(\int_a^b f(x)dx = \sum_{j=0}^m w_j f(x_j),\) where $w_j = \int_a^b L_j^m(x)dx$.

$f(x) = \sum_{j=0}^m f(x_j)L_j^m(x)$

Gaussian quadrature

In Newton-Côtes quadrature, evaluation points $x_{ij}^$ are equally spaced in $[x_i, x_{i+1}]$ and the integration $I_i$ is approximated as \(I_i = \int_{x_i}^{x_{i+1}}f(x)dx \approx \sum_{j=0}^m w_{ij}f(x_{ij}^*)\) by choosing $w_{ij}$ optimally, in the sense that the approximation is *exact if $f$ is up to $m$th degree polynomial (Corollary 1).

We may remove the constraint of equal spacing and optimally choose both $w_{ij}$ and $x_{ij}^*$.

Theorem 2. If $f$ is a polynomial of degree at most $2m+1$, then there exist ${x_{i0}^, \dotsc, x_{im}^} \subset [x_i, x_{i+1}]$ and ${w_{i0}, \dotsc, w_{im}} \subset \mathbb{R}$ such that \(I_i = \int_{x_i}^{x_{i+1}}f(x)dx = \sum_{j=0}^m w_{ij}f(x_{ij}^*) .\) Furthermore, $w_{ij} > 0$ for all $j$.

Thus the quadrature rule using $m+1$ points can be exact up to $2m+1$ degree polynomial, a big improvement from Newton-Côtes.

Review: Linear Algebra

Let $\mathcal{P}^m$ be the set of all polynomials of degree at most $m$. We all know that $\mathcal{P}^m$ is a vector space (over the real numbers) of dimension $m+1$, and ${1, x, x^2, \dotsc, x^m}$ forms a basis.

It is easy to check for any given distinct points ${(x_0, y_0), \dotsc, (x_m, y_m)}$, ${L_0^m(x), \dotsc, L_m^m(x)}$ also forms a basis of $\mathcal{P}^m$.

Now consider an interval $(a, b) \subset \mathbb{R}$, and $\mathcal{P}^m(a,b)$ be the $\mathcal{P}^m$ restricted to $(a, b)$. If we define \(\langle \mathbf{p}, \mathbf{q} \rangle = \int_a^b p(x)q(x)dx: \quad \mathbb{P}^m(a,b) \times \mathbb{P}^m(a,b) \to \mathbb{R}\) for $\mathbf{p}=p(x)$ and $\mathbf{q}=q(x)$ in $\mathbb{P}^m(a,b)$, then $\langle \mathbf{p}, \mathbf{q} \rangle$ is an inner product of $\mathbf{p}$ and $\mathbf{q}$, and $\mathbb{P}^m(a,b)$ is an inner product space:

  1. $\langle \mathbf{p}, \mathbf{p} \rangle \ge 0$;
  2. $\langle \mathbf{p}, \mathbf{p} \rangle = 0 \iff \mathbf{p} = 0$, i.e., $p(x)=0$ for all $x\in[a, b]$;
  3. $\langle \alpha \mathbf{p}, \mathbf{q} \rangle = \alpha \langle \mathbf{p}, \mathbf{q} \rangle$, $\alpha \in \mathbb{R}$;
  4. $\langle \mathbf{p}, \mathbf{q} \rangle = \langle \mathbf{q}, \mathbf{p} \rangle$;
  5. $\langle \mathbf{p} + \mathbf{q}, \mathbf{r} \rangle = \langle \mathbf{p}, \mathbf{r} \rangle + \langle \mathbf{q}, \mathbf{r} \rangle$.

This inner product allows us to define a norm $|\mathbf{p}|$ over $\mathcal{P}^m(a,b)$: \(\|\mathbf{p}\| = \langle \mathbf{p}, \mathbf{p} \rangle^{1/2} = \left(\int_a^b p^2(x)dx\right)^{1/2}.\)

An orthonormal basis of $\mathcal{P}^m(a, b)$ can be constructed by using the Gram-Schmidt procedure: let $\mathbf{z}j = z_j(x) = x^{j}$, $j=0, \dotsc, m$. Then set $\mathbf{q}_0 = \mathbf{z}_0 / |\mathbf{z}_0|$ and \begin{align*} \tilde{\mathbf{q}}_j &= \mathbf{z}_j - \sum{k=0}^{j-1}\langle \mathbf{z}j, \mathbf{q}_k \rangle \mathbf{q}_k
\mathbf{q}_j &= \tilde{\mathbf{q}}_j / |\tilde{\mathbf{q}}_j| \end{align*} for $j=0, \dotsc, m$. This amounts to \begin{align*} \tilde{q}_j(x) &= x^j - \sum
{k=0}^{j-1}\left(\int_a^b t^j q_k(t) dt\right) q_k(x)
q_j(x) &= \frac{\tilde{q}_j(x)}{\left(\int_a^b \tilde{q}_j^2(t)dt\right)^{1/2}} \end{align*}

Orthogonal polynomial $q_j(x)$ are called the Legendre polynomial. Note that Legendre polynomials are nested: \(\text{span}\{\mathbf{q}_0, \dotsc, \mathbf{q}_m\} = \mathcal{P}^{m}(a,b) \subset \mathcal{P}^{m+1}(a, b) = \text{span}\{\mathbf{q}_0, \dotsc, \mathbf{q}_m, \mathbf{q}_{m+1} \}.\)

Proof of Theorem 2

Proof of Theorem 2. WLOG change the notation to ${x_0=a, x_1, \dotsc, x_{m-1}, x_m = b}$ and ${w_0, \dotsc, w_m}$.

Since $\mathbf{f} = f(x) \in \mathcal{P}^{2m+1}(a, b)$, polynomial division yields \(f(x) = g(x)q_{m+1}(x) + r(x),\) where $\mathbf{q}{m+1} = q{m+1}(x)$ is the $m+1$st Legendre polynomial in ${\mathbf{q}0, \dotsc, \mathbf{q}{2m+1}}$, which forms an orthonomal basis of $\mathcal{P}^{2m+1}[a, b]$; $\mathbf{g} = g(x)$ and $\mathbf{r} = r(x)$ are polynomials of degree at most $m$. By the construction of the Legendre polynomials, we see that \(\langle \mathbf{g}, \mathbf{q}_{m+1} \rangle = \int_a^b g(x)q_{m+1}(x)dx = 0 .\) Thus $\int_a^b f(x)dx = \int_a^b r(x)dx$.

Now choose $x_0, \dotsc, x_m$ as the distinct real roots of the $m+1$st degree polynomial $q_{m+1}(x)$. (Such roots exist indeed; see Lemma 1 below.) Then \(f(x_j) = g(x_j)q_{m+1}(x_j) + r(x_j) = r(x_j), \quad j=0, \dotsc, m.\) From Corollary 1, \(\int_a^b r(x) dx = \sum_{j=0}^m w_j r(x_j), \quad w_j = \int_a^b L_j^m(x) dx,\) where $L_j^m(x)$ is the $j$th Lagrange polynomial given ${x_0, \dotsc, x_m}$ and ${r(x_0), \dotsc, r(x_m)}$.

Thus \(\int_a^b f(x)dx = \int_a^b r(x) dx = \sum_{j=0}^m w_j r(x_j) = \sum_{j=0}^m w_j f(x_j),\) as desired.

It remains to show that $w_k > 0$ for any $k\in {0, 1, \dotsc, m}$. Set $f(x) = [L_k^m(x)]^2$, a $2m$-degree polynomial. By the definition of the Lagrange polynomial, \(f(x_j) = \begin{cases} 1 & j = k \\ 0 & j \neq k \end{cases}\) So, \(0 < \int_a^b f(x) dx = \sum_{j=0}^m w_j f(x_j) = \sum_{j=0}^m w_j\delta_{jk} = w_k .\)

Lemma 1. The roots of $k$th Legendre polynomial $q_k(x)$ are all real and distinct.

Proof. Suppose the contrary is true. Then $q_k(x)$ changes its sign fewer than $k$ times. Let the roots of sign changes be $r_1 < \dotsb < r_l$ for $l < k$. Then $q_k(x)\prod_{i=1}^l (x - r_i)$ will not suffer sign changes, only being zero at $r_1, \dotsc, r_l$. Thus \(\int_a^b q_k(x)\prod_{i=1}^l (x - r_i) dx \neq 0,\) which is a contradiction since the degree of polynomial $\prod_{i=1}^l (x - r_i)$ is less than $k$ and is orthogonal to $q_k(x)$.

Degree $2m+1$ is maximal

Consider $f(x) = \prod_{j=0}^m (x-x_j)^2$, where $x_j$ are the roots of $q_{m+1}$. Obviously $f$ is a $(2m+2)$-degree polynomial. However, \(0 < \int_a^b f(x) dx \neq \sum_{j=0}^m w_j f(x_j)\) since $f(x_j) = 0$ for all $j = 0, \dotsc, m$.

Orthogonal polynomials

If we change the inner product of $\mathcal{P}^m(a, b)$ to \(\langle \mathbf{p}, \mathbf{q} \rangle = \int_a^b p(x)q(x)w(x) dx\) for a function $w(x)$ proportional to a density function on $(a, b)$, then the Gram-Schmidt procedure on ${1, x, \dotsc, x^m}$ generates a different sequence of orthogonal polynomials, where the orthogonality is with respect to $w$ (called $w$-orthogonal). The Legendre polynomial corresponds to the uniform distribution on $(a, b)$.

Typical choice of $w$ and associated $w$-orthogonal polynomials are tabulated:

othogonal polynomial weight function ($w(x)$) domain distribution
Legendre 1 $(-1, 1)$ uniform
Laguerre $x^{\alpha}e^{-x}$ $(0, \infty)$ gamma
Hermite $e^{-x^2}$ $(-\infty, \infty)$ normal
Jacobi $(1-x)^{\alpha}(1+x)^{\beta}$ $(-1, 1)$ beta*

(*) After change of variable.

Legendre polynomial

The $m$th degree Legendre polynomial (on $(-1, 1)$) is a solution to Legendre’s ordinary differential equation \((1 - x^2)u'' - 2x u' + m(m + 1)m = 0\) given by \(q_m(x) = \frac{(-1)^m}{2^m m!}\frac{d^m}{dx^m}[(1 - x^2)^m].\)

Laguerre polynomial

The $m$th degree Laguerre polynomial is a solution to Laguerre’s ordinary differential equation \(xu'' + (\alpha + 1 - x)u' + m u = 0\) given by \(q_m(x) = \Gamma(m + \alpha)\sum_{k=0}^m\binom{m}{k}\frac{(-1)^k x^k}{\Gamma(k + \alpha)}.\)

Hermite polynomial

The $m$th degree Hermite polynomial is a solution to Hermites’s ordinary differential equation \(u'' - 2xu' + 2m u = 0\) given by \(q_m(x) = (-1)^m e^{x^2}\frac{d^m}{dx^m}e^{-x^2}.\)

Jacobi polynomial

The $m$th degree Jacobi polynomial is a solution to Jacobi’s ordinary differential equation \((1 - x^2)u'' + (\beta - \alpha + (\alpha + \beta + 2)x)u' + m(m + \alpha + \beta + 1) u = 0\) given by \(q_m(x) = \frac{\Gamma(\alpha + m + 1)}{m!\Gamma(\alpha + \beta + m + 1)}\sum_{k=0}^m \frac{\Gamma(\alpha + \beta + n + m + 1)}{m!\Gamma(\alpha + m + 1)} \left(\frac{x - 1}{2}\right)^k .\)

Evaluation of orthogonal polynomials

What we need in Gaussian quadrature is the roots of the $m$th degree orthogonal polynomial (with respect to $w$). These roots are usually found by using Newton’s method. The following properties of orthogonal polynomials are useful:

3-term recursion

\[q_m(x) = (\alpha_m + x\beta_m)q_{m-1}(x) - \gamma_m q_{m-2}(x)\]

for some sequence $\alpha_m$, $\beta_m$, and $\gamma_m$.

Interlacing property

If $\mu_1 < \dotsb < \mu_m$ are the roots of $q_m(x)$, then the roots of $q_{m+1}(x)$ lie in each of the intervals \((-\infty, \mu_1), ~ (\mu_1, \mu_2), \dotsc, (\mu_m, \infty).\)

Gauss-Hermite quadrature

If we want to integrate $f(x)$ from $-\infty$ to $\infty$, then Hermite polynomial is the choice: \(\int_{-\infty}^{\infty} f(x)dx = \int_{-\infty}^{\infty} \left(\frac{f(x)}{e^{-x^2}}\right) e^{-x^2}dx \approx \sum_{j=0}^m w_j f(x_j)e^{x_j^2} ,\) where ${w_j}$ and ${x_j}$ are determined by the (scaled) Hermite polynomial.

# statmod::gauss.quad() calculates nodes and weights for Gaussian quadrature
statmod::gauss.quad(1, "hermite")   # this computes w_0
$nodes
0
$weights
1.77245385090552
sqrt(pi)   # int_{-\infty}^{\infty}exp(-x^2)dx

1.77245385090552

Example: Binomial-logit mixture

Suppose $x$ is a random sample from a Binomial distribution with $n$ trials and success probability $p=\frac{e^{\theta}}{1 + e^{\theta}}$. Suppose the prior on $\theta$ is $N(0, \sigma^2)$.

Thus if we define $g(\theta) = \frac{e^{\theta x}}{[1 + e^{\theta}]^n}$, then the normalizing constant is \(\sigma\sqrt{2}\int_{-\infty}^{\infty} g(\sigma\sqrt{2}y)w(y)dy\) where $w(y) = e^{-y^2}$, the weight function for the Hermite polynomials. However, this turns out to be a bad idea:

g <- function(theta, x, n) exp(theta * x - n * log(1 + exp(theta)))

x <- 18  # data
n <- 20  # sample size
sigma <- 10 # standard deviation of the prior

m <- 40   # degree of Hermite polynomial
ghq <- statmod::gauss.quad(m + 1, "hermite")
denom <- sigma * sqrt(2) * sum(g(sigma * sqrt(2) * ghq$nodes, x, n) * ghq$weights)
denom
# R built-in adaptive quadrature
integrate(function(y) sigma * sqrt(2) * g(sigma * sqrt(2) * y, x, n) * exp(-y^2) , -Inf, Inf) 

0.000220517105801289

0.002829073 with absolute error < 1.5e-09

The reason is that the integrand $g(\theta)$ has maximum at $p(\theta) = x/n = 0.9$ or $\theta = \log\frac{p}{1-p} = 2.197 \gg 0$.

A better apporach is the set $\bar{g}(\theta) = \frac{e^{\theta x}}{[1 + e^{\theta}]^n}e^{-\theta^2/(2\sigma^2)}=g(\theta)e^{-\theta^2/(2\sigma^2)}$ and find the maximizer $\theta^$ of $\bar{g}$, and change the variable to $z = \theta - \theta^$. Then, the normalizing constant is \(\int_{-\infty}^{\infty}\bar{g}(\theta)d\theta = \int_{-\infty}^{\infty} \bar{g}(z + \theta^*)dz = \int_{-\infty}^{\infty} \bar{g}(z + \theta^*)e^{z^2}e^{-z^2}dz = \int_{-\infty}^{\infty}\tilde{g}(z)w(z)dz,\) where \(\tilde{g}(z) = \bar{g}(z + \theta^*)e^{z^2} .\)

theta_hat <- optimize(function(y) -g(y, x, n) * exp(-y^2 / sigma^2), c(-5, 5))$minimum  # univariate minimizer
theta_hat

2.1733258692788

g2 <- function(z, x, n, sigma) g(z + theta_hat, x, n) * exp(-(z + theta_hat)^2 / 2 / sigma^2 + z^2) # g_tilde(z+theta_hat)
denom2 <- sum(g2(ghq$nodes, x, n, sigma) * ghq$weights)
denom2

0.00282907322799488

Laplace approximation

Laplace approximation concerns replacing the integral \(\int_{-\infty}^{\infty} f(x)e^{-a g(x)} dx\) by optimization. WLOG assume that $g$ has its minimum at 0. If $g$ is three times continuously differentiable, then Taylor expansion around 0 yields \(g(x) = g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + O(x^3) = g(0) + \frac{1}{2}g''(0)x^2 + o(x^2).\) Thus \(\int_{-\infty}^{\infty} f(x)e^{-a g(x)} dx \approx f(0)e^{-a g(0)}\int_{-\infty}^{\infty}\exp\left(-\frac{a g''(0)x^2}{2}\right)dx = f(0)e^{-ag(0)}\sqrt{\frac{2\pi}{ag''(0)}}\) as $a \to \infty$, as in this case contributions to the integral is dominated by those around 0. This amounts to reduce the integral to integration against $N(0, 1/[ag’‘(0)])$. A formal statement of the Laplace approximation is given below.

Theorem 3. If $g$ satisfies the following conditions

  1. for every $\delta > 0$ there exists a $\rho > 0$ such that $g(x) - g(0) \ge \rho$ for all $x$ with $ x \ge \delta$;
  2. $g$ is twice continuously differentiable in a neighborhood of 0 and $g’‘(0) > 0$;
  3. $f$ is continuous in a neighborhood of 0 and $f(0) > 0$;
  4. the integral $\int_{-\infty}^{\infty}f(x)e^{-ag(x)}dx$ is absolutely convergent for $a \ge a_0$;

then, \(\int_{-\infty}^{\infty} f(x)e^{-a g(x)} dx \asymp f(0)e^{-ag(0)}\sqrt{\frac{2\pi}{ag''(0)}}\) as $a\to\infty$. That is, \(\lim_{a\to\infty} \left(\int_{-\infty}^{\infty} f(x)e^{-a g(x)} dx\right)/\left(f(0)e^{-ag(0)}\sqrt{\frac{2\pi}{ag''(0)}}\right) = 1.\)

Applications

Stirling’s formula

The gamma function \(\Gamma(t) = \int_0^{\infty} x^{t - 1}e^{-x} dx\) generalizes the factorial operation: $\Gamma(n+1) = n!$. If we define $z = x / t$, then \(\Gamma(t+1) = t^{t+1}\int_0^{\infty}e^{-t g(z)}dz\) for $g(z) = z - \log z$. Since $g$ has its minimum at $z=1$, applying the Laplace approximation yields \(\Gamma(t+1) \asymp \sqrt{2\pi}t^{t+1/2}e^{-t}\) as $t\to \infty$, which is the famous Stirling’s formula for $t!$.

n <- 70
factorial(n)

1.19785716699696e+100

sqrt(2*pi) * n^(n + 0.5) * exp(-n)

1.19643200473376e+100

Posterior expectation

Recall Bayesian inference:

We may want to evaluate posterior expectation $\mathbf{E}[h(\Theta) | X_1, \dotsc, X_n]$ given $n$ independent observations. The required expectation takes the form \(\frac{\int e^{\tilde{h}(\theta)}e^{\ell_n(\theta) + \pi(\theta)}d\theta}{\int e^{l_n(\theta) + \pi(\theta)}d\theta}\) where $\tilde{h}(\theta) = \log h(\theta)$, $\ell_n(\theta) = \sum_{i=1}^n \log f_{X|\Theta}(x_i|\theta)$ is the the log likelihood of the data, and $\pi(\theta) = \log f_{\Theta}(\theta)$.

If $n$ is large, usually the log posterior $\ell_n(\theta) + \pi(\theta)$ is sharply peaked around $\hat{\theta}$, the posterior mode.

Using the Laplace approximation, the denominator is approximated by \(\int e^{\ell_n(\theta) + \pi(\theta)}d\theta \approx e^{\ell_n(\hat{\theta})+\pi(\hat{\theta})} \sqrt{\frac{2\pi}{-[\ell_n''(\hat{\theta}) + \pi''(\hat{\theta})]}} .\)

If $\tilde{h}(\theta) + \ell_n(\theta) + \pi(\theta)$ takes its maximum at $\tilde{\theta}$, then the numerator is approximated by \(\int e^{\tilde{h}(\theta) + \ell_n(\theta) + \pi(\theta)}d\theta \approx e^{\tilde{h}(\tilde{\theta}) + \ell_n(\tilde{\theta})+\pi(\tilde{\theta})} \sqrt{\frac{2\pi}{-[\tilde{h}(\tilde{\theta}) + \ell_n''(\tilde{\theta}) + \pi''(\tilde{\theta})]}} .\)

Thus the expectation is approximated by \(\exp\left(\tilde{h}(\tilde{\theta}) + \ell_n(\tilde{\theta}) + \pi(\tilde{\theta}) - \ell_n(\hat{\theta}) - \pi(\hat{\theta})\right) \sqrt{\frac{\ell_n''(\hat{\theta}) + \pi''(\hat{\theta})}{\tilde{h}''(\tilde{\theta}) + \ell_n''(\tilde{\theta}) + \pi''(\tilde{\theta})}} .\)

Example: binomial-logit mixture

Recall from the binomial-logit mixture example above, set \(f(\theta) = e^{-\theta^2/(2\sigma^2)}, \quad g(\theta) = -\theta x + n\log(1 + e^{\theta}) .\)

Then the Laplace approximation of the normalization constant of the posterior density is \(f(\theta^*)e^{-g(\theta^*)}\sqrt{\frac{2\pi}{g''(\theta^*)}}\) where $\theta^* = \log\frac{x/n}{1 - x/n}$. It is easy to see that \(g''(\theta^*) = n \frac{e^{\theta^*}}{(1 + e^{\theta^*})^2} = n p (1 - p), \quad p = x / n.\) For the numerical example of $x=18$ and $n=20$, we have the following:

x <- 18; n <- 20
p <- x / n
theta_star <-log(p / (1-p))
lap <- exp(-theta_star^2 / 2/ sigma^2) * exp(theta_star * x) / 
        (1 + exp(theta_star))^n * sqrt(2 * pi / n / p / (1 - p))
lap   # Laplace approximation
denom2  # Gauss-Hermite quadrature

0.00273738211907555

0.00282907322799488

Example: moments of beta distribution

Recall that the beta distribution has density \(f_{\Theta}(\theta) = \frac{\Gamma(\alpha+\beta)\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\Gamma(\alpha)\Gamma(\beta)} \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}\) on $(0, 1)$. If the likelihood of the data $X$ is $\text{Binom}(n, \theta)$, then the posterior density of the success probability $\theta$ is also beta: $f_{\Theta|X}(\theta|x) \propto \theta^{\alpha + x -1}(1-\theta)^{\beta + n - x -1}$. Thus, finding the moment of a beta distribution amounts to computing the posterior moment of the success probability.

The exact value of $\mathbf{E}[\Theta^k]$ is \(\mathbf{E}[\Theta^k] = \frac{\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{\beta-1}d\theta}{\int_0^1\theta^{\alpha-1}(1-\theta)^{\beta-1}d\theta} = \frac{\Gamma(k + \alpha)\Gamma(\beta)}{\Gamma(k + \alpha + \beta)} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} = \frac{\Gamma(k + \alpha)/\Gamma(\alpha)}{\Gamma(k + \alpha + \beta)/\Gamma(\alpha + \beta)} = \frac{(k+\alpha -1)\dotsb \alpha}{(k+\alpha+\beta-1)\dotsb(\alpha+\beta)}\)

In order to apply the Laplace approximation, observe that $-\log(\theta^{\alpha-1}(1-\theta)^{\beta-1}) = -[(\alpha-1)\log\theta + (\beta-1)\log(1-\theta)] =: g(\theta)$, which has the unique minimum at $\hat{\theta} = \frac{\alpha-1}{\alpha+\beta-2}\in(0, 1)$. Therefore, \(\int_0^1 \theta^{\alpha-1}(1-\theta)^{\beta-1} d\theta \approx e^{-g(\hat{\theta})}\sqrt{\frac{2\pi}{g''(\hat{\theta})}} = \left(\frac{\alpha-1}{\alpha+\beta-2}\right)^{\alpha-1}\left(\frac{\beta-1}{\alpha+\beta-2}\right)^{\beta-1} \sqrt{2\pi}\sqrt{\frac{(\alpha-1)(\beta-1)}{(\alpha+\beta-2)^3}}\) since $g(\hat{\theta}) = -\log\left[\left(\frac{\alpha-1}{\alpha+\beta-2}\right)^{\alpha-1}\left(\frac{\beta-1}{\alpha+\beta-2}\right)^{\beta-1}\right]$ and $g’’(\hat{\theta}) = (\alpha-1)\left(\frac{\alpha+\beta-2}{\alpha-1}\right)^2 + (\beta-1)\left(\frac{\alpha+\beta-2}{\beta-1}\right)^2$.

Likewise, \(\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{\beta-1} d\theta \approx \left(\frac{k+\alpha-1}{k+\alpha+\beta-2}\right)^{k+\alpha-1}\left(\frac{\beta-1}{k+\alpha+\beta-2}\right)^{\beta-1} \sqrt{2\pi}\sqrt{\frac{(k+\alpha-1)(\beta-1)}{(k+\alpha+\beta-2)^3}}\)

Therefore, \(\mathbf{E}[\Theta^k] \approx \left(\frac{(k+\alpha-2)(\alpha+\beta-2)^3}{(\alpha-1)(k+\alpha+\beta-2)^3}\right)^{1/2} \cdot \frac{\left(\frac{k+\alpha-1}{k+\alpha+\beta-2}\right)^{k+\alpha-1}\left(\frac{\beta-1}{k+\alpha+\beta-2}\right)^{\beta-1}}{\left(\frac{\alpha-1}{\alpha+\beta-2}\right)^{\alpha-1}\left(\frac{\beta-1}{\alpha+\beta-2}\right)^{\beta-1}}\)

It can be shown that $\alpha, \beta \to \infty$ in such a way that $\frac{\alpha}{\alpha+\beta} \to p \in (0,1)$, then the approximation of $\mathbf{E}[\Theta^k]$ tends to $p^k$. Also note that \(\frac{(k+\alpha -1)\dotsb \alpha}{(k+\alpha+\beta-1)\dotsb(\alpha+\beta)} \to p^k\) in this setting. Thus the Laplace approximation of the beta moment is correct in the limit.

Multidimensional integration

Laplace approximation naturally extends to multidimensional integration. On the contrary, quadrature methods needs number of points that grows exponentially with dimension.

We will see later that Monte Carlo methods can also alleviate this difficulties.

Proof of Theorem 3

Assume WLOG $g(0) = 0$ (multiply $e^{ag(0)}$ on both sides and set $g(x) := g(x) - g(0)$). Now since \(g'(x) = g'(0) + g''(0)x + o(x) = g''(0)x + o(x)\) as $x \to 0$, applying l’Hôpital’s rule yields $g(x) - \frac{1}{2}g’‘(0)x^2 = o(x^2)$ as $x\to 0$. Combining with condition 3, for a given small $\epsilon \in (0, \frac{1}{2}g’‘(0))$ there exists $\delta > 0$ such that \begin{align} (1-\epsilon) f(0) &\le f(x) \le (1+\epsilon) f(0)
-\epsilon x^2 &\le g(x) - \frac{1}{2}g’‘(0)x^2 \le \epsilon x^2 \end{align
} for all $x$ with $|x| < \delta$.

Now examine the contributions to the integral from the region $|x| \ge \delta$. From condition 1, $g(x) \ge \rho$. Then for $a \ge a_0$, \begin{align} \left|\int_{\delta}^{\infty}f(x)e^{-ag(x)}dx\right| &\le \int_{\delta}^{\infty}|f(x)|e^{-(a - a_0)g(x)}e^{-a_0 g(x)}dx
&\le e^{-(a - a_0)\rho}\int_{\delta}^{\infty}|f(x)|e^{-a_0 g(x)}dx
&\le e^{-(a - a_0)\rho}\int_{-\infty}^{\infty}|f(x)|e^{-a_0 g(x)}dx = O(e^{-\rho a}) . \end{align
} The last equality is due to condition 4. By the same argument, we also have \(\int_{-\infty}^{-\delta}f(x)e^{-ag(x)}dx = O(e^{-\rho a}) .\)

The central portion of the integral can be bounded above: \begin{align} \int_{-\delta}^{\delta} f(x)e^{-ag(x)}dx & \le (1+\epsilon)f(0)\int_{-\delta}^{\delta} e^{-\frac{a}{2}[g’‘(0)-2\epsilon]x^2} dx
&= (1+\epsilon)f(0)\int_{-\infty}^{\infty} e^{-\frac{a}{2}[g’‘(0) - 2\epsilon]x^2} dx
&= (1+\epsilon)f(0)\sqrt{\frac{2\pi}{a[g’‘(0)- 2\epsilon]}} . \end{align
} Therefore, \(\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx \le (1+\epsilon)f(0)\sqrt{\frac{2\pi}{a[g''(0)- 2\epsilon]}} + O(e^{-\rho a}),\) yielding \(\limsup_{a\to\infty} \sqrt{a}\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx \le (1+\epsilon) f(0) \sqrt{\frac{2\pi}{g''(0)- 2\epsilon}} .\) Sending $\epsilon$ to zero, we have \(\limsup_{a\to\infty} \sqrt{a}\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx \le f(0) \sqrt{\frac{2\pi}{g''(0)}} .\)

We can also find a lower bound of the central portion of the integral: \begin{align} \int_{-\delta}^{\delta} f(x)e^{-ag(x)}dx & \ge (1-\epsilon)f(0)\int_{-\delta}^{\delta} e^{-\frac{a}{2}[g’‘(0)+2\epsilon]x^2} dx . \end{align}

For $|x| \ge \delta$, $(\frac{1}{2}g’‘(0) + \epsilon)x^2 \ge (\frac{1}{2}g’‘(0) + \epsilon)\delta^2 =: \lambda$. Furthermore, $\int_{-\infty}^{\infty} e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx < \infty$ for any $a > 0$. Hence by repeating the same argument leading to $\int_{\delta}^{\infty}f(x)e^{-ag(x)}dx = O(e^{-\rho a})$, we have \(\int_{-\infty}^{-\delta}e^{-\frac{a}{2}[g''(0) + 2\epsilon]x^2}dx + \int_{\delta}^{\infty}e^{-\frac{a}{2}[g''(0) + 2\epsilon]x^2}dx = O(e^{-\lambda a}).\) Therefore, \begin{align} (1-\epsilon)f(0)\int_{-\delta}^{\delta} e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx &= (1-\epsilon)f(0)\int_{-\infty}^{\infty} e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx
& \quad - (1-\epsilon)f(0)\left[\int_{-\infty}^{-\delta}e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx + \int_{\delta}^{\infty}e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx \right]
&\ge (1-\epsilon)f(0)\int_{-\infty}^{\infty} e^{-\frac{a}{2}[g’‘(0) + 2\epsilon]x^2} dx - O(e^{-\lambda a})
&= (1-\epsilon)f(0)\sqrt{\frac{2\pi}{a[g’‘(0)+ 2\epsilon]}} + O(e^{-\lambda a}) , \end{align
} yielding \(\liminf_{a\to\infty} \sqrt{a}\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx \ge (1-\epsilon) f(0) \sqrt{\frac{2\pi}{g''(0)+ 2\epsilon}} .\) Sending $\epsilon$ to zero, we have \(\liminf_{a\to\infty} \sqrt{a}\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx \ge f(0) \sqrt{\frac{2\pi}{g''(0)}} .\)

Thus \(\lim_{a\to\infty} \sqrt{a}\int_{-\infty}^{\infty} f(x)e^{-ag(x)}dx = f(0) \sqrt{\frac{2\pi}{g''(0)}} .\) as desired.