Monte Carlo Application and the Bootstrap


Lecture 11

March 6, 2024

Last Class

Monte Carlo

  • Stochastic Simulation
  • Estimate \[\mu = \mathbb{E}_p[f(x)] \approx \frac{1}{N} \sum_{i=1}^n f(x_i) = \hat{\mu}_n\]

Monte Carlo Error Analysis

  • Standard error \[\hat{\sigma}_n = \frac{\sigma_{f(x)}}{\sqrt{n}} \approx \frac{\hat{\sigma}_{f(x_{1:n})}}{\sqrt{n}}\]

  • Confidence interval given by CLT based on standard error.

Monte Carlo Example

Airshed Model

Figure 1: Illustration of the airshed, including notation.

Goal: Find the probability of exceeding the 1-hour SO2 average exposure concentration standard, which is 0.14 ppm.

Uncertainties: Hourly wind speed \(u\), the inflow concentration \(C_\text{in}\), net rate of SO2 emission within the airshed \(R=S-D\).

Airshed Model

Figure 2: Illustration of the airshed, including notation.

\[\frac{dC}{dt} = \frac{u}{L} C_\text{in} + \frac{S-D}{WHL} - \left(\frac{u}{L} + k\right)C\]

Forward Euler Discretization

\[ \frac{dC}{dt} = \frac{u}{L} C_\text{in}(t) + \frac{S-D}{WHL} - \left(\frac{u}{L} + k\right)C\]

\[\Rightarrow \frac{C(t+1) - C(t)}{\Delta t} = \frac{u}{L} C_\text{in}(t) + \frac{R}{WHL} - \left(\frac{u}{L} + k\right)C(t)\]

\[\bbox[yellow, 10px, border:5px solid red]{C(t+1) = \left(1 - \Delta t\left(\frac{u}{L} + k\right)\right)C(t) + \Delta t \left(\frac{u}{L} C_\text{in}(t) + \frac{R}{WHL}\right)} \]

Monte Carlo Samples

Code
nsamp = 1000
u = rand(LogNormal(log(2), 1), nsamp)
Cin = rand(LogNormal(log(0.16), 0.12), nsamp)
R = rand(Normal(0.5, 0.5), nsamp)

p1 = histogram(u, ylabel="count", xlabel=L"$u$ (m/s)", label=false, tickfontsize=16, guidefontsize=18, size=(400, 450))
p2 = histogram(Cin, ylabel="count", xlabel=L"$C_{in}$ (ppm)", label=false, tickfontsize=16, guidefontsize=18, size=(400, 450))
p3 = histogram(R, ylabel="count", xlabel=L"$R$ (ppm/hr)", label=false, tickfontsize=16, guidefontsize=18, size=(400, 450))
display(p1)
display(p2)
display(p3)
(a) Monte Carlo samples for the airshed model.
(b)
(c)
Figure 3

Simulation Results

Code
# other parameters
C₀ = 0.07
T = 60
k = 0.3
W = 4
H = 5
L = 4
# conduct simulation
P = u / L .* Cin
l = u / L .+ k
C2 = zeros(T*100 + 1, nsamp)
S = 0:0.01:T
for (i, t) in pairs(S)
    if i == 1
        C2[i, :] .= C₀
    else
        C2[i, :] = (1 .- 0.01*l) .* C2[i-1, :] .+ 0.01 * P .+ 0.01 * R / (H * W * L)
    end
end
mean_SO2 = map(mean, eachcol(C2)) # calculate means
# plot histogram
p1 = histogram(mean_SO2, xlabel="1-Hour Average Exposure (ppm)", ylabel="Count", legend=false, tickfontsize=16, guidefontsize=18)
vline!(p1, [0.14], color=:red, linestyle=:dash, linewidth=3)
xticks!(p1, 0:0.04:0.3)
xaxis!(p1, xminorticks=2)
plot!(p1, size=(600, 450))
# plot cdf
p2 = plot(sort(mean_SO2), (1:nsamp) ./ nsamp, xlabel="1-Hour Average Exposure (ppm)", ylabel="Cumulative Probability", legend=false, tickfontsize=17, guidefontsize=18, linewidth=3)
vline!(p2, [0.14], linestyle=:dash, color=:red, linewidth=3, minorgrid=true)
xticks!(p2, 0:0.04:0.3)
xaxis!(p2, xminorticks=2)
yaxis!(p2, yminorticks=5)
plot!(p2, size=(600, 450))

display(p1)
display(p2)
(a) Monte Carlo samples for the airshed model.
(b)
Figure 4

Monte Carlo Estimation

\[\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n \mathbb{I}[x_i > 0.14]\]

\[\hat{\sigma}_n = \sqrt{\frac{\text{Var}(\mathbb{I}[x_{1:n} > 0.14])}{n}}\]

Code
# show Monte Carlo estimate stabilization
avg_mc_out = zeros(nsamp)
avg_mc_out[1] = mean_SO2[1] > 0.14
std_mc_out = zeros(nsamp)
std_mc_out[1] = 0
for i = 2:nsamp
    avg_mc_out[i] = (avg_mc_out[i-1] * (i-1) + (mean_SO2[i] > 0.14)) / i
    std_mc_out[i] = 1/sqrt(i) * std(mean_SO2[1:i] .> 0.14)
end
p = plot(avg_mc_out, xlabel="Monte Carlo Iteration", ylabel="Probability", left_margin=3mm, legend=:false, ribbon=1.96*std_mc_out, fillalpha=0.3, linewidth=2, tickfontsize=16, guidefontsize=18, fillcolor=:red, right_margin=5mm, minorgrid=true)
ylims!(p, (0, 0.3))
yaxis!(p, yminorticks=5)
plot!(p, size=(600, 450))
display(p)
Figure 5: Monte Carlo estimation for the airshed model.

Upshot of Monte Carlo

  • Estimate summary statistics through simulation
  • Need input distributions. How do we get them?
  • Uncertainty Quantification: Inferring probabilistic representations of uncertainties (such as input distributions).
  • Some UQ methods:
    • The Bootstrap
    • Markov chain Monte Carlo
    • Expert elicitation

The Bootstrap

Sampling Distributions

The sampling distribution of a statistic captures the uncertainty associated with random samples.

Sampling Distribution

The Bootstrap Principle

Efron (1979) suggested combining estimation with simulation: the bootstrap.

Key idea: use the data to simulate a data-generating mechanism.

Baron von Munchhausen Pulling Himself By His Hair

Source: Wikipedia

Monte Carlo vs Bootstrapping

Monte Carlo: If we have a generative probability model (including input distributions), simulate new samples from the model and estimate the sampling distribution.

Bootstrap: assumes the existing data is representative of the “true” population, and can simulate based on properties of the data itself.

Why Does The Bootstrap Work?

Efron’s key insight: due to the Central Limit Theorem, the differences between estimates drawn from the sampling distribution and the true value converge to a normal distribution.

  • Use the bootstrap to approximate the sampling distribution through re-sampling and re-estimation.
  • Can draw asymptotic quantities (bias estimates, confidence intervals, etc) from the differences between the sample estimate and the bootstrap estimates.

What Can We Do With The Bootstrap?

Let \(t_0\) the “true” value of a statistic, \(\hat{t}\) the estimate of the statistic from the sample, and \((\tilde{t}_i)\) the bootstrap estimates.

  • Estimate Variance: \(\text{Var}[\hat{t}] \approx \text{Var}[\tilde{t}]\)
  • Bias Correction: \(\mathbb{E}[\hat{t}] - t_0 \approx \mathbb{E}[\tilde{t}] - \hat{t}\)
  • Compute basic \(\alpha\)-confidence intervals: \[\left(\hat{t} - (Q_{\tilde{t}}(1-\alpha/2) - \hat{t}), \hat{t} - (Q_{\tilde{t}}(\alpha/2) - \hat{t})\right)\]

The Non-Parametric Bootstrap

The Non-Parametric Bootstrap

The non-parametric bootstrap is the most “naive” approach to the bootstrap: resample-then-estimate.

Non-Parametric Bootstrap

Simple Example: Is A Coin Fair?

Suppose we have observed twenty flips with a coin, and want to know if it is weighted.

Code
# define coin-flip model
p_true = 0.6
n_flips = 20
coin_dist = Bernoulli(p_true)
# generate data set
dat = rand(coin_dist, n_flips)
freq_dat = sum(dat) / length(dat)
dat'
1×20 adjoint(::Vector{Bool}) with eltype Bool:
 0  1  0  1  0  0  1  0  1  0  1  0  1  1  1  0  0  1  1  0

The frequency of heads is 0.5.

Is The Coin Fair?

Code
# bootstrap: draw new samples
function coin_boot_sample(dat)
    boot_sample = sample(dat, length(dat); replace=true)
    return boot_sample
end

function coin_boot_freq(dat, nsamp)
    boot_freq = [sum(coin_boot_sample(dat)) for _ in 1:nsamp]
    return boot_freq / length(dat)
end

boot_out = coin_boot_freq(dat, 1000)
q_boot = 2 * freq_dat .- quantile(boot_out, [0.975, 0.025])

p = histogram(boot_out, xlabel="Heads Frequency", ylabel="Count", title="1000 Bootstrap Samples", titlefontsize=20, guidefontsize=18, tickfontsize=16, legendfontsize=16, label=false, bottom_margin=7mm, left_margin=5mm, right_margin=5mm)
vline!(p, [p_true], linewidth=3, color=:orange, linestyle=:dash, label="True Probability")
vline!(p, [mean(boot_out) ], linewidth=3, color=:red, linestyle=:dash, label="Bootstrap Mean")
vline!(p, [freq_dat], linewidth=3, color=:purple, linestyle=:dash, label="Observed Frequency")
vspan!(p, q_boot, linecolor=:grey, fillcolor=:grey, alpha=0.3, fillalpha=0.3, label="95% CI")
plot!(p, size=(1000, 450))
Figure 6: Bootstrap heads frequencies for 20 resamples.

Larger Sample Example

Code
n_flips = 50
dat = rand(coin_dist, n_flips)
freq_dat = sum(dat) / length(dat)

boot_out = coin_boot_freq(dat, 1000)
q_boot = 2 * freq_dat .- quantile(boot_out, [0.975, 0.025])

p = histogram(boot_out, xlabel="Heads Frequency", ylabel="Count", title="1000 Bootstrap Samples", titlefontsize=20, guidefontsize=18, tickfontsize=16, legendfontsize=16, label=false, bottom_margin=7mm, left_margin=5mm, right_margin=5mm)
vline!(p, [p_true], linewidth=3, color=:orange, linestyle=:dash, label="True Probability")
vline!(p, [mean(boot_out) ], linewidth=3, color=:red, linestyle=:dash, label="Bootstrap Mean")
vline!(p, [freq_dat], linewidth=3, color=:purple, linestyle=:dash, label="Observed Frequency")
vspan!(p, q_boot, linecolor=:grey, fillcolor=:grey, alpha=0.3, fillalpha=0.3, label="95% CI")
plot!(p, size=(1000, 450))
Figure 7: Bootstrap heads frequencies for 20 resamples.

Bootstrapping with Structured Data

The naive non-parametric bootstrap that we just saw doesn’t work if data has structure, e.g. spatial or temporal dependence.

Bootstrapping with Structured Data

Code
sl_dat = CSV.read(joinpath(@__DIR__, "data", "sealevel", "CSIRO_Recons_gmsl_yr_2015.csv"), DataFrame)
p1 = plot(sl_dat[:, 1] .- 0.5, sl_dat[:, 2], xlabel="Year", ylabel="GMSL (mm)", title="Sea-Level Rise Observations", label=:false, linewidth=2, tickfontsize=16, guidefontsize=18, titlefontsize=20)
plot!(p1, size=(600, 450))

resample_index = sample(1:nrow(sl_dat), nrow(sl_dat); replace=true)
p2 = plot(sl_dat[:, 1] .- 0.5, sl_dat[resample_index, 2], xlabel="Year", ylabel="GMSL (mm)", title="Sea-Level Rise Resample", label=:false, linewidth=2, tickfontsize=16, guidefontsize=18, titlefontsize=20)
plot!(p2, size=(600, 450))

display(p1)
display(p2)
(a) Simple bootstrap with time series data.
(b)
Figure 8

Why Use The Bootstrap?

  • Do not need to rely on variance asymptotics;
  • Can obtain non-symmetric CIs.

Approaches to Bootstrapping Structured Data

  • Correlations: Transform to uncorrelated data (principal components, etc.), sample, transform back.
  • Time Series: Block bootstrap

Block Bootstrap

Divide the data \(x_t\) into blocks of length \(b\), \(X_1, \ldots, X_n\).

For example: \[X_1 = x_{1:b}, X_2 = x_{b+1:2b}, \ldots, X_n = x_{(n-1)b+1:nb}\]

Then resample blocks and glue back: \(X_{\sigma(1)}, \ldots, X_{\sigma(n)}\)

Block Bootstrap Example

Code
block_idx = collect(1:10:131)
blocks_resample_idx = sample(1:length(block_idx)-1, length(block_idx)-1)
p1 = plot(sl_dat[1:130, 1] .- 0.5, sl_dat[1:130, 2], xlabel="Year", ylabel="GMSL (mm)", title="Sea-Level Rise Observations", label=:false, linewidth=2, tickfontsize=16, guidefontsize=18, titlefontsize=20)
vspan!(p1, sl_dat[block_idx, 1] .- 0.5, fillcolor=:gray, fillalpha=0.3, label=false)
plot!(p1, size=(600, 450))

block_resample = zeros(130)
for i = 1:length(block_idx)-1
    block_resample[(i-1)*10+1:i*10] = sl_dat[block_idx[blocks_resample_idx[i]]:block_idx[blocks_resample_idx[i]]+9, 2]
end
p2 = plot(sl_dat[1:130, 1] .- 0.5, block_resample, xlabel="Year", ylabel="GMSL (mm)", title="Sea-Level Rise Resample", label=:false, linewidth=2, tickfontsize=16, guidefontsize=18, titlefontsize=20)
plot!(p2, size=(600, 450))

display(p1)
display(p2)
(a) Simple bootstrap with time series data.
(b)
Figure 9

Generalizing the Block Bootstrap

The rough transitions in the block bootstrap can really degrade estimator quality.

  • Improve transitions between blocks
  • Moving blocks (allow overlaps)

Key Points and Upcoming Schedule

Key Points

  • Bootstrap: Approximate sampling distribution by re-simulating data
  • Non-Parametric Bootstrap: Treat data as representative of population and re-sample.
  • More complicated for structured data.

Sources of Non-Parametric Bootstrap Error

  1. Sampling error: error from using finitely many replications
  2. Statistical error: error in the bootstrap sampling distribution approximation

When To Use The Non-Parametric Bootstrap

  • Sample is representative of the sampling distribution
  • Doesn’t work well for extreme values!

Next Classes

Monday: Parametric Bootstrap and Examples

Wednesday: What is a Markov Chain?

Assessments

Exercise 7: Due Friday

References

References

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Stat., 7, 1–26. https://doi.org/10.1214/aos/1176344552