Bootstrapping: The Basics

14 min readNov 26, 2020

What is bootstrapping? Why is it useful? How do we do it?

An overview of the bootstrapping process. (Diagram by author.)

Why bootstrap?

At the broadest level, the goal of statistics is to make inferences about a population based on observations drawn from that population. Often, we want to estimate key features of a population or distribution (such as the mean, median, or variance), as well how certain these estimates are. However, what if we can’t go and collect more data and see how our estimates vary? Or, what if there is no formula for calculating the uncertainty of a particular statistic? In such cases, bootstrapping can be used to quickly estimate the uncertainty around any chosen estimator, using basic resampling and simulation.

What we’ll cover:

In this blog post, we will cover the basics of what the bootstrapping method is, how it works, and why it is useful. To do so, we will explore an example scenario of bootstrapping — estimating voters’ preference in an election. Then, we will discuss key limitations of the bootstrapping method — so we know where it works best, and what it can realistically tell us. Finally, we will briefly examine different types of bootstrapping and where they are useful. I hope to show how bootstrapping can be a simple and useful tool for estimating statistics (such as the mean or median, to more complex functions including ratios and correlations) of unknown populations and distributions — and how they vary — using available data.

What is bootstrapping?

Bootstrapping refers to randomly resampling from the original observations with replacement to create many simulated samples, from which a chosen statistic is calculated. Then, the variability of the statistic over the bootstrap samples can be easily measured. Sampling with replacement means that we return each (random) pick to the set of observations that we pick from before randomly picking the next observation. By sampling in this way, we ensure that we get different simulated samples (typically with the same number of observations as the original data). For example, if we sample 5 numbers from an initial set of 5 numbers: [1,2,3,4,5] with replacement, some observations may be picked multiple times and some may not be chosen at all: e.g. [2,3,3,5,1].

The idea of bootstrapping is that, instead of taking multiple samples from the underlying distribution or population (which may be too expensive or not possible), we can use our available observations to simulate taking multiple samples from our empirical distribution — which is the probability measure obtained by assigning equal probability mass at each observed data point. Since our calculated statistic will differ between simulated samples, we can estimate how our statistic varies across the sampling.

To be precise, bootstrapping approximates the value of a plug-in estimator of a statistical functional of the underlying distribution. A statistical functional is any function that maps a distribution to a real number, such as the minimum, median, or variance. A plug-in estimator of a statistical functional is just the statistical functional of the empirical distribution, obtained by ‘plugging-in’ the empirical distribution (Watson, 2020). For example, the plug-in estimator of the mean of a distribution is the mean of the empirical distribution, i.e. the sample mean.

In other words, the bootstrap estimate approximates the statistical functional (e.g. the mean or median) of the empirical distribution. How closely this value approximates the statistic for the true population or distribution depends on how well the observations represent the overall population.

Why is it called bootstrapping?

The name “bootstrapping” comes from the phrase “to pull oneself up by one’s bootstrap”, based on the 18th century story of Baron Munchausen who — impossibly — saved himself from drowning by pulling himself up by his own bootstraps (Efron & Tibshirani, 1994). Indeed, creating more samples from a single sample may seem like a similarly implausible trick. However, we will see that bootstrapping is truly a useful technique to measure uncertainty.

Baron Munchausen and his famous bootstraps. (R.E. Raspe, 1902)

Why is it useful?

The main benefits of bootstrapping are that we can easily estimate the sampling distribution of any estimator, without assuming any knowledge of the underlying distribution. With this information, we can measure standard errors, build confidence intervals, and test hypotheses around any statistical functional we’d like. In the words of the statisticians who invented bootstrapping in the late 1970s, “it is easy to write a bootstrap program that works for any computable statistic” (p.15, Efron & Tibshirani, 1994).

So let’s see how it works in practice.

Bootstrap example: estimating voter preferences

Imagine there’s an election coming up, and we want to predict who will win. For now, we want to estimate who voters in our small town prefer. To do so, we conduct a simple poll of likely voters. However, since we can’t afford to survey all 700 likely voters in our town, we decide to sample 100 at random.

In the poll, respondents indicate which candidate they prefer: Red or Blue. (In this scenario, there are only 2 candidates, and the candidate with the majority of votes will win.)

This binary outcome for each respondent can be modelled using a Bernoulli trial, where a voter’s preference is represented as either a 0 (e.g. preference for candidate Red) or 1 (preference for candidate Blue).

At first, we want to know two things:

Which candidate is preferred overall?
How certain are we of the first result?

We know that all polls have random sampling error, so we want to estimate how likely our observed result is. In particular, we may want to know whether an observed lead is significant or due to chance.

Let’s create some voters:

# programming in Julia# create a class of voters
struct Voter
    X::Vector # position on graph 
    color::String
    preference::Int # 0 for blue, 1 for red
end# function to generate voters
function voter()
    # ?% chance of red, ?% chance of blue
    if rand() < [removed] #removed p for this display, since unknown
        Voter([rand(), rand()], "red", 0)
    else
        Voter([rand(), rand()], "blue", 1)
    end
end

While we don’t know the actual distribution of voters, if we could see our overall population it may look like:

Here, each point represents a likely voter. The blue points represent voters who prefer the Blue candidate, while red points represent those who support the Red candidate. Each point is uniformly distributed in the space (in reality, voters tend to be more geographically polarized). Since we haven’t actually observed any of these points or preferences yet, the points are semi-transparent. (Figure by author.)

By randomly surveying 100 people from the population of likely voters, we get 100 observations:

# randomly sample 100 observations
observations = sample(all_voters, 100, replace=false)

100 observations drawn randomly from the unknown population. For the 100 randomly surveyed voters, we can now see their preferences! (Figure by author.)

Since we only know the political preferences for the 100 people, our observations actually look like:

Our observations. All we know for now. (Figure by author.)

# collect all the observed voter preferences
obs_pref = [voter.preference for voter in observations]# calculate mean voter preference for the observations
mean(obs_pref)

Out of the 100 people we polled, 46 prefer candidate Blue. Since ‘1’ represents support for candidate Blue and ‘0’ represents support for candidate Red, the sample mean is 0.46. From this, it seems that the majority of likely voters prefer candidate Red. But how do we know if it is accurate? How might it vary across different random samples?

We don’t know the true population, and unfortunately we can’t survey more people. Instead, we can apply the bootstrap method to estimate the uncertainty around the mean of observed voter preferences.

We have our initial observations, and our statistical functional: the mean of the distribution. Since we care about how the polling result based on 100 people may vary, we also have a second statistical functional: the standard deviation of the mean of 100 independent observations drawn from the distribution.The second statistical functional is important because it provides a standard error for computing a confidence interval for our estimate of the distribution mean.

Now, we perform basic bootstrapping:

Create bootstrap samples by randomly drawing observations from the original observations with replacement. Since we have 100 original observations, we collect 100 randomly selected observations for each simulated sample.
For each bootstrap sample of likely voters (n=100), calculate the sample’s mean. Save this estimate to our list of bootstrap estimates.
Repeat steps 1 & 2 over many simulations. Programming makes these simulations easy — in this example I chose to simulate 100’000 times (taking seconds).
Plot a histogram of our bootstrap estimates, to see how they vary. We can then compute a bootstrap confidence interval by taking the quantiles of our bootstrap estimates.

In Julia, these steps look like:

# create an array to store bootstrap estimates
sample_means = []# simulate 10^5 bootstrap samples and store their means
for B in 1:10^5
    # create a new bootstrap sample by sampling with replacement from the observatiosns
    bootstrap_sample = sample(observations, length(observations), replace=true)
    # calculate the bootstrap sample mean preference
    bootstrap_mean_pref = mean([voter.preference for voter in bootstrap_sample])
    # add bootstrap sample mean to list of sample means
    push!(sample_means, bootstrap_mean_pref)
end

For example, each of the four samples below were obtained by resampling with replacement from the original set of observations. The more times a point has been resampled in a given bootstrap sample, the darker it appears:

Bootstrap samples obtained by sampling with replacement from the original observations, each of size n=100. (Figure by author.)

Visualizing multiple bootstrap samples at once, we get:

Here, the points are shifted up slightly so that multiple resamplings of the same points are more visible. Overall, we can see that bootstrap samples depend on the initial observations, but differ in terms of proportion of reds and blues. (Figure by author.)

By simulating 10’000 bootstrap samples and calculating each of their mean preferences, we arrive at the following sampling distribution:

Histogram of the 10'000 bootstrapped mean voter preference bootstrap estimates. (Figure by author.)

What if we bootstrap 10x more, i.e. 100’000 times?

Histogram of 100'000 bootstrapped mean voter preference bootstrap estimates. The benefits of a computer. (Figure by author.)

By the central limit theorem, the distribution of sample means is approximately normally distributed. As we can see above, our distribution looks increasingly normal as we simulate more bootstrap samples. While a normal distribution allows us to calculate the uncertainty analytically in this case, the bootstrap sampling distribution does not have to look normal for bootstrap to work. Instead, we can just calculate the standard deviation of our bootstrap estimates. Further, to form our bootstrap confidence interval, we can directly compute the quantiles of our bootstrap estimates to see which range of values typically trap our estimate.

# calculate the standard deviation of the bootstrap estimates
std(sample_means)# compute the bootstrap 90% confidence interval by taking quantiles
boot_CI = quantile(sample_means, [0.05, 0.95])

In this example, the standard deviation of the 100’000 bootstrap estimates is approximately 0.04977.

Taking the 0.05 and 0.95 quantiles, we directly estimate the bootstrap 90% confidence interval, which is the middle 90% of the histogram:

Histogram of the bootstrap mean preferences, with the 90% confidence interval between the blue lines. The true mean preference in the unknown population is shown by the orange line (only revealed now). (Figure by author.)

The bootstrap 90% confidence interval around our estimate of the mean voter preference is [0.38, 0.54]. What this means is that any confidence interval constructed in this way will capture the population mean preference 90% of the time (e.g. if we got 100 different initial observations and repeated these steps to form a similar confidence interval each time, we would expect 90 of these confidence intervals to trap the true mean value).

As it turns out, the true mean voter preference across the entire population of 700 likely voters was 0.5257, i.e. 52.57% support for candidate Blue (which we wouldn’t know in practice, at least until the actual election). Our bootstrap confidence interval traps this true value. Moreover, since the neutral 0.50 mean preference is within our 90% confidence interval (i.e. not in the critical region outside the interval), we cannot reject a null hypothesis that our observed lead is due to chance.

In this particular example, we can actually estimate the standard error of our poll sample mean analytically. Since the poll mean can be viewed as the mean of many binomial trials (with a ‘success’ = 1 and ‘failure’ = 0), the standard error of our sample estimate can be approximated by:

Estimate of the standard error of a Bernoulli proportion.

X̅ = mean(obs_pref)# estimate of standard error 
sqrt(X̅*(1-X̅)/length(obs_pref))

The analytical estimate gives a standard error of approximately 0.04984 (close to our bootstrap standard error of 0.04977.)

Similarly, we can also compute a confidence interval based on the normal approximation, to verify our bootstrap confidence interval. Using a z-score of 1.645, the 90% confidence interval assuming that our estimate is normally distributed is:

# binomial confidence interval
binom_CI = [X̅ - 1.645*sqrt(X̅*(1-X̅)/length(observations)), X̅ + 1.645*sqrt(X̅*(1-X̅)/length(observations))]

Which is approximately [0.378, 0.542]. Plotting both confidence intervals:

Histogram of bootstrap estimates with 90% bootstrap confidence interval (in blue) and the normal approximated confidence interval (in green). (Figure by author.)

Thus, our bootstrap 90% confidence interval of [0.38, 0.54] is very close to the one obtained by normal approximation, without making any initial assumptions about the underlying distribution. Since our estimator is normally distributed in this case, either confidence interval works here. However, when the sampling distribution of our estimator is not normal, the bootstrap confidence interval will be more accurate as it does not rely on assuming normality.

As we can see, the bootstrap method closely approximated the standard error of the poll sample mean, without assuming any knowledge of the distribution. However, our misleading estimate of a majority of support for candidate Red highlights how bootstrapping still depends on how well our observations and the resulting empirical distribution approximate the underlying population. While bootstrapping can give us a measure of the uncertainty of our mean voter preference, an unrepresentative initial polling sample can still lead to prediction error.

So, what can we do to achieve accurate predictions, other than collecting more data? In real-life polling estimates, bootstrapping can be combined with methods such as quota sampling to ensure that bootstrap samples are drawn from observations that reflect the broader population along relevant characteristics (such as education, age, and geography) (Sturgis et al., 2018).

Beyond this simple polling example, bootstrapping can approximate standard errors and confidence intervals for any calculable estimator when the distribution is actually unknown, and when there is no formula for variance (e.g. for a distribution’s median, or more complex ratios, expectations, or correlations).

Step-by-step bootstrapping

So, what are the steps of bootstrapping in general?

To bootstrap an estimator for any statistical functional using available data, you just need to:

Get an initial set of independent observations from an unknown population or distribution.
Choose an estimator of a statistical functional of the population that you want to approximate (i.e. what statistical functional do you want to know? For example, the median and how it varies).
Choose the total number of bootstrap samples (i.e. how many times will you simulate a new sample?) This is usually as many as possible that is reasonable to compute.
Choose the number of observations in each bootstrap sample (typically, this is the same number as the number of original observations, to mimic a new sample with comparable uncertainty. Thus, if you have 10 initial observations, each bootstrap sample would have 10 randomly sampled observations).
Simulate the plug-in estimator many times. For each bootstrap sample/simulation. a) Randomly draw observations from the original observations with replacement, until you have your chosen number of observation. b) Calculate the estimate of your chosen statistical functional on your sample, e.g. the median of your bootstrap sample (and save it to a list)
Calculate the mean and standard deviation of the bootstrap estimates across the simulations.
Plot a histogram of the bootstrap estimates. What range of values capture the estimate most of the time? Directly compute quantiles of the estimates for a confidence interval.

When is bootstrapping useful?

In general, bootstrapping is useful when:

We only have one set of (independent) observations, and cannot collect more data
There is no formula for the uncertainty of an estimator, or it is difficult to calculate analytically
The underlying distribution is unknown — regular bootstrapping does not assume the shape of the underlying distribution
We want to verify the variability of our results

What can’t bootstrapping tell us?

Importantly, bootstrapping cannot generate new information about the original data, and cannot tell us directly about the statistical functional of the underlying distribution or population.

The key limitations of bootstrapping are that:

Bootstrapping estimates the statistical functional of the empirical distribution (from the available observations sampled with replacement), while we really want to know the statistical functional of the true distribution or population. How well bootstrapping approximates the statistical functional of the true underlying distribution depends on having enough — and sufficiently representative — observations to approximate the statistical functional of the true distribution.
Relatedly, if there are not enough initial observations to sample from, running more simulations will not bridge the gap between the bootstrap estimate and the statistical functional of the true distribution.
The data must be (approximately) independent and identically distributed. The main assumption for bootstrapping is that observations are drawn from the same data-generating process, and are independent from each other.

What are other types of bootstrapping, and where are they useful?

There are three main categories of bootstrapping, which differ based on the assumptions made about the underlying population. So far, the type of bootstrapping we have explored is called nonparametric (or resampling) bootstrapping, whose defining characteristic is that no assumptions are made about the underlying distribution. As we’ve seen, this technique involves randomly resampling the available observations to create bootstrap samples, from which we estimate a bootstrap sampling distribution for our given estimator.

However, as we’ve seen, the nonparametric bootstrap only resamples existing observations, and is thus highly dependent on the available data. In reality, we may expect that the true population contains observations that are different — yet still similar — to the ones we have.

The second type of bootstrap — semiparametric bootstrapping — includes this assumption that similar unobserved data exist in our population of interest. To do so, the semiparametric bootstrapping adds (typically normally distributed) random noise to non-parametric bootstrapped observations. By including such noise, semiparametric bootstrap samples do not exactly replicate the original observations, and introduce smoothness to the estimated underlying distribution (important for optimization techniques) (Penn State University, 2018). In fact, this method is equivalent to sampling from a kernel density estimate of the data.

The third type of bootstrapping is parametric bootstrapping. In this type, the underlying distribution is known or assumed (e.g. if we know the data comes from a normal or Poisson distribution), yet its parameters are unknown. Here, the missing parameters are first estimated from the data in order to approximate the distribution, which is then used to generate new samples (Penn State University, 2018).

Broadly, nonparametric bootstrapping is the most simple technique and works well when the underlying distribution is unknown. Semi-parametric bootstrapping is useful for machine learning tasks that require smoother probability density functions. Parametric bootstrapping works when the underlying distribution is known, often producing more accurate estimates than nonparametric bootstrapping if the assumptions about the distribution hold.

Conclusion

We’ve seen that bootstrapping is a fast and flexible way to estimate the uncertainty around any statistical functional, without having to collect more data or make assumptions about the underlying distribution (for non and semi-parametric bootstrapping, at least). Bootstrapping is especially useful when our estimators of interest are complex and don’t have formulas for measures of their uncertainty. By making use of sampling with replacement and simulation, we can estimate sampling distributions of our estimates and see how they vary.

We’ve also seen that bootstrapping is not magic, and that the accuracy of the bootstrap method depends on how well our available observations approximate the true population. Further, bootstrap can be done in tandem with more traditional approaches. As bootstrapping experts Davison and Hinkley (1997) note, bootstrapping helps “avoid tedious calculations based on questionable assumptions”, but “cannot replace clear critical thought about the problem, appropriate design of the investigation and data analysis and incisive presentation of conclusions” (p.4). Nevertheless, I hope I’ve helped show how bootstrapping is a flexible tool for measuring uncertainty around estimators with available data and few assumptions.

Thanks for reading!

References

Brownlee, J. (2019, August 08). A Gentle Introduction to the Bootstrap Method. Retrieved from https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application (№1). Cambridge university press.

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

Joseph, T. (2020, June 22). Bootstrapping Statistics. What it is and why it’s used. Retrieved from https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307

Pennsylvania State University. (2018). Bootstrapping. Retrieved from https://online.stat.psu.edu/stat555/node/119/

Raspe, R. E. (1902). The surprising adventures of Baron Munchausen. Thomas Y. Crowell & Company.

Sturgis, P., Kuha, J., Baker, N., Callegaro, M., Fisher, S., Green, J., … & Smith, P. (2018). An assessment of the causes of the errors in the 2015 UK general election opinion polls. Journal of the Royal Statistical Society. Series A: Statistics in Society, 181(3), 757–781.

Watson, S. (2020). Bootstrapping — Statistics. Retrieved from https://mathigon.org/course/intro-statistics/bootstrapping