Bootstrap statistics — how to work around limitations of simple statistical tests (2024)

Getting Started

Published in

Towards Data Science

7 min read

Jan 14, 2021

You have a data sample. From it, you want to calculate a confidence interval for the population mean value. What’s the first thing you think about? It’s usually a t-test.

But the t-test has several requirements, one of which is that the sampling distribution of the mean is nearly normal (either the population is normal, or the sample is reasonably large). In practice, that’s not always true, and so the t-test may not always deliver optimal results.

To work around that kind of limitation, use the bootstrap method. It has only one important requirement: that the sample approximates the population well enough. Normality is not required.

Let’s give the t-test a try, and then compare it with the bootstrap method.

Let’s load first all libraries we need:

from sklearn.utils import resample
import pandas as pd
from matplotlib import pyplot as plt
import pingouin as pg
import numpy as np
import seaborn as snsnp.random.seed(42)
plt.rcParams["figure.figsize"] = (16, 9)

Note: we’re using the Pingouin library for several statistical tests here. It’s great because many important techniques are kept under the same roof, and have uniform interfaces and syntax. Give it a try:

https://pingouin-stats.org/

Now back to the main topic — this is the sample:

spl = [117, 203, 243, 197, 217, 224, 279, 301, 317, 307, 324, 357, 364, 382, 413, 427, 490, 742, 873, 1361]
plt.hist(spl, bins=20);

Bootstrap statistics — how to work around limitations of simple statistical tests (3)

You can sort-of pretend that looks normal, but it has several large outliers, which the t-test does not tolerate very well. Still, let’s do the t-test for the 95% confidence interval of the population mean:

pgtt = pg.ttest(spl, 0)
print(pgtt['CI95%'][0])[272.34 541.46]

That’s a very wide interval, reflecting the spread of values in the sample. Can we do better? The answer is yes, using the bootstrap method.

If the sample was larger, or if you could extract multiple samples from the same population, the results would be better. But that’s not possible here — this sample is all we have.

However, if we assume the sample approximates the population fairly well, we can pretend to generate multiple pseudo-samples from the population: what we really do is take this sample and resample it — generate multiple samples, of the same size, using the same values. And then we work with the set of generated samples.

The number of pseudo-samples you make needs to be large, in the thousands usually.

Since the samples we generate from the original sample are the same size as the original, some values will be repeated in each pseudo-sample. That’s fine.

Let’s generate ten thousand chunks of resampled data from the original sample:

B = 10000
rs = np.zeros((B, len(spl)), dtype=np.int)
for i in range(B):
 rs[i,] = resample(spl, replace = True)
print(rs)[[ 279 1361 413 ... 357 357 490]
 [ 307 427 413 ... 317 203 1361]
 [ 413 279 357 ... 357 203 307]
 ...
 [ 217 243 224 ... 279 490 1361]
 [ 301 324 224 ... 224 317 317]
 [ 324 301 364 ... 873 364 203]]

Each line in the array is a resampled chunk and is the same size as the original sample. There are 10k lines in total.

Now let’s build the bootstrap distribution: for each line, calculate the mean value:

bd = np.mean(rs, axis=1)
print(bd)[376.35 515.15 342.75 ... 507.8 426.15 377.05]

The bootstrap distribution contains the means for each resampled chunk of data. It’s easy to get the 95% confidence interval for the mean from here: simply sort the bootstrap distribution array (the means), cut off the top and bottom 2.5%, and read the remaining extreme values:

bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)[300.0975 541.705]

And that’s the bootstrap 95% confidence interval for the mean. How does it compare to the t-test confidence interval?

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(pgtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.vlines(np.mean(spl), 0, ymax, colors = 'black')
plt.show();

Bootstrap statistics — how to work around limitations of simple statistical tests (4)

The bootstrap CI (in green) is somewhat more narrow than the t-test CI (in red).

You can use bootstrap to generate a CI for the median value as well: simply build the bootstrap distribution using np.median() instead of np.mean():

bd = np.median(rs, axis = 1)
bootci = np.percentile(bd, (5, 95))
print(bootci)[279. 382.]

Let’s take a sample from a normal population of known variance (var=100):

sp = np.random.normal(loc=20, scale=10, size=50)
sns.violinplot(x=sp);

Bootstrap statistics — how to work around limitations of simple statistical tests (5)

The variance of our sample is indeed close to 100:

print(np.var(sp))112.88909322502681

Let’s use bootstrap to extract a confidence interval for the population variance from our existing sample. Same idea: resample the data many times, calculate the variance for each resampled chunk, get a percentile interval for the list of variances.

rs = np.zeros((B, len(sp)), dtype=np.float)
for i in range(B):
 rs[i,] = resample(sp, replace=True)
bd = np.var(rs, axis=1)
bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)[74.91889462 151.31601521]

Not bad; the interval is narrow enough, and it includes the true value for the population variance (100). Here’s the histogram of variances from the bootstrap distribution:

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.vlines(np.var(sp), 0, ymax, colors = 'black')
plt.show();

Bootstrap statistics — how to work around limitations of simple statistical tests (6)

Here are the weights of two samples of plums, grouped by color:

https://raw.githubusercontent.com/FlorinAndrei/misc/master/plumdata.csv

plumdata = pd.read_csv('plumdata.csv', index_col=0)
sns.violinplot(x='color', y='weight', data=plumdata);

Bootstrap statistics — how to work around limitations of simple statistical tests (7)

Are the two colors, on average, different in terms of weight? In other words, what’s the confidence interval for the difference of mean weights? Let’s try the classic t-test first:

plum_red = plumdata['weight'][plumdata['color'] == 'red'].values
plum_yel = plumdata['weight'][plumdata['color'] == 'yellow'].values
plumtt = pg.ttest(plum_red, plum_yel)
print(plumtt['CI95%'][0])[8.67 27.2]

The t-test says — yes, with 95% confidence, the mean weight of red plums is different from the mean weight of yellow plums (the CI does not include 0).

For bootstrap, we will literally resample both samples 10k times, calculate the mean for each resampled chunk, do the difference of means 10k times, and look at the 2.5 / 97.5 percentiles for the difference of means distribution.

This is the resampled data:

plzip = np.array(list(zip(plum_red, plum_yel)), dtype=np.int)
rs = np.zeros((B, plzip.shape[0], plzip.shape[1]), dtype=np.int)
for i in range(B):
 rs[i,] = resample(plzip, replace = True)
print(rs.shape)(10000, 15, 2)

10000 is for the number of resamples, 15 is the length of each resample, and there are 2 samples we compare in each case.

Let’s calculate the mean for each resampled chunk:

bd_init = np.mean(rs, axis=1)
print(bd_init.shape)(10000, 2)

And the bootstrap distribution is the difference of means (red mean minus yellow mean) for each one of the 10000 resampled cases:

bd = bd_init[:, 0] - bd_init[:, 1]
print(bd.shape)(10000,)print(bd)[12.33333333 21. 10.33333333 ... 21.46666667 11.66666667
 17.06666667]

And now we can get the CI for the difference of means from the bootstrap distribution:

bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)[9.33333333 26.06666667]

Again bootstrap provides a tighter interval than the t-test:

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(plumtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.show();

Bootstrap statistics — how to work around limitations of simple statistical tests (8)

Here’s paired data — the strength of several brands of ropes in either wet or dry conditions:

https://raw.githubusercontent.com/FlorinAndrei/misc/master/strength.csv

strength = pd.read_csv('strength.csv', index_col=0)
sns.violinplot(data=strength);

Bootstrap statistics — how to work around limitations of simple statistical tests (9)

Is this enough data to conclude that the wet/dry conditions influence the strength of the ropes? Let’s calculate a 95% confidence interval for the mean difference of strength, wet vs dry. First, the t-test:

s_wet = strength['wet']
s_dry = strength['dry']
s_diff = s_wet.values - s_dry.values
strengthtt = pg.ttest(s_wet, s_dry, paired=True)
print(strengthtt['CI95%'][0])[0.4 8.77]

The t-test indicates, with 95% confidence, that indeed the wet/dry conditions make a statistically significant difference — but just barely. One end of the CI is close to 0.

Now let’s do bootstrap for the same CI. We can simply take s_diff (the differences in strength for each rope), and go back to the one-sample bootstrap procedure:

rs = np.zeros((B, len(s_diff)), dtype=np.int)
for i in range(B):
 rs[i,] = resample(s_diff, replace = True)
bd = np.mean(rs, axis=1)
bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)[1.33333333 8.5]

Both the t-test and bootstrap agree that there’s a statistically significant difference (95% confidence) that the strengths are different when conditions are different. Bootstrap is a little more confident (interval is more narrow and is further away from 0).

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(strengthtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.show();

Bootstrap statistics — how to work around limitations of simple statistical tests (10)

Notebook with code:

https://github.com/FlorinAndrei/misc/blob/master/bootstrap_plain.ipynb

This is just the basic bootstrap technique. It does not perform bias corrections, etc.

There is no cure for small sample sizes. Bootstrap is powerful, but it’s not magic — it can only work with the information available in the original sample.

If the samples are not representative of the whole population, then bootstrap will not be very accurate.

Bootstrap statistics — how to work around limitations of simple statistical tests (2024)

FAQs

What are the limitations of bootstrap statistics? ›

Limitations of Bootstrapping Statistics

Time-consuming: Thousands of simulated samples are needed for bootstrapping to be accurate.
Computationally taxing: Because bootstrapping requires thousands of samples and takes longer to complete, it also demands higher levels of computational power.

More items...

Read On ›

How do you do bootstrapping statistics? ›

The simplest bootstrap method involves taking the original data set of heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N. The bootstrap sample is taken from the original by using sampling with replacement (e.g. we might 'resample' 5 ...

Discover More Details ›

When should you not use bootstrapping? ›

Bootstrap is powerful, but it's not magic — it can only work with the information available in the original sample. If the samples are not representative of the whole population, then bootstrap will not be very accurate.

What is the rule of thumb for bootstrap? ›

A rule of thumb for the number of resamples needed for a reasonable bootstrap distribution is 10,000, however for the use of this class, use 500. Too few bootstrap samples can create problems for getting a good bootstrap sampling distribution.

See Details ›

Why is bootstrap not used anymore? ›

Bootstrap now faces a reasonable competition from similar UI kit based CSS frameworks like Foundation and Bulma. Apart from these, It also faces a severe blow from TailwindCSS. Bootstrap is always criticized for its inflexibility.

Find Out More ›

What is the problem with bootstrapping? ›

According to one strategy, bootstrapping is flawed because it can only deliver the result that the target source is reliable, regardless of whether it is reliable or not.

Tell Me More ›

How many samples do you need for bootstrapping? ›

Typically, if you are going to use percentile intervals, you would need thousands of bootstrap samples to get good tail probabilities. For some methods, like the student-t method, 200 might be enough for a well characterized statistic.

Show Me More ›

What is bootstrapping in simple terms? ›

Bootstrapping is the process of founding and running a company using only personal finances or operating revenue. It is a form of financing that allows the entrepreneur to maintain more control even though it can increase financial strain.

Explore More ›

What is bootstrapping sampling example? ›

To generate one bootstrap dataset imagine grabbing a ball out of the jar at random, noting it's diameter and then putting it back. This process of selecting balls is repeated until you have noted down a set of balls that is the same size as the original sample.

Why shouldn't you use bootstrap? ›

Bootstrap's framework has excess, non-semantic classes and it completely litters markup. If you want to create layouts that best fit the business, you don't want to go with something like Bootstrap that is strictly going to limit the capabilities on the front end.

Show Me More ›

What are the pros and cons of bootstrapping? ›

The Pros and Cons of Bootstrapping

PRO: Greater Focus. Bootstrapping can also take out another pressure point of many startups which is having to impress investors to raise funding. ...
CON: Time. ...
PRO: Easier Pivoting. ...
CON: Lack of Investor support. ...
PRO: You don't dilute your ownership. ...
CON: Personal risk.

Read The Full Story ›

What are the advantages of bootstrapping statistics? ›

Bootstrapping is a useful data resampling technique, especially when the sample size is small, the population distribution is unknown, or the statistic of interest is complex or non-standard. It has several advantages, such as being easy to implement and understand without requiring complex formulas or calculations.

See Details ›

How does bootstrap work in statistics? ›

Bootstrapping is a resampling procedure that uses data from one sample to generate a sampling distribution by repeatedly taking random samples from the known sample, with replacement.

Get More Info Here ›

What is the limitation of bootstrap? ›

One notable drawback is the framework's constraints on design flexibility. For instance, designers may find it challenging to create custom layouts or adjust element proportions beyond the predefined options offered by Bootstrap's grid system.

What is the difference between bootstrapping and sampling? ›

In general, bootstrap takes sample with replacement from the data of size the same as the size of the data. One obtains the usual sample by sampling from the population. A bootstrapping sample is different because one samples with replacement from the sample itself.

What are the limitations of bootstrap protocol? ›

While Bootstrap offers numerous advantages, it's important to also consider its limitations. Many websites built primarily with Bootstrap may encounter issues that can affect user experience and hinder development efficiency. One notable drawback is the framework's constraints on design flexibility.

View Details ›

Which of the following is a disadvantage of bootstrapping? ›

Disadvantages of bootstrapping

Increased chance of business failure: For early-stage companies, bootstrapping may not provide sufficient resources to build traction and survive beyond the startup phase. Increased risks assumed by owners: Initial funding usually comes from owners' personal savings.

What are the disadvantages of bootstrap circuit? ›

Drawback of Bootstrap Circuitry

The biggest difficulty with this circuit is that the negative voltage present at the source of the switching device during turn−off causes load current to suddenly flow in the low−side freewheeling diode, as shown in Figure 3.

Learn More ›