Macro 1 - Bootstrap

Bootstrapping Macro: Part 1 - Use Cases
The name bootstrapping refers to the idea of picking oneself up

by ones own bootstraps - fitting, because the technique uses the
sample data to gain more knowledge about the sample data. This
macro implements the simplest form of bootstrapping, a Monte
Carlo procedure for sampling with replacement:
Consider you want to know the average price of a tent.
1) Collect some small sample data.
- For example, visit the websites of 16 retailers:
170
300
240
202
24
230
49
109
128
239
199
370
280
154
109
259
2) Re-sample, with replacement, from the sample data.

- Sample 1
24
259
300
280
24
199
154
259
259
109
109
128
370
109
280
154
- Sample 2, Sample 3,
3) Calculate a test statistic of interest.

- For example, the sample mean:
Original Sample Sample Sample Sample Sample Sample
Data:
1:
2:
3:
4:
5:
6:
191.4
188.6
226.9
187.2
243.2
210.1
196.8
4) Compare the test statistic of the original data to the test

statistics of the re-sampled data.
Figure 1: Example distribution of sample means. Red line

represents the original sample mean.
The major benefit of the bootstrap technique is that in Step 4 we

do not make any assumptions about the distribution of the data.
Another benefit is the flexibility in the choice of the test statistic
(we can compute p-values for quantities whose distribution would
be hard to derive).
Additional Resources: https://en.wikipedia.org/wiki/Bootstrapping
Bootstrapping Macro: Part 2 - Implementation Details

The bootstrapping macro has two options: re-sampling and
calculation can be performed with Alteryx built-in tools or with the
R tool. Why take the time to make two implementations? I did this
as part of a challenge - when creating a new predictive tool I am
quick to jump into R. However, the overhead of passing data
between the Alteryx engine and the R engine can be large. I think
it is worthwhile, when possible, to try and implement a technique
natively.
As a challenge, alternatives for bootstrapping big data sets exist.
Check out the work done by researchers at Berkeley - it would be
interesting to supplement the bootstrapping macro with this
implementation.
Lets take a deeper look at some parts of the macro.

1) Sampling with replacement in Alteryx:
Let n be the sample size of the original data, and let m be the
number of bootstrap samples. We start by creating n*m random
numbers which take on integer values 1,2,n. These values
represent our index. To create a random sample we perform a
join* based on the record id of the original sample and our
random index. For example:
n=4,m=2
Record Id
Original Sample
Random Index
Sample Value
3
The last step is to group our newly created data using the tile tool
(Sample 1 - Blue; Sample 2 - Green) and make it look pretty with
the cross tab tool.
*Alteryx automatically sorts the data by sample value during the
join, so to get back our random samples we un-sort by sorting
based on the original record id.
2) Calculating complicated test statistics with the R Tool:
An un-intuitive feature of the macro is the option to specify a test
statistic using the R tool.
When this option is selected data will be passed into R and run
through the following code:
At its heart, this code resamples the original data, evaluates a

command to generate an object based on the re-sampled data,
and then takes part of that object as the test statistic.
As a simple example, if we wanted to use the R Tool to calculate
sample means we would specify the R command to run over
samples as mean and the Attribute to . as [1].
The flexibility to bootstrap a model object and was included so
that the tool could be extended to bootstrapping the distributions
of linear model coefficients, time series auto-correlation functions,
etc. For example, the default configuration (as pictured) will treat
each sample as a time series and will find a distribution for autocorrelation of the first lag (the correlation between x(t) and x(t-1)
across the time series).
3) P-value: Why the flip-flop?
The bootstrap macro outputs a p-value. This p-value is for the

null hypothesis that the observed test statistic is similar to the
mean of the bootstrapped test statistics. I say similar because
the macro actually tests one of 2 hypotheses. If the observed
test statistic is larger than the mean, it reports a p-value for the
hypothesis that the test statistic is
less-than-or-equal-to the bootstrapped data. If the observed
test statistic is smaller than the mean, it reports a p-value for
the hypothesis that the test statistic is greater-than-or-equal-to
the bootstrapped data.
This is implemented in this formula tool:
Bootstrapping Macro: Part 3 - Learning the CLT

The central limit theorem is foundational to statistics. The basic
idea is that sample means will converge to the normal distribution,
regardless of the underlying data.
The bootstrapping macro makes it easy to quickly calculate a

whole bunch of sample means, so I thought it would be fun to try
and create an app to demonstrate the CLT.
Have fun!

Macro 1 - Bootstrap

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Macro 1 - Bootstrap

Uploaded by

Copyright:

Available Formats

Bootstrapping Macro: Part 1 - Use Cases

The name bootstrapping refers to the idea of picking oneself up

2) Re-sample, with replacement, from the sample data.

3) Calculate a test statistic of interest.

4) Compare the test statistic of the original data to the test

Figure 1: Example distribution of sample means. Red line

The major benefit of the bootstrap technique is that in Step 4 we

Additional Resources: https://en.wikipedia.org/wiki/Bootstrapping

Bootstrapping Macro: Part 2 - Implementation Details

Lets take a deeper look at some parts of the macro.

At its heart, this code resamples the original data, evaluates a

3) P-value: Why the flip-flop?

The bootstrap macro outputs a p-value. This p-value is for the

Bootstrapping Macro: Part 3 - Learning the CLT

The bootstrapping macro makes it easy to quickly calculate a

You might also like