First, we import some modules:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy.stats import genpareto, binom
Next, we write the function to simulating getting the maximum value out of n samples from a given distribution:
def max_out_of_n_from_dist(dist, out_of_n=1e9):
manageable_n = 100000
if out_of_n <= manageable_n:
return dist.rvs(out_of_n).max()
else:
top_percentiles = \
np.random.rand(manageable_n) * manageable_n / out_of_n
return min(
dist.isf(top_percentiles.min()),
np.finfo(float).max / 100) # prevent inf
Next, we consider generalized Pareto distributions with the shape parameters between 0.01 to 2.0, in increasing intervals of 0.01. That is, we consider shape parameters of 0.01, 0.02, 0.03 ... 2.0.
We then simulate getting the maximum value out of 1e9 samples drawn from these distributions, and rescale the distribution so that this maximum value is at x = 1. Then we calculate the probability density at x = 6. This is the likelihood of drawing a sample at 6 times the maximum value - that is to say, the likelihood of "naturally" generating a Jesus-level resurrection report from this distribution. Lastly, we next calculate how many "outliers" would exist in this distribution.
We repeat this 10000 times for each of the 200 shape parameters between 0.01 and 2.0, and put it all in a table. The result is a table with 2 000 000 rows, whose columns are the shape parameter, the likelihood of drawing a sample 6 times greater than the maximum, and the number of "outliers".
The following code gives us this results table.
sample_size = int(1e9)
genpareto_shapes = np.linspace(0.01, 2.0, 200)
shape_params = []
pdf_6x_max = []
n_outliers = []
for shape_param in genpareto_shapes:
dist = genpareto(shape_param, scale=1, loc=0)
for i in range(10000):
shape_params.append(shape_param)
max_val = max_out_of_n_from_dist(dist, sample_size)
scaled_dist = genpareto(
shape_param, scale=1 / max_val, loc=0)
pdf_6x_max.append(scaled_dist.pdf(6))
# use binom instead of actually drawing for outliers
p_outlier = (
scaled_dist.sf(0.5) - scaled_dist.sf(1)
) / scaled_dist.cdf(1)
n_outlier = \
binom(sample_size - 1, p_outlier).rvs(1)[0] + 1
n_outliers.append(n_outlier)
genpareto_results_df = pd.DataFrame({
"shape_params":shape_params,
"pdf_6x_max":pdf_6x_max,
"n_outliers":n_outliers,
})
#save to .csv, as generating this takes a while
genpareto_results_df.to_csv(
"genpareto_results_df.csv", encoding="utf-8")
Let's load up the results and see the first few rows:
genpareto_results_df = pd.read_csv(
"genpareto_results_df.csv", encoding="utf-8"
).drop("Unnamed: 0", 1)
print(genpareto_results_df.shape)
genpareto_results_df.head()
(2000000, 3)
shape_params | pdf_6x_max | n_outliers | |
---|---|---|---|
0 | 0.01 | 1.400046e-36 | 26287 |
1 | 0.01 | 1.650500e-38 | 11942 |
2 | 0.01 | 4.315781e-36 | 31962 |
3 | 0.01 | 2.588130e-37 | 19289 |
4 | 0.01 | 3.339785e-39 | 8680 |
So, let's say that in reality, there are only 15 "outliers". Now, this does not narrow down the possibilities to a single shape parameter. Just due to chance, you can get 15 "outliers" from a shape parameter of 0.3, and also from a shape parameter of 0.6. However, the 15 "outliers" does narrow things down enough to give us a distribution over shape parameters. This is an improvement over our prior knowledge, which was that we had no idea what the shape parameter might be.
How could we get this posterior distribution of the shape parameters? All we need to do is to take the subset of the results table where the number of outliers is exactly 15, and look at the shape parameters. This satisfies Bayes' theorem, and the distribution of shape parameters in that subset IS the posterior distribution for the shape parameter.
Let's write a function to do this and see what it looks like:
def shape_hist(df, n_outliers):
srs = df[df["n_outliers"] == n_outliers]["shape_params"]
n_bins = (srs.max() - srs.min()) / 0.01 + 1
srs.hist(bins=int(n_bins)).set_xlabel("shape parameters")
shape_hist(genpareto_results_df, 15)
So, our "skeptic's distribution" in this case is a linear combination of generalized Pareto distributions, with the above distribution of shape parameters, each one scaled individually depending on where its maximum value was found.
Continuing on with the same reasoning, the "skeptic's likelihood" of generating a Jesus-level resurrection report would simply be the density of that distribution at x = 6. Since we tracked this value in the above table, again all we have to do is take the subset of the table where the number of outliers is exactly 15, and average over this likelihood value.
Furthermore, the Bayes factor is simply the ratio of this value to the the likelihood from the "Christian's Distribution", which is just 0.5/12 for values up to x = 12.
As before, let's write a function to calculate these values, to see what they are with 15 outliers:
def likelihood_and_bf(df, n_outliers):
skeptical_likelihood = df[
df["n_outliers"] == n_outliers]["pdf_6x_max"].mean()
christian_likelihood = (0.5 / 12)
bf = christian_likelihood / skeptical_likelihood
print("skeptic's likelihood: {:.2e}"
.format(skeptical_likelihood))
print("Bayes factor: {:.2e}".format(bf))
likelihood_and_bf(genpareto_results_df, 15)
skeptic's likelihood: 8.51e-12 Bayes factor: 4.90e+09
Note that the above distribution of shape parameters decays to nearly nothing by the time we reach values around 1.0. So we don't have to worry about our artificial upper limit of 2.0. There is essentially no chance for the shape parameter to be that high, given 15 outliers.
Of course, there's almost certainly more than 15 outliers in world history. This further forces the shape parameters to smaller values. In addition, this distribution over shape parameters turns out to decay quite rapidly as it goes off to the right (demonstrating this is left as an exercise to for the reader). All this combines to show that the upper limit of 2.0 will have no bearing on our conclusions.
So, what if there are more outliers, like 60? That would make for the following posterior distribution over shape parameters:
shape_hist(genpareto_results_df, 60)
And the Bayes factor for a Jesus-level resurrection report would be:
likelihood_and_bf(genpareto_results_df, 60)
skeptic's likelihood: 3.86e-13 Bayes factor: 1.08e+11
What if there are 250 outliers? Then the distribution over shape parameters looks like this:
shape_hist(genpareto_results_df, 250)
And the Bayes factor for a Jesus-level resurrection report would be:
likelihood_and_bf(genpareto_results_df, 250)
skeptic's likelihood: 1.44e-14 Bayes factor: 2.90e+12
We can clearly see that the number of "outliers" controls the "skeptic's likelihood" of generating a Jesus-level resurrection report. Here is how the two quantities are related:
outliers_pdf_6x = genpareto_results_df[
genpareto_results_df["n_outliers"] < 100
].groupby("n_outliers")["pdf_6x_max"].mean()
outliers_pdf_6x.name = "pdf_6x_max"
outliers_pdf_6x.reset_index().plot(
kind="scatter", x="n_outliers", y="pdf_6x_max",
xlim=(0,100), ylim=(0, 6e-11),
title="skeptic's likelihood vs outliers"
)
<matplotlib.axes._subplots.AxesSubplot at 0x1b7391adda0>
And this is how that translates into Bayes factors for Jesus's resurrection:
outliers_log10_bf = np.log10((0.5 / 12) / outliers_pdf_6x)
outliers_log10_bf.name = "log10_bf"
outliers_log10_bf.reset_index().plot(
kind="scatter", x="n_outliers", y="log10_bf",
xlim=(0,100), ylim=(8, 14),
title="Bayes factors vs outliers"
)
<matplotlib.axes._subplots.AxesSubplot at 0x1b737c4d358>
The abnormal values around n_outliers < 5 is due to the "shape parameter exceeding 2.0" problem mentioned earlier - because with so few outliers, it would be a problem. It quickly becomes a non-issue as the number of outliers increases.
Otherwise, looking at the graphs above, we see that the skeptic's likelihood of generating a Jesus-level resurrection report drops as the number of "outliers" increases, thereby increasing the Bayes factor. Having MORE non-Christian resurrection reports (that is, having more "outliers") makes the skeptic LESS able to explain Jesus's resurrection, and therefore makes it MORE likely - exactly as we said before.