December 8, 2014

In my last post, I introduced Bayes' theorem:

P(hypothesis|observation) = P(observation|hypothesis)/P(observation) * P(hypothesis)

Now, this is a powerful equation that tells us how to use observed evidence to update our beliefs about a hypothesis. But as I mentioned, it has two difficulties with its use: first, the probability prior to the observation - P(hypothesis) - is famously difficult to compute in a clear, objective manner, and it changes based on the background information that each person has. For these reasons it's often said to be a personal, subjective probability, reflecting a particular person's degree of belief based on his or her unique set of background information.

And second, things get even worse for P(observation): this is the probability of making the observation, averaged over the complete set of competing hypotheses. Because this is an average over the complete set, we have to know all P(hypothesis) values for every competing hypothesis. But as we said just in the previous paragraph, computing even one of these values is difficult. If that wasn't hard enough, in real-life situations we may not even be able to enumerate the complete set of competing hypotheses. And then, even if we somehow got through all these difficulties, we still have to calculate P(observation|hypothesis) values for each of these hypotheses, which itself is no trivial task, then calculate their average across all the hypotheses. This step often requires more computation than the rest of Bayes' theorem put together, even for well-defined problems with fixed values for all other probabilities.

For these reasons I often like to use Bayes' theorem in odds form: simply write down the equations for two different hypotheses and divide one by the other, and you get:

P(hypothesis A|observation)/P(hypothesis B|observation) =

P(hypothesis A)/P(hypothesis B) * P(observation|hypothesis A)/P(observation|hypothesis B)

This can be summarized as "posterior odds = prior odds * likelihood ratio (of the observation being made from each hypothesis)", where:

P(hypothesis A|observation)/P(hypothesis B|observation) = posterior odds,

P(hypothesis A)/P(hypothesis B) = prior odds,

P(observation|hypothesis A)/P(observation|hypothesis B) = likelihood ratio.

Let's go through an example: say you're investigating a murder. You think that Alice is twice as likely to be guilty compared to Bob - this is your prior odds. You then observe fingerprints on the murder weapon that are 3000 times more likely to have come from Alice than from Bob - this is the likelihood ratio. You multiply these ratios to calculate your new opinion, the posterior odds: Alice is now 6000 times more likely to be guilty than Bob. Posterior odds is prior odds times likelihood ratio.

This is still Bayes' theorem, just in a different algebraic form. The intuition captured by this equation is the same: an observations counts as evidence towards the hypothesis that better predicts, anticipates, explains, or agrees with that observation. But notice that in this form, P(observation) - which was difficult or impossible to calculate - has been cancelled out. Also, P(hypothesis) - another troublesome number - only appears in a ratio of two competing hypotheses, which I think is a more reasonable way to think of it: it's easier to say how much more likely one hypothesis is than another, instead of assigning absolute probabilities to both of them. In short, this form makes the math easier, and allows you to think of just two hypotheses at a time, rather than having to account for the complete set of competing hypotheses all at once. You don't have to worry about Carol and her fingerprints for the time being in the above murder investigation example.

Let's go through a couple more examples:

Say that your friend claims that he has a trick coin: he says it lands "heads" all the time, rather than the 50% of the time that you'd normally expect. You're somewhat skeptical, and based on his general trustworthiness and the previous similar claims he's made, you only think that there's a 1:4 odds that this is a 100% "heads" coin, versus it being a normal coin. This is your P(always heads)/P(normal), the prior odds.

When you express your skepticism, your friend says, "well then, let me just show you!" and flips the coin. It lands "heads". "See!" says your friend. "I told you it'll always lands heads!" Now, obviously a single flip doesn't prove anything. But it certainly is evidence - not very strong evidence, but some evidence. Since the coin will land "heads" 100% of the time if your friend is right, but only 50% of the time if it's a normal coin, their ratio - the likelihood ratio - is 100%:50%, or 2:1.

Now, according to the odds form of Bayes' theorem, posterior odds is prior odds times likelihood ratio. 1:4 * 2:1 = 1:2, so you should now believe that there's a 1:2 odds that this is a trick coin like your friend claimed, versus it being a normal coin. You're still skeptical of the claim, but you're now less skeptical.

Noting your remaining skepticism, your friend then flips the coin again. "Ha, another heads!" he says as he calls out the result. Now, to calculate your new opinion, simply repeat the calculation above, with the previous answer - the old posterior odds of 1:2 - serving as the new prior odds. The likelihood ratio remains 2:1. Posterior odds is prior odds times likelihood ratio, so our new posterior odds is 1:2*2:1 = 1:1. You should now be completely uncertain as to whether this coin in fact is a trick coin. You say to your friend, "well, you may have something there".

"Okay, fine then." says your friend. "Let's flip this thing ten more times." And behold, it comes up "heads" all ten times. Your posterior odds get multiplied by 2:1 for each of the ten flips, and it's now 1:1 * (2:1)^10 = 1024:1. You should now believe that the chance of this being an "always heads" coin is 1024 times greater than it being a normal coin. If you're willing to consider "normal" and "always heads" as the complete set of competing hypotheses, this would give you over 99.9% certainty that your friend is right that this coin will always land heads.

"Wow, amazing." you tell your friend, as you're now pretty much convinced. "I've never actually seen one of these before", you say, as you idly grab the coin and flip it again, fully expecting it to land "heads" once more. But this time, it lands "tails".

What now? The likelihood ratio for the coin to land "tails" - P(tails|always heads)/P(tails|normal) - is 0%:50%, or 0:1. Our new posterior odds is 1024:1 * 0:1 = 0:1. There is now absolutely no chance that this coin is one that will land heads 100% of the time. But at the same time, it also seems unlikely that it's just a normal coin. given that it landed "heads" 12 times in a row just before this. A new possibility suggests itself: that this coin has something like a 90% chance of landing heads.

This illustrates one of the major advantages of the odds form of Bayes' theorem. Before this, you hadn't even considered that the chance for this coin to land "heads" was anything other than 50% or 100%. All of the other hypotheses - such as the coin landing "heads" 90% or 80% or 20% of the time - you had ignored. And yet, even without considering the complete set of competing hypotheses, you were still able to carry out valid calculations and make statistical inferences, reaching sound conclusions.

You both stare at the coin that landed "tails". You ask your friend, "What just happened?" He replies, "well, the magician I bought it from said that it would always land heads. And it seemed to be working fine up 'til now. Maybe he just meant that it'll land heads most of the time?" Being naturally suspicious, you respond, "Looks like he lied to you then. He probably just sold you a normal coin". But your friend comes back with, "C'mon, you know that's not fair. Human language doesn't work like that. It's imprecise by its very nature. When someone says 'always' in casual conversation, they don't necessarily mean '100.000000...% of the time' with an infinite number of significant figures. Even 'normal' coins don't land heads exactly 50.000000...% of the time". Struck by your friend's rare moment of lucid articulation, you become temporarily speechless. "Besides", your friend continues, "the magician might have said that the coin 'nearly always lands heads'. I don't remember exactly".

With this new insight, you realize that your had set your priors to the wrong hypotheses at the beginning of the problem. Instead of the hypotheses that the coin to land "heads" exactly 100% of the time, or exactly 50% of the time, you should have set them to 'close to 100% of the time' and 'close to 50% of the time'. Giving the odds of P(close to 100%)/P(close to 50%) = 1:4 as before, and interpreting "close to" as a flat distribution within 2% of the given value, We get that the likelihood ratio for the coin landing "heads" is P(heads|close to 100)/P(heads|normal) = 99%:50% = 1.98:1, and for the coin landing "tails" is P(tails|close to 100)/P(tails|normal) = 1%:50% = 0.02:1. Then the value for the posterior odds after 12 heads and 1 tails is given by prior odds times likelihood ratio, and it is roughly:

1:4 * (1.98:1)^12 * 0.02:1 = 18.15:1

(This is an approximation, made by assuming that the probability distribution can be thought of as being entirely focused at the center of their interval. The actual value, 16.97:1, can be obtained by a straightforward integration over the probability distributions, but that calculation lies beyond the scope of this introductory post.)

So you don't have to abandon the "close to 100%" hypothesis along with the "exactly 100% hypothesis. The odds are still 18:1 in favor of the coin landing "heads" more than 98% of the time, against it being a "normal" coin - enough for you to be reasonably confident in believing as your friend does.

This illustrates again the advantages of using the odds form. Firstly, we again didn't have to consider other probability values for the coin landing "heads", such as 75%. We were still able to come to a reasonable conclusion without having to specify the complete set of competing hypotheses, and their probability distribution. Secondly, we were able to completely switch the class of hypotheses under consideration, without losing consistency. If we had stuck to the original form of Bayes' theorem, then we would have had to specify our prior probabilities for P(heads exactly 100% of the time) and P(heads exactly 50% of the time). To maintain our 1:4 ratio, we would assign them as 20% and 80%, taking up all 100% of our probability, because we were not thinking about other possibilities. But then, upon realizing our mistake, we would have no choice but to contradict our previous priors, and assign P(heads close to 100% of the time) and P(close to 50% of the time) some values, while going back and admitting that the chances of the coin giving exactly 50% or 100% "heads" are nearly zero. This is a problem created entirely by being unaware of the complete set of competing hypotheses.

But with the odds form, we don't have to have complete awareness. All the conclusions that we came to are still perfectly consistent with the data: there is zero chance for the coin to land "heads" exactly 100% of the time, yet it is much more likely that the "heads" probability is close to 100% than it being a normal coin. Our two sets of priors do not contradict each other either: it's quite reasonable for our prior odds to be 1:4 in both cases, because we have not specified how much of the total probability they take up. In general, I feel that it's easier to say how likely two hypotheses are relative to one another, rather than specifying the absolute probability value for a hypothesis.

I hope this convinces you of the virtues of the odds form of Bayes' theorem. This is how I use Bayes' theorem in everyday situations to sharpen my thinking: I didn't know if this one movie was going to be any good (prior odds), but upon its recommendation from a friend (likelihood ratio), I revise my opinion and are now more likely to see it (posterior odds). I didn't know whether Argentina or Germany is more likely to win the World Cup (prior odds), but upon watching Germany slaughter Brazil (likelihood ratio), I now consider Germany more likely than Argentina to win the World Cup (posterior odds). So on and so forth. Posterior odds is prior odds times likelihood ratio.

Let's consider a couple of last examples:

I don't know if Bill Gates owns Fort Knox (prior odds). But I know that he's rich, and he's more likely to be rich if the owns Fort Knox than if he does not (likelihood ratio). Therefore, given that Bill Gates is rich, he's more likely to own Fort Knox (posterior odds).

Does that reasoning sound suspicious? It should. I took it straight from the Wikipedia page on "affirming the consequent", which is a logical fallacy. But the structure of the above argument is correct according to Bayes' theorem. It follows the same structure as all of my other examples. So, has Bayesian reasoning lead to a logical fallacy? Oh no! What shall we do?

Hold that thought, while we consider our last example:

I don't know whether Einstein's theory of general relativity, or Newton's theory of gravity is correct. (prior odds). But upon considering the experimental evidence of bending of starlight observed during the 1919 solar eclipse (likelihood ratio), I now consider general relativity much more likely to be correct than Newtonian gravity (posterior odds).

You should recognize that as the event that actually "proved" general relativity to the public, and the epitome of the scientific method at work: hypotheses are judged according to their agreement with experimental observations. But this is nothing more than just straightforward Bayesian reasoning, following the same structure as all of my other examples. So, it turns out that Bayesian reasoning underlies the scientific method, by providing the logical framework for it.

What are we to make of these two last examples? Does Bayesian reasoning allow for affirming the consequent? But isn't that a logical fallacy? But doesn't Bayesian reasoning also underlie the scientific method? Does that mean that science follows a logically flawed system? What are we to make of this?

I will address these issues in my next post.

You may next want to read:

Basic Bayesian reasoning: a better way to think (Part 4) (Next post of this series)

Isn't the universe too big to have humans as its purpose?

What is "evidence"? What counts as evidence for a certain position?

Another post, from the table of contents

Copyright