Author Topic: Thinkin, Fast and Slow - Statistics question  (Read 3986 times)

frugalnacho

  • Walrus Stache
  • *******
  • Posts: 5060
  • Age: 42
  • Location: Metro Detroit
Thinkin, Fast and Slow - Statistics question
« on: January 11, 2017, 08:42:30 PM »
I am reading "Thinking, Fast and Slow" by Daniel Kahneman and I am confused by an example he used.  I read it and thought it was wrong, and the more I think about it the more it seems wrong, but I don't know why.  This is the excerpt from Chapter 16:

Quote
Consider the following scenario and note your intuitive answer to the question.

A cab was involved in a hit-and-run accident at night.
Two cab companies, the Green and Blue, operate in the city.
You are given the following data:

*85% of the cabs in the city are Green and 15% are blue
*A witness identified the cab as Blue.  The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time.


This is a standard problem of Bayesian inference  There are two items of information: a base rate and the imperfectly reliable testimony of a witness.  In the absense of a witness, the probability of the guilty cab being Blue is 15%, which is the base rate of that outcome.  If the two cab companies had been equally large, the base rate would be uninformative and you would consider only the reliability of the witness, concluding that the probability is 80%.  The two sources of information can be combined by Bayes's rule. The correct answer is 41%.  However, you can probably guess what people do when faced with this problem: they ignore the base rate and go with the witness.  The most common answer is 80%.

I can't figure out why the base rate matters at all.  If the witness is accurate 80% of the time, isn't there an 80% chance the car was blue (as the witness claims) and not 41%, regardless of the base rate?  If there were 100 parallel universes with this exact scenario, in 80% of the those universes the car would be blue (the witness was correct) and 20% it would be green (the witness was wrong) by the very definition of what 80% means. 

respond2u

  • Stubble
  • **
  • Posts: 119
Re: Thinkin, Fast and Slow - Statistics question
« Reply #1 on: January 11, 2017, 09:18:53 PM »
Here's my best guess (hopefully someone that's sure will look at this and be motivated enough to correct me :)

I think that you must have left off the question, but here's one way to get to 41%.

85% of the cabs are green
15% of the cabs are blue
The witness is right 80% of the time and wrong 20% of the time.

If a witness identifies a cab as green, the the actual "count" of green cabs is:
(85% green * 80% right) + (15% blue * 20% wrong) = .71

As a blue cab:
(15% blue * 80% right) + (85% green * 20% wrong) = .29

The ratio is .29/.71 = .41 (41%)
So the witness will say it's a blue cab 41% of the time (not 80%).

As to what that actually means: https://en.wikipedia.org/wiki/Bayes%27_rule
 

shelfins

  • 5 O'Clock Shadow
  • *
  • Posts: 47
Re: Thinkin, Fast and Slow - Statistics question
« Reply #2 on: January 11, 2017, 09:25:16 PM »
It can help to break out this way: Of every 100 cabs in the city, 85 are green and 15 are blue.

Now, let's imagine every one of these 100 cars got in accident and was observed by a witness.

What would the witnesses report in each of those 100 cases?

Since witnesses are correct in their observation 80% of the time, of the 85 witnesses who saw green cabs, you would have 68 (85*0.8) who would correctly saw they had a green cab, and 17 (85*0.2) who would incorrectly say they saw a blue cab.

Of the witnesses who saw the blue cabs, 12 (15*0.8) would correctly say they saw a blue cab, and 3 (15*0.2) would incorrectly say they saw a green cab.

So out of these 100 accidents, it would break out like this:

Actual color: Green Witness report: Green Count: 68
Actual color: Blue Witness report: Green Count: 3
Actual color: Green Witness report: Blue Count: 17
Actual color: Blue Witness report: Blue Count: 12

As you can see, out of these 100 accidents, 29 involve a witness that says they saw a blue car, but of those 29, 17 actually had a green car, and only 12 of the 29 cases (or 41%) actually had a blue car!

Basically, because green cars are so common and blue cars are so rare, there are going to be a lot of cases where a person sees a green car and mistakes it for a blue car but relatively few cases where a person actually sees a blue car.

It can be more obvious with something that's really, really rare, like say a disease that only 1 in 1,000 people have. If you have a test for the disease that tells you the correct diagnosis 80% of the time and the incorrect diagnosis 20% of the time, and you test 10,000 people, 10 will have the disease, 9,990 won't, and you'll end up diagnosing 20% of those 9,990 (or 1,998 people who don't have the disease) as having the disease, and only 8 people who actually do have the disease as having the disease. So of the 2,006 people who got diagnosed with the disease, only 8 actually have it, because there are so many more healthy people than sick people so there are a lot of opportunities to misdiagnose a healthy person and very few opportunities to correctly diagnosis a sick person. 

Does that help at all?

samsonator54321

  • 5 O'Clock Shadow
  • *
  • Posts: 63
Re: Thinkin, Fast and Slow - Statistics question
« Reply #3 on: January 11, 2017, 09:46:32 PM »
I think it will help if you think of the probability  BEFORE the observer states what they saw.

These are the four possible outcomes:

1. The car was green the observer saw green. .85 x .80 = .68
2. The car was green the observer saw blue. .85 x .20 = .17
3. The car was blue the observer saw blue. .15 x .8 = .12
4. The car was blue the observer saw green. .15 x .2 = .03

If you look up the Bayesian formula you'd find: Probability actually blue = (probability correct x probability of blue)/ (probability of seeing blue).

( (Percent Correct = .8)(Probabiltiy of blue .15))/ (chance of seeing blue .17 + .12 = .29)   Which is (.8 x .15)/.29 = .41

I like the way the person above me explained it better but figured I'd post the formula since I had already typed it.

frugalnacho

  • Walrus Stache
  • *******
  • Posts: 5060
  • Age: 42
  • Location: Metro Detroit
Re: Thinkin, Fast and Slow - Statistics question
« Reply #4 on: January 11, 2017, 09:50:57 PM »
It can help to break out this way: Of every 100 cabs in the city, 85 are green and 15 are blue.

Now, let's imagine every one of these 100 cars got in accident and was observed by a witness.

What would the witnesses report in each of those 100 cases?

Since witnesses are correct in their observation 80% of the time, of the 85 witnesses who saw green cabs, you would have 68 (85*0.8) who would correctly saw they had a green cab, and 17 (85*0.2) who would incorrectly say they saw a blue cab.

Of the witnesses who saw the blue cabs, 12 (15*0.8) would correctly say they saw a blue cab, and 3 (15*0.2) would incorrectly say they saw a green cab.

So out of these 100 accidents, it would break out like this:

Actual color: Green Witness report: Green Count: 68
Actual color: Blue Witness report: Green Count: 3
Actual color: Green Witness report: Blue Count: 17
Actual color: Blue Witness report: Blue Count: 12

As you can see, out of these 100 accidents, 29 involve a witness that says they saw a blue car, but of those 29, 17 actually had a green car, and only 12 of the 29 cases (or 41%) actually had a blue car!

Basically, because green cars are so common and blue cars are so rare, there are going to be a lot of cases where a person sees a green car and mistakes it for a blue car but relatively few cases where a person actually sees a blue car.

It can be more obvious with something that's really, really rare, like say a disease that only 1 in 1,000 people have. If you have a test for the disease that tells you the correct diagnosis 80% of the time and the incorrect diagnosis 20% of the time, and you test 10,000 people, 10 will have the disease, 9,990 won't, and you'll end up diagnosing 20% of those 9,990 (or 1,998 people who don't have the disease) as having the disease, and only 8 people who actually do have the disease as having the disease. So of the 2,006 people who got diagnosed with the disease, only 8 actually have it, because there are so many more healthy people than sick people so there are a lot of opportunities to misdiagnose a healthy person and very few opportunities to correctly diagnosis a sick person. 

Does that help at all?

Thanks that makes perfect sense.  Seems counter-intuitive until it's broken down like that.