Conditional Probability

  1. The Probability of A & B Both happening (as a numerator)
  2. The probability of A (as a denominator)
  3. numerator / denominator
  • P(B|A) = P(A,B) / P(A)

An example:

  • people go food shopping, and we are curious about (b) vegetables and (a) frozen-dinners
  • 60% of people buy both
  • 80% of people buy frozen dinners

The Question: What is the probability of people who buy frozen dinners AND buy vegetables?

  • numerator = .6 (60%)
  • denominator = .8 (80%)
  • .6 / .8 = .75

The Answer: 75% of those who buy frozen dinners buy vegetables

Conditional Probability with Python

Figure out the likelyhood that people in age-ranges buying something.

peoplePerAgeGroup

contains the total number of people in each age group.

In [1]:
peoplePerAgeGroup = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
print(peoplePerAgeGroup)
{20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0}

purchases

contains the total number of things purchased by people in each age group.

In [2]:
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
print(purchases)
{20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0}

totalPurchases

The grand total of purchases is in totalPurchases, and we know the total number of people is 100,000.

In [3]:
totalPurchases = 0
print(totalPurchases)
0

FakeData Generation

Below is some code to create some fake data on how much stuff people purchase given their age range.

It generates 100,000 random "people" and randomly assigns them as being in their 20's, 30's, 40's, 50's, 60's, or 70's.

It then assigns a lower probability for young people to buy stuff.

In [4]:
from numpy import random
random.seed(0)
totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])

    # HERE, there's a correlation between ageDecade and purchase-probablility
    # Older is MORE LIKELY to buy
    purchaseProbability = float(ageDecade) / 100.0
    totals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        totalPurchases += 1
        purchases[ageDecade] += 1
In [5]:
print(totals)
{20: 16576, 30: 16619, 40: 16632, 50: 16805, 60: 16664, 70: 16704}
In [6]:
print(purchases)
{20: 3392, 30: 4974, 40: 6670, 50: 8319, 60: 9944, 70: 11713}
In [7]:
print(totalPurchases)
45012

Conditional Probability

Let's play with conditional probability. E will represent making a Purchase.
F will represent a specific age-range.

Prob. Purchasing in their 30s

First let's compute P(E|F), where E is "purchase" and F is "you're in your 30's". The probability of someone in their 30's buying something is just the percentage of how many 30-year-olds bought something:

In [8]:
numberOfPeopleWhoPurchasedThirties = purchases[30]
numberOfPeopleThirties = totals[30]
print("numberOfPeopleWhoPurchasedThirties:",numberOfPeopleWhoPurchasedThirties)
print("numberOfPeopleThirties:",numberOfPeopleThirties)

probEeGivenF = float(numberOfPeopleWhoPurchasedThirties) / float(numberOfPeopleThirties)
print('P(purchase | 30s): ' + str(probEeGivenF))
numberOfPeopleWhoPurchasedThirties: 4974
numberOfPeopleThirties: 16619
P(purchase | 30s): 0.29929598652145134

Prob. of being in 30s

P(F) is just the probability of being 30 in this data set

In [9]:
PF = float(totals[30]) / 100000.0
print("P(30's): " +  str(PF))
P(30's): 0.16619

Prob of Buying something

And P(E) is the overall probability of buying something, regardless of your age:

In [10]:
PE = float(totalPurchases) / 100000.0
print("P(Purchase):" + str(PE))
P(Purchase):0.45012

If E and F were independent, then we would expect P(E | F) to be about the same as P(E). But they're not; P(E) is 0.45, and P(E|F) is 0.3. So, that tells us that E and F are dependent (which we know they are in this example.)

P(E,F) is different from P(E|F). P(E,F) would be the probability of both being in your 30's and buying something, out of the total population - not just the population of people in their 30's:

In [11]:
print("P(30's, Purchase)" + str(float(purchases[30]) / 100000.0))
P(30's, Purchase)0.04974

Let's also compute the product of P(E) and P(F), P(E)P(F):

In [12]:
print("P(30's)P(Purchase)" + str(PE * PF))
P(30's)P(Purchase)0.07480544280000001

Something you may learn in stats is that P(E,F) = P(E)P(F), but this assumes E and F are independent. We've found here that P(E,F) is about 0.05, while P(E)P(F) is about 0.075. So when E and F are dependent - and we have a conditional probability going on - we can't just say that P(E,F) = P(E)P(F).

We can also check that P(E|F) = P(E,F)/P(F), which is the relationship we showed in the slides - and sure enough, it is:

In [13]:
print((purchases[30] / 100000.0) / PF)
0.29929598652145134
In [17]:
# update the loop & store the data
localTotalPurchases = 0;
localTotals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
localPurchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPeople = 100000

for _ in range(totalPeople):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = .75
    localTotals[ageDecade] += 1
    if (random.random() < purchaseProbability):
        localTotalPurchases += 1
        localPurchases[ageDecade] += 1

# show results
print("localTotals:",localTotals)
print("localTotalPurchases:",localTotalPurchases)
print("localPurchases:",localPurchases)
localTotals: {20: 16911, 30: 16396, 40: 16780, 50: 16511, 60: 16875, 70: 16527}
localTotalPurchases: 75030
localPurchases: {20: 12675, 30: 12282, 40: 12634, 50: 12376, 60: 12615, 70: 12448}

Calculate Probability

Confirm that P(E|F) is about the same as P(E), showing that the conditional probability of purchase for a given age is not any different than the a-priori probability of purchase regardless of age.

In [19]:
thisPE = float(localTotalPurchases) / totalPeople
print(thisPE)
0.7503
In [20]:
ageArr = [20,30,40,50,60]
for decade in ageArr:
    print("P("+str(decade)+"'s, Purchase)" + str(float(localPurchases[decade]) / 100000.0))
P(20's, Purchase)0.12675
P(30's, Purchase)0.12282
P(40's, Purchase)0.12634
P(50's, Purchase)0.12376
P(60's, Purchase)0.12615
Page Tags:
python
data-science
jupyter
learning
numpy