You are on page 1of 36

Basic Concepts of

Probability and
Independence of Events
By
Amitava Bandyopadhyay

Learning Objectives
Understand the concepts of events and
probability of events
Understand
the
notion
of
conditional
probabilities and independence of different kinds
Understand the concept of inverse probabilities
and Bayes theorem
Understand specific concepts of lift, support,
sensitivity and specificity
Develop ability to use these concepts for
formulation of business problems and providing
solutions to the same

Events
In business analytics we often study the occurrence of events. A
customer coming into a store may or may not buy some items. In
this case buying is an event. We may be interested in knowing
the amount spent by the customer. In this case the amount spent
(or the range of spend) is an event. A particular machine may or
may not fail in a given time interval. In this case failure of the
machine in the given time interval is an event.
Loosely speaking an event is something happening. We attach
numeric value to the outcome being observed.
An interesting point is that while dealing with events we are
dealing with a special kind of variable. The range of all possible
values of this variable is known in advance but the exact value
that will happen in the next instance is not known. Such a
variable is called a random variable.
Note: An event is a subset of values that the random variable can
assume

Events (Continued)
Consider the case of a customer walking into a retail
store. During her stay in the store, she may or may not
buy. Accordingly, we may define a random variable X
that takes two values 0 if nothing was bought and 1, if
something was bought. Events of buying or not buying
may then be defined.
Usually events are defined in terms of capital letters
Note: Events are subsets of values of random
variable. An important assumption is that the
values of the random variables are being
generated
under
similar
conditions.
This
assumption is often ignored and is a hazardous
practice.

Examples
You are working for an automobile company. The company
wants to know how many times an automobile might fail during
the warranty period (say before travelling 10000 miles). The
number of failure is a random variable. Suppose we denote this
random variable by X. An event may be described as X 5, i.e.
the event that an automobile fails at most 5 times.
Suppose an automobile has been serviced and you want to
know how many miles it will travel before encountering the
next failure. Let X denote the number of miles. Here X is a
random variable. An event may be X 5000.

Note
Many real life situations are actually events that describe some
values assumed by a random variable
Number of vehicles sold during a month, number of accidents in a
day, number of customers coming to a retail store in a month,
number of telephone calls made by customers in a day to a call
center all these are examples of random variables. An event
specifies some values of the random variable.
In business analytics you must always try to identify the random

Probability of an Event
We will normally use the symbol P(A) to denote the
probability that the event A happens
Let A be the event that a personal loan given to a
particular customer turns out to be bad (cannot be
recovered)
Let P(A) = 0.03
This implies that the past record of the bank shows that 3%
of the personal loans turns out to be bad. (Note that we are
assuming that all personal loans are sanctioned under
more or less similar conditions. In case the past records
contain a period of severe recession leading to loss of
many jobs the analyst should be careful in estimating the
probability. Even dropping that period may be meaningful)

Example-cum-Exercise

Suppose a telecom service provider has carried out a survey to find


the level of importance customers attach to various aspects of their
experience of using the service. Suppose the importance is given in
a seven point scale (1 to 7) where 1 means least importance and 7
stands for the highest importance. One of the aspects of customer
experience is accuracy of bills and suppose that the survey has
yielded the following result
Value
1
2
3
4
5
6
7

Frequency
1
3
6
13
72
135
130

Let A be the event that a randomly selected customer will consider


the importance of accurate billing to be 6 or more on a 7 point
scale. What is P(A)? How did you arrive at the value? What
assumptions did you make?

Some Elementary Properties


One can derive many properties of the function P from the axioms
0 P(A) 1 for all events
Let be the universal set (i.e. the event that contains all possible values of the
random variable under consideration). is called the sample space as well.
If is the sample space then P() = 1
Let be the empty set (i.e. the event that none of the possible values of the
random variable occurs). Then P() = 0
P(Ac) = 1 P(A) where the set Ac consists of all points x that does not belong to
the set A. Ac is called the complementary set of A (read as A complement)
For any event A we have A = A; A = ; A = A and A =
A and B are said to be mutually exclusive in case A B =
If A and B are two mutually exclusive events, then P(A B) = P(A) + P(B)
If A and B are not mutually exclusive, then P(A B) = P(A) + P(B) P(A B)
A set of events B1, B2, , Bp are mutually exclusive and collectively
exhaustive, when Bi Bj for all i j and B1 B2 Bp =
Note: In the previous case (A Bj) = A where the union is taken over all j =
1, 2, p. Thus P(A Bj) = P(A) when the Bjs are mutually exclusive and
collectively exhaustive. (Why?)

Axioms of Probability
A function P that assigns areal number P(A)
to each event A is a probability
distribution or a probability measure if
it satisfies the following three axioms
a. P(A) 0
b. P() = 1
c. If A1, A2, . are disjoint events, i.e. Ai Aj =
where is the empty set, then P(Aj) = P(Aj)

The axioms of probability provides the


theoretical basis and the elementary
properties mentioned in the previous

Concept of Joint Probability

Let A and B be two events with probabilities P(A) and P(B).


Suppose we are interested in finding out the probability of the
event AB (read A intersection B)
The event AB denotes the joint occurrence of A and B
Examples: Suppose in the context of a retail store, A denotes
the event that a customer buys bread and B denotes the event
that a customer buys butter. Then AB denotes the event that
the customer buys both bread and butter.
Another example: Suppose a travel company places online ads
for hotel booking, air ticketing and car hire. Let A, B and C be
the events that a prospective customer visiting the site books
hotel room, buys air ticket through the travel company and
hires car respectively. Then AB indicates that the customer
books hotel room as well as air ticket through the travel
company. What will AC, BC and AB C indicate?
In the previous case suppose N people have visited the site. Let
NA, NB and NC denote the number of customers who booked
hotel room, air ticket and hired cars. Let NAB, NBC and NAC be the
number of cases when the customers have
booked two
services and NABC be the number of cases when the customer

Conditional Probability
Conditional probability of event A given event B
written as P(AB) is the relative frequency of A
given B has happened.
Conditional probability P(AB) = P(AB) / P(B).
Actually P(AB) = NAB / NB
In the table given in the example-cum-exercise
slide, what is the conditional probability that a
customer will rate his billing experience as 7
given that his experience score is > 5?
Suppose a family has three siblings. What is the
conditional probability that the family has three
P(AB) isgiven
defined
,
i.e. onlyatif least
P(B)
daughters
thatonly
out if
ofBthe
3 siblings
> are
0 girls?
two
Note that P(AB) and P(BA) are not the same.

An Important Point

Note that P(AB) and P(BA) are not the same. Consider the following
example.
An epidemiologist wants to assess the impact of smoking on the
incidence of lung cancer. From hospital records she collected data on
100 patients of lung cancer and she also collected data on 300
persons not suffering from lung cancer. She has classified the 400
samples into smokers and non smokers and the observations are
Smoker
Lung Cancer
Total
summarized
below

Yes

No

Yes

69

137

206

No

31

163

194

Total

100

300

400

Let A be the event that a person has lung cancer and let B be the
event that the person is a smoker. Can you estimate P(AB) from the
table given above?

An Interesting Aspect of Conditional Probability


P(AB) may be very different from P(A) and we can often use this to our
advantage
Note that AB is a subset of all occurrences of A. For example A may
denote the event that a machine fails. On a random day the chance of
failure may be 0.0001 or 1 in 10000.
However, given certain conditions described through the event B, P(AB)
may increase to 0.01.
Thus presence of condition B leads to a 100 fold increase of the
probability of failure on any given day. If the condition persists for 10
days, the chance increases tremendously. (Can you calculate assuming
failure across days are independent?)
Another example: We know that probability of a heart attack during a
period of say one year for a randomly selected Indian male may be fairly
low. However, this risk may increase significantly for a given combination
of age, genetic disposition, smoking habit, BMI, level of blood sugar and
LDL.
In many analytics problems our job is to find the event B that increases /
decreases the chance of occurrence of an event of interest significantly.

Concept of Independence
We say that events A and B are independent in
case P(AB) = P(A), i.e. the probability of A is not
impacted by the presence of event B
This definition implies that when A and B are
independent, P(AB) = P(A).P(B)
Example: Suppose a machine may fail for three
different reasons and suppose these three reasons
happen independently. Let A, B and C denote the
events that reason 1, reason 2 and reason 3 are
present. Then P(ABC) = P(A). P(B).P(C)
Note: If A and B are independent, then Ac and Bc
are independent. In fact it can be easily shown that
Ac and B , and A and Bc are also independent.

Examples of Independence
Suppose you are tossing a fair coin. Thus the
probability that a toss results in a head is 0.5.
Assuming that tosses are independent of each
other, what is the chance that 3 tosses will result
in 3 heads?
Suppose a machine has 20 different parts.
Suppose the parts fail independently of each other
and on any given day a part fails with 1% chance
only. Suppose the machine continues to operate if
all parts are operational and fails if one or more
parts fail. What is the chance that the machine will
fail on a randomly selected day?

Mutually Independent
Events
A set of events B1, B2, , Bp are said to be
mutually independent in case for all
combinations 1 i < j < k <.. p, the
following multiplication rules hold
P(Ai Aj) = P(Ai) P(Aj) ..
(1)
P(Ai Aj Ak) = P(Ai) P(Aj) P(Ak) (2)
.
P(A1 A2 Ap ) = P(A1) P(A2)P(Ap).......(p
1)

Notes on Mutual
Independence
Mutual independence is a strong
condition
Even though the condition consisting
of a set of 2p equations looks
complicated, its validity is obvious
and requires no checking.
It may be readily verified that when
the last equation holds good, all
other equations will hold good.

Pairwise Independence
When the first equation involving two events
hold good for all possible choices of two events,
the events are said to be pairwise independent
Pairwise independence does not mean mutual
independence. Suppose two fair dice are thrown
and the following three events are defined
A means odd face with first die
B means odd face with second die
C means odd sum
Note that A, B and C are pairwise independent but not
mutually independent

Concept of Total Probability


Let B1, B2, , Bp be a set of
mutually exclusive and collectively
exhaustive events such that Bj
for j = 1, 2, p
Let A be any other event.
Then P(A) = P(ABj) P(Bj) (Why?)

Exercise
In a certain county 60% of registered
voters support party A, 30% support
party B and 10% are independents.
When those voters were asked about
increasing military spending 40% of
supporters of A opposed it, 65% of
supporters of B opposed it and 55%
of the independents opposed it. What
is the probability that a voter
selected randomly in this county
opposes increased military spending?

Bayes Theorem
Bayes theorem allows us to look at probability from a inverse
perspective
Bayes theorem states that
P(BA) = P(AB) P(B) / P(A)
Let B1, B2, , Bp be a set of mutually exclusive and collectively
exhaustive events such that B j for j = 1, 2, p. In this set up
Bayes theorem may be stated as
P(BjA) = P(ABj) P(Bj) / (P(ABj) P(Bj)), j = 1, 2, p
This simple yet intelligent way of looking at probability is often
very effective. We may not be able to find P(B jA) directly but it
may be far easier to estimate P(AB j).
Construct examples of the previous statement. Recall the example
of smoking and lung cancer. Can you use Bayes theorem to
estimate probability of lung cancer given smoking habit?

Application of Bayes
Theorem
Suppose I divide my email into three categories: A 1 =
spam; A2 = administrative and A3 = technical. From
previous experience I find that P(A1) = 0.3; P(A2) = 0.5;
and P(A3) = 0.2. Let B be the event that the email contains
the word free and has at least one occurrence of the
character !. From previous experience I have noted that
P(B/A1) = 0.95; P(B/A2) = 0.005 and P(B/A3) = 0.001. I
receive an email with the word free and the !. What is the
probability that it is a spam?

Notice that we have used Bayes theorem


to construct a spam filter. In the
subsequent slides we will see how this
simple concept may be extended to
construct
powerful
classification
mechanisms

Sensitivity and Specificity


While Bayes theorem may be used to construct
classification mechanisms, it may also be used to
evaluate their performance
Let B denote the event of interest say the failure
of a machine or the event that a person has a
particular disease
Let A be the event that the classifier gives a
positive response
AC is the event that classifier gives a negative
response
Note the difference between actual occurrence and
Questions:
positive
response between
from the
the two
classification
a.aWhat
is the difference
conditional
events A / B and B / A?
technique
b. Which probability we are interested in?
c. Is it possible to estimate the probability of interest
directly? If yes, how? If not, why not?

Sensitivity and Specificity (Continued)


P(AB) is the conditional probability of a positive
response given that the event has actually
occurred. This probability is called the
sensitivity. Higher the probability of a positive
response from the classifier when the
underlying condition is truly positive, higher is
the sensitivity.
P(ABc) is the probability of a positive response
when the underlying condition is actually
negative. 1 P(ABc) is called the specificity.
Lower the value of P(ABc) higher is the
specificity. Thus specificity is also given by
P(AcBc)

False Positive and False


Negative
P(ABc) is the probability of getting a false
positive. This gives the probability of a
positive response when the event of interest
actually did not happen.
P(AcB) is the probability of getting a false
negative. This gives the probability of a
negative response when the event of
interest actually happened.
Note
A sensitive instrument does not give false
negative results and a specific instrument
does not give false positive results

Events of Interest
Note that sensitivity and specificity do not give the
probabilities of the events of interest
We are actually interested in positive and negative predictive
values (abbreviated as PPV and NPV respectively) defined as

PPV = P(BA) = P(AB) P(B) / P(A) by Bayes theorem


NPV = P(BcAc) = P(AcBc) P(Bc) / P(Ac) by Bayes theorem

Notice that PPV and NPV cannot be found directly whereas


sensitivity and specificity can be.
Also P(A) = P(A B) + P(A Bc)
= P(AB) P(B) + P(ABc) P(Bc)
= Sensitivity. P(B) + (1 Specificity)(1 P(B))
Thus we can find PPV and NPV provided we know sensitivity,
specificity and prevalence of the particular event of interest
in the population (i.e. if we know P(B))

Why
is
this
Important?
Suppose we are trying to develop a classification
model to understand what leads to failure of
vehicles. It is not possible to conduct experiments
where we observe impact of different conditions on
the event of failure of vehicles in a given period of
time. However, whenever vehicles fail, the failures
will be reported. Suppose the conditions are
captured by sensors. Thus we will have data on
conditions given that failure has happened. We can,
therefore, estimate the probability of different
conditions given that failure has happened. From
warranty report data we can also estimate the
unconditional probability of failure. We can,
therefore, use the methodology given above to
classify whether vehicles will fail under given
conditions

Further Insights
Note that the previous discussions show how we can
estimate P(BA) where B is the failure event, when P(AB)
and P(B) are known
Generally we would like to estimate the conditional
probability of failure given many rather than only one event.
Thus we may like to estimate P(BA 1 A2 . Ak). Note that
using Bayes theorem, we know

P(BA1 A2 . Ak) = P(A1 A2 . AkB) P(B) / P(A1 A2 . Ak)


In the next section we will see how the previous concepts,
including the Bayes optimal classification rules and the
concept of conditional independence to be introduced next
may be used to solve classification problems

Concept of Conditional
Independence
Let A, B and C be three events
A and B are said to be conditionally independent given C, in
case P(AB C) = P(AC)
Conditional independence is often a reasonable assumption as
we show in the subsequent examples
Consider the following events
A = Event that lecture is delivered by Amitava (there are two teachers
Amitava and Boby)
B = Event that lecturer arrives late
C = Event that lecture concerns stat theory (theory and practical are taught)
Suppose Amitava has a higher chance of delivering lecture on stat theory
Suppose Amitava is likelier to be late
Notice that the conditional probability of lecture being on stat theory given that
lecturer is Amitava is independent of the event that the lecturer arrives late.
Thus P(C / AB) = P(C / A)

Implication of Conditional Independence


Let A and B be conditionally independent
given C. Note that
P(ABC) = P(ABC) / P(C)
= P(ABC) P(BC) / P(C)
= P(AC) P(BC) P(C) / P(C) (Why?)
Thus we get P(ABC) = P(AC) P(BC)
In general we may say that when A 1, A2, .
Ak are conditionally independent given B
P(A1 A2 . AkB) = P(A1B) P(A2B).
P(AkB)

Nave Bayes Classification


The concepts of Bayes theorem, Bayes optimality criterion for classification and
conditional independence may be combined to develop a classification
methodology
Suppose a response variable R takes k different values. Let us assume that these
values are 1, 2, k without loss of generality.
Let A1 A2 . An be n different events defined in terms of explanatory variables.
We want to estimate the probability P(R = j / A 1 A2 . An) for different
combinations of A1 A2 . An.
Once these probabilities are estimated for all j for a given combination of A 1 A2 .
An, we try to find j that maximizes this probability. From Bayes optimality criterion,
for a given combination of A1 A2 . An, the response is allocated to class j that
maximizes the probability.
We have already shown that P(BA1 A2 . Ak) = P(A1 A2 . AkB) P(B) / P(A1 A2 .
A k)
Since the denominator is constant, we allocate to that class for which the
numerator is maximum
Under the assumption of conditional independence of A 1, A2, . Ak given B, we
get P(A1 A2 . AkB) = P(A1B) P( A2B) P(AkB). Generally these probabilities
can be estimated and consequently a classification mechanism may be developed

Example
Consider the problem where data were collected for customers of
computers. We need to develop a classification mechanism so that
customers may be classified as buyers or non-buyers given the
profile. We will use Nave Bayes classification methodology to
accomplish this objective.
Data Table
Age

Income

Student

Credit
Rating

Buys
Computer

30

High

No

Fair

No

30

High

No

Excellent

No

31 40

High

No

Fair

Yes

> 40

Medium

No

Fair

Yes

> 40

Low

Yes

Fair

Yes

> 40

Low

Yes

Excellent

No

31 40

Low

Yes

Excellent

Yes

30

Medium

No

Fair

No

30

Low

Yes

Fair

Yes

> 40

Medium

Yes

Fair

Yes

30

Medium

Yes

Excellent

Yes

31 40

Medium

No

Excellent

Yes

31 40

High

Yes

Fair

Yes

> 40

Medium

No

Excellent

No

Classification Mechanism
The classifier aims at developing a method such that
optimal allocation to one of the classes (buys
computer / does not buy computer) is made for any
customer with a given combination of age, income,
status (student or not) and credit rating
Let B be the response variable that takes two values.
B = 0 means the customer does not buy computer
and 1 means s/he buys computer
Now P(B = 0 / Age, Income, Status, Credit Rating)
and P(B = 1 / Age, Income, Status, Credit Rating)
needs to be found using the Nave Bayes theory
We know that rather than estimating these probabilities,
some values proportional to the same shall be found

Exercise
Develop a classification mechanism
for the IRIS database
Hint: Note that the response variable
has three classes. Also observe that
there are four explanatory variables
Divide the explanatory variables into
certain classes and find the conditional
probabilities of the explanatory variables
given the different values of the
response variable

Examples of Usage of the


Concepts

A machine has many different sensors that capture data on a number of


variables say temperature, speed, vibration, and so on. Suppose these data
are continuous captured in a ratio scale. The machine may fail in many
different modes including degradation of function or development of fault
codes that may not lead to stoppage of function. The failure is a categorical
variable. Our problem is to allocate the machine to one of these classes given
the sensor data.
Inspection of products may be expensive. Thus we may need to develop a
filtering mechanism on the basis of automatically collected data to classify
products as good or bad. This application is very similar to spam filtering we
have discussed earlier
Suppose a manufacturer has installed an automatic sorting device at a great
cost. The concepts of sensitivity and specificity in particular positive
predictive value and negative predictive predictive value may be used to
assess the justification of the investment
A manufacturer may like to estimate the probability of failure of certain
mission given certain conditions say for example an R&D project under
certain condition. It is usually easier to look at failed and successful projects
called case-control studies, and assess the probabilities. The concept of
Bayes theorem may then be used to find the probability of failure.
Accordingly the company may be guided about making prudent investments

Review Questions
What is a random experiment?
What is an event? What is the meaning of
probability?
Let A and B be two events. What is meant
by A and B are independent?
Define conditional probability. Are P(A / B)
and P(B / A) same? If not, why not?
Explain
the
concept
of
conditional
independence and how is it used for
classification?

You might also like