You are on page 1of 93

Introduction to Probability and Statistics

Nagarajan Krishnamurthy
Introduction to Business Statistics for EPGP 2015-16 batch
Indian Institute of Management Indore

Thanks to Prof. Arun Kumar and Prof. Ravindra Gokhale,


co-instructors of QT1, AY 2012-13

Part 1: Summarizing and Visualizing a Data Set

Types of Data
Quantitative Data: Data for which arithmetic operations
makes sense. E.g.: Age, Salary, Length.

Categorical Data: Data obtained by putting individuals in


different categories. E.g.: Gender, States of a country

Visualization

Quantitative Data: Histogram, Stem-Leaf plot, Box plot

Categorical Data: Pie Chart, Bar chart

*Discuss Cafe data (using Excel)

Interpreting a Histogram

Shape: symmetric, skewed;


unimodal, bimodal, ...;
leptokurtic, platykurtic, mesokurtic
Center: mean, median
Spread: range, standard deviation, inter-quartile range

Measure of the central tendency of a data set

Mean: If we have a data set x1 , . . . , xn then mean of the data


n
set is x1 ++x
.
n

Notation: x

Mean: Example

The mean of 0,5,1,1,3 is 2.

Measure of the Central Tendency of a Data Set

Median: Middle number in a sorted data set. When the


number of observations (sample size) is an even number then
there are two middle numbers. In that case, we take average
of the two middle numbers to obtain the median.

Notations: x

Median: Example 1

For example the median of 0,5,1,1,3 is 1 because 1 is the


middle number of the sorted data i.e. 0,1,1,3,5.

Median: Example 2

The median of 3,2,5,6,4,5,3,5 is 4.5 because 4.5 is the average


of the two middle numbers of the sorted data i.e.
2,3,3,4,5,5,5,6.

Measure of the Central Tendency of a Data Set

Mode: Observation in the data set with the largest frequency.


Note that we can have more than one mode for a data set.

Mode: Example

For example the mode of 0,5,1,1,3 is 1.

Effect of an Outlier

Calculate mean, median, and mode of 0,5,1,1,3,100.

Effect of an Outlier

Calculate mean, median, and mode of 0,5,1,1,3,100.

mean=18.33, median=2, mode=1.

Effect of an Outlier

Outlier pulls mean towards it but may not affect median and
mode.

Identifying Relation Between Mean and Median


from Histogram

Identifying Relation Between Mean and Median


from Histogram

Symmetric: mean median

Identifying Relation Between Mean and Median


from Histogram

Symmetric: mean median

Left skewed: Mean < Median < Mode (in general)

Right skewed: Mean > Median > Mode (in general)

Kurtosis

Leptokurtic
Platykurtic
Mesokurtic

Measure of the Spread of a Data Set

Range: max-min

Ex: 0,5,1,1,3; what is the range?

Measure of the Spread of a Data Set

Range: max-min

Ex: 0,5,1,1,3; what is the range?


Range = 5 0 = 5.

Measure of the Spread of a Data Set

Variance:

Pn

x )2
i=1 (xi
n1

Standard deviation:

q Pn

x)
i=1 (xi
n1

Variance and Standard Deviation: Example

What is the variance and the standard deviation of 3,3,3,3,3?

Variance and Standard Deviation: Example

What is the variance and the standard deviation of 3,3,3,3,3?


Ans. variance=0 standard deviation=0

What is the variance and the standard deviation of 1,2,3,4,5?

Variance and Standard Deviation: Example

What is the variance and the standard deviation of 3,3,3,3,3?


Ans. variance=0 standard deviation=0

What is the variance and the standard deviation of 1,2,3,4,5?


Ans. variance=2.5 standard deviation=1.58

Standard Deviation

Standard deviation is always greater than or equal to zero.

Does Standard Deviation Gets Affected by


Outliers?

What is the standard deviation for the data 3,3,3,3,100?

Does Standard Deviation Gets Affected by


Outliers?

What is the standard deviation for the data 3,3,3,3,100?

Ans. 43.38

Is Standard Deviation Always a Good Measure of


the Spread of a Data Set?

Is Standard Deviation Always a Good Measure of


the Spread of a Data Set?

Not a good measure when data is skewed or has outliers.

Quartiles

First quartile: 25th percentile

Notation: Q1

Quartiles

Third quartile: 75th percentile

Notation: Q3

Exercise
Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.

Exercise
Find the first and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.

Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median of


the red half of the data is 3.5 (Q1) and the median of the blue
half of the data is 6.5 (Q3).

Quartiles

Median is the second quartile (Q2 ).

Measure of the Spread of a Data Set

Inter Quartile Range (IQR): Q3 Q1

*IQR is a robust measure of spread. IQR does not get affected


much by skewness or outliers.

Exercise

Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.

Exercise

Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.

Q3-Q1=6.5-3.5=3.

Five Number Summary

Minimum
First quartile
Median
Third quartile
Maximum

Boxplot

*We will create a box plot for the Cafe data set.

Interpreting a Box Plot

Shape:

Outliers: Any observation not in the range


[Q1 1.5 IQR, Q3 + 1.5 IQR] is considered an outlier
(Informal Rule).

Why Do We Need Box Plot?

To compare two or more data sets.


Visualization of summary statistics.

Categorical Data Visualization

*Bar Chart

*Pie Chart

Show billionaires data.

Part 2: Introduction to Probability

Describing Shape of a Bar Graph

Proportion of observations in a particular category.

Describing Shape of a Histogram

Proportion of observations in a particular class interval.

Probability

Proportion sample

Probability population

Example
Workforce distribution in the United States.
Industry
Probability
Agriculture
0.130
Construction
0.147
Finance, Insurance, Real Estate 0.059
Manufacturing
0.042
Mining
0.002
Services
0.419
Trade
0.159
Transportation, Public Utilities 0.042

Sample Space

Def: Set of all possible outcomes.

E.g.: ={Agriculture, Construction, . . . , Services, Trade,


Transportation and Public Utilities}

Simple Events

Simple event: An event in the finest partition of the sample


space.

Example: 1 =Agriculture, 2 =Construction.

Event

Def: Any subset of the sample space

E.g.: {Agriculture, Construction}

Exercise

A bowl contains three red and two yellow balls. Two balls are
randomly selected and their colors recorded. Use a tree
diagram to list the 20 simple events in the experiment, keeping
in mind the order in which the balls are drawn.

Other Approaches for Calculating Probabilities

Classical Approach: Assuming all outcomes to be equally


likely, the probability of an event is the number of favourable
outcomes divided by the total number of outcomes.
E.g. Rolling a dice

Subjective Approach: Assigning probability to an event based


on ones experience.

Example
Workforce distribution in the United States.
Industry
Probability
Agriculture
0.130
Construction
0.147
Finance, Insurance, Real Estate 0.059
Manufacturing
0.042
Mining
0.002
Services
0.419
Trade
0.159
Transportation, Public Utilities 0.042

Probability

P(Agriculture)

Probability

P(Agriculture) = 0.13
P(Either Agriculture or Construction or both)
P(Agriculture Construction)

Probability

P(Agriculture) = 0.13
P(Either Agriculture or Construction or both)
P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction)
P(Agriculture Construction)

Probability

P(Agriculture) = 0.13
P(Either Agriculture or Construction or both)
P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction)
P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec )

Probability

P(Agriculture) = 0.13
P(Either Agriculture or Construction or both)
P(Agriculture Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction)
P(Agriculture Construction) =0.
P(Not in Agriculture) P(Agriculturec ) = 1-0.13=0.87.

Compound Events
If A and B are two events then

Union event is A B

Intersection event is A B

Complement event is Ac

Venn Diagram Representation

Disjoint events A and B

S
A

AUB

Mutually exclusive and exhaustive


events: A, B, C, and D

Probability Rules

1
2

P(A B) = P(A) + P(B) P(A B)


P(Ac ) = 1 P(A)

Mutually Exclusive

Def: Two events are mutually exclusive if they do not have


any common outcome.

E.g.: Agriculture and Construction are mutually exclusive


events.

Mutually Exclusive

A and B are mutually exclusive if P(A B) = 0.

This implies that for mutually exclusive events A and B,


P(A B) = P(A)+P(B).

Pizza Venn Diagram

What is the sample space?

What is the sample space?

Sample space={Tomato only, Fish Only, Mushroom-Tomato,


Mushroom-Tomato-Fish, Mushroom-Fish, No toppings}.

Probability of the events in the sample space

P(Tomato only)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)=1/8.

P(Mushroom-Tomato)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)=1/8.

P(Mushroom-Tomato) =2/8=1/4;
P(Mushroom-Tomato-Fish)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)=1/8.

P(Mushroom-Tomato) =2/8=1/4;
P(Mushroom-Tomato-Fish)=1/8.

P(Mushroom-Fish)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)=1/8.

P(Mushroom-Tomato) =2/8=1/4;
P(Mushroom-Tomato-Fish)=1/8.

P(Mushroom-Fish) =1/8; P(No toppings)

Probability of the events in the sample space

P(Tomato only) =2/8; P(Fish only)=1/8.

P(Mushroom-Tomato) =2/8=1/4;
P(Mushroom-Tomato-Fish)=1/8.

P(Mushroom-Fish) =1/8; P(No toppings)=1/8.

Union Rule

What is the probability that your slice will have tomato or


mushroom?

Union Rule

What is the probability that your slice will have tomato or


mushroom?

Ans. 6/8=3/4

Intersection Rule

What is the probability that your slice will have tomato and
mushroom?

Intersection Rule

What is the probability that your slice will have tomato and
mushroom?

Ans. 3/8

Complement Rule

What is the probability that your slice will not have tomato?

Complement Rule

What is the probability that your slice will not have tomato?

Ans. 3/8

Conditional Probability

You have pulled out a slice of pizza that has tomato on it.
What is the probability that your slice will have mushrooms?

Ans. 3/5.

Conditional Probability

Def: Probability of event A in event B. That is, probability


that even A occurs given than B occurs.

Notation: A|B

Multiplication rule

P(A B) = P(A)P(B|A)
P(A B) = P(B)P(A|B)

Statistical Independence

Two events are said to be independent if the occurrence of


one has no effect on the chance of occurrence of the other.

Statistical Independence

Two events A and B are considered independent when


P(A|B)=P(A).

Exercise 1

Is gender related to whether someone voted in the last mayoral


election? Answer the question using the joint probabilities
given in the table below.

Voted in the last mayoral election


Yes
No

Gender
Female Male
0.25
0.18
0.33
0.24

Statistical Independence

If two events A and B are independent then


1
P(A B) = P(A)P(B)

Law of Total Probability

Given a set of events S1 , S2 , . . . , Sk that are mutually exclusive


and exhaustive, and an event A, the probability of the event A
can be expressed as
P(A) = P(S1 ).P(A|S1 ) + P(S2 ).P(A|S2 )
+P(S3 ).P(A|S3 ) + . . . + P(Sk ).P(A|Sk )

Exercise 2
A business group owns three five-star hotels (say, A, B, and C)
in India. By studying the past behavior of the revenue
obtained from the three hotels month by month, it has been
observed that the probability of increase in revenue of either B
or C or both of them is 0.5. If As revenue increases in a given
month, the probability of increase in Bs revenue is 0.7, the
probability of increase in Cs revenue is 0.6, and the probability
of increase in both B and Cs revenue is 0.5. However if As
revenue does not increase in a given month, the probability of
increase in Bs revenue is 0.2, the probability of increase in Cs
revenue is 0.3, and the probability of increase in both B and
Cs revenue is 0.1. What is the probability that the revenue of
all the three hotels, A, B, and C, increase in a given month?

Exercise 3
You are a physician. You think it is quite likely that one of your patients has strep
throat, but you are not sure. You take some swabs from the throat and send them to
a lab for testing. The test is (like nearly all lab tests) not perfect. If the patient has
strep throat, then 70% of the time the lab says YES but 30% of the time it says NO.
If the patient does not have strep throat, then 90% of the time the lab says NO but
10% of the time it says YES. You send five succesive swabs to the lab, from the same
patient. You get back these results, in order; YNYNY. What do you conclude?
These results are worthless.
It is likely that the patient does not have the strep throat.
It is slightly more likely than not, that patient does have the strep throat.
It is very much more likely than not, that patient does have the strep throat.

Bayes Rule

Let S1 , S2 , . . . , Sk represents k mutually exclusive and


exhaustive sub-populations with prior probabilities
P(S1 ), P(S2 ), . . . , P(S2 ). If an event A occurs, the posterior
probability of Si given A is the conditional probability
P(Si ).P(A|Si )
P(Si |A) = Pk
j=1 P(Sj ).P(A|Sj )

Exercise

Strep Throat Exercise

Bibliography

An Introduction to Probability and Inductive Logic, by Ian


Hacking
Introduction to Probability and Statistics, by William
Mendenhall, Robert J. Beaver, and Barbara M. Beaver
Practice of Business Statistics, by David S. Moore, George
P. McCabe, William M. Duckworth, and Stanley L. Sclove
Bradley A. Warner, David Pendergrift, and Timothy
Webb,That was Venn, This is now, Journal of
Statistical Education, Volume 6, Number 1, 1998

You might also like