You are on page 1of 46

Statistics and Probability

What is Data?
Data is a collection of facts, such as values or measurements. It can be numbers, words, measurements, observations or even just descriptions of things.

Qualitative vs Quantitative
Data can be qualitative or quantitative. Qualitative data is descriptive information (it describes something) Quantitative data, is numerical information (numbers).

And Quantitative data can also be Discrete or Continuous: Discrete data can only take certain values (like whole numbers) Continuous data can take any value (within a range)

Put simply: Discrete data is counted, Continuous data is measured To help you remember think "Quantitative is about Quantity"

Collecting
Data can be collected in many ways. The simplest way is direct observation. Example: you want to find how many cars pass by a certain point on a road in a 10minute interval. So: simply stand at that point on the road, and count the cars that pass by in that interval. You collect data by doing a Survey.

Census or Sample
A Census is when you collect data for every member of the group (the whole "population"). A Sample is when you collect data just for selected members of the group. Example: there are 120 people in your local football club. You can ask everyone (all 120) what their age is. That is a census. Or you could just choose the people that are there this afternoon. That is a sample. A census is accurate, but hard to do. A sample is not as accurate, but may be good enough, and is a lot easier.

Language
Data or Datum?
The singular form is "datum", so we would say "that datum is very high". "Data" is the plural so we can say "the data are available", but it is also a collection of facts, so "the data is available" is fine too.

Discrete and Continuous Data


Data can be Descriptive (like "high" or "fast") or Numerical (numbers). And Numerical Data can be Discrete or Continuous:

Discrete data is counted, Continuous data is measured

Discrete Data
Discrete Data can only take certain values. Example: the number of students in a class (you can't have half a student).

Continuous Data
Continuous Data is data that can take any value (within a range) Examples: A person's height: could be any value (within the range of human heights), not just certain fixed heights, Time in a race: you could even measure it to fractions of a second, A dog's weight, The length of a leaf, Lots more!

Finding a Central Value


When you have two or more numbers it is nice to find a value for the "center".

2 Numbers
With just 2 numbers the answer is easy: go half-way in-between.

Example: what is the central value for 3 and 7?


Answer: Half-way in-between, which is 5.

You can calculate it by adding 3 and 7 and then dividing the result by 2:

(3+7) / 2 = 10/2 = 5

3 or More Numbers
You can use the same idea when you have 3 or more numbers:

Example: what is the central value of 3, 7 and 8?


Answer: You calculate it by adding 3, 7 and 8 and then dividing the results by 3 (because there are 3 numbers):

(3+7+8) / 3 = 18/3 = 6

Notice that we divided by 3 because we had 3 numbers ... very important!

The Mean
So far we have been calculating the Mean (or the Average):

Mean: Add up the numbers and divide by how many numbers.


But sometimes the Mean can let you down:

Example: Birthday Activities


Uncle Bob wants to know the average age at the party, to choose an activity. There will be 6 kids aged 13, and also 5 babies aged 1. Add up all the ages, and divide by 11 (because there are 11 numbers):

(13+13+13+13+13+13+1+1+1+1+1) / 11 = 7.5...

The mean age is about 7, so he gets a Jumping Castle!

The 13 year olds are embarrassed, and the 1-year olds can't jump!

The Mean was accurate, but in this case it was not useful.

The Median
But you could also use the Median: simply list all numbers in order and choose the middle one:

Example: Birthday Activities (continued)


List the ages in order:

1, 1, 1, 1, 1, 13, 13, 13, 13, 13, 13


Choose the middle number:

1, 1, 1, 1, 1, 13, 13, 13, 13, 13, 13


The Median age is 13 ... so let's have a Disco! Sometimes there are two middle numbers. Just average them:

Example: What is the Median of 3, 4, 7, 9, 12, 15


There are two numbers in the middle:

3, 4, 7, 9, 12, 15
So we average them:

(7+9) / 2 = 16/2 = 8
The Median is 8

The Mode
The Mode is the value that occurs most often:

Example: Birthday Activities (continued)


Group the numbers so we can count them:

1, 1, 1, 1, 1, 13, 13, 13, 13, 13, 13


"13" occurs 6 times, "1" occurs only 5 times, so the mode is 13. How to remember? Think "mode is most" But Mode can be tricky, there can sometimes be more than one Mode.

Example: What is the Mode of 3, 4, 4, 5, 6, 6, 7


Well ... 4 occurs twice but 6 also occurs twice. So both 4 and 6 are modes. When there are two modes it is called "bimodal", when there are three or more modes we call it "multimodal".

Conclusion
There are other ways of measuring central values, but Mean, Median and Mode are the most common. Use the one that best suits your data. Or better still, use all three!

How to Find the Mean


The mean is just the average of the numbers. It is easy to calculate: add up all the numbers, then divide by how many numbers there are.

In other words it is the sum divided by the count. Example 1: What is the Mean of these numbers?

6, 11, 7
Add the numbers: 6 + 11 + 7 = 24 Divide by how many numbers (there are 3 numbers): 24 / 3 = 8

The Mean is 8

Why Does This Work?


It is because 6, 11 and 7 added together is the same as 3 lots of 8:

It is like you are "flattening out" the numbers

Example 2: Look at these numbers:

3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
The sum of these numbers is 330 There are fifteen numbers. The mean is equal to 330 / 15 = 22

The mean of the above numbers is 22

Negative Numbers
How do you handle negative numbers? Adding a negative number is the same as subtracting the number (without the negative). For example 3 + (-2) = 3-2 = 1. Knowing this, let us try an example:

Example 3: Find the mean of these numbers:

3, -7, 5, 13, -2
The sum of these numbers is 3 - 7 + 5 + 13 - 2 = 12 There are 5 numbers. The mean is equal to 12 5 = 2.4

The mean of the above numbers is 2.4


Now have a look at The Mean Machine.

How to Find the Median Value


It's the middle number in a sorted list.

Median Value
The Median is the "middle number" (in a sorted list of numbers).

How to Find the Median Value


To find the Median, place the numbers you are given in value order and find the middle number.

Example: find the Median of {12, 3 and 5}


Put them in order:

3, 5, 12
The middle number is 5, so the median is 5.

Example 2
Look at these numbers:

3, 13, 7, 5, 21, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
If we put those numbers in order we have:

3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 39, 40, 56
There are fifteen numbers. Our middle number will be the eighth number:

3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 39, 40, 56
The median value of this set of numbers is 23.
(Note that it didn't matter if we had some numbers the same in the list)

Two Numbers in the Middle


BUT, if there are an even amount of numbers things are slightly different. In that case we need to find the middle pair of numbers, and then find the value that would be half way between them. This is easily done by adding them together and dividing by two. An example will help:

3, 13, 7, 5, 21, 23, 23, 40, 23, 14, 12, 56, 23, 29
If we put those numbers in order we have:

3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56
There are now fourteen numbers and so we don't have just one middle number, we have a pair of middle numbers:

3, 5, 7, 12, 13, 14, 21, 23, 23, 23, 23, 29, 40, 56
In this example the middle numbers are 21 and 23. To find the value half-way between them, add them together and divide by 2:

21 + 23 = 44 44 2 = 22

And, so, the Median in this example is 22.


(Note that 22 was not in the list of numbers ... but that is OK, because half the numbers in the list are less, and half the numbers are greater.)

Your Turn
Remember: sort them first (by dragging them left or right) !

View Larger

Which is the Middle Number?


A quick way to know which is the middle number: count how many numbers, add 1 then divide by 2

Example: There are 45 numbers


45 plus 1 is 46, then divide by 2 and you get 23 So the median is the 23rd number in the sorted list.

Example: There are 66 numbers in the sorted list


66 plus 1 is 67, then divide by 2 and you get 33.5 33 and a half? That means that the 33rd and 34th numbers in the sorted list are the two middle numbers. So to find the median: add the 33rd and 34th numbers together and divide by 2.

How to Find the Mode or Modal Value


The mode is simply the number which appears most often.

Finding the Mode


To find the mode, or modal value, first put the numbers in order, then count how many of each number.

Example: 3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29
In order these numbers are:

3, 5, 7, 12, 13, 14, 20, 23, 23, 23, 23, 29, 39, 40, 56
This makes it easy to see which numbers appear most often.

In this case the mode is 23. Another Example: {19, 8, 29, 35, 19, 28, 15}
Arrange them in order: {8, 15, 19, 19, 28, 29, 35} 19 appears twice, all the rest appear only once, so 19 is the mode. How to remember? Think "mode is most"

More Than One Mode


You can have more than one mode.

Example: {1, 3, 3, 3, 4, 4, 6, 6, 6, 9}
3 appears three times, as does 6. So there are two modes: at 3 and 6 Having two modes is called "bimodal". Having more than two modes is called "multimodal".

Grouping
When all values appear the same number of times the idea of a mode is not useful. But you could group them to see if one group has more than the others.

Example: {4, 7, 11, 16, 20, 22, 25, 26, 33}


Each value occurs once, so let us try to group them. We can try groups of 10: 0-9: 2 values (4 and 7) 10-19: 2 values (11 and 16) 20-29: 4 values (20, 22, 25 and 26) 30-39: 1 value (33) In groups of 10, the "20s" appear most often, so we could choose 25 as the mode. You could use different groupings and get a different answer!

The Mean from a Frequency Table


It is easy to calculate the Mean: Add up all the numbers, then divide by how many numbers there are.

Example 1: What is the Mean of these numbers?

6, 11, 7
Add the numbers: 6 + 11 + 7 = 24 Divide by how many numbers (there are 3 numbers): 24 3 = 8

The Mean is 8
But sometimes you won't have a simple list of numbers, you might have a frequency table like this (the "frequency" says how often they occur): Score 1 2 3 4 5 Frequency 2 5 4 2 1

(it says that score 1 occurred 2 times, score 2 occurred 5 times, etc) You could list all the numbers like this:

Mean =

1+1 + 2+2+2+2+2 + 3+3+3+3 + 4+4 + 5 (how many numbers)

But rather than do lots of adds (like 3+3+3+3) it is often easier to use multiplication: 21 + 52 + 43 + 24 + 15 (how many numbers)

Mean =

And rather than count how many numbers there are, we can add up the frequencies: 21 + 52 + 43 + 24 + 15 2+5+4+2+1

Mean =

So let's calculate: 2 + 10 + 12 + 8 + 5 14 37 14

Mean =

= 2.64...

And that is how to calculate the mean from a frequency table! Here is another example:

Example: Parking Spaces per House in Hampton Street


Isabella went up and down the street to find out how many parking spaces each house had. Here are her results:

Parking Spaces
1 2 3 4

Frequency
15 27 8 5

What is the mean number of Parking Spaces? Answer: 151 + 272 + 83 + 54 Mean = 15+27+8+5 = 55 15+54+24+20 = 2.05...

The Mean is 2.05 (to 2 decimal places)


(much easier than adding all numbers separately!)

Notation
Now you know how to do it, let's do that last example again, but using formulas. This symbol (called Sigma) means "sum up" (read more at Sigma Notation) So we can say "add up all frequencies" this way:

(where And we would use it like this:

f is frequency)

Likewise we can add up "frequency times score" this way:

(where

f is frequency and x is the matching score)

And the formula for calculating the mean from a frequency table is:

The

x with the bar on top says "the mean of x"

So now we are ready to do our example above, but with correct notation.

Example: Calculate the Mean of this Frequency Table

x
1 2 3 4 And here it is:

f
15 27 8 5

There you go! You can use sigma notation.

Calculate in the Table


It is often better to do the calculations in the table.

Example: (continued)
From the previous example, calculate totals:

f x in the right-hand column and then do f


15 27 8 5 55

x
1 2 3 4 TOTALS: And the Mean is then easy:

fx
15 54 24 20 113

Mean = 113 / 55 = 2.05...

The Range (Statistics)


The Range is the difference between the lowest and highest values. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.

So the range is 9-3 = 6.

It is that simple! But perhaps too simple ...

The Range Can Be Misleading


The range can sometimes be misleading when there are extremely high or low values. Example: In {8, 11, 5, 9, 7, 6, 3616}: the lowest value is 5, and the highest is 3616,

So the range is 3616-5 = 3611.


The single value of 3616 makes the range large, but most values are around 10. So you may be better off using Interquartile Range or Standard Deviation.

Range of a Function
Range can also mean all the output values of a function, seeDomain, Range and Codomain.

Quartiles
Quartiles are the values that divide a list of numbers into quarters. First put the list of numbers in order Then cut the list into four equal parts The Quartiles are at the "cuts"

Like this:

Example: 5, 8, 4, 4, 6, 3, 8
Put them in order: 3, 4, 4, 5, 6, 8, 8 Cut the list into quarters:

And the result is: Quartile 1 (Q1) = 4 Quartile 2 (Q2), which is also the Median, = 5 Quartile 3 (Q3) = 8

Sometimes a "cut" is between two numbers ... the Quartile is the average of the two numbers.

Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
The numbers are already in order Cut the list into quarters:

In this case Quartile 2 is half way between 5 and 6:

Q2 = (5+6)/2 = 5.5
And the result is: Quartile 1 (Q1) = 3 Quartile 2 (Q2) = 5.5 Quartile 3 (Q3) = 7

Interquartile Range
The "Interquartile Range" is from Q1 to Q3:

To calculate it just subtract Quartile 1 from Quartile 3 , like this:

Example:

The Interquartile Range is:

Q3 - Q1 = 8 - 4 = 4 Box and Whisker Plot


You can show all the important values in a "Box and Whisker Plot", like this:

A final example covering everything:

Example: Box and Whisker Plot and Interquartile Range for 4, 17, 7, 14, 18, 12, 3, 16, 10, 4, 4, 11
Put them in order:

3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18

Cut it into quarters:

3, 4, 4 | 4, 7, 10 | 11, 12, 14 | 16, 17, 18


In this case all the quartiles are between numbers: Quartile 1 (Q1) = (4+4)/2 = 4 Quartile 2 (Q2) = (10+11)/2 = 10.5 Quartile 3 (Q3) = (14+16)/2 = 15 Also: The Lowest Value is 3, The Highest Value is 18 So now we have enough data for the Box and Whisker Plot:

And the Interquartile Range is:

Q3 - Q1 = 15 - 4 = 11

Standard Deviation and Variance


Deviation just means how far from the normal

Standard Deviation
The Standard Deviation is a measure of how spread out numbers are. Its symbol is

(the greek letter sigma)

The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?"

Variance
The Variance is defined as: The average of the squared differences from the Mean. To calculate the variance follow these steps: Work out the Mean (the simple average of the numbers)

Then for each number: subtract the Mean and square the result

(the squared difference).


Then work out the average of those squared differences. (Why Square?)

Example
You and your friends have just measured the heights of your dogs (in millimeters):

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation. Your first step is to find the Mean:

Answer:
Mean = 600 + 470 + 170 + 430 + 300 5 = 1970 5 = 394

so the mean (average) height is 394 mm. Let's plot this on the chart:

Now, we calculate each dogs difference from the Mean:

To calculate the Variance, take each difference, square it, and then average the result:

So, the Variance is 21,704. And the Standard Deviation is just the square root of Variance, so:

Standard Deviation: = 21,704 = 147.32... = 147 (to the nearest mm)

And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean:

So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them! Now try the Standard Deviation Calculator.

But ... there is a small change with Sample Data


Our example was for a Population (the 5 dogs were the only dogs we were interested in). But if the data is a Sample (a selection taken from a bigger Population), then the calculation changes!

When you have "N" data values that are:


The Population: divide by N when calculating Variance (like we did) A Sample: divide by N-1 when calculating Variance All other calculations stay the same, including how we calculated the mean. Example: if our 5 dogs were just a sample of a bigger population of dogs, we would divide by 4 instead of 5 like this:

Sample Variance = 108,520 / 4 = 27,130 Sample Standard Deviation = 27,130 = 164 (to the nearest mm)
Think of it as a "correction" when your data is only a sample.

Formulas
Here are the two formulas, explained at Standard Deviation Formulas if you want to know more:

The "Population Standard Deviation":

The "Sample Standard Deviation":

Looks complicated, but the important change is to divide by N-1 (instead of N) when calculating a Sample Variance.

*Footnote: Why square the differences?


If we just added up the differences from the mean ... the negatives would cancel the positives:

4+4-4-4 =0 4

So that won't work. How about we use absolute values?

|4| + |4| + |-4| + |-4| = 4

4+4+4+4 =4 4

That looks good, but what about this case:

|7| + |1| + |-6| + |-2| = 4

7+1+6+2 =4 4

Oh No! It also gives a value of 4, Even though the differences are more spread out! So let us try squaring each difference (and taking the square root at the end):

42 + 42 + 42 + 42 4

64 =4 4

72 + 12 + 62 + 22 4

90 = 4.74... 4

That is nice! The Standard Deviation is bigger when the differences are more spread out ... just what we want! In fact this method is a similar idea to distance between points, just applied in a different way. And it is easier to use algebra on squares and square roots than absolute values, which makes the standard deviation easy to use in other areas of mathematics.

Normal Distribution
Data can be "distributed" (spread out) in different ways.

It can be spread out more on the left

... or more on the right

Or it can be all jumbled up

But there are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a "Normal Distribution" like this:

A Normal Distribution
The "Bell Curve" is a Normal Distribution. And the yellow histogram shows some data that follows it closely, but not perfectly (which is usual).

It is often called a "Bell Curve" because it looks like a bell.

Many things closely follow a Normal Distribution: heights of people size of things produced by machines errors in measurements blood pressure marks on a test

We say the data is "normally distributed".

The Normal Distribution has: mean = median = mode symmetry about the center 50% of values less than the mean

and 50% greater than the mean

Quincunx

You can see a normal distribution being created by random chance! It is called the Quincunx and it is an amazing machine. Have a play with it!

Standard Deviations
The Standard Deviation is a measure of how spread out numbers are (read that page for details on how to calculate it). When you calculate the standard deviation of your data, you will find that (generally):

68% of values are within 1 standard deviation of the mean

95% are within 2 standard deviations

99.7% are within 3 standard deviations

Example: 95% of students at school are between 1.1m and 1.7m tall.
Assuming this data is normally distributed can you calculate the mean and standard deviation? The mean is halfway between 1.1m and 1.7m:

Mean = (1.1m + 1.7m) / 2 = 1.4m

95% is 2 standard deviations either side of the mean (a total of 4 standard deviations) so:

1 standard deviation = (1.7m-1.1m) / 4 = 0.6m / 4 = 0.15m


And this is the result:

It is good to know the standard deviation, because we can say that any value is: likely to be within 1 standard deviation (68 out of 100 will be) very likely to be within 2 standard deviations (95 out of 100 will be) almost certainly within 3 standard deviations (997 out of 1000 will be)

Standard Scores
The number of standard deviations from the mean is also called the "Standard Score", "sigma" or "z-score". Get used to those words!

Example: In that same school one of your friends is 1.85m tall

You can see on the bell curve that 1.85m is 3 standard deviations from the mean of 1.4, so:

Your friend's height has a "z-score" of 3.0

It is also possible to calculate how many standard deviations 1.85 is from the mean How far is 1.85 from the mean? It is 1.85 - 1.4 = 0.45m from the mean How many standard deviations is that? The standard deviation is 0.15m, so: 0.45m / 0.15m = 3 standard deviations

So to convert a value to a Standard Score ("z-score"): first subtract the mean, then divide by the Standard Deviation

And doing that is called "Standardizing":

You can take any Normal Distribution and convert it to The Standard Normal Distribution.

Example: Travel Time


A survey of daily travel time had these results (in minutes): 26, 33, 65, 28, 34, 55, 25, 44, 50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34 The Mean is 38.8 minutes, and the Standard Deviation is 11.4 minutes (you can copy and paste the values into the Standard Deviation Calculator if you want). Convert the values to z-scores ("standard scores"). To convert 26:

first subtract the mean: 26 - 38.8 = -12.8, then divide by the Standard Deviation: -12.8/11.4 = -1.12
So 26 is -1.12 Standard Deviations from the Mean Here are the first three conversions Original Value 26 33 65 ... And here they are graphically: Calculation (26-38.8) / 11.4 = (33-38.8) / 11.4 = (65-38.8) / 11.4 = ... Standard Score (z-score) -1.12 -0.51 +2.30 ...

You can calculate the rest of the z-scores yourself!

Here is the formula for z-score that we have been using: z is the "z-score" (Standard Score) x is the value to be standardized is the mean is the standard deviation

Why Standardize ... ?


It can help you make decisions about your data.

Example: Professor Willoughby is marking a test.


Here are the students results (out of 60 points): 20, 15, 26, 32, 18, 28, 35, 14, 26, 22, 17 Most students didn't even get 30 out of 60, and most will fail. The test must have been really hard, so the Prof decides to Standardize all the scores and only fail people 1 standard deviation below the mean. The Mean is 23, and the Standard Deviation is 6.6, and these are the Standard Scores: -0.45, -1.21, 0.45, 1.36, -0.76, 0.76, 1.82, -1.36, 0.45, -0.15, -0.91 Only 2 students will fail (the ones who scored 15 and 14 on the test) It also makes life easier because we only need one table (the Standard Normal Distribution Table), rather than doing calculations individually for each value of mean and standard deviation.

In More Detail
Here is the Standard Normal Distribution with percentages for every half of a standard deviation, and cumulative percentages:

Example: Your score in a recent test was 0.5 standard deviations above the average, how many people scored lower than you did? Between 0 and 0.5 is 19.1% Less than 0 is 50% (left half of the curve) So the total less than you is:

50% + 19.1% = 69.1%


In theory 69.1% scored less than you did (but with real data the percentage may be different)

A Practical Example: Your company packages sugar in 1 kg bags.


When you weigh a sample of bags you get these results: 1007g, 1032g, 1002g, 983g, 1004g, ... (a hundred measurements) Mean = 1010g Standard Deviation = 20g

Some values are less than 1000g ... can you fix that?

The normal distribution of your measurements looks like this:

31% of the bags are less than 1000g, which is cheating the customer!
Because it is a random thing we can't stop bags having less than 1000g, but we can reduce it a lot ... if 1000g was at -3 standard deviations there would be only 0.1% (very

small)
at -2.5 standard deviations we can calculate:

below 3 is 0.1% and between 3 and 2.5 standard deviations is 0.5%, together that is 0.1%+0.5% = 0.6% (a good choice I think)
So let us adjust the machine to have 1000g at 2.5 standard deviations from the mean. We could adjust it to: increase the amount of sugar in each bag (this would change the mean), or make it more accurate (this would reduce the standard deviation)

Let us try both:

Adjust the mean amount in each bag


The standard deviation is 20g, and we need 2.5 of them:

2.5 20g = 50g


So the machine should average 1050g, like this:

Adjust the accuracy of the machine


Or we can keep the same mean (of 1010g), but then we need 2.5 standard deviations to be equal to 10g:

10g / 2.5 = 4g
So the standard deviation should be 4g, like this: (We hope the machine is that accurate!)

Or perhaps we could have some combination of better accuracy and slightly larger average size, I will leave that up to you!

Correlation
When two sets of data are strongly linked together we say they have a High Correlation. The word Correlation is made of Co- (meaning "together"), and Relation Correlation is Positive when the values increase together, and Correlation is Negative when one value decreases as the other increases

Like this:

Correlation can have a value:

1 is a perfect positive correlation 0 is no correlation (the values don't seem linked at all) -1 is a perfect negative correlation

The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.

Example: Ice Cream Sales


The local ice cream shop keeps track of how much ice cream they sell versus the temperature on that day, here are their figures for the last 12 days: Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445

17.2 And here is the same data as a Scatter Plot:

$408

You can easily see that warmer weather leads to more sales, the relationship is good but not perfect.

In fact the correlation is 0.9575 ... see at the end how I calculated it.

Correlation Is Not Good at Curves


The correlation calculation only works well for relationships that follow a straight line.

Our Ice Cream Example: there has been a heat wave!


It gets so hot that people aren't going near the shop, and sales start dropping. Here is the latest graph:

The correlation is now 0: "No Correlation" ... !


The calculated value of correlation is 0 (trust me, I worked it out), which says there is "no correlation". But we can see the data follows a nice curve that reaches a peak around 25 C. But the correlation calculation is not "smart" enough to see this.

Moral of the story: make a Scatter Plot, and look at it! You may see more than the correlation value says.

Correlation Is Not Causation


"Correlation Is Not Causation" ... by that I mean: when there is a correlation it does not mean that one thing causes the other

Example: Sunglasses vs Ice Cream


Our Ice Cream shop finds how many sunglasses were sold by a big store for each day and compares them to their ice cream sales:

The correlation between Sunglasses and Ice Cream sales is high


Does this mean that sunglasses make people want ice cream?

How To Calculate
How did I calculate the value 0.9575 at the top? I used "Pearson's Correlation". There is software that can calculate it for you, such as the CORREL() function in Excel or OpenOffice Calc ...

... but here is how to calculate it yourself:


Let us call the two sets of data "x" and "y" (in our case Temperature is x and Ice Cream Sales is y): Step 1: Find the mean of x, and the mean of y Step 2: Subtract the mean of x from every x value (call them "a"), do the same for y (call them "b") Step 3: Calculate: a b, a2 and b2 for every value Step 4: Sum up a b, sum up a2 and sum up b2 Step 5: Divide the sum of a b by the square root of [(sum of a 2) (sum of b2)]

Here is how I calculated the first Ice Cream example (values rounded to 1 or 0 decimal places):

As a formula it is:

Where:

is Sigma, the symbol for "sum up"


is each x-value minus the mean of x (called "a" above) is each y-value minus the mean of y (called "b" above)

You probably won't have to calculate it like that, but at least you know it is not "magic", but simply a routine set of calculations.

Approximate Values
There are also approximate ways to calculate a correlation coefficient, such as "Spearman's rank correlation coefficient", but I prefer using a spreadsheet like above.

Skewed Data
Data can be "skewed", meaning it tends to have a long tail on one side or the other:

Negative Skew

No Skew

Positive Skew

Negative Skew?
Why is it called negative skew? Because the long "tail" is on the negative side of the peak. People sometimes say it is "skewed to the left" (the long tail is on the left hand side) The mean is also on the left of the peak.

The Normal Distribution has No Skew


A Normal Distribution is not skewed. It is perfectly symmetrical. And the Mean is exactly at the peak.

Positive Skew
And positive skew is when the long tail is on the positive side of the peak, and some people say it is "skewed to the right". The mean is on the right of the peak value.

Example: Income Distribution


Here is some data I extracted from a recent Census. As you can see it is positively skewed ... in fact the tail continues way past $100,000

Calculating Skewness
"Skewness" (the amount of skew) can be calculated, for example you could use the SKEW() function in Excel or OpenOffice Calc.

The Range
The range is the most obvious measure of dispersion and is the difference between the lowest and highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6 students and the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In semester 2, the largest tutorial group size is 7 students and the smallest tutorial group contains 3 students, therefore the range is 4 (7-3).

The range is simple to compute and is useful when you wish to evaluate the whole of a dataset. The range is useful for showing the spread within a dataset and for comparing the spread between similar datasets. An example of the use of the range to compare spread within datasets is provided in table 1. The scores of individual students in the examination and coursework component of a module are shown.

To find the range in marks the highest and lowest values need to be found from the table. The highest coursework mark was 48 and the lowest was 27 giving a range of 21. In the examination, the highest mark was 45 and the lowest 12 producing a range of 33. This indicates that there was wider variation in the students performance in the examination than in the coursework for this module. Since the range is based solely on the two most extreme values within the dataset, if one of these is either exceptionally high or low (sometimes referred to as outlier) it will result in a range that is not typical of the variability within the dataset. For example, imagine in the above example that one student failed to hand in any coursework and was awarded a mark of zero, however they sat the exam and scored 40. The range for the coursework marks would now become 48 (48-0), rather than 21, however the new range is not typical of the dataset as a whole and is distorted by the outlier in the coursework marks. In order to reduce the problems caused by outliers in a dataset, the inter-quartile range is often calculated instead of the range.

The Inter-quartile Range


The inter-quartile range is a measure that indicates the extent to which the central 50% of values within the dataset are dispersed. It is based upon, and related to, the median. In the same way that the median divides a dataset into two halves, it can be further divided into quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of magnitude; the upper quartile is found three

quarters along the dataset. Therefore, the upper quartile lies half way between the median and the highest value in the dataset whilst the lower quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the upper quartile. For example, the examination marks for 20 students following a particular module are arranged in order of magnitude.

The median lies at the mid-point between the two central values (10th and 11th) = half-way between 60 and 62 = 61 The lower quartile lies at the mid-point between the 5th and 6th values = half-way between 52 and 53 = 52.5 The upper quartile lies at the mid-point between the 15th and 16th values = half-way between 70 and 71 = 70.5 The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43 = 37. The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the outlying values. Like the range however, the inter-quartile range is a measure of dispersion that is based upon only two values from the dataset. Statistically, the standard deviation is a more powerful measure of dispersion because it takes into account every value in the dataset. The standard deviation is explored in the next section of this guide.

Calculating the Inter-quartile range using Excel

The method Excel uses to calculate quartiles is not commonly used and tends to produce unusual results particularly when the dataset contains only a few values. For this reason you may be best to calculate the inter-quartile range by hand.

The Standard Deviation


The standard deviation is a measure that summarises the amount by which every value within a dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are bunched around the mean value. It is the most robust and widely used measure of dispersion since, unlike the range and inter-quartile range, it takes into account every variable in the dataset. When the values in a dataset are pretty tightly bunched together the standard deviation is small. When the values are spread apart the standard deviation will be relatively large. The standard deviation is usually presented in conjunction with the mean and is measured in the same units. In many datasets the values deviate from the mean value due to chance and such datasets are said to display a normal distribution. In a dataset with a normal distribution most of the values are clustered around the mean while relatively few values tend to be extremely high or extremely low. Many natural phenomena display a normal distribution. For datasets that have a normal distribution the standard deviation can be used to determine the proportion of values that lie within a particular range of the mean value. For such distributions it is always the case that 68% of values are less than one standard deviation (1SD) away from the mean value, that 95% of values are less than two standard deviations (2SD) away from the mean and that 99% of values are less than three standard deviations (3SD) away from the mean. Figure 3 shows this concept in diagrammatical form.

If the mean of a dataset is 25 and its standard deviation is 1.6, then 68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4) andMEAN+1SD (25+1.6=26.6) 99% of the values will lie between MEAN-3SD (25-4.8=20.2) and MEAN+3SD(25+4.8=29.8). If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it would indicate that the values were more dispersed. The frequency distribution for a dispersed dataset would still show a normal distribution but when plotted on a graph the shape of the curve will be flatter as in figure 4.

Population and sample standard deviations


There are two different calculations for the Standard Deviation. Which formula you use depends upon whether the values in your dataset represent an entire population or whether they form a sample of a larger population. For example, if all student users of the library were asked how many books they had borrowed in the past month then the entire population has been studied since all the students have been asked. In such cases the population standard deviation should be used. Sometimes it is not possible to find information about an entire population and it might be more realistic to ask a sample of 150 students about their library borrowing and use these results to estimate library borrowing habits for the entire population of students. In such cases the sample standard deviation should be used.

Formulae for the standard deviation


Whilst it is not necessary to learn the formula for calculating the standard deviation, there may be times when you wish to include it in a report or dissertation. The standard deviation of an entire population is known as (sigma) and is calculated using:

Where x represents each value in the population, is the mean value of the population, is the summation (or total), and N is the number of values in the population. The standard deviation of a sample is known as S and is calculated using:

Where x represents each value in the population, x is the mean value of the sample, is the summation (or total), and n-1 is the number of values in the sample minus 1.

Calculating the standard deviation using Excel


Excel has functions to calculate the population and sample standard deviations. The appropriate commands are entered into the formula bar towards the top of the spreadsheet and the corresponding cells in the spreadsheet are updated to show the result. For an example of calculating the population standard deviation, imagine you wish to know how fuelefficient a new car that you have just purchased is. You calculate how many kilometres you have done per litre on your first five trips. This information is presented as column A of the spreadsheet (figure 5). As you have only made 5 trips you do not have any further information and you are therefore measuring the whole population at this point in time. The command to find the population standard deviation in Excel is =STDEVP(VALUES) and in this case the command is =STDEVP(A2:A6) which gives an answer of 0.49. Basing your results on the population standard deviation and assuming that your first 5 trips in your new car have been typical of your usual journeys, you can be 99% confident that your new car will do between 14.75 (MEAN-3SD) and 17.69 (MEAN+3SD) kilometres per litre.

The same data can be used to demonstrate how to calculate the sample standard deviation in Excel. In this case, imagine that the data in column A represent the kilometres per litre found for a sample of 5 new cars tested by the manufacturer. The population standard deviation is calculated using =STDEV(VALUES) and in this case the command is =STDEV(A2:A6) which produces an answer of 0.55. The sample standard deviation will always be greater than the population standard deviation when they are calculated for the same dataset. This is because the formula for the sample standard deviation has to take into account the possibility of there being more variation in the true population than has been measured in the sample. Based on their sample of 5 cars, and therefore using the sample standard deviation, the manufacturers could state with 99% confidence that similar cars will do between 14.57(MEAN-3SD) and 17.87 (MEAN+3SD) kilometres per litre . These examples show the quick method of calculating standard deviations using a cell range. Each of the commands can also be written out in a longer format with the individual kilometres/litre entered. For example entering: =STDEV(16.13,16.40,15.81,17.07,15.69) produces an identical result to=STDEV(A2:A6). However, if one of the values in column A was found to be incorrect and adjusted, the cell range method would automatically update the calculation of the standard deviation whereas the longer format will require manual adjustment of the command. Further information about using Excel to perform calculations can be found here.

Summary
The range, inter-quartile range and standard deviation are all measures that indicate the amount of variability within a dataset. The range is the simplest measure of variability to calculate but can be misleading if the dataset contains extreme values. The inter-quartile range reduces this problem by considering the variability within the middle 50% of the dataset. The standard deviation is the most robust measure of variability since it takes into account a measure of how every value in the dataset varies from the mean. However, care must be taken when calculating the standard deviation to consider whether the entire population or a sample is being examined and to use the appropriate formula.

Further help and information


Further study guides covering a range of numeracy topics are available from the Student Learning Centre in College House. The Maths Help service can also provide advice and support for students who have questions about any aspect of maths or numeracy including using the range, inter-quartile range and standard deviation.

You might also like