You are on page 1of 8

2-1) Graphical Summaries for Qualitative Data

A.) Vocabulary:
i) data set: the bare values of the data in lists, arrays, or tables.
ii) frequency: the number of times a category or class occurs in a data set.
iii) frequency distribution: organization of data into a table that bins data into classes and
counts how many are in each class.
iv) relative frequency: the number of members in a class divided by the total number of subjects.
(This gives a number between 0 and 1, and is related to percent which is a number between 0%
and 100%).
v) relative frequency distribution: organization of relative frequencies of each class into a table.
(Note: frequently, the frequency distribution and relative frequency distribution are shown in
the same table.
vi) bar graph: represents the (relative) frequency distribution by using vertical or horizontal bars
whose heights or lengths represent frequencies of the data.
vii) side-by-side graphs: are bar graphs where two groups are shown for each class. This means
we can visually compare the two groups in each class.
viii) pie graphs: a circle that is divided in to sections/wedges according to the relative frequencies
of each category of the distribution.
(Note: the whole pie must be 100% of the data, and the categories cannot overlap.)
B.) Skills: Be able to...
i) make a frequency distribution table from data sets.
ii) calculate relative frequencies and create a relative frequency distribution table
iii) create bar charts from distribution tables and raw data
iv) create a pie chart from tables and data sets
v) use tables charts and logic to answer questions about the data sets

2-2) Graphical Summaries for Quantitative Data

A.) Vocabulary:
i) class limits: the smallest (lower class limit) and largest (upper class limit) data values
that can be included in the class.
ii) class width: the difference between consecutive lower class limits.
iii) Requirements for choosing classes:
Every observation must fall into one of the classes
The classes must not overlap
The classes must have equal width
There must be no gaps between classes
iv) Procedure for Constructing a Frequency Distribution:
Step 1. Choose a Class width.
Step 2. Choose a lower class limit for the first class.
Step 3. Compute the lower limit for the next class by adding the lass width to the first lower
limit.
Step 4. Compute the lower limits for the remaining classes by adding the class width to the
previous lower class limit. Stop when the largest data value is included in a class.
Step 5. Count the number of observations in each class and construct the frequency distribution.

1
v) Class widths are chosen based on a brief review of the data, and deciding on how many classes to
Largest data value - Smallest data value
use. Roughly: Class Width , and then this number
Number of classes
is rounded up to the nearest convenient value.
vi) histogram: a graph that displays the frequency distribution by using vertical bars of various
heights to represent the frequencies of the classes. The bars have lower end points at the lower
limit of the classes and upper end points at the lower limit of the next class.
vii) Note: if a histogram is of discrete data (frequently integers), the bars are centered on the integer
for each class.
viii) distribution shapes: over time some shapes of histograms repeat over and over again. These
shapes are results of certain patterns. Some common types are: Bell shaped, uniform, skewed to
the right or skewed to the right, etc.
ix) mode: The modal class is the class that has the highest frequency. A Histogram that has only
one distinct peak is called unimodal while one that two distinct peaks is called bimodal. (Note:
a graph can be bimodal even if one peak is higher than the other, the important factor is that
two truly distinct peaks appear.)
B.) Skills: Be able to...
i) make a histogram from a frequency table, and from raw data.
ii) make a relative frequency graph comparing data from raw data sets.
iii) determine classes with class limits from raw numerical data.
iv) construct a grouped frequency distribution with a specified number of classes from raw data.
v) determine class width and class midpoints from a given frequency distribution.
vi) add cumulative frequency distributions to prior tables of frequency distributions.
vii) use technology (SPSS, etc.) to construct frequency distributions and histograms from raw data.
viii) recognize common distribution shapes based on graphs (and graphs you make from raw data).

2-3) Other Types of Graphs

A.) Vocabulary:
i) Stem-and-Leaf Plot: A compact way to organize and display the entire data set. The stem
is a vertical listing of the first digits and the leaves are the last place for each data point.
ii) A back-to-back stem and leaf plot can show two different data sets on the same plot, one on
each side of the stem
iii) time series graph: a dot and line plot that represents data over a specific period of time.
iv) dot plot: a graph in which each data point is plotted as a single point above the horizontal
axis.
B.) Skills: Be able to...
i) create a Stem-and-Leaf plot from raw data.
ii) create a time series plot from raw data.
iii) create a dot plot from raw data.

3.1) Measures of Center

A. Vocabulary:
i) statistic: is a characteristic or measure obtained by using data values from a sample.
ii) parameter: is a characteristic or measure obtained by using all data values from a specific
population.

2
iii) General Rounding Rule: when calculations are done, rounding should not be done until the final
answer is calculated.
iv) Mean - (excel calls this the average): found by adding the data values of a set and dividing by
the number of values in the set.
P
X1 + X2 + X3 + + Xn X
x= =
n n
P
(Note: if it is the mean for a population, we use the greek letter = NX for the mean, and use
the capitol N to mean the number of data points.)
v) median: the middle value of a data set. This is obtained by listing the data in ascending order
and:
a) if there are an odd number of data values, choose the center value [center point= n+1 2 ];
b) if there are an even number of values, average the two data around the center [average the n2
and n2 + 1 points].
vi) mode: is the most common value (or class) in the data. If there are two values that are both
most common we say the set is bimodal. If more than two are most common the data set is
called multimodal.
vii) Definition: Resistant: We call a statistic resistant if it not affected much by extreme values.
[Note 1: since the mean is affected by very large (or small) data values, it is not resistant.]
[Note 2: since the median is not affected much by large (or small) data values, it is resistant.]
viii) The shape of the histogram reflects the relationship between the mean and the median:
Shape Relationship
Skewed right mean is greater than median
Approx. Symmetric Mean and median are similar
Skewed left mean is less than median
ix) Approximate mean: found by using the midpoint of each class (average of the bottom of the
class and the bottom of the next class) as the data values Xk , multiplying each value by its
corresponding frequency (or relative frequency) Fk , adding and then dividing by the number of
data points. P
F1 X1 + F2 X2 + F3 X3 + + Fm Xm FX
X= =
N N
B. Skills: Be able to...
i) calculate mean, median, mode from raw data.
ii) determine whether the median is higher or lower than the mean, and what that implies for the
distribution.
iii) calculate approximate means and find median classes from tables of data

3.2) Measures of Spread

A. Vocabulary:
i) range : is the highest value minus the lowest value and represents the breadth of values obtained.

R = highest value lowest value

ii) population variance ( 2 ): the average of the squares of the distance each value is from the
mean
(X1 )2 + (X2 )2 + (X3 )2 + + (XN )2 (X )2
P
2
= =
N N

3
iii) population standard deviation (): the square root of the variance
r rP
(X1 )2 + (X2 )2 + (X3 )2 + + (XN )2 (X )2
= =
N N
iv) sample variance (s2 ) and sample standard deviation (s): is a calculation of variance and
deviation of a sample. (n is the number of data points in the sample, we divide by n 1 to correct
from the tendency of a sample to have a lower spread than the population)

(X1 X)2 + (X2 X)2 + (X3 X)2 + + (Xn X)2 (X X)2


P
2
s = =
n1 n1
s s
(X1 X)2 + (X2 X)2 + (X3 X)2 + + (Xn X)2 (X X)2
P
s= =
n1 n1
v) coefficient of variation: is written as a percentage and is the standard deviation divided by
the mean (this can be used to compare standard deviations of data sets with very different means
so that we can compare relative spread of the sets)
s
CVar = 100 or CVar = 100
X

vi) The Empirical Rule: If the data is approximately bell shaped, we can approximate how much
of the data is within n standard deviations of the mean as below:
68%

95%
Frequency

99.7%

2.1% 13.6% 34.1% 34.1% 13.6% 2.1%

4 3 2 1 0 1 2 3 4
Standard deviations
vii) Chebyshevs Theorem: The proportion of values from a data set that fall within k standard
deviations of the mean will be at least 1 1/k 2 , where k is a number greater than 1 (k is not
necessarily an integer).
B. Skills: Be able to...
i) calculate variance and standard deviation from raw data.
ii) use the Empirical Rule to approximate how much data is between given data values.
iii) test Chebyshevs Theorem with raw data.
iv) use the coefficient of variation to compare the variance of disparate data sets.
v) use the histogram of data to determine whether the Empirical Rule should be applied to the data
set

3-3) Measures of Position

A. Vocabulary:

4
i) z score (or standard score): is the number of standard deviations that a data value falls away
from the mean and is calculated:
data value mean
z=
standard deviation
Thus, in terms of symbols, this is written as:
xx x
z= ; or z =
s
depending on if the data is from a sample (left) or a population (right), but the calculation is the
same regardless of which way it is written.
ii) divisions of the data set:
a) percentile: divide the data set into 100 even groups with dividing lines labelled P1 , P2 ,...,P98 ,
and P99 , so that, for example, if a data point is above P73 , it is above 73% of the data (and
thus in the top 27% of the data). Note: P50 corresponds to the median.
P
The data point associated with the P th percentile is L = 100 n and then taken UP to the
next integer. (NOT ROUNDED. If L is an integer, use L + 1. If L is a decimal, round UP to
the next integer.)
b) quartiles: divide the data set into four equal groups, the bottom quartile, below Q1 , second
quartile, between Q1 and Q2 , third quartile, between Q2 and Q3 , and the top quartile, above
Q3 . Note: Q2 corresponds to the median. (Note: Q1 = P25 and Q3 = P75
iii) Five Number Summary: The five number summary is: Minimum; Q1 ; Median; Q3 ; Maximum.
(Note 1: the five number summary, with outliers marked, is graphed by the Box Plot.)
(Note 2: if the median is closer to Q1 , or the upper whisker is longer, the data is skewed right
if the median is closer to Q3 , or the lower whisker is longer, the data is skewed left
iv) outliers: are data points that are extremely low or extremely high compared to the other data
values in the set. Computationally, these values can be found by:
Step 1.) find Q1 and Q3 for the data set.
Step 2.) Calculate the Inner Quartile Range (IQR), IQR = Q3 Q1 , the width of the center half
of the data
Step 3.) set the minimum value as Q1 1.5(IQR) and the maximum as Q3 + 1.5(IQR)
Step 4.) Any data points below the minimum or maximum values found above are called out-
liers (and are sometimes dropped from the data set.)
B. Skills: Be able to...
i) calculate z-values for data points
ii) calculate percentiles of data points, and find data points corresponding to percentiles
iii) calculate the Five number summary
iv) determine outliers in a data set
v) draw box-plots
vi) estimate skewness from box-plots

4.1) Probability: Basic Ideas

A. Vocabulary:
i) Probability: We call the probability of an event the proportion of times the event occurs in the
long run.
ii) sample space: the set of all possible outcomes of a probability experiment.
iii) event: a set of outcomes.

5
number of outcomes in the sample space in which A occurs
iv) P (A) =
total number of outcomes in the sample space
v) Probability:
a) The probability of an event A is a fraction or a decimal between 0 and 1, 0 P (A) 1
b) The sum of all probabilities of all outcomes in a sample space is 1.
c) If an event A cannot occur, then P (A) = 0
d) If an event A is certain, then P (A) = 1
vi) unusual event: We call an event an unlikely event if the probability of that event occurring is
small.
Sometimes people will call any event with a probability of less than 0.05 an unlikely event.
vii) Law of Large Numbers: As the number of times an experiment grows, the relative frequency that
an event occurs will get arbitrarily close to the probability of the event.
B. Skills: Be able to...
i) calculate probabilities for events from a sample space.
ii) calculate the probability of randomly choosing a member of a certain class from a frequency
distribution.
iii) Calculate the probability of an event from the probability of its complement (and vice versa).
4.2) A. Vocabulary:
i) compound events: We call an event that is formed by combining two or more events a com-
pound event.
ii) P (A or B) = P (A occurs or B occurs or both occur) and is calculated:

P (A or B) = P (A) + P (B) P (A and B)


iii) mutually exclusive events or disjoint events: events that cannot occur at the same time.
A Venn diagram of mutually exclusive events, A and B, has no overlap between circles for A and
B.
iv) If events are not mutually exclusive, an outcome can be in circles for A and B at the same time,
so the circles overlap.
v) complement: If A is a set of outcomes in the sample space, the complement of A, called Ac
(said A-compliment), is the set of all events not in A. (So, A and Ac make up the whole sample
space with no overlap.)
Thus, P (A) + P (Ac ) = 1 and
P (A) = 1 P (Ac ), and
P (Ac ) = 1 P (A).
B. Skills: Be able to...
i) Calculate the probability of or combinations of events from the probabilities of each event, and
any overlap of the events.
ii) Determine what the complement of a given event is.
iii) Use the probability of an event to find the probability of the compliment of that event
6.1) The Standard Normal Curve
A. Vocabulary:
i) A Probability Density Curve is a probability distribution of a continuous variable.
The area under the curve is equal to 1.
The area under the curve between two values, a and b, is the probability that a randomly selected
individual will have a variable value between a and b.

6
ii) The Standard Normal Curve: Is a probability distribution with the following properties:
68%

95%
Frequency
99.7%

2.1% 13.6% 34.1% 34.1% 13.6% 2.1%

4 3 2 1 0 1 2 3 4
Standard deviations
1)
there is only one mode,
2)
it is symmetric around the mode,
3)
the mean and medial are equal to the mode,
approximately 68% of the population is between and + ,
4)
approximately 95% of the population is between 2 and + 2,
approximately 99.7% of the population is between 3 and + 3
iii) z-score: The z-score for a data point, x, is the number of standard deviations, , away from the
mean, , the data point is. It is calculated by:
x
z=

If z < 0, the data point is less than the mean, and if z > 0, the data point is greater than the
mean.
iv) Note: the table in the back of the book (Appendix: A-6 and A-7) calculates the probability that
a randomly selected data point is less than the given z-value.
v) To calculate the probability that a randomly selected variable has a z-value related to a specific
z value can be obtained as follows:
a) Probability less than: P (z < z ) = value from the table
b) Probability greater than: P (z > z ) = 1 P (z < z )
c) Probability between z1 and z2 : P (z1 < z < z2 ) = P (z < z2 ) P (z < z1 )
Note: Instructions for using a TI-84 calculator are included in the text.
vi) Finding a z-score for the amount of the distribution to the right of a particular value z is the
z-score for the having to the left of that z-score.
To find such a z , from the chart, we need the to the left, value, so we calculate (1 ), and
then use the chart in reverse. Thus, find the value closest to (1 ) in the body of the table,
and read off the appropriate z-score.
vii) Finding the middle R% of the data (let r = R%/100, the decimal form of the percent). Find
the z for z1(1r)/2 and z(1r)/2
B. Skills: Be able to...
i) Find probabilities left, right and between given z-values
ii) Find the z-score for given areas (probabilities) left, right and for the middle areas.
iii) Find z values

5.1) Random Variables (after 6.1)

A. Vocabulary:

7
i) Random Variable: A random variable is a numerical outcome of a probability experiment.
ii) Discrete random variable: a random variable whole possible values can be listed. (Frequently
integers)
Continuous random variables: A continuous random variable is a random variable that
can take any value in an interval.
iii) Expected Value: The expected value of a random variable is the mean of the random variable.
X
E(X) = X = [x P (x)]

iv) The variance of a random variable is calculated by:


X
2
X = [(x X )2 P (x)]

Which can also be calculated via:


X
2
X = [x2 P (x)] 2X

The standard deviation is calculated by the usual square root of the variance:
q
X = X 2

B. Skills: Be able to...


i) Recognize what type of random variable an experiment will produce
ii) calculate the expected value, and standard deviation of a given probability distribution
iii) use the expected value and standard deviation to answer questions about real world scenarios.

You might also like