You are on page 1of 12

Confidence interval

A range of values, derived from sample statistics, that is likely to contain the value of an
unknown population parameter. Because of their random nature, it is unlikely that two
samples from a given population will yield identical confidence intervals. But if you
repeated your sample many times, a certain percentage of the resulting confidence
intervals would contain the unknown population parameter. The percentage of these
confidence intervals that contain the parameter is the confidence level of the interval.
For example, suppose you want to know the average amount of time it takes for an
automobile assembly line to complete a vehicle. You take a sample of completed cars,
record the time they spent on the assembly line, and use the 1-sample t procedure to
obtain a 95% confidence interval for the mean amount of time all cars spend on the
assembly line. Because 95% of the confidence intervals constructed from all possible
samples will contain the population parameter, you conclude that the mean amount of
time all cars spend on the assembly line falls between your interval's endpoints, which are
called confidence limits.
Creating confidence intervals is analogous to throwing nets over a target with an
unknown, yet fixed, location. Consider the graphic below, which depicts confidence
intervals generated from 20 samples from the same population. The black line represents
the fixed value of the unknown population parameter; the blue confidence intervals
contain the value of the population parameter; the red confidence interval does not.

A 95% confidence interval indicates that 19 out of 20 samples


(95%) from the same population will produce confidence
intervals that contain the population parameter.

Mean
Describes an entire set of observations with a single value representing the center of the
data. The mean (arithmetic average) is the sum of all the observations divided by the
number of observations. For example, the waiting time (in minutes) of five customers in a
bank are: 3, 2, 4, 1, and 2. The mean waiting time is:
3 + 2 + 4 + 1 + 2 = 12
= 2.4 min
5

On average, a customer waits 2.4 minutes for service at the bank.

Because the mean depends equally on all of the data including extreme values, it may not
be representative of the center for skewed data.

Symmetric data: the mean (pink line)


lies near the center of the distribution,
making it a good representation of the
center.

Skewed data: the mean (pink line) is


pulled in the direction of the heavier tail,
making it misleading as a representation
of the center.

Many statistical analyses use the mean as a standard reference point.


represents the population mean; X (or hat) represents the sample mean.

Median
The middle of the range of data: half the observations are less than or equal to it and half
the observations are greater than or equal to it.
If the data set contains an odd
number of values then the median
is simply the value in the middle
of the ordered data set. In this set
of numbers, the median is three
two values are higher and two
lower.

In a data set where there is an


even number of values, take the
average of the two middle values
to arrive at the median. In this set
of numbers contains an even
number of values. Take the
average of the two middle values
(3 and 21) return a median value
of 12.

21

35

21

35

42

Compared to the mean, the median is not sensitive to extreme data values, and is, thus,
often a more informative measure of the center of skewed data.
For example, the mean may not be a good statistic for describing salaries within a
company. The relatively high salaries of few top earners inflates the overall average,
giving a false impression of salaries at the company. In this case the median is more
informative. The median is equivalent to the second quartile or the 50th percentile.

Standard deviation
The most common measure of dispersion, or how spread out the data are from the mean.
While the range estimates the spread of the data by subtracting the minimum value from
the maximum value, the standard deviation roughly estimates the "average" distance of
the individual observations from the mean. The greater the standard deviation, the greater
the spread in the data.
Standard deviation can be used as a preliminary benchmark for estimating the overall
variation of a process. For example, administrators track the discharge time for patients
treated in the emergency departments of two hospitals. Although the average discharge
times are about the same (35 minutes), the standard deviations are significantly different.

Hospital 1 The standard deviation is


about 6. On average, a patient's
discharge time deviates from the mean
(blue line) by about 6 minutes.

Hospital 2 The standard deviation is


about 20. On average, a patient's
discharge time deviates from the mean
(blue line) by about 20 minutes.

The standard deviation is calculated by taking the positive square root of the variance,
another measure of data dispersion. Standard deviation is often more convenient and
intuitive to work with, however, because it uses the same units as the data. For example,
if a machine part is weighed in grams, the standard deviation of its weight is also
calculated in grams, while its variance is calculated in grams .
2

In a normal (bell-shaped) distribution, successive standard deviations from the mean


provide useful benchmarks for estimating the percentage of data observations.

About 95% of the observations fall within 2


standard deviations of the mean, shown by
the blue shaded area.

About 68% of the observations fall within 1


standard deviation from the mean (-1 to +1),
and about 99.7% of the observations would
fall within 3 standard deviations of the mean
(-3 to +3).

The symbol (sigma)is often used to represent the standard deviation of a


population, while s is used to represent the standard deviation of a sample.
Variation that is not random or natural to a process is often referred to as noise.

Variance
A measure of dispersion, which is the extent to which a data set or distribution is
scattered around its mean.
Monitoring variance is essential to the manufacturing and quality industries because a
reduction of process variance increases precision and reduces the number of defects. For
example, a factory produces carpentry nails that are 50mm in length, and a nail meets
specifications if its length is within 2mm of the target value of 50mm. The factory uses
two types of machines to manufacture nails. Both machines produce nails with normally
distributed lengths and a mean length of 50mm. However, nails from each machine have
different variances: Machine A, with the dotted-line distribution below, produces nails
with a variance of 9mm , and Machine B, with the solid-line distribution below, produces
nails with a variance of 1mm . The distributions of nail length for each machine are
superimposed, along with the vertical upper and lower specification bounds:
2

Distributions of Nail Length

millimeters
Nail length from Machine A has a larger variance than nail length from Machine B.
Therefore, any given nail from Machine A has a greater chance of being outside the
specification limits than a nail from Machine B.
Because variance ( ) is a squared quantity, its units are also squared and may be
confusing to discuss in practice. For example, a sample of waiting times at a bus stop
may have a mean of 15 minutes and a variance of 9 minutes . To resolve this confusion,
variance is often displayed with its square root, the standard deviation (), which is a
more intuitive measurement. A variance of 9 minutes is equivalent to a standard
deviation of 3 minutes.

Hypothesis test
A procedure that evaluates two mutually exclusive statements about a population. A
hypothesis test uses sample data to determine which statement is best supported by the
data. These two statements are called the null hypothesis and the alternative hypotheses.
They are always statements about populations attributes, such as the value of a
parameter, the difference between corresponding parameters of multiple populations, or
the type of distribution that best describes the population. Examples of questions you can
answer with a hypothesis test include:
Is the mean height of undergraduate women equal to 66 inches?
Is the standard deviation of their height equal to 5 inches?
Are male and female undergraduates equal in height?
Does the height of female undergraduates follow a normal distribution?
To illustrate the process, the manager of a pipe manufacturing facility must ensure that
the inside diameters of its pipes equal 5cm. She takes a sample of pipes, measures their

inside diameters, and conducts a hypothesis test on the mean inside pipe diameter. First,
she must formulate her hypotheses.

Null Hypothesis: H

States that a population parameter is equal to a desired value. The null hypothesis for the
pipe example is:
H: = 5
0

Alternative Hypothesis: H or H
1

States that the population parameter is different than the value of the population
parameter in the null hypothesis. In the example, the manager chooses from the
following alternative hypotheses:
If the manager thinks

She will formulate H to be


1

the true population mean is less than one-sided: < 5


the target
the true population mean is greater one-sided: > 5
than the target
the true population mean
differs from the target,
but she does not know in
which
direction
it
differs

two-sided: 5

After formulating her null and alternative hypotheses, the manager performs her
hypothesis test. The test calculates the probability of obtaining the observed sample data
under the assumption that the null hypothesis is true. If this probability (the p-value) is
below a user-defined cut-off point (the -level), then this assumption is probably wrong.
Therefore, she would reject the null hypothesis and conclude in favor of the alternative
hypothesis. So, if the manager performs a hypothesis test with a two-sided H and obtains
a p-value of 0.005, she rejects the null hypothesis and concludes that the mean inside pipe
diameter of all pipes is not equal to 5cm.
1

Accelerated life test

Models product performance (usually failure times) at elevated stress levels so that you
can extrapolate the results back to normal conditions. The goal of an accelerated life test
is to speed up the failure process to obtain timely information on products with a long
life. For example, under normal conditions it may take years for a microchip to fail.
However, the same microchip will fail within hours when subjected to high temperatures.
With an accelerated life test you can use the information about when microchips fail
under high temperatures to predict when failures are likely to occur under normal
operating conditions.
Because electronic components often take a long time to fail, accelerated life tests are
common in the electronics industry. Accelerated life tests are also used to predict the
performance of materials such as metals, plastics, motors, insulations, ceramics,
adhesives, and protective coatings. Common performance (response) variables include
fatigue life, cycle time, crack initiation, wear, and corrosion. Common stress variables
include mechanical stress, temperature, vibration, humidity, and voltage.

Fitted distribution line


Use to determine how well sample data follows a specific distribution. Minitab generates
a fitted distribution line using parameter estimates derived from a sample or user inputted
historical values. These distribution lines are generally overlaid with the actual data so
you can directly compare the empirical data to the hypothesized distribution. Fitted
distribution lines can appear in Histograms, Probability Plots, and Empirical CDF plots.
For example, you are investigating the strength of your company's product. As an initial
step, you would like to determine if your response data follow a normal distribution. To
do this, you generate the following histogram with the fitted normal distribution.

A visual inspection reveals that the fitted normal distribution is not a perfect fit. There are
more data than expected to the left of the peak and in the right tail. The table displays the

parameter estimates used to generate the curve. If you try fitting another distribution to
the data, the table displays the parameter estimates specific to that distribution. You can
also use the Anderson Darling statistic on Probability Plots to quantitatively test how well
the data follow a particular distribution.

Confidence intervals probability plot


You can display confidence intervals and set the confidence level for the distribution fit.

Distribution fit and confidence Distribution fit only


intervals
If you hover the mouse pointer over the fitted line or confidence intervals, Minitab
displays a table of estimated percentiles and their associated pointwise confidence
intervals. The table below shows the 95% confidence bounds. For example, you can be
95% confident that the 20th percentile for the population is between 2.85231 and
3.13318.

Note

Minitab calculates pointwise confidence intervals, thus the confidence level


applies only to individual intervals. For groups, the actual confidence level for all
estimates will be less than the chosen -level.

Regression analysis
Generates an equation to describe the statistical relationship between one or more
predictors and the response variable and to predict new observations. Regression
generally uses the ordinary least squares method which derives the equation by
minimizing the sum of the squared residuals.
Regression results indicate the direction, size, and statistical significance of the
relationship between a predictor and response.
Sign of each coefficient indicates the direction of the relationship.
Coefficients represent the mean change in the response for one unit of change in the
predictor while holding other predictors in the model constant.
P-value for each coefficient tests the null hypothesis that the coefficient is equal to
zero (no effect). Therefore, low p-values suggest the predictor is a meaningful addition to
your model.

The equation predicts new observations given specified predictor values.


For example, you work for a potato chip company that is analyzing factors that affect the
percentage of crumbled potato chips per container before shipping (response variable).
You are conducting the regression analysis and include the percentage of potato relative
to other ingredients and the cooking temperature (Celsius) as your two predictors. Below
is a simplified table of results.
Regression equation: %Broken Chips = 4.231 - 0.044(%Potato) + 0.023(Cooking
temperature C)
Predictor Coefficient P
Constant 4.231

0.322

%Potato -0.044

0.001

Cook
temp.

0.020

0.023

R-Sq = 67.2%
The regression results tell you that both predictors are significant because of their low pvalues. Together, the two predictors account for 67.2% of the variance of broken potato
chips. Specifically:
For each 1% increase in the amount of potato, the percentage of broken chips is
expected to decrease by 0.044%.
For each 1 degree Celsius increase in cooking temperature, the percentage of broken
chips is expected to increase by 0.023%.
To predict the percentage of broken chips for settings of 50% potato and a cooking
temperature of 175C, you calculate an expected value of 4.831% broken potato chips.
Note

Models with one predictor are referred to as simple linear regression, otherwise it
is known as multiple linear regression.

Percentiles
Divide the data set into parts. In general, the nth percentile has n% of the observations
below it, and (100-n)% of observations above it.

Of the total data values, 25% lie below the


25th percentile (red region), while 75% lie
above the 25th percentile (white region).

25th percentile
The 50th percentile represents the median of the data (half the observations fall above it
and half fall below it). In normal capability analysis, the distance between the 99.87th
and 0.13th percentiles is equivalent to 6 standard deviations.
In Minitab, you can use the Calculator to find a given percentile for a column of numbers,
or use Reliability/Survival to estimate percentiles based on a distribution and your sample
data. For example, you can estimate the time at which a certain percent of the population
has failed. In the table below, the 1st percentile, or the time at which 1% of the windings
have failed, is 10 months.
Table of Percentiles
Standard

95.0% Normal CI

Percent Percentile

Error

Lower

Upper

10.0765

2.7845

5.8626

17.3193

13.6193

3.2316

8.5543

21.6834

16.2590

3.4890

10.6767

24.7601

18.4489

3.6635

12.5009

27.2270

You might also like