Professional Documents
Culture Documents
A range of values, derived from sample statistics, that is likely to contain the value of an
unknown population parameter. Because of their random nature, it is unlikely that two
samples from a given population will yield identical confidence intervals. But if you
repeated your sample many times, a certain percentage of the resulting confidence
intervals would contain the unknown population parameter. The percentage of these
confidence intervals that contain the parameter is the confidence level of the interval.
For example, suppose you want to know the average amount of time it takes for an
automobile assembly line to complete a vehicle. You take a sample of completed cars,
record the time they spent on the assembly line, and use the 1-sample t procedure to
obtain a 95% confidence interval for the mean amount of time all cars spend on the
assembly line. Because 95% of the confidence intervals constructed from all possible
samples will contain the population parameter, you conclude that the mean amount of
time all cars spend on the assembly line falls between your interval's endpoints, which are
called confidence limits.
Creating confidence intervals is analogous to throwing nets over a target with an
unknown, yet fixed, location. Consider the graphic below, which depicts confidence
intervals generated from 20 samples from the same population. The black line represents
the fixed value of the unknown population parameter; the blue confidence intervals
contain the value of the population parameter; the red confidence interval does not.
Mean
Describes an entire set of observations with a single value representing the center of the
data. The mean (arithmetic average) is the sum of all the observations divided by the
number of observations. For example, the waiting time (in minutes) of five customers in a
bank are: 3, 2, 4, 1, and 2. The mean waiting time is:
3 + 2 + 4 + 1 + 2 = 12
= 2.4 min
5
Because the mean depends equally on all of the data including extreme values, it may not
be representative of the center for skewed data.
Median
The middle of the range of data: half the observations are less than or equal to it and half
the observations are greater than or equal to it.
If the data set contains an odd
number of values then the median
is simply the value in the middle
of the ordered data set. In this set
of numbers, the median is three
two values are higher and two
lower.
21
35
21
35
42
Compared to the mean, the median is not sensitive to extreme data values, and is, thus,
often a more informative measure of the center of skewed data.
For example, the mean may not be a good statistic for describing salaries within a
company. The relatively high salaries of few top earners inflates the overall average,
giving a false impression of salaries at the company. In this case the median is more
informative. The median is equivalent to the second quartile or the 50th percentile.
Standard deviation
The most common measure of dispersion, or how spread out the data are from the mean.
While the range estimates the spread of the data by subtracting the minimum value from
the maximum value, the standard deviation roughly estimates the "average" distance of
the individual observations from the mean. The greater the standard deviation, the greater
the spread in the data.
Standard deviation can be used as a preliminary benchmark for estimating the overall
variation of a process. For example, administrators track the discharge time for patients
treated in the emergency departments of two hospitals. Although the average discharge
times are about the same (35 minutes), the standard deviations are significantly different.
The standard deviation is calculated by taking the positive square root of the variance,
another measure of data dispersion. Standard deviation is often more convenient and
intuitive to work with, however, because it uses the same units as the data. For example,
if a machine part is weighed in grams, the standard deviation of its weight is also
calculated in grams, while its variance is calculated in grams .
2
Variance
A measure of dispersion, which is the extent to which a data set or distribution is
scattered around its mean.
Monitoring variance is essential to the manufacturing and quality industries because a
reduction of process variance increases precision and reduces the number of defects. For
example, a factory produces carpentry nails that are 50mm in length, and a nail meets
specifications if its length is within 2mm of the target value of 50mm. The factory uses
two types of machines to manufacture nails. Both machines produce nails with normally
distributed lengths and a mean length of 50mm. However, nails from each machine have
different variances: Machine A, with the dotted-line distribution below, produces nails
with a variance of 9mm , and Machine B, with the solid-line distribution below, produces
nails with a variance of 1mm . The distributions of nail length for each machine are
superimposed, along with the vertical upper and lower specification bounds:
2
millimeters
Nail length from Machine A has a larger variance than nail length from Machine B.
Therefore, any given nail from Machine A has a greater chance of being outside the
specification limits than a nail from Machine B.
Because variance ( ) is a squared quantity, its units are also squared and may be
confusing to discuss in practice. For example, a sample of waiting times at a bus stop
may have a mean of 15 minutes and a variance of 9 minutes . To resolve this confusion,
variance is often displayed with its square root, the standard deviation (), which is a
more intuitive measurement. A variance of 9 minutes is equivalent to a standard
deviation of 3 minutes.
Hypothesis test
A procedure that evaluates two mutually exclusive statements about a population. A
hypothesis test uses sample data to determine which statement is best supported by the
data. These two statements are called the null hypothesis and the alternative hypotheses.
They are always statements about populations attributes, such as the value of a
parameter, the difference between corresponding parameters of multiple populations, or
the type of distribution that best describes the population. Examples of questions you can
answer with a hypothesis test include:
Is the mean height of undergraduate women equal to 66 inches?
Is the standard deviation of their height equal to 5 inches?
Are male and female undergraduates equal in height?
Does the height of female undergraduates follow a normal distribution?
To illustrate the process, the manager of a pipe manufacturing facility must ensure that
the inside diameters of its pipes equal 5cm. She takes a sample of pipes, measures their
inside diameters, and conducts a hypothesis test on the mean inside pipe diameter. First,
she must formulate her hypotheses.
Null Hypothesis: H
States that a population parameter is equal to a desired value. The null hypothesis for the
pipe example is:
H: = 5
0
Alternative Hypothesis: H or H
1
States that the population parameter is different than the value of the population
parameter in the null hypothesis. In the example, the manager chooses from the
following alternative hypotheses:
If the manager thinks
two-sided: 5
After formulating her null and alternative hypotheses, the manager performs her
hypothesis test. The test calculates the probability of obtaining the observed sample data
under the assumption that the null hypothesis is true. If this probability (the p-value) is
below a user-defined cut-off point (the -level), then this assumption is probably wrong.
Therefore, she would reject the null hypothesis and conclude in favor of the alternative
hypothesis. So, if the manager performs a hypothesis test with a two-sided H and obtains
a p-value of 0.005, she rejects the null hypothesis and concludes that the mean inside pipe
diameter of all pipes is not equal to 5cm.
1
Models product performance (usually failure times) at elevated stress levels so that you
can extrapolate the results back to normal conditions. The goal of an accelerated life test
is to speed up the failure process to obtain timely information on products with a long
life. For example, under normal conditions it may take years for a microchip to fail.
However, the same microchip will fail within hours when subjected to high temperatures.
With an accelerated life test you can use the information about when microchips fail
under high temperatures to predict when failures are likely to occur under normal
operating conditions.
Because electronic components often take a long time to fail, accelerated life tests are
common in the electronics industry. Accelerated life tests are also used to predict the
performance of materials such as metals, plastics, motors, insulations, ceramics,
adhesives, and protective coatings. Common performance (response) variables include
fatigue life, cycle time, crack initiation, wear, and corrosion. Common stress variables
include mechanical stress, temperature, vibration, humidity, and voltage.
A visual inspection reveals that the fitted normal distribution is not a perfect fit. There are
more data than expected to the left of the peak and in the right tail. The table displays the
parameter estimates used to generate the curve. If you try fitting another distribution to
the data, the table displays the parameter estimates specific to that distribution. You can
also use the Anderson Darling statistic on Probability Plots to quantitatively test how well
the data follow a particular distribution.
Note
Regression analysis
Generates an equation to describe the statistical relationship between one or more
predictors and the response variable and to predict new observations. Regression
generally uses the ordinary least squares method which derives the equation by
minimizing the sum of the squared residuals.
Regression results indicate the direction, size, and statistical significance of the
relationship between a predictor and response.
Sign of each coefficient indicates the direction of the relationship.
Coefficients represent the mean change in the response for one unit of change in the
predictor while holding other predictors in the model constant.
P-value for each coefficient tests the null hypothesis that the coefficient is equal to
zero (no effect). Therefore, low p-values suggest the predictor is a meaningful addition to
your model.
0.322
%Potato -0.044
0.001
Cook
temp.
0.020
0.023
R-Sq = 67.2%
The regression results tell you that both predictors are significant because of their low pvalues. Together, the two predictors account for 67.2% of the variance of broken potato
chips. Specifically:
For each 1% increase in the amount of potato, the percentage of broken chips is
expected to decrease by 0.044%.
For each 1 degree Celsius increase in cooking temperature, the percentage of broken
chips is expected to increase by 0.023%.
To predict the percentage of broken chips for settings of 50% potato and a cooking
temperature of 175C, you calculate an expected value of 4.831% broken potato chips.
Note
Models with one predictor are referred to as simple linear regression, otherwise it
is known as multiple linear regression.
Percentiles
Divide the data set into parts. In general, the nth percentile has n% of the observations
below it, and (100-n)% of observations above it.
25th percentile
The 50th percentile represents the median of the data (half the observations fall above it
and half fall below it). In normal capability analysis, the distance between the 99.87th
and 0.13th percentiles is equivalent to 6 standard deviations.
In Minitab, you can use the Calculator to find a given percentile for a column of numbers,
or use Reliability/Survival to estimate percentiles based on a distribution and your sample
data. For example, you can estimate the time at which a certain percent of the population
has failed. In the table below, the 1st percentile, or the time at which 1% of the windings
have failed, is 10 months.
Table of Percentiles
Standard
95.0% Normal CI
Percent Percentile
Error
Lower
Upper
10.0765
2.7845
5.8626
17.3193
13.6193
3.2316
8.5543
21.6834
16.2590
3.4890
10.6767
24.7601
18.4489
3.6635
12.5009
27.2270