You are on page 1of 29

Print

Summary: Lesson 1: Introduction to Statistics


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Basic Statistical Concepts

Descriptive statistics organizes, describes, and summarizes data using numbers and
graphical techniques. Inferential statistics is concerned with drawing conclusions about
a population from the analysis of a random sample drawn from that population.
Inferential statistics is also concerned with the precision and reliability of those
inferences.
A population is the complete set of observations or the entire group of objects that you
are researching. A sample is a subset of the population. The sample should be
representative of the population. You can obtain a representative sample by collecting
a simple random sample.
Parameters are numerical values that summarize characteristics of a population.
Parameter values are typically unknown and are represented by Greek letters.
Statistics summarize characteristics of a sample. You use letters from the English
alphabet to represent sample statistics. You can measure characteristics of your
sample and provide numerical values that summarize those characteristics. You use
statistics to estimate parameters.
Variables are characteristics or properties of data that take on different values or
amounts. A variable can be independent or dependent. In some contexts, you select
the value of an independent variable, in order to determine its relationship to the
dependent variable. In other contexts, the independent variables values are simply
taken as given.
Variables are also classified according to their characteristics. They can
be quantitative or categorical. Data that consists of counts or measurements is
quantitative. Quantitative data can be further distinguished by two types: discrete and
continuous. Discrete data takes on only a finite, or countable, number of values.
Continuous data has an infinite number of values and no breaks or jumps.
Categorical or attribute data consists of variables that denote groupings or labels.
There are two main types: nominal and ordinal. A nominal categorical variable exhibits
no ordering within its groups or categories. With ordinal categorical variables, the
observed levels of the variable can be ordered in a meaningful way that implies
differences due to magnitude.
A variables classification is its scale of measurement. There are two scales of
measurement for categorical variables: nominal and ordinal. There are two scales of
measurement for continuous variables: interval and ratio. Data from an interval scale
can be rank-ordered and has a sensible spacing of observations such that differences

between measurements are meaningful. However, interval scales lack the ability to
calculate ratios between numbers on the scale because there is no true zero point.
Data on a ratio scale includes a true zero point and can therefore accurately indicate
the ratio of difference between two spaces on the measurement scale.
The appropriate statistical method for your data also depends on the number of
variables involved. Univariate analysis provides techniques for analyzing and
describing a single variable at a time. Bivariate analysis describes and explains the
relationship between two variables and how they change, or covary, together.
Multivariate analysis examines two or more variables at the same time, in order to
understand the relationships among them.

Descriptive Statistics

A data's distribution tells you what values your data takes and how often it takes those
values.
You can calculate descriptive statistics that measure locations in your data. Statistics
that locate the center of the data are measures of central tendency. These include
mean, median, and mode.
Percentiles are descriptive statistics that give you reference points in your data. A
percentile is the value of a variable below which a certain percentage of observations
fall. The most commonly reported percentiles are quartiles, which break the data into
quarters.
There are several descriptive statistics that measure the variability of your data: range,
interquartile range (IQR), variance, standard deviation, and coefficient of variation
(C.V.).
To summarize and generate descriptive statistics, you use the MEANS procedure.
PROC MEANS calculates a standard set of statistics, including the minimum,
maximum, and mean data values, as well as standard deviation and n. The
PRINTALLTYPES option displays statistics for all requested combinations of class
variables

Picturing Your Data

A histogram is a visual representation of the frequency distribution of your data. The


frequencies are represented by bars.
The normal distribution is a common theoretical distribution in statistics. It is bellshaped, with values concentrated near the mean, and it is symmetric around the
mean. The standard deviation () determines how variable the distribution is.
Underlying the normal distribution is a mathematical function named the probability
density function.
To check the assumption that your random sample has a normal distribution, you can
plot a histogram. You can also look at statistical summaries of your data. The closer
skewness and kurtosis are to 0, the closer your data is shaped like the normal

distribution.
Skewness measures the tendency of your data to be more spread out on one side of
the mean than on the other. It measures the asymmetry of the distribution. The
direction of skewness is the direction to which the data is trailing off. The closer the
skewness is to 0, the more normal or symmetric the data.
Kurtosis measures the tendency of data to be concentrated toward the center or
toward the tails of the distribution. The closer kurtosis is to 0, the closer the tails of the
data resemble the tail thickness of the normal distribution. Kurtosis can be difficult to
assess visually.
A negative kurtosis statistic means that the data has lighter tails than in a normal
distribution and is less heavily concentrated about the mean. This is a platykurtic
distribution.
A positive kurtosis statistic means that the data has heavier tails and is more
concentrated about the mean than a normal distribution. This is a leptokurtic
distribution, which is often referred to as heavy-tailed and also as an outlier-prone
distribution.
A normal probability plot is another way to visualize and assess the distribution of your
data. The vertical axis represents the actual data values. The horizontal axis displays
the expected percentiles from a standard normal distribution. The normal reference
line along the diagonal indicates where the data would fall if it were perfectly normal.
A box plot makes it easy to see how spread out the data is and if there are any
outliers.
You can use PROC UNIVARIATE to generate descriptive statistics, histograms, and
normal probability plots.
In the ID statement, you list the variable or variables that SAS should label in the table
of extreme observations and identify as outliers in the graphs.
You can add additional options to the HISTOGRAM and PROBPLOT statements. The
NORMAL option uses estimates of the population mean and standard deviation to add
a normal curve overlay to the histogram and a diagonal reference line to the normal
probability plot.
You can use the INSET statement to create a box of summary statistics directly on the
graphs.
In addition to the statistical graphics available to you with PROC UNIVARIATE, you
might want to use PROC SGSCATTER, PROC SGPLOT, PROC SGPANEL, and
PROC SGRENDER to produce a wide variety of additional plot types.
You can use PROC SGPLOT to generate dot plots, horizontal and vertical bar charts,
histograms, box plots, density curves, scatter plots, series plots, band plots, needle
plots, and vector plots. The REG statement generates a fitted regression line or curve.
You use a REFLINE statement to create a horizontal or vertical reference line on the
plot.
ODS Graphics is an extension of the SAS Output Delivery System. With ODS
Graphics, statistical procedures produce graphs as automatically as they produce
tables, and graphs are integrated with tables in the ODS output. You can find a list of

the graphs available for each SAS procedure in the SAS documentation.

Confidence Intervals for the Mean

A point estimator is a sample statistic used to estimate a population parameter. A


statistic that measures the variability of your estimator is the standard error.
The standard error of the mean measures the variability of your sample mean. Its an
estimate of how much you can expect the sample mean to vary from sample to
sample.
The distribution of sample means is the distribution of all possible sample means from
the population. The distribution of the mean is always less variable than the data.
An interval estimator is another way to estimate a population parameter. It
incorporates the uncertainty that arises from random variability.
Confidence intervals are a type of interval estimator used to estimate the population
mean, while taking into account the variability of the sample mean.
The central limit theorem states that the distribution of sample means is approximately
normal, regardless of the population distribution's shape, if the sample size is large
enough.
You can use the MEANS procedure to generate a 95% confidence interval for the
mean.
You can use the CLM option in the PROC MEANS statement to calculate the
confidence limits for the mean.
You can add the ALPHA= option to the PROC MEANS statement in order to construct
confidence intervals with a different confidence level.

Hypothesis Testing
A hypothesis test uses sample data to evaluate a question about a population. It
provides a way to make inferences about a population, based on sample data.
There are four steps in conducting a hypothesis test. The first step is to identify the
population of interest and determine the null and alternative hypotheses. The null
hypothesis, H0, is what you assume to be true, unless proven otherwise. It is usually a
hypothesis of equality. The alternative hypothesis, Ha or H1, is typically what you
suspect, or are attempting to demonstrate. It is usually a hypothesis of inequality.
The second step in hypothesis testing is to select the significance level. This is the
amount of evidence needed to reject the null hypothesis. A common significance level
is 0.05 (1 chance in 20).
The third step is to collect the data. The fourth step is to use a decision rule to
evaluate the data. You decide whether or not there is enough evidence to reject the
null hypothesis.

If you reject the null hypothesis when it's actually true, you've made a Type I error. The
probability of committing a Type I error is . is the significance level of a test. If you
fail to reject the null hypothesis and it's actually false, you've made a Type II error. The
probability of committing a Type II error is . Type I and II errors are inversely
related.The power of a statistical test is equal to 1 minus beta (1 ),
The difference between the observed statistic and the hypothesized value is the effect
size. A p-value measures the probability of observing a value as extreme as the one
observed or more extreme. A p-value is not only affected by the effect size, but also by
the sample size.
The t statistic measures how far X-bar, the sample mean, is from the hypothesized
mean, 0. If the t statistic is much higher or lower than 0 and has a small
corresponding p-value, this indicates that the sample mean is quite different from the
hypothesized mean, and you would reject the null hypothesis.
You can use PROC UNIVARIATE to perform a statistical hypothesis test. You use the
MU0= option to specify the value of the hypothesized mean, 0. You can use the
ALPHA= option to change the significance level.

Syntax
To go to the movie where you learned a statement or option, select a link.
PROC MEANS DATA=SAS-data-set <options>;
CLASS variables;
VAR variables;
RUN;
PROC UNIVARIATE DATA=SAS-data-set <options>;
VAR variables;
ID variables;
HISTOGRAM variables </options>;
PROBPPLOT variables </options>;
INSET keywords </options>;
RUN;
PROC SGPLOT DATA=SAS-data-set<options>;
DOT category-variable </option(s)>;
HBAR category-variable </option(s)>;
VBAR category-variable </option(s)>;
HBOX response-variable </option(s)>;
VBOX response-variable </option(s)>;
HISTOGRAM response-variable </option(s)>;
SCATTER X=variable Y=variable </option(s)>;
NEEDLE X=variable Y=numeric-variable </option(s)>;
REG X=numeric-variable Y=numeric-variable </option(s)>;
REFLINE variable | value-1 <... value-n> </option(s)>;
RUN;
ODS GRAPHICS ON <options>;
statistical procedure code
ODS GRAPHICS OFF;

Sample Programs

Using PROC MEANS to Generate Descriptive Statistics


proc means data=statdata.testscores maxdec=2 fw=10
printalltypes
n mean median std var q1 q3;
class Gender;
var SATScore;
title 'Selected Descriptive Statistics for SAT Scores';
run;
title;

Using SAS to Picture Your Data


proc univariate data=statdata.testscores;
var SATScore;
id idnumber;
histogram SATScore / normal(mu=est sigma=est);
inset skewness kurtosis / position=ne;
probplot SATScore / normal(mu=est sigma=est);
inset skewness kurtosis;
title 'Descriptive Statistics Using PROC UNIVARIATE';
run;
title;
proc sgplot data=statdata.testscores;
refline 1200 / axis=y lineattrs=(color=blue);
vbox SATScore / datalabel=IDNumber;
format IDNumber 8.;
title "Box Plots of SAT Scores";
run;
title;

Calculating a 95% Confidence Interval


proc means data=statdata.testscores maxdec=4
n mean stderr clm;
var SATScore;
title '95% Confidence Interval for SAT';
run;
title;

Using PROC UNIVARIATE to Perform a Hypothesis Test


ods select testsforlocation;
proc univariate data=statdata.testscores mu0=1200;
var SATScore;
title 'Testing Whether the Mean of SAT Scores = 1200';
run;
title;

Statistics I: Introduction to ANOVA, Regression, and Logistic Regression


Copyright 2014 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Close
Print

Summary: Lesson 2: Analysis of Variance (ANOVA)


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Two-Sample t-Tests

The two-sample t-test is a hypothesis test for answering questions about the means of
two populations. You can examine the differences between populations for one or
more continuous variables and assess whether the means of the two populations are
statistically different from each other.
The null hypothesis for the two-sample t-test is that the means for the two groups are
equal. The alternative hypothesis is the logical opposite of the null and is typically what
you suspect or are trying to show. It is usually a hypothesis of inequality. The
alternative hypothesis for the two-sample t-test is that the means for the two groups
are not equal.
The three assumptions for the two-sample t-test are independence, normality, and
equal variances.
You use the F-test for equality of variances to evaluate the assumption of equal
variances in the two populations. You calculate the F statistic, which is the ratio of the
maximum sample variance of the two groups to the minimum sample variance of the
two groups. If the p-value of the F-test is greater than your alpha, you fail to reject the
null hypothesis and can proceed as if the variances are equal between the groups. If
the p-value of the F-test is less than your alpha, you reject the null hypothesis and can
proceed as if the variances are not equal.
With one-sided tests, you look for a difference in one direction. For instance, you can
test to determine whether the mean of one population is greater than or less than the
mean of another population. An advantage of one-sided tests is that they can increase
the power of a statistical test.
To perform the two-sample t-test and the one-sided test, you can use PROC TTEST.
You add the PLOTS option to the PROC TTEST statement to control the plots that
ODS produces. You add the SIDES=U or SIDES=L option to specify an upper or lower

one-sided test.

One-Way ANOVA

You can use ANOVA to determine whether there are significant differences between
the means of two or more populations. In this model, you have a continuous
dependent, or response, variable and a categorical independent, or predictor, variable.
With ANOVA, the null hypothesis is that all of the population means are equal.
Thealternative hypothesis is that not all of the population means are equal. In other
words, at least one mean is different from the rest.
One way to represent the relationship between the response and predictor variables in
ANOVA is with a mathematical ANOVA model.
ANOVA analyzes the variances of the data to determine whether there is a difference
between the group means. You can determine whether the variation of the means is
large enough relative to the variation of observations within the group. To do this,
you calculate three types of sums of squares: between group variation (SSM), within
group variation (SSE), and total variation (SST). The SSM and SSE represent pieces
of the total variability. If the SSM is larger than the SSE, you reject the null hypothesis
that all of the group means are equal.
Before you perform the hypothesis test, you need to verify the three ANOVA
assumptions: the observations are independent observations, the error terms are
normally distributed, and the error terms have equal variances across groups.
The residuals that come from your data are estimates of the error term in the model.
You calculate the residuals from ANOVA by taking each observation and subtracting its
group mean. Then you verify the two assumptions regarding normality and equal
variances of the errors.
To verify the ANOVA assumptions and perform the ANOVA test, you use PROC GLM.
In the MODEL statement, you specify the dependent and independent variables for the
analysis. The MEANS statement computes unadjusted means of the dependent
variable for each value of the specified effect. You can add the HOVTEST option to the
MEANS statement to perform Levene's test for homogeneity of variances. If the
resulting p-value of Levene's test is greater than 0.05 (typically), then you fail to reject
the null hypothesis of equal variances.

ANOVA with Data from a Randomized Block Design

In a controlled experiment, you can design the analysis prospectively and control for
other factors, nuisance factors, that affect the outcome you're measuring. Nuisance
factors can affect the outcome of your experiment, but are not of interest in the
experiment. In a randomized block design, you can use a blocking variable to control
for the nuisance factors and reduce or eliminate their contribution to the experimental
error.
One way to represent the relationship between the response and predictor variables in
ANOVA is with a mathematical ANOVA model. You can also include a blocking
variable in the model.

Along with the three original ANOVA assumptions of independent observations,


normally distributed errors, and equal variances across treatments, you make two
more assumptions when you include a blocking factor in the model. You assume that
the treatments are randomly assigned within each block, and you assume that the
effects of the treatment factor are constant across levels of the blocking factor.
You use PROC GLM to perform ANOVA with a blocking variable. You list the blocking
variable in the CLASS statement and in the MODEL statement.

ANOVA Post Hoc Tests

A pairwise comparison examines the difference between two treatment means. If your
ANOVA results suggest that you reject the null hypothesis that the means are equal
across groups, you can conduct multiple pairwise comparisons in a post hoc analysis
to learn which means differ.
The chance that you make a Type I error increases each time you conduct a statistical
test. The comparisonwise error rate, or CER, is the probability of a Type I error on a
single pairwise test. The experimentwise error rate, orEER, is the probability of making
at least one Type I error when performing all of the pairwise comparisons. The EER
increases as the number of pairwise comparisons increases.
You can use the Tukey method to control the EER. This test compares all possible
pairs of means, so it can only be used when you make pairwise
comparisons. Dunnett's method is a specialized multiple comparison test that enables
you to compare a single control group to all other groups.
You request all of the multiple comparison methods with options in the LSMEANS
statement in PROC GLM. You use the PDIFF=ALL option to request p-values for the
differences between all of the means. With this option, SAS produces a diffogram. You
use the ADJUST= option to specify the adjustment method for multiple comparisons.
When you specify the ADJUST=Dunnett option, SAS produces multiple comparisons
using Dunnett's method and acontrol plot.

Two-Way ANOVA with Interactions

When you have two categorical predictor variables and a continuous response
variable, you can analyze your data using two-way ANOVA. With two-way ANOVA, you
can examine the effects of the two predictor variables concurrently. You can also
determine whether they interact with respect to their effect on the response variable.
Aninteraction means that the effects on one variable depend on the value of another
variable. If there is no interaction, you can interpret the test for the individual factor
effects to determine their significance. If an interaction exists between any factors, the
test for the individual factor effects might be misleading due to the masking of these
effects by the interaction.
You can include more than one predictor variable and interactions in the ANOVA
model.
You can graphically explore the relationship between the response variable and the
effect of the interaction between the two predictor variables using PROC SGPLOT.

You can use PROC GLM to determine whether the effects of the predictor variables
and the interaction between them are statistically significant.

Syntax
To go to the movie where you learned a statement or option, select a link.
PROC TTEST DATA=SAS-data-set<options>;
CLASS variable;
VAR variable(s);
RUN;

Selected Options in PROC TTEST


Statement
PROC TTEST

Option
PLOTS(SHOWNULL)=INTERVAL
SIDES=U
SIDES=L

PROC GLM DATA=SAS-data-set<options>;


CLASS variable(s);
MODEL dependents=independents </options>;
MEANS effects </options>;
LSMEANS effects </options>;
RUN;
QUIT;

Selected Options in PROC GLM


Statement
PROC GLM

Option
PLOTS(ONLY)
DIAGNOSTICS(UNPACK)

MEANS
HOVTEST
LSMEANS
PDIFF=ALL
ADJUST=

Sample Programs

Running PROC TTEST in SAS


proc ttest data=statdata.testscores plots(shownull)=interval;
class Gender;
var SATScore;
title 'Two-Sample t-Test Comparing Girls to Boys';
run;

title;

Performing a One-Sided t-Test


proc ttest data=statdata.testscores plots(shownull)=interval
h0=0 sides=U;
class Gender;
var SATScore;
title 'One-Sided t-Test Comparing Girls to Boys';
run;
title;

Examining Descriptive Statistics across Groups


proc means data=statdata.mggarlic printalltypes maxdec=3;
var BulbWt;
class Fertilizer;
title 'Descriptive Statistics of Garlic Weight';
run;
proc sgplot data=statdata.mggarlic;
vbox BulbWt / category=Fertilizer datalabel=BedID;
format BedID 5.;
title 'Box Plots of Garlic Weight';
run;
title;

Using the GLM Procedure


proc glm data=statdata.mggarlic
plots(only)=diagnostics(unpack);
class Fertilizer;
model BulbWt=Fertilizer;
means Fertilizer / hovtest;
title 'Testing for Equality of Means with PROC GLM';
run;
quit;
title;

Performing ANOVA with Blocking


proc glm data=statdata.mggarlic_block
plots(only)=diagnostics(unpack);
class Fertilizer Sector;
model BulbWt=Fertilizer Sector;
title 'ANOVA for Randomized Block Design';
run;
quit;
title;

Performing a Post Hoc Pairwise Comparison


ods select lsmeans diff meanplot diffplot controlplot;

proc glm data=statdata.mggarlic_block;


class Fertilizer Sector;
model BulbWt=Fertilizer Sector;
lsmeans Fertilizer / pdiff=all adjust=tukey;
lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett;
lsmeans Fertilizer / pdiff=all adjust=t;
title 'Garlic Data: Multiple Comparisons';
run;
quit;
title;

Examining Your Data with PROC MEANS


proc format;
value dosef
1="Placebo"
2="100mg"
3="200mg"
4="500mg";
run;
proc means data=statdata.drug mean var std printalltypes;
class Disease DrugDose;
var BloodP;
output out=means mean=BloodP_Mean;
format DrugDose dosef.;
title 'Selected Descriptive Statistics for Drug Data Set';
run;
title;

Examining Your Data with PROC SGPLOT


proc sgplot data=means;
where _TYPE_=3;
scatter x=DrugDose y=BloodP_Mean
/ group=Disease markerattrs=(size=10);
series x=DrugDose y=BloodP_Mean
/ group=Disease lineattrs=(thickness=2);
xaxis integer;
format DrugDose dosef.;
title 'Plot of Stratified Means in Drug Data Set';
run;
title;

Performing Two-Way ANOVA with Interactions


proc glm data=statdata.drug;
class DrugDose Disease;
model Bloodp=DrugDose Disease DrugDose*Disease;
format DrugDose dosef.;
title1 'Analyze the Effects of DrugDose and Disease';
title2 'including Interactions';
run;
quit;
title;

Performing a Post Hoc Pairwise Comparison


proc format;
value dosef
1="Placebo"
2="100mg"
3="200mg"
4="500mg";
run;
ods select meanplot lsmeans slicedanova;
proc glm data=statdata.drug;
class DrugDose Disease;
model Bloodp=DrugDose Disease DrugDose*Disease;
lsmeans DrugDose*Disease / slice=Disease;
format DrugDose dosef.;
title 'Analyze the Effects of DrugDose at Each Level of
Disease';
run;
quit;
title;

Statistics I: Introduction to ANOVA, Regression, and Logistic Regression


Copyright 2014 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Close
Print

Summary: Lesson 3: Regression


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
To analyze continuous variables, you can use linear regression. To investigate your
data before performing linear regression, you can use techniques for exploratory data
analysis, including scatter plots and correlation analysis. In exploratory data analysis,
you're simply trying to explore the relationships between variables and to screen for
outliers.
Scatter plots are an important tool for describing the relationship between continuous
variables. Plot your data!You can use scatter plots to examine the relationship
between two continuous variables, to detect outliers, to identify trends in your data, to
identify the range of X and Y values, and to communicate the results of a data

analysis.
You can also use correlation analysis to quantify the relationship between two
variables. Correlation statistics measure the strength of the linear relationship between
two continuous variables. Two variables are correlated if there is a linear association
between them. A common correlation statistic used for continuous variables is
thePearson correlation coefficient, which ranges from 1 to +1.
The population parameter that represents a correlation is . The null hypothesis for a
test of a correlation coefficient is that equals 0, and the alternative hypothesis is that
is not 0. Rejecting the null hypothesis means only that you can be confident that the
true population correlation is not exactly 0. You need to avoid common mistakes when
interpreting the correlation between variables.
To produce correlation statistics and scatter plots for your data, you use PROC CORR.
To rank-order the absolute value of the correlations from highest to lowest, you add
the RANK option to the PROC CORR statement. To produce scatter plots, you add
the PLOTS= option in the PROC CORR statement. You can also add context-specific
options in parentheses following the main option keyword, such as PLOTS or
SCATTER.
To examine the correlations between the potential predictor variables, you produce
a correlation matrix and scatter plot matrix by using the NOSIMPLE, PLOTS=MATRIX,
and HISTOGRAM options. To specify tooltips for hovering over data points and seeing
detailed information about the observations, you use the IMAGEMAP=ON option in the
ODS GRAPHICS statement and an ID statement in the PROC CORR step.

In correlation analysis, you determine the strength of the linear relationships between
continuous response variables. In simple linear regression, you use the simple linear
regression model to determine the equation for the straight line that defines the linear
relationship between the response variable and the predictor variable.
To determine how much better the model that takes the predictor variable into account
is than a model that ignores the predictor variable, you can compare the simple linear
regression model to a baseline model. For your comparison, you calculate the
explained, unexplained, and total variability in the simple linear regression model.
The null hypothesis for linear regression is that the regression model does not fit the
data better than the baseline model.The alternative hypothesis is that the regression
model does fit the data better than the baseline model. In other words, the slope of the
regression line is not equal to 0, or the parameter estimate of the predictor variable is
not equal to 0.
Before performing simple linear regression, you need to verify the four assumptions for
linear regression: that the mean of the response variable is linearly related to the value
of the predictor variable, that the error terms are normally distributed, that the error
terms have equal variances, and that the error terms are independent at each value of
the predictor variable.
To fit regression models to your data, you use PROC REG. The MODEL
statement specifies the response variable and the predictor variable. To evaluate your
model, you typically examine the p-value for the overall model, the R-square value,
and the parameter estimates.
To assess the level of precision around the mean estimates of the response variable,
you can produce confidence intervals around the means and construct prediction

intervals for a single observation. To display confidence and prediction intervals, you
can specify the CLM and CLI options in the MODEL statement.
To produce predicted values for small data sets using PROC REG, you create a new
data set containing the values of the independent variable for which you want to make
predictions, concatenate the new data set with the original data set, and fit a simple
linear regression model to the new data set.
To produce predicted values for large data sets, using PROC REG and PROC
SCORE is more efficient. You can use the NOPRINT and OUTEST= options in a
PROC REG statement to write the parameter estimates from PROC REG to an output
data set. Then you score the new observations using PROC SCORE, with
the SCORE= option specifying the data set containing the parameter estimates,
the OUT= option specifying the data set that PROC SCORE creates, and
the TYPE= option specifying what type of data the SCORE= data set contains.

In multiple regression, you can model the relationship between the response variable
and more than one predictor variable. In a model with two predictor variables, you can
model the relationship of the three variablesthree dimensionswith a twodimensional plane.
Multiple linear regression has advantages and disadvantages. Its biggest advantage is
that it's more powerful than simple linear regression, that is, you can determine
whether a relationship exists between the response variable and several predictor
variables at the same time. The disadvantages of multiple linear regression are that
you have to decide which model to use, and that when you have more predictors,
interpreting the model becomes more complicated.
You can use multiple regression in two ways: for analytical or explanatory analysis and
for prediction. If you specify many terms, the model for multiple regression can
become very complex.
The hypotheses for multiple regression are similar to those for simple linear
regression. The null hypothesis is that the multiple regression model does not fit the
data better than the baseline model. (All the slopes or parameter estimates are equal
to 0.) The alternative hypothesis is that the regression model does fit the data better
than the baseline model.
For multiple linear regression, the same four assumptions as for simple linear
regression apply: that the mean of the response variable is linearly related to the value
of the predictor variables, that the error terms are normally distributed, that the error
terms have equal variances, and that the error terms are independent at each value of
the predictor variable.
To compare multiple linear regression models, you typically examine the p-value for
the overall models, the adjusted R-square values, and the parameter estimates. The
adjusted R-square value takes into account the number of terms in the model and
increases only if new terms significantly improve the model.

Your first decision in model selection is whether to use a manual or automated


approach. Automated model selection techniques in SAS fall into two general
categories: the all-possible regressions method and stepwise selection methods. For a
large number of potential predictor variables, the stepwise regression methods might
be a better option. The all-possible regressions method produces more candidate

models, which requires you to use your expertise to select a model.


In the all-possible regressions method, SAS calculates all possible regression models.
To describe your model, you can add an optional label to the MODEL statement. You
can reduce the number of models in the output by specifying the BEST= option in the
MODEL statement. To help evaluate the models you produce, you can
requestMallows' Cp statistic in the PLOTS= option in the PROC REG statement and in
the SELECTION= option in the MODEL statement. To request statistics for each
model, you can specify them in the SELECTION= option. To select the best model for
prediction,you should use Mallows' criterion for Cp. To select the best model for
parameter estimation, you should use Hocking's criterion for Cp.
Stepwise selection methods are another, less computer-intensive way to find good
candidate models without having to generate all possible models. You can specify
forward, backward, and stepwise methods in theSELECTION= option in the MODEL
statement. Each method selects variables based on their p-values. To change the
default p-values that PROC REG uses to select variables, you can use the SLENTRY=
and SLSTAY= optionsin the MODEL statement. It's a good idea to always run all three
stepwise selection methods and look for commonalities among the final models for all
three methods.

Syntax
To go to the movie where you learned a statement or option, select a link.
PROC CORR DATA=SAS-data-set <options>;
VAR variable(s);
WITH variable(s);
RUN;

Selected Options in PROC CORR


Statement
PROC CORR

Option
RANK
PLOTS=<(context-specific options)>
NOSIMPLE

Selected ODS Option


Statement
ODS GRAPHICS

Option
IMAGEMAP=ON

PROC REG DATA=SAS-data-set <options>;


MODEL dependent-regressor <options>;
WITH variable(s);
ID variable(s);
RUN;

Selected Options in PROC REG


Statement
PROC REG

Option
NOPRINT
OUTEST=

MODEL

CLM
CLI
P

PROC SCORE DATA=SAS-data-setSCORE=SAS-data-setOUT=SAS-datasetTYPE=name<options>;


VAR variable(s);
RUN;

Sample Programs
Producing Correlation Statistics and Scatter Plots
proc corr data=statdata.fitness rank
plots(only)=scatter(nvar=all ellipse=none);
var RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse Performance;
with Oxygen_Consumption;
title "Correlations and Scatter Plots with
Oxygen_Consumption";
run;
title;
Examining Correlations between Predictor Variables
ods graphics on / imagemap=on;
proc corr data=statdata.fitness nosimple
plots=matrix(nvar=all histogram);
var RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse Performance;
id name;
title "Correlations with Oxygen_Consumption";
run;
title;
Performing Simple Linear Regression
proc reg data=statdata.fitness;
model Oxygen_Consumption=RunTime;
title 'Predicting Oxygen_Consumption from RunTime';
run;
quit;
title;
Viewing and Printing Confidence Intervals and Prediction Intervals
proc reg data=statdata.fitness;
model Oxygen_Consumption=RunTime / clm cli;
id name runtime;
title 'Predicting Oxygen_Consumption from RunTime';
run;
quit;
title;
Producing Predicted Values of the Response Variable
data need_predictions;
input RunTime @@;
datalines;
9 10 11 12 13
;
run;
data predoxy;
set need_predictions
statdata.fitness;
run;

proc reg data=predoxy;


model Oxygen_Consumption=RunTime / p;
id RunTime;
title 'Oxygen_Consumption=RunTime with Predicted Values';
run;
quit;
title;
Storing Parameter Estimates and Scoring
proc reg data=statdata.fitness noprint outest=estimates;
model Oxygen_Consumption=RunTime;
run;
quit;
proc print data=estimates;
title "OUTEST= Data Set from PROC REG";
run;
title;
proc score data=need_predictions score=estimates
out=scored type=parms;
var RunTime;
run;
proc print data=Scored;
title "Scored New Observations";
run;
title;
Performing Multiple Linear Regression
proc reg data=statdata.fitness;
model Oxygen_Consumption=Performance RunTime;
title 'Multiple Linear Regression for Fitness Data';
run;
quit;
title;
Using Automatic Model Selection
ods graphics / imagemap=on;
proc reg data=statdata.fitness plots(only)=(cp);
ALL_REG: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ selection=cp rsquare adjrsq best=20;
title 'Best Models Using All-Regression Option';
run;
quit;
title;
Estimating and Testing Coefficients for Selected Models
proc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
EXPLAIN: model Oxygen_Consumption=
RunTime Age Weight Run_Pulse Maximum_Pulse;
title 'Check "Best" Two Candidate Models';
run;
quit;
title;
Performing Stepwise Regression
proc reg data=statdata.fitness plots(only)=adjrsq;
FORWARD: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse

/ selection=forward;
BACKWARD: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ selection=backward;
STEPWISE: model Oxygen_Consumption=
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ selection=stepwise;
title 'Best Models Using Stepwise Selection';
run;
quit;
title;

Statistics I: Introduction to ANOVA, Regression, and Logistic Regression


Copyright 2014 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Close
Print

Summary: Lesson 4: Regression Diagnostics


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Verifying the first assumption of linear regression, that the linear model fits the data
adequately, is critical. You should always plot your data before producing a model.
The remaining three assumptions of linear regression relate to error terms, so you
check these assumptions in terms of errors, not in terms of the values of the response
variable. To verify these assumptions, you can use several different residual plots to
check your regression assumptions. You can plot the residuals versus the predicted
values, plot the residuals versus the values of the independent variables, and produce
a histogram or a normal probability plot of the residuals. To verify that model
assumptions are valid, you can analyze the shape of the residual values to ensure that
they display a random scatter of the residual values above and below the reference
line at 0. If you see patterns or trends in the residual values, the assumptions might
not be valid and the models might have problems. You can also use residual plots
to detect outliers.
To create residual plots and other diagnostic plots, you use PROC REG, which creates
a number of default plots. Specifying an identifier variable in the ID statement shows
you that information when you hover your cursor over the data points in the graph. You
can also request specific plots with the PLOTS= option in the PROC REG statement.

You should also identify any influential observations that strongly affect the linear
model's fit to the data. To identify outliers and influential observations in your data, you
can use several diagnostic statistics in PROC REG.To detect outliers, you can
use STUDENT residuals. To detect influential observations, you can
use Cooks D statistics,RSTUDENT residuals, and DFFITS statistics.
Cooks D statistic is most useful for explanatory or analytic models, and DFFITS is
most useful for predictive models. If you detect an influential observation, you can
identify which parameter the observation is influencing most by using DFBETAS.
To detect influential observations in your model using PROC REG, you can produce
diagnostic statistics as well as diagnostic plots. To control which plots are produced,
you can use the PLOTS= option in the PROC REG statement. To request
the diagnostic statistics used in creating the plots without producing the plots
themselves, you can use the R and INFLUENCE options in the MODEL statement.
When you use these options, PROC REG creates an ODS output object
called OutputStatistics, which contains the residuals and influential statistics from the
R and INFLUENCE model options. To add variables in the model to the
OutputStatistics data object, you specify them in the ID statement. To save the
statistics in an output data set, you use the ODS OUTPUT statement.
For very large data sets, viewing or printing all residuals and influence statistics quickly
becomes unwieldy. To reduce the amount of output, you can use the cutoff values for
each of the diagnostic criteria to detect influential observations. To do so, you can use
macro variables and the DATA step to create a program that you can reuse.
You can handle influential observations in several ways. You can recheck for data
entry errors, determine whether you have an adequate model, and determine whether
the observation is valid but unusual. In your analysis, you should report the results of
your model with and without the influential observation.

Collinearity, also called multicollinearity, occurs in multiple regression when two or


more predictor variables are highly correlated with each other. Collinearity doesn't
violate the assumptions of multiple regression, but it leads to instability in the
regression model.
To detect collinearity, you can check your PROC REG output. To measure the
magnitude of collinearity in a model, you can use the VIF option in the MODEL
statement. If you detect collinearity, you can determine how to proceedand which
model to select.
To review, effective modeling includes performing preliminary analyses, selecting
candidate models, validating assumptions, detecting influential observations and
collinearity, revising your model, and performing prediction testing.

Syntax
To go to the movie where you learned a statement or option, select a link.
LIBNAME libref 'SAS-library';
ODS OUTPUT output-object-specification=data-set;
PROC REG DATA=SAS-data-set <options>;
MODEL dependent-regressor <options>;
ID variable(s);

RUN;

Selected Options in PROC REG


Statement

Option

PROC REG

PLOTS=

MODEL

R
INFLUENCE
VIF

%LET variable=value;
DATA SAS-data-set;
SET SAS-data-set;
variable=value;
IF expression;
RUN;
PROC PRINT DATA= SAS-data-set;
VAR variable(s);
RUN;
Sample Programs
Producing Default Diagnostic Plots
ods graphics / imagemap=on;
proc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
id Name;
title 'PREDICT Model - Plots of Diagnostic Statistics';
run;
quit;
title;
Requesting Specific Diagnostic Plots
ods graphics / imagemap=on;
proc reg data=statdata.fitness
plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
id Name;
title 'PREDICT Model - Plots of Diagnostic Statistics';
run;
quit;
title;
Using Diagnostic Plots to Identify Influential Observations
ods graphics / imagemap=on;
proc reg data=statdata.fitness plots(only)=
(RSTUDENTBYPREDICTED(LABEL)
COOKSD(LABEL)
DFFITS(LABEL)
DFBETAS(LABEL));
PREDICT: model Oxygen_Consumption =
RunTime Age Run_Pulse Maximum_Pulse;
id Name;

title 'PREDICT Model - Plots of Diagnostic Statistics';


run;
quit;
title;
Writing Diagnostic Statistics to an Output Data Set
ods output outputstatistics=Check4Outliers;
proc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_pulse
/ r influence;
id Name Oxygen_Consumption RunTime Age Run_Pulse
Maximum_pulse;
title 'PREDICT Model - Plots of Diagnostic Statistics';
run;
quit;
title;
Detecting Influential Observations Programmatically
%let dsname=check4outliers; /*data set name*/
%let numparms=5; /*# of predictor variables + 1*/
%let numobs=31;
/*# of observations*/
%let idvars=Name Oxygen_Consumption RunTime DFB_RunTime
Age DFB_Age Run_Pulse DFB_Run_Pulse
Maximum_pulse DFB_Maximum_Pulse;
/*relevant variable(s)*/
data influential;
set &dsname;
CutDFFits=2*(sqrt(&numparms/&numobs));
CutCooksD=4/&numobs;
RStud_i=(abs(RStudent)>3);
DFits_i=(abs(DFFits)>CutDFFits);
CookD_i=(CooksD>CutCooksD);
Summary_i=compress(RStud_i||DFits_i||CookD_i);
if Summary_i ne '000';
run;
proc print data=influential;
var Summary_i &IDVars PredictedValue RStudent
DFFits CutDFFits CooksD CutCooksD;
title 'Observations Exceeding Suggested Cutoffs';
run;
title;
Detecting Collinearity
proc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse;
FULL: model Oxygen_Consumption =
Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse;
title 'Collinearity: Full Model';
run;
quit;
title;
Calculating Collinearity Diagnostics
proc reg data=statdata.fitness;
FULL: model Oxygen_Consumption=

Performance RunTime Age Weight


Run_Pulse Rest_Pulse Maximum_Pulse
/ vif;
title 'Collinearity: Full Model with VIF';
run;
quit;
title;
Dealing with Collinearity
proc reg data=statdata.fitness;
NOPERF: model Oxygen_Consumption=
RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse
/ vif;
title 'Dealing with Collinearity';
run;
quit;
title;

Statistics I: Introduction to ANOVA, Regression, and Logistic Regression


Copyright 2014 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Close
Print

Summary: Lesson 5: Categorical Data Analysis


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
A one-way frequency table displays frequency statistics for a categorical variable.
An association exists between two variables if the distribution of one variable changes
when the value of the other variable changes. If there's no association, the distribution
of the first variable is the same regardless of the level of the other variable.
To look for a possible association between two or more categorical variables, you can
create a crosstabulation table. A crosstabulation table shows frequency statistics for
each combination of values (or levels) of two or more variables.
To create frequency and crosstabulation tables in SAS, and request associated
statistics and plots, you use theTABLES statement in the FREQUENCY procedure.
You can use the PLOTS= option in the TABLES statement to request specific plots for

frequency and crosstabulation tables.


When ordinal values are ordered logically, you can use more powerful statistical tests
that can detect linear (ordinal) associations instead of only general associations. To
logically order the values of a variable for calculations and output, you can create a
new variable or you can apply a temporary format to an existing variable. The
ORDER=FORMATTED option in the PROC FREQ statement tells PROC FREQ to
perform calculations and display output by using the formatted values instead of the
stored values.

To perform a formal test of association between two categorical variables, you use the
chi-square test. ThePearson chi-square test is the most commonly used of several chisquare tests. The chi-square statistic indicates the difference between observed
frequencies and expected frequencies. Neither the chi-square statistic nor its p-value
indicates the magnitude of an association.
Cramer's V statistic is one measure of the strength of an association between two
categorical variables. Cramer's V statistic is derived from the Pearson chi-square
statistic.
To measure the strength of the association between a binary predictor variable and a
binary outcome variable, you can use an odds ratio. An odds ratio indicates how much
more likely it is, with respect to odds, that a certain event, or outcome, occurs in one
group relative to its occurrence in another group.
To perform a Pearson chi-square test of association and generate related measures of
association, you specify theCHISQ option and other options in the TABLES statement
in PROC FREQ.
For ordinal associations, the Mantel-Haenszel chi-square test is a more powerful test
than the Pearson chi-square test. The Mantel-Haenszel chi-square statistic and its pvalue indicate whether an association exists but not the magnitude of the association.
To measure the strength of the linear association between two ordinal variables, you
can use the Spearman correlation statistic. The Spearman correlation is considered to
be a rank correlation because it provides a degree of linearity between the ordinal
variables.
To perform a Mantel-Haenszel chi-square test of association and generate related
measures of association, you specify the CHISQ option and other options in the
TABLES statement in PROC FREQ.

Logistic regression is a type of statistical model that you can use to predict a
categorical response, or outcome, on the basis of one or more continuous or
categorical predictor variables. You select one of three types of logistic regression
binary, nominal, or ordinal based on your response variable.
Although linear and logistic regression models have the same structure, you can't use
linear regression with a binary response variable. Binary logistic regression uses a
predictor variable to estimate the probability of a specific outcome. To directly model
the relationship between a continuous predictor and the probability of an event or
outcome, you must use a nonlinear function: the inverse logit function.
To model categorical data, you use the LOGISTIC procedure. The two required

statements are the PROC LOGISTIC statement and the MODEL statement.
Depending on the complexity of your analysis, you can use additional statements in
PROC LOGISTIC. If your model has one or more categorical predictor variables, you
must specify them in the CLASS statement. The MODEL statement specifies the
response variable and can specify other information as well, such as the response
variable. In the MODEL statement, the EVENT= option specifies the event category for
a binary response model. To specify the type of confidence intervals you want to use,
you add the CLODDS= option to the MODEL statement. PROC LOGISTIC computes
Wald confidence intervals by default. You can use the PLOTS= option in the PROC
LOGISTIC statement to request specific plots.
Instead of working directly with the categorical predictor variables in the CLASS
statement, PROC LOGISTIC firstparameterizes each predictor variable. The CLASS
statement creates a set of one or more design variables that represent the information
in each specified classification variable. PROC LOGISTIC uses the design variables,
and not the original variables, in model calculations. Two common parameterization
methods are effect coding (the method that PROC LOGISTIC uses by default)
and reference cell coding. To specify a parameterization method other than the default,
you use the PARAM= option in the CLASS statement. If you want to specify a
reference level other than the default for a classification variable, you use the REF=
variable option in the CLASS statement.
Akaike's information criterion (AIC) and the Schwarz criterion (SC) are goodness-of-fit
measures that you can use to compare models. -2Log L is a goodness-of-fit measure
that is not commonly used to compare models.Comparing pairs is another goodnessof-fit measure that you can use to compare models.
PROC LOGISTIC uses a 0.05 significance level and a 95% confidence interval by
default. If you want to specify a different significance level for the confidence interval,
you can use the ALPHA= option in the MODEL statement.
For a continuous predictor variable, the odds ratio measures the increase or decrease
in odds associated with a one-unit difference of the predictor variable by default.

A multiple logistic regression model characterizes the relationship between a


categorical response variable and multiple predictor variables.
One method of selecting a subset of predictor variables for a multiple logistic
regression model is the backward elimination method. To specify the variable selection
method in PROC LOGISTIC, you add the SELECTION= option to the MODEL
statement. By default, for the backward elimination method, PROC LOGISTIC uses a
0.05 significance level to determine which variables remain in the model. If you want to
change the significance level, you can use the SLSTAY= (or SLS=) option in the
MODEL statement.
Multiple logistic regression uses adjusted odds ratios, which measure the effect of a
single predictor variable on a response variable while holding all the other predictor
variables constant.
In PROC LOGISTIC, the UNITS statement enables you to obtain customized odds
ratio estimates for a specified unit of change in one or more continuous predictor
variables.
In the CLASS statement, when you use the REF= option with a variable that has either
a temporary or a permanent format assigned to it, you must specify the formatted

value of the level instead of the stored value.


When you fit a multiple logistic regression model, the simplest approach is to consider
only the main effectsthe effect of each predictor individuallyon the response. If you
suspect that there are interactions between predictor variables, you can fit a more
complex logistic regression model that includes interactions. When you use
thebackward elimination method with interactions in the model, PROC LOGISTIC must
preserve the model hierarchy when eliminating main effects. You specify interactions
in the MODEL statement.
By default, PROC LOGISTIC produces the odds ratio only for variables that are not
involved in an interaction.To tell PROC LOGISTIC to produce the odds ratios for each
value of a variable that is involved in an interaction, you can use the ODDSRATIO
statement. To specify whether PROC LOGISTIC computes the odds ratios for a
categorical variable against the reference level or against all of its levels, you can use
the DIFF= option. The AT option specifies fixed levels of one or more interacting
variables (also called covariates). PROC LOGISTIC computes odds ratios at each of
the specified levels.
To visualize the interaction between two categorical variables, you can produce
an interaction plot.

Syntax
To go to the movie where you learned a statement or option, select a link.

PROC FREQ DATA=SAS-data-set 'SAS-library' <option(s)>;


TABLES=table-request(s) </ option(s)>;
additional statements;
RUN;
Selected Options in PROC FREQ
Statement
PROC FREQ

Option
ORDER=

TABLES
CELLCHI2
CHISQ (Pearson and Mantel-Haenszel)
CL
EXPECTED
MEASURES
NOCOL
NOPERCENT
PLOTS=
RELRISK

PROC LOGISTIC DATA=SAS-data-set<options>;


CLASS variable <(variable_option(s)> ... </ options>;
MODEL response<(variable_options)>=predictors </ options>;
UNITS independent1=list ... </ options>;
ODDSRATIO <'label'> variable </ options>;

RUN;
Selected Options in PROC LOGISTIC
Statement
PROC LOGISTIC

Option
PLOTS=

CLASS
PARAM=
REF= (general usage and usage with a formatted variable)
MODEL

ALPHA=
CLODDS=
EVENT=
SELECTION=
SLSTAY= | SLS=

ODDSRATIO

AT
CL=
DIFF=

Sample Programs
Examining the Distribution of Variables
proc freq data=statdata.sales;
tables Purchase Gender Income
Gender*Purchase
Income*Purchase /
plots=(freqplot);
format Purchase purfmt.;
title1 'Frequency Tables for Sales Data';
run;
ods select histogram probplot;
proc univariate data=statdata.sales;
var Age;
histogram Age / normal (mu=est
sigma=est);
probplot Age / normal (mu=est
sigma=est);
title1 'Distribution of Age';
run;
title;

Ordering the Values of a Variable by Creating a New Variable


data statdata.sales_inc;
set statdata.sales;
if Income='Low' then IncLevel=1;
else If Income='Medium' then IncLevel=2;
else If Income='High' then IncLevel=3;
run;
proc freq data=statdata.sales_inc;

tables IncLevel*Purchase / plots=freq;


format IncLevel incfmt. Purchase purfmt.;
title1 'Create variable IncLevel to correct Income';

run;

title;

Performing a Pearson Chi-Square Test of Association


proc freq data=statdata.sales_inc;
tables Gender*Purchase /
chisq expected cellchi2 nocol nopercent
relrisk;
format Purchase purfmt.;
title1 'Association between Gender and Purchase';
run;
title;

Performing a Mantel-Haenszel Chi-Square Test


proc freq
tables
format
title1
run;

data=statdata.sales_inc;
IncLevel*Purchase / chisq measures cl;
IncLevel incfmt. Purchase purfmt.;
'Ordinal Association between IncLevel and Purchase?';

title;

Fitting a Binary Logistic Regression Model


proc logistic data=statdata.sales_inc
plots(only)=(effect);
class Gender (param=ref ref='Male');
model Purchase(event='1')=Gender;
title1 'LOGISTIC MODEL (1):Purchase=Gender';
run;
title;

Fitting a Multiple Logistic Regression Model


proc logistic data=statdata.sales_inc
plots(only)=(effect oddsratio);
class Gender (param=ref ref='Male')
IncLevel (param=ref ref='1');
units Age=10;
model Purchase(event='1')=Gender Age IncLevel /
selection=backward clodds=pl;
title1 'LOGISTIC MODEL (2):Purchase=Gender Age IncLevel';
run;
title;

Fitting a Multiple Logistic Regression Model with Interactions


proc logistic data=statdata.sales_inc
plots(only)=(effect oddsratio);
class Gender (param=ref ref='Male')
IncLevel (param=ref ref='1');
units Age=10;
model Purchase(event='1')=Gender | Age | IncLevel @2 /
selection=backward clodds=pl;
title1 'LOGISTIC MODEL (3): Main Effects and 2-Way
Interactions';
title2 '/ sel=backward';
run;
title;

Fitting a Multiple Logistic Regression Model with All Odds Ratios


ods select OddsRatiosPL ORPlot;
proc logistic data=statdata.sales_inc
plots(only)=(oddsratio);
class Gender (param=ref ref='Male')
IncLevel (param=ref ref='1');
units Age=10;
model Purchase(event='1')=Gender | IncLevel Age;
oddsratio Age / cl=pl;
oddsratio Gender / diff=ref at (IncLevel=all) cl=pl;
oddsratio IncLevel / diff=ref at (Gender=all) cl=pl;
title1 'LOGISTIC MODEL (3a): Significant Terms and All Odds
Ratios';
title2 '/ sel=backward';
run;
title;

Statistics I: Introduction to ANOVA, Regression, and Logistic Regression


Copyright 2014 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Close

You might also like