You are on page 1of 15

USING SPSS FOR DATA ANALYSIS

Michael Shalev

June 2008 (amended March 2009)

GETTING TO KNOW YOUR DATA


Before doing anything else you need to become familiar with all of the variables that you
intend to analyze. For this purpose, use the procedures available from Descriptive
Statistics (which is accessed from the Analyze menu).
Start by requesting Frequencies for the variables of interest and study the results carefully
(e.g. you may find that a variable has one or more categories that are nearly emptydecide
if you want to recode them as Missing, combine them with other categories, or leave them
as they are).
PAY ATTENTION TO THE SCALE OF MEASUREMENT ( !)
Statistical procedures like correlation and regression requires continuous variables
() : either on an interval scale ( ) or ratio scale () . In practice,
researchers may treat ordinal data ( ) as if it was interval, especially questions that ask
people to express their attitudes, e.g. on a scale from 1 (Strongly favor) to 5 (Strongly
oppose).
These types of data must be distinguished from categorical variables ( ) like sex or
ethnic origin. Categorical variables can only be included in correlations or regressions if they
are converted to dichotomous variables which have only two possible values: 1 or 0. (These
are often called dummy variables.) For the variable SEX the solution is easy: recode females
1 and males 0 (or vice versa).
For variables with more than two categories, each of the categories except for one reference
category (" )" must be converted to a dummy variable. It is not possible to
convert a categorical variable to multiple dichotomous variables if it is going to be the
dependent variable in a regression. (Why? Because there can only be one dependent variable
in a regression, whereas we can have as many independent variables as we like.)
The regression coefficient of a dummy variable is interpreted as follows: it shows the average
value of the dependent variable for the category coded 1 in comparison with the average
for the reference category which is coded 0.

DATA ANALYSIS AN EXAMPLE


Our example is based on Arian and Shamirs article about the 1981 elections. One of their
main arguments was that ethnic origin does not cause differences in voting. The most
important causal variable is hawkishness. Voters who oppose territorial compromise prefer
the Likud. The correlation between ethnicity and voting is spurious ((/ and is due to
the fact that Mizrachim tend to be hawks and Ashkenazim doves.
The false model is:
ETHNICITY

VOTE

The true model is:


ETHNICITY
VOTE
HAWKISHNESS
-1-

Testing this model requires the following five steps:


1.
2.
3.
4.

Decide exactly how to measure each variable in the model.


Use tables and charts to explore the relationships between the variables in more detail.
Decide whether to change the model in light of the results so far.
Use multiple regression to summarize the relationships between the variables and see
if they fit the model(s).

1. DECIDE HOW TO MEASURE THE VARIABLES


The dependent variable is vote: Like Arian and Shamir, we will use Vote if Knesset
elections were held today (v122) to create a new dummy variable called LIKUD which is
coded 1 if a person said they would vote for the Likud and 0 if s/he voted for the Maarach.
(Of course this is not the only solution. For example, we could have used the Left-Right
scale, V116, which gives meaningful values for respondents who supported parties other than
Likud and Maarach.)
The first independent variable (the original cause) is : there are a variety of ways that
could be measured. (We will not do it here, but in this kind of situation it is best to repeat
the analysis for each different measure. You can then see if the results depend on how the
variable is measured.)
We will try to improve on the way Arian and Shamir measured . They did not distinguish
between first generation and second generation. Also, they defined all Sabras (Israeliborn whose father was also born in Israel) as Ashkenazim. We will define immigrant
Ashkenazim as the reference category ( ) and we will create 4 dummy
variables for the other 4 categories of the variable Country of origin (V137).

V137

ASH_ISR MIZ_OLEH MIZ_ISR SABRA

1 Israel - Israel

2 Israel - Asia-Africa

3 Israel - Europe-America

4 Asia-Africa - Asia-Africa

5 Europe-America - Europe-America*

6 other combination

Missing

Missing

Missing

Missing

System Missing

Missing

Missing

Missing

Missing

* No dummy variable is created for Ashkenazi immigrants because they


serve as the reference category.

The second independent variable (the real cause) is hawkishness: The questionnaire
included three different questions concerning attitude to annexation: V7, V8 and V105. Are
these three different ways of measuring the same thing? Or does hawkishness have more than
one dimension? To find out, we could perform a Factor Analysis (a procedure which is
explained at the end of this Guide). For now we will use only one question, V8, which ranges
from a value of 1 (no territorial concessions) to 5 (concede all of the territories). This
measures dovishness ().

-2-

2. GENERATE TABLES
The Custom Tables procedure of SPSS will help us to find out how different types of
respondents differ in their voting. The dependent variable will be LIKUD - the mean []
Likud vote. The types will be combinations of ethnicity (V137) and dovishness (V8). We
want to know what happens to the effect of ethnicity when dovishness is controlled. So we
need to know the mean Likud vote of each , for each category of dovishness.
To make the table, from the Analyze menu choose Tables and then Custom Tables.
1. You will be asked if you want to define how your variables are measured. SPSS requires
that the variables which form the rows and columns of the table be either nominal (
)or ordinal () . Since both V8 and V137 are defined at the moment as "scale"
variables
(") " , you need to change the type of measurement. There are 3
different ways to do this. Changes made in the Measure column of the Data Editor will
be permanent. Otherwise, you can either use the wizard, or else make changes by rightclicking a variable in the variable list.
2. A template ( )is provided for laying out the table. It is recommended that the control
variable (in this vase V8) be placed in the rows. Drag V8 to "rows" and V137 to
"columns". You now have to choose what you want to appear in each cell in this case,
the mean of the variable LIKUD. The default for cell contents is the number of cases
("count"). Drag the variable LIKUD to the Count labels (make sure all the labels are
selected). The calculation will automatically change to Mean, which is what we want. To
change the measure, or the number of decimal places, right-click LIKUD in the table and
choose Summary Statistics.
3. The default table doesn't include totals. To add them, right-click V8, select Categories
and Totals, then check the box at the bottom marked Show Total and click the Apply
button. Repeat for the variable V137. At this stage, the dialog box should look like this:

-3-

-4-

4. Finally, choose the Titles tab, and add a title to the table ("Mean Likud Vote in 1981").
Press OK and the table will appear. It should look like this:
Mean Likud Vote in 1981

v8 Yoniyut

v137 country of origin


5 Europe3 Israel 4 AsiaAmerica EuropeAfrica EuropeAmerica
Asia-Africa
America

1 Israel Israel

2 Israel Asia-Africa

likud

likud

likud

likud

Mean
.62
.69

Mean
.66
.35

Mean
.48
.22

3 some vitur

.36

.35

4 nearly all vitur

.33

.67

5 complete vitur

.00

Total

.55

1 against vitur
2 small vitur

6 other
combination

Total

likud

likud

likud

Mean
.60
.48

Mean
.37
.34

Mean
.
.

Mean
.55
.38

.24

.40

.17

.28

.00

.00

.00

.18

.00

.00

.33

.00

.05

.53

.32

.52

.28

.43

Study the table carefully and you will see several interesting things. First, look at the totals in
the bottom row, which show the overall effect of ethnicity: this effect is large. Both
Mizrachim and Sabras were much more likely than Ashkenazim to vote Likud (more than
50%, compared with about 30%). We can also see that the difference between the foreignborn and Israeli-born generations is small.
Second, look at the totals in the last column, which show the effect of dovishness on
voting. This effect is also very strong, and it appears to be linear. We can construct a line
chart to make sure. Double-click the table, select the first 5 numbers in the last column, then
click the right mouse button and choose Create Graph and Line. Here is the result. We
see that the influence of dovishness on Likud voting is almost perfectly linear.

-5-

What about the inner cells of the table? If the ethnic effect on voting really is spurious, then
within categories of dovishness there will be no ethnic vote. A quick look shows that inside
each row (level of dovishness) there are still differences in the Likud vote between ethnic
groups. However, before going any further we need to answer two questions.
First, are there are enough cases for us to have confidence in the means? If there are very few
people in a cell, it is not worth much! We therefore generate the table again, but this time
requesting Count instead of Mean as the desired Statistic. The results (not shown) reveal that
there were very few people in the two most extreme categories of the dovishness variable.
Second, are there any categories that should be dropped or combined? The Sabras (Israeliborn whose fathers were also born here) should be dropped, because we don't know where their
grandparents came from. In addition, we already saw from the previous table that there is little
difference in the vote of foreign-born and Israel-born respondents.
We conclude that it would be a good idea to simplify both of our variables. First we recode V8
into a new variable, V8new, and give it the label Dovishness. The new variable combines the
three most dovish categories (3, 4 and 5). Then we recode V137 into V137new, which leaves
out the Sabras and compares all Ashkenazim (coded 1) with all Mizrachim (coded 0). (Its
important to add Value Labels so your results will show that 1 is Ashkenazim and 0 is
Mizrachim.)
We will use the two new variables to make a chart showing the difference in Likud vote for
Mizrachim and Ashkenazim at different levels of dovishness. Each level of dovishness will be
represented by a different line. After creating the table double-click it, then select all cells except
Totals and click with the right mouse button to request a Line chart.
The result is the chart on the left.

-6-

The question we asked is: do ethnic differences in voting disappear within categories of
dovishness? The answer is NO! Whether they were hawks or doves, many more Mizrachim
than Ashkenazim planned to vote Likud. Now lets ask a different question: can we see any
conditional relationships ( ?) In other words, is there any difference between the
slope of the three lines? If so, it would mean that the effect of ethnicity depends on the level
of dovishness. But there is actually not much difference.
For a good example of a conditional relationship, look at the chart on the right. Here the
effect of ethnicity is examined for different categories of religiousness ( ) instead
of for differences in dovishness. We see that among people who are very religious (the top
line), the slope is unusual. In this group, Ashkenazim were actually more likely than
Mizrachim to vote Likud.

3. SHOULD THE MODEL BE CHANGED?


We try to learn from the tables how best to set up our regression model. In the present
example one result (already mentioned) was the discovery that generation does not matter
the Likud vote is very similar for immigrant and Israeli Ashkenazim, and for immigrant
and Israeli Mizrachim. Therefore instead using 4 dummy variables for the 5 categories of
in our regressions, we could use only two: Mizrachim and Sabras, with all
Ashkenazim serving as the reference category. (We do not actually make this change in the
example below.)
More important would be additions or changes to the causal relationships that our regression
is designed to test. We have used tables to check two issues: (1)whether the effect of
dovishness on voting is linear; and (2)whether dovishness conditions the effect of ethnicity.
In reality, neither was a problembut what if they had been?
(1)Suppose we had found that ethnicity made no difference to voting except for the most
dovish group. Then, rather than continuing to measure dovishness as a continuous variable it
would have been better to turn it into a dummy variable (1=very dovish, 0=everyone else).
(2)What if the effect of ethnicity had been conditional on dovishness? (e.g. all hawks support
Likud, regardless of ethnicity; but among doves, there is a difference between Ashkenazim
and Mizrachim.) That would have required testing for interaction, which is explained on
page 9.

4. REGRESSIONS
The tables showed that both ethnicity and attitude to annexation affect whether people vote
for the Likud. Multiple regression ( ) will provide a test of whether the effect of
ethnicity on voting is spurious. The purpose of the regression is to see what happens to the
effect of ethnicity after we control for dovishness.

If the coefficients ( )of the ethnic dummy variables get a lot smaller, this
would support the Arian-Shamir hypothesis of .
But smaller coefficients may also be consistent with the hypothesis that the effect of
ethnicity is mediated by ( ) "dovishness, or that both ethnicity and dovishness
affect voting (complementary effects - ) .
If on the other hand the coefficients remain the same, then the two effects are
independent as well as complementary () .

SPSS makes it easy to run before and after regressions. The original model is defined as the
first block. The next model (the second block) adds additional variables that were not in
the first model.

-7-

Use the Linear Regression procedure (from the Analyze menu, choose
Regression). Select LIKUD as your Dependent Variable and the four ethnic dummy
variables as your Independent Variables. Press Next and add V8 to your second Block.
Click on Statistics, select Confidence intervals and R squared change, then click
Continue. Now click OK to run the regression.

Understanding Regression Output


The first table of output is the Model Summary. It shows the percentage of variance
explained by each model. (Remember, Model 1 is what we defined in Block 1, Model 2
includes Block 2 as well.) We see that ethnicity alone explains 4.7% (.047) of the variance in
LIKUD. Adding V8 to the regression more than doubles the Adjusted R-squared, which rises
to 10.6% (.106). As discussed in class, we are more interested in the effects (regression
coefficients, or slopes) than the ability of the model to explain variance. However, we
may be interested in the F-test of whether the change in R=squared between models is
significant. In this case it definitely is statistically significant (see the arrow pointing to .000).
This F-test can be very useful when a new block adds more than one independent variable.
It tests whether these variables as a group add anything significant to the previous model.

Next we need to look at the regression coefficients (the table labeled Coefficients). First
lets examine the unstandardized coefficients (B) that are highlighted in yellow.

-8-

Do Mizrachim vote differently from Ashkenazim, and what difference does it make if we
control for dovishness? The B coefficient for MIZ in Model 1 is .232. An unstandardized
regression coefficient represents the expected effect on Y of a 1-unit increase in X. In the
special case of dummy variables, the coefficient represents the average difference between
the category coded 1 and the reference category. The coefficient for MIZ shows that, on
average, the proportion voting Likud among foreign-born Mizrachim is 23.2% higher than
among foreign-born Ashkenazim.
What happens when V8 is added to the regression (Model 2)? The coefficients of all the
ethnic dummy variables decline slightly, but remain substantial. Comparing the coefficients
for MIZ we see that the gap between foreign-born Mizrachim and Ashkenazim falls from
23% to 19%. Thus, the effect of ethnicity is not spurious. Only a small part of ethnicitys
effect is actually due to dovishness, or is mediated by dovishness. Our results therefore do
not support Arian and Shamirs main claim.
Several other features of the regression output are worth noting:
1. The constant (): it is interpreted as the expected value of Y when all Xs (independent
variables) are zero. Usually this is not very informative. But in Model 1 the constant is .283,
which means that 28.3% of the reference category (Ashkenazim) voted Likud. (Take a look at
the last column of the table on page 4, and you will see the identical result!)
2. The column headed Sig. shows the significance level of each coefficient. The previous
column shows the value of the t-statistic, from which significance is computed. If t is at least
2 then the coefficient will usually be significant at the 5% level or better. If the value of
Sig. is greater than .05 (5%), this means that there is more than a 5% probability that the
true coefficient is actually zero (i.e. the independent variable probably has no effect).
Significance levels provide a very rough guide to whether a regression coefficient means
anything. More helpful are the Confidence Intervals ( ) shown in the last two
columns of the regression table. Recall that statistical significance is based on the idea that
our data come from a sample which is only one of many possible samples that might have
been drawn. The question is, how typical are the coefficients in the sample we used
compared with what would have been found if we could have run the regression for all
possible samples? What SPSS computes for us is the range of coefficients that would have
been expected in 95% of these samples. If the values in this range are all meaningful to us,
this is a good indication that a coefficient is solid. In Model 2 the effect of MIZ in 95% of
samples is expected to be somewhere between 10% and 29%. This is encouraging, because
even the lowest expected effect is quite large.
3. An effect can be statistically significant without being important. Importance is something
only you can judge. It often helps to use the results of the regression to estimate what
difference it makes. Consider the following simple comparison showing the effect of
dovishness, after controlling for ethnicity (Model 2). Among people who were opposed to
any territorial compromise (V8=1), the expected vote for Likud is 40.7% (.527 + [-.120*1]).
Among people who were ready to concede some territory, only 16.7% are expected to vote
Likud (.527 + [-.120*3]). Thats a difference of 24 percentage points!
4. The Beta coefficients () in the third column of numbers indicate the relative importance
of different independent variables, measured in units of standard deviations. (This usually
cannot be judged by comparing ordinary B coefficients, because different independent
variables are not measured on the same scale.) Knowing the relative importance of the
effects may or may not be of interest it depends on the question you are asking. In the
present case, Arian and Shamir might have been interested in knowing whether dovishness

-9-

affects voting more than ethnicity. However, because ethnicity is measured by 4 different
dummy variables we cannot directly compare the effects of ethnicity and dovishness.

- 10 -

Testing for Interaction


The analysis so far has illustrated how to test for all types of causal relationship except a
conditional relationship () . For this purpose we shall go back to the example that
was shown earlier, in the chart on the right on page 5. This showed that while Mizrachim are
normally more favorable to the Likud, among very religious voters more Ashkenazim than
Mizrachim favored the Likud. In the model below, the red line represents the conditional
relationship that we would like to test.
ETHNICITY
VOTE
RELIGIOUSNESS
This is often referred to as interaction between the effects of two independent variables.
There are two ways of testing for interaction. In the present example, the simplest way would
be to estimate the effect of ethnicity twiceonce for very religious people and once for
everyone else. If the coefficients of the variables in these two regressions are different,
that would support the idea of a conditional relationship. SPSS has a procedure called Split
File which makes it easy to use this method. In the present example, we recode V143 to
create a new variable called DATI with two categories (very religious = 1, everyone else = 0).
From the Data menu click on Split File, select Compare groups, use the arrow to
select the DATI variable, and press OK. From now on, unless you choose the option Analyze
all cases, any analysis you do will show separate results for very religious people and for
everyone else. Heres what happens when we run a regression to test the effect of ethnicity
and religious observance on LIKUD with Split File turned on.

The area highlighted in blue shows that the regression results are repeated twiceonce for
respondents coded 0 on DATI, and again for those coded 1. (Why doesnt the second block,
Model 2, appear for DATI=1? Because all of this group of respondents is very religious,
there is no variation in V143, therefore SPSS cannot test its effect on voting.) What interests
us is whether the coefficients for the ethnic dummy variables are different for DATI=0 and
- 11 -

DATI=1. They certainly are. We highlight in yellow one example, Mizrachim born abroad in
comparison with Ashkenazim born abroad. The difference in expected support for the Likud
is much higher among the very religious (40.4%) compared with non-religious respondents
(15.7% after controlling for variations in religiosity).
This method of testing conditional relationships is limited. If the conditioning variable has
more than two categories, the analysis will have to be split into many parts. The solution is
to add one more variablecalled an interaction termto the regression. This variable is the
product ( )of the two variables of interest. It tests whether combining the two variables
changes the effects of each one on its own. We will illustrate how its done with a simplified
version of the conditional model we have been using so far. Instead of using 4 dummy
variables to test the effect of ethnicity, we will contrast all Mizrachim and all Ashkenazim by
using a single dummy variable called MIZ_ASH. (Sabras are treated here as Missing.) We
already have a dummy variable DATI contrasting very religious people to everyone else.
Now we create an interaction variable called INTERAC, which is equal to MIZ_ASH *
DATI (this is easily done using Compute, available from the Transform menu). We now
run the regression in 3 blocks.

Model 1 shows that the Mizrachi vote for Likud is 23.2% higher than for Ashkenazim, and
we see from Model 2 that this gap falls very slightly to 22.4% after the effect of strong
religious observance is taken into account. Model 3 includes the variable INTERAC to test
whether the effect of ethnicity (MIZ_ASH) depends on whether people are very religious or
not (DATI).
We can easily calculate the expected effect of MIZ_ASH for different values of DATI.
When DATI=0 (not "very religious"), the expected value of MIZ_ASH does not change
(.233).
When DATI=1 ("very religious"), the expected value of MIZ_ASH is the sum of the
coefficients for MIZ_ASH and INTERAC (.233 - .284 = -.051). If the conditioning variable
(DATI in this case) had values other than 0 or 1, we could calculate the expected value of the
dependent variable for each value of the conditioning variable, according to this formula.
= (LIKUD) '
+ (MIZ_ASH) '
(DATI) ( * ' INTERAC)

- 12 -

It is recommended that you perform this calculation yourself for the two values of DATI,
0 & 1. (Click here to see how to do the calculations in Excel.)
Testing the statistical significance of conditional relationships is complicated. In this case the
interaction term is not significant, but that is misleading. For our purposes, the important
thing is whether the results are substantively meaningful.

FACTOR ANALYSIS
Factor Analysis is a useful way of finding out whether a group of variables share a single
common denominator () , or whether they are best summarized using several
different dimensions (). Even if we already know the answer to that question, Factor
Analysis can simplify the construction of a new variable that summarizes the values of
existing variables. In creating scores for the factors that it uncovers, the procedure can
combine the values of variables measured on different scales. The resulting factor scores are
easy to interpret, since they always have a mean of zero and a standard deviation of 1.
Suppose we were interested in measuring what people like or dislike about different political
parties. Variables V51-V60 cover a variety of reasons why voters might be attracted to the
Labour Party. Four of them are: experience in government (V51), good leaders (V52), peace
policy (V54) and social policy (V55). Do people who favor Labour rank the party high on all
of these aspects, while people who dislike Labour dislike it in all respects? Or does
like/dislike of Labour divide among several different dimensions? To find out we use these
menus: Analyze, Data reduction, Factor.

The 4 variables which interest us have been selected. Now we should change some of the
details. Click on Rotation and select Varimax, then click Continue. Click on Options.
Select Sorted by size and Suppress absolute values less than .10 and then click OK.

- 13 -

Total Variance Explained


Initial
Eigenvalues
Component

Total

1
2
3
4

2.821
.501
.372
.306

% of
Variance
70.519
12.517
9.305
7.658

Extraction
Sums of
Squared
Loadings
Total

Cumulative
%
70.519
83.037
92.342
100.000

2.821

% of
Variance
70.519

Cumulative
%
70.519

The table at the bottom of the previous page shows the variance explained by the factors. In
this example only the first factor (called Component 1) is important enough to be
extracted. We see that it alone explains 70% of the collective variance in the four variables
that were analyzed.
We turn now to a different example. The Arian-Shamir survey included a variety of questions
concerning involvement in election-related activities that could influence a persons vote. Is
there one underlying factor here? For example, do people who participate in party rallies also
tend to be influenced by the media campaign (or have they already made up their minds)? Are
people who admit to being influenced by the television campaign also influenced by the radio
and newspaper campaigns? A factor analysis of 7 relevant variables, V17 through V23, shows
that there are two main factors.
Total Variance Explained
Initial
Eigenvalues

Extraction
Sums of
Squared
Loadings
Component
Total
% of
Cumulative
Total
Variance
%
1
3.219
45.992
45.992
3.219
2
1.758
25.121
71.113
1.758
3
.846
12.091
83.203
4
.483
6.906
90.109
5
.271
3.870
93.979
6
.246
3.516
97.495
7
.175
2.505
100.000
Extraction Method: Principal Component Analysis.

% of
Variance
45.992
25.121

Cumulative
%
45.992
71.113

Rotation
Sums of
Squared
Loadings
Total
3.041
1.936

% of
Variance
43.450
27.663

The first factor is more important than the second one. Together they explain the majority
(71%) of total variance. Note that the final factors have been rotated to make them as
dissimilar from one another as possible.
Rotated Component Matrix

V23
V21
V22
V20
V17
V18
V19

extent TV broadcasts help decide


extent radio broadcasts help decide
extent newspapers help decide
extent the election campaign help decide
participate parties' rallies
participate parties' home-circles
extent talk about political issues

Component
1
2
.929
.907
.897
.732
.273
.892
.888
.523

There is a very clear division between the two factors (Components). The table above
shows the extent to which variables contribute to each factor (the numbers represent
correlations between each variable and a factor). By studying these loadings, as they are

- 14 -

Cumulative
%
43.450
71.113

called, we can learn how to interpret the factors. The first factor represents passive
participation in the election campaign, indicated by the extent voters are influenced by the
media and the parties campaign efforts. The second factor covers actions initiated by voters
themselves.
In order to create summary scores for each factor, we simply need to run the Factor Analysis
again but with one change: click on Scores and press Save as variables. In the present
instance two new variables would be added to the dataset, one for each factor. These variables
could be used in regressions or any other type of analysis.
Comments or suggestions? Please send them to michael.shalev@gmail.com

- 15 -

You might also like