Professional Documents
Culture Documents
Introduction
In the last practical, you saw how to handle geographical data in R, and how to carry out
some basic, and more advanced statistical analysis on the data. However, even the
more advanced Poisson modelling carried out did not take into consideration any spatial
dependencies in the data. The breach of peace counts in each of the census blocks were
modelled as independent Poisson counts, and the number of counts in each block was
considered only in terms of other properties of that block, ignoring anything happening in
surrounding blocks. However, there is a large area of statistical analysis devoted to
processes in which events in nearby areas are related. In this practical you will learn how
to use R libraries devoted to this kind of analysis - in particular spdep.
required
required
required
required
required
required
package:
package:
package:
package:
package:
package:
tripack
sp
maptools
foreign
boot
spam
The topology of a spatial data set is the term usually described the spatial arrangement of
geographical items within it - in particular, for a polygon data set the topology is a list of
polygons that touch one another. Here, touching can mean the sharing of a common
edge, and in some cases it can also mean the sharing of a common single point (for
example when two census block areas are joined only at their corners.
spdep has a function to extract the topology information from a polygon object - called
poly2nb. The nb here stands for neighbours - since it is basically a list of which polygon
neighbours which other ones. Enter
blocks.nb = poly2nb(blocks)
Spatial Statistics with R: Page 1 of 12
to store this information in a variable called blocks.nb. It is possible to plot this information
as a kind of network. The nodes on the network are the so-called label points for the
polygon file. Each polygon has a label point - a point somewhere inside the polygon
where any text used to label the polygon may be placed. They are useful useful as node
points on a network representing polygon neighbours. To extract the label points, as a
point object, enter
blocks.labs = poly.labels(blocks)
and then it is possible to plot the neighbour information. Here, this is done on a backdrop
of the census block polygons:
plot(blocks,col='grey')
plot(blocks.nb,coordinates(blocks.labs),col='red',add=TRUE)
the default for poly2nb is to define neighbours as having points, as well as edges, in
common. This is sometimes called queen's case topology because connection at edges
and corners corresponds to the legal moves of the queen in chess. It is also possible to
extract rook's case topology - where only common edges define neighbours. This
corresponds to legal moves of the rook in chess. To extract rook's case moves, add the
argument queen=FALSE to the poly2nb function:
blocks.nb = poly2nb(blocks,queen=FALSE)
plot(blocks,col='grey')
plot(blocks.nb,coordinates(blocks.labs),col='red',add=TRUE)
This repeats the network map from before, but now only polygons with common edges
are connected. In this case, as few polygon pairs are connected only at the corners, the
result is fairly similar.
An alternative definition of topology (based on nearness of polygons rather than contiguity)
is to defined two polygons as neighbours if their label points are within some distance d of
one another. R can define these kinds of neighbours using the dnearneigh function. For
example, to define census blocks as being neighbours if they are within 1.2 miles apart,
enter the following:
blocks.nb2 = dnearneigh(poly.labels(blocks),0,miles2ft(1.2))
It is then possible to plot the neighbour network under this definition:
plot(blocks,col='grey')
plot(blocks.nb2,coordinates(blocks.labs),col='red',add=TRUE)
Note that this demonstrates that under different definitions of neighbour, quite different
patterns of network can occur.
N
I=! !
i
wij
"
#"
#
Xj X
w
X
X
ij
i
j
#
! "
2
X
X
i
i
! !
Where:
Xi
N
wij
The formula may seem complex, but essentially it measures the degree to which similarvalued attributes occur near to each other. If above average valued attributes tend to be
near other above-average attributes, this gives a positive value of Morans-I. If, on the
other hand, above average values tend to occur near to below average values - in a
checker-board pattern - this gives a negative Morans-I. Morans I is typically between -1
and 1, and in some ways is similar to a correlation coefficient. A value of zero suggests
no spatial dependency. It is sometimes referred to as a measure of autocorrelation as it
measures the variable Xs correlation to itself, in a geographical sense. To illustrate this,
choropleth maps corresponding to four values of Morans-I are given below:
I = 0.904
I = 0.126
I = 0.074
I = 0.435
data: P_VACANT
weights: blocks.lw
Moran I statistic standard deviate = 2.7789, p-value = 0.002727
alternative hypothesis: greater
sample estimates:
Moran I statistic
Expectation
Variance
0.143721934
-0.007812500
0.002973471
This needs some explanation. The first number of the last line printed gives the Morans-I
statistic itself - about 0.144.
The other information relates to a statistical test as to
whether the Morans-I is equal to zero. If this is the case, then the theoretical values for
the expected value of Morans-I and its sample variance are estimated using the following
formulae:
E(I) =
1
N 1
N D 6EC 2
(N + 1)(N 1)C 2
V ar(I) =
where:
A=
B=
1 !!
(wij + wji )2 , i != j
2 i j
!
k
!
!
wjk +
wik
j
C=
!!
i
wij , i != j
D = (N 2 3N + 3)A N B + 3C 2
!
4 /N
(Xi X)
E = "! i
#2
2
i (Xi X) /N
z=
I E(I)
{V ar(I)}
1
2
which is approximately Normally distributed, and can be looked up against a p-value. The
last line of the printout from moran.test tells you that the value of E(I) for P_VACANT is
about -0.008 (labelled Expectation) and that for Var(I) is about 0.0026 (labelled
Variance). These can be used to compute z in the formula above, which is then used to
test the hypothesis that I=0. In the printout from moran.test this is labelled as Moran I
statistic standard deviate and takes the value of around 2.61. Finally the p-value for the
statistic is computed, and shown in the printout to be about 0.00448. Recall that the pvalue is the probability of obtaining a value at least as extreme as the one observed from
the data, given that the null hypothesis is true. Thus, the lower the value, the more
evidence against the null hypothesis. Here the smallness of the p-value suggests strong
evidence against the null hypothesis - ie we should reject the hypothesis that I=0, implying
that some degree of spatial dependency is present.
We can now do the same test in terms of density of breach of peace events - firstly
compute the density values in events per square mile:
density = poly.counts(breach,blocks)/
ft2miles(ft2miles(poly.areas(blocks)))
and then carry out the Morans-I test:
moran.test(density,blocks.lw)
this gives a print-out similar to that before. In this case the Morans-I statistic is 0.235. As
a self test you should be able to find the p-value for this and decide whether Morans-I
differs significantly from zero.
Simulation-Based Tests
The basis for the significance tests in the last section was to compute the expected value
and variance of the Morans-I statistic under the assumption that there is no spatial
dependency in the attribute X. Here, this is done by assuming that if there was no spatial
dependency, then any of the observed X-values could have occurred with equal chance at
any of the polygons. In other words, any permutation of polygon attributes to the
polygons is equally likely. The formulae for E(I) and Var(I) were theoretically derived given
Spatial Statistics with R: Page 5 of 12
this assumption. However, the assumption that Morans-I is normally distributed in this
case is only approximate.
In times when computers were a lot slower than they are now, this approach was probably
the most appropriate but now there is an alternative approach. This is simply to permute
the attributes randomly amongst the polygons a large number of times, and note the
values of Morans-I each time. By comparing the actual Morans-I against these, we can
see how extreme the true value is compared to those generated under the assumption
that any permutation is equally likely. If there are n simulations, and m of these have a
larger value than the true Morans-I, then the experimental p-value is m/(n+1). The
theoretical approach of the previous section is relatively easy to compute (although seven
formulae may seem complex to a human, they can be calculated in a fraction of a second
by a computer) but it is only approximate. The simulation approach - also called the
Monte-Carlo approach - outlined here requires more computer time (usually n should be
around 10,000) but the simulations are of the true model. R can can carry out simulationbased tests with the moran.mc function:
moran.mc(P_VACANT,blocks.lw,nsim=10000)
The extra argument nsim tells the function how many simulations to carry out - that is, the
number n mentioned above. The result will be something like:
data: P_VACANT
weights: blocks.lw
number of simulations + 1: 10001
statistic = 0.1256, observed rank = 9909, p-value = 0.0092
alternative hypothesis: greater
Note that the p-value here - although slightly different from that obtained from moran.test
still suggests that the hypothesis of zero Morans-I should be rejected. Also note that
when you run moran.mc you may well obtain slightly different results, as this approach is
based on random simulation, and so no two runs of the function will have identical
outcomes. As another self-test, try running moran.mc on the density variable.
Yi = 0 + 1 Xi + "i
where the Y variable is to be predicted by the X variable. The beta {0 , 1 } values are the
regression coefficients (intercept and slope respectively) and the final epsilon {!i } term is
an error term. In a standard model it is assumed that these are normally distributed, with
a mean of zero. It is also assumed that all errors have the same standard deviation, and
that they are independent. However, in many geographical situations, the last assumption
is dubious. The error term in a model is essentially related to factors influencing the Y
Spatial Statistics with R: Page 6 of 12
variable that are not reflected in the predictor variable X. If such factors relate to a
geographical phenomenon, it is possible that their effects might spill over, so that error
terms in adjacent regions will depend on one another. In this case, the model above will
be inappropriate,
and models allowing for dependency in the epsilons should be
considered instead.
To consider this kind of model, we will look at two new New Haven crime variables related
to residential burglaries. These are both point objects, called burgres.f and burgres.n.
burgres.f is a list of burglaries occurring between 1st august 2007 and 31st january 2008
where entry was forced into the property, and burgres.n is a list of burglaries from the
same time period where no entry was forced. In the case of non-forced entry, this
suggests that the property was left insecure, perhaps by leaving a door or window open.
Both variables are point objects. One interesting question is whether both kinds of
residential burglary occur in the same places - that is, if a place is a high risk area for nonforced entry, does it imply that it is also a high risk for forced entry? To investigate this,
we will use a bivariate regression model that attempts to predict the density of forced
burglaries from the density of non-forced ones.
The indicators needed for this are the rates of burglary given the number of properties at
risk. Here we use the variable OCCUPIED from the data frame in the census blocks
object to estimate the number of properties at risk. If we were to compute rates per 1,000
households, this would be
1000*(number of burglaries in block)/OCCUPIED
and since this is over a six-month period, doubling this quantity gives the number of
burglaries per 1,000 households per year. However, typing in OCCUPIED shows that
some blocks have no occupied housing, so the above quantity is not defined. To
overcome this problem we select a subset of the blocks object consisting only of blocks
with greater than zero occupied dwellings. For polygon spatial objects, each individual
polygon can be treated like a row in a data frame for purposes of subset selection. Thus
to select only the blocks where the variable OCCUPIED is greater than zero, enter
blocks2 = blocks[OCCUPIED > 0,]
to stored the subset census block data in the object blocks2. We can now compute the
burglary rates for forced and non-forced entries by first counting the burglaries in each
block in blocks2 (with the poly.counts function), dividing these numbers by the OCCUPIED
counts and then multiplying by 2,000 (to get yearly rates per 1,000 households). However,
before we do this, remember that we need the OCCUPIED column from blocks2 and not
blocks - but at the moment the one from blocks is attached. To sort this out, firstly detach
the data frame associated with blocks and then attach the one associated with blocks2:
detach(data.frame(blocks))
attach(data.frame(blocks2))
now the two rate variables can be calculated:
forced.rate = 2000*poly.counts(burgres.f,blocks2)/OCCUPIED
notforced.rate = 2000*poly.counts(burgres.n,blocks2)/OCCUPIED
so we now have the two rates stored in forced.rate and notforced.rate. A first attempt at
modelling the relationship between the two rates could be via simple bivariate regression ignoring any spatial dependencies in the error term. This is done using the lm function,
which creates a simple regression model object.
model1 = lm(forced.rate~notforced.rate)
this stores the basic model in model1 - to see the regression coefficients, enter
summary(model1)
which produces the following output:
Call:
lm(formula = forced.rate ~ notforced.rate)
Residuals:
Min
1Q
-11.209 -5.467
Median
-1.434
3Q
3.002
Max
30.926
Coefficients:
(Intercept)
notforced.rate
--Signif. codes:
yi =
wij yj + 0 + 1 xi + #i
The difference between this and the standard model is the first term on the left hand side.
Median
-1.4167
forced.rate
3Q
3.0909
notforced.rate,
listw
Max
31.0006
Coefficients:
(Intercept)
notforced.rate
a p-value of around 0.319 - so we fail to reject the null hypothesis. This suggests that, in
this case, one does not need to allow for spatial dependency of the error term.
For the more mathematically-minded, if f(x,y) is the density function, then the probability
that an even occurs in an area A is:
! !
f (x, y) dydx
(x,y)A
! " x xi y yi #
1
f(x, y) =
k
,
h1 h2 i
h1
h2
in mathematical terms. The function k is the kernel function - that is, the 'bump'
described earlier. The h parameters control the smoothness of the estimate. Very small
values give rise to very 'spikey' surfaces, and large values to very flat ones. Typically,
they are chosen automatically, from the distribution of the points. Here, the function to
compute a kernel density estimation is kde.points. This estimates the value of the density
over a grid of points, and returns the result as a grid object. It can take two arguments the set of points to use, and another geographical object, whose bounding box will be the
extent of the grid object to be created. The points object breach will be used to produce a
kernel surface:
breach.dens = kde.points(breach,lims=tracts)
This stores the kernel density estimate of breach of peace in a grid object called
breach.dens. A quick way of drawing the density is to use the level.plot function:
level.plot(breach.dens)
This draws a shaded contour plot of the density function. One thing to notice is that this
covers a rectangular area - but to give context it would be helpful to add a map of New
Haven. For example, to add the Census tracts, type
plot(tracts,add=TRUE)
Another approach might be to mask out the information outside of the study area. The
kde.points function always computes values on a rectangular grid, but part of the grid lies
outside of the New Haven area. To overcome this, it is possible to create a mask polygon
object. This is simply a normal polygon object, shaped like the rectangle that kde.points
produces, but with a hole in it the shape of the study area. In this case the hole is shaped
like New Haven. If the mask polygon is plotted over the level plot of the grid data, with
both its edges and fill colour being white, the effect is to erase the parts of the density
surface lying outside of the study area. This can be achieved using the poly.outer function:
masker = poly.outer(breach.dens,tracts,extend=100)
The first two parameters give the outer rectangle and the hole shape, respectively. The
third parameter actually causes the outer rectangle to extend by a small amount in each
direction - sometimes this is useful, since occasionally their is a very slight mismatch
between the coordinates of the outer edge of the grid, and the outer edge of the mask
Spatial Statistics with R: Page 11 of 12
polygon. The erasing technique set out above might then fail to erase a small amount of
information on the edge of the grid. The extend parameter avoids this by making the mask
polygons outer edges slightly exceed those of the grid. Here, we extend the edges by
100 feet. Now we have a masking polygon, called masker we can plot this on the map.
The quickest way to do this is to use the add.masking command - this is more or less the
same as the plot command, but defaults to drawing white filled polygons with white
boundaries. Enter
add.masking(masker)
This erases the part of the density map outside of New Haven. However it has also partly
erased the external boundaries of the census tracts. It would probably have been more
sensible to draw the tracts after the mask polygon was drawn.
A better map can be
achieved by entering the commands in this order:
level.plot(breach.dens)
add.masking(masker)
plot(tracts,add=TRUE)
Finally, it is also possible to use shading schemes (as seen in practical 2) to draw level
plots with different intervals or colours. To do this, the auto.shading function is used as
before. The variable to define the shading scheme is the kernel density estimate of the
breach.dens object - accessed by breach.dens$kde. The following gives a level plot with 7
levels, drawn as shades of green:
breach.dens.shades = auto.shading(breach.dens$kde,
n=7,cols=brewer.pal(7,"Greens"),cutter=range.cuts)
level.plot(breach.dens,shades=breach.dens.shades)
add.masking(masker)
plot(tracts,add=TRUE)
Note the first command is split over two lines.
End of Practical
At this stage, the practical has finished. To exit R, enter
save.image(file='rpract.RData')
detach(data.frame(blocks))
q()
Which will save your current variables into a file in your working folder, undo the attach
command entered earlier, and quit R.