You are on page 1of 22

Neural coding: linear models

9.29 Lecture 1

1 What is computational neuroscience?


The term “computational neuroscience” has two different definitions:
1. using a computer to study the brain
2. studying the brain as a computer
In the first, the field is defined by a technique. In the second, it is defined by an idea.
Let’s discuss these two definitions in more depth.
Why use a computer to study the brain? The most compelling reason is the tor­
rential flow of data generated by neurophysiology experiments. Today it is common to
simultaneously record the signals generated by tens of neurons in an awake behaving
animal. Once the measurement is done, the neuroscientist must analyze the data to
figure out what it means, and computers are necessary for this task. Computers are also
used to simulate neural systems. This is important when the models are complex, so
that their behaviors are not obvious from mere verbal reasoning.
On to the second definition. What does it mean to say that the brain is a computer?
To grasp this idea we must think beyond our desktop computers with their glowing
screens. The abacus is a computer, and so is a slide rule. What do these examples
have in common? They are all dynamical systems, but they are of a special class.
What’s special is that the state of a computer represents something else. The states of
transistors in your computer’s display memory represent the words and pictures that
are displayed on its screen. The locations of the beads on a abacus represent the money
passing through a shopkeeper’s hands. And the activities of neurons in our brains
represent the things that we sense and think about. In short,
computation = coding + dynamics
The two terms on the right hand side of this equation are the two great questions
for computational neuroscience. How are computational variables are encoded in neu­
ral activity? How do the dynamical behaviors of neural networks emerge from the
properties of neurons?
The first half of this course will address the problem of encoding, or representa­
tion. The second half of the course will address the issue of brain dynamics, but only
incompletely. The biophysics of single neurons will be discussed, but the collective
behaviors of networks are left for 9.641 Introduction to Neural Networks.

1
2 Neural coding
As an introduction to the problem of neural coding, let me show you a video of a
neurophysiology experiment. This video comes from the laboratory of David Hubel,
who won the Nobel prize with his colleague Torsten Wiesel for their discoveries in the
mammalian visual system.
In the video, you will see a visual stimulus, a flashed or moving bar of light pro­
jected onto a screen. This is the stimulus that is being presented to the cat. You will
also hear the activity of a neuron recorded from the cat’s brain. I should also describe
what you will not see and hear. A cat has been anesthetized and placed in front of the
screen, with its eyelids held open. The tip of a tungsten wire has been placed inside the
skull, and lodged next to a neuron in a visual area of the brain. Although the cat is not
conscious, neurons in this area are still responsive to visual stimuli. The tungsten wire
is connected to an amplifier, so that the weak electrical signals from the neuron can be
recorded. The amplified signal is also used to drive a loudspeaker, and that is the sound
that you will hear.
As played on the loudspeaker, the response of the neuron consists of brief clicking
sounds. These clicks are due to spikes in the waveform of the electrical signal from
the neuron. The more technical term for spike is action potential. Almost without
exception, such spikes are characteristic of neural activity in the vertebrate brain.
As you can see and hear, the frequency of spiking is dependent on the properties of
the stimulus. The neuron is activated only when the bar is placed at a particular location
in the visual field. Furthermore, it is most strongly activated when the bar is presented
at a particular orientation. Arriving at such a verbal model of neural coding is more
difficult than it may seem from the video. David Hubel has recounted his feelings of
frustration during his initial studies of the visual cortex. For a long time, he used spots
of light as visual stimuli, because that had worked well in his previous studies of other
visual areas of the brain. But spots of light evoked only feeble responses from cortical
neurons. The spots of light were produced by a kind of slide projector. One day Hubel
was wrapping up yet another unsuccessful experiment. As he pulled the slide out of
the projector, he heard an eruption of spikes from the neuron. It was that observation
that led to the discovery that cortical neurons were most sensitive to oriented stimuli
like edges or bars.
The study of neural coding is not restricted to sensory processing. One can also
investigate the neural coding of motor variables. In this video, you will see the move­
ments of a goldfish eye, and hear the activity of a neuron involved in control of these
movements. The oculomotor behavior consists of periods of static fixation, punctuated
by rapid saccadic movements. The rate of action potential firing during the fixation
periods is correlated with the horizontal position of the eye.
Finally, some neuroscientists study the encoding of computational variables that
can’t be classified as either sensory nor motor. This video shows a recording of a
neuron in a rat as it moves about a circular arena. Neurons like this are sensitive to the
direction of the rat’s head relative to the arena, and are thought to be important for the
rat’s ability to navigate.
Verbal models are the first step towards understanding neural coding. But compu­
tational neuroscientists do not stop there. They strive for a deeper understanding by

2
constructing mathematically precise, quantitative models of neural coding. In the next
few lectures, you will learn how to construct such models. But first you have to become
familiar with the format of data from neurophysiological experiments.

3 Neurophysiological data
For your first homework assignment, you will be given data from an experiment on
the weakly electric fish Eigenmannia. The fish has a special organ that generates an
oscillating electric field with a frequency of several hundred Hz. It also has an elec­
trosensory organ, with which it is able to sense its electric field and the fields of other
fish. The electric field is used for electrolocation and communication.
In the experiment, the fish was stimulated with an artificial electric field, and the
activity of a neuron in the electrosensory organ was recorded. The artificial electric
field was an amplitude-modulated sine wave, much like the natural electric field of the
fish. The stimulus vector si in the dataset contains the modulation signal sampled every
0.5 ms. The response vector ρi contains the spike train of the neuron. Its components
are either zero or one, indicating whether or not a spike occurred during each 0.5 ms
time bin.
As you will see in the homework, the probability of spiking during a time bin
depends linearly on the modulation signal. To visualize this dependence, one must first
transform the binary vector ρi into an analog firing probability pi . This is done by some
method of smoothing, as will be explained in a later lecture and in the assignment. If
the pairs (si , pi ) are plotted as points on a graph, a linear relationship can be seen.
The slope and intercept of the line can be found by optimizing the approximation pi ≈
a + bsi with respect to the parameters a and b.
So in this case, the neural coding problem can be addressed by simply fitting a
straight line to data points. This is probably the most common way to fit experimental
data in all of the sciences. Before we describe the technique below, let’s pause to note
that this is a very simple dataset. The stimulus is a scalar signal that varies with time.
More generally, a vector might be required to describe the stimulus at a given time, as
in the case of a dynamically varying image. The neural response might also be more
complicated, if the experiment involved simultaneous recording of many neurons. But
even in these more complex cases, it is sometimes possible to construct a linear model.
When we do so later, we will see that some of the simple concepts introduced below
can be generalized.

4 Fitting a straight line to data points

Suppose that we are given measurements (xi , yi ), where the index i runs from 1 to

m. In the context of the previous experiment, the measurements are (si , pi ). We have
simply switched notation to emphasize the generality of the problem. Our task is to
find parameters a and b so that the approximation

yi ≈ a + bxi (1)

3
is as accurate as possible. Note that it is not generally possible to find a and b so that the
error vanishes completely. There are two reasons for this. First, measurement are not
exact, but suffer from experimental error. Second, while linear models are often used
in computational neuroscience, the underlying behavior is not truly linear. The linear
model is just an approximation. Note that this is unlike the case of physics, where the
proportionality of force and acceleration (F = ma) is considered a true “law.”
While there are many ways of finding an optimal a and b, the canonical one is the
method of least squares. Its starting point is the squared error function
m
� 1

E= (a + bxi − yi )2 (2)

i=1
2

which quantifies the accuracy of the model in Eq. (1). If E = 0 the model is perfect.
Minimizing E with respect to a and b is a reasonable way of finding the best approxi­
mation. Since E is quadratic in a and b, its minimum can be found by setting the partial
derivatives with respect to a and b equal to zero.
Setting ∂E/∂a = 0 yields
� �
0 = ma + b xi − yi
i i

while setting ∂E/∂b = 0 produces



0 = (a + bxi − yi )xi (3)
i
� � �
= a xi + b x2i − yi xi (4)
i i i

Rearranging slightly, we obtain two simultaneous linear equations in two unknowns,


� �
ma + b xi = yi (5)
i i
� � �
a xi + b x2i = yi xi (6)
i i i

As a shorthand for the coefficients of these linear equations, it is helpful to define


m m
1 � 1 � 2
�x� = xi �x2 � = x (7)
m i=1 m i=1 i
m m
1 � 1 �
�y� = yi �xy� = xi yi (8)
m i=1 m i=1

The quantity �x� is known as the mean or first moment of x, while �x2 � is known as
the second moment. The quantity �xy� is called the correlation of x and y.
With this new notation, the equations for a and b take the compact form
a + b�x� = �y� (9)
a�x� + b�x2 � = �xy� (10)

4
We can solve for a in terms of b via

a = �y� − b�x� (11)

This can be used to eliminate a completely, yielding

�xy� − �x��y�
b= (12)
�x2 � − �x�2

Backsubstituting this expression in Eq. (11) allows us to solve for a.


The numerator and denominator in Eq. (12) have special names. The denominator
�x2 � − �x�2 is called the variance of x, because it measures how much x fluctuates.
Note that if all the xi are equal to a large constant C, the second moment �x2 � = C 2 is
large also. In contrast, the variance vanishes completely. The meaning of the variance
is also evident in the identity

�(δx)2 � = �x2 � − �x�2

which you should verify for yourself. This equation says that the variance is the second
moment of δx = x − �x�, which is the deviation of x from its mean. The standard
deviation is another term that you should learn. It is defined as the square root of the
variance.
The numerator �xy� − �x��y� in Eq. (12) is called the covariance of x and y. It is
equal to the correlation of the fluctuations δx and δy,

�δxδy� = �xy� − �x��y�

Again, I recommend that you verify this identity on your own.


In summary, we have a simple recipe for a linear fit. Compute the covariance
Cov(x, y) of x and y, and the variance Var(x) of x. The ratio of these two quantities
gives the slope b of the linear fit. Then compute a by Eq. (11).
Substituting Eq. (11) in the linear approximation of Eq. (1) yields

yi − �y� ≈ b(xi − �x�)

In other words, the constant a is unnecessary, if the linear fit is done to δx and δy,
rather than to x and y. Given this fact, one approach is to compute the means �x� and
�y� first, and subtract them from the data to get δx and δy. Then apply the formula

�δxδy�
b=
�(δx)2 �

which is equivalent to Eq. (12). The trick of subtracting the mean comes up over and
over again in linear modeling.
Some of you may already have encountered the correlation coefficient r, which is
defined by
�xy� − �x��y�
r= � �
�x2 � − �x�2 �y 2 � − �y�2

5
You may have learned that r close to ±1 means that the linear approximation is a good
one. The correlation coefficient is similar to the covariance, except for the presence of
the standard deviations of x and y in the denominator. The denominator normalizes
the correlation coefficient, so that it must lie between −1 and 1, unlike the covariance,
which can take on any value in principle. If you know the Cauchy-Schwarz inequality,
you can use it to prove that −1 ≤ r ≤ 1, but this is not so illuminating.
The correlation coefficient can be interpreted as measuring the reduction in variance
that comes from taking a linear (first-order) model of the data, as opposed to a constant
(zeroth-order) model. Recall that the squared error of Eq. (2) measures the variance
of the deviation of the data points from the straight line. This variance vanishes only
when the model is perfect.
For the best zeroth-order model, we constrain b = 0 in Eq. (2), so that E is min­
imized when a = �y�, taking a value proportional to the variance of y. For the best
first-order model, E is minimized with respect to both a and b, so that its optimal value
is further reduced. The ratio of the new E to the old E is 1 − r2 . Another way of saying
it is that r2 is the fraction of the variance in y that is explained by the linear term in the
model.

6
Convolution, correlation, and the Wiener-Hopf
equations

9.29 Lecture 2

In this lecture, we’ll learn about two mathematical operations that are commonly
used in signal processing, convolution and correlation. The convolution is used to
linearly filter a signal, for example to smooth a spike train to estimate probability of
firing. The correlation is used to characterize the statistical dependencies between two
signals.
When analyzing neural data, the firing rate of a neuron is sometimes modeled as a
linear filtering of the stimulus. Alternatively, the stimulus is modeled as a linear filter­
ing of the spike train. To construct such a model, the optimal filter must be determined
from the data. This problem was studied by the famous mathematician Norbert Wiener
in the 1940s. It requires the solution of the famous Wiener-Hopf equations.

1 Convolution
Let’s consider two time series, gi and hi , where the index i runs from −∞ to ∞. The
convolution of these two time series is defined as


(g ∗ h)i = gi−j hj (1)
j=−∞

This definition is applicable to time series of infinite length. If g and h are finite, they
can be extended to infinite length by adding zeros at both ends. After this trick, called
zero padding, the definition in Eq. (1) becomes applicable. For example, the sum in
Eq. (1) becomes
n−1

(g ∗ h)i = gi−j hj (2)
j=0

for the finite time series h0 , . . . , hn−1 . Another trick for turning a finite time series into
an infinite one is to repeat it over and over. This is sometimes called periodic boundary
conditions, and will be encountered later in our study of Fourier analysis.
The convolution operation, like ordinary scalar multiplication, is both commutative
g ∗ h = h ∗ g and associative f ∗ (g ∗ h) = (f ∗ g) ∗ h. Although g and h are treated
symmetrically by the convolution, they generally have very different natures. Typically,

1
one is a signal that goes on indefinitely in time. The other is concentrated near time
zero, and is called a filter or convolution kernel. The output of the convolution is also
a signal, a filtered version of the input signal.
In Eq. (2), we chose hi to be zero for all negative i. This is called a causal filter,
because g ∗ h is affected by h in the present and past, but not in the future. In some
contexts, the causality constraint is not important, and one can take h−M , . . . , hM to
be nonzero, for example.
Formulas are nice and compact, but now let’s draw some diagrams to see how this
works. Let m and n be the dimensions of g and h respectively. For simplicity, assume
zero-offset indexing, so that the first components of g and h are g0 and h0 (not g1 and
h1 as in MATLAB). Then (g ∗ h)0 is given by summing g−j hj over j, which can be
visualized as
··· gm−1 ··· g1 g0 0 0 0 0 ···
··· 0 ··· 0 h0 h1 ··· hn−1 0 ···

Next, (g ∗ h)1 is found by summing g1−j hj over j, which can be visualized as

··· 0 gm−1 ··· g1 g0 0 0 0 ···


··· 0 ··· 0 h0 h1 ··· hn−1 0 ···
The rest of the components of g ∗ h are generated by sliding the g vector to the right.
The last nonzero component (g ∗ h)m+n−2 can be visualized as

··· 0 0 ··· gm−1 ··· g1 g0 0 ···


··· h0 h1 ··· hn−1 ··· 0 0 0 ···
Therefore g ∗ h has m + n − 1 nonvanishing components, which is why the MATLAB
function conv returns an m + n − 1 dimensional vector.

2 Probability of firing
The spike train ρi is a binary-valued time series. Since linear models are best suited for
analog variables, it is helpful to replace ρi with a probability pi of firing per time bin.
Many methods for doing this can be expressed in the convolutional form

pi = ρi−j wj
j

where w satisfies the constraint j wj = 1. According to this formula, pi is the
weighted average of ρi and its neighbors, so that 0 ≤ pi ≤ 1.
There are many different ways to choose w, depending on the particulars of the
application. For example, w could be chosen to be of length n, with nonzero values
equal to 1/n. This is sometimes called a “boxcar” filter. MATLAB comes with a lot
of other filter shapes. Try typing help bartlett, and you’ll find more information
about the Bartlett and other types of windows that are good for smoothing. Depending
on the context, you might want a causal or a noncausal filter for estimating probability
of firing.

2
Another option is to choose the kernel to be a decaying exponential,

0, j<0
wj =
γ(1 − γ)j , j ≥ 0

This is causal, but has infinite duration. As an exercise, you could try proving that this
is equivalent to
pi = (1 − γ)pi−1 + γρi
The probability p of firing in a time bin is closely related to frequency ν of firing
by p = νΔt, where Δt is the sampling interval. Probabilistic models of neural activity
will be treated more formally in a later lecture.

3 Correlation
The correlation of two time series is


Corr[g, h]j = gi hi+j
i=−∞

The case j = 0 corresponds to the correlation that was defined in the first lecture. The
difference here is that g and h are correlated at times separated by the lag j. 1 . As
with the convolution, this definition can be applied to finite time series by using zero
padding. Note that Corr[g, h]j = Corr[h, g]−j , so that the correlation operation is not
commutative. Typically, the correlation is applied to two signals, while its output is
concentrated near zero.
If g and h are n-dimensional vectors, then the MATLAB command xcorr(g,h)
returns a 2n − 1 dimensional vector, corresponding to the lags j = −n to n. Lags
beyond this range are not included, as the correlation vanishes. The zero lag case looks
like
· · · 0 g1 g2 · · · gn 0 · · ·
· · · 0 h1 h2 · · · h n 0 · · ·
and the other lags correspond to sliding h right or left. A maximum lag can also be
given, xcorr(g,h,maxlag), restrict the range of lags computed to -maxlag to
maxlag. The default is the unnormalized correlation given above, but there are other
options too.
The autocorrelation is a special case of the correlation, with g = h. If g �= h, the
correlation is sometimes called the crosscorrelation to distinguish it from the autocor­
relation. In the first lecture, we distinguished between correlation and covariance. The
covariance was defined as the correlation with the means subtracted out. Similarly, the
cross-covariance can be defined as the correlation left between two time series after
subtracting out the means. The auto-covariance is a special case. The command xcov
can be used for this purpose.
1 Warning: This is the convention followed by Dayan and Abbott, and by MATLAB. Some other books,

like Numerical Recipes, call the above sum Corr[h, g]j

3
4 Spike-triggered average
Demonstration of these ideas:
• Convolve spike train ρ with filter to find firing rate
• Autocorrelation of stimulus
• Autocorrelation of spike train

• Cross-correlation of spike train and stimulus

5 The Wiener-Hopf equations


Suppose that we’d like to model the time series yi as a filtered version of xi , i.e. find
the h that optimizes the approximation

yi ≈ hj xi−j
j

We assume that both x and y have had their means subtracted out, so that no additive
constant is needed in the model. Also, hj is assumed to be zero for j < M1 or j > M2 .
This constrains how far forward or backward in time the kernel extends. For example,
M1 = 0 corresponds to the case of a causal filter.
The best approximation in the least squares sense is obtained by minimizing the
squared error
 2
M2
1 � �
E= yi − hj xi−j 
2 i
j=M1

relative to hj for j = M1 to M2 . This is analogous to the squared error function for


linear regression, which we saw in the first lecture.
The minimum is given by the equations, ∂E/∂hk = 0, for k = M1 to M2 . These
are the famous Wiener-Hopf equations,
M2

Ckxy = xx
hj Ck−j k = M1 , . . . , M 2 (3)
j=M1

where the shorthand notation


� �
Ckxy = xi yi+k Clxx = xi xi+l
i i

has been used for the cross-covariance and auto-covariance. You’ll be asked to prove
this in the homework. This is a set of M2 − M1 + 1 linear equations in M2 − M1 + 1
unknowns, so it typically has a unique solution. For our purposes, it will be sufficient to
solve them using the backslash (\) and toeplitz commands in MATLAB. If you’re

4
worried about minimizing computation time, there are more efficient methods, like
Levinson-Durbin recursion.
Recall that in simple linear regression, the slope of the optimal line times the vari­
ance of x is equal to the covariance of x and y. This is a special case of the Wiener-Hopf
equations. In particular, linear regression corresponds to the case M1 = M2 = 0, for
which
h0 = C0xy /C0xx

6 White noise analysis


If the input x is Gaussian white noise, then the solution of the Wiener-Hopf equation is
xx
trivial, because Ck−j = C0xx δkj . Therefore

Ckxy
hk = (4)
C0xx

So a simple way to model a linear system is to stimulate it with white noise, and
correlate the input with the output. This method is called reverse correlation or white
noise analysis.
If the input x is not white noise, then you must actually do some work to solve the
Wiener-Hopf equations. But if the input x is close to being white noise, you might
get away with being lazy. Just choose the filter to be proportional to the xy cross-
correlation, hk = Ckxy /γ, as in the formula (4). The optimal choice of the normaliza­
tion factor γ is � xy xx xy
jl Cj Cj−l Cl
γ= � xy xy
m Cm Cm
where the summations run from M1 to M2 . Note this reduces to γ = C0xx in the case
of white noise, as in Eq. (4).

5
Basic Linear Algebra in MATLAB

9.29 Optional Lecture 2

In the last optional lecture we learned the the basic type in MATLAB is a matrix
of double precision floating point numbers. You learned a number of different tools
for initializing matrices and some basic functions that used them. This time, we’ll
make sure that we understand the basic algebraic operations that can be performed on
matrices, and how we can use them to solve a set of linear equations.

A Note on Notation
The convention used in this lecture and in most linear algebra books is that an italics
lower case letter (k) denotes a scalar, a bold lower case letter (x) denotes a vector, and
a capital letter (A) denotes a matrix. Typically we name our MATLAB variables with
a capital letter if they will be used as matrices, and lower case for scalars and vectors.

1 Vector Algebra
Remember that in MATLAB, a vector is simply a matrix with the size of one dimension
equal to 1. We should distinguish between a row vector (a 1xn matrix) and a column
vector (an nx1 matrix). Recall that we change a row vector x into a column vector
using the transpose operator (x� in MATLAB). The same trick works for changing a
column vector into a row vector.
We can add two vectors, x and y, together if they have the same dimensions. The
resulting vector z = x + y is simply an element by element addition of the components
of x and y: zi = xi + yi . From this is follows that vector addition is both commutative
and associative, just like regular addition. MATLAB also allows you to add a scalar k
(a 1x1 matrix) to a vector. The result of z = x + k is the element by element addition
zi = k + x i .
Vector multiplication can take a few different forms. First of all, if we multiply a
scalar k times a vector x, the result is a vector with the same dimension as x: z = kx
implies zi = kxi . There are two standard ways to multiply two vectors together: the
inner product and the outer product.
The inner product, sometimes called the dot product, is the result� of multiplying a
row vector times a column vector. The result is a scalar z = xy = i xi yi . To take
the inner product of two column vectors, use z = x � y. As we’ll see, the orientation of
the vectors matters because MATLAB treats vectors as matrices.

1
Unlike the inner product, the result of the outer product of two vectors is a matrix.
In MATLAB, you get the outer product my multiplying a column vector times a row
vector: Z = xy. The components of Z are Zij = xi yj . To take the outer product of
two column vectors, use Z = xy � .
Occassionally, what we really want to do is to multiply two vectors together ele­
ment by element: zi = xi yi . MATLAB provides the .* command for this operation:
z = x. ∗ y.
To test our understanding, let’s try some basic matlab commands:
x = 1:5

y = 6:10

x+y

x+5

5*x

x*y’

x’*y

x.*y

How would you initialize the following matrix in MATLAB using outer products?
 
1 2 3 4 5
1 2 3 4 5 
 
1 2 3 4 5 
 
1 2 3 4 5 
1 2 3 4 5

2 Matrix Algebra
The matrix operations are simply generalizations of the vector operations when the
matrix has multiple rows and columns. You can think of a matrix as a set of row
vectors or as a set of column vectors.
Matrix addition works element by element, just like vector addition. It is defined
for any two matrices of the same size. C = A + B implies that C ij = Aij + Bij . Once
again, it is both commutative and associative. Scalar multiplication of matrices is also
defined as it was with vectors. The result is a matrix: C = kA implies C ij = kAij .
If you multiply a matrix times a column vector, then the result � is another column
vector - the column of inner products: b = Ax implies b i = j Aij xj . Similarly,
� a row vector times a matrix to get a row of inner products: b = xA
you can multiply
implies bi = j Aji xj . Notice that in both cases, the definitions require that the first
variable must have the same number of columns as the the second variable has rows.
This idea generalizes to multiplying two matrices together. For the multiplication

C = AB, the matrix C is simply a collection of inner products: C ik = j Aij Bjk .
In this case, A must have the same number of columns as B has rows. Like ordinary
multiplication, matrix multiplication is associative and distributive, but unlike ordinary
multiplication, it is not commutative. In general, AB �= BA.
Now we are in a position to better understand the matrix transpose. If B = A � , then
Bij = Aji . Think of this as flipping the matrix along the diagonal. This explains why

2
the transpose operator changes a row vector into a column vector and vice versa. The
following identity holds for the definitions of multiplication and transpose: (AB) � =
B � A� . This help us to understand the difference between x � A and Ax. Notice that for
column vector x, (Ax)� = x� A� .
There are a few more matrix terms we should know. A square matrix is an nxn ma­
trix (it has the same number of rows and columns). A diagonal matrix A has non-zero
elements only along the diagonal (Aii ), and zeros everywhere else. You can initialize
a diagonal matrix in MATLAB by passing a vector to the diag command. The iden­
tity matrix is a special diagonal matrix with all diagonal elements set to 1. You can
initialize an idenitity matrix using the eye command.
Try the following matlab commands:

diag(1:5)

3 Solving Linear Equations


Let’s take a step back for a moment, and try to solve the following set of linear equa­
tions:

x1 + 3x2 = 4
2x1 + 2x2 = 9

With a little manipulation, we find that x1 = 4.75 and x2 = −0.25. We could solve
this set of equations because we had 2 equations and 2 unknowns. How should we
solve a set of equations with 50 equations and 50 unknowns?
Let’s rewrite the previous expression in matrix form:
� �� � � �
1 3 x1 4
=
2 2 x2 9

Notice that we could use the same form, Ax = b, for our set of 50 equations with 50
unknowns. As expected, MATLAB provides all of the tools that we need to solve this
matrix formula, and it uses the idea of a matrix inverse.
The inverse of a square matrix A, which is A−1 in the textbooks and inv(A) in
MATLAB, has the property that A−1 A = AA−1 = I. Using this, let’s manipulate our
previous equation:

Ax = b
−1
A Ax = A−1 b
x = A−1 b

Now solve the original equations in MATLAB using inv(A) ∗ b. You should get the
vector containing 4.75 and -0.25.
There are a few things to remember about matrix inverses. First of all, they are only
defined for square matrices. It works with the transpose and multiplication operations
with the following identities: (A� )−1 = (A−1 )� and (AB)−1 = B −1 A−1 . You should

3
be able to verify these properties on your own using the ideas we’ve developed. But the
most important thing to know about matrix inverses is that they don’t always exist, even
for square matrices. Using MATLAB, try taking the inverse of the following matrix:
� �
1 2
A=
2 4

Now try inserting this A into the system of equations at the beginning of this section,
and solving it using good old fashioned algebra. Why doesn’t the inverse exist?

4 Quadratic Optimization
We would like to solve the equation Ax = b even if A is not square. Let’s seperate the
problem into a few cases where the matrix A is an mxn matrix:
If m < n, then we have more unknowns than equations. In general, this system
will have infinitely many solutions.
If m > n, then we have more equations than unknowns. In general, this system
doesn’t have any solution. What if we don’t want to have matlab always return ”no
solution”, but we actually want the closest solution in the least squares sense? This is
equivalent to minimizing the following quantity:
1� �
E= ( Aij xj − bi )2
2 i j

MATLAB provides the backslash operator to accomplish the least squares fit for a
matrix equation: x = A \ b. Type help slash to appreciate the power of this command.
You’ll see that we could have used this command to solve the square matrix equations,
too.

5 Eigenmannia
How does this relate to the fish data that we used in problem set 1? Recall that we were
given a set of points (xi , yi ), and we were asked to find the coefficients a and b to fit
the following linear model:
yi ≈ a + bxi
You can think of each point as an equation, and write the entire data set in matrix form:
   
1 x1 y1
 1 x 2  � �  y2 
  a  
. ..  b =  .. 
 .. .   . 
1 xm ym

If you call the left matrix A and the right side b, then calling A \ b will ask MATLAB
to solve for the values of a and b that minimize the least-squared error of the model. It
will return exactly the same values for a and b that the polyfit command returns.

4
For this simple example, the polyfit command and the backslash command accom­
plished the same task. But what if you were given a set of points (x i , yi , zi ) and you
were asked to fit the following linear model:

zi ≈ a + bxi + cyi

The matrix notation easily scales to this problem, but the polyfit function does not.

5
More about convolution and correlation

9.29 Lecture 3

1 Some odds and ends


Consider a spike train ρ1 , . . . , ρN . One estimate of the probability of firing is
1 �
p= ρi (1)
N i
This estimate is satisfactory, as long as it makes sense to describe the whole spike train
by a single probability that does not vary with time. This is an assumption of statistical
stationarity.
More commonly, it’s a better model to assume that the probability varies slowly
with time (is nonstationary). Then it’s better to apply something like Eq. (1) to small
segments of the spike train, rather than to the whole spike train. For example, the
formula
pi = (ρi+1 + ρi + ρi−1 )/3 (2)
estimates the probability at time i by counting the number of spikes in three time bins,
and then dividing by three. In the first problem set, you were instructed to smooth the
spike train like this, but to use a much wider window. In general, choosing the size of
window involves a tradeoff. A larger window minimizes the effects of statistical sam­
pling error (like flipping a coin many times to more accurately determine its probability
of coming up heads). But a larger window also reduces the ability to follow more rapid
changes in the probability as a function of time.
Note that the formula (2) isn’t to be trusted near the edges of the signal, as the filter
operates on the zeros that surround the signal.
In the last lecture, we defined the unnormalized correlation. There is also a normal­
ized version that looks like
m
1 �
Qxy
j = xi yi+j
m i=1
To compensate for boundary effects, the form
m
1 �
Qxy
j = xi yi+j
m − |j| i=1
is sometimes preferred. Both forms can be obtained through the appropriate options to
the xcorr command.
A signal is called white noise if the correlation vanishes, except at lag zero.

1
2 Using the conv function
We learned last time that if g0 , g1 , . . . , gM −1 and h0 , h1 . . . , hN −1 are given as argu­
ments to the conv function, then the output is f0 , f1 , . . . , fM +N −2 , where we denote
f = g ∗ h.
Let’s generalize this: if gM1 , . . . , gM2 and hN1 , . . . , hN2 are given as arguments to
the conv function, then the output is fM1 +N1 , . . . , fM2 +N2 .
For example, suppose that g is a signal, and h represents an acausal filter, with
N1 < 0 and N2 > 0. Throwing out the first |N1 | and last N2 elements of f leaves us
with fM1 , . . . , fM2 , which are at the same times as the signal g.
Note that this prescription for discarding the elements is intended for time aligning
the result of the convolution with the input signal, and for producing a result that is the
same length.
A different motivation for discarding elements at the beginning and end is that they
may be corrupted by edge effects. If you are really worried about this, you may have
to discard more than was prescribed above.

3 Impulse response
Consider the signal consisting of a single impulse at time zero,

1, j = 0
δj =
0, j �= 0
The convolution of this signal with a filter h is

(δ ∗ h)i = δj−k hk = hj
k

which is just the filter h again. In other words h, is the response of the filter to an
impulse, or the impulse response function. If the impulse is displaced from time 0 to
time i, then the result of the convolution is the filter h, displaced by i time steps.
A spike train is just a superposition of impulses at different times. Therefore, con­
volving a spike train with a filter gives a superposition of filters at different times.
The “Kronecker delta” notation δij is equivalent to δi−j .

4 Matrix form of convolution


The convolution of g0 , g1 , g2 and h0 , h1 , h2 can be written as
g ∗ h = Gh
where the matrix G is defined by
 

g0 0 0

 g1 g0 0 

G=
 g2 g1 g0 

(3)

 0 g2 g1 

0 0 g2

2
and g ∗h and h are treated as column vectors. Each column of G is the same time series,
but shifted by a different amount. You can use the MATLAB function convmtx to
create matrices like G from time series like g. This function is found in the Signal
Processing Toolbox.
If you don’t have this toolbox installed, you can make use of the fact that Eq. (3) is
a Toeplitz matrix, and can be constructed by giving its first column and first row to the
toeplitz command in MATLAB.

5 Convolution as multiplication of polynomials


If the second degree polynomials g0 + g1 z + g2 z 2 and h0 + h1 z + h2 z 2 are multiplied
together, the result is a fourth degree polynomial. Let’s call this polynomial f0 + f1 z +
f2 z 2 + f3 z 3 + f4 z 4 . This is equivalent to f = g ∗ h.

6 Discrete versus continuous time


In the previous lecture, the convolution, correlation, and the Wiener-Hopf equations
were defined for data sampled at discrete time points. In the remainder of this lecture,
the parallel definitions will be given for continuous time.
Before the advent of the digital computer, the continuous time formulation was
more important, because of its convenience for symbolic calculations. But for numeri­
cal analysis of experimental data, it is the discrete time formulation that is essential.

7 Convolution
Consider two functions g and h defined on the real line. Their convolution g ∗ h is
defined as � ∞
(g ∗ h)(t) = dt� g(t − t� )h(t� )
−∞

The continuous variables t and t have taken the place of the discrete indices i and j.
Again, you should verify commutativity and associativity.
If g and h are only defined on finite intervals, they can be extended to the entire
real line using the zero padding trick. For example, if h vanishes outside the interval
[0, T ], then
� T
(g ∗ h)(t) = dt� g(t − t� )h(t� )
0

8 Firing rate
To define the continuous-time representation of a spike train, we need to make use of
a mathematical construct called the Dirac delta function. The delta function is zero
everywhere, except at the origin, where it is infinite. You can imagine it as a box of

3
width Δt and height 1/Δt centered around the origin, with the limit Δt → 0. The
delta function is defined by the identity
� ∞
h(t) = dt� δ(t − t� )h(t� )
−∞

In other words, when the delta function is convolved with a function, the result is the
same function, or h = δ ∗ h. A special case of this formula is the normalization
condition � ∞
1= dt� δ(t − t� )
−∞

Note that the delta function has dimensions of inverse time.


The delta function represents a single spike at the origin. A spike train with spikes
at times ta can be written as a sum of delta functions,

ρ(t) = δ(t − ta )
a

A standard way of estimating firing rate from a spike train is to convolve it with a
response function w

ν(t) = dt w(t − t� )ρ(t� ) (4)
� �
= dt w(t − t� ) δ(t� − ta ) (5)
a
��
= dt w(t − t� )δ(t� − ta ) (6)
a

= w(t − ta ) (7)
a

So the convolution simply adds up copies of the response function centered around the
spike times. Note that it’s important to choose a kernel satisfying

dt w(t) = 1

so that � �
dt ν(t) = dt ρ(t)

Since the Dirac delta function has dimensions of inverse time, smoothing ρ(t) results
in an estimate of firing rate. In contrast, the discrete spike train ρi is dimensionless, so
smoothing it results in an estimate of probability of firing. You can think of ρ(t) as the
Δt → 0 limit of ρi /Δt.

4
9 Low-pass filter
To see the convolution in action, consider the differential equation

dx
τ +x=h
dt
This is an equation for a low-pass filter with time constant τ . Given a signal h, the
output of the filter is a signal x that is smoothed over the time scale τ . The solution
can be written as the convolution x = g ∗ h, where the “impulse response function” g
is defined as
1
g(t) = e−t/τ θ(t)
τ
and we have defined the Heaviside step function θ(t), which is zero for all negative
time and one for all positive time. The response function g is zero for all negative time,
jumps to a nonzero value at time zero, and then decays exponentially for positive time.
To construct the function x, the convolution places a copy of the response function
g(t − t� ) at every time t� . Each copy gets weighted by h(t� ), and they are all summed to
obtain x(t). The response function is sometimes called the kernel of the convolution.
To see another application of the delta function, note that the impulse response
function for the low-pass filter satisfies the differential equation

dg
τ + g = δ(t)
dt
In other words, g is the response to driving the low-pass filter with an “impulse” δ(t),
which is why it’s called the impulse response.

10 Correlation
The correlation is defined as
� ∞
Corr[g, h](t) = dt� g(t� )h(t + t� )
−∞

This compares g and h at times separated by the lag t.1 Note that Corr[g, h](t) =
Corr[h, g](−t), so that the correlation operation is not commutative.
As before, if g and h are only defined on the interval [0, T ], they can be extended
by defining them to be zero outside the interval. Then the above definition is equivalent
to � T
Corr[g, h](t) = dt� g(t� )h(t + t� )
0
This is the unnormalized version of the correlation. In the Dayan and Abbott textbook,
Qgh (t) = (1/T ) Corr[g, h](t), which is the normalized correlation.
1 The expression above is the definition used in the Dayan and Abbott book, but take note that the opposite

convention is used in other books like Numerical Recipes, which call the above integral Corr[h, g](t).

5
11 The spike-triggered average
Dayan and Abbott define the spike-triggered average of the stimulus as the average
value of the stimulus at time τ before a spike,
1�
C(τ ) = s(ta − τ )
n a

where n is the number of spikes. Then in Figure 1.9 they plot C(τ ) with the positive τ
axis pointing left. This sign convention may be standard, but it is certainly confusing.
Exactly the same graph would be produced by the alternative convention of taking
C(τ ) to be the average value of the stimulus at time τ after a spike, and plotting it with
the positive τ axis pointing right. Note that in this convention, C(τ ) would have the
same shape as the cross-correlation of ρ and s,

Corr[ρ, s](τ ) = dt ρ(t)s(t + τ ) (8)
� �
= dt δ(t − ta )s(t + τ ) (9)
a

= s(ta + τ ) (10)
a

12 Visual images
So far we’ve discussed situations where the neural response encodes a single time-
varying scalar variable. In the case of visual images, the stimulus is a function of
space as well as time. This means that a more complex linear model is necessary for
modeling the relationship between stimulus and response. Let the stimulus be denoted
by sab
i , where the indices a and b specify pixel location in the two-dimensional image.
Construct xab ab ab
i = si −�s � by subtracting out the pixel means. Similarly, let yi denote
the neural response with the mean subtracted out. Then consider the linear model

yi ≈ hab ab
j xi−j
jab

We won’t derive the Wiener-Hopf equations for this case, as the indices get messy.
But for white noise the optimal filter is given by the cross correlation

hab
j ∝ xab
i yi+j
i

definition of white noise

You might also like