Professional Documents
Culture Documents
Error Backpropagation
Sargur Srihari
1
Machine Learning Srihari
2
Machine Learning Srihari
⎡ ∂E ⎤
⎢ ⎥
⎢ ∂w0 ⎥
⎢ ∂E ⎥⎥
w (τ+1) = w (τ) − η∇E(w (τ) ) ( )
⎢
∇E w = ⎢ ∂w1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢⎣ ∂wT ⎥
⎦
Machine Learning Srihari
6
Machine Learning Srihari
aj=∑iwjizi zj=h(aj)
zi
wji
• zi is activation of a unit (or input) that sends a connection to unit j and wji
is the weight associated with the connection
• Output is transformed by a nonlinear activation function zj=h(aj)
• The variable zi can be an input and unit j could be an output
• For each input xn in the training set, we calculate activations of
all hidden and output units by applying above equations
• This process is called forward propagation
Machine Learning Srihari
∂ En
Summarizing evaluation of Derivative ∂ w ji
• By chain rule for partial derivatives
∂En ∂En ∂a j
= aj=∑iwjizi
∂w ji ∂a j ∂w ji
a j = ∑ w ji zi
Define δ ≡ ∂En i zi wji
j
∂a j ∂a j
we have = zi
∂w ji
• Substituting we get
∂En
= δ j zi
∂w ji
• Thus required derivative is obtained by multiplying
1. Value of δ for the unit at output end of weight
2. Value of z for unit at input end of weight
• Need to figure out how to calculate δj for each unit of network
1 ∂E
• For output units δj=yj-tj If E = ∑ (y − t ) and y = a = ∑ w z then δ =
2
= y − t For
2 j
j j
∂a
j j ji i j
regression j
j j
• For hidden units, we again need to make use of chain rule of derivatives to
determine δ ≡ ∂En
j
∂a j
Machine Learning Srihari
δ j = h '(a j )∑ wkj δk
k
a j = ∑ w ji zi and zj=h(aj)
i
13
Machine Learning Srihari
A Simple Example
• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak Standard Sum of Squared Error
• Hidden units have logistic sigmoid 1
∑ ( )
2
activation function En = y k − tk
2 k
h(a)=tanh (a)
where e a − e−a yk: activation of output unit k
tanh(a) = a −a
e +e tk : corresponding target
for input xk
simple form for derivative
h '(a) = 1 − h(a)2
€
Machine Learning Srihari
z j = tanh(a j )
M
yk = ∑ wkj(2)z j
j =0
• Output differences
δ k = y k − tk
δ j = h '(a j )∑ wkj δk
• Backward Propagation (δ s for hidden units) k
K 2
δ = (1 − z )∑ w δ 2 h'(a) = 1− h(a)
j j kj k
k =1
∂E ∂En
• Batch method =∑
∂w ji n ∂w ji
Machine Learning Srihari
(τ )
• Where the gradient vector ∇E (w ) consists of the vector of
derivatives evaluated using back-propagation
⎡ ∂E ⎤
⎢ ⎥
⎢ (1) ⎥ There are W= M(D+1)+K(M+1) elements
⎢ ∂w11 ⎥
⎢ ⎥ in the vector
⎢ . ⎥ (τ )
Gradient ∇E (w ) is a W x 1 vector
⎢ ⎥
⎢ ∂E ⎥
⎢ (1) ⎥
d ⎢ ∂wMD ⎥
∇E(w) = E(w) = ⎢ ⎥
dw ⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂w11 ⎥
⎢ ⎥
⎢ ⎥
⎢ . ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂wKM ⎥
⎢⎣ ⎥⎦ 16
Machine Learning
Numerical example
D
Srihari
a j = ∑ w (1)x
(binary classification) ji i
i=1
z j = σ(a j )
M
z1 yk = ∑ wkj(2)z j
j =1
D=3
M=2 y1 Errors
K=1 δj = σ '(a j )∑ wkj δk
N=1 k
Error Derivatives
z2 ∂En ∂En
= δ j xi = δk z j
∂w (1)
ji
∂w (2)
kj
Machine Learning
Srihari 19
Machine Learning Srihari
Initializations
% This pseudo-code illustrates implementing a
several layer neural %network. You need to fill in s{1} = size(train_x, 1);
the missing part to adapt the program to %your s{2} = 100;
own use. You may have to correct minor mistakes s{3} = 100;
in the program s{4} = 100;
s{5} = 2;
%% prepare for the data
Performance Evaluation
% Do some predictions to know the performance
a{1} = test_x; %Calculate training error
% forward propagation %minibatch size
bs = 2000;
% no. of mini-batches
for i = 2 : numOfHiddenLayer + 1
nb = size(train_x, 2) / bs;
%This is essentially doing W{i-1}*a{i-1}+b{i-1}, but since they
%have different dimensionalities, this addition is not allowed in train_error = 0;
%matlab. Another way to do it is to use repmat %Here we go through all the mini-batches
for ll = 1 : nb
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) ); %Use submatrix to pick out mini-batches
end a{1} = train_x(:, (ll-1)*bs+1 : ll*bs );
yy = train_y(:, (ll-1)*bs+1 : ll*bs );
%Here we calculate the sum-of-square error as loss function
loss = sum(sum((test_y-a{numOfHiddenLayer + 1}).^2)) / size(test_x, 2); for i = 2 : numOfHiddenLayer + 1
a{i} = sigm( bsxfun(@plus, W{i-1}*a{i-1}, b{i-1}) );
end
% Count no. of misclassifications so that we can compare it
train_error = train_error + sum(sum((yy-a{numOfHiddenLayer + 1}).^2));
% with other classification methods end
% If we let max return two values, the first one represents the max train_error = train_error / size(train_x, 2);
% value and second one represents the corresponding index. Since we
% care only about the class the model chooses, we drop the max value losses = [losses loss];
% (using ~ to take the place) and keep the index.
test_wrongs = [test_wrongs, test_wrong];
[~, ind_] = max(a{numOfHiddenLayer + 1}); [~, ind] = max(test_y); train_errors = [train_errors train_error];
test_wrong = sum( ind_ ~= ind ) / size(test_x, 2) * 100;
end
Efficiency of Backpropagation
• Computational Efficiency is main aspect of back-prop
• No of operations to compute derivatives of error function scales
with total number W of weights and biases
• Single evaluation of error function for a single input requires
O(W) operations (for large W)
• This is in contrast to O(W2) for numerical differentiation
• As seen next
23
Machine Learning Srihari
∂En En (w ji + ε) − En (w ji )
= +O(ε) where ε<<1
∂w ji ε
• Accuracy improved by making ε smaller until round-off problems
arise
• Accuracy can be improved by using central differences
∂En En (w ji + ε) − En (w ji − ε)
= +O(ε 2 )
∂w ji 2ε
• This is O(W2)
• Useful to check if software for backprop has been
correctly implemented (for some test cases)
24
Machine Learning Srihari
Summary of Backpropagation
• Derivatives of error function wrt weights are obtained by
propagating errors backward
• It is more efficient than numerical differentiation
• It can also be used for other computations
• As seen next for Jacobian
25
Machine Learning Srihari
∂yk
J ki =
∂x i
∂E ∂E ∂yk ∂z j
=∑
∂w k,j ∂yk ∂z j ∂w
∂yk yk (x i + ε) − yk (x i − ε)
= +O(ε 2 )
∂x i 2ε 29
Machine Learning Srihari
Summary