You are on page 1of 7

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Nonlinear Dynamic Boltzmann


Machines for Time-Series Prediction

Sakyasingha Dasgupta and Takayuki Osogami


IBM Research - Tokyo
{sdasgup, osogami}@jp.ibm.com

Abstract primarily for engineering purposes. Unlike BM, the recently


proposed dynamic Boltzmann machine (DyBM) (Osogami
The dynamic Boltzmann machine (DyBM) has been pro-
and Otsuka 2015a; 2015b) can be used to learn a generative
posed as a stochastic generative model of multi-dimensional
time series, with an exact, learning rule that maximizes the model of temporal pattern sequences, using an exact learn-
log-likelihood of a given time series. The DyBM, however, ing rule that maximises the log likelihood of given time-
is dened only for binary valued data, without any nonlin- series. This learning rule exhibits key properties of spike-
ear hidden units. Here, in our rst contribution, we extend timing dependent plasticity (STDP), a variant of the Heb-
the DyBM to deal with real valued data. We present a for- bian rule. In STDP, the amount of change in the synaptic
mulation called Gaussian DyBM, that can be seen as an ex- strength between two neurons that red together depends on
tension of a vector autoregressive (VAR) model. This uses, in precise timing when the two neurons red. However, similar
addition to standard (explanatory) variables, components that to a BM, in the DyBM, each neuron takes a binary value, 0
captures long term dependencies in the time series. In our or 1, following a probability distribution that depends on the
second contribution, we extend the Gaussian DyBM model
parameters of the DyBM. This has limited applicability to
with a recurrent neural network (RNN) that controls the bias
input to the DyBM units. We derive a stochastic gradient up- real-world time-series modeling problems, which are often
date rule such that, the output weights from the RNN can also real valued.
be trained online along with other DyBM parameters. Fur- First Contribution: Here, we extend the DyBM to deal with
thermore, this acts as nonlinear hidden layer extending the real values and refer to the extended model as a Gaussian
capacity of DyBM and allows it to model nonlinear compo- DyBM. Although extension is possible in the way that BM is
nents in a given time-series. Numerical experiments with syn- extended to the Gaussian BM, we also relax some of the con-
thetic datasets show that the RNN-Gaussian DyBM improves straints that the DyBM has required in (Osogami and Otsuka
predictive accuracy upon standard VAR by up to 35%. On 2015a; 2015b). The primary purpose of these constraints
real multi-dimensional time-series prediction, consisting of in (Osogami and Otsuka 2015a; 2015b) was to interpret its
high nonlinearity and non-stationarity, we demonstrate that
learning rule as biologically plausible STDP. We relax these
this nonlinear DyBM model achieves signicant improve-
ment upon state of the art baseline methods like VAR and constraints in a way that the Gaussian DyBM can be related
long short-term memory (LSTM) networks at a reduced com- to a vector autoregressive (VAR) model (Lutkepohl 2005;
putational cost. Bahadori, Liu, and Xing 2013), while keeping the key prop-
erties of the DyBM having an exact learning rule intact. This
makes our new model specically suited for time-series pre-
Introduction diction problems. Specically, we show that a special case of
The Boltzmann machine (BM) (Hinton and Sejnowski 1983; the Gaussian DyBM is a VAR model having additional com-
Ackley, Hinton, and Sejnowski 1985) is an articial neu- ponents that capture long term dependency of time series.
ral network that is motivated by the Hebbian learning rule These additional components correspond to DyBMs eligi-
(Hebb 1949) of biological neural networks, in order to learn bility traces, which represent how recently and frequently
a collection of static patterns. That is, the learning rule of spikes arrived from one unit1 to another. This forms the rst
the BM that maximizes the log likelihood of given patterns primary contribution of this paper.
exhibits a key property of the Hebbs rule, i.e. co-activated Second Contribution: Similar to DyBM, its direct exten-
neural units should be connected. Although the original BM sion to the Gaussian case, can have two restrictions. Firstly,
is dened for binary values (1 for neuron ring and 0 rep- the maximum number of units has to be equal to the dimen-
resenting silence), it has been extended to deal with real sion of the time-series being learned, and secondly, the ab-
values in the form of Gaussian BM (Marks and Movellan sence of nonlinear hidden units. In our second contribution,
2001; Welling, Rosen-Zvi, and Hinton 2004) or Gaussian- we extend DyBM by adding a RNN layer that computes a
Bernoulli restricted BM (Hinton and Salakhutdinov 2006)
Copyright  c 2017, Association for the Advancement of Articial 1 In the rest of the paper the term neuron and unit are used inter-
Intelligence (www.aaai.org). All rights reserved. changeably.

1833
high dimensional nonlinear feature map of past input se-
quences (from the time-series data) to DyBM. The output
from the RNN layer is used to update the bias parameter in






Gaussian DyBM at each time. As such, we call this exten-



sion as RNN-Gaussian DyBM. The RNN is modeled similar
to an echo state network (Jaeger and Haass 2004). Such that,






the weights in the RNN layer is xed randomly, and we only


update the weights from the recurrent layer to the bias layer
for Gaussian DyBM. We derive a stochastic gradient descent
(SGD) update rule for the weights, with the objective of (a)
maximizing the log likelihood of given time-series. RNN-


Gaussian DyBM, thus allows hidden nonlinear units in the

form of the recurrent layer, where by the size of RNN layer
can be selected differently than the dimension of time-series
being modeled. This can signicantly improve the learning
of high-dimensional time-series by enabling very long tem-

poral memory in DyBM that can also deal with nonlinear
dynamics of the data.

Evaluation: We demonstrate the effectiveness of this
nonlinear DyBM model, namely, RNN-Gaussian DyBM (b)
through numerical experiments on two different synthetic
Figure 1: (a) A Gaussian Boltzmann machine (BM) that
datasets. Furthermore, we also evaluate its performance
when unfolded in time, gives a Gaussian DyBM as T .
on more complex real time-series prediction using non-
(b) The parametric form of the weight assumed in the Gaus-
stationary multidimensional nancial time-series and non-
sian BMs and its extensions.
linear sunspot time-series prediction problems . We train the
RNN-Gaussian DyBM and let it predict the future values of
the time-series in a purely online manner with SGD (Tiele- The focus of such study, however, is on non-linearity of sim-
man and Hinton 2012). Namely, at each moment, we update ple RNNs. The Gaussian DyBM formulation, rst extends
the parameters and variables by using only the latest values the linear VAR model but with the additional variables, re-
of the time-series, and let the DyBM predict the next values lated to DyBMs eligibility traces, which take into account
of time-series. The experimental results show that the RNN- the long term dependency in time-series. The addition of the
Gaussian DyBM can signicantly better the predictive per- RNN layer then allows to model the nonlinear components
formance against standard VAR methods, as well as match of the time-series.
(and on occasion outperform) the performance of state of the
art RNNs like long short-term memory (LSTM) (Hochre- Deriving a Gaussian DyBM
iter and Schmidhuber 1997) networks. Furthermore, RNN-
Gaussian DyBM can be implemented in an online manner, We will dene a G-DyBM2 as a limit of a sequence of Gaus-
at a signicantly reduced cost. sian BMs. Each of the Gaussian BMs denes a probability
distribution of the patterns that it generates, and an analo-
Related work gous probability distribution for the G-DyBM is dened as
a limit of the sequence of those probability distributions.
In the remainder of this section, we review the prior work
related to ours. Besides the DyBM, the BM has also been
Gaussian DyBM as unfolded Gaussian Boltzmann
extended into temporal horizon to deal with time-series in
various ways, and these extensions have been shown to per- machines for T
form effectively in practice (Sutskever, Hinton, and Taylor Consider a Gaussian BM having a structure illustrated in
2009; Hinton and Brown 1999; Sutskever and Hinton 2007; Figure 1 (a). This Gaussian BM consists of T + 1 layers
Taylor and Hinton 2009; Mittelman et al. 2014). In particu- of units. Let, N be the number of units in each layer. This
lar, Gaussian units have been applied in (Sutskever, Hinton, Gaussian BM can represent a series of N-dimensional
and Taylor 2009; Taylor and Hinton 2009; Mittelman et al. patterns of length T + 1. In the gure, this series of patterns
2014). Unlike the DyBM, however, exact learning of the log- is denoted as x[tT,t] (xs )s=tT,...,t for some time t. That is,
[t ]
liklihood gradient without back-propagation and sampling, the -th layer represents the pattern, x[t ] (xi )i=1,...,N ,
in these extended models is intractable and needs to be ap- at time t for = 0, 1, . . . , T .
proximated. On the other hand, the RNN-Gaussian DyBM
can be trained to maximize the log-likelihood of given time- The Gaussian BM in Figure 1(a) has three kinds of pa-
series without approximation, and this learning rule has the rameters, bias, variance, and weight, which determine the
characteristics of STDP, which is inherited from DyBM. probability distribution of the patterns that the Gaussian BM
A large amount of the prior work has compared recurrent
neural networks (RNN) against autoregressive models (Con- 2 In interest of space, we refer to Gaussian DyBM as G-DyBM
nor, Atlas, and Martin 1992; Zhang, Patuwo, and Hu 1998). and its RNN extension as RNN-G-DyBM on certain occasions.

1834
generates. For i = 1, . . . , N, let bi be the bias of the i-th Notice that the eligibility trace can be computed recursively:
unit of any layer and i2 be the variance of the i-th unit [t] [t1] [tdi, j +1]
[ ] i, j,k = k i, j,k + xi . (7)
of any layer. Let wi, j be the weight between the i-th unit
of the (s + )-th layer and the j-th unit of the s-th layer Gaussian DyBM as extended vector autoregression
for (i, j) {1, . . . , N}2 , s = 0, . . . , T , and = 1, . . . , T .
We assume that there are no connections within each layer: When the expression (5) is seen as a regressor (or predictor)
[0] [t]
namely wi, j = 0. With this assumption, the most recent val- for x j , it can be understood as a model of vector autoregres-
[t] sion (VAR) with two modications to the standard model.
ues, xi for i = 1, . . . , N, are conditionally independent of First, the last term in the right-hand side of (5) involves el-
each other given x[tT,t1] . igibility traces, which can be understood as features of his-
Hence, the conditional probability density of x[t] given torical values, x[,tdi, j ] , and are added as new variables of
x[tT,t1]
can be represented as the VAR model. Second, the expression (5) allows the num-
N
ber of terms (i.e., lags) to depend on i and j through di, j ,
[t]
p(x[t] |x[tT,t1] ) = p j (x j |x[tT,t1] ), (1) while this number is common among all pairs of i and j in
j=1 the standard VAR model.
When the conduction delay has a common value (di, j =
where each factor of the right-hand side denotes the con- d, i, j), we can represent (5) simply with vectors and matri-
[t]
ditional probability density of x j given x[tT,t1] for j = ces as follows:
[t] d1 K
1, . . . , T . More specically, x j has a Gaussian distribution [t1]
for each j:
[t] = b + W[ ] x[t ] + Uk k , (8)
=1 k=1
 [t] [t] 2
1  xj j  [t]
[t] [tT,t1]
p j (x j |x )=  exp , (2) where ( j ) j=1,...,N , b (b j ) j=1,...,N , x[t] (x j ) j=1,...,N ,
2 2 2 2j [t1] [t1]
j and k ( j,k ) j=1,...,N for k = 1, . . . , K are column vec-
[t] tors; W[ ] (wi, j )(i, j){1,...,N}2 for 0 < < di, j and Uk
where j takes the following form and can be interpreted
(ui, j,k )(i, j){1,...,N}2 for k = 1, . . . , K are N N matrices.
as the expected value of the j-th unit at time t given the last
T patterns:
T N RNN-Gaussian DyBM as a nonlinear model
[t] [ ] [t ]
j bj + wi, j xi . (3)
Having derived the G-DyBM, we now formulate the RNN-
=1 i=1
G-DyBM, as a nonlinear extension of the G-DyBM model
To take the limit of T while keeping the number of by updating the bias parameter vector b, at each time using
parameters constant (i.e., independent of T ), we assume that a RNN layer. This RNN layer computes a nonlinear feature
the weight has the parametric form illustrated in Figure 1(b). map of the past time series input to the G-DyBM. Where in,
Here, di, j 1 is an integer and will represent the conduction the output weights from the RNN to the bias layer along with
delay from i to j in the G-DyBM. Specically, for di, j , DyBM parameters, can be updated online using a stochastic
we use the parametric form of the DyBM (Osogami and Ot- gradient method.
suka 2015a; 2015b): We consider a G-DyBM connected with a M-dimensional
K RNN, whose state vector changes dependent on a nonlinear
[ ] di, j
wi, j = k ui, j,k , (4) feature mapping of its own history and the N-dimensional
k=1
time-series input data vector at time t 1. Where in, for
where, for each k, ui, j,k is a learnable parameter, and the de- most settings M > N. Specically, for RNN-G-DyBM we
cay rate, k , is xed in range [0, 1). Unlike the DyBM, we consider the bias vector to be time-dependent. Where in, it
[ ]
assume no constraint on wi, j for 0 < < di, j . Although these is updated at each time as:
weight could have shared weight in the G-DyBM as well, the b[t] = b[t1] + A [t] (9)
unconstrained weight will allow us to interpret the G-DyBM
as an extension of the VAR model in the next section. Here, [t] is the M 1 dimensional state vector at time t of
a M dimensional RNN. A is the M N dimensional learned
Plugging (4) into (3) and letting T , we obtain output weight matrix that connects the RNN state to the bias
N di, j 1 N K vector. Where, the RNN state is updated based on the input
[t] [ ] [t ] [t1]
j bj + wi, j xi + ui, j,k i, j,k , (5) time-series vector x[t] as follows:
i=1 =1 i=1 k=1 [t] = (1 )[t1] + F (Wrnn [t1] + Win x[t] ), (10)
[t1]
where i, j,k will be referred to as an eligibility trace and is Where, F (x) = tanh(x). This can however be replaced by
dened as follows: any other suitable nonlinear function, e.g. rectied linear
units, sigmoid etc. Here, 0 < 1 is a leak rate hyper-
[t1] di, j [t ]
i, j,k k xi . (6) parameter of the RNN, which controls the amount of mem-
=di, j ory in each unit of the RNN layer. Wrnn and Win are the

1835
M M dimensional RNN weight matrix and N M dimen- and Ba 2015) and RMSProp (Tieleman and Hinton 2012).
sional projection of the time series input to the RNN layer, [t] [t1] [t ]
In (14)-(18), j is given by (5), where i, j,k and xi
respectively. Here, we design the RNN similar to an echo
for [1, di, j 1], respectively, are stored and updated in
state network (Jaeger and Haass 2004). Such that, the weight
a synapse and a FIFO queue that connects from neuron i to
matrices Wrnn and Win are initialized randomly. Wrnn is ini-
neuron j. It should be noted that, in the absence of the for-
tialized from a Gaussian distribution N (0, 1) and Win is
mulations in (9) and (10), this same learning procedure up-
initialized from N (0, 0.1). The sparsity of the RNN weight
dates the parameters of an equivalent G-DyBM model, with-
matrix can be controlled by the parameter and it is scaled
out the need for (18). See algorithmic description in supple-
to have a spectral radius less than one, for stability (Jaeger
mentary (Dasgupta and Osogami 2016).
and Haass 2004). For all results presented here, the RNN
weight matrix was 90% sparse and had a spectral radius of
0.95. Numerical experiments
We now demonstrate the advantages of the RNN-G-DyBM
Online training of a RNN-Gaussian DyBM through numerical experiments with two synthetic and two
real time series data. In these experiments, we use the RNN-
We now derive an online learning rule for the RNN-G- G-DyBM with a common conduction delay (di, j = d for any
DyBM in a way that the log-likelihood of given time-series i, j; see (8)). All the experiments were carried out with a
data, D, is maximized. The log-likelihood of D is given by Python 2.7 implementation (with numpy and theano back-
LL(D) = log p(x[t] |x[,t1] ). (11) end) on a Macbook Air with Intel Core i5 and 8 GB of mem-
ory.
xD t
In all cases we train the RNN-G-DyBM in an online man-
Here, we show the case where D consists of a single time-
series, but extension to multidimensional cases is straight- ner. Namely, for each step t, we give a pattern, x[t] , to the
forward. The approach of stochastic gradient is to update the network to update its parameters and variables such as eligi-
parameters of the RNN-G-DyBM at each step, t, according bility traces (see (7)), and then let theRNN-G-DyBM predict
to the gradient of the conditional probability density of x[t] , the next pattern, x[t+1] , based on [t+1] using (8)-(10). This
N
process is repeated sequentially for all time or observation
log p(x[t] |x[,t1] ) = log pk (xi |x[,t1] )
[t]
(12) points t = 1, 2, . . .. Here, the parameters are updated accord-
i=1 ing to (14)-(18). The learning rates, and , in (14)-(18) is
1  [t] adjusted for each parameter according to RMSProp (Tiele-
xi i )2 
N [t]
= log i2 + , man and Hinton 2012), where the initial learning rate was
i=1 2 2i2 set to 0.001. Throughout, the initial values of the parameters
(13) and variables, including eligibility traces and the spikes in
the FIFO queues, are set to 0. However, we initialize j = 1
where, the rst equality follows from the conditional inde- for each j to avoid division by 0 error. All RNN weight ma-
pendence (1), and the second equality follow from (2). From trices were initialized randomly as described in the RNN-G
(13) and (5), we can now derive the derivative with respect DyBM model section. The RNN layer leak rate was set to
to each parameter. These parameters can then be updated, 0.9 in all experiments.
for example, as follows:
 [t] [t]
xj j ) Synthetic time series prediction
bj bj + , (14) The purpose of the experiments with the synthetic datasets
2j
  [t]  is to clearly evaluate the performance of RNN-G-DyBM
[t] 2
xj j 1 in a controlled setting. Specically, we consider a RNN-G-
j j + 1 , (15) DyBM with N DyBM units and M RNN units. The DyBM
2j j units are connected with FIFO queues of length d and eligi-
 [t] [t] bility traces of decay rate , where d and are varied in the
[ ] [ ] x j j ) [t ]
wi, j wi, j + xi , (16) experiments. For = 0, and in the absence of the RNN layer,
2j we have [t] = x[td] , and this Gaussian DyBM reduces to a
 [t] [t] vector autoregressive model with d lags. We use this VAR
x j j ) [t1]
ui, j,k ui, j,k + i, j,k (17) model as the baseline for performance evaluation.
2j
Multidimensional noisy sine wave: In the rst synthetic
[t] [t]

(x j j ) [t]
task, we train the RNN-G-DyBM with a ve dimensional
Al, j Al, j + l , (18) noisy sine-wave. Where each dimension xD is generated as:
2j
[t]
xD = sin(D t/100) + [t] , D = (1, 2, 3, 4, 5), (19)
for k = 1, . . . , K, = 1, . . . , di, j 1, and (i, j) {1, . . . , N},
where and are learning rates. We set, < such for each t, where [t] is independent and identically dis-
that Al j is stationary while other parameters are updated. tributed (i.i.d) with a Gaussian distribution N (0, 1). The
The learning rates can be adjusted at each step according number of DyBM units N = 5 with M = 10 hidden RNN
to stochastic optimization techniques like Adam (Kingma units.

1836
(a) noisy sine, d = 2 (b) noisy sine, d = 4 (c) noisy sine, d = 7

(d) narma-30, d = 2 (e) narma-30, d = 4 (f) narma-30, d = 7


Figure 2: Mean squared error of prediction for the synthetic time series. For each step t, the mean squared error is averaged
over 100 independent runs. Decay rate is varied as in the legend, and the red curve ( = 0) corresponds to the baseline VAR
model. Conduction delay d is varied across panels. (a)-(c) Prediction performance on the noisy sine task. (d)-(f) Prediction
performance on the 30th order NARMA task.

30th order nonlinear autoregressive moving average


(NARMA): In this task, we train the RNN-G-DyBM for
one step prediction with a one dimensional nonlinear time-
series. Namely, the 30th order NARMA (Connor, Atlas, and
Martin 1992; Jaeger and Haass 2004) which is generated as:
29

x[t] = 0.2 x[t1] + 0.004 x[t1] x[t1i]


i=0
+ 1.5 u[t30] u[t1] + 0.01 (20)
Where, u[t] is i.i.d with a Gaussian distribution N (0, 0.5).
As such, for future prediction, this task requires the model-
ing of the inherent nonlinearity and up to 30 time-steps of
memory. The number of DyBM units are N = 1 with M = 5 Figure 3: Average root mean squared error after 20 epochs
hidden RNN units. for one-week ahead prediction plotted for varying decay
Figure 2 shows the predictive accuracy of the RNN- rates ( ), on the real dataset of weekly retail prices for gaso-
G-DyBM. Here, the prediction, x[t] , for the pattern at line and diesel in U.S.A. The results shown are using xed
time t is evaluated with mean squared error, MSE[t] delay strength d = 2. The star marks show the performance
[t] [t] 2 [t] of the baseline VAR model ( = 0.0). Training error is plot-
100 s=t50 (x x ) , and MSE is further averaged over
1 50
ted in red and test error is plotted in blue. See supplementary
100 independent runs of the experiment. Due to the time- results (Dasgupta and Osogami 2016), for plots with d = 3
dependent noise [t] , the best possible squared error is 1.0 in and d = 4.
expectation. We vary decay rates as indicated in the legend
and d as indicated below each panel. We observe that in the
noisy sine task, although the prediction accuracy depends when the lag or delays increases. As this task requires both
on the choice of , Figure 2 (a)-(c) shows that the RNN-G- long memory and nonlinearity, the gain RNN-G-DyBM has
DyBM (with > 0) signicantly outperforms VAR model over the VAR stems from the use of the eligibility traces,
( = 0; red curves) and reduces the error by more than 30 %. [t] and the nonlinear RNN hidden units, instead of the lag-
A similar performance benet is also observed in Figure 2 d variable, x[td] . As such, even with increasing delays the
(d)-(f) for the 30th order NARMA prediction task. Where VAR model does not match the performance of RNN-G-
in, the RNN-G-DyBM performs robustly across varying de- DyBM. The results for even longer delays can be found in
cay rates and consistently outperforms the VAR model even the supplementary material (Dasgupta and Osogami 2016).

1837
MODEL Retail-price prediction Sunspots prediction Monthly sunspot number prediction4 : In the second
Test RMSE Test RMSE task, we use the historic benchmark of monthly sunspot
LSTM 0.0674 0.07342 number (Hipel and McLeod 1994) collected in Zurich from
VAR (d=2) 0.0846 0.2202 Jan. 1749 to Dec. 1983. This is a one-dimensional nonlinear
VAR (d=3) 0.0886 0.1859
VAR (d=4) 0.0992 0.2050 time series with 2820 time steps. As before, we normalize
RNN-G-DyBM (d=2) 0.0580 0.08252 the data within the interval [0, 1.0] and used 67% for train-
RNN-G-DyBM (d=3) 0.0564 0.0770 ing and 33% for testing the models. With goal of monthly
RNN-G-DyBM (d=4) 0.0605 0.0774
prediction we trained RNN-G-DyBM with N = 1 unit, and
Table 1: The average test RMSE after 20 epochs for various M = 50 hidden RNN units.The LSTM model also had 50
models for online time series prediction tasks with the retail- hidden units. All other settings were same as the other real-
price dataset and sunspot number datasets, respectively. The data set.
delay (d) for each model is in brackets. For LSTM model Figure 3 shows the one-week ahead prediction accuracy
there was no delay and decay rate ( ) hyper-parameters. In of RNN-G-DyBM on the retail-price time series for delay
all VAR models = 0.0. For RNN-G-DyBM models the d = 2. We evaluate the error using the root mean squared er-
best test score achieved across the entire range of decay rates ror (RMSE) measure averaged over 20 epochs calculated for
is reported. the normalized time series. The gure shows that the RNN-
G-DyBM (solid curves) clearly outperforms VAR (stared
points) by nearly more than 30% (depending on the set-
MODEL Average runtime/epoch (s) tings), on both training and test predictions. However for
LSTM 11.2132 larger decay rates the test RMSE increases suggesting that
RNN-G-DyBM (across delays) 0.7014
VAR (across delays) 0.3553 over-tting can occur if hyper-parameters are not selected
properly. In comparison, while directly using the G-DyBM
Table 2: The average CPU time taken in seconds, to train model on this task, the best case test RMSE was 0.0718
a model per epoch. The reported values are for the sunspot with d = 3, thus RNN-G-DyBM improves upon G-DyBM
nonlinear time series prediction task. by more than 21% .
As observed in Table 1, on this task the RNN-G-DyBM
with d = 3 achieved the best performance, which remarkably
Real-data time series prediction beats even the LSTM model by a margin of 16 %. Consid-
erable performance gain against the VAR baseline was also
Having evaluated the performance of RNN-G-DyBM in a observed (Table 1. columns two) using the sunspot time se-
controlled setting, we now evaluate its performance on two ries. Due to the inherent nonlinearity in this data, both the
real-data sets. We demonstrate that RNN-G-DyBM consis- VAR and G-DyBM models perform very poorly even for
tently outperforms the baseline VAR models as well as, bet- higher delays. Notably, the best case test RMSE obtained
ters in some cases, the performance obtained with the popu- when using the G-DyBM model was 0.1327 (with d = 3) i.e.
lar LSTM RNN model. We rst evaluate using a highly non- 40% lower as compared to the best RNN-G-DyBM model.
stationary multidimensional retail-price time series dataset. With 50 hidden units the LSTM model performs slightly bet-
Weekly retail gasoline and diesel prices in U.S.3 : This ter in this case with a normalized test error of 0.07342 as
dataset consists of real valued time series of 1223 steps(t = compared to the 0.0770 for the RNN-G-DyBM with d = 3
1, . . . , 1223) of weekly (April 5th, 1993 to September 5th, (comparison of the true vs predicted time series on this
2016) prices of gasoline and diesel (in US dollar/gallon) task for all the three models can be seen in the supplemen-
with 8 dimensions covering different regions in U.S.A. We tary data (Dasgupta and Osogami 2016). Visual inspection
normalize the data within the interval [0, 1.0]. We use the shows little difference between RNN-G-DyBM and LSTM
rst 67% of the time series observations (819 time steps) as prediction.).
the training set and the remaining 33% (404 time steps) as In Table 2. we record the average CPU time taken (in sec-
the test data set. We train the RNN-G-DyBM with N = 8 onds) to execute a single training epoch on the sunspot data,
units, and M = 20 hidden RNN units with varying d and . across the three models. As observed, the RNN-G-DyBM
The objective in this task was to make one-week ahead pre- not only achieves comparable performance to the LSTM
dictions in an online manner, as in the synthetic experiments. but learns in only 0.7014 sec./epoch as compared to the
We once again use the VAR model as the baseline. Addi- 11.2132 sec./epoch (16 times more) for the LSTM model.
tionally, we also learn with LSTM RNN (Gers, Schmid- Notably, after 50 epochs LSTM achieves a best test RMSE
huber, and Cummins 2000) with 20 hidden units. We set of 0.0674(retail-price dataset) taking 566 secs., while af-
the relevant hyper-parameters such that they were consistent ter 50 epochs the RNN-G-DyBM realises a best test RMSE
across all the models. In all models the parameters of the net- of 0.0564 in 35 secs, on the same task. As such, the non-
work was updated online, using the stochastic optimization linear RNN-G-DyBM model is highly scalable in an online
method RMSProp (Tieleman and Hinton 2012) with an ini- learning environment. Expectedly, the VAR model without
tial learning rate of 0.001. LSTM was implemented in Keras any eligibility traces and hidden units runs much faster, al-
see supplementary (Dasgupta and Osogami 2016). beit with signicantly lower predictive accuracy.

3 Data obtained from http://www.eia.gov/petroleum/ 4 Publicly available at https://datamarket.com/data/set/22t4/

1838
Conclusion Hinton, G. E., and Sejnowski, T. J. 1983. Optimal perceptual
In this paper we rst extended the dynamic Boltzmann inference. In Proc. IEEE Conference on Computer Vision
machine (DyBM), into the Gaussian DyBM (G-DyBM) to and Pattern Recognition, 448453.
model real valued time series. The G-DyBM can be seen Hipel, W. K., and McLeod, A. I. 1994. Time series mod-
as an extension of vector autoregression model. We fur- elling of water resources and environmental systems, vol-
ther extended this to the RNN-Gaussian DyBM in order to ume 45. Elsevier.
model inherent nonlinearities in time series data. Experi- Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
mental results demonstrate the effectiveness of the RNN- memory. Neural computation 9(8):17351780.
G-DyBM model with signicant performance gain over Jaeger, H., and Haass, H. 2004. Harnessing nonlinearity:
VAR. The RNN-G-DyBM also outperforms popular LSTM Predicting chaotic systems and saving energy in wireless
models at a considerably reduced computational cost. Our communication. science 304(5667):7880.
model is highly scalable similar to binary DyBM (Dasgupta,
Yoshizumi, and Osogami 2016) that was shown to give sig- Kingma, D. P., and Ba, J. 2015. Adam: A method
nicant performance improvement on the high-dimensional for stochastic optimization. In Proceedings of the Inter-
moving MNIST task. Furthermore, unlike models requir- national Conference on Learning Representations (ICLR),
ing back-propagation, in RNN-G-DyBM each parameter can arXiv:1412.6980.
be updated in a distributed manner in constant time with Lutkepohl, H. 2005. New introduction to multiple time se-
Eqs.14-18. This update is independent of data dimension or ries analysis. Springer Science & Business Media.
the maximum delay. This makes the RNN-G-DyBM model Marks, T., and Movellan, J. 2001. Diffusion networks, prod-
highly robust and scalable for online high-dimensional time ucts of experts, and factor analysis. In Proceedings of the
series prediction scenarios. Third International Conference on Independent Component
Analysis and Blind Source Separation.
Acknowledgments Mittelman, R.; Kuipers, B.; Savarese, S.; and Lee, H. 2014.
This work was partly funded by CREST, JST. Structured recurrent temporal restricted Boltzmann ma-
chines. In Proc. 31st Annual International Conference on
References Machine Learning (ICML 2014), 16471655.
Ackley, D. H.; Hinton, G. E.; and Sejnowski, T. J. 1985. Osogami, T., and Otsuka, M. 2015a. Learning dynamic
A learning algorithm for Boltzmann machines. Cognitive Boltzmann machines with spike-timing dependent plasticity.
Science 9:147169. Technical Report RT0967, IBM Research.
Bahadori, M. T.; Liu, Y.; and Xing, E. P. 2013. Fast struc- Osogami, T., and Otsuka, M. 2015b. Seven neurons mem-
ture learning in generalized stochastic processes with latent orizing sequences of alphabetical images via spike-timing
factors. In Proceedings of the 19th ACM SIGKDD interna- dependent plasticity. Scientic Reports 5:14149.
tional conference on Knowledge discovery and data mining, Sutskever, I., and Hinton, G. E. 2007. Learning multilevel
284292. ACM. distributed representations for high-dimensional sequences.
Connor, J.; Atlas, L. E.; and Martin, D. R. 1992. Recurrent In Proceedings of the Eleventh International Conference
networks and NARMA modeling. In Advances in Neural on Articial Intelligence and Statistics (AISTATS-07), vol-
Information Processing Systems 4. 301308. ume 2, 548555.
Dasgupta, S., and Osogami, T. 2016. Nonlinear dynamic Sutskever, I.; Hinton, G. E.; and Taylor, G. W. 2009.
boltzmann machines for time-series prediction. Technical The recurrent temporal restricted Boltzmann machine. In
Report RT0975, IBM Research. http://ibm.biz/NDyBMSup. Advances in Neural Information Processing Systems 21.
16011608.
Dasgupta, S.; Yoshizumi, T.; and Osogami, T. 2016. Regu-
larized dynamic boltzmann machine with delay pruning for Taylor, G. W., and Hinton, G. E. 2009. Factored conditional
unsupervised learning of temporal sequences. In Proceed- restricted Boltzmann machines for modeling motion style.
ings of the 23rd International Conference on Pattern Recog- In Proc. 26th Annual International Conference on Machine
nition (ICPR). Learning (ICML 2009), 10251032.
Gers, A. F.; Schmidhuber, J.; and Cummins, F. 2000. Learn- Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop:
ing to forget: Continual prediction with lstm. Neural com- Divide the gradient by a running average of its recent magni-
putation 12(10):24512471. tude. COURSERA: Neural Networks for Machine Learning
4(2).
Hebb, D. O. 1949. The organization of behavior: A neu-
ropsychological approach. Wiley. Welling, M.; Rosen-Zvi, M.; and Hinton, G. E. 2004. Expo-
nential family harmoniums with an application to informa-
Hinton, G. E., and Brown, A. D. 1999. Spiking Boltzmann tion retrieval. In Advances in Neural Information Processing
machines. In Advances in Neural Information Processing Systems 17. 14811488.
Systems 12. 122128.
Zhang, G.; Patuwo, B. E.; and Hu, M. Y. 1998. Forecasting
Hinton, G. E., and Salakhutdinov, R. 2006. Reducing with articial neural networks: The state of the art. Interna-
the dimensionality of data with neural networks. Science tional Journal of Forecasting 14(1):3562.
313:504507.

1839

You might also like