Professional Documents
Culture Documents
by artificial intelligence. As a benchmark and a proof-of-principle study that this approach is indeed
possible, in this work we design an introspective learning architecture that can automatically develop
the concept of the quantum wave function and discover the Schrödinger equation from simulated
experimental data of the potential-to-density mappings of a quantum particle. This introspective
learning architecture contains a translator and a knowledge distiller. The translator employs a re-
current neural network to learn the potential to density mapping, and the knowledge distiller applies
the auto-encoder to extract the essential information and its update law from the hidden layer of
the translator, which turns out to be the quantum wave function and the Schrödinger equation. We
envision that our introspective learning architecture can enable machine learning to discover new
physics in the future.
knowledge
distiller
(RAE)
based machine learning for industry and social appli-
…
cations. Inspired by this great success, machine learn-
ing algorithms have also been rapidly applied to vari- wave function
E D D D D D
ous branches physics research, ranging from high-energy
and string theory to condensed matter, atomic, molecu- } reconstruction
loss ℒRAE
lar and optical physics.[1–18] However, a major challenge
hidden
states
is still to uncover the black box of the neural network density (output)
to understand what kinds of rules has been developed.
translator
We consider the machine learning application to physics
(RNN)
problems as a better platform to tackle this important … … } translation
loss ℒRNN
challenge. For instance, among different kinds of appli-
cations, one type of supervised learning algorithms is to ( , )
first train the neural network based on labeled experi- potential density (target)
mental data, and then ask the neural network to make training data
predictions. In contrast to the industry and social prob-
lems, here both the input and the output data have FIG. 1: The architecture of an introspective recurrent neu-
ral network, called “the Schrödinger machine”. It contains
well-defined physical meaning, and there always exists
a translator (lower panel) and a knowledge distiller (upper
a clear rule behind the experimental data. The ques- panel). The translator is implemented as a recurrent neural
tion is whether one can read out the rule by opening up network to perform the task of the potential-to-density map-
the black box of the neural network, and furthermore, ping. The knowledge distiller compresses the hidden states
whether the task of reading out the rule can also be car- generated by the translator using a recurrent auto-encoder
ried out by the machine itself. Hence, the most ambitious and extracts the most essential variables in the hidden layer
goal of machine learning physics is not just to perform together with its update rule.
designated tasks with high accuracy but also to develop
novel architectures that allow machine itself to distill the
underlying rules from the experimental data.[19, 20]
the target, and challenge the machine to discover the un-
As a proof-of-concept study that this goal can indeed derlying rule governing the potential-to-density mapping.
be achieved, here we consider a single quantum particle To this purpose, we develop an introspective learning
moving in a one-dimensional space with certain poten- architecture as shown in Fig. 1 that contains a transla-
tial. Suppose we can measure the particle density for tor to perform the task of potential-to-density mapping
each given potential, we supply the machine with the and a knowledge distiller to extract the knowledge be-
potential profile as the input and the density profile as hind the mapping. We treat both the potential and den-
2
sity profiles as sequential data (discretely sampled along (a) density sequence (b) ρ′i
the one-dimensional space), then the problem belongs
… ρ′i-1 ρ′i ρ′i+1 … 1
to a broader class of sequence-to-sequence mapping,[21– P
24] which can be handled by the recurrent neural net- hi-1 hi hi+1 h d hi
work (RNN).[25] The RNN has been widely used in nat- i -1 d
ural language processing to translate a sequence of words d d
d ×d
from the source language to another sequence of words W
in the target language.[26] Here we construct a novel … V Vi Vi+1 …
i -1 1
RNN structure, named “the Taylor RNN”, to perform
potential sequence Vi
the potential-to-density mapping as a translation task.
After training, the RNN can predict the density accu-
FIG. 2: Architecture of the translator RNN for the potential-
rately for any given new potential. As the RNN per- to-density mapping. (a) is the global structure and (b) is
forms the task, its hidden layer generates a “big data” the network structure within each block. Arrows indicate
that should contain the information about the underlying the direction of information flow. The tensor dimensions are
rule governing the potential-to-density mapping, mixed marked out in gray. W and P can be generic functions, al-
with redundant information. though they are modeled by the Taylor expansions in our
To extract the essential variables in the hidden layer, implementation. The symbol denotes matrix-vector multi-
plication.
we pass the hidden states to a second machine, dubbed
as the knowledge distiller. It works on the hidden states
of the first machine to compress the information and to
extract the underlying rule. The auto-encoder architec- the following update equations
ture is widely used for information compression.[27, 28]
Here we invent an auto-encoder incorporated in a recur- hi = W (Vi ) · hi−1 , ρ0i = P (hi ). (2)
rent structure, which we named as the recurrent auto-
encoder (RAE). Eventually, we can show that the RAE where both the input Vi ∈ R and the output ρ0i ∈ R are
extracts out that the essential variables in the hidden scalars and the hidden state hi ∈ Rd is a d-dimensional
layers are two real numbers and they can be interpreted vector. The hidden state hi is updated by an input-
as the quantum wave function and its first order deriva- dependent linear transformation, represented by a d ×
tive. The equation they obeyed is consistent with the d matrix W (Vi ) ∈ Rd×d multiplied to hi . The output
Schrödinger equation. In this way, we can conclude that, ρ0i is generated from the hidden state by a projection
without any prior knowledge of quantum mechanics, this map P (hi ). The tensor flow is graphically represented in
learning architecture itself can develop the concept of Fig. 2(b). The output sequence ρ0i is then compared with
the quantum wave function and discover the Schrödinger the target sequence ρi over a window of steps to evaluate
equation when it is only provided with experimental data the loss function
of potential and density pairs. As a consistency check, X
LRNN = (ρ0i − ρi )2 . (3)
we also show that if the particle follows classical statis-
i∈window
tical mechanics, with the same kind of data and similar
learning architecture, the Schrödinger equation does not How the RNN updates its hidden state and generates
emerge. output is determined by the functions W and P . In gen-
The Translator. We discretize the potential V (x) eral, W and P could be non-linear functions modeled by
and density profiles ρ(x) along the one-dimensional space feedforward neural networks for instance. However, for
and treat them as sequences of real numbers, our problem, we find it sufficient to model W by a Taylor
expansion (to the nW th order in Vi ) and P by a linear
Vi = V (xi ), ρi = ρ(xi ), (1)
projection,
where xi = ai are the discrete coordinates for i = nW
0, 1, 2, · · · , which are evenly distributed along the one-
X
W (Vi ) = W (n) Vin , P (hi ) = p| · hi , (4)
dimensional space with a fixed separation a = 0.1. We n=0
assume that the potential is always measured with re-
spect to the energy of the particle, such that the particle where W (n) is the nth order Taylor expansion coeffi-
energy is effectly fixed at zero. We will only consider cient matrix (each one is of the dimension d × d) and
the case of Vi < 0, such that the particle remains in ex- p is a d-dimensional vector. The elements in W (n) and
tended states. We apply the RNN architecture to build p are model parameters to be trained to minimize the
the translator. In each step, the RNN takes an input loss function LRNN . The training dataset contains pairs
Vi from the source sequence, modifies its internal hidden of potential and density sequences that serve as paral-
state hi accordingly, and generates the output ρ0i based lel corpora to train the RNN translator. They are cur-
on the hidden state, as illustrated in Fig. 2(a). We adopt rently obtained from numerical simulation, but can be
1
4 (a) h′ (b)
target output i0 hi′0 +1 hi′0 +2 hi′0 +3
2
ρi
…
gi-1 d d gi
0 d ×d
(a)
D D D D W
-2
Vi
1
-4
4 Vi
gi0 gi0 +1 gi0 +2 gi0 +3 …
2 (c) 100
ρi
d
0 (b) E Vi0 +1 Vi0 +2 Vi0 +3 … d
E=
-2
Vi
-4 hi0
4
2
ρi
4 4 (b)
(arbitary unit)
(a) d= d= 2
1 1 1
3 2 3 2 0
ℒRAE
3 3
ℒAE
2 4 2 -1
-2 ρi gi,1 gi,2
1 1
0 100 200 300 400
0 0 i
0 100 200 300 400 0 50 100
steps steps FIG. 6: The RNN output density profile ρi and the RAE
hidden state gi = (gi,1 , gi,2 ) for a constant potential Vi = 1.
FIG. 5: (a) The RAE reconstruction loss LRAE v.s. the train- It shows that the periodicity of gi is twice of ρi .
ing steps for the quantum case. Different curves are for dif-
ferent RAE hidden state dimensions d. ˜ d˜ = 2 turns out to be
the minimal d˜ without sacrificing the reconstruction loss. (b)
The AE reconstruction loss LAE v.s. training steps for the the periodicity doubling and the relative phase shift.
classical thermal gas. The vanishing LAE implies that there Second, we open up the recurrent block of the RAE to
is no need to pass any variable along the sequence in this case. extract the update rules for gi , which is machine’s formu-
lation of the physical rules. The update rules
PnWare encoded
(n) n
in the transformation matrix W̃ (Vi ) = n=0 W̃ Vi ,
The RAE is trained to minimize the reconstruction loss which are parameterized by the Taylor expansion coeffi-
X cient matrices W̃ (n) . To connect this formulation to the
LRAE = (h0i − hi )2 . (6) Schrödinger equation we familiar with, we notice that
i∈window this mapping is invariant under a linear transformation
M ∈ GL(2, R) applied to all W̃ (n) . We find that it is al-
It is important that the RAE (knowledge distiller) hid-
ways possible to find a proper linear transformation that
den state gi has a smaller dimension d˜ compared to the
can simultaneously bring all W̃ (n) to the following form
dimension d of the RNN (translator) hidden state hi ,
therefore it can enforce an information bottleneck that
0.9993 0.1007 1 a
only allows the vital information to be passed down in M −1 W̃ (0) M = ≈ ,
0.0013 0.9987 0 1
gi . Furthermore, instead of using a single auto-encoder (7)
0.0067 0.0004 0 0
to compress the hidden state at each step independently, M −1 W̃ (1) M = ≈ .
0.1001 0.0024 a 0
the RAE connects a series of decoders together by a re-
current neural network. This design is to ensure that the Here the numerical matrix elements are what we obtained
latent representation gi remains coherent among a series from a particular instance of the trained RAE. They can
of steps and contains the key variables that should be be associated to the lattice constant a to the leading order
passed down along the sequence. A similar RAE archi- given that a = 0.1, and we have also verified that they
tecture was proposed in Ref. 32 and recently redesigned scale correctly with a as proposed. The result in Eq. (7)
in Ref. 19 to enable AI scientific discovery on sequential points to the following difference equation
data. In this way, the RAE compresses the original RNN
to a more compact RNN capturing the most essential
gi+1,1 1 a gi,1
information and its induced update rules. = . (8)
gi+1,2 aV (x) 1 gi,2
As shown in Fig. 5(a), we find that the reconstruction
loss LRAE of the RAE increases dramatically only when If we interpret gi,1 as the quantum wave function ψ(xi )
its hidden state dimension d˜ is squeezed below two (i.e. and gi,2 as its first order derivative ∂x ψ(xi ), Eq. (8) corre-
d˜ < 2), implying that the key feature can be stored in sponds to a discrete version of the Schrödinger equation
a two-component real vector (i.e. d˜ = 2) in the most ∂x2 ψ(x) = V (x)ψ(x) as the particle energy was taken to
parsimonious manner, as gi = (gi,1 , gi,2 ). Here we show be zero.
that gi in fact represent the quantum wave function and Finally, as a consistency check, we train the same in-
its first order derivation. The evidences are two fold: trospective recurrent neural network on the potential and
First, we try to use the trained RNN to predict the density data of the high-temperature thermal gas follow-
density with a constant potential√ V , the result of which ing ρi ∝ e−βVi at a fixed inverse temperature β. In this
should be cos2 (kxi ) with k = −V being the momen- case, we can even reduce the RAE to an auto-encoder
tum. Indeed it is the case as shown in Fig. 6. If gi,1 and (AE) without sacrificing the reconstruction loss LAE . As
gi,2 are the wave function and its derivative, it should shown in Fig. 5(b), the LAE remains vanishing for any d, ˜
be cos(kxi ) and sin(kxi ), respectively, whose periods are implying that there is no need to pass any variable along
twice of the period of ρi with phases shifted by π/2 rela- the sequence and hence the Schödinger equation will not
tive to each other. As shown in Fig. 6, gi indeed displays emerge for thermal gas.
5
Conclusion. To conclude, we design the architecture [6] K. Hashimoto, S. Sugishita, A. Tanaka, and A. Tomiya,
that combines a task machine directly learning the ex- ArXiv e-prints (2018), 1802.08313.
perimental data and an introspective machine working [7] G. Torlai and R. G. Melko, Phys. Rev. B 94, 165134
on the neural activations of the task machine. The sepa- (2016), 1606.02718.
[8] L. Wang, Phys. Rev. B 94, 195105 (2016).
ration of the task machine from the introspective machine [9] J. Carrasquilla and R. G. Melko, Nature Physics 13, 431
effectively isolates the knowledge distillation from affect- (2017), 1605.01735.
ing the task performance, such that the whole system [10] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber,
can simultaneously improve the task performance and Nature Physics 13, 435 (2017), 1610.02048.
approach the parsimonious limit of knowledge represen- [11] Y. Zhang and E.-A. Kim, Physical Review Letters 118,
tation, without trading off between one another. Here we 216401 (2017), 1611.01518.
show that this architecture can discover the Schrödinger [12] C. Wang and H. Zhai, Phys. Rev. B 96, 144432 (2017),
1706.07977.
equation from the potential-to-density data. Therefore [13] C. Wang and H. Zhai, Frontiers of Physics 13, 130507
we name it as the “Schrödinger machine”. We envision (2018), 1803.01205.
that the same architecture can be generally applied to [14] P. Zhang, H. Shen, and H. Zhai, Physical Review Letters
other machine learning applications to physics problems 120, 066401 (2018), 1708.09401.
and enable machine learning to discover new physics in [15] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer,
the future. R. Melko, and G. Carleo, Nature Physics 14, 447 (2018).
[16] J. C. Snyder, M. Rupp, K. Hansen, K.-R. Müller, and
Besides, there are another few points worth highlight-
K. Burke, Phys. Rev. Lett. 108, 253002 (2012).
ing in this work. First, although the use of Taylor ex- [17] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman,
pansion for the non-linear functions in our RNN is not K. Burke, and K.-R. Müller, Nature Communications 8,
essential and can be replaced by neural network models, 872 (2017).
it has the advantage of being analytical tractability which [18] K. Mills, M. Spanner, and I. Tamblyn, Phys. Rev. A 96,
makes it easier to understand how the RNN works. Sec- 042113 (2017).
ond, the potential-to-density mapping is also an essential [19] R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Ren-
ner, ArXiv e-prints (2018), 1807.10300.
component in the density functional theory, known as the
[20] T. Wu and M. Tegmark, arXiv e-prints arXiv:1810.10525
Kohn-Sham mapping.[33] The existing machine learning (2018), 1810.10525.
solutions for this task include the kernel method and [21] N. Kalchbrenner and P. Blunsom, in Proceedings of the
the convolutional neural network approach.[16, 17, 34– 2013 Conference on Empirical Methods in Natural Lan-
36] The RNN approach introduced here has the advan- guage Processing (Association for Computational Lin-
tage of being spatially scalable without retraining, which guistics, 2013), pp. 1700–1709.
could find potential applications in boosting the density [22] I. Sutskever, O. Vinyals, and Q. V. Le, in Advances
in Neural Information Processing Systems 27, edited by
functional calculation and material search. Thirdly, we
Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
invent a model that incorporates the auto-encoder with and K. Q. Weinberger (Curran Associates, Inc., 2014),
the recurrent neural network, which can find a compact pp. 3104–3112.
representation of the entire RNN model. This algorithm [23] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau,
can find its application in other occasions of model com- F. Bougares, H. Schwenk, and Y. Bengio, ArXiv e-prints
pression and knowledge transfer. (2014), 1406.1078.
Acknowledgement. This work is funded mostly [24] D. Bahdanau, K. Cho, and Y. Bengio, ArXiv e-prints
(2014), 1409.0473.
by Grant No. 2016YFA0301600 and NSFC Grant No.
[25] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-
11734010. CW acknowledges the support of the China ing (MIT Press, 2016).
Scholarship Council. [26] G. Neubig (2017), 1703.01619.
[27] Y. Bengio, A. Courville, and P. Vincent, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 35,
1798 (2013), ISSN 2160-9292.
[28] D. P. Kingma and M. Welling, arXiv preprint
∗
Electronic address: hzhai@tsinghua.edu.cn arXiv:1312.6114 (2013).
†
Electronic address: yzyou@physics.ucsd.edu [29] J. E. Lye, L. Fallani, M. Modugno, D. S. Wiersma,
[1] J. Carifio, J. Halverson, D. Krioukov, and B. D. Nel- C. Fort, and M. Inguscio, Phys. Rev. Lett. 95, 070401
son, Journal of High Energy Physics 9, 157 (2017), (2005).
1707.00655. [30] C. Ma, J. Wang, and W. E, ArXiv e-prints (2018),
[2] J. Liu, Journal of High Energy Physics 12, 149 (2017), 1808.04258.
1707.02800. [31] L. Banchi, E. Grant, A. Rocchetto, and S. Severini,
[3] Y.-N. Wang and Z. Zhang, Journal of High Energy ArXiv e-prints (2018), 1808.01374.
Physics 8, 9 (2018), 1804.07296. [32] P. Mirowski, M. Ranzato, and Y. LeCun, in Proceedings
[4] M. Koch-Janusz and Z. Ringel, Nature Physics 14, 578 of the NIPS 2010 Workshop on Deep Learning (2010),
(2018), 1704.06279. vol. 2.
[5] Y.-Z. You, Z. Yang, and X.-L. Qi, Phys. Rev. B 97, [33] W. Kohn and L. J. Sham, Physical review 140, A1133
045153 (2018), 1709.01223. (1965).
6
[34] L. Li, J. C. Snyder, I. M. Pelaschier, J. Huang, U.- Rev. B 94, 245129 (2016).
N. Niranjan, P. Duncan, M. Rupp, K.-R. Müller, and [36] Y. Khoo, J. Lu, and L. Ying, ArXiv e-prints (2017),
K. Burke, ArXiv e-prints (2014), 1404.1333. 1707.03351.
[35] L. Li, T. E. Baker, S. R. White, and K. Burke, Phys.
7
SUPPLEMENTARY MATERIAL
Trackable Limit of RNN Translator However, what is the minimum hidden state dimension
d (in terms of real variables) for the RNN to function well
The RNN translator may not be able to formulate in the potential-to-density mapping? Can the RNN dis-
physical laws in the most parsimonious language. The cover that the quantum wave function ψ(x) could provide
hidden state of the RNN may contain redundant infor- a more parsimonious description, which only requires two
mation. In fact, there is an analytically tractable limit real variables Re ψ(x) and Im ψ(x) to parameterize? To
where we can explicitly demonstrate this possibility. For answer these questions, we train the RNN translator un-
example, the RNN may tried to capture the differen- der different hidden state dimensions d. As shown in
tial equation for the density profile directly, instead of Fig. 7, we observe that the loss LRNN only drop signifi-
that for the quantum wave function. To simplify the cantly if d ≥ 3, implying that the RNN was unable to
analysis, let us take ~2 /(2ma2 ) as our energy unit and realize the more efficient (d = 2) wave function descrip-
define the potential energy with respect to the single- tion. For the d = 3 case, as we read out the hidden
particle energy level, then the Schrödinger equation for states hi at each step, we found that they indeed cor-
the BEC wave function ψ(x) takes a rather simple form respond to the vector [ρ(xi ), η(xi ), ξ(xi )]| up to specific
of ∂x2 ψ(x) = V (x)ψ(x). However, in terms of the density linear transformation (depending on the random initial-
profile ρ(x) = |ψ(x)|2 , the Schrödinger equation implies ization of the model parameters), confirming that the
RNN indeed works like the base model Eq. (10). From
ρ(x) 0 2 0 ρ(x) this example, we see that the RNN could develop legiti-
∂x η(x) = V (x) 0 1 η(x) , (9) mate and predictive rules of physics, such as Eq. (9), from
ξ(x) 0 2V (x) 0 ξ(x) the observation data. It tends to work directly with the
variables present in the observation data to get the job
where η(x) = Re ψ ∗ (x)∂x ψ(x) and ξ(x) = |∂x ψ(x)|2 are done. Sometimes the rules it found can work well enough
two other real profiles that combine with ρ(x) to form that the RNN may not have the motivation to develop
a system of linear differential equations. The recurrent higher-level concepts like quantum wave functions.
rule for such a system lies within the description power
of our RNN architecture. If the RNN choose to identify
its hidden state as hi = [ρ(xi ), η(xi ), ξ(xi )]| , the follow- Method
ing parameters will allow it to model Eq. (9) with good
accuracy to the first order in a: In this section, we elaborate on the details of our train-
ing process. For the RNN based on Taylor expansion, we
1 2a 0 0 0 0 1 cut off the expansion at power nW = 2, and consider the
W (0) = 0 1 a , W (1) = a 0 0 , p = 0 . (10) hidden space dimension d from 1 to 6. Taking d = 6
0 0 1 0 2a 0 0 as an example, the initial h0 = (1, 1, 1, 1, 1, 1) and the
vector p = (p1 , 0, 0, 0, 0, 0) without loss of generality, the
This theoretical construction at least provides us a base
parameter p1 is set to be 1 initially. We initialize the
RNN that demonstrates why the proposed architecture
coefficient matrices W (n) to
in Eq. (2) could work in principle. The performance can 1
be further improved by relaxing the parameters from this 0.01
W (0) = 1d×d + randnd×d , (for n = 0)
idea limit or by enlarging the hidden state dimension d. d (11)
0.01
0.5 W (n) = randnd×d , (for n > 0)
d
0.4 d=
1 where 1d×d stands for the d × d dimensional identity ma-
0.3
ℒRNN