Professional Documents
Culture Documents
Abram Demski
Institute for Creative Technologies, 12015 Waterfront Drive, Playa Vista, CA 90094
Procedural:
havior.
Declarative:
Provides
constraints
generating
form.
The two alternatives will be studied in the context of Bayesian modeling. Bayesian
modeling consists of specifying a prior probability (or simply
of possible models. The priors discussed here relate to the minimum-descriptionlength principle (MDL), a unifying concept from model selection which balances
simplicity with explanatory power to avoid overmatching.
In an eort to dene the MDL principle in the most general way possible,
the notion of Kolmogorov complexity has emerged: the Kolmogorov complexity
of an object is the length of the shortest computer program which reproduces
that object[3]. This gives us an algorithmic notion of description length, with
many nice properties.
Considering algorithmic description length in a Bayesian setting has led to
the denition of the Solomono distribution, an algorithmic prior for sequence
prediction. This distribution has been put forward as a theory of rational belief
?
This eort has been sponsored by the U.S. Army and the Air Force Oce of Scientic
Research. Statements and opinions expressed do not necessarily reect the position
or the policy of the United States Government, and no ocial endorsement should
be inferred.
can
consider the
L of rst-order sentences
as follows:
of predicate symbols,
f1 , f2 , ... F .
The number of arguments fed to a predicate or function is referred to as
its
arity.
This statement is based on personal correspondances. Almost all those who I discussed the rst-order prior with suspected that it would equivalent to the Solomono
distribution.
functions in
An
f2 (v1 )
versus
f2
f2 (v1 , v2 )).
f1 (f1 (v1 )). Specically, the set of expressions E are dened inducV E , and for every function fn F of arity a and expressions
e1, e2, ..., ea E , we have fn (e1 , e2 , ...ea ) E .
For e1 , e2 E , e1 = e2 is in L; this represents equality.
For pn P of arity a and e1, e2, ..., ea E , we have pn (e1, e2, ..., ea ) L.
For A, B L, we have AB L and AB L; these represent conjunction
example
tively by:
For any
and
vn .S
L,
The inference rules will not be reviewed here, but some basic results will be
important. These results can be found in many textbooks, but in particular, [1]
has material on everything mentioned here.
if
` S,
then
S.
That is, the inference rules will never derive something that doesn't logically
follow from a set of premises.
Completeness.
` S.
For a sentence
if
S,
then
That is, the inference rules can derive anything which logically follows
is
` S.
extension decides all sentences formed from the functions and predicates used
sentence
does not tell us the truth or falsehood of all questions in its subject matter, and
neither do any extensions of
`,
`.
We can show
that any suciently strong theory of entailment will have no complete extension
which decides
or
Encoding computations.
order logic. This can be done, for example, by providing axioms related to the
behavior of Turing machines.
In addition to rst-order logic, it will be useful to have some background on
the Solomono distribution.
length
n; B
{0, 1}; B n
SB = B B
denotes
ab
is the concatenation of
Consider a class
and
b.
tape, one or more work tapes, and an output tape. The input and output tape
T C denes
fT from B to SB : for input i B , fT (i) is considered to be
T writes to the output tape, which may be innite if T never
are both able to move in just one direction. Any Turing machine
a partial function
the string which
universal
sequence of bits s B which we can place on U 's input tape to get it to behave
exactly like T ; that is, fU (si) = fT (i) for all i.
A distribution M over SB can be dened by feeding random bits to U ; that
Strict
2
exponential dominance again indicates that the reverse is not the case,
This is true if we measure evidence by the log of the likelihood ratio. P1 (x|e) =
P1 (x)P1 (e|x)/P1 (e), so multiplicative dominance indicates thatP1 (e|x)/P1 (e) doesn't
have to get too extreme to bridge the distance between P1 and P2 .
P2
needs
P1
more
can reach.
Requirements
-algebra approach,
and require the probabilities to follow several constraints based on the logic. Since
we are using classical logic, I will simplify their constraints for that case.
Let
bility function
P :LR
P (A) = 0 if A is refutable.
P (A) = 1 if A is provable.
If A logically implies B , then P (A) P (B).
P (A) + P (B) = P (A B) + P (A B).
(P0)
(P1)
(P2)
(P3)
3.2
P (A) + P (A) = 1.
The idea behind the prior is to consider theories as being generated by choosing
sentences at random, one after another. The probability of a particular sentence
is taken to be the probability that it occurs in a theory randomly generated in
this manner.
To be more precise, suppose we have some random process to generate individual sentences from our language
ability
G(S).
L.
Each sentence
S.
G(S)
should
will not be too important; we can suppose that each symbol has been assigned
some binary encoding, so that we can measure the length of a sentence in bits.
a sequence of sentences
PL
on sentences from
that a sentence
will be
as described. Unlike
G , PL
Proof.
contradictions. (P1) is satised, because a provable statement can never contradict the sentences so far, so each will eventually be generated by chance as we
continue to generate the sequence. Therefore, provable statements are generated
with probability 1. (P2) is satised, by a similar argument: if we have already
generated
A,
A ` A B and
B ` A B , the sentence A B will occur in any theory in which A or B occurs.
If both A and B occur in a theory, then A B would be contradictory, so will
not occur; therefore, A B will eventually be generated. Since A B occurs
with probability 1 wherever A or B occurs, and A B occurs with probability
We can extend the argument a bit further to show (P3). Since
B)/PL (B).
PL ()
PL (A|B) = PL (A
to include probabilities of
will be
3.3
Approximability
PL () by eliminating
S1 , S2 , ..., Sn
t =1 ,
y =1 ,
PL (A).
n =1.
loop :
p r e f i x =n o n e
//
Reset
the
prefix
at
the
beginning
of
each
Notice, this means any theory generated in this manner will contain all of its logical
consequences with probability 1. This allows us to talk just about what sentences
are in the theory, when we might otherwise need to talk about the theory plus all
its logical consequences.
loop .
while
not
( s=A o r
s=n e g (A ) ) :
s=g e n e r a t e ( )
//
Until
//
Get
p r e f i x =p u s h ( s , p r e f i x )
//
Append
c=c h e c k ( p r e f i x , t )
//
Spend
if
c:
( s=A ) :
y=y+1
else :
n=n+1
t=t +1
The variables
while
and
we
random
get A or
//
If
backtrack .
//
If
the
//
increment
//
Otherwise ,
n e g (A) ,
sentence .
sample
time
//
pop ( p r e f i x )
if
to
the
sequence
looking
contradiction
generated
for
is
prefix
found ,
contains
increment
n.
//
Increment
at
the
end
of
each
loop .
sampled prexes. (I will use the term sample to refer to generated prexes,
rather than individual sentences.) To that end, the function check (,) takes a
prex and an amount of time, and spends that long trying to prove a contradiction from the prex. If one is found, check returns true; otherwise, false. The
specic proof-search technique is of little consequence here, but it is necessary
that it is exhaustive (it will eventually nd a proof if one exists). The function
neg() takes the negation; so, we are waiting for either
or
to occur in each
Theorem 2.
Proof.
y
n+y
and less likely that an inconsistency in a sample goes undetected. Since every
inconsistency has a nite amount of time required for detection, this means the
probability of an undetected inconsistency will fall arbitrarily far as
rises. The
probability of consistent samples, however, does not fall. Therefore, the counts
will eventually be dominated by consistent samples.
The question reduces to whether the probability of a consistent sample containing
is equal to
PL (A).
that the generated sentences will be consistent with the sentence so far, then the
generation probabilities are exactly those of the previous section.
3.4
Dominance
f1
A,
y.
//
so
f2 ()
far .
contradictions .
f3 ()
f4
S as the Solomono
S will
be compatible with a sequence drawn from the Solomono distribution. This will
be denoted
PA (S).4
Proof.
costs, such as terminal symbols. Given that such a description will be consistent,
we know that there is a chance that it will be generated as the rst statement
in the theory, with probability
G(D1 D2 ),
where
>1
is the normalization
as partially
specifying a binary sequence: place a 1 in every location whose number corresponds to a sentence provable by
by
Since no extension of
PL .
in this way. (If there were, we could decide the sublanguage with the
s000...,
nitely-describable program
means that no subset of
such that
must be
there can be no
, > 0
PA () = 0
PA () > PL () .
and
PL () > 0,
PA ,
so
Note that this PA does not respect the logic. If a consistent sentence A has no logical
relationship to the data sequence, then PA (A) = PA (A) = 1.
PL ;
so if
PL
PA
is multiplicatively
can converge to the truth in more situations. However, these situations are
PL ,
like humans, is
capable of considering the possibility that physics is not computable (as mentioned in the introduction). There are is no obvious need for this in machine
learning. Yet, it does apparently capture something more about the way humans think.
A more practical application of this approach would be to human-like mathematical reasoning, formalizing the way that humans are able to reason about
mathematical conjectures. The study of conjecturing in articial intelligence has
been quite successful, but it is dicult to analyse this theoretically, especially
from a Bayesian perspective.
This situation springs from the problem of
omniscience problem has to do with the sort of uncertainty that we can have
when we are not sure what beliefs follow from our current beliefs. For example, we
might understand that the motion of an object follows some particular equation,
but be unable to calculate the exact result without pen and paper. Because
the brain has limited computational power, we must expect the object to follow
some plausible range of motion based on estimation. Standard probability theory
does not model uncertainty of this kind. A distribution which follows the laws
of probability theory will already contain all the consequences of any beliefs (it
is logically omniscient). Real implementations cannot work like that.
An agent might even have beliefs that logically contradict each other.
Mersenne believed that
PL
such as the one in section 3.3 seem to have some good properties in
A(t, X, Y )
to
PL (X|Y )
a statement which is not provable: it will be evaluated with respect to how well
it ts the evidence
time
Y,
of
t.
Y,
A(t, X, Y )
x.S[x] can
but the proof is too long to nd in reasonable time, then the probability
x.S[x]
S[i]
which are found to be true (although this cannot be made precise without more
assumptions about the approximation process).
It is not clear how one would study this problem in the context of Solomono
induction. Individual beliefs are not easy to isolate from a model when the
model is presented as an algorithm. The problem of inconsistent beliefs does not
even arise.
I do not claim that the rst-order prior is a complete solution to this problem.
There is much more to be said about the nature of mathematical conjecture and
human mathematical reasoning.
This is not the rst published example of a prior more general than the
Solomono prior. Algorithmic priors more general than the Solomono prior have
been proposed in [4]. It would be helpful to establish the relationship between
the rst-order prior and those priors.
It would be interesting to consider similar priors over other logics. For example, what happens if we choose 2nd-order logic rather than 1st-order logic?
There are deep issues to consider.
References
1. G. Boolos, J.P. Burgess, and R.C. Jerey. Computability and Logic. Cambridge
University Press, 2007.
2. Haim Gaifman. Reasoning with limited resources and assigning probabilities to
arithmetical statements. Synthese, 140(1951):97119, 2004.
3. Ming Li and Paul M. B. Vitnyi. An introduction to Kolmogorov complexity and its
applications (2. ed.). Graduate texts in computer science. Springer, 1997.
4. SCHMIDHUBER. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587612, 2002.
5. Oron Shagrir and Itamar Pitowsky. Physical hypercomputation and the churchturing thesis. Minds and Machines, 13:87101, 2003. 10.1023/A:1021365222692.
6. Brian Weatherson. From classical to intuitionistic probability. Notre Dame Journal
of Formal Logic, 44(2):111123, April 2003.