You are on page 1of 10

Logical Prior Probability

Abram Demski

Institute for Creative Technologies, 12015 Waterfront Drive, Playa Vista, CA 90094

A Bayesian prior over rst-order theories is dened. It is


shown that the prior can be approximated, and that it is exponentially
dominant over the Solomono distribution when applied to sequence
prediction.
Abstract.

1 Introduction & Motivation


Logical and algorithmic representations are two broad categories of knowledge
representation, often referred to as

declarative and procedural. Declarative knowl-

edge consists of logical statements; to use it, we make inferences. Procedural


knowledge consists of programs which compute something; to use it, we execute
the program.
These terms are generally understood, but it would be nice to have a more
formal distinction if they are to be contrasted. For the purposes of this paper,
the following denitions will suce:

Procedural:

Provides an explicit method of

havior.

Declarative:

Provides

constraints

generating

a desired object or be-

to which an object or behavior must con-

form.
The two alternatives will be studied in the context of Bayesian modeling. Bayesian
modeling consists of specifying a prior probability (or simply

prior ) over a space

of possible models. The priors discussed here relate to the minimum-descriptionlength principle (MDL), a unifying concept from model selection which balances
simplicity with explanatory power to avoid overmatching.
In an eort to dene the MDL principle in the most general way possible,
the notion of Kolmogorov complexity has emerged: the Kolmogorov complexity
of an object is the length of the shortest computer program which reproduces
that object[3]. This gives us an algorithmic notion of description length, with
many nice properties.
Considering algorithmic description length in a Bayesian setting has led to
the denition of the Solomono distribution, an algorithmic prior for sequence
prediction. This distribution has been put forward as a theory of rational belief
?

This eort has been sponsored by the U.S. Army and the Air Force Oce of Scientic
Research. Statements and opinions expressed do not necessarily reect the position
or the policy of the United States Government, and no ocial endorsement should
be inferred.

given evidence; in that context it is referred to as Solomono induction. Although


still contentious, this idea has gained some popularity as a foundational theory
of inductive learning.
This paper explores an alternative based on logical description length. Declarative representations turn out to yield a prior which assigns positive probability
to more possible circumstances than Solomono induction. This is contrary to
the intuitions of many researchers.

1 Computations can be described within logic,

and the inference rules of logic can be implemented in computers, so it is easy to


think they must be representationally equivalent. However, this is not the case.
The Solomono distribution can be seen as the consequence of equating describable with computable. This is justied to the extent that the formal theory of computation captures the informal notion of description. Computation
has certainly proven to be a very general tool, but there are a few things which
seem to be describable but which we know are not computable. The primary
examples come from mathematics: many mathematical concepts which are commonly described in rst-order logic are fundamentally incomplete (as explained
in 2). This means that we can provide a partial set of axioms, which answer
some questions, but no nite formulation can answer all the questions. Because
the Solomono distribution requires programs to output all answers sequentially,
incomplete formulations like this are given no probability.
Giving positive probability to situations which can be described in terms
of constraints, but for which no generative description exists, seems to capture
something about human probabilistic reasoning which is not captured by the
Solomono distribution. For example, humans can sonsider the possibility of
an uncomputable physics (such as in [5]). While humans are no better than
computers at using such theories as generative models, we

can

consider the

constraints on observations imposed by such theories, and reason logically to


arrive at consequences which can be compared to observations.

2 Selected Background and Notation


We will be using rst-order logic, dening the language

L of rst-order sentences

as follows:

of predicate symbols,

v1 , v2 , ... V , an innite stock


p1 , p2 , ... P , and an innite stock of function symbols,

There is an innite stock of variable symbols,

f1 , f2 , ... F .
The number of arguments fed to a predicate or function is referred to as
its

arity.

For example, a predicate of arity 2 is typically referred to as a

relation. A function of arity 0 is referred to as a constant, and a predicate of


arity 0 is a proposition. For simplicity, the arity of a symbol will be inferred
from its use here, rather than set ahead of time. If the same symbol is used

This statement is based on personal correspondances. Almost all those who I discussed the rst-order prior with suspected that it would equivalent to the Solomono
distribution.

with multiple arities, the uses are independent (so

functions in
An

f2 (v1 )

versus

f2

would notate distinct

f2 (v1 , v2 )).

expression is a composition of function symbols and variable symbols, for

f1 (f1 (v1 )). Specically, the set of expressions E are dened inducV E , and for every function fn F of arity a and expressions
e1, e2, ..., ea E , we have fn (e1 , e2 , ...ea ) E .
 For e1 , e2 E , e1 = e2 is in L; this represents equality.
 For pn P of arity a and e1, e2, ..., ea E , we have pn (e1, e2, ..., ea ) L.
 For A, B L, we have AB L and AB L; these represent conjunction
example

tively by:




and disjunction, respectively.


For

S L, we have S L. This represents negation.


S L and vn V , we have vn .S L

For any

and

vn .S

L,

representing universal and existential quantication.

B (meaning, B is true in any situation


A  B . The notation also applies to multiple
premises; if A and B together imply C , we can write A, B  C . Uppercase greek
letters will also be used to denote sets of sentences. We can write A 2 B to say
that A does not logically imply B .
If A implies B according to the inference rules (meaning, we can derive B
starting with the assumption A), we write A ` B . This notation applies to
multiple premises as well, and can be denied as 0.
If sentence
in which

logically implies sentence

is true), then we write

The inference rules will not be reviewed here, but some basic results will be
important. These results can be found in many textbooks, but in particular, [1]
has material on everything mentioned here.

Soundness. For a sentence S

and a set of sentences

if

` S,

then

 S.

That is, the inference rules will never derive something that doesn't logically
follow from a set of premises.

Completeness.

` S.

For a sentence

and a set of sentences

if

 S,

then

That is, the inference rules can derive anything which logically follows

from a set of premises.


Since the rules for

can be followed by a computer, this shows that

is

computably enumerable: a (non-halting) program can enumerate all the true


instances of

` S.

Incompleteness. There exists a nite set of sentences for which no consistent

extension decides all sentences formed from the functions and predicates used

such that and is consistent, there will exist a


S (actually, many sentences) in which any function or predicate symbols
are also found in , and for which neither ` S nor ` S . This means that
in

That is: for any

sentence

does not tell us the truth or falsehood of all questions in its subject matter, and
neither do any extensions of

There exist many such

Basic number theory

as codied by the axioms of Robinson arithmetic is an example. The inference


relation itself,

`,

is another example: we can code expressions to stand for the

sentences of rst-order logic, and use a relation to represent

`.

We can show

that any suciently strong theory of entailment will have no complete extension
which decides

or

for all pairs of sentences.

Encoding computations.

Any computable function can be encoded in rst-

order logic. This can be done, for example, by providing axioms related to the
behavior of Turing machines.
In addition to rst-order logic, it will be useful to have some background on
the Solomono distribution.

denotes the binary alphabet,

length

n; B

{0, 1}; B n

denotes the set of binary strings of

denotes the set of binary strings of any nite length;

the set of binary strings of innite length; and

SB = B B

denotes

denotes the set

of nite and innite binary strings. String concatenation will be represented by


adjacency, so

ab

is the concatenation of

Consider a class

and

b.

of Turing machines with three or more tapes: an input

tape, one or more work tapes, and an output tape. The input and output tape

T C denes
fT from B to SB : for input i B , fT (i) is considered to be
T writes to the output tape, which may be innite if T never

are both able to move in just one direction. Any Turing machine
a partial function
the string which

stops writing output. Now consider a

universal

machine from this class; that

U C such that for any other machine T C , there is a nite

sequence of bits s B which we can place on U 's input tape to get it to behave
exactly like T ; that is, fU (si) = fT (i) for all i.
A distribution M over SB can be dened by feeding random bits to U ; that

is, we take fU (i) for uniformly random i B . The Solomono distribution is


a little bit dierent: we apply the Solomono normalization to M , which gives

a distribution over B . The details of normalization will be skipped.


is, a machine

The development here has been adapted from [3].


Now, how do we compare two distributions?

P1 multiplicatively dominates P2 i there exists > 0 such that P1 (x) >


P2 (x) for any x. An intuitive way of understanding this is that P1 needs at most
2
a constant amount more evidence to reach the same conclusion as P2 . Strict
multiplicative dominance means that P1 multiplicatively dominates P2 , but the
reverse is not the case. This indicates that P1 needs at most a constant amount
more evidence to reach the same conclusion as P2 , but we can nd examples
where P2 needs arbitrarily more evidence than P1 to come to the conclusion P1
reaches.
The main reason that the Solomono distribution is interesting is that it
is multiplicatively dominant over any computable probability distribution for
sequence prediction. This makes the Solomono distribution a highly general
tool.

P1 exponentially dominates P2 i there exists , > 0 such that P1 >


P2 (x) . This intuitively means that P1 needs at most some constant multiple of the amount of evidence which P2 needs to reach a specic conclusion.

Strict
2

exponential dominance again indicates that the reverse is not the case,

This is true if we measure evidence by the log of the likelihood ratio. P1 (x|e) =
P1 (x)P1 (e|x)/P1 (e), so multiplicative dominance indicates thatP1 (e|x)/P1 (e) doesn't
have to get too extreme to bridge the distance between P1 and P2 .

which means that

P2

some conclusions that

needs

P1

more

than multiplicatively more evidence to reach

can reach.

3 A Notion of Logical Probabilities


3.1

Requirements

I will follow [6] in the development of the idea of a probability distribution


over a language, since this provides a particularly clear idea of what it means
for a continuous-valued belief function to t with a logic. I shall say that the
distribution

respects the logic. The approach is to dene probability as a function

on sentences in a language, rather than by the more common

-algebra approach,

and require the probabilities to follow several constraints based on the logic. Since
we are using classical logic, I will simplify their constraints for that case.
Let

be the language of rst-order logic from section 2. We want a proba-

bility function

P :LR

to obey the following rules:

P (A) = 0 if A is refutable.
P (A) = 1 if A is provable.
If A logically implies B , then P (A) P (B).
P (A) + P (B) = P (A B) + P (A B).

(P0)
(P1)
(P2)
(P3)

From these, we can prove other typical properties such as

3.2

P (A) + P (A) = 1.

Denition As a Generative Process

The idea behind the prior is to consider theories as being generated by choosing
sentences at random, one after another. The probability of a particular sentence
is taken to be the probability that it occurs in a theory randomly generated in
this manner.
To be more precise, suppose we have some random process to generate individual sentences from our language
ability

G(S).

L.

Each sentence

is generated with prob-

We would like a description-length prior. Specically,

decrease exponentially with the length of

S.

G(S)

should

The specics of measuring length

will not be too important; we can suppose that each symbol has been assigned
some binary encoding, so that we can measure the length of a sentence in bits.

L. To generate a random theory, we generate


S1 , S2 , S3 , ... according to the following process. For
each Sn , use sentences from G , but discarding those which are inconsistent with
the sentences so far; that is, rejecting any candidate for Sn which would make
S1 ... Sn into a contradiction. (For S1 , the set of preceding sentences is empty,
A theory is a set of sentences in

a sequence of sentences

so we only need to ensure that it does not contradict itself.)


Notice that there is no stopping condition. The sequence generated will be
innite. However, the truth or falsehood of any particular statement (or any
nite theory) will be determined after a nite amount of time. (The remaining
sentences generated will either be consequences of, or irrelevant to, the statement

in question.) Shorter (nite) theories will have a larger probability of occurring


in the sequence.
In this way, we induce a new probability distribution

G . PL (S) is the probability


present in a sequence S1 , S2 , S3 , ... generated from G

the one we began with,

PL

on sentences from

that a sentence

will be

as described. Unlike

G , PL

respects the logic:

Theorem 1. PL obeys (P0)-(P3).

Proof.

(P0) is satised easily, since the process explicitly forbids generation of

contradictions. (P1) is satised, because a provable statement can never contradict the sentences so far, so each will eventually be generated by chance as we
continue to generate the sequence. Therefore, provable statements are generated
with probability 1. (P2) is satised, by a similar argument: if we have already
generated

A,

A, but A implies B , then anything which contradicts B will contradict


B will never be ruled out,

and hence never be generated. This means that

and so must eventually be generated at random.

A ` A B and
B ` A B , the sentence A B will occur in any theory in which A or B occurs.
If both A and B occur in a theory, then A B would be contradictory, so will
not occur; therefore, A B will eventually be generated. Since A B occurs
with probability 1 wherever A or B occurs, and A B occurs with probability
We can extend the argument a bit further to show (P3). Since

1 where both are present, the equation (P3) must hold.


The conditional probability can be dened as usual, with

B)/PL (B).

We can also extend the denition of

sets of sentences, so that

PL ()

PL (A|B) = PL (A

to include probabilities of

PL ( ) for L is the probability that all S

will be

present in a sequence generated by the process dened above. (By an argument


similar to the one used to prove (P3), the probability of a set of sentences will
be equal to the probability of the conjunction.)

3.3

Approximability

The generative process described so far cannot be directly implemented, since


there is no way to know for sure that a theory remains consistent as we add sentences at random. However, we can asymptotically approach

PL () by eliminating

inconsistent possibilities when we nd them.


Suppose we want to approximate

S1 , S2 , ..., Sn
t =1 ,

y =1 ,

PL (A).

I shall call a partial sequence

prex. Consider the following Monte Carlo approximation:

n =1.

loop :
p r e f i x =n o n e

//

Reset

the

prefix

at

the

beginning

of

each

Notice, this means any theory generated in this manner will contain all of its logical
consequences with probability 1. This allows us to talk just about what sentences
are in the theory, when we might otherwise need to talk about the theory plus all
its logical consequences.

loop .

while

not

( s=A o r

s=n e g (A ) ) :

s=g e n e r a t e ( )

//

Until

//

Get

p r e f i x =p u s h ( s , p r e f i x )

//

Append

c=c h e c k ( p r e f i x , t )

//

Spend

if

c:

( s=A ) :
y=y+1

else :
n=n+1
t=t +1
The variables

while

and

we

random

get A or

//

If

backtrack .

//

If

the

//

increment

//

Otherwise ,

n e g (A) ,

sentence .

sample
time

//

pop ( p r e f i x )
if

to

the

sequence

looking

contradiction
generated

for

is

prefix

found ,
contains

increment

n.

//

Increment

at

the

end

of

each

loop .

count the number of positive and negative samples,

provides a continually rising standard for consistency-detection on the

sampled prexes. (I will use the term sample to refer to generated prexes,
rather than individual sentences.) To that end, the function check (,) takes a
prex and an amount of time, and spends that long trying to prove a contradiction from the prex. If one is found, check returns true; otherwise, false. The
specic proof-search technique is of little consequence here, but it is necessary
that it is exhaustive (it will eventually nd a proof if one exists). The function
neg() takes the negation; so, we are waiting for either

or

to occur in each

sample. The prex is represented as a FILO queue. push() adds a sentence to


the given prex, and pop() removes the most recently added sentence.
The inner loop produces individual extensions at random, backtracking whenever an inconsistency is found. The loop terminates when a theory includes either

Theorem 2.

Proof.

y
n+y

based on the result, increments


to get another sample.

will approach PL (A).

The standard of consistency is continually rising, so that it becomes less

and less likely that an inconsistency in a sample goes undetected. Since every
inconsistency has a nite amount of time required for detection, this means the
probability of an undetected inconsistency will fall arbitrarily far as

rises. The

probability of consistent samples, however, does not fall. Therefore, the counts
will eventually be dominated by consistent samples.
The question reduces to whether the probability of a consistent sample containing

is equal to

PL (A).

We can see that this is the case, since if we assume

that the generated sentences will be consistent with the sentence so far, then the
generation probabilities are exactly those of the previous section.

3.4

Dominance

In order to compare the rst-order prior with algorithmic probability, we need to


apply the rst-order prior to sequence prediction. We can do so by encoding bit
sequences in rst-order logic. For example,

f1

A,

y.

//

A or A. The outer loop then increments y or n


t, erases the prex, and re-enters the inner loop

so

can serve as a logical constant rep-

resenting the sequence to be observed and predicted;

f2 ()

far .

contradictions .

can represent adding

a 0 to the beginning of some sequence; and

f3 ()

can represent adding a 1. So, to

say that the sequence begins 0011... we would write


where

f4

f1 = f2 (f2 (f3 (f3 (f4 )))),

is a logical constant standing for the remainder of the sequence.

Dene the algorithmic probability of a rst-order sentence


probability that

S as the Solomono
S will

is consistent with the data; that is, the probability that

be compatible with a sequence drawn from the Solomono distribution. This will
be denoted

PA (S).4

Theorem 3. PL strictly exponentially dominates PA .

Proof.

We can describe the universal machine

from section 2 in a rst-order

f1 to represent the output


D1 . Now any nite input sequence s of length l has
l
probability 2 . The sequence s has a description D2 in rst-order logic of length
at most + l, where is derived from the length of f2 , f3 , f4 , and the necessary
parentheses. absorbs the cost of the initial  f1 = and any additional coding
sentence (which will be a large conjunction), taking
tape. Call this description

costs, such as terminal symbols. Given that such a description will be consistent,
we know that there is a chance that it will be generated as the rst statement
in the theory, with probability

G(D1 D2 ),

where

>1

is the normalization

constant to account for the rejection of inconsistent sentences as we draw sen-

G . The description length of D1 D2 will be  + l, where  is plus


G is exponential in description length with some base b,
 l
 l
l
this means that PL (D1 D2 ) b b < b b = (2 ) for some , . Thus PL
tences from

some constant. Since

is multiplicatively dominant. It remains to show that this dominance is strict.

Consider some incomplete theory,

and the sublanguage of that theory,

as dened in section 2. There exists a computable bijection from the sentences


in that sublanguage to natural numbers; that is, we can put the sentences in
an order, so that each sentence gets a dierent number. We can therefore use
sequence locations to talk about the truth or falsehood of dierent statements;
each location is associated with a sentence. With this, we can see

as partially

specifying a binary sequence: place a 1 in every location whose number corresponds to a sentence provable by
by

it will get positive probability in


constraint

Since no extension of

PL .

Call the set of sequences satisfying this

decides all the statements of the sublanguage, there

which we can feed to

to ensure that the output agrees

in this way. (If there were, we could decide the sublanguage with the

s000...,

nitely-describable program
means that no subset of

such that

which we could encode as a theory.) This

must be

there can be no

, > 0

is assigned positive probability by

a set of measure zero. Since

and a 0 wherever a sentence is refutable

are no nite strings


with

This constraint on sequences can be encoded as a nite theory. As such,

PA () = 0

PA () > PL () .

and

PL () > 0,

PA ,

so

This proves the desired result.

Note that this PA does not respect the logic. If a consistent sentence A has no logical
relationship to the data sequence, then PA (A) = PA (A) = 1.

This leaves open the question of strict multiplicative dominance. However,


is certainly not multiplicatively dominant over

PL ;

so if

PL

PA

is multiplicatively

dominant, it is strictly so.

4 Conclusion & Questions


PL

can converge to the truth in more situations. However, these situations are

strange in that they display incompleteness. For example,

PL ,

like humans, is

capable of considering the possibility that physics is not computable (as mentioned in the introduction). There are is no obvious need for this in machine
learning. Yet, it does apparently capture something more about the way humans think.
A more practical application of this approach would be to human-like mathematical reasoning, formalizing the way that humans are able to reason about
mathematical conjectures. The study of conjecturing in articial intelligence has
been quite successful, but it is dicult to analyse this theoretically, especially
from a Bayesian perspective.
This situation springs from the problem of

logical omniscience [2]. The logical

omniscience problem has to do with the sort of uncertainty that we can have
when we are not sure what beliefs follow from our current beliefs. For example, we
might understand that the motion of an object follows some particular equation,
but be unable to calculate the exact result without pen and paper. Because
the brain has limited computational power, we must expect the object to follow
some plausible range of motion based on estimation. Standard probability theory
does not model uncertainty of this kind. A distribution which follows the laws
of probability theory will already contain all the consequences of any beliefs (it
is logically omniscient). Real implementations cannot work like that.
An agent might even have beliefs that logically contradict each other.
Mersenne believed that

267 =1 is a prime number, which was proved false

in 1903, [...] Together with Mersenne's other beliefs about multiplication


and primality, that belief logically implies that 0 = 1. [2]
Gaifman proposes a system in which probabilities are dened only with respect
to a nite subset of the statements in a language, and beliefs are required to
be consistent only with respect to chains of deduction in which each statement
occurs in this limited set.
I will not attempt to address this problem to its fullest here, but approximations to

PL

such as the one in section 3.3 seem to have some good properties in

this area. If the proof of some statement

A(t, X, Y )

to

PL (X|Y )

X is too long for some approximation


t, then S will be treated exactly like

to nd given time

a statement which is not provable: it will be evaluated with respect to how well
it ts the evidence
time

Y,
of

t.

Y,

A(t, X, Y )
x.S[x] can

given the connections which

For example, if some universal statement

can nd within


be proven from

but the proof is too long to nd in reasonable time, then the probability

x.S[x]

will still tend to rise with the number of individual instances

S[i]

which are found to be true (although this cannot be made precise without more
assumptions about the approximation process).
It is not clear how one would study this problem in the context of Solomono
induction. Individual beliefs are not easy to isolate from a model when the
model is presented as an algorithm. The problem of inconsistent beliefs does not
even arise.
I do not claim that the rst-order prior is a complete solution to this problem.
There is much more to be said about the nature of mathematical conjecture and
human mathematical reasoning.
This is not the rst published example of a prior more general than the
Solomono prior. Algorithmic priors more general than the Solomono prior have
been proposed in [4]. It would be helpful to establish the relationship between
the rst-order prior and those priors.
It would be interesting to consider similar priors over other logics. For example, what happens if we choose 2nd-order logic rather than 1st-order logic?
There are deep issues to consider.

References
1. G. Boolos, J.P. Burgess, and R.C. Jerey. Computability and Logic. Cambridge
University Press, 2007.
2. Haim Gaifman. Reasoning with limited resources and assigning probabilities to
arithmetical statements. Synthese, 140(1951):97119, 2004.
3. Ming Li and Paul M. B. Vitnyi. An introduction to Kolmogorov complexity and its
applications (2. ed.). Graduate texts in computer science. Springer, 1997.
4. SCHMIDHUBER. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587612, 2002.
5. Oron Shagrir and Itamar Pitowsky. Physical hypercomputation and the churchturing thesis. Minds and Machines, 13:87101, 2003. 10.1023/A:1021365222692.
6. Brian Weatherson. From classical to intuitionistic probability. Notre Dame Journal
of Formal Logic, 44(2):111123, April 2003.

You might also like