Install Theme

Having thought about this for a few more minutes:

It seems like things are much easier to handle if, instead of putting any actual numbers (probabilities) in, we just track the partial order generated by the logical relations.  Like, when you consider a new hypothesis you’ve never thought about, you just note down “has to have lower probability than these ones I’ve already thought about, and higher probability than these other ones I’ve already thought about.”

At some point, you’re going to want to assign some actual numbers, but we can think of this step as more provisional and revisable than the partial order.  You can say “if I set P(thing) = whatever, what consequences does that have for everything else?” without committing to “P(thing) = whatever” once and for all, and if you retract it, the partial order is still there.

In fact, we can (I think) do conditionalization without numbers, since it just rules out subsets of hypothesis space.  I’m not sure how the details would work but it feels do-able.

The big problem with this is trying to do decision theory, because there you’re supposed to integrate over your probabilities for all hypotheses, whereas this setup lends itself better to getting bounds on individual hypotheses (“P(A) must be less than P(B), and I’ll willing to say P(B) is less than 0.8, so P(A) is less than 0.8″).  I wonder if a sensible (non-standard) decision theory can be formulated on the basis of these bounds?

(long OP snipped for space, but may be worth looking at for context)

I thought about this more, and concluded that Jaynes’ propositional formulation of probability is more different from Kolmogorov than I thought, and really does have an advantage in this case.

Both the Jaynes and Kolmogorov formulations have some some of Boolean algebra structure, which the probabilities are constrained to respect.  In Kolmogorov, this arises because probabilities get assigned to sets, and you get a Boolean algebra from the unions/intersections of the sets.  In Jaynes, probabilities are assigned to propositions and the Boolean algebra comes from logical and/or/not.

The difference is in the “basic atoms” of the algebra.  In Kolmogorov, the sets contain things, called “outcomes,” and the “smallest” set you can make (besides the empty set) is a singleton with just one of the outcomes in it.  All of these distinct singletons are mutually exclusive events, so they can’t be proper subsets of one another, which is the structure you need if you want to encode material conditionals (i.e. the kind of “A→B” that is equivalent to “¬A or B”).  In Kolmogorov, material conditionals only arise from the subset relationships between bigger, non-atomic events.

But in Jaynes, the basic atoms are atomic propositions (i.e. propositions like “A” which are not aliases for compounds like “B or C,” but are undefined within the algebra itself).  For two atoms A and B, the material conditional “¬A or B” is a perfectly good compound proposition, and we can (say) begin by being unsure of its truth value and later update after we prove it (by conditioning on it), as @eclairsandsins​ was suggesting upthread.

If we try to translate this back to the Kolmogorov picture, the compound propositions follow the same algebra as Kolmogorov events (sets), but the atomic propositions are weird “sets of unknown structure”: their unions, intersections, etc. exist in the algebra but we don’t know anything about their contents.  This is exactly the sort of thing I needed in the OP to encode material conditionals, and generally seems preferable to the “you have to already know everything” feel of the Kolmogorov setup.

If we move from material conditionals to counterfactual conditionals, things get harder.  (Counterfactual conditionals are the type I mentioned earlier that can be false even if antecedent and consequent are false: e.g. “if I were in Mexico, I’d be in Asia” is false even though I am in neither.)  The Boolean algebra can tell us that certain things are logically possible even if they are contingently false (e.g. A and ¬A are both in the algebra, but we happen to assign P(A) = 0), but I don’t think this does the whole job, for which I think we need predicate calculus or modal logic rather than just propositional logic.

For instance, even if Linda is not a bankteller at all, I believe it is false (in the counterfactual conditional sense) that “(Linda is a bankteller) → (Linda is a feminist bankteller).”  Not just because non-feminist banktellers are logically possible, but actually possible.  For this I need the “there exists” of predicate calculus, or (better?) the “is possible” of modal logic.  David Chapman has a post saying that Jaynes’ setup doesn’t generalize to predicate calculus; I linked it a while back and remember getting pushback from @raginrayguns​ about it, but I don’t remember the details.

Anyway, the motivation for all this was thinking about logical induction.  If we want to reason sensibly about the probabilities of logical statements, we ought to be able to do it about (at least) statements of first-order predicate calculus, and possibly higher-order (?) as well.  We’re talking about trying to assign probabilities to math conjectures here; we at least need to be able to deal with Peano arithmetic.

The MIRI formalism outsources this task to the e.c. traders, which doesn’t tell us how efficiently it can actually be done; if for instance we need a separate trader for each case of a “for all” sentence, that would still get the job done in MIRI’s asymptotic sense, but effectively means that we can’t do it in practice (one would like to be able to take in the sentence “all at once”).  Quantifiers can be parcelled out into conjunctions/disjunctions of countably many propositions, but we’d like to avoid that if we can (one should be able to do mathematical induction without needing infinite storage space to hold P(n) for every integer.)  IDK, it just seems like “can we do this sort of probability in practice” is the real question, and the MIRI work sidesteps that.

(via nostalgebraist)

eclairsandsins:

nostalgebraist:

I’ve been thinking a bit about the how to get a “uniform rather than pointwise” version of the logical induction stuff.

It seems like a lot of the challenge of the problem is generic to Bayesianism, and not particular to “logical” or “mathematical” outcomes.  Anyone who reads my posts on this stuff knows I have an axe to grind about how, outside of specialized small domains, you don’t have a complete sigma-algebra to put probabilities on.  (Because you aren’t logically omniscient, you don’t know all the logical relations between hypotheses, which means you don’t know all of the subset relationships between sets in your algebra.)

The case of “logical induction” forces Bayesians to think about this even if they wouldn’t otherwise, since the prototypical/motivating examples involve math, a world in which we are continually discovering facts of the form “A implies B.”  So assuming logical omniscience would be assuming we know all the theorems at the outset, in which case we wouldn’t need LI to begin with.

But the problem that “A may imply B even though you don’t know it does” is generic and comes up for Bayesian inference about real-world events, too.  A good solution to this problem would be very interesting and important (?) even if it didn’t, in itself, handle the “logical” aspects of “logical induction” (like “what counts as evidence for a logical sentence”).


I have to imagine there is work on this problem out there, but I have had a hard time finding it.

The basic mathematical setup would have to involve some “incomplete” version of a sigma-algebra (generically, a field of sets), where not all of the union/intersection information is “known.”  This is a bit weird because when we talk about a collection of sets, we usually mean we know what is in the sets, and that information contains all the relations like “A is a subset of B” (i.e. B implies A), when we want to make some of them go away.

A Boolean algebra is like a field of sets where we forget what the sets contain, and just leave them as blank symbols that happen to have union/intersection (AKA “join/meet”) relations with one another.  That seems closer to what we want, except that we need some of the join/meet operations to give undefined results.  There are Boolean algebras where not everything has a join/meet (those that aren’t complete, in the complete lattice sense), but this seems like a thing having to do with inf/sup stuff in infinite spaces and isn’t really what we want.  (Despite my username, I know very little about algebra and am just flying blind on Wikipedia here.)

An example of the sort of thing I want to do is the following.  Say we are assigning probabilities in (0,1) to P(A), P(B), P(A=>B), and P(B=>A).  Suppose P(A=>B) > P(B=>A), that is, we think it’s more likely that A implies B than the reverse (and in particular, more likely than A<=>B).

Now consider P(A) and P(B).  The probabilities above say we’re most likely to be in a world where A=>B and not vice versa, in which case we should have P(A) > P(B), or we’ll be incoherent.  So it seems like we should have P(A) > P(B) right now.  Of course, this will make us incoherent if it turns out that we are in the B=>A or A<=>B worlds, but we think those are less likely.  In betting terms, the losses we might incur from incoherence in a likely world should outweigh those we’d incur from incoherence in an unlikely world.

What we’re really doing here, I guess, is treating the implication (i.e. subset relation) as a random event, so implicitly there is a second, complete probability space whose events (or outcomes?) include the subset relations on the first, incomplete probability space (the one discussed above).  Maybe you could just do the whole thing this way?  I haven’t tried it, I’m curious what would happen

Anyway, I can’t help but think there must be the right math tools out there for doing this kind of thing, and I just don’t know about it.  Anyone have pointers?

Um, if P(A → B) > P(B → A) then P(A) < P(B) not greater. Imagine if P(A → B) = 90% and P(B → A) = 30%. The only difference between the truth tables of A → B vs B → A is that the former is false only when A is true and B is false, and the latter is false only when B is true and A is false. So, P(A and ¬B) = 10% and P(B and ¬A) = 70%. The latter tells you that P(B) > 70% and P(¬A) > 70% aka P(A) < 30%. Therefore P(B) > P(A). To get a more general proof use variables instead of 90% and 30%.

By the way, “A → B” is just another way of saying “¬A or B,” and you just apply normal probability to the latter.

Ah, I think we are using two different meanings for the → sign.

In the Kolomogorov defn. of a probability space, we have to have a sigma-algebra (which specifies sets [“events”] and their union/intersection relations) before we assign any probabilities.  If A, B are in the sigma-algebra and B is a subset of A, this is interpreted as “if event A happens, event B must also happen.”  If we are taking A and B to be propositions, this means “A implies B.”

In the usual Bayesian framework, the events are propositions, but (our beliefs about) truth and falsehood are represented by probability assignments we give to the events, and we can only make these assignments if we have the sigma-algebra already.  So the sigma-algebra encodes implication relationships which we are supposed to assent to before we take the step where we say certain propositions are true (P=1) or false (P=0).

To use the classic example, the sigma-algebra will have “Linda is a feminist bankteller” (A) as a subset of “Linda is a bankteller” (B).  Then when we go and assign probabilities, the probability axioms tell us that we must respect this implication (A→B).  Among other things this will mean that we assign probability 1 to “¬A or B,” for the trivial reason that “¬A or B” is the set of all outcomes.  But this is not the sort of thing that the framework allows us to not know, and then figure out: it is fixed by the sigma-algebra at the outset.

So when I write things like P(A → B), I am talking about the sort of relation we normally get from the sigma-algebra.  Such a relation goes beyond the truth tables: the sigma-algebra normally tells us things like “if Linda is a feminist bankteller, Linda is a bankteller” which are true (in the relevant sense) even if Linda is neither of those things in reality (in which case the truth tables are mute).  There’s a connection to math progress here, in that often mathematicians are concerned about the consequences of assuming certain axioms but agnostic about the truth of the axioms; “the well-ordering theorem is equivalent to the axiom of choice” is interesting, even thought you will be hard pressed to find people who think they’re both true or both false (it is contested what that would even mean!).

It sounds like you’re coming from Jaynes’ approach to probability, while I’m used to Kolmogorov; the two are close to equivalent, but I’ll have to think more about whether Jaynes’ version makes this problem easier.

I’ve been thinking a bit about the how to get a “uniform rather than pointwise” version of the logical induction stuff.

It seems like a lot of the challenge of the problem is generic to Bayesianism, and not particular to “logical” or “mathematical” outcomes.  Anyone who reads my posts on this stuff knows I have an axe to grind about how, outside of specialized small domains, you don’t have a complete sigma-algebra to put probabilities on.  (Because you aren’t logically omniscient, you don’t know all the logical relations between hypotheses, which means you don’t know all of the subset relationships between sets in your algebra.)

The case of “logical induction” forces Bayesians to think about this even if they wouldn’t otherwise, since the prototypical/motivating examples involve math, a world in which we are continually discovering facts of the form “A implies B.”  So assuming logical omniscience would be assuming we know all the theorems at the outset, in which case we wouldn’t need LI to begin with.

But the problem that “A may imply B even though you don’t know it does” is generic and comes up for Bayesian inference about real-world events, too.  A good solution to this problem would be very interesting and important (?) even if it didn’t, in itself, handle the “logical” aspects of “logical induction” (like “what counts as evidence for a logical sentence”).


I have to imagine there is work on this problem out there, but I have had a hard time finding it.

The basic mathematical setup would have to involve some “incomplete” version of a sigma-algebra (generically, a field of sets), where not all of the union/intersection information is “known.”  This is a bit weird because when we talk about a collection of sets, we usually mean we know what is in the sets, and that information contains all the relations like “A is a subset of B” (i.e. B implies A), when we want to make some of them go away.

A Boolean algebra is like a field of sets where we forget what the sets contain, and just leave them as blank symbols that happen to have union/intersection (AKA “join/meet”) relations with one another.  That seems closer to what we want, except that we need some of the join/meet operations to give undefined results.  There are Boolean algebras where not everything has a join/meet (those that aren’t complete, in the complete lattice sense), but this seems like a thing having to do with inf/sup stuff in infinite spaces and isn’t really what we want.  (Despite my username, I know very little about algebra and am just flying blind on Wikipedia here.)

An example of the sort of thing I want to do is the following.  Say we are assigning probabilities in (0,1) to P(A), P(B), P(A=>B), and P(B=>A).  Suppose P(A=>B) > P(B=>A), that is, we think it’s more likely that A implies B than the reverse (and in particular, more likely than A<=>B).

Now consider P(A) and P(B).  The probabilities above say we’re most likely to be in a world where A=>B and not vice versa, in which case we should have P(A) > P(B), or we’ll be incoherent.  So it seems like we should have P(A) > P(B) right now.  Of course, this will make us incoherent if it turns out that we are in the B=>A or A<=>B worlds, but we think those are less likely.  In betting terms, the losses we might incur from incoherence in a likely world should outweigh those we’d incur from incoherence in an unlikely world.

What we’re really doing here, I guess, is treating the implication (i.e. subset relation) as a random event, so implicitly there is a second, complete probability space whose events (or outcomes?) include the subset relations on the first, incomplete probability space (the one discussed above).  Maybe you could just do the whole thing this way?  I haven’t tried it, I’m curious what would happen

Anyway, I can’t help but think there must be the right math tools out there for doing this kind of thing, and I just don’t know about it.  Anyone have pointers?

evolution-is-just-a-theorem:

So if MIRI redid logical induction for uniform instead of pointwise convergence, how good would that be? (Just in terms of technical difficulty, i.e. what would it say about MIRI’s mathematical competence relative to other groups of basically-post-docs.)

Also, how much does having the pointwise convergence proof help in generating a uniform convergence proof?

Also, has anyone talked to MIRI about this? (Gosh it’s almost like the review process occasionally does some good).

@nostalgebraist, @jadagul

The thing is, the analogous uniform convergence statements would be false for the criterion they’ve formulated, and my sense is you’d need something very different to get them. So you’re asking about a hypothetical paper which would be about some other criterion; everything would depend on the quality of the replacement.

For uniform convergence, you’d need to be resistant to *every* trader at once, but with some relaxed notion of resistance so you can tolerate losing some money and the algorithm is supposed to lose less and less (against all traders) over time. This may be possible, but it’s a different goal that would require its own distinct setup and a different kind of argument.

The friend I was talking to knows some people at MIRI, and I think he’s interested in talking to them about this stuff.

jadagul replied to your post “on MIRI’s “Logical Induction” paper”
Right. Any finite set of traders is defeated after finite (but arbitrary) time. But if you need to defeat an infinite set of traders, there’s no guarantee that you ever beat all of them.

Yeah, that seems like a good way of looking at it.

on MIRI’s “Logical Induction” paper

jadagul:

nostalgebraist:

When I first saw this paper, I said it looked impressive, and seemed a lot more substantial than MIRI’s other work, but I never really looked at it in much detail.  In the past week I’ve been having an extended conversation with a friend about it, and that spurred me to read it more closely.  I now think it’s much less impressive than it seems at first glance.

It occurred to me that the criticisms I made to my friends might be of wider interest, so I’ll write a post about them.

Summary:

The authors state something they call the “logical induction criterion,” meant to formalize a kind of ideal inference.  To satisfy this criterion, an inference procedure only needs to have certain asymptotic properties in the infinite-time limit.  Rather than being sufficient for good inference but too strong for practical computation (as the paper suggests), the criterion is too weak for good inference: it tolerates arbitrarily bad performance for arbitrarily long times.

The easiest way to see why this is problematic in practice is to consider the criterion-satisfying procedure constructed in the paper, called LIA.  Speaking very roughly, LIA makes a countably-infinite list (in unspecified/arbitrary order) of all the “mistakes” it could possibly make, and at discrete time n, does a brute force search for a set of beliefs which avoids the first n mistakes in the list.

Depending on the ordering of the list, LIA can make an arbitrarily bad mistake for an arbitrarily long time (some mistake has to go in slot number 3^^^3, etc.)  Nonetheless, LIA can be proven to asymptotically avoid every mistake, since for every mistake there is some time N at which it begins to avoid it.  Thus, LIA enjoys a very large number of nice-sounding asymptotic properties, but it converges for very different reasons than most algorithms one is used to hearing about: rather than converging to nice properties because it moves toward them, it simply exploits the fact that these properties can only fail in countably many ways, and ticks off those ways one by one.

Thus, LIA is like an “author-bot” which generates all possible character strings in some arbitrary order, and at each step consults a panel of readers to weigh in on its best work so far.  One could argue that this bot’s “asymptotic best work” will have any aspect of writerly brilliance imaginable, but that does not mean that the bot satisfies some “ideal writing criterion,” and indeed we’d expect its output in practice to lack any aspects of writerly brilliance.  (LIA differs from author-bot in that it has an objective, if very slow, way to find its “best work so far.”  But garbage in means garbage out.)

More detail under the cut

Keep reading

Good post, for those who are interested in such things.

I wanted to pull out one bit that I thought was interesting. Content warning for analysis.

Keep reading

Oh, yeah, it totally is a pointwise vs. uniform convergence thing!  Thanks, that is illuminating.

In the OP I wrote “it [the LI criterion] tolerates arbitrarily bad performance for arbitrarily long times,” but now I realize that’s not stating the full case, since it could be true even if we had uniform (but arbitrarily slow) convergence.  I linked your post in a Discord chat and said this:

i think “it’ll take an extremely long time, but eventually, the probabilities will be sensible” is implicitly imagining that the convergence is uniform, when it’s really pointwise.  like, we imagine that there’s some huge N such that if we wait until then, the probabilities will be generally “good,” i.e. will have a whole bunch of great properties at once.

but what we really have are just guarantees that each good property arrives sometime.  it doesn’t have to arrive simultaneously with properties that seem related, and we’re not guaranteed an overall “package deal” at any point (unless we can show that there’s an e.c. trader that gets us that whole package at once).

(via jadagul)

on MIRI’s “Logical Induction” paper

When I first saw this paper, I said it looked impressive, and seemed a lot more substantial than MIRI’s other work, but I never really looked at it in much detail.  In the past week I’ve been having an extended conversation with a friend about it, and that spurred me to read it more closely.  I now think it’s much less impressive than it seems at first glance.

It occurred to me that the criticisms I made to my friend might be of wider interest, so I’ll write a post about them.

Summary:

The authors state something they call the “logical induction criterion,” meant to formalize a kind of ideal inference.  To satisfy this criterion, an inference procedure only needs to have certain asymptotic properties in the infinite-time limit.  Rather than being sufficient for good inference but too strong for practical computation (as the paper suggests), the criterion is too weak for good inference: it tolerates arbitrarily bad performance for arbitrarily long times, and more generally, provides no guarantees whatsoever about the quality of the inference procedure’s output at any finite time.  Moreover (see remarks below about pointwise convergence), there is a principled reason for the lack of guarantees which is baked deep into the paper’s approach, and which I don’t think could be fixed without completely overhauling that approach.

The easiest way to see why the LI criterion is problematic in practice is to consider the criterion-satisfying procedure constructed in the paper, called LIA.  Speaking very roughly, LIA makes a countably-infinite list (in unspecified/arbitrary order) of all the “mistakes” it could possibly make, and at discrete time n, does a brute force search for a set of beliefs which avoids the first n mistakes in the list.

Depending on the ordering of the list, LIA can make an arbitrarily bad mistake for an arbitrarily long time (some mistake has to go in slot number 3^^^3, etc.)  Nonetheless, LIA can be proven to asymptotically avoid every mistake, since for every mistake there is some time N at which it begins to avoid it.  Thus, LIA enjoys a very large number of nice-sounding asymptotic properties, but it converges for very different reasons than most algorithms one is used to hearing about: rather than converging to nice properties because it moves toward them, it simply exploits the fact that these properties can only fail in countably many ways, and ticks off those ways one by one.

Thus, LIA is like an “author-bot” which generates all possible character strings in some arbitrary order, and at each step consults a panel of readers to weigh in on its best work so far.  One could argue that this bot’s “asymptotic best work” will have any aspect of writerly brilliance imaginable, but that does not mean that the bot satisfies some “ideal writing criterion,” and indeed we’d expect its output in practice to lack any aspects of writerly brilliance.  (LIA differs from author-bot in that it has an objective, if very slow, way to find its “best work so far.”  But garbage in means garbage out.)

EDIT 6/28/17: I’ve added some remarks about pointwise vs. uniform convergence.  The relevance of this distinction to the LI paper was originally brought to my attention by @jadagul​ in a response to this post, and I think it’s important enough that it should be discussed here for completeness’ sake.  To find the part I added, search for “pointwise.”

More detail under the cut

Keep reading

raginrayguns replied to your post “I rag a lot on Bayesianism, but when I think about it, there’s a…”

determining the variance of the prior from the data actually looks a lot like what the bayesian method does, when you use a hyperprior over the variance parameter, and think about the updates on each individual data point (only the data points at the beginning of the sequence will affect the hyperparameter, I’m thinking; the hyperparameter will stabilize before the actual parameters do. Hmm I guess I don’t actually know this.) HOwever I don’t see any bayesian justificati
on for WHY you would use a hyperprior, whereas the algorithmic modeling framework………….. may provide an undertstandable justification for setting th ehyperparameter from the data? I actualy don’t get it. I dont’ actualy knoww whether adding a model selection step to cross validation where you choose a ridge regression lambda gives better results than just making one up before you look at the data. Somebody knows this but not me 
am I overcomplicating this? Lets say you divide your dataset into training data X and test data Y. The log likelihood function decomposes as log p(X) + log p(Y|X). The algorithmic modeling approach is essetially just dropping log p(X).

I was thinking about this on the plane today, in relation to this paper I mentioned in a reply earlier.  The authors use a Gaussian prior over regression weights beta_i like in ridge regression, but instead of setting specific values for prior variance (tau_i), they introduce a hyperprior for it, a Jeffreys prior (p(tau_i) proportional to 1/tau_i).

They marginalize out the hyperprior, which makes the likelihood non-Gaussian, so you wouldn’t look at it and think “oh a Gaussian prior.”  And then they take MAP estimates of the betas because they want sparsity (it’s supposed to be a competitor to LASSO).  And … they it does as well or better than standard regularization techniques, despite having no free parameters.  Which is pretty spooky.

It also is hard to compare to what the algorithmic modeler would do.  They would make all the taus equal to a single regularization parameter and select the optimal value by cross-validation, grid-searching over some interval [a,b] of their choice.  This is close to having a uniform hyperprior on [a,b] that’s zero outside, except you aren’t averaging over all models in [a,b], you’re picking the best one.  So it’s like having that hyperprior and using MAP to estimate the taus.  Whereas in that paper, they average over the hyperprior and then do MAP at the end.

The usual idea in algorithmic modeling is that you should always have at least one parameter that controls model complexity, so you can optimize it with cross-validation.  If you don’t have such a parameter, it intuitively seems like you’re being wasteful – there is some optimal level of complexity, you usually can’t confidently derive it a priori, so if you don’t learn it from your data you’re making an assumption for no reason.  The algorithm in the paper, though, does learn the complexity from the data, by having a prior over it and then updating.  It just does this automatically, in a way that apparently works well, but it’s mysterious why.

As you said, it’s not clear what Bayesian reason there is to use a hyperprior here.  In the two examples I can find of Jaynes doing regression with an explicit prior for the coefficients (the seasonal trends in Ch 17 and the model comparison in Ch 20), he just uses a Gaussian and seems to think this choice doesn’t need justification.  You could maybe argue that the hyperprior is correct because it lets you use the principle of indifference – you introduce this variance, but you don’t actually know what it should be, so in order to be more agnostic, you use an uninformative prior on it, and now you have no more parameters to set.

But that’s still an a priori argument.  Why does it work in practice?  The arguments about indifference make more sense when you’re talking about possibilities you could actually measure (you don’t want to prematurely focus on some possibilities over others).  But our taus here are features of our uncertainty, not the real world.  You can’t measure them by updating on data (if the model is correct, the posterior variance will asymptotically go to zero).  So it’s still mysterious why it works.

About the log p(X) + log p(Y|X) thing – how does that work with cross-validation, where every data point gets to be part of Y in some fold?

@absolutefucker

trefethen pwns

truuuuu

For the unfamiliar: other great Trefethen stuff includes Chebfun (”our vision is to achieve for functions what floating-point arithmetic achieves for numbers”) and his paper about how traditional stability theory is misleading when your operators are non-normal

(via absolutefucker)