Install Theme

Fluid dynamics as thinking (thoughts so far):

Velocity points in the direction of (believed) logical implication (or perhaps some other sort of “implication” – some sense of “this naturally follows this”).

A tracer field (“dye”) represents “the thoughts currently being entertained.”  The tracer follows fluid parcels, which move in the direction of their velocity.  If I am thinking A, and A implies B, then after a bit of time elapses, I am now thinking B.  (I may be able to think about a number of things simultaneously, depending on how the tracer is distributed.)

Each fixed point in physical space (“Eulerian perspective”) represents the same proposition or concept for all time.  A fluid parcel (“Lagrangian perspective”) changes its meaning from time to time, following paths of implication.

Assuming incompressible flow for now.  What exactly does this mean?  (If tracer is treated as a density over space, means “the amount of stuff I am thinking about is constant in time.”)

How does the velocity (implication) field evolve in time?  In incompressible flow, vorticity (curl of velocity) is conserved on fluid parcels, which move with velocity, and velocity is determined at each time by the vorticity field.  What is the meaning of vorticity here?  "The rate at which circular implications near this parcel move?“  So some parcels have the property of encouraging "fast thinking” nearby, which others encourage slow or no thinking.

Does the direction of vorticity/circulation have a meaning?  In terms of our interpretation so far, no: a loop of implication can be equally well represented as a counterclockwise loop or a clockwise one.  But direction matters for dynamics: two vortices of the same sign interact differently from two vortices of different signs.  How to interpret in terms of thinking?

“Loops of implication”: depending on boundary conditions, streamlines may or may not be able to end in the fluid.  If they can’t, all reasoning is “circular”: A implies B implies C etc. which eventually implies A.

Steady flow: endless thought loop.

Need a way to relate ideas besides implication, based on distance in physical space.  This was already implicitly present when we talked about vorticity: vorticity is a measure of “nearby circulation,” but this has no interpretation unless being nearby means something about two fluid parcels in the “thinking” interpretation.  Perhaps nearby ideas are felt to be “related”?  Consider: at any point, the velocity field only points in one direction, but there are other directions it could have pointed.  Thinking about A could have resulted in thinking about C rather than B – but only if C is also close to A (just in a different direction).

Adding tracer diffusion would allow the fluid to begin to think about ideas related to A because it is thinking about A.

Turbulent flow tends to “stir up” a tracer field, producing large gradients, which will then be smoothed out if diffusion is present, “mixing” the tracer.  (Stirring milk in coffee results first in complicated patterns of milky and not-milky strands, then in homogeneous somewhat-milkiness everywhere.)  Interpretation: turbulence/stirring creates a state of thought in which, for many sets of closely related ideas, some are being thought about, but others are not.  If tracer diffusion is present, this state then turns into a state where all ideas in some related area are being thought about simultaneously.

Dimensionality of the fluid: number of different “ways” ideas can be locally related.  No reason to stick to 2 or 3 (but I know nothing about “higher-dimensional” fluid mechanics – does it exist/work?).

differentialprincess:

real analysis “what the fuck” moment of the day: bump functions exist

you can do such weird things with non-analytic real functions

i love these guys!

RIght now instead of thinking about work my mind is insisting on trying to think of a way that fluid dynamics concepts can be put into correspondence with mental concepts, so that fluid flow could be interpreted as corresponding to some (perhaps unusual or illogical) kind of “thinking.”

raginrayguns:

nostalgebraist:

raginrayguns:

fnord888:

nostalgebraist:

In a curve-fitting method (such as polynomial regression), I may want to penalize very “wiggly” curves, because I know that noise tends to make small samples look wigglier than they are.  (Given a few noisy points from a straight line, I want my method to find the line and ignore the noise, not over-fit the noise with wiggles.)  However, I do not believe that “wiggly” curves are a priori unlikely!

If I may jump in here…

You don’t believe that “wiggly” curves in general are a priori unlikely (well, depending on your priors :P), but any particular “wiggly” curve has a lower probability than a particular linear curve. This is true for more or less the same reason that “wiggly” curves are prone to overfitting in the first place: because there are more parameters, each of which is individually uncertain.

okay imma try and develop this more when I get home, but that’s actually not the important thing either!

Point estimates of curves can be smooth even if the true curve is known not to be. I mean… if non-wiggly curves have lower probability or even no probability, then still the point estimate, which is an average over all possible curves, might not have wiggles.

(like how if a number is either 0 or 10 the best guess can be 5, even though there’s no chance of it being exactly right)

I think I can illustrate this with wavelet regression, i’ll maybe get around to it

raginrayguns: that’s true if we take the posterior mean, but I was figuring we were taking the posterior mode (“maximum a posteriori estimation”), since that’s how you can get the same point estimate out of ridge regression and Bayes with a Gaussian prior (see here).  Unless I’m misunderstanding your point.

*snop*

(Would anyone just come along and say “hey guys, my a priori prejudices about the parameters in this statistical model follow a normal distribution” if they weren’t doing it to justify the already existing practice of ridge regression?  Do you really believe in those Gaussians before any data comes in, or are they just a means to an end?  Would you accept bets on the basis of them?)

MAP estimates have no justification, as far as I know. Bayesians do use them sometimes… I’d guess? But I’ve also seen the phrase “MAP is crap” in the literature. (in the book Bayesian Methods in Structural Bioinformatics). My Bayes teachers bring up MAP but I’ve never been advised to use it in any particular case, while I have been advised not to use it.

So… I don’t think MAP estimates are the things to talk about when we’re talking about the point estimates that Bayesian methods produce. Certainly not if we’re trying to justify Bayesian methods, because as I said, MAP estimates have no justification even to a Bayesian.

The posterior mean on the other hand has a very natural justification: it’s the point estimate that minimizes the expected squared error. (hereafter MSE, mean squared error.) So it’s the more natural bayesian analogue of the frequentist idea of regularization, justified through the bias-variance decomposition of MSE.

The ridge regression estimate, besides being the MAP estimate, is also the posterior mean with a normal prior. This is how I’d explain its low MSE.

Because, do I believe that normal prior? Well, not really, not precisely, quantitatively. But think of the prior that corresponds to unpenalized regression, the uniform prior. Do I believe that each regression coefficient has the same probability of being in (1, 2) as in (1000001, 1000002)? Not usually, and even when the coefficient can be that high, there’s some value that’s just too astronomically high to be plausible. That’s what the normal prior gets right.

The normal shape is for conjugacy, for computation. Actually, I can quote Marina Vanucci to demonstrate that Real Bayesians think this way. I wrote down this quote during a lecture on variable selection in linear regression: “The construction is the conjugate case because that’s to make our lives easier when it comes to implementing the MCMC.” But there’s a parameter you can set in the conjugate prior to make it uniform, and we don’t, because of our knowledge that some proposed coefficients are just too big.

Oh, sorry, I missed that posterior mean would give you back the same thing as posterior mode here!  (I guess which point estimate to report depends on which loss function you care about, but MSE is probably much more practically relevant than the kind of loss function that would make the mode best, I think?)

I take your point about the Gaussian being more intuitive than the uniform here.  But what about the task of choosing between different regularizers?  Ridge regression corresponds to Gaussian priors, while Lasso regression corresponds to Laplace distribution priors.  (Well, OK, it only does if you use MSE, and if you use median or mean you get a “Bayesian Lasso” that gives results somewhere between ridge and Lasso.)  Note that the Laplace distribution also has the property you mention, of falling off as the weights get bigger.

It seems to me like one can look at constructing these methods in two different ways: an “engineering” way, where you think about giving the method some performance properties you want, and a “belief introspection” way, where you figure out what you really think about the weights a priori, and then Bayes update from there.

In this case – and perhaps in general with this sort of thing – the “engineering” way seems much more intuitive to me.  I can understand why someone might want to use Lasso regression because of its properties as a method (e.g. it tends to drive some weights to zero, which makes it easier to interpret).  I have a much harder time imagining someone just deciding “ah, my state of ignorance about these weights is represented by a Laplace distribution,” unless they had really done the engineering thought process first, and were then translating it into Bayesian terms.

I see this all as relevant to the broader questions about “Bayesianism” as a philosophy because when people talk about Bayesianism they always talk about people having these beliefs represented as probabilities.  But when we talk about Bayesian methods like this, it’s really very difficult to think about things that way.  (“Is your prior for the weights Gaussian or Laplace?” Wait, is that the sort of thing I’m supposed to have an opinion about?)  And it seems like what one ends up doing is running through the engineering analysis, and then going back and translating it into the prior you “must have had” to license applying the method you want to use.  But throw out the last step, and you’re just a frequentist.  (Except you get a whole distribution out instead of a point estimate which I guess is nice)

(via raginrayguns)

raginrayguns:

fnord888:

nostalgebraist:

In a curve-fitting method (such as polynomial regression), I may want to penalize very “wiggly” curves, because I know that noise tends to make small samples look wigglier than they are.  (Given a few noisy points from a straight line, I want my method to find the line and ignore the noise, not over-fit the noise with wiggles.)  However, I do not believe that “wiggly” curves are a priori unlikely!

If I may jump in here…

You don’t believe that “wiggly” curves in general are a priori unlikely (well, depending on your priors :P), but any particular “wiggly” curve has a lower probability than a particular linear curve. This is true for more or less the same reason that “wiggly” curves are prone to overfitting in the first place: because there are more parameters, each of which is individually uncertain.

okay imma try and develop this more when I get home, but that’s actually not the important thing either!

Point estimates of curves can be smooth even if the true curve is known not to be. I mean… if non-wiggly curves have lower probability or even no probability, then still the point estimate, which is an average over all possible curves, might not have wiggles.

(like how if a number is either 0 or 10 the best guess can be 5, even though there’s no chance of it being exactly right)

I think I can illustrate this with wavelet regression, i’ll maybe get around to it

raginrayguns: that’s true if we take the posterior mean, but I was figuring we were taking the posterior mode (“maximum a posteriori estimation”), since that’s how you can get the same point estimate out of ridge regression and Bayes with a Gaussian prior (see here).  Unless I’m misunderstanding your point.

fnord888: that’s a really good point and I don’t know enough about this area to be sure what to say about it.  I guess my first stab at a response would be that the general norm of “penalize complexity” doesn’t uniquely fix what the prior distribution should look like (how much should complexity be penalized, and what metric do we use for “complexity”?), and to pin it down exactly one needs to look to practical concerns.

(Well, either practical concerns or some a priori sense of what the “right” way to penalize complexity is.  Are there any serious proposals in this direction, for the specific case of regression?)

“Practical concerns” of course amount to knowing about how different situations call for different methods, and this can be considered a kind of “prior information,” which can then be translated back into a Bayesian prior.  But if there’s no better way to get this prior than to study the performance of your tools, then you’re basically just doing frequentism.

(Would anyone just come along and say “hey guys, my a priori prejudices about the parameters in this statistical model follow a normal distribution” if they weren’t doing it to justify the already existing practice of ridge regression?  Do you really believe in those Gaussians before any data comes in, or are they just a means to an end?  Would you accept bets on the basis of them?)

(via raginrayguns)

raginrayguns.tumblr.com →

scientiststhesis:

nostalgebraist:

scientiststhesis:

raginrayguns:

nostalgebraist and hot-gay-rationalist are talking about probability. The relevant reblog threads are here: one, two, three.

I think I understand both of them better than they understand each other so here goes:

Keep reading

(via scientiststhesis-at-pillowfort)

raginrayguns.tumblr.com →

scientiststhesis:

raginrayguns:

nostalgebraist and hot-gay-rationalist are talking about probability. The relevant reblog threads are here: one, two, three.

I think I understand both of them better than they understand each other so here goes:

Keep reading

(via scientiststhesis-at-pillowfort)

hot-gay-rationalist reblogged your post and added:

Right, I tried to explain some of it above, let me…

Okay, thanks for clearing that up.

There are a number of threads in this conversation now (quality of various texts about Bayes, usefulness of Bayesian vs. non-Bayesian methods in real practice, “foundedness” of frequentist probability theory, etc.), any of which seem like they could be rabbit holes.

What I’m most interested in is trying to convey my original point about plausibilities (“1b” in an earlier post).  I tried to describe this objection in a way that involved probabilities without Bayes and you objected to this.  I don’t think resolving that issue is necessary; my objection can be reframed.

You talk about how any non-Bayesian method will violate certain “desiderata” and that you find this a very satisfying property of Bayes.  Okay, but remember that in addition to the Cox axioms, one of the “desiderata” is the injunction that you should have plausibilities to begin with.  It is this that I am objecting to.  (I feel like I have said this several times, but it bears repeating.)

This is important in practice because typically – or this is my impression anyway – what distinguishes “Bayesian methods” from “non-Bayesian methods” is that Bayesian methods are the ones which demand that you have a prior.  If we were all perfect Jaynesbots we would actually have priors that we could just plug into these models.  As it is, in some scientific applications we have real “prior knowledge” that can be straightforwardly expressed in a probability distribution, and sometimes we have no idea what our prior “should be” and simply choose it in one or another ad hoc way, often simply because it’s easy to work with.

If we were certain of our priors, perhaps Bayes would be the way to go.  If our method essentially asks us to “make up” a prior, then it is asking to assign plausibilities in a case where we (unlike the perfect Jaynesbot) don’t feel we have any such priors.  The “desideratum” lost in choosing the non-Bayesian method here is not any of the Cox axioms, but the injunction to assign plausibilities.  Again, if one really has a sense of the plausibilities then so be it, but it is worth asking whether the best method of induction is the one that always asks us to make up plausibilities even when we don’t have them.

There is reason for doubt on this point.  Even if there are results that show that all good methods are (in some sense) close to Bayesian, that just means Bayesian with some prior; that doesn’t assure us that a Bayesian method with a prior we pulled out of our asses will work especially well.  In some contexts, in fact, it is known that pulling a prior out of your ass is dangerous.  Diaconis and Freedman discuss one such case in this paper, which ends this way:

image

Rather than focusing on the details of the problem dealt with here, I want to point to the general style of reasoning.  D&F say that for some types of problems, a prior chosen “for mathematical convenience” will work pretty well, but in other types of problems “arbitrary details of the prior can really matter; indeed, the prior can swamp the data, no matter how much data you really have.”

Note that we are now very far away from Jaynesbots who simply have plausibilities (by assumption) and merely need to know how to represent and update them; we’re now in a context where may have no strong feelings about what the prior should be, and in which fine details of this arbitrary prior may have a large impact on our conclusions.

In this case I don’t see how what I’ve read in, say, Jaynes would bear on the issue.  Jaynes tells us how to build ideal inductive creatures under the condition that they plausibilities to begin with; he doesn’t tell us that we should make up plausibilities if we don’t have them simply to be more like these creatures.  In short, the desideratum “always assign plausibilities, and just make them up if you have no real knowledge from which to construct them” doesn’t seem to have much intuitive force to me, and it’s problematic in practice in some cases, so I’d rather not bind myself to it literally all the time.

(I realize that Jaynes is an objective Bayesian and thinks that there is simply a right way to construct priors no matter how little you know.  But I’ve already expressed some of my misgivings about his approach in practice, and intuitively it feels sort of backwards to me – like Jaynes decides at first that you should always have plausibilities for some unspecified reason, and then tries to figure out what the plausibilities should be even when what you know is less than a set of plausibilities, so that e.g. he instructs you to choose a Gaussian when you know the mean and variance, when what you really know is just a mean and variance, not a whole function’s worth of plausibility assessments.)

hot-gay-rationalist your latest response covers a lot of ground and contains a lot of statements I don’t agree with.  I would like to give it a thorough response at some point (which I can’t do now b/c I’m at work).

However, before I do that, it might be helpful if you could clear up something for me.  There’s this running theme in our conversations where I talk about “probability” outside of a Bayesian context, and you say that probability doesn’t make sense to you outside of Bayesianism.

The way I’ve been Principle of Charity-ing this is that you mean you're familiar with the use of probability by non-Bayesians, but you think there is something wrong with it.  (This would fall into a long tradition of people in academic discussions using “I don’t understand X” as a coy/polite way of saying “I’m familiar with X, but I think it’s incoherent.”)  However, it occurs to me that you may simply be saying you’re only familiar with the probability axioms as a consequence of the Cox argument, and so you don’t get why I’d invoke the same mathematical structure in another context.

The thing is, most people, even Bayesians, don’t encounter the probability axioms first in the context of the Cox argument.  The usual presentation, even in textbooks/classes by Bayesians, involves starting with probabilistic events we have intuitive notions about – such as coin or dice – and then presenting mathematical probability as a formalization of those intuitions.  Most people feel like they know a six-sided die has a “one-sixth chance” of coming up on each individual side, even if they don’t quite know what they mean by that; the probability axioms formalize this kind of basic intuition, and related intuitions involving independence and dependence of events, while leaving “probability” as an undefined primitive.

Most people find probability most familiar from this kind of context, as sort of an “intuitive theory of coins and dice”; this is why I keep coming back to coin flips in my examples (which is standard practice in math classes).  The usual course of events would be to learn this first, and then see the Cox argument and think “whoa!  If I accept some postulates and then reason out their consequences, it turns out that the theory of ‘assigning plausibilities to hypotheses’ should be isomorphic to the 'theory of random events’ that I already know about!”  Like, for most people, “probability as a description of coin flips” is the most intuitive setting and “the thing that describes coin flips also provides a norm for beliefs” is a non-obvious discovery.

So, if you haven’t done so (and it’s only now occurring to me that you may not have), I recommend you read a presentation of the theory of probability in a setting about coins or dice and not beliefs (which you can find in almost any probability text that isn’t Jaynes), and try to put aside partisanship about Bayes vs. frequentism and just see if the ideas in themselves make intuitive sense.  OTOH, if you have done this but think that probability is somehow wrong or nonsensical as a theory of random events like coin flips, could you explain why?  (I know you think it’s more general, but that doesn’t mean the specific case is wrong; the complex numbers are more general than the reals, but that doesn’t mean the reals in themselves “don’t make sense.”)

I hope this doesn’t seem condescending.  I just feel like I have to say it explicitly so I can get a handle on what you’re really saying.

hot-gay-rationalist reblogged your post hot-gay-rationalist said:Hey, cou… and added:

Hmm… I find the idea of a prior pretty compelling, to be honest, so we’re diverging in…

What I mean by “maximum likelihood” has nothing to do with priors – it’s just, you have some space H of hypotheses, and some observed data D, and the “likelihood function” for any hypothesis in H is defined as the probability of observing D if that hypothesis is true.  Then the maximum likelihood hypothesis in H is the one that has the highest likelihood function.

So, if we have a coin, and our space H is parameterized by p_H (probability of heads) in [0,1], with p_T = 1 - p_H, and we see two flips that are both heads (D = (H,H)), then what is the maximum likelihood hypothesis?  It’s p_H = 1, because then the likelihood is 1, where for any other p_H it would be less than 1 (because there would have been some probability of seeing something other than two heads).

Note that we don’t need to have a prior over H to do this.

In fact, this is a case where the maximum likelihood answer looks kind of silly, and the reason is that we do have prior knowledge of what “coins” are like (fair coins are a lot more common than unfair ones).  So a Bayesian prior might give us more sensible answers here.

In a case where we truly don’t know anything beforehand, there are still reasons not to use maximum likelihood, for reasons described in that Shalizi post I linked – they come down to “you don’t want to overfit sampling noise.”  Having a Bayesian prior is one way to do this, but not necessarily the only one.

I don’t see Bayes as a method for fitting data, exactly; fitting data is one of the things you do with it, yes, but a Jaynesbot has a lot to say about stuff other than fitting data. The philosophical reasons behind Jaynesbots try to pin down truth-reasoning uniquely, so that the robot is a generalised reasoner that can think about anything that could possibly be true.

I think by “fitting data” I mean the same thing you mean by “truth-reasoning”: deciding how to combine the set of concepts you can think about (hypotheses) and the information you get from the world (data) to produce inductive knowledge.  Is there a particular idea you have in mind that you wouldn’t say falls under “fitting data”?  (“Inductive inference” might be a better phrase.)