He was nineteen years old, which meant that everything he said made sense and sounded good.

He was nineteen years old, which meant that everything he said made sense and sounded good.
I have written a lot on this tumblr about (mostly against) “strong Bayesianism” or “Jaynesianism,” but I have mostly been silent about the pros and cons of Bayesian methods as they are actually practiced. This is, honestly, because I don’t know much about Bayesian methods as they are actually practiced, although I am trying to learn more.
Back when I wrote that Bayes masterpost, @raginrayguns rightly took me to task for ignoring “hedging” as a virtue of Bayesian modeling. Something that stands out to me when I read about Bayes-in-practice is that hedging is seen as extremely important – indeed, often as the whole point of the exercise.
This is quite different from the Jaynesian perspective, where both prior and posterior are representations of real beliefs, and hence it is important to get the prior “right” (through MaxEnt or something). In practical Bayesian work, the prior is treated more as a way to do model averaging; what matters is not whether it philosophically “reflects our beliefs in the absence of evidence” but whether it leads to averaging over models in a way we like.
You have probably seen it before, but that Gelman/Shalizi paper is relevant here – says you should do hypotheco-deductivism with Bayesian models, where both the model class and the prior are falsifiable hypotheses.
One very intuitive (to me) justification for model averaging is automatic quantification of variance (and its consequences). If you just fit one “best” model, you can happily chug along making predictions with it, but you ought to worry about how much each of these predictions would have varied if you had fitted the model on slightly different data (with different noise, say). Since a Bayesian method effectively uses every model in the model class and averages over them, it perhaps captures this variability? I am used to seeing this done with the bootstrap, which directly generates “different data”; there is supposedly a connection between the bootstrap and the Bayesian thing (which uses only the real data but still uses multiple models), but I don’t fully understand it yet.
A superficially obvious “gotcha” argument goes like this: “even if some averaging is being done under the hood, the Bayesian model still just outputs conditional probabilities, like any probabilistic model. Thus ‘Bayesian averaging over model class C’ produces a single model for each training data set, and is thus choosing a single ‘best’ model from some other model class (call it C-prime). One could then argue that it would be better to average models from C-prime according to some prior, obtaining C-prime-prime, and so on ad infinitum.”
I haven’t really worked that through and I don’t know whether it truly makes sense. It also seems misleading in that it dismisses “averaging under the hood” as though this is a mere computational choice and can’t be discerned from the resulting conditional probabilities, but that doesn’t seem like it’s true. Except for special cases (involving Gaussianity/linearity), I have a hard time thinking of apparently non-Bayesian methods that can be re-written as Bayesian averages in a nontrivial way. (Random forests might be Monte Carlo sampling from trees according to likelihood? not sure.)
This suggests that there may be special features conferred by the Bayesian averaging process which can be read off of the results even if you didn’t know there was averaging under the hood, but if so, I don’t know what they are (or how to look for info on this).
In a machine learning context, Bayesian methods (relative to others) feel less solidly rooted in Breiman’s “algorithmic modeling" culture – like they still have one foot in the “data modeling” culture. There is a great deal of focus on technical methods for sampling from ~*~*the posterior*~*~, with the implication that it is clearly this great amazing thing and we are justified in going to great lengths to approximately compute it. This is a bit confusing to me since the posterior is just a combination of a model class and a prior, and the prior is often just some computationally convenient distribution (Gaussian, Dirichlet), so it seems like we’re working very hard to compute something whose definition we chose for our own convenience rather than its optimality.
Discussions of the Dirichlet process, for instance, often start out with talk of “adaptively choosing the number of clusters” – leading me to say “great, so what’s the best way to do that?” – and then jump into discussions of the Chinese restaurant process without telling me why the clusters should be generated in this way rather than any other.
(Actually, if someone can point me to a justification of the Dirichlet distribution that isn’t “it’s a conjugate prior, which is computationally convenient,” that would be helpful)
In spike
sorting the restaurant is a single recording, each table is a neuron,
and each customer is an action potential waveform.
Easily a third of the people I follow on here would’ve been at least minor historical figures had they been born to aristocratic families before the 1800s.
Shout-out to the Bellman equation, which is cool and useful and also has one of those great derivations that feels like a joke
Could you explain? Wikipedia isn’t revealing the humour.
(@digging-holes-in-the-river also asked)
I guess I find it funny because I first read about it in the context of Q-learning, where the goal is to find the best policy (the best action at every time), and at first I was like “what the hell? instead of learning a policy directly, they’re doing all this work to learn this weird other thing called Q?”
But then the Bellman equation shows why if you know Q (as a function), you immediately know the optimal policy. And the proof is like a punchline, where you suddenly see why you should care about Q. As a dialogue:
A, a rash neophyte: “I want to always know which action to take to maximize my (time-integrated discounted) rewards.”
B, ancient and wise: “Ah! Then you’ll be interested in this magical function I call Q. It tells you the maximum time-integrated discounted reward you could possibly get, starting from the situation you’re in.”
A, a rash neophyte: “Why would I care about that? If it tells me ‘you could achieve a time-integrated discounted reward of 104282.3,’ I still won’t know how to get that reward. The function would just be teasing me!”
B, ancient and wise: “But tell me, do you agree that the maximum time-integrated discounted reward right now equals the maximum reward on the next step, plus the maximum time-integrated discounted reward from all the other steps?”
A, a rash neophyte: “… duh? Are you trolling me?”
B, ancient and wise: “But if you pull a discount factor out of the second term, it’s just the maximum time-integrated discounted reward at the next state.”
A, a rash neophyte: “… and?”
B, ancient and wise: “We have a name for that. It’s Q, evaluated at the next state. So Q_t is just the maximum reward from the next step, plus the discount factor times Q_{t+1}.”
A, a rash neophyte: “Wait, so if I knew how to calculate Q, I could find the best action just by plugging actions and immediate results into the equation? I don’t have to think about the entire infinite future, just the next step? Why didn’t you tell me Q was so amazing?”
B, ancient and wise: “I did, young one. I did.“
(via szhmidty)
Shout-out to the Bellman equation, which is cool and useful and also has one of those great derivations that feels like a joke
(via kitswulf)
Invisible chairs, swims to Baltimore,
clamping-on, swirlees, water bottles held at arms length, around the world
tours, Good Night Jane Fonda calls, running deck to deck/outside the company
area, ping pong and Midshipmen grenades or other acts that require undue
physical or emotional stress are strictly prohibited.
Say what you will about @brazenautomaton‘s status theories as they apply to the wider world – they are, at least, a witheringly accurate characterization of the way people react when @brazenautomaton talks about status