Shout-out to the Bellman equation, which is cool and useful and also has one of those great derivations that feels like a joke
Could you explain? Wikipedia isn’t revealing the humour.
(@digging-holes-in-the-river also asked)
I guess I find it funny because I first read about it in the context of Q-learning, where the goal is to find the best policy (the best action at every time), and at first I was like “what the hell? instead of learning a policy directly, they’re doing all this work to learn this weird other thing called Q?”
But then the Bellman equation shows why if you know Q (as a function), you immediately know the optimal policy. And the proof is like a punchline, where you suddenly see why you should care about Q. As a dialogue:
A, a rash neophyte: “I want to always know which action to take to maximize my (time-integrated discounted) rewards.”
B, ancient and wise: “Ah! Then you’ll be interested in this magical function I call Q. It tells you the maximum time-integrated discounted reward you could possibly get, starting from the situation you’re in.”
A, a rash neophyte: “Why would I care about that? If it tells me ‘you could achieve a time-integrated discounted reward of 104282.3,’ I still won’t know how to get that reward. The function would just be teasing me!”
B, ancient and wise: “But tell me, do you agree that the maximum time-integrated discounted reward right now equals the maximum reward on the next step, plus the maximum time-integrated discounted reward from all the other steps?”
A, a rash neophyte: “… duh? Are you trolling me?”
B, ancient and wise: “But if you pull a discount factor out of the second term, it’s just the maximum time-integrated discounted reward at the next state.”
A, a rash neophyte: “… and?”
B, ancient and wise: “We have a name for that. It’s Q, evaluated at the next state. So Q_t is just the maximum reward from the next step, plus the discount factor times Q_{t+1}.”
A, a rash neophyte: “Wait, so if I knew how to calculate Q, I could find the best action just by plugging actions and immediate results into the equation? I don’t have to think about the entire infinite future, just the next step? Why didn’t you tell me Q was so amazing?”
B, ancient and wise: “I did, young one. I did.“
(via szhmidty)



