Install Theme

nostalgebraist:

nostalgebraist:

nostalgebraist:

I had some fun asking ChatGPT about cases from “Counterexamples in Analysis.” You get this kind of uncanny valley math, syntactically and stylistically correct but still wildly wrong.

This was a response to “Prove or disprove: there exists a nowhere continuous function whose absolute value is everywhere continuous.” It responded in TeX, which I coped into a TeX editor.

image

Another answer to the same question:

image

If I ask Bing the same question, it tells me about something called the “very not continuous function” (lol):

image

I can’t find the term “very not continuous function” anywhere on the web except this page, the one Bing cites.

The page looks kind of click-farm-like, and it’s not clear what function it means by “the very not continuous function.”  But it does discuss the question I asked, so at least there’s that.

Anyway, it’s not web search relevance that I care about here – it’s math ability.

I tried again with Bing, this time with a different “Counterexamples in Analysis” case, an injunction not to perform a web search, and a half-hearted nod to chain-of-thought prompting.

image

The resulting discussion was an adventure in Helpful™ overconfidence:

image
image
image
image
image
image

(I said “bing ai” in the last screenshot due to a bizarre UI decision by Microsoft that makes it very easy to say “bing ai” to Bing without wanting or intending to. Don’t ask me, I didn’t do it ¯\_(ツ)_/¯ )

Here’s GPT-4 (on poe.com) answering the first of the two questions:

image

Update: tried the second example with GPT-4 (via ChatGPT plus).

It struggles in a similar manner to Bing.  As with Bing, my attempts to reason with it do not work very well.

Maybe there’s a way of phrasing the responses that would make it think more carefully about their meaning and implications?

It’s hard to guess what will work because of the involvement of RLHF.  (Otherwise I could just ask myself what a desriable version of this interaction might have looked like in the training data.)

Unfortunately, GPT-4 itself is inherently RLHF’d – a base model exists, but they aren’t exposing it to us, and I don’t see a reason to expect they ever will.

Screenshots under the cut

Keep reading

nostalgebraist:

nostalgebraist:

I had some fun asking ChatGPT about cases from “Counterexamples in Analysis.” You get this kind of uncanny valley math, syntactically and stylistically correct but still wildly wrong.

This was a response to “Prove or disprove: there exists a nowhere continuous function whose absolute value is everywhere continuous.” It responded in TeX, which I coped into a TeX editor.

image

Another answer to the same question:

image

If I ask Bing the same question, it tells me about something called the “very not continuous function” (lol):

image

I can’t find the term “very not continuous function” anywhere on the web except this page, the one Bing cites.

The page looks kind of click-farm-like, and it’s not clear what function it means by “the very not continuous function.”  But it does discuss the question I asked, so at least there’s that.

Anyway, it’s not web search relevance that I care about here – it’s math ability.

I tried again with Bing, this time with a different “Counterexamples in Analysis” case, an injunction not to perform a web search, and a half-hearted nod to chain-of-thought prompting.

image

The resulting discussion was an adventure in Helpful™ overconfidence:

image
image
image
image
image
image

(I said “bing ai” in the last screenshot due to a bizarre UI decision by Microsoft that makes it very easy to say “bing ai” to Bing without wanting or intending to. Don’t ask me, I didn’t do it ¯\_(ツ)_/¯ )

Here’s GPT-4 (on poe.com) answering the first of the two questions:

image

nostalgebraist:

I had some fun asking ChatGPT about cases from “Counterexamples in Analysis.” You get this kind of uncanny valley math, syntactically and stylistically correct but still wildly wrong.

This was a response to “Prove or disprove: there exists a nowhere continuous function whose absolute value is everywhere continuous.” It responded in TeX, which I coped into a TeX editor.

image

Another answer to the same question:

image

If I ask Bing the same question, it tells me about something called the “very not continuous function” (lol):

image

I can’t find the term “very not continuous function” anywhere on the web except this page, the one Bing cites.

The page looks kind of click-farm-like, and it’s not clear what function it means by “the very not continuous function.”  But it does discuss the question I asked, so at least there’s that.

Anyway, it’s not web search relevance that I care about here – it’s math ability.

I tried again with Bing, this time with a different “Counterexamples in Analysis” case, an injunction not to perform a web search, and a half-hearted nod to chain-of-thought prompting.

image

The resulting discussion was an adventure in Helpful™ overconfidence:

image
image
image
image
image
image

(I said “bing ai” in the last screenshot due to a bizarre UI decision by Microsoft that makes it very easy to say “bing ai” to Bing without wanting or intending to. Don’t ask me, I didn’t do it ¯\_(ツ)_/¯ )

I had some fun asking ChatGPT about cases from “Counterexamples in Analysis.” You get this kind of uncanny valley math, syntactically and stylistically correct but still wildly wrong.

This was a response to “Prove or disprove: there exists a nowhere continuous function whose absolute value is everywhere continuous.” It responded in TeX, which I coped into a TeX editor.

image

Another answer to the same question:

image

Another way to look at the Kelly criterion is to think about betting on a variable number of independent things at once.

If you make a single bet repeatedly, and you use the Kelly criterion, then over time, your log(wealth) is a sum of IID random variables.

So the Law of Large Numbers and Central Limit Theorem hold…

  • asymptotically, as time passes
  • for log(wealth)

Now imagine that instead, you diversify your wealth across many identical and independent bets. (And you use Kelly to decide how to bet on each one, given the fraction of wealth assigned to it.)

Here, the limit theorems hold…

  • asymptotically, as the number of simultaneous bets grows
  • for wealth

which is better in both respects. You control the number of bets, so you can just set “n” to a large number immediately rather than having to wait. And the convergence is faster and tighter in term of real money, because the thing that converges doesn’t get magnified by an exp() at the end.

This is regular diversification, which is very familiar. And then, making sequential independent bets turns out to be kind of like “diversifying across time,” because they’re independent. But it’s not as nice as what happens in regular diversification.

In fact, the familiar knee-jerk intuition “never go all-in, bet less than everything!” comes from this distinction, rather than from any result about how to bet on a single random variable if forced to do so.

In the real world, you’re not stuck in an eternal casino with exactly one machine. If you keep some money held back from your bet, it doesn’t just sit there unused. Money you hold back from a bet can be used for things, including other independent bets.

(The Kelly criterion holds money back so it can be used on future rounds of the same bet, which are a special case of “other independent bets.”)

Of course, if you have linear utility (i.e. no risk aversion), you should still go all-in on whichever bet has highest expected return individually. If this were really true, your life would be so simple that most of finance would be irrelevant to it (and vice versa). You’d just put 100% in whichever asset you thought was best at any given time.

Stuff about the Kelly criterion

unknought:

I’ve been off Tumblr for a little while, but there’s apparently been some discussion about the Kelly criterion, a concept in probability, in relation to some things Sam Bankman-Fried said about it and how that relates to risk aversion. I’m going to do what I can to explain some aspects of the math as I understand them.

The Kelly criterion is a way of choosing how much to invest in a favorable bet, i.e. one where the expected value is positive. The Kelly criterion gives the “best” amount for a bunch of different senses of “best” in a bunch of different scenarios, but I’m going to restrict to one of the simplest ones.

Suppose you have some bet where you can bet whatever amount of money you want, you have probability p of winning, and you gain b times the amount you bet if you win. (Of course, if you lose, you lose the amount you bet.) Also suppose you get the opportunity to make this bet some large number n of times in a row, you have the same probabilities and payoff rules for each of them, and they’re independent events from each other. The assumption that all of the bets in the sequence have the same probabilities and payoff rules is made here to simplify the discussion; the basic concepts can still hold when there are a mix of different bets, but it’s a lot messier to state things and reason about them.

Also suppose that your strategies are limited to choosing a single quantity f between 0 and 1 and always betting f times your total wealth at every step. This is a pretty big restriction, and it too can be relaxed at the cost of making things much messier. But even with this restriction we’ll be able to compare the strategy prescribed by the Kelly criterion to the “all-in” strategy of always betting all of your money.

So what is the best choice of f? The Kelly criterion gives an answer, but the sense in which it’s the “best” is one that it’s not obvious should apply to any choice of f. I’ll state it here but keep in mind that until we’ve done some more calculation, we shouldn’t assume that that there is any choice of f which is the best in this sense.

The Kelly criterion gives a choice of f such that, for any other choice of f, the Kelly criterion produces a better result than the other choice with high probability. Here “high probability” means that the probability that the Kelly choice outperforms the other one goes to 1 as n goes to infinity.

So why is this possible?

Let Xi be the random variable representing the ratio of the money you have after the ith bet to the amount you had before it. So your final wealth is equal to your starting wealth times the product of the Xi for i from 1 to n. Also these Xi are independent identically distributed variables. (We can describe their distribution in terms of p, b, and f but the exact details aren’t too important to the concepts I want to communicate.) Sums of random variables have some nicer things that can be said about them than products, so we take the logarithm. The logarithm of your final wealth is the log of your starting wealth plus a sum of n independent variables log(Xi).

Now, the expected value of that sum is n times the expected value of one of the individual summands, and the (weak) law of large numbers tells us that with high probability the actual value of the sum will be close to that. (To be rigorous about this: For any constant C, the probability that the sum will be further than Cn away from its expected value goes to 0 as n goes to infinity.) So for any betting strategy f, define r(f) to be the expected value of log(Xi). So if we have any two strategies f and f’, the log of your final wealth following strategy f minus the log of your final wealth following strategy f’ will be about r(f)n-r(f’)n, and so will be positive with high probability if r(f)>r(f’). (If you understood the rigorous definition in the previous parenthetical, you should be able to make this argument rigorous as well.) Thus with high probability the log of your final wealth will be greater using strategy f than strategy f’. Since log is an increasing function, this is equivalent to saying that with high probability, f will result in a greater final wealth than f’.

Then if you pick f such that r(f) is maximized, then for each other choice of f, you’ll outperform that choice with high probability. This is what the Kelly criterion says to do. Maximizing r(f) can be equivalently described by saying that at each bet, you bet the amount that maximizes the expectation of the logarithm of the amount you’ll have after the bet.

A pitfall to avoid here: Although the log of the final wealth can be said to be “about” a certain value with high probability, we can’t really say that the final wealth is guaranteed to be “about” anything in particular. Differences that we can consider to be negligibly small when we’re looking at the logarithm can balloon to very large differences when we’re looking at the actual value, and it is very possible for one experimental trial using a given strategy to yield something many times larger than another trial using the same strategy where you’re a little less lucky.

The Kelly criterion is not the strategy that maximizes the expected amount of money you have at the end. The best strategy for that goal is that is the one where you put all of your money in on every bet. This isn’t inconsistent with the previously stated results; in almost all cases the Kelly criterion outperforms the all-in strategy (because the all-in strategy loses at some point and ends up with no money). But in the very unlikely event that you win every single one of your bets, you end up with an extremely large amount of money, so large that even when you multiply it by that very small probability you get something that’s larger than the expected value of any other strategy.

What if, instead of trying to maximize the expected dollar payoff, you have some utility function of wealth, and you’re trying to maximize the expected value of that? Well, it depends what your utility function is. If your utility function is the logarithm of your wealth, the Kelly criterion maximizes your expected utility; in fact, in this case we don’t even need to assume n is large or invoke the law of large numbers. But going back to the case of large n, there are a lot of other utility functions where the Kelly criterion is also optimal. Think about it like this: the Kelly strategy outperforms any other strategy in almost all cases; the only situation where you might still prefer the other strategy is if in the tiny chance that you get a better outcome, your outcome is so much better than it makes up for losing out the vast majority of the time. So if your utility function grows slower than the logarithm, you care even less about that tiny chance of vast riches than you would if you had a logarithmic utility function, so the Kelly criterion continues to be optimal. More generally, I think it can be shown that when comparing the Kelly criterion to some other strategy, the probability of that other strategy doing better than it decays exponentially in n. Since the amount the other strategy can obtain in that tail situation grows at most exponentially in n, this implies that as long as your utility function grows slower than nε for all ε>0, you won’t care about the tail so the Kelly criterion is still optimal. If your utility function grows faster than that, i.e. if there is some ε>0 such that your utility function grows faster than nε, then I think for sufficiently favorable bets, all-in comes out ahead again.

Okay but how does this all of this apply in the real world? Honestly I’m not sure. If your utility function is your individual well-being, it seems very likely to me that that grows logarithmically or slower; if what you care about is maximizing the amount of good you do for the world by charitable donations, I think there is some merit to SBF’s argument that you should treat that utility as a linear function of money, at least up to a certain point. But even he acknowledged that it drops off significantly once you get into the trillions, and since the reasons for potentially preferring riskier strategies over the Kelly criterion hinged on exponentially small probabilities of exponentially large payoffs, I think that that trillion-dollar regime might actually be pretty relevant to the computation.

Really any utility function should be eventually constant, but in that case the Kelly criterion ceases to be optimal in the way discussed before. For large enough n, it will get you all the money you could want, but so will any other strategy other than all-in and “never bet anything”. Obviously this is not a good model of how the world works. To repair this we probably want to introduce time-discounting, but to make sense of that we need to have some money getting spent before the end of the experiment rather than all of it available for reinvesting, and by this point things have gotten far enough away from the original scenario that it’s hard to tell how relevant the conclusions from it even are. It seems like it’s a useful heuristic in a pretty wide range of scenarios? But I have no idea whether SBF was right that he was not in one of them.

To be clear, none of this is to excuse his actions; whether or not he should have been applying the Kelly criterion, I think “committed billions of dollars of fraud” does a better job of capturing what he did wrong than “was insufficiently risk-averse”.

OK yeah, that thing I was talking to @raginrayguns about is way simpler than I thought

The Kelly criterion maximizes the rate of exponential growth, which is just

log(final / initial)

up to a constant.

Like if you have w(t) = exp(rate * t) , and you end at t=T, then

rate = 1/T log(w(T) / w(0))

and T is a constant.

So the Kelly criterion really is nothing but maximizing log wealth, only phrased equivalently as “maximizing exponential growth rate.”

And this phrasing is confusing, because “maximizing exponential growth rate” sounds sort of generically good. Like why wouldn’t you want that?

But the equivalence goes both ways: it’s the same thing as maximizing log wealth, and it’s easy to see you may not want that.

—-

I made a mistake in my original post about geometric averages – I linked to a twitter thread about the Kelly criterion, and a blog post by the same person, as if they were making the same point.

The thread was how I found the post. But in fact, the thread is both wrong and not really about geometric averages being confusing. The post, however, is mostly good and doesn’t mention Kelly at all.

Why did the thread link back to the post, then? The author is conflating several things.

Here are some things you can compute:

  1. The expected growth in wealth from n sequential bets, E[ w_n / w_0 ]. This is what you want to maximize if you have linear utility.
  2. The expected arithmetic average over the growth in wealth from the individual bets.

    This is E[ (w_1 / w_0) + (w_2 / w_1) + … + (w_n / w_{n-1}) ] / n.

    This is meaningless, there’s no reason to do this. However, this gets reported in financial news all the time, I’ve seen in the WSJ for example.
  3. The expected geometric average over the growth in wealth from the individual bets.

    This is E[ ((w_1 / w_0) * (w_2 / w_1) * … )^1/n ], or after cancelling, E[ (w_n / w_0)^1/n ]. So this is (1.), but with a power of 1/n inside the E[].
  4. Like (3.), but with a logarithm inside the E[]: E[ log((w_n / w_0)^1/n) ]. This is the exponential growth rate.

Everything except (1.) has dubious importance at best, IMO.

(1.) is for linear utility, but you have nonlinear utility U, you would just maximize a variant of #1, E[ U(w_n / w_0) ] instead.

In the blog post, Hollerbach is essentially talking about the confusing relationship between (1.) and terms like (w_1 / w_0). You have to multiply these terms to get (1.), and multiplication is confusing.

However, in the post he conflates this product (1.) with the geometric average (3.). They’re not equivalent because the power doesn’t commute with expectation. But I guess they both involve multiplication, and multiplication is confusing.

In the twitter thread, he sort of conflates the geometric average (3.) with the exponential growth rate (4.). Then he pits these against the arithmetic average (2.), which is bad, but is not what SBF was advocating.

Then, since the blog post has already conflated the geometric average with the expected wealth growth, he ends up conflating together everything except the bad one, (2.). In fact, all four are different. And only (1.), or a nonlinear-utility variant of it, is what matters.

raginrayguns:

After n bets from initial wealth 1, your wealth is about

exp(E[log R] n)

where R is new/old wealth in one bet. That’s the appeal of the kelly criterion

But (assuming for now betting at even odds), if you bet it all at each step, expected wealth is

p^n 2^n - (1 - p^n)

exp(log(2p) n) - (≈1)

The weird thing is

log(2p) > max E[log R]

so in terms of expected value, you’re doing better than the original approximation allows

It seemed paradoxical to me at first. But it makes sense after unpacking “about”, considering what kind of convergence, which is

total wealth / exp(E[log R] n) → 1

EDIT: ↑ probably wrong

Betting everything every time is 0/0 on the left. so maybe there’s no real contradiction?

@nostalgebraist why i dont agree with that matt hollerbach thread btw. Not the only person on twitter who was saying SBF was making some elementary mistake… kelly in a certain sense maximizes the growth rate of your money, but it does NOT maximize the growth rate of the expected value of your money

I think you’re right, yeah …

  • Kelly maximizes the expected growth rate.
  • Betting everything maximizes the expectation of your wealth at any given period n.

And, as you say in the OP,

  • E[wealth] grows exponentially in both cases
  • It grows faster if you bet everything than if you bet Kelly

Which makes it sound better to bet everything, if you care about E[wealth].

EDIT 2: everything after this line is totally wrong lol

However, consider the event “exponential growth happens up to n,” i.e. “wealth at n ~ exp(n).” At each n, this is either true or false. In the large n limit:

  • If you bet Kelly, I think this has probability 1? Haven’t checked but I can’t see how that would fail to be true
  • If you bet everything, this has probability 0. Your wealth goes to 0 at some n and stays there.

OK, why would we care? Well, I think these two results apply in two different scenarios we might be in.

  1. You fix some n in advance, and commit to making n bets and then “cashing out.”
    You want to maximize this cash received at n. Here, you want to bet everything.
  2. You want to keep betting indefinitely, while regularly “cashing out” a <100% fraction of the money used for betting, over and over again.
    You want to maximize the expected total you will cash out. (With some time discounting thing so it’s not infinity.)

In case 2, I think maybe you want to bet Kelly? At least, I’m pretty sure you don’t want to bet everything:

  • If you bet everything, you cash out some finite number of times M, making some finite amount of cash ~M. Then your betting wealth goes to zero.
  • If you bet Kelly, then with probability 1 (?), you can cash out arbitrarily many times.
    If you have zero time preference, then you make infinite cash, which is obv. better than the previous case.
    If you do time discounting, I guess it depends on the details of the time discounting? You get a finite amount, and it might be less than the above if you discount aggressively, but then it might not be.

The punchline is, I think “case 2” is more representative of doing actual investing. (Including anything that SBF could reasonably believe himself to be doing, but also like, in general.)

You don’t have some contract with yourself to be an investor for some exact amount of time, and then cash out and stop. (I mean, this is an imaginable thing someone could do, but generally people don’t.)

You have money invested (i.e. continually being betted) indefinitely, for the long term. You want to take it out, sometimes, in the future, but you don’t know when or how many times. And even if you die, you can bequeath your investments to others, etc.

And maybe you do exponential time discounting, behaviorally, for yourself. But once your descendants, or future generations, come into the picture, well – I mean there are economists who do apply exponential time discounting across generations, it’s kind of hard to avoid it. But it’s very unnatural to think this way, and especially if you’re a “longtermist” (!), I doubt it feels morally correct to say your nth-generation descendants matter an amount that decays exponentially in n.

What would make you prefer the finite lump sum from betting everything here?

Well, if you think the world has some probability of entirely ending in every time interval, and these are independent events, then you get exponential discounting. (This is sort of the usual rationale/interpretation for discounting across generations, in fact.)

So if you think p(doom) in each interval is pretty high, in the near term, maybe you’d prefer to bet everything over Kelly.

Which amusingly gets back to the debate about whether it makes sense to call near-term X-risk concerns “longtermist”! Like, there is a coherent view where you believe near-term X-risk is really likely, and this makes you have unusually low time preference, and prefer short term cash in hand to long-term growth. And for all I know, this is what SBF believes! It’s a coherent thing you can believe, it’s just that “longtermism” is exactly the wrong name for it.

ETA: after more thought I don’t think the above is fully correct.

I don’t think the “event” described above is well-defined. At a single n, your wealth (if it’s nonzero) is always “~ exp(n),” for some arbitrary growth rate. Unless it’s zero.

Betting everything is a pathological edge case, b/c your wealth can go to 0 and get stuck there. If you are any amount more conservative than that, you still “get exponential growth” in some sense, it’s just that you’ll regularly have periods of very low wealth (with this low value, itself, growing exponentially in expectation).

If you are cashing out at every n individually, for all n, then I guess you want to maximize the time-discounted sum over n of wealth at each n … need to work that out explicitly I guess.

maybesimon asked:

i think almost nowhere might be my favorite work of yours, I hope to catch up on it over the winter break. Is it called almost nowhere because of measure theory?

Glad to hear it!

Yes, the title comes from measure theory.

—-

I originally started writing an explanation of what “almost nowhere” means here, not assuming any background, but it got long enough that it felt contrary to the simplicity of the concept.

But briefly, it’s a variant of the more common term “almost everywhere,” which means “everywhere, except possibly on a set of measure zero.”

What’s a “set of measure zero”? A familiar example is an infinitesimal point inside a larger, continuous object like a square or cube.

A point has no area or volume unto itself.

We measure amounts of physical stuff using area and volume (yards of fabric, liters of soda…). So a single point contains “0% of the stuff” in the whole object. You can remove it, and the whole object will have exactly as much “stuff” as it did before.

Hence, we might want to draw a distinction between “literally all of the object” and “all of the stuff in the object.”

An object that’s red, except for a single infinitesimal blue point, has just as much red stuff in it as a version without the blue point. This object isn’t entirely red, but it’s got just as much red stuff inside it as it would if it lacked the blue defect.

In this case, we say the object is “red almost everywhere.”

Almost nowhere is just the reverse of this: something that’s true only on a set of measure zero. Is our object “blue nowhere”? No, there’s somewhere in it that’s blue – the one point. But it is “blue almost nowhere.” None of its stuff is blue.

raginrayguns asked:

I remember you posted a long time ago a paper saying that stochastic gradient descent was important to generalization, not just an approximation to gradient descent. I was thinking about that again after learning about dropout, it seems deliberately hard to hit narrow targets in parameter space. Has there been more theory along those lines? It seems it could be important for understanding the effects of different architectures, like you can recognize a bad architecture if it compresses the correct solutions into a small hypervolume of parameter space.

Yeah, there’s a lot of research on this topic.

By now, the standard “lore” or received wisdom is that “flat minima generalize better.”

Any realistic architecture and training dataset will have many different “minima”: points in parameter space that gradient descent can converge to, which achieve roughly equal training loss.

But they don’t all perform equally well beyond the training data. And the idea is that the “flat” (low curvature) minima perform better here than the “sharp” (high curvature) ones.

As this paper pointed out, for this claim to make sense, you need to measure curvature in a way that’s invariant to reparameterization. But you can do that. (Some people didn’t, in earlier papers, but you can.) See this blog post from 2018 for some commentary on this.

AFAIK something like the original claim does hold up, once you measure flatness sensibly, but it’s one of those things where people like to write papers pointing out exceptions to the claim, or proposing modified forms of it.

More recently, there’s been interest in designing the optimizer specifically to seek out flat minima.

This is called “Sharpness-Aware Minimization” (SAM) after the paper that popularized it, and there have been later variants that claim to do better than that paper’s algorithm. And this does seem to be practically useful, at least sometimes. The DALLE-2 paper used it, and found that it made their CLIP embeddings much closer to low-rank (which they wanted for problem-specific reasons) in addition to slightly improving generalization.

My intuition for this stuff (which is not original to me):

There’s a connection between “predicting on novel data” and “predicting on training data with slightly perturbed weights.” Because unseen data elicits combinations of activations that are slightly different than any that ever occurred during training.

So it’s like the activations are perturbed slightly. And feeding perturbed layer N activations into layer N+1 is equivalent to feeding the original activations into a perturbed version of layer N+1(’s weights).

So if your model can’t cope with small perturbations to its weights, it won’t be able to cope with small perturbations to its activations (relative to the ones that it got gradients for). And so it won’t be able to cope with unseen data.

—-

Also, in the case of dropout specifically, there has been some work studying the exact way that dropout acts as a regularizer.

E.g. this paper talks about dropout penalizing high-order interactions between features, which is very intuitive. I haven’t read this one yet but it looks interesting.