Install Theme

there is no “mainstream consensus” among intelligence researchers

vaniver:

nostalgebraist:

How’s that for a clickbait title? ;)

The motivation for this post was a tumblr chat conversation I had with @youzicha.  I mentioned that I had been reading this paper by John L. Horn, a big name in intelligence research, and that Horn was saying some of the same things that I’d read before in the work of “outsider critics” like Shalizi and Glymour.  @youzicha said it’d be useful if I wrote a post about this sort of thing, since they had gotten the impression that this was a matter of solid mainstream consensus vs. outsider criticism.

This post has two sides.  One side is a review of a position which may be familiar to you (from reading Shalizi or Glymour, say).  The other side consists merely of noting that the same position is stated in Horn’s paper, and that Horn was a mainstream intelligence researcher – not in the sense that his positions were mainstream in his field, but in the sense that he is recognized as a prominent contributor to that field, whose main contributions are not contested.

Horn was, along with Raymond Cattell, one of the two originators of the theory of fluid and crystalized intelligence (Gf and Gc).  These are widely accepted and foundational concepts in intelligence research, crucial to the study of cognitive aging.  They appear in Stuart Ritchie’s book (and in his research).  A popular theory that extends Gf/Gc is knows as the “Cattell–Horn–Carroll theory.”

Horn is not just famous for the research he did with Cattell.  He made key contributions to the methodology of factor analysis; a paper he wrote (as sole author) on factor analysis has been cited 3977 times, more than any of his other papers.  Here’s a Google Scholar link if you want to see more of his widely cited papers.  And here’s a retrospective from two of his collaborators describing his many contributions.

I think Horn is worth considering because he calls into question a certain narrative about intelligence research.  That narrative goes something like this: “the educated public, encouraged by Gould’s misleading book The Mismeasure of Man, thinks intelligence research is all bunk.  By contrast, anyone who has read the actual research knows that Gould is full of crap, and that there is a solid scientific consensus on intelligence which is endlessly re-affirmed by new evidence.”

If one has this narrative in one’s head, it is easy to dismiss “outsider critics” like Glymour and Shalizi as being simply more mathematically sophisticated versions of Gould, telling the public what it wants to hear in opposition to literally everyone who actually works in the field.  But John L. Horn did work in the field, and was a major, celebrated contributor to it.  If he disagreed with the “mainstream consensus,” how mainstream was it, and how much of a consensus?  Or, to turn the standard reaction to “outsider critics” around: what right do we amateurs, who do not work in the field, have to doubt the conclusions of intelligence-research luminary John Horn?  (You see how frustrating this objection can be!)


So what is this critical position I am attributing to Horn?  First, if you have the interest and stamina, I’d recommend just reading his paper.  That said, here is an attempt at a summary.

Keep reading

I disagree with several parts of this, but on the whole they’re somewhat minor and I think this is a well-detailed summary.

Note how far this is from Spearman’s theory, in which the tests had no common causes except for g! 

Moving from a two-strata model, where g is the common factor of a bunch of cognitive tests, to a three-strata model, where g is the common factor of a bunch of dimensions, which themselves are the common factor of a bunch of cognitive tests, seems like a natural extension to me. This is especially true if the number of leaves has changed significantly–if we started off with, say, 10 cognitive tests, and now have 100 cognitive tests, then the existence of more structure in the second model seems unsurprising.

What would actually be far is if the tree structure didn’t work. For example, a world in which the 8 broad factors were independent of each other would totally wreck the idea of g; a world in which the 8 broad factors were dependent, but had an Enneagram-esque graph structure as opposed to being conditionally independent given the general factor would also do so.


When it comes to comparing g, Gf, and Gc, note this bit of Murray’s argument:

In diverse ways, they sought the grail of a set of primary and mutually independent mental abilities. 

So, the question is, are Gc and Gf mutually independent? Obviously not; they’re correlated. (Both empirically and in theory, since the investment of fluid intelligence is what causes increases in crystallized intelligence.) So they don’t serve as a replacement for g for Murray’s purposes. If you want to put them in the 3-strata model, for example, you need to have a horizontal dependency and also turn the tree structure into a graph structure (since it’s likely most of the factors in strata 2 will depend on both Gc and Gf).


Let’s switch to practical considerations, and for convenience let’s assume Caroll’s three-strata theory is correct. The question them becomes, do you talk about the third strata or the second strata? (Note that if you have someone’s ‘stat block’ of 8 broad factors, then you don’t need their general factor.)

This hinges on the correlation between the second and third strata. If it’s sufficiently high, then you only need to focus on the third strata, and it makes sense to treat g as ‘existing,’ in that it compresses information well.


This is the thing that I disagree with most strenuously:

In both cases, when one looks closely at the claim of a consensus that general intelligence exists, one finds something that does not look at all like such a consensus. 

Compared to what? Yes, psychometricians are debating how to structure the subcomponents of intelligence (three strata or four?). But do journalists agree with the things all researchers would agree on? How about the thugs who gave a professor a concussion for being willing to interview Charles Murray?

That’s the context in which it matters whether there’s a consensus that general intelligence exists, and there is one. Sure, talk about the scholarly disagreement over the shape or structure of general intelligence, but don’t provide any cover for the claim that it’s worthless or evil to talk about a single factor of intelligence.

Keep reading

nostalgebraist:
“ raginrayguns:
“ Seems like when someone writes like, “we care about this thing, so we used the standard quantitative measure of this thing,” @nostalgebraist is in the habit of asking, “why’s that standard?” Especially if that...

nostalgebraist:

raginrayguns:

Seems like when someone writes like, “we care about this thing, so we used the standard quantitative measure of this thing,” @nostalgebraist is in the habit of asking, “why’s that standard?” Especially if that measure has some aura of goodness or rightness about it, that makes you question whether it’s being used for intellectual reasons.

One such question was, why do statistics people always measure distance between two distributions using Kullback-Leibler divergence? Besides, you know, “it’s from information theory, it means information.”

Above, I’ve illustrated the difference between using KL divergence, and another measure, L2 distance. I’ve shown a true distribution which has two bell curve peaks, but the orange and purple distributions only have one, so they can’t match it perfectly. The orange distribution has lower L2 distance (.022 vs .040), and the purple curve has lower KL divergence (2.1 vs 3.0). You can see that they’re quite different:

  • the orange low-L2 one matches one peak of the true distribution, but has the other one deep in the right tail
  • the purple low-KL one goes between them and spreads itself out, to make sure there’s no significant mass in the tails

And this difference makes a real practical difference–using KL divergence actually is not always appropriate. When I’m doing statistical estimation, I often have a model for the data, but I don’t expect every data point to follow the model. So I expect the true distribution to have one peak which fits my model, plus some other stuff. So I don’t want to do maximum likelihood estimation, which is heavily influenced by that other stuff. And maximum likelihood estimation is actually choosing a model by minimizing a sample-based estimate of KL divergence. Instead, I minimize a sample-based estimate of L2 divergence–this is called L2 estimation, or L2E. (some papers about it here.) That way when I’ve inferred the parameters of my model, it matches the “main” peak of the data, and is robust to the other stuff.

The invention of L2E is actually informative about how standard KL divergence really is. Because, it was invented by someone in a statistical community where L2 divergence is standard. Specifically, non-parametric density estimation–think histograms and kernel density estimators. The guy is actually David Scott, who’s also known for “Scott’s rule” for choosing the bin width of a histogram, which you may have used if you’ve ever done “hist(x, method=‘scott’)” in R. Scott’s rule starts by looking at the mean and standard deviation of your sample, and then gives you the bin width that would be best for a sample of that size drawn from a normal distribution with that mean and sd. And how’s “best” quantified? It’s expected L2 distance between that normal distribution and the resulting histogram. Most papers you see on histograms and kernel density estimators will use L2 distance. He came up with L2E just by asking the question, what if we took the measure of fit used in nonparametric density estimation, and applied it to parametric models?

(code)

This is really interesting, thanks.  Especially the connection of MLE downsides to K-L downsides.

One thing that gets mentioned as a good quality of K-L is that it’s invariant to changes of coordinates.  L2 divergence doesn’t have this (I think? the squares ruin it, you get a squared factor and the “dx” can only cancel half of it).  How much of an issue is this in practice?  Like, it seems bad if you can totally change the distributions you get by squishing and stretching your coordinate system, but I guess if you have a really natural coordinate system to begin with … ?

Also, this made me think about how a sample distribution is going to have better resolution near the peak than in the tails, which could be one justification for caring more about the fit near the peak.  It seems like that could be put on a quantitative footing, too?  With theorems and stuff, even.  Maybe this is already a thing and I just don’t know it

(via nostalgebraist)

raginrayguns:
“ Seems like when someone writes like, “we care about this thing, so we used the standard quantitative measure of this thing,” @nostalgebraist is in the habit of asking, “why’s that standard?” Especially if that measure has some aura of...

raginrayguns:

Seems like when someone writes like, “we care about this thing, so we used the standard quantitative measure of this thing,” @nostalgebraist is in the habit of asking, “why’s that standard?” Especially if that measure has some aura of goodness or rightness about it, that makes you question whether it’s being used for intellectual reasons.

One such question was, why do statistics people always measure distance between two distributions using Kullback-Leibler divergence? Besides, you know, “it’s from information theory, it means information.”

Above, I’ve illustrated the difference between using KL divergence, and another measure, L2 distance. I’ve shown a true distribution which has two bell curve peaks, but the orange and purple distributions only have one, so they can’t match it perfectly. The orange distribution has lower L2 distance (.022 vs .040), and the purple curve has lower KL divergence (2.1 vs 3.0). You can see that they’re quite different:

  • the orange low-L2 one matches one peak of the true distribution, but has the other one deep in the right tail
  • the purple low-KL one goes between them and spreads itself out, to make sure there’s no significant mass in the tails

And this difference makes a real practical difference–using KL divergence actually is not always appropriate. When I’m doing statistical estimation, I often have a model for the data, but I don’t expect every data point to follow the model. So I expect the true distribution to have one peak which fits my model, plus some other stuff. So I don’t want to do maximum likelihood estimation, which is heavily influenced by that other stuff. And maximum likelihood estimation is actually choosing a model by minimizing a sample-based estimate of KL divergence. Instead, I minimize a sample-based estimate of L2 divergence–this is called L2 estimation, or L2E. (some papers about it here.) That way when I’ve inferred the parameters of my model, it matches the “main” peak of the data, and is robust to the other stuff.

The invention of L2E is actually informative about how standard KL divergence really is. Because, it was invented by someone in a statistical community where L2 divergence is standard. Specifically, non-parametric density estimation–think histograms and kernel density estimators. The guy is actually David Scott, who’s also known for “Scott’s rule” for choosing the bin width of a histogram, which you may have used if you’ve ever done “hist(x, method=‘scott’)” in R. Scott’s rule starts by looking at the mean and standard deviation of your sample, and then gives you the bin width that would be best for a sample of that size drawn from a normal distribution with that mean and sd. And how’s “best” quantified? It’s expected L2 distance between that normal distribution and the resulting histogram. Most papers you see on histograms and kernel density estimators will use L2 distance. He came up with L2E just by asking the question, what if we took the measure of fit used in nonparametric density estimation, and applied it to parametric models?

(code)

This is really interesting, thanks.  Especially the connection of MLE downsides to K-L downsides.

One thing that gets mentioned as a good quality of K-L is that it’s invariant to changes of coordinates.  L2 divergence doesn’t have this (I think? the squares ruin it, you get a squared factor and the “dx” can only cancel half of it).  How much of an issue is this in practice?  Like, it seems bad if you can totally change the distributions you get by squishing and stretching your coordinate system, but I guess if you have a really natural coordinate system to begin with … ?

nostalgebraist:

I’m sure this will be old news to some of you, but Christopher Olah’s blog on machine learning (well, mostly neural networks) is excellent.

In particular, he explains standard concepts more clearly than I’ve seen them explained elsewhere: his post on backpropagation, for instance, derives it in a way that makes it totally clear why it’s a fast way to compute derivatives in gradient descent, by starting out with the task of computing derivatives and deriving it naturally rather than starting out with unmotivated talk about “running the network backwards.”  See also his explanations of ConvNets and LSTMs.

OMG, the end of this one blew my mind.  Basically he says that neural nets for classification are (maybe) trying to find an embedding that “unties knots” in the data manifold so that the originally “tangled up” classes are linearly separable (the final set of weights then does linear classification is this new space).  And then he relates that to adversarial examples, and also suggests using kNN rather than linear classification in the last step, and uh, just read the post, it’s really cool

dronegoddess-deactivated2017053 asked: Say I wanted to make a graph of "rationalist Tumblr," where each node is weighted by how many other members of rationalist Tumblr they interact with and anyone below a certain threshold is considered "not part of rationalist Tumblr." What kind of statistical/ML method would I use to do this? Is this a hard problem or a simple one?

Sorry, I don’t fully understand what you want to do.  It sounds a lot like the kind of thing you get out of PageRank / eigenvector centrality, though.

What are the desired inputs and outputs here?  Like, supposing that “rationalist tumblr” is a well-defined subgraph of some larger social graph, do you want the algorithm to discover that subgraph without being told about it at the outset?  There are a variety of algorithms for breaking a graph down into multiple “communities.”  After defining the community, you could use something like PageRank / eigenvector centrality (only within the community) to get the weights.

I guess what’s confusing me about your question is this bit:

anyone below a certain threshold is considered “not part of rationalist Tumblr.”

The thing is, you need some notion of “who constitutes rationalist tumblr” before you can start assigning the weights you describe; otherwise you don’t know which connections to count toward the node weights and which not to.  (Being connected to me boosts your node weight only if I’m part of rattumb, but you don’t know whether or not I am part of rattumb until you know my node weight.)  If there’s an algorithm that deals with this circularity, I don’t know of it.

(PageRank deals with a similar sort of circularity, except without the hard/discontinuous cutoff: being connected to low-weight nodes only gives you a low boost to your own weight, but the boost is always nonzero.)

Bernoulli polynomials are amazing and I didn’t even know they existed two days ago!!

I guess I hadn’t heard of them because they’re not orthogonal, and I only see polynomials sequences in things like numerics where being orthogonal is the whole point.

But get this: each of the Bernoulli polynomials B_n(x) on [0,1] has degree n, but they don’t cross zero any more often as their degree gets higher.  The odd-numbered ones always cross it once and the even-numbered ones always cross it twice.  And as the degree goes to infinity, what they do is turn into perfect sinusoids, the odd ones to sin(2*pi*x) and the even ones to cos(2*pi*x).  (Actually, they get big and have to be rescaled, and also multiplied by -1 half the time, but that’s just a multiplicative factor)

And you’ve got all the usual sort of good stuff like all integrating to zero (except for B_0 = 1), and having a nice derivative rule, B_n’(x) = n B_{n-1}(x).  That relation clearly leads to some kind of factorial scaling as n goes up, and indeed the scale factor you have to use to turn them into sin and cos has a factorial in it (the scaling is like n! / ((2pi)^n)) … so I kinda wonder why they didn’t define them without the “multiplying by n every time.”  This guy does that, and it seems to work fine for him.  I guess the “multiplying by n” is there because that makes it an “Appell sequence,” which is apparently important?

(You still have the 1/(2pi)^n) scaling, but maybe that is more of a “real” thing, where the n! scaling is just an artifact of the definition.  Or maybe the n! scaling is “real” too because it’s “uniquely correct” for it to be an Appell sequence)

They’re related to the Riemann Zeta Function too.

Read about them on Wikipedia of course but also check this out if you want more detail

ETA: actually, the fact that you have to scale by (2pi)^n and multiply by (-1) half the time to get back sin and cos makes perfect sense … since we’re on [0,1] the sin and cos functions we get are sin(2*pi*x) and cos(2*pi*x), and those pick up a factor of 1/(2pi) every time you take an antiderivative.  And as n goes up, you’re taking more and more antiderivatives (since B_n’(x) = n B_{n-1}(x)), so in order to converge to those sinusoids, they have to get smaller and smaller like that as they go up (the sinusoids also get smaller when you anti-differentiate them).  And the -1 thing is because cos’’(x) = -cos(x), sin’’(x) = -sin(x).

You could get rid of the (2pi)^n by defining the polynomials on [0,2pi], but then you’d get factors of 2pi popping up in other places and it might not be worth it.  And you could get rid of the -1 by multiplying by i every time you take an antiderivative … that would give you a limit where B_{2n}(x) + B_{2n+1}(x) turns into exp(2 pi i x), which could be cool I guess.

I’m not sure how that exclamation mark is meant to make me feel, here

I’m not sure how that exclamation mark is meant to make me feel, here

A little math problem came up in the thesis earlier today and I got obsessed with it to the point that it’s kind of hard to break away from it to read more of Seven Surrenders (!)

(not a “problem” like an obstacle to writing, a problem like a homework problem, which arose when I thought “hey it’d be cool to give the reader an example of how this equation behaves in a simple case, like one where you could get a closed form maybe”)

I don’t really need the closed form in my thesis, I can give the example just as well without it, but now the problem it raised has me obsessed, so why not share it

OK so it’s have the sum (over n) from 1 to infinity of

1 - exp(-t/(n^2))

where t is some nonnegative real number.  We want a closed form for this as a function of t (like, a form that’s not an infinite sum).

I can bound this above and below by integrals, but does the sum itself have a closed form?

ETA: see @absurdseagull‘s reply

raginrayguns replied to your post “I am wondering about the concept of economic stability.  It seems…”

“Loss aversion is why people consider a financial product less valuable if there is a greater variance of possible outcomes (risk), even if the expected value of the outcome (reward) is the same.” Sometimes more variance makes it more valuable though, since if you win you can reinvest the gains and they’ll grow–isn’t there a sweet spot of optimal variance? Hmm I’m actually not sure
the kelly criterion balances a tradeoff between expectation and risk in individual bets, that elads to maximum growth rate over iterated bets. You choose a lower expectation short term, but it leads to a higher expectation long term by limiting your risk
so the reasoning behind the kelly criterion is sort of…. taking risk into account, in ORDER to maximize growth rate? So maybe…. economists are considering stuff like, the uncertainty that leads people to avoid getting educated, when they calculate the growth rate of the economy? Since fewer people getting educated leads to less growth. And so the risk considerations you’re talking about are actually contained in the growht calculation

Oh, I didn’t know about the Kelly criterion, interesting

I think there are two distinctions we need to make here.  First, the difference between value judgments made by people in an economy/model and value judgments made by economists.  Like, say the fake people inside the model are designed to be risk-averse, so they respond to future uncertainty.  And say that future uncertainty causes the model people to get less education, and so the economy grows slower.  But the model’s predictions are stochastic, so when we say the economy “grows slower” we’re actually comparing one distribution of growth trajectories to another, in some way that summarizes which distribution is “slower overall” (mean or median or w/e).  And then economists can go away and compare whatever summary statistics they want about those distributions.  The economists analyzing the model output are not required to think the same way the fake people inside the model do, so even if the model people cared about risk, that doesn’t mean the economist does.

Second (OK this is most of the post, this got longer than expected): the difference between expected utility and expected objective-physical-stuff (e.g. money).  The Kelly criterion doesn’t actually maximize your money, it maximizes your utility assuming your utility is logarithmic in money.  So it’s producing a kind of “maximal growth rate,” but it’s in abstract utility units, so it’s not the same kind of “growth rate” that people talk about IRL, the kind that can be measured in real-life monetary units.

I read Kelly’s original paper (it’s not long) and it’s … really weird, and kinda interesting in itself.  He starts out with Shannon’s definition of the “transmission rate” of a noisy channel.  Shannon had showed that this rate was important if you are thinking about codings you could use with the channel.  Kelly wants to show that it’s of more general importance, so he presents a real-world situation where the transmission rate is important, even though there is no coding involved.  That situation is the gambler example (the gambler ends up using the Kelly criterion).

But in the introduction to the paper, Kelly also makes some complaints about “cost functions” – as far as I understand it, he says that to evaluate a channel without coding you need a utility function, and that opens up a whole bunch of possibilities and makes it hard to tie anything to information theory because you’d have to prove that the transmission rate (or entropy or whatever) would pop out of some equation for all possible utility functions.  So he wants to exhibit a real-life case where you don’t need a utility function.

So then, when he introduces his gambler, he says that the gambler wants to maximize the “exponential rate of growth of [their] capital,” which is basically the exponent r if their money grows like exp(r * t).  That seems like a sensible enough thing to maximize, but it does commit you to valuing money logarithmically.  (At any time t, multiplying your money by some constant only affects r, the thing you care about, additively.)  That seems like a utility function, and Kelly’s goal was to avoid those.  But then, in the conclusion to his paper, Kelly asserts that the gambler doesn’t really have a utility function:

The gambler introduced here follows an essentially different criterion from the classical gambler. At every bet he maximizes the expected value of the logarithm of his capital. The reason has nothing to do with the value function which he attached to his money, but merely with the fact that it is the logarithm which is additive in repeated bets and to which the law of large numbers applies.

By this I think he means that since “the logarithm is additive in repeated bets,” you can treat your the logarithms of your gains from bet n to bet (n+1) as a sequence of random variables, and take a well-defined average over them … which is indeed necessary to compare different money-over-time paths that are infinitely long.  Like, you need some functional of the path that isn’t going to be zero or infinity for most of the paths you want to be comparing.  So you can kinda argue that this is just the “objective” way to attach values to different infinitely long growth trajectories.  Except … not really, because there are other functionals of the paths you could take.  Like instead of doing sums of logs of wealth, you could just sum absolute wealth up to time t=50 and ignore everything after that.  This would change the results (the law of large numbers can no longer be used), but there is no less justification for it (on “objective” terms) than for Kelly’s logarithmic rule.

ANYWAY, uh, in order to provide a recommendation for someone in a risk/reward tradeoff, the Kelly criterion assumes that they have a kind of loss aversion (logarithmic utility of money).  Which makes sense, because loss aversion is necessary to see risk vs. reward as a “tradeoff” (if we weren’t loss averse we wouldn’t care about lowering risk).  But it just so happens that the loss averse utility function it chooses, the logarithmic one, assigns utility based on growth rate (the exponent).  And this looks objective because “the growth rate” is what anyone will first think about when comparing two exponentially growing things.  But valuing the growth rate linearly is equivalent to valuing money logarithmically.  If you make the “objective” decision to value the growth rate, you’re committing yourself to a “subjective” position about how to value different amounts of money.

ANYWAY AGAIN, the moral I am trying to draw is that there isn’t any unique way to reduce “take risk into account” to “just get the most money” without assuming a utility function.  Like, if I understand your reply, you’re saying that the Kelly criterion looks like “how to just get the most money” and it incorporates risk, so therefore we can just care about “getting the most money” like usual in econ without having to change anything because of risk.  But in fact, to define what “getting the most money” means when money is a function of time, you need to implicitly commit yourself to some utility function (e.g. defining “most money” as “highest growth rate” commits you to logarithmic).  And then, if you’ve committed yourself to a nonlinear utility function, the utility you attach to a distribution won’t be uniquely determined by the expected value, and so every time you hear a statistic like “this bill will generate 8 billion dollars in tax revenue” you won’t be able to evaluate that outcome without knowing more about the distribution.

(Like, even if you count your utility in dollars for some reason, your expected utility is going to be \int u(x) p(x) dx and no way are the talking heads on TV going to be telling you about that particular integral.  Well, except I guess in the case where the talking heads are talking about growth rates, which are interpretable as logarithmic utilities, but only in the long-time limit, and basically this special case just seems annoyingly misleading)

Uh I hope that made sense

(Horizontal axis is time, vertical axis is energy)
Blue: unperturbed system (i.e. epsilon=0)
Yellow: perturbed system (epsilon=1)
Red: perturbed system with a tiny, should-be-negligible perturbation (epsilon=0.000001)
why

(Horizontal axis is time, vertical axis is energy)

Blue: unperturbed system (i.e. epsilon=0)

Yellow: perturbed system (epsilon=1)

Red: perturbed system with a tiny, should-be-negligible perturbation (epsilon=0.000001)

why