I took linear algebra as an undergrad and then took a slightly fancier version in my first year of grad school, and I understood all the “matrices <==> linear transformations” stuff, but I never really felt comfortable interpreting the actual entries of a matrix until my second year of grad school, when I learned the rule
the matrix-vector product A*v is a linear combination of the columns of A, with the coefficients given by the entries of v
I learned this from the excellent book Numerical Linear Algebra by Trefethen and Bau, and I don’t think I’ve ever heard it mentioned by anyone else outside of that book. Yet it’s been invaluable to me, and not just for numerics. Did I just miss out, or is this simple fact not disseminated widely enough?
The difficult thing about teaching linear algebra (he says, procrastinating from writing the last week of notes for the linear algebra class he is teaching) is that the entire subject is, like, four actual facts, each of which is repeated twenty times in slightly different language.
And you have a great example! We could talk about:
A linear transformation, as a function with certain properties.
Matrix mutiplication
A system of linear equations
A collection of dot products with the row vectors
A linear combination of column vectors
A hyperplane in some higher-dimensional space
A semi-rigid geometric transformation of some space.
A function determined entirely by what it does to some basis.
And those are all the same thing. I think typically students coming out of a (first) linear algebra class understand and have internalized a couple of those; can cite a couple others; and are completely oblivious to the rest. (Any may not have heard of some, because it’s hard to cover all eight; I know that my discussion of the geometric properties has been somewhat perfunctory.
But for any given person, some of these perspectives will make much more sense than others; and if your class doesn’t get you to the ones that work for you, you won’t understand nearly as much as if it does.
(The goal, of course, is to understand all of the perspectives, and to switch among them fluently, but that’s hard and definitely not happening in a first course. So you have to pick your focuses. The reason I was so unhappy with my college’s choice of textbook is that its focus is exacty the opposite of what I would like).
So, for instance, you say that the matrix product is a linear combination of the columns of the matrix, with coefficients given by the input vector. And you say that, and I think for a few seconds and say “huh, I guess that’s true.” But that’s not how I think about it; I think about it as a function that sends each standard basis element to the corresponding column vector.
Except those are literally the exact same thing. You write your input as a linear combination of your standard basis vectors, and then your function preserves linear combinations, and sends each basis vector to the corresponding column—so you get a linear combination of the column vectors.
And I think the thing I just said is pretty common to mention. It’s certainly necessary for doing any sort of change-of-basis stuff. But if it made more sense to you in different language, that’s 100% unsurprising.
If you’re interested, here’s a stab at describing why I find the columns thing so useful.
In a lot of physics-like contexts, it’s natural to write vectors with respect to a basis which has a special physical importance, but whose basis vectors don’t. For instance, the position x(t) of a damped harmonic oscillator obeys
x’(t) = v
v’(t) = -(cv + kx)
This can be written as a matrix equation y’(t) = Ay, with y = (x, v)^T.
There is clearly something uniquely nice about the basis being used here. x is position and v is velocity, and it’s a easier to interpret a solution written in terms of x and v than one in terms of (say) x + 2v and x - v. In fact, since x and v are what you can actually measure, you have to specify how to transform to this particular basis, or you lose the physical meaning.
On the other hand, the basis vectors have little physical importance. One basis vector is a state with zero velocity, and the other is a state with zero position, and there isn’t any interesting physical connection between the state (x, v) and the states (x, 0), (0, v). So there’s no physical intuition you can attach to the question “how does A act on (x, 0)?”
In this type of problem, there is a different basis whose basis vectors have special physical meaning: the eigenbasis of A (each eigenspace is closed under time evolution and grows/decays/oscillates at rates given by the eigenvalue). But you wouldn’t typically write down a problem in that basis at the outset, because you want to give the reader the directly measurable quantities first.
Now that I think about it, the above makes (x, v)^T feel like a covector: we naturally think about its coefficients (“how it acts on the basis elements”), not its decomposition with respect to some basis. That suggests that this might all be less confusing if we wrote everything with row vectors instead of column vectors. Vectors would multiply matrices on the left rather than the right, and we would naturally read this off as the vector transforming the matrix rather than vice versa. But for whatever reason, column vectors are standard.
In fact, Trefethen and Bau’s comments on the columns thing can be viewed as trying to correct for the psychological effect of using column vectors instead of row vectors:
Will think more about this in the morning, after I’ve gotten some sleep. But my first reaction is to think that in some sense things are backwards. You have the matrix equation y’ = Ay, and you want to find y. So really your equation is A^(-1) y’ = y.
Basically, I think your second chunk is right; the reason this is feeling unnatural to you is that you’re never using the matrix to plug in y and get y’. Instead, you’re saying that the matrix gives you a (parametrized family of) functions, so you’re saying “I want to know position and velocity, and if I have this matrix I get that family of functions.”
You can always perform this sort of sleight of hand, of course. Any time you have a family of functions f_k: A -> B, you could instead think of this as a family of functions A_a: F -> B. Or as a single function from (F x A) -> B.
But if you find yourself doing this thing a lot, I can see why you’d want to think of linear algebra in a way I find slightly odd. (And I still don’t think I’ve totally wrapped my head around the way you’re thinking of it, so I may come back and revisit this thought later on).
Hm, I’ve thought about this a bit more and I think I figured out why this is weird for me, but I haven’t quite understood it yet. But basically, I don’t think I would think of “position and velocity” as a “basis” at all. x and v aren’t numbers; they’re functions.
You’re working on an infinite-dimensional function space squared; a basis for the whole space will have way more than two elements. And you’re right that the partition into “the position and the velocity” is more natural to the problem than the division into, like, “the position plus the velocity and the position minus the velocity”. But that doesn’t have anything to do with them being a “basis”, which they’re not.
Unless I guess our field of scalars is a function field or something? But that setup is weird enough that I don’t trust myself to understand it on this little sleep either.
Hmmm. I’m not sure I understand the first part here, but if this is relevant, I don’t think the key point here depends on the fact that we’re solving an ODE. A plain old matrix equation with the same feature is the Leontief model in economics, which models how much a set of industries has to produce when their products can be factors of production. It reads
x = Ax + d
where x_i is the quantity produced of good i, d_i is the quantity of good i demanded from outside, and A_ij is the quantity of good i needed to produce 1 unit of good j.
Here the x_i and d_i are just real numbers, but we still have the fact that it’s not very (economically) meaningful to think of the production x as “composed” of basis vector states where only one good is produced, or the demand d as “composed” of states where only one good is demanded. These generally don’t arise in reality, so we don’t care about them per se.
(If we start with a solution x = (I-A)^(-1) d for some particular and we want to look at how a change to d will affect x, then our basis vectors are more meaningful because it’s natural to imagine changing demand for one good in isolation. But this is really a different question than the original problem asked – we’re now asking about a tangent plane to the solution surface of the original problem. It just so happens that this question uses the same equation as the original one, due to linearity.)
I took linear algebra as an undergrad and then took a slightly fancier version in my first year of grad school, and I understood all the “matrices <==> linear transformations” stuff, but I never really felt comfortable interpreting the actual entries of a matrix until my second year of grad school, when I learned the rule
the matrix-vector product A*v is a linear combination of the columns of A, with the coefficients given by the entries of v
I learned this from the excellent book Numerical Linear Algebra by Trefethen and Bau, and I don’t think I’ve ever heard it mentioned by anyone else outside of that book. Yet it’s been invaluable to me, and not just for numerics. Did I just miss out, or is this simple fact not disseminated widely enough?
The difficult thing about teaching linear algebra (he says, procrastinating from writing the last week of notes for the linear algebra class he is teaching) is that the entire subject is, like, four actual facts, each of which is repeated twenty times in slightly different language.
And you have a great example! We could talk about:
A linear transformation, as a function with certain properties.
Matrix mutiplication
A system of linear equations
A collection of dot products with the row vectors
A linear combination of column vectors
A hyperplane in some higher-dimensional space
A semi-rigid geometric transformation of some space.
A function determined entirely by what it does to some basis.
And those are all the same thing. I think typically students coming out of a (first) linear algebra class understand and have internalized a couple of those; can cite a couple others; and are completely oblivious to the rest. (Any may not have heard of some, because it’s hard to cover all eight; I know that my discussion of the geometric properties has been somewhat perfunctory.
But for any given person, some of these perspectives will make much more sense than others; and if your class doesn’t get you to the ones that work for you, you won’t understand nearly as much as if it does.
(The goal, of course, is to understand all of the perspectives, and to switch among them fluently, but that’s hard and definitely not happening in a first course. So you have to pick your focuses. The reason I was so unhappy with my college’s choice of textbook is that its focus is exacty the opposite of what I would like).
So, for instance, you say that the matrix product is a linear combination of the columns of the matrix, with coefficients given by the input vector. And you say that, and I think for a few seconds and say “huh, I guess that’s true.” But that’s not how I think about it; I think about it as a function that sends each standard basis element to the corresponding column vector.
Except those are literally the exact same thing. You write your input as a linear combination of your standard basis vectors, and then your function preserves linear combinations, and sends each basis vector to the corresponding column—so you get a linear combination of the column vectors.
And I think the thing I just said is pretty common to mention. It’s certainly necessary for doing any sort of change-of-basis stuff. But if it made more sense to you in different language, that’s 100% unsurprising.
If you’re interested, here’s a stab at describing why I find the columns thing so useful.
In a lot of physics-like contexts, it’s natural to write vectors with respect to a basis which has a special physical importance, but whose basis vectors don’t. For instance, the position x(t) of a damped harmonic oscillator obeys
x’(t) = v
v’(t) = -(cv + kx)
This can be written as a matrix equation y’(t) = Ay, with y = (x, v)^T.
There is clearly something uniquely nice about the basis being used here. x is position and v is velocity, and it’s a easier to interpret a solution written in terms of x and v than one in terms of (say) x + 2v and x - v. In fact, since x and v are what you can actually measure, you have to specify how to transform to this particular basis, or you lose the physical meaning.
On the other hand, the basis vectors have little physical importance. One basis vector is a state with zero velocity, and the other is a state with zero position, and there isn’t any interesting physical connection between the state (x, v) and the states (x, 0), (0, v). So there’s no physical intuition you can attach to the question “how does A act on (x, 0)?”
In this type of problem, there is a different basis whose basis vectors have special physical meaning: the eigenbasis of A (each eigenspace is closed under time evolution and grows/decays/oscillates at rates given by the eigenvalue). But you wouldn’t typically write down a problem in that basis at the outset, because you want to give the reader the directly measurable quantities first.
Now that I think about it, the above makes (x, v)^T feel like a covector: we naturally think about its coefficients (“how it acts on the basis elements”), not its decomposition with respect to some basis. That suggests that this might all be less confusing if we wrote everything with row vectors instead of column vectors. Vectors would multiply matrices on the left rather than the right, and we would naturally read this off as the vector transforming the matrix rather than vice versa. But for whatever reason, column vectors are standard.
In fact, Trefethen and Bau’s comments on the columns thing can be viewed as trying to correct for the psychological effect of using column vectors instead of row vectors:
I took linear algebra as an undergrad and then took a slightly fancier version in my first year of grad school, and I understood all the “matrices <==> linear transformations” stuff, but I never really felt comfortable interpreting the actual entries of a matrix until my second year of grad school, when I learned the rule
the matrix-vector product A*v is a linear combination of the columns of A, with the coefficients given by the entries of v
I learned this from the excellent book Numerical Linear Algebra by Trefethen and Bau, and I don’t think I’ve ever heard it mentioned by anyone else outside of that book. Yet it’s been invaluable to me, and not just for numerics. Did I just miss out, or is this simple fact not disseminated widely enough?
Eh the correct view of the entries of a matrix is through Hom(V) \simeq V^* \otimes V. The ik-component corresponds to the 1-form that is 1 on the k:th basis vector tensored with the i:th basis vector. That is, it’s how much e_i you get out when you put in e_k. If you group components by their respective 1-forms you get the view you mention.
For finite-dimensional vector spaces and their bundles it’s essential to become comfortable with how a tensor, being a machine that makes scalars from vectors and 1-forms, by reflexivity can be seen as transforming tensors of one type into another. E.g. the Riemann 4-tensor is most naturally seen as a map from 2-forms to 2-forms.
All of this is very transparent in Einstein notation and I feel sorry for anyone who has to do multilinear algebra without it.
The “correct view” depends on the subject. In numerical linear algebra, for instance, your algorithm is handed matrices and vectors in coordinates, and a lot of really important things can only be defined with respect to a basis, like LU and QR decompositions. (These involve triangular matrices, and there is no intrinsic notion of “triangular” or “non-triangular” linear transformations, since one can make any linear transformation triangular with respect to some basis via the Schur decomposition)
I took linear algebra as an undergrad and then took a slightly fancier version in my first year of grad school, and I understood all the “matrices <==> linear transformations” stuff, but I never really felt comfortable interpreting the actual entries of a matrix until my second year of grad school, when I learned the rule
the matrix-vector product A*v is a linear combination of the columns of A, with the coefficients given by the entries of v
I learned this from the excellent book Numerical Linear Algebra by Trefethen and Bau, and I don’t think I’ve ever heard it mentioned by anyone else outside of that book. Yet it’s been invaluable to me, and not just for numerics. Did I just miss out, or is this simple fact not disseminated widely enough?
I’ve been reading some E. T. Jaynes lately (parts of PT:LoS I hadn’t read plus some of his papers). I think I may have overestimated his philosophical ambitions in the past, probably because I didn’t separate him and Yudkowsky clearly enough in my mind.
Jaynes is unusual in that he’s very pragmatic-minded yet also very anti-eclectic. A lot of pragmatic people will pick and choose methods from different schools of thoughts, using one here and another there, on the principle of “whatever works.” Jaynes also adopts the principle of “whatever works,” but he is convinced that his preferred method always works best in every case. Unlike many texts on Bayesianism, his big book is not focused on arguments like Dutch Books that try to establish Bayesian superiority in the general case once and for all; instead he gives the reader an endless succession of concrete, quantitative “problems,” and shows again and again how the (Objective) Bayesian methods are faster, cleaner, easier, more robust, etc. than some alternatives. (“If you juggle the variables and get the right priors / it’ll pull your butt out of many a fire … ”)
This focus on “problems” should give pause to anyone who, like Yudkowsky, wants to base a whole epistemology on Jaynes. The more I read Jaynes, the more it seems like he was interested in giving practical advice to working scientists, rather than in giving a systematic account of “how science works.” The title “Probability Theory: The Logic of Science” makes the book sound like it’s trying to tell you how science works, but what he means by “the logic of science” is really “the logic of working scientists”: he gives a systematic and rigorous account of the kind of reasoning scientists use in practice when they have to estimate a spectrum or derive a specific heat or whatever, without saying this can be patched together to form a full picture of the scientific enterprise.
This is not always clear in his writing about Bayesianism per se, but it’s very clear in his writing about Maximum Entropy methods. Jaynes was an Objective Bayesian, meaning he thought that prior distributions were not a matter of personal choice, that they could be deduced objectively and that two people “with the same information” ought to have the same distribution. His recipe for making prior distributions had two parts: non-informative priors and MaxEnt.
Non-informative priors are a really cool and kind of spooky thing where you can deduce the exact form of a distribution just from the transformation properties it must have, and thus deduce a unique (!) prior distribution compatible with the information “I know this is a standard deviation of something, but I have no clue what it is.” So that’s what you start out with when you know as little as you possibly could. When you know more than that, Jaynes says you should incorporate this by using MaxEnt, which tells you (roughly speaking) how to form the “equivalent of” a uniform distribution if you’re restricted to only use distributions with certain constraints.
So far, so good, but where do the constraints come from? Jaynes always assumes that our prior knowledge comes in the form of exactconstraints on the mean of our prior distribution. This is a very natural thing to do in statistical mechanics, which Jaynes wrote a lot about, but as many people have noted, it is very strange as a principle of general inference. Our prior distribution (as Jaynes keeps reminding us) is meant to represent our state of knowledge about the world, not some feature of the world itself (except incidentally). It is hard to imagine a case in which we have evidence saying that, although the world could be many different ways, our internal expression of our knowledge about it must make a certain average prediction. Indeed, Jaynes belabors this very point on p. 40 of this article, while arguing against a claim that MaxEnt and Bayes were inconsistent: he says they cannot be inconsistent because the empirical information which goes into Bayes – observed frequency counts, for instance – does not take the form of an assertion about your distribution. I agree! But this only makes it more mysterious where these assertions do come from.
In practice, when Jaynes solves a problem with MaxEnt, he either chooses a textbook-ish problem in which the constraint is simply asserted as part of the problem, or he chooses problems where your prior is supposed to match observed frequencies so that the constraint rule is less bizarre. Here’s an example of the latter. On pp. 48-63 of the same paper, he analyzes empirical frequency counts from a possibly-biased 6-sided die by first making physical arguments about the sorts of bias that are likely to arise in a die. These take the form of constraints on functions of the probabilities assigned to the six faces, with some undetermined parameters corresponding to the extent to which the die is weighted. These physical arguments are not about states of knowledge; they only happen to carry over to the prior in this case because our “state of knowledge” about the result of a given roll is supposed to line up in a particular way with the physical form of the die.He then tries MaxEnt with one constraint, then with two, in each case estimating the parameters by using sample means as exact constraints, and doing a chi-squared test for goodness of fit; once he has imposed two constraints, the test doesn’t reject the MaxEnt distribution and he declares success.
He immediately addresses the obvious concerns with this procedure, for instance about the interpretation of sample means as constraints. He shows that among probability distributions with some constraint on these means, the one which gives the data the highest likelihood is the one where the constraints are set equal to the sample means. This is not surprising (a model perfectly tuned to the observations will assign them high likelihood), but it assumes at the outset that we are supposed to set constraints on means, which is not obvious. Indeed, this approach falls prey to the same problem with maximum likelihood that Jaynes identifies in Section 13.9 of PT:LoS, where he shows that it is equivalent to estimation with an all-or-nothing loss function:
The maximum-likelihood criterion is the
one in which we care only about the chance of being exactly right; and, if we are wrong,
we don’t care how wrong we are. This is just the situation we have in shooting at a small
target, where ‘a miss is as good as a mile’. But it is clear that there are few other situations
where this would be a rational way to behave; almost always, the amount of error is of some
concern to us, and so maximum likelihood is not the best estimation criterion.
Typically Jaynes prefers the estimates given by a squared-error loss function, which are means rather than modes of the posterior. This has a nice regularization effect, and corresponds to the idea of mixing pure likelihood and background knowledge so that you don’t make overly radical, overfitting jumps based on small amounts of data. But that is precisely what Jaynes advocates when using MaxEnt: he specifically asserts that the sample means can be used as constraints whether they are taken over many observations or just a few.
A very simple example illustrates the problem with this. Suppose we roll a six-sided die once and observe a 6. Taking literally the idea of using small-N sample means as constraints, we are forced to pick our MaxEnt distribution from those distributions with mean 6, and there is only one such distribution, the one that assigns probability 1 to the 6 face and 0 to the other faces. Obviously this is absurd to take as a prior (if you ever roll your die and get anything but a six, you will have to divide by zero in Bayes’ rule). I am sure that in this case Jaynes would say that we really know additional prior information about dice, e.g. that “dice in general” have means around 3.5 and so we should use that in our constraint. But this does not have to be about a die; it could be some totally abstract multinomial process which we know nothing about at the outset, and still this would be a rash and bad inference.
(Jaynes says that uncertainty about the mean <f> can be supplied to MaxEnt via a constraint on <f^2>, but that doesn’t help here, as our N=1 sample has zero variance, and anyway you can’t get a mean of 6 without zero variance.)
This all makes more sense in light of Jaynes’ pragmatic focus. If we want to provide a completely systematic set of rules for inference – so that they could be programmed into the hypothetical “robot” Jaynes frequently discusses – then we must worry about “obviously bad” inferences like the one above. The robot will follow the rules we give it even when the results are “obviously bad” (it has no notion of obviousness outside of the rules). But if we are giving practical advice to human scientists, we don’t need to give them a rule telling them not to do the thing I just described, because they wouldn’t try it in the first place. They would gather more that one data point before using MaxEnt – which contradicts the claim that MaxEnt works for any N, but can be justified by demanding that we write down a “sensible” problem, like the textbook-ish ones Jaynes uses, before we apply his methods.
Indeed, I don’t think Jaynes ever clarifies what specific mixture of Bayes and MaxEnt his robot is supposed to use; there might be some recipe which would avoid pitfalls like the above, but Jaynes does not seem very interested in it. As PT:LoS goes on, he says less and less about systematic rules and the robot, and focuses more and more on “solving problems” in the manner of a physics lecturer doing board work. The real intended user of his methods is a human physicist, not a robot, and he is satisfied with methods that work well when judiciously applied, even if they are not foolproof (and thus not a complete theory of inference).
A final example of Jaynes’ pragmatism: I was surprised to find, in a late (1991) paper, Jaynes happily conceding something which I had always thought was a knock-down point against the Bayesian approach. I had always make a big fuss over the fact that the approach doesn’t tell you how to modify your hypothesis space if it is not complete to start with, or how to re-distribute probabilities after modifications. Jaynes advocates using MaxEnt to deal with “refinements,” i.e. breakdowns of the possibilities into finer and finer details. At one level of description you might apply MaxEnt over possible structures of a crystal; then, to the possible arrangements of molecules within the crystal; then to possible arrangements of atoms, and so on (cf. p. 15 here). But this doesn’t work if you do not have exhaustive knowledge of the possibilities on any one level. In Section 6 of the 1991 paper, Jaynes admits he finds it awkward that MaxEnt requires a hypothesis space, and hopes for a development that will extend his theory to cases without one:
Our bemusement is at the fact that in problems where we do not have even an hypothesis
space, we have at present no officially approved way of applying probability theory;
yet intuition may still give us a strong preference for some conclusions over others. Is
this intuition wrong; or does the human brain have hidden principles of reasoning as yet
undiscovered by our conscious minds? We could use some new creative thinking here.
What’s the point of gathering data about me if you’re going to show me ads aimed at “people aged 18-34 living in the U.S.”?
I don’t have an authoritative answer to this question, but it reminds me of a line I’ve heard a lot in data science – that you tend to get the most leverage from exploiting relatively “basic” information, and that clever, fancy stuff tends to work less well than you’d expect.
Random example: one of the people who won the Netflix Prize (for predicting how users would rate movies) wrote
Out of the numerous new algorithmic contributions, I
would like to highlight one – those humble baseline predictors
(or biases), which capture main effects in the data. While
the literature mostly concentrates on the more sophisticated
algorithmic aspects, we have learned that an accurate treatment
of main effects is probably at least as significant as coming
up with modeling breakthroughs.
(See here, here.) By “baseline predictors,” they mean terms in their equation that capture the (possibly time-varying) average rating for each movie, or average rating given by each user, without looking at user-movie interactions. Intuitively, this information feels “uninteresting” and perhaps even unrelated to the real problem – as humans we care about a user’s personal preferences between different movies, and average ratings (”most people hate this movie” / “this user gives everything 4 or 5 stars”) are part of the background that our minds intuitively “subtract out” and treat as obvious.
Yet most of the variance in the date is in those averages, and some recommender systems have not included them. Thus, a lot of the performance advantages of the state-of-the-art systems can be attributed to successfully doing things that seem extremely crude and obvious to humans (where previous systems hadn’t even done that).
Analogously, it would not surprise me if getting a system to successfully identify “people aged 18-34 living in the U.S.” actually gives you a huge boost beyond just showing the same ads to everyone, while more micro-targeted strategies that intuitively seem more powerful would at best give you a small boost over this baseline, and have more trouble with robustness/overfitting/dealing with real-world phenomena never seen in the training data.
what’s the most time-efficient zero-money way to level up in math? khan academy, going through the exercises in an old textbook, youtube + anki, something else? i flunked calculus a decade ago and have always felt a sense of inferiority (and, less negatively and narcissistically, that there are beautiful horizons of beying just beyond my grasp, &c.) since
I look around for textbook recommendations and then pirate them these days. I did khan academy for a while and liked it then but i’m not sure i’d endorse it now- i didn’t come away with an actual understanding of much of what i did there and i’m not sure how much of that to blame on khan academy’s flaws and how much to blame on my having no idea how to learn things back then. it seems like a decent source for high school algebra, less so for anything beyond that.
tbh i’d advise not sticking too hard to a particular method/source- browsing around for an alternate explanation of a thing you aren’t really getting is a good habit to have, really.
if you’re looking for places to start, so long as you have a reasonable grounding in high-school-level algebra i’d advise doing discrete mathematics before calculus or any of that. discrete math courses/books require very little existing mathematical skill and teach logic/proofs/set theory/what a function is/etc which are vital for Actually Understanding basically any kind of math, and then go into the basics of fun things like graph theory and counting. basically a course in actually thinking about math instead of just memorizing formulas. (and it’s useful for programming, if you’re into that)
uh i hope this was somewhat helpful
actually I hadn’t even considered doing anything other than “the next thing you’re supposed to learn” (as gotten from a vague sense of how it’s taught in schools) for where to get back on the train, so this is actually extremely useful - i’ll keep that in mind as an additional question to think about as i google around, and certainly will look into discrete maths as a place to start unless you’re some yahoo who is yelling against a clear consensus otherwise. (not that yahoos yelling against the consensus aren’t often right; just that i’m a layperson who doesn’t have any other heuristics to work on.)
and i am grounded in high school algebra and want to improve my rudimentary programming skills too, so there’s that!
(also multiplicity of methods is just common sense on reflection, yeah)
This is not discrete math, but I recommend it for the same reasons @maddeningscientist mentioned discrete math – I’m a big fan of Ray Mayer’s “Intro to Analysis” course notes. They’re really a version of the “intro to rigor” class about proofs and stuff that appears somewhere in the math major curriculum at most colleges, but they’re a version that’s billed as a freshman-level class and doesn’t assume much background, and they take you on this whole journey based on a single set of axioms, where you start out proving things like “2 times 0 equals 0″ and “0 is less than 1″ and end up defining real and complex numbers, derivatives, infinite series, etc.
Taking some sort of course like this (or the equivalent) will make a lot of other math resources more accessible, and I really like this one
Also, another recommendation – as you learn more, try to get a sense of what you are and aren’t curious about, and learn what’s interesting as opposed to what seems like a big topic or “next in the sequence.” This is because there’s a lot of mid-level math out there (e.g. any class called “Real Variables”) that amounts to really careful/nitpicky definitions that only matter in “unusual” or “pathological” cases, which are incredibly important to pure mathematicians (because they act as counterexamples to seemingly intuitive conjectures) but not at all useful if you care about how “typical” mathematical objects behave (which is what you’ll generally care about in, say, physics and other applications). It’s very easy to end up getting lost in technicalities that are very important from some perspectives and useless pedantry from others.
Seems like when someone writes like, “we care about this thing, so we used the standard quantitative measure of this thing,” @nostalgebraist is in the habit of asking, “why’s that standard?” Especially if that measure has some aura of goodness or rightness about it, that makes you question whether it’s being used for intellectual reasons.
One such question was, why do statistics people always measure distance between two distributions using Kullback-Leibler divergence? Besides, you know, “it’s from information theory, it means information.”
Above, I’ve illustrated the difference between using KL divergence, and another measure, L2 distance. I’ve shown a true distribution which has two bell curve peaks, but the orange and purple distributions only have one, so they can’t match it perfectly. The orange distribution has lower L2 distance (.022 vs .040), and the purple curve has lower KL divergence (2.1 vs 3.0). You can see that they’re quite different:
the orange low-L2 one matches one peak of the true distribution, but has the other one deep in the right tail
the purple low-KL one goes between them and spreads itself out, to make sure there’s no significant mass in the tails
And this difference makes a real practical difference–using KL divergence actually is not always appropriate. When I’m doing statistical estimation, I often have a model for the data, but I don’t expect every data point to follow the model. So I expect the true distribution to have one peak which fits my model, plus some other stuff. So I don’t want to do maximum likelihood estimation, which is heavily influenced by that other stuff. And maximum likelihood estimation is actually choosing a model by minimizing a sample-based estimate of KL divergence. Instead, I minimize a sample-based estimate of L2 divergence–this is called L2 estimation, or L2E. (some papers about it here.) That way when I’ve inferred the parameters of my model, it matches the “main” peak of the data, and is robust to the other stuff.
The invention of L2E is actually informative about how standard KL divergence really is. Because, it was invented by someone in a statistical community where L2 divergence is standard. Specifically, non-parametric density estimation–think histograms and kernel density estimators. The guy is actually David Scott, who’s also known for “Scott’s rule” for choosing the bin width of a histogram, which you may have used if you’ve ever done “hist(x, method=‘scott’)” in R. Scott’s rule starts by looking at the mean and standard deviation of your sample, and then gives you the bin width that would be best for a sample of that size drawn from a normal distribution with that mean and sd. And how’s “best” quantified? It’s expected L2 distance between that normal distribution and the resulting histogram. Most papers you see on histograms and kernel density estimators will use L2 distance. He came up with L2E just by asking the question, what if we took the measure of fit used in nonparametric density estimation, and applied it to parametric models?
This is really interesting, thanks. Especially the connection of MLE downsides to K-L downsides.
One thing that gets mentioned as a good quality of K-L is that it’s invariant to changes of coordinates. L2 divergence doesn’t have this (I think? the squares ruin it, you get a squared factor and the “dx” can only cancel half of it). How much of an issue is this in practice? Like, it seems bad if you can totally change the distributions you get by squishing and stretching your coordinate system, but I guess if you have a really natural coordinate system to begin with … ?
Also, this made me think about how a sample distribution is going to have better resolution near the peak than in the tails, which could be one justification for caring more about the fit near the peak. It seems like that could be put on a quantitative footing, too? With theorems and stuff, even. Maybe this is already a thing and I just don’t know it
re: change of coordinates, some observations about how L2 ends up being used:
in research on histograms and kernel density estimators, the problem is often to choose a bin width (for a histogram) or a bandwidth (for kernel density estimators), which are usually constant. So, then you’ve got the question, in what coordinate system does constant bin width/constant bandwidth makes sense? One where the smoothness of the distribution is sufficiently close to constant I guess.
I use L2E a lot and don’t think about the coordinates much. Usually i end up plotting the L2E fit over a histogram and being like, “yeah, that looks good.” If the distribution had really sharp spikes, or really long sparsely populated tails, I guess my histogram would look like crap and I might consider changing the coordinate system?
re: resolution near peaks vs near tails: yeah I guess but if we actually know the true form of the distribution MLE is more efficient. Scott describes the efficiency of L2E relative to MLE as similar to the efficiency of the median relative to the mean. So the responsiveness of MLE to the tails must be helping it, if we can trust that those values actually came from the theoretical distribution we’re fitting.
This kind of makes sense. Think of the extreme case of a uniform distribution on (μ-½, μ+½). Two data points near the edges completely determine μ, whereas two data points in the middle leave it ambiguous.
But maybe the uniform distribution is a bad example since it doesn’t seem to really generalize. In the case of a normal distribution with known sd, it doesn’t seem to matter where the data comes from. With a uniform prior over the mean, the posterior always has the same spread–always normal with variance equal to the sampling variance over the sample size. Doesn’t matter if you observe 2 data points and they’re both 0, or if one is -2 and the other 2–they both pin down an answer of “μ is around 0″ with exactly equal confidence. That’s actually really weird now that I think of it.
Oh, and here’s a case where it’s the other way around. Consider the maximum likelihood estimate of location for a cauchy distribution. We’re going to try and minimize the negative log likelihood. Which we do by solving
∑ 2(x_i-μ) / [(x_i-μ)^2 + 1] = 0
Each term is weighted inversely by its distance from the center. (sorry for saying this was the loglikelihood itself in an earlier version of the post!)
And this kind of looks like L2E in a way, don’t worry about the tails, and I think that this isn’t a coincidence. When I’m using L2E, I’m considering data points to have different reliability. Some tell me about the “primary mechanism” which I have a model for, whereas others don’t because they came from some other process. This is similar to how you can sample from a Cauchy distribution just by sampling from a normal distribution, but each time choosing the precision (1/variance) according to a chisquared distribution with 1 degree of freedom. This captures that idea of “variation in reliability.” And the ones farthest from the center are likely to be the low-precision samples which carry little information.
Hmm … I guess my idea about low resolution in tails only makes sense if you’re asking the question “how much does one function (the empirical PDF or CDF) look like another one (some theoretical PDF or CDF),” as opposed to “how likely is this sample to have come from this theoretical distribution?” The latter is literally maximum likelihood, and the former seems like roughly what L2E does.
From the “how similar are these functions” perspective, extreme values aren’t really special. Say you’re comparing your empirical PDF to a, say, a standard normal PDF. If you see an individual data point at, say, -100, or -1000, this doesn’t actually “make the functions look less similar” by very much: the normal PDF is basically zero at both -100 and -1000, so you’ll take a hit from the empirical PDF being nonzero there, but with an appreciable sample size a single point won’t make it very much larger than zero, and what’s more, it hardly matters whether the point is -100 or -1000 (or -1e12), since you’re comparing to “basically zero” in all those cases.
By contrast, in maximum likelihood, you really care about those points because it’s extremely unlikely that you’d observe them if sampling from the standard normal, and -1000 is vastly more unlikely than -100. The idea of “resolution” I had in mind doesn’t apply here; you aren’t trying to say how confident you are about the shape of the function there. Like it’s probably true that the true PDF doesn’t have some weird little bump at -1000, it’s probably more continuous than that, but that single observation still gives you a huge amount of information about the relation between the true distribution and your hypothetical one.
It makes sense that L2E would be used for histograms, because with a histogram you really do want it to “look like the function,” since you’re going to be … looking at it, and treating it like it’s the function.
It seems like the resolution concept would be most relevant in comparing two empirical distributions, since there you don’t know the true probability of anything. And the K-S test is used a lot for that, and it is less sensitive to extreme values, although I don’t know if there’s a principled connection between those two facts. (Sometimes people say this is a flaw in the K-S test and use corrections or other tests because they want more sensitivity to extreme values.)
The motivation for this post was a tumblr chat conversation I had with @youzicha. I mentioned that I had been reading this paper by John L. Horn, a big name in intelligence research, and that Horn was saying some of the same things that I’d read before in the work of “outsider critics” like Shalizi and Glymour. @youzicha said it’d be useful if I wrote a post about this sort of thing, since they had gotten the impression that this was a matter of solid mainstream consensus vs. outsider criticism.
This post has two sides. One side is a review of a position which may be familiar to you (from reading Shalizi or Glymour, say). The other side consists merely of noting that the same position is stated in Horn’s paper, and that Horn was a mainstream intelligence researcher – not in the sense that his positions were mainstream in his field, but in the sense that he is recognized as a prominent contributor to that field, whose main contributions are not contested.
Horn was, along with Raymond Cattell, one of the two originators of the theory of fluid and crystalized intelligence (Gf and Gc). These are widely accepted and foundational concepts in intelligence research, crucial to the study of cognitive aging. They appear in Stuart Ritchie’s book (and in his research). A popular theory that extends Gf/Gc is knows as the “Cattell–Horn–Carroll theory.”
Horn is not just famous for the research he did with Cattell. He made key contributions to the methodology of factor analysis; a paper he wrote (as sole author) on factor analysis has been cited 3977 times, more than any of his other papers. Here’s a Google Scholar link if you want to see more of his widely cited papers. And here’s a retrospective from two of his collaborators describing his many contributions.
I think Horn is worth considering because he calls into question a certain narrative about intelligence research. That narrative goes something like this: “the educated public, encouraged by Gould’s misleading book The Mismeasure of Man, thinks intelligence research is all bunk. By contrast, anyone who has read the actual research knows that Gould is full of crap, and that there is a solid scientific consensus on intelligence which is endlessly re-affirmed by new evidence.”
If one has this narrative in one’s head, it is easy to dismiss “outsider critics” like Glymour and Shalizi as being simply more mathematically sophisticated versions of Gould, telling the public what it wants to hear in opposition to literally everyone who actually works in the field. But John L. Horn did work in the field, and was a major, celebrated contributor to it. If he disagreed with the “mainstream consensus,” how mainstream was it, and how much of a consensus? Or, to turn the standard reaction to “outsider critics” around: what right do we amateurs, who do not work in the field, have to doubt the conclusions of intelligence-research luminary John Horn? (You see how frustrating this objection can be!)
So what is this critical position I am attributing to Horn? First, if you have the interest and stamina, I’d recommend just reading his paper. That said, here is an attempt at a summary.
I disagree with several parts of this, but on the whole they’re somewhat minor and I think this is a well-detailed summary.
Note how far this is from Spearman’s theory, in which the tests had no common causes except for g!
Moving from a two-strata model, where g is the common factor of a bunch of cognitive tests, to a three-strata model, where g is the common factor of a bunch of dimensions, which themselves are the common factor of a bunch of cognitive tests, seems like a natural extension to me. This is especially true if the number of leaves has changed significantly–if we started off with, say, 10 cognitive tests, and now have 100 cognitive tests, then the existence of more structure in the second model seems unsurprising.
What would actually be far is if the tree structure didn’t work. For example, a world in which the 8 broad factors were independent of each other would totally wreck the idea of g; a world in which the 8 broad factors were dependent, but had an Enneagram-esque graph structure as opposed to being conditionally independent given the general factor would also do so.
When it comes to comparing g, Gf, and Gc, note this bit of Murray’s argument:
In diverse ways, they sought the grail of a set of primary and mutually
independent mental abilities.
So, the question is, are Gc and Gf mutually independent? Obviously not; they’re correlated. (Both empirically and in theory, since the investment of fluid intelligence is what causes increases in crystallized intelligence.) So they don’t serve as a replacement for g for Murray’s purposes. If you want to put them in the 3-strata model, for example, you need to have a horizontal dependency and also turn the tree structure into a graph structure (since it’s likely most of the factors in strata 2 will depend on both Gc and Gf).
Let’s switch to practical considerations, and for convenience let’s assume Caroll’s three-strata theory is correct. The question them becomes, do you talk about the third strata or the second strata? (Note that if you have someone’s ‘stat block’ of 8 broad factors, then you don’t need their general factor.)
This hinges on the correlation between the second and third strata. If it’s sufficiently high, then you only need to focus on the third strata, and it makes sense to treat g as ‘existing,’ in that it compresses information well.
This is the thing that I disagree with most strenuously:
In both cases, when one looks closely at the claim of a consensus that
general intelligence exists, one finds something that does not look at
all like such a consensus.
Compared to what? Yes, psychometricians are debating how to structure the subcomponents of intelligence (three strata or four?). But do journalists agree with the things all researchers would agree on? How about the thugs who gave a professor a concussion for being willing to interview Charles Murray?
That’s the context in which it matters whether there’s a consensus that general intelligence exists, and there is one. Sure, talk about the scholarly disagreement over the shape or structure of general intelligence, but don’t provide any cover for the claim that it’s worthless or evil to talk about a single factor of intelligence.
The context I have in mind is me talking to other people who are
normally interested in talking about thorny methodological issues and
contrarian academic positions.
…
For my part, I think that if by “exists” we mean “compresses information
well,” then we can automatically get “g exists” from “positive manifold
+ high correlations.”
My claim is twofold: first, “compresses information well” (along with some other claims about durability) is the standard usage of “g exists,” and if one wants to use a subtle meaning, one should use a subtle phrasing. The statement “g isn’t causal” can’t be misinterpreted in the way that “g doesn’t exist” can.
To borrow an example from climate change, saying “global warming has stalled,” while correct for the standard definition of “global warming,” is generally misleading because the defect is in the definition more than the prediction; recent energy imbalance has mostly been going into the deep oceans, which historically aren’t counted as part of that definition, but are probably still relevant to the overall problem. The statement “global warming has been mostly affecting the deep ocean recently” points at the same issue but in a way that makes clear that we’re talking about a subpoint that doesn’t contradict the main point, which is
accumulating energy imbalance
to the Earth.
(I have seen at least three people using Shalizi’s critique as support for the belief that IQ is meaningless, not the more specific claim that Spearman’s specific hypothesis of a single causal g is wrong. This is why I respond to disagreements like this, and I think that attempts to attach the standard meaning to the opaque phrase “positive manifold” is basically obscurantist.)