Install Theme

Back when we were arguing over logical induction, I posted this in an Agent Foundations comment, but I don’t think I ever posted it over here.  I really like it, and I think it helps clarify how weak mere convergence can be:

Finally, about abstract asymptotic results leading to efficient practical algorithms – yes, this happens, but it’s important to think about what information beyond mere convergence is necessary for it to happen.

Consider root-finding for a differentiable function F from R→R. Here’s one method that converges (given some conditions): Newton’s method. Here’s another: enumerate the rational numbers in an arbitrary order and evaluate F at one rational number per timestep, and write down the number iff F there is closer to zero than with the last number you wrote down. (You can approximate the root arbitrarily well with rationals, the function is continuous, blah blah.)

Even though these are both convergent, there’s obviously a big difference; the former is actually converging to the result in the intuitive sense of that phrase, while the latter is just trolling you by satisfying your technical criteria but not the intuitions behind them. (Cf. the enumeration-based trader constructions.)

Comments on the Friston “free energy” stuff that @slatestarscratchpad has been talking about:


Friston’s papers are badly written in a way that I find very recognizable.  He’s not incomprehensible in some exciting, esoteric way, he’s just a certain type of bullshitter.

He does a thing which is common to both many physics crackpots and (alas) quite a few real scientists: making a big deal out of his ability to relate his ideas to various famous and positive-affect-haloed ( “canonical,” “celebrated” etc.) equations/theorems/subject-areas.  The problem is, you can always do this, because most ideas in math and physics are connected, sometimes for ~deep~ reasons but usually for ordinary, obvious and not very interesting reasons.

An (exaggerated) non-mathematical version of this would be to breathlessly exclaim that your theory applies simultaneously to Europe and France, due to the (“time-honored”? “celebrated”? pick your adjective) fact that France is a part of Europe.  If you’re like Friston you can go the extra mile and claim that your theory has unified Europe and France.

For example, as far as I can tell, the entire section called “Optimal control theory and game theory” in Friston (2010) boils down to the observation that if you use prior probability as your utility function, then all of the usual results about how to make decisions with a utility function (optimal control) apply as usual, where the utility you are optimizing is prior probability.  Friston’s proposal has no more (or less) connection to optimal control theory than any proposal about which utility function to use.


More substantially, insofar as I understand the non-vacuous part of Friston’s proposal, it strikes me as a non-starter.

Here is what I think he is proposing.  First, he has this novel theory of motor planning which says something like the following: at every moment, we have some prior belief about what we expect to be happening, and we also have sensory input about what is really happening, and our muscles decide how to move by using the rule, “move to decrease the gap between prior belief and reality.”

And in some sense, that is also the rule followed by perception: immediate sensory input is ambiguous and can be parsed equally well into more than one possible “reality,” and we break this tie by choosing to perceive the reality that most closely matches our prior beliefs.  So Friston wants to say that perception and motor planning (really, perception and action in general) are executing the same exact rule, “minimize the gap between prior belief and reality.”  (He calls this gap “free energy.”)

Before I go into the problems with this, here is a concrete example where I can see the appeal of this perspective.  Suppose I’m walking along, and I suddenly hit an unseen bump in the road and trip on it.  I lurch forward – now my head is pointing downward, I’m seeing the ground, my body is at maybe a 45° angle instead of a 90° angle.  My visual and proprioceptive perceptions immediately shift to account for this: I know that I’m seeing the ground and that I’m in this new position.  At the same time, though, some very quick reflexive muscle movements are taking place, trying to bring me back to my original upright position, which is where I predicted I would be.

In this example,  perception and action are both – I guess – trying to close the gap between prior belief and reality.  Perception tries to close the gap by proposing a “conservative” interpretation of the sudden shift: lots of sensory input has changed, but only because of a slight shift in body position (while the rest of my prior beliefs remain true), not because I’ve been suddenly teleported to Mars or something.  Action tries to close the gap by making my sensory input closer to predicted sensory input: faced with a sudden acceleration of my head (signaled to me by my inner ear), it reacts by accelerating my head in the opposite direction, and so forth.

There are various things that I’ve finessed here, most importantly the distinction between predictions about sensory input and prior beliefs about reality.  Friston makes this distinction, but claims that these two terms should be bundled together in a single expression (“free energy”) which is optimized by perception (as it alters perceived reality, keeping sensory input fixed) and by action (as it tries to alter sensory input, keeping perceived reality fixed).

I am a bit worried that this is a trick you could use to “unify” any two unrelated processes, but I can see the intuitive justification: given some sensory input and some perceived reality, there is some single quantity expressing how “surprised” we are by this state of affairs, and in principle we can reduce that number in two ways, by altering perceived reality or by altering sensory input.


But here is the problem.  We can alter perceived reality more-or-less instantaneously.  But we can’t instantaneously change sensory input.  What we can do is send motor signals, and those set the rate of change of sensory input.

Revisit the unseen-bump-in-road example.  My perceptions jump immediately to fit the sensory input perfectly.  There isn’t some gradual process where my perceived position shifts from the predicted one (upright) to the actual one (lurched over) – or if there is, it happens in a tiny, tiny fraction of a second.  But my motion to right myself occurs perceptibly over time.

Indeed, this separation of timescales seems like a good thing.  If perception could only make sense of new inputs as fast as action could change those inputs, then the two would be stuck chasing each other in an endless loop.  First I pitch forward (in real life).  Then, over the course of (let’s say) two seconds, my perception slowly alters my world model so that it represents me as pitched forward, so that this input is no longer surprising.  But in those same two seconds, my actions have responded by making the input unsurprising in their way – that is, by bringing me upright.  Now, my brain notices that sensory input is surprising again: my world model says I’m pitched over, but input says I’m upright.  So perception starts to adjust my world model towards “I’m upright,” over the course of two seconds … in which time, action has closed the gap by making me pitch over again.  I’m now in the same place I started: world model upright, but input says I’m pitched over.  And so I bob up and down in place, endlessly, like one of those bird toys you put on your desk.

One engineering solution to this problem is to have perception adjust the world model so fast that action can’t keep up.  (It seems like this is generally the case.)  Another is to send a copy of the motor instructions back to perception so it can “subtract them out” when deciding whether it needs to shift anything to reduce surprise.  I think this also happens; at least, I remember learning that the nervous systems of (some) electric fish do this, so they don’t sense the electric field changes they themselves produce (or plan to produce).  But this ruins the picture where actions are directed to decrease surprise, since it subtracts action effects out of the surprise signal.

You can actually see this problem right there in Friston’s equations:

image

There are a lot of variables here, but to get a basic sense: “s” is sensory input, “mu” is perceived reality, “a” is action, and “x” and “theta” are the external world state.  (Elsewhere he writes that x and some other things [including a non-curly-script theta, not shown here] are a subset of theta, which makes no sense to me since he writes things as functions of theta and x; who fucking knows.  The tildes above x and s appear to indicate that they are functions of time, even though everything here is a function of time.)

Anyway, if you squint at the green “External states” box on the left, you can see that there’s a tiny dot above the x, meaning “the time derivative of x.”  This makes sense: the right-hand side includes action, and our actions can’t immediately change sensory input to whatever we desire, they can just move it over time in some direction.

But there’s no such time dependence in the “Internal states” box.  Here, perceived reality mu is updated instantaneously at all times to track sensory input s.  This is the meaning of the “arg max” in the lower right box.

But we also have an “arg max” to compute action in the lower left box.  What’s up with that?  If we look at the line above, we see that we’re maximizing a quantity that depends on “a” via “s(a)”.  That is, sensory input (s) is supposed to be a function of action (a), so that we can set it to whatever we want instantaneously by choosing the right action.

But, as I just said a moment ago, it isn’t.  Again, we can see this by following the arrows to the left and up: s = g(x, theta) + w, and the only term on the right affected by “a” is “x,” and “a” doesn’t set “x” directly, only its time derivative.

tl;dr: Friston is full of shit.

Quick follow-up to the last paragraph of last night’s deep learning post, something that just occurred to me and could well just be stupid:

It seems like there should be various transformations that all your classes are invariant under in the same way.  For images, this could be things like translations (already baked into convnets) but also rotations, dilations, more sophisticated things like rotations in inferred 3D space or inferred 3D lighting conditions, etc.

For text, it might be things like rephrasings: “I loved this movie” and “this movie was loved by me” should both receive the same sentiment label (positive), and so should “I hated this movie” and “this movie was hated by me,” for the same reason.

When we bake invariances directly into the architecture, like with convnets, then of course they apply equally to all classes, and also the networks know them from “birth” and don’t have to consume 40 million data points to learn them, and basically that works great except (1) it isn’t learning and (2) you have to plan them all in advance and laboriously design the architecture around them.  I’m not dissing this approach, since as far as I can tell natural organisms do a whole lot of this and I don’t see any reason to think you can get around doing it.  The tradeoff between generality and efficiency is pretty generic.

But what if you want the network to learn some invariances?  I don’t know what the right representation would be.  (I suspect there is a “right representation,” or several, and that it would take more mathematical sophistication than I actually have to come up with it, so I hope someone else is working on this.)  But linear classification definitely can’t do the job.  It gives each class an (n-1)-dimensional subspace in which you can move without changing the class probability*, and these all have to be distinct, because each one uniquely defines the class as opposed to all the others.

(And you can’t rely on the lower layers to sort things out so that the same invariances get mapped to different subspaces for different classes in the feature space, since that requires them to implicitly figure out which class a point is in so they can translate, say, “rotation invariance” into “the distinctive representation of rotation invariance for dogs” iff a point is a dog, in which case the classification has already been implicitly done and the linear classifier layer is redundant.)

I suspect the ultimate right answer here has to do with both prototypes and having a single set of invariances.  Like, once you “quotient out” all the invariances you know about, the dogs will all cluster together in the resulting space, and you can distinguish between “this is a weird marginal dog” and “this is a typical dog, but transported a long way from the ones I’ve seen, by one of the invariance groups.”  I suspect that you’ll have to bake in a lot of invariances rather than solving for them and classification at once, cf. all the stuff about the sophisticated visuo-spatial intuitions that even babies have.

*(well, the unnormalized probability – the others may change and that’ll change the softmax output)

Some recent thoughts about deep learning, which are all sort of related but which I can’t boil down into a simply summary I’m confident about:


As always, I keep coming back to Christopher Olah’s amazing 2014 post about neural networks and topology.

One of the things that post emphasizes is that even fancy deep networks are still usually doing linear softmax classification – logistic regression – in the last layer.  All the fancy nonlinear stuff, then, is just trying to transform the data into a new feature space in which they are linearly separable.  This is true even, for example, of LSTMs that generate text for translation or conversation purposes (seq2seq)  – they’re still doing logistic regression on the feature space to figure out what word or character to output next.

(At the end of that post, Olah suggests using (differentiable) k-NN in the last layer instead, so that the goal for the feature space is the more forgiving “nearby points have the same class.”  Once this idea has been brought up, it seems obviously promising, and I’m confused why more research hasn’t been done on it since 2014.  Or if it’s been done, why I can’t find it.)


One of the various downsides of this approach to classification is that for each class, it represents “the quality of being that class” as a direction (in the feature space), given (roughly) by the direction between the centroids of the set of training examples in that class and the set of all the other ones.  I call this a downside both for theoretical reasons (it doesn’t seem like a good way to represent a concept) and for not-unrelated empirical reasons (it gives you adversarial examples).

Say, for example, your network is classifying images, and one of the classes is “dog.”  There is a direction, in the high-dimensional feature space, such that moving further in that direction always makes any image “more doglike” (and moving in the opposite direction makes it “less doglike”).  Of course, these simple linear motions correspond to complicated nonlinear trajectories back in the input space of images – but nonetheless, from any starting image, this gives you a one-parameter family of “more doglike” and “less doglike” versions of that image.

In principle, there is nothing wrong with this.  Indeed, there can’t be, because any classifier whose output probabilities are differentiable in the inputs will have these one-parameter families.  Just find the gradient of p(dog | X) and move along it.

But what is bad is that, in the logistic regression approach, we choose the direction that best separates “dog” from other classes in the training data, and then generalize that direction to all inputs.

Imagine the space of natural images as a lower-dimensional manifold in the higher-dimensional space of all possible images, and then imagine the training data as some little subset of that manifold.  On this very special subset, there are (let’s say) certain surefire image features that distinguish a dog from anything else.  The network encodes this by building detectors for those features in the lower layers, and then ensuring that in the final feature space, everything with those features is (as much as possible) on one side of a hyperplane, while everything without them features is on the other side.

But now the network’s concept of “dogness” is “being further in the direction that separated dogs from everything else.”  And as you move further in that direction, the network will become ever more certain that you are showing it a dog – 99% certain, 99.9% certain, 99.999999% certain, as certain as you want to get.  In other words, the network thinks that “dogness” is sort of intensity, like temperature, so that there are images vastly more doglike than your ordinary picture of a dog, the way the sun is hotter than a hot summer day.

This doesn’t match up at all to the way I understand concepts.  There’s a standard distinction between concepts defined by necessary and sufficient conditions and concepts defined prototypically, and while the latter are closer, there’s still a big difference.  Maybe in my head I have some prototypical dog, and can say that some dogs are more doglike than others, by mentally comparing them to that one Platonic dog.  But this levels off after a point; ultimately a dog can only be so doglike, and if you showed me a picture of my mind’s own dog prototype itself (assuming for the sake of argument there is such a thing), I imagine I’d be like “yeah, that’s a dog,” not “oh my god, dogness level infinity!!!!! that is such a dog that I regret ever calling any other ‘dog’ a dog, and if you asked me to bet on whether this was a dog or [some other dog picture] was a dog I would bet my life savings on this guy.”

But that it what the networks do.  Moving along the one-parameter family, you can demand as much confidence as you want – enough that the betting odds relative to any real dog picture are as uneven as you please.

It turns out that you don’t even have to go very far.  At one point I was like “why has no one made visualizations of these one-parameter families?”, but then I realized that they had, and they’re the simplest kind of adversarial example, like the panda/gibbon thing (see Goodfellow, Shlens and Szegedy).  To get extreme dogness out of a picture of a non-dog, you need only move such a short distance in the dog direction that the difference is imperceptible to a human.

Goodfellow, Shlens and Szegedy write, about a network for MNIST:

Correct classifications occur only on a thin manifold where x occurs in the data. Most of R^n [in image space] consists of adversarial examples and rubbish class examples.

“Rubbish class examples” are pictures that are not of anything at all, but which are classified as some class by the network.  It makes sense that this happens.  The hyperplane separating dog from non-dog was designed to make fine distinctions between training examples that had a lot of special features; it isn’t surprising that a lot of random, gibberish images are much further to the dog side than any real image.  Likewise with any class.


Given all this, it seems remarkable that deep networks do as well as they do.  The second part of these assorted thoughts is that this may be related to their extreme data-hungriness.

Ultimately, the generalization performed by these networks is linear generalization: fit a linear trend to the training data (in the feature space), and assume it continues outside the bounds of the training data.  This gets you rubbish class examples through most of input space: extreme confidence that a gibberish input is some thing or other, because a line fit locally to small distinctions in training data is being extrapolated to places far from that data.

To do well, then, you need inputs that "could have been” training data, whatever that means.  I talked earlier about the training data forming a special subset of a manifold.  Apparently, if we stray much from this special subset (whatever it is), we end up in the land of rubbish class examples and do very poorly.  (k-NN would not have this problem, instead growing less confident of anything as one moves away from all training examples.)

But deep networks, famously, need huge amounts of training data.  Perhaps, then, they aren’t so much learning generalizations that can be extended beyond the training data – the way that, when you say “this relationship really is linear,” you can extend it to X and Y more extreme than any observed.  Instead, they are just interpolating between points, in data sets which are so large that they include a bunch of points kind of like most inputs you might think of testing them on.  Like k-NN, except while k-NN assumes a certain flat metric across the input space, deep networks learn how to stretch and rearrange the input space (so that classes are linearly separable).

This calls into question the ability of deep networks to learn facts that generalize.  The implicit “dogness measure” does not capture dogness, but once you have enough examples, a test input that is a dog will be close to some of the example dogs, and that is all you need.  Deep networks, then, would be just a nice way of interpolating between memorized examples without overfitting.

A while ago I thought about ways to do this that would capture concepts better.  One would think that each class should be its own manifold, so that what matters is not “how to make this image more or less doglike” but “how to transform this dog image so that, although different, it remains equally doglike.”  Implicitly, of course, there is already such a thing in the current models – the hyperplane – but a priori, there doesn’t seem to be any reason to represent “transformations that preserve a property” as “motions in an (n-1)-dimensional subspace of an n-dimensional linear space.”  Then again, I have no idea what a more promising representation would look like; at the time I did a few hours of ignorant and unproductive pontification about Lie groups and then gave up.

platypusumwelt:
“in 1700 BCE Egyptian mathematicians came up with the best possible way to end a math proof (”Behold! The beer quantity is found to be correct!”), and the only evidence you’ll ever need of base neoliberal depravity is that we knew...

platypusumwelt:

in 1700 BCE Egyptian mathematicians came up with the best possible way to end a math proof (”Behold! The beer quantity is found to be correct!”), and the only evidence you’ll ever need of base neoliberal depravity is that we knew about this absolute masterpiece and instead opted to end our proofs with an empty square that means “what was to be demonstrated”

I was reading a paper (“Information-Based Clustering”) and got confused, and it turned out I was confused because I had an intuition about information theory that wasn’t true.  So I figured I would post about it here in case anyone else has the same intuition.

The paper is about clustering.  It starts out with a formulation where you have some similarity measure for the N points you want to cluster, and you want to put them in a fixed number of clusters, N_c.

They write down the objective function I’m used to: the average within-cluster similarity.  But instead of just maximizing that, they say they want to maximize it subject to a constraint, eq. 3 in their paper.  This is the mutual information between the point labels i and the cluster labels C, which they’ve re-written so it looks like a K-L divergence between p(C|i) and p©.

Now, to be honest, I’m still not entirely sure what we get out of this term.  I think what it does is encourage imbalanced clusters, so if we’re asked which cluster a point belongs to, we can express it concisely (with a prefix code where the more common clusters get shorter code words), but still I’m not clear on why that’s desirable.  I think the idea is that we can let the number of clusters get very large, and still get readable results, with a few big clusters and then some smaller ones capturing fine-scale stuff.  (It also does something to the way we assign fuzzy class probabilities to each point, I guess.)

Forget about that, though.  What confused me originally was that this measure of information doesn’t know how close together our points are.  That’s all in the similarity measure, which we’re balancing against this information thing.

That means that a clustering rule that seems really simple, like “all points with x>0 are red, all other points are blue,” could have the same amount of information (or more) as one that seems really complicated, with lots of wiggly boundaries.  I.e. whatever this is capturing, it isn’t how much information it would take to specify the clustering rule.


Thinking about this led me to realize that information theory per se is entirely about measure-related stuff, and only knows about metric stuff if you put that stuff in yourself, as a constraint.

The usual derivations of Shannon entropy, etc., start out with discrete distributions, where you have a bunch of discrete points that have nothing to do with one another.  They have labels, but the labels are arbitrary, and you can permute them without changing anything.

Then, if you want, you can move to continuous distributions, although now there are subtleties (the differential entropy isn’t coordinate-independent, so you have to move to K-L divergence w/r/t some background measure that can stretch when you change the coordinates).  Now, suddenly, we’re talking about functions on R^N, where we’re using to there being a metric. 

But all of the information theory stuff has been carried over from the context of discrete, unrelated points, so if there’s a metric, it knows nothing about it.  What it knows about are measurable sets, the equivalent here of the discrete points.  But you can “permute the labels” of these sets all you want, without anything changing.

Like, take a standard normal distribution, and move parts of it around.  Take the part on [-1, 1] and exchange it with the part on [1000, 1001].  This is a totally different distribution, with way different moments, and it’s discontinuous now … but it has the same (differential) entropy.  Because if you had to make an optimal code for these distributions, all that matters is how likely one piece is relative to another piece – doesn’t matter where they are in relation to the other pieces.

As I said, you can bring metric stuff into the picture if you want, as in the paper with its similarities, or in maximum entropy where you fix things like the mean and standard deviation.  But there’s something kind of weird about this to me.  On the one hand, you’re saying that you care about metric stuff.  On the other hand, you’re quantifying how efficiently you could code a sample from your distribution, using a measure of optimal coding efficiency that doesn’t know about distances.

But this does actually work, because once you write an objective function where distances matter, codes that exploit the distances start winning in your optimization problem over codes that don’t.  What got me confused was that “exploiting distances” doesn’t actually let you use fewer bits per se – it just lets you convey distances better than another code using the same number of bits.

The paper cites something called rate-distortion theory, which seems to be the general theory of doing this.  The idea is like: suppose I am using a lossy encoding, and I want to set a maximum on the squared error (“distortion”) between the reconstructed signal and the true signal.  How many bits can I afford to throw away?  And then you do this same kind of thing, where you minimize a mutual information with a constraint on the distortion (where the paper is basically minimizing distortion with a constraint on mutual information).

As a concrete example of “exploiting distances doesn’t actually let you use fewer bits per se,” consider two distributions over x ∈ [0, 1000].  Distribution A is uniform over [0, 10] and then drops to zero for x > 10.  Distribution B splits this into two far-apart bins: one uniform over [0, 5] and one uniform over [995, 1000].  These have the same differential entropy (we can turn one into another by “permuting labels”).

But if we want to minimize distortion for distribution B, our code should start out with a symbol saying which bin you’re in, and then refining from there.  If our code corresponds to a continuous distribution (with less entropy than B, so it’s compressed), we’ll probably want it to be bimodal; as the entropy gets lower and lower it would tend to a Dirac delta in the center of each bin.  We wouldn’t have the same bimodality if we were trying to code A, I think (although I haven’t actually done the problem – I don’t even know if it has a solution!).  The point is, the distances tell us what shape our code should have, and then the information measure tells us what the cheapest code with that shape is, even though it doesn’t care about shapes per se.

Only Mostly Serious

evolution-is-just-a-theorem:

eelfoe:

bartlebyshop:

Remove almost all undergrad E+M and replace it with numerics classes.

You want a “math methods” class in solving bullshit PDEs and 3D volume integrals? Great! I hear the applied math department runs service classes for that. Furthermore, no one actually solves PDEs that way anymore. People throw them at Mathematica or solve them numerically. Nobody is going to be trapped on a desert island with no cell signal, a gun to their head unless they can solve terrible boundary value problems for dielectrics in capacitors.

You want an overview of the state of physics 150-75 years ago? Then why isn’t nuclear a required class? Metropolis algorithm is nearly old enough now anyway.

You want a weeder class to get rid of people in the degree to prove how smart they are? Make them write an implicit restart Lanczos or something.

“But E+M is important! They won’t see spherical harmonics otherwise! How will we teach Quantum?” Holy shit! You can’t teach them spherical harmonics in Quantum?? They’re really not that complicated, and besides, the E+M presentation is always “Here’re some bullshit solutions to this bullshit PDE we derived using separation of variables. Group theory? No, this isn’t a math class. Be able to rederive Bessel’s equation for the exam.”

“Students in my lab need to be able to deal with circuits!” Then make that part of a lab class. Not everyone is going to be an experimentalist. Almost every grad student I know writes code, though.

All the good stuff in E+M you can fold into relativity and quantum anyway. More numerics also means more time to teach real statistics, another area completely neglected by undergrad physics degrees.

i don’t know much about physics but i endorse this post. people get a fucked up view of PDEs and think that writing down an explicit analytic solution is a good idea. even if writing down an analytic solution were possible, it wouldn’t be desirable. the resulting expression would be intractably complicated. a reasonable/useful/interesting goal is to either do numerics or prove qualitative(ish) shit about the solution.

Wait… what? Do physics undergrads get the impression that you solve PDEs analytically? Does… does no one tell them that they’re almost all unsolvable?

I mean I guess if I wasn’t a math major and my only encounter with the subject was my diffeqs class I might have though the same thing…

Still though. That’s messed up.

At least in my case I learned about perturbation theory in a core class, so my impression was “yeah, almost all PDEs are unsolvable, so in practice we hope they’re close enough to solvable ones to be amenable to some approximation technique like perturbation theory.”

Then in my final year I took numerics, as an elective, and finally learned about the other thing you can do.  If I hadn’t arbitrarily decided to take that class (misleadingly named “Scientific Computing”), and had instead been a more “responsible” physics major and taken the Thermal elective, well, who even knows what would have happened?

OP is extremely correct.

(via just-evo-now)

This is probably getting into some programmer holy war bullshit and I’m sure there are like 10000 Usenet posts explaining why I am capital-w Wrong, but

I don’t like the idea of computations on NaN values returning results that might have come from actual numbers.

The idea of a NaN (as I understand it) is “there was supposed to be a number here, but for some reason we don’t have one.”  Could be an operation with a mathematically undefined result; could be an operation with a defined result but not in the number system you’re using (like sqrt(-1) when you are representing real or rational numbers); could be a missing value in real-world data.

Doesn’t matter, because all these should be treated the same way: by doing something numbers would not do (either through an error or returning NaN), so that the user does not get a result that could logically follow for some actual numbers but might not logically follow for whatever numbers the NaNs were “supposed to be” (if any).

Compare NaN to anything else and get False?  Nope, for all you know the number that was supposed to be there equaled the other guy.

Compare NaN to NaN and get False?  This is legendarily confusing, which would perhaps be defensible if it were a downstream consequence of the confusingness of NaN itself – but no, for all you know those numbers were supposed to be equal.  (Tell me sqrt(-1) is not equal to sqrt(-1), I dare you)

Convert NaN to boolean and get True, because you have to equal zero to be False, and (NaN == 0) == False?  How many times do I have to tell you there isn’t a number there, we don’t know what it is, we don’t know what ought to happen when we put it through your function

No no wait, to be fair this doesn’t always happen!!  You could be in JavaScript, where if convert NaN to boolean, you get … False, because NaN is “falsy.”  No it isn’t, it isn’t “falsy,” it isn’t “truthy,” it isn’t anything-y, we don’t know what it is okay

I know I complained about this exact thing before, but I keep reading papers where researchers try to measure the duration of a phenomenon by finding the earliest time they can’t statistically detect it with p < .05

No!  Nooooo!  This is so bad in so many ways!  It has multiple comparisons problems, it has within-vs-between-subjects problems (easily fixable by doing a paired test but they never do that), but those aren’t even the main problem, the main problem is that you’re making the duration a function of your sample size and it’ll get longer or shorter if you re-do the study with a different sample size

There is probably no name more liberally and more confusingly used in dynamical systems literature than that of Lyapunov (AKA Liapunov). Singular values / principal axes of strain tensor JJ (objects natural to the theory of deformations) and their longtime limits can indeed be traced back to the thesis of Lyapunov [ 10, 8], and justly deserve sobriquet ’Lyapunov’. Oseledec [8] refers to them as ‘Liapunov characteristic numbers’, and Eckmann and Ruelle [11] as ‘characteristic exponents’. The natural objects in dynamics are the linearized flow Jacobian matrix J t , and its eigenvalues and eigenvectors (stability exponents and covariant vectors). Why should they also be called ’Lyapunov’? The covariant vectors are misnamed in recent papers as ‘covariant Lyapunov vectors’ or ‘Lyapunov vectors’, even though they are not the eigenvectors that correspond to the Lyapunov exponents. That’s just confusing, for no good reason - Lyapunov has nothing to do with linear stability described by the Jacobian matrix J, as far as we understand his paper [10] is about JJ and the associated principal axes. To emphasize the distinction, the Jacobian matrix eigenvectors {e(j) } are in recent literature called ‘covariant’ or ‘covariant Lyapunov vectors’, or ‘stationary Lyapunov basis’ [ 12]. However, Trevisan [7] refers to covariant vectors as ‘Lyapunov vectors’, and Radons [ 13] calls them ‘Lyapunov modes’, motivated by thinking of these eigenvectors as a generalization of ‘normal modes’ of mechanical systems, whereas by ith ‘Lyapunov mode’ Takeuchi and Chat´e [ 14] mean {λj, e(j) }, the set of the ith stability exponent and the associated covariant vector. Kunihiro et al. [15] call the eigenvalues of stability matrix (4.3), evaluated at a given instant in time, the ‘local Lyapunov exponents’, and they refer to the set of stability exponents (4.7) for a finite time Jacobian matrix as the ‘intermediate Lyapunov exponent’, “averaged” over a finite time period. The list goes on: there is ‘Lyapunov equation’ of control theory, which is the linearization of the ‘Lyapunov function’, and the entirely unrelated ‘Lyapunov orbit’ of celestial mechanics.

(Cvitanović et. al., Chaos: Classical and Quantum)

I’ve never gotten far enough into this particular stuff to feel the brunt of this particular terminological clusterfuck, but I felt a stab of painful recognition reading this, because this sort of thing happens all the time

One of the little frustrations of grad school was simultaneously holding in mind (broadly justified) sense that all the profs and paper-authors were far, far smarter than me and the (also justified) sense that they were very sloppy writers and it was my job to clean up their thoughts inside my own head, because they weren’t going to do it for me