Install Theme
nostalgebraist:
“the bad popes
”
Finally picked up and finished this book after forgetting about it for like two years. Review here if you’re curious.
(I’ve been posting so much less in this space lately than I used to, so I figure I ought to at...

nostalgebraist:

the bad popes

Finally picked up and finished this book after forgetting about it for like two years.  Review here if you’re curious.

(I’ve been posting so much less in this space lately than I used to, so I figure I ought to at least link to anything I write elsewhere.)

Quick follow-up to the last paragraph of last night’s deep learning post, something that just occurred to me and could well just be stupid:

It seems like there should be various transformations that all your classes are invariant under in the same way.  For images, this could be things like translations (already baked into convnets) but also rotations, dilations, more sophisticated things like rotations in inferred 3D space or inferred 3D lighting conditions, etc.

For text, it might be things like rephrasings: “I loved this movie” and “this movie was loved by me” should both receive the same sentiment label (positive), and so should “I hated this movie” and “this movie was hated by me,” for the same reason.

When we bake invariances directly into the architecture, like with convnets, then of course they apply equally to all classes, and also the networks know them from “birth” and don’t have to consume 40 million data points to learn them, and basically that works great except (1) it isn’t learning and (2) you have to plan them all in advance and laboriously design the architecture around them.  I’m not dissing this approach, since as far as I can tell natural organisms do a whole lot of this and I don’t see any reason to think you can get around doing it.  The tradeoff between generality and efficiency is pretty generic.

But what if you want the network to learn some invariances?  I don’t know what the right representation would be.  (I suspect there is a “right representation,” or several, and that it would take more mathematical sophistication than I actually have to come up with it, so I hope someone else is working on this.)  But linear classification definitely can’t do the job.  It gives each class an (n-1)-dimensional subspace in which you can move without changing the class probability*, and these all have to be distinct, because each one uniquely defines the class as opposed to all the others.

(And you can’t rely on the lower layers to sort things out so that the same invariances get mapped to different subspaces for different classes in the feature space, since that requires them to implicitly figure out which class a point is in so they can translate, say, “rotation invariance” into “the distinctive representation of rotation invariance for dogs” iff a point is a dog, in which case the classification has already been implicitly done and the linear classifier layer is redundant.)

I suspect the ultimate right answer here has to do with both prototypes and having a single set of invariances.  Like, once you “quotient out” all the invariances you know about, the dogs will all cluster together in the resulting space, and you can distinguish between “this is a weird marginal dog” and “this is a typical dog, but transported a long way from the ones I’ve seen, by one of the invariance groups.”  I suspect that you’ll have to bake in a lot of invariances rather than solving for them and classification at once, cf. all the stuff about the sophisticated visuo-spatial intuitions that even babies have.

*(well, the unnormalized probability – the others may change and that’ll change the softmax output)

loumargi:
“Edward Arthur Walton (1860-1922), The White Flower
”

loumargi:

Edward Arthur Walton (1860-1922), The White Flower

nuclearspaceheater asked: Would you consider it fair to count representational but non-realistic art as "adversarial examples" in human visual perception? Humans are more robust but it seems to me to follow a similar principle: target the desired perception directly, rather than create a realistic facsimile of the thing that normally evokes that perception.

I understand what you’re getting at, but I think this is almost the opposite of what adversarial examples are.  Whenever anyone in the research literature constructs an adversarial example, it’s by finding the minimal perturbation to a real image (or whatever) that will change the class label.  It’s “adversarial” because to humans it looks like a very slightly altered picture of [something], while to the machine it looks like a picture of [something else].

With representational but non-realistic art – like cartoons – we instead have an image that is very different from a photo of a real human, but is nonetheless recognized as a human (well, sort of – crucially, humans can distinguish between cartoon images and photos).  If anything, this seems like the kind of conceptual understanding we want the machines to have more of, i.e. the understanding that anything with a certain constellation of traits “looks like a person,” even when many other things about the image apart from those traits are drastically varied.

(I kind of touched on this here, with reference to stuff like these synthetic class examples from an ImageNet classifier.  The nets do learn how to distinguish classes in real photos, but it looks like they do so by focusing on highly distinctive, photorealistic individual features without caring about spatial arrangement the way we do, so that the equivalent of a “cartoon dog” for such a classifier would be something like “some photorealistic dog paws and muzzles, spammed randomly across space.”  For a vivid and amusing example from my own experiments with a slightly different sort of net, see here.)

Some recent thoughts about deep learning, which are all sort of related but which I can’t boil down into a simply summary I’m confident about:


As always, I keep coming back to Christopher Olah’s amazing 2014 post about neural networks and topology.

One of the things that post emphasizes is that even fancy deep networks are still usually doing linear softmax classification – logistic regression – in the last layer.  All the fancy nonlinear stuff, then, is just trying to transform the data into a new feature space in which they are linearly separable.  This is true even, for example, of LSTMs that generate text for translation or conversation purposes (seq2seq)  – they’re still doing logistic regression on the feature space to figure out what word or character to output next.

(At the end of that post, Olah suggests using (differentiable) k-NN in the last layer instead, so that the goal for the feature space is the more forgiving “nearby points have the same class.”  Once this idea has been brought up, it seems obviously promising, and I’m confused why more research hasn’t been done on it since 2014.  Or if it’s been done, why I can’t find it.)


One of the various downsides of this approach to classification is that for each class, it represents “the quality of being that class” as a direction (in the feature space), given (roughly) by the direction between the centroids of the set of training examples in that class and the set of all the other ones.  I call this a downside both for theoretical reasons (it doesn’t seem like a good way to represent a concept) and for not-unrelated empirical reasons (it gives you adversarial examples).

Say, for example, your network is classifying images, and one of the classes is “dog.”  There is a direction, in the high-dimensional feature space, such that moving further in that direction always makes any image “more doglike” (and moving in the opposite direction makes it “less doglike”).  Of course, these simple linear motions correspond to complicated nonlinear trajectories back in the input space of images – but nonetheless, from any starting image, this gives you a one-parameter family of “more doglike” and “less doglike” versions of that image.

In principle, there is nothing wrong with this.  Indeed, there can’t be, because any classifier whose output probabilities are differentiable in the inputs will have these one-parameter families.  Just find the gradient of p(dog | X) and move along it.

But what is bad is that, in the logistic regression approach, we choose the direction that best separates “dog” from other classes in the training data, and then generalize that direction to all inputs.

Imagine the space of natural images as a lower-dimensional manifold in the higher-dimensional space of all possible images, and then imagine the training data as some little subset of that manifold.  On this very special subset, there are (let’s say) certain surefire image features that distinguish a dog from anything else.  The network encodes this by building detectors for those features in the lower layers, and then ensuring that in the final feature space, everything with those features is (as much as possible) on one side of a hyperplane, while everything without them features is on the other side.

But now the network’s concept of “dogness” is “being further in the direction that separated dogs from everything else.”  And as you move further in that direction, the network will become ever more certain that you are showing it a dog – 99% certain, 99.9% certain, 99.999999% certain, as certain as you want to get.  In other words, the network thinks that “dogness” is sort of intensity, like temperature, so that there are images vastly more doglike than your ordinary picture of a dog, the way the sun is hotter than a hot summer day.

This doesn’t match up at all to the way I understand concepts.  There’s a standard distinction between concepts defined by necessary and sufficient conditions and concepts defined prototypically, and while the latter are closer, there’s still a big difference.  Maybe in my head I have some prototypical dog, and can say that some dogs are more doglike than others, by mentally comparing them to that one Platonic dog.  But this levels off after a point; ultimately a dog can only be so doglike, and if you showed me a picture of my mind’s own dog prototype itself (assuming for the sake of argument there is such a thing), I imagine I’d be like “yeah, that’s a dog,” not “oh my god, dogness level infinity!!!!! that is such a dog that I regret ever calling any other ‘dog’ a dog, and if you asked me to bet on whether this was a dog or [some other dog picture] was a dog I would bet my life savings on this guy.”

But that it what the networks do.  Moving along the one-parameter family, you can demand as much confidence as you want – enough that the betting odds relative to any real dog picture are as uneven as you please.

It turns out that you don’t even have to go very far.  At one point I was like “why has no one made visualizations of these one-parameter families?”, but then I realized that they had, and they’re the simplest kind of adversarial example, like the panda/gibbon thing (see Goodfellow, Shlens and Szegedy).  To get extreme dogness out of a picture of a non-dog, you need only move such a short distance in the dog direction that the difference is imperceptible to a human.

Goodfellow, Shlens and Szegedy write, about a network for MNIST:

Correct classifications occur only on a thin manifold where x occurs in the data. Most of R^n [in image space] consists of adversarial examples and rubbish class examples.

“Rubbish class examples” are pictures that are not of anything at all, but which are classified as some class by the network.  It makes sense that this happens.  The hyperplane separating dog from non-dog was designed to make fine distinctions between training examples that had a lot of special features; it isn’t surprising that a lot of random, gibberish images are much further to the dog side than any real image.  Likewise with any class.


Given all this, it seems remarkable that deep networks do as well as they do.  The second part of these assorted thoughts is that this may be related to their extreme data-hungriness.

Ultimately, the generalization performed by these networks is linear generalization: fit a linear trend to the training data (in the feature space), and assume it continues outside the bounds of the training data.  This gets you rubbish class examples through most of input space: extreme confidence that a gibberish input is some thing or other, because a line fit locally to small distinctions in training data is being extrapolated to places far from that data.

To do well, then, you need inputs that "could have been” training data, whatever that means.  I talked earlier about the training data forming a special subset of a manifold.  Apparently, if we stray much from this special subset (whatever it is), we end up in the land of rubbish class examples and do very poorly.  (k-NN would not have this problem, instead growing less confident of anything as one moves away from all training examples.)

But deep networks, famously, need huge amounts of training data.  Perhaps, then, they aren’t so much learning generalizations that can be extended beyond the training data – the way that, when you say “this relationship really is linear,” you can extend it to X and Y more extreme than any observed.  Instead, they are just interpolating between points, in data sets which are so large that they include a bunch of points kind of like most inputs you might think of testing them on.  Like k-NN, except while k-NN assumes a certain flat metric across the input space, deep networks learn how to stretch and rearrange the input space (so that classes are linearly separable).

This calls into question the ability of deep networks to learn facts that generalize.  The implicit “dogness measure” does not capture dogness, but once you have enough examples, a test input that is a dog will be close to some of the example dogs, and that is all you need.  Deep networks, then, would be just a nice way of interpolating between memorized examples without overfitting.

A while ago I thought about ways to do this that would capture concepts better.  One would think that each class should be its own manifold, so that what matters is not “how to make this image more or less doglike” but “how to transform this dog image so that, although different, it remains equally doglike.”  Implicitly, of course, there is already such a thing in the current models – the hyperplane – but a priori, there doesn’t seem to be any reason to represent “transformations that preserve a property” as “motions in an (n-1)-dimensional subspace of an n-dimensional linear space.”  Then again, I have no idea what a more promising representation would look like; at the time I did a few hours of ignorant and unproductive pontification about Lie groups and then gave up.

journalgen:

Proceedings of the 1st Asian Conference on Modal Abyss Science

rubegoldbergsaciddreams:

“If you bombard the Earth with photons for a while, it will emit a Tesla Roadster”

— The sun, probably

(via cryptovexillologist)

In some social settings I am a wallflower, and in others I’m a strong and noticeable presence.  Indeed, it feels like I tend to end up on one side of that spectrum or another, with negligible likelihood of landing somewhere in the middle.

This has always been a mystery to me, and it probably always will, at least in part.  But here is a hypothesis that does a pretty job of explaining the pattern: I can only make a splash in a social setting if it is a setting where can I feel assured that, at least sometimes, I have the floor.  (In the public debate/meeting sense of the term, although this can extend to all behavior, not just speaking and being heard.)

I tend to be quiet and awkward at parties, and in casual hangout settings with more than 3 people or so.  Usually, when I reflect on this, I default to the (perhaps relatively flattering) explanation that I’m concerned about talking over other people, or drawing attention away from people who would put it to better use.  When there are so many people around me, what is the likelihood that I have the best (most informed, interesting, funny, entertaining, etc.) thing to say or do at any given moment?

And yet: in meetings and discussion classes, I talk frequently, confidently and at length, to the point that I have to remind myself to hold back so I don’t dominate the conversation or otherwise annoy others.  This puts the lie to the “don’t want to talk over others” explanation, since the exact same considerations ought to apply to these settings – indeed with more force, since they are closer to zero-sum.  (If a meeting or a class is on a strict schedule and can’t run over its time limit, then every extra second I talk is a second someone else can’t talk, if we ignore silences for the sake of a first approximation.)

The difference, I’m now thinking, is that in meetings and discussion classes, once I start to talk I know I have the floor.  In these settings it’s usually considered a faux pas to interrupt people, and people are also usually not allowed to get up and leave, so while I’m talking, I know everyone has to listen.  And once I know that, I’m in fact pretty confident (rightly or not) that I have things to say that are worth hearing.

What I’m not confident about, ultimately, is holding moment-to-moment attention in the face of competition.  If I know that at any time, someone could interrupt me with something more appealing – verbal or nonverbal – then I’m lost.  I don’t know how to be continuously appealing, robust to interruptions at every step.  I just know how to do things which, after they’ve been fully completed, I expect to have been appealing as entire wholes.  If I’m in a setting where no one gets to have the floor securely, this feels, inside my head, like I’m afraid of talking over other people.  And maybe there’s some truth to that – but it isn’t that I’m scared of not having good contributions, it’s that I’m scared of not having continuously appealing ones.  (There is a kind of shame associated with this, an awareness that other people have some talent I lack, and that it must be obvious that I can’t stack up to them when I try.)

I think my experiences in online venues also fit this pattern.  On forums and IRC, I’ve sometimes been a notable presence, but usually I have this awkward way of being in my own world, almost talking to myself as the conversation continues around me.  In these systems, messages are presented sequentially, but there are a whole lot of them and hardly anyone reads every single one, so you are competing for momentary attention; left to my own devices, I just ramble on and hope someone finds my messages more interesting than the ones interleaved with them.

This is also the way the tumblr dash works, but tumblr has two advantages for me here.  First, there’s the concept that everyone is creating their own “blog,” which appears on the dashes of their followers almost as an aftereffect.  This allows me to ramble on inside my own world for as long as I please (in this post, for example), and to feel like this is an expected way of engaging with the medium.  (If you don’t like reading this stuff, you don’t have to follow; if you’ve followed, apparently you do.)  Second, when I reply to someone via a reblog, it shows up in their notifications in a way that makes it harder to ignore than someone quoting you in a forum post, a way that makes them have to attend to your message (“give you the floor”), even if it isn’t continually appealing.  (At least it seems that way in my own experiences of receiving both sorts of responses.)

 (I’ve been more successful on Discord recently, which is a “competing for attention” venue.  But largely, I think, due to confidence built from knowing that people there already know me from tumblr.)

During his freshman year at NYU, Lee was sentenced to prison for his role in the Great Hacker War.