@adzolotl replied to this post
Changing one weight by 1,000 units does seem like it could break things, and is rotationally similar to changing 1,000,000 weights by one unit. (A “unit” here is, like, the stdev of weights, and I’m assuming this is similar for different blocks of weights, which is sketchy but makes things more Euclidean.)
Hmm… part of why I think “changing one weight can’t break things” is that NN weights are often “inactive” for a large fraction of inputs.
Like, if it’s an operation before a relu, and that relu is doing useful work, then the relu output is zero for some nontrivial fraction of inputs, and for those it doesn’t matter what the weight is.
So the full loss is an average over those inputs, where screwing up this weight doesn’t change the problem, and the other inputs, where it does. For the “bad” inputs, the changed weight might produce large gradients in other weights, but in each mini-batch you still have “good” inputs with gradients that won’t like moving away from what currently works for those inputs.
I haven’t tried this, though, maybe it’s worse than I think. You probably could mess things up by changing a weight in a layer/batch norm or something like that that’s active for all inputs.
Anyway, that reasoning is all specialized to the weight basis.
So you might think: if I need that argument to explain why a step in the weight basis is OK, then maybe a step of the same magnitude in an arbitrary basis would be not-OK.
But again, in N-d, most vectors are orthogonal. If you have a trained weight matrix before an activation fn., its rows (“filters”) are specially chosen to be not-orthogonal-to-the-input a nontrivial fraction of the time (picking out inputs which the filter matches). If you add a random vector to each row, you don’t screw up that property; the random vector is orthogonal to most inputs (b/c everything is).
So again, there’s a “good” part of the input space where the perturbation doesn’t matter. The “bad” inputs (where the perturbation is harmful) give you gradients telling you to get rid of the perturbation, but they’re noisy gradients and by themselves might scramble the original filter. But if they try to do that, the gradients from the “good” part of the input will say “hey, no, that thing’s useful” and ensure it stays.
I’m sure there are adversarial directions that do destroy your progress by compounding the same type of error across layers, just as there are adversarial directions where you change one weight in a layer/batch norm. They just aren’t most directions.
I’m just thinking out loud here… I guess this sounds like a half-baked theory of why NNs are easy to optimize, dunno how well it holds up.
