raginrayguns asked:
I've heard you describe neural networks learning differentiable maps from R^j to R^k I think? it just occurred to me that they have to be differentiable with respect to the weights in order to be trained with gradient descent, but do they actually have to be differentiable with respect to the input features? Like are these connected or just happen to both be true
They don’t have to be differentiable with respect to the input, no.
Indeed, in a case like GPT, the input space is discrete, and you can’t be differentiable with respect to it.
On the other hand… most of the time, a neural network is a long string of composed functions, where each one is parameterized by some “weights”.
Like y(x) = p(q(h(g(f(x))))), where x is the input. Except each of the functions p, q, … f have an extra second argument, its weights.
You want the derivatives w/r/t the weights of p, q, … f. Doing the chain rule, you get terms like p’(q(…)), q’(h(…)), etc.
So every one of these functions has to be differentiable w/r/t its “input argument” – except for the very first one, f, which doesn’t have to be differentiable with respect to x (the input of the whole network).
That is, neural networks are usually differentiable “most of the way through,” from the output to the first function in the composition, which is usually very simple. Even if dy/dx doesn’t exist, dy/df has to exist. And that derivative spans all of the nontrivial calculation done by the network. In the case of GPT, “f(x)” is a lookup table that maps discrete tokens to vectors, and everything after that is differentiable w/r/t the vectors.
In this sense, it feels natural to speak of neural networks as “learning differentiable functions” even if it’s not always strictly true.
(P. S. there is an unrelated nuance with commonly used functions like relu that are differentiable almost everywhere, but not everywhere. In practice this doesn’t really matter, as you’d probably expect)
