natashajaques asked:
Hey, I just read your post about how attention relates to CNNs/RNNs, and I loved it! The intuitive explanations are so good! I was thinking about tweeting it, but I noticed you did make one slight error. Attention was first invented for machine translation, in the paper "Neural Machine Translation by Jointly Learning to Align and Translate" by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Thanks. I think that still falls under the category of “processing pairs of texts,” although you’re doing NLU for one text and NLG for the other, rather than NLU for both as in textual entailment / NLI. I realize my post does kind of make it sound like attention was invented for NLI, but in context I don’t think the distinction is very important?
Like, one of the things that makes the transformer literature less intuitive than it could be is the fact that the transformer also originated in MT, leading to the distinction between the “transformer encoder"and the "transformer decoder,” which is mostly historical baggage at this point. “Attention between different sequences” vs. “self-attention” still seems like an important distinction at the relevant level of generality, but distinctions between different types of paired-sequence tasks don’t change any of the intuition.






