Install Theme

you-have-just-experienced-things:

@nostalgebraist​:

since you know about transfer learning, can you help me get a sense of “how good” the results in the deepmind paper were, apart from the question of how far they might generalize?  i was confused about this here: http://nostalgebraist.tumblr.com/post/157414651209/deepmind-just-published-a-mind-blowing-paper

Short answer: I have no frickin’ clue.

Longer answer:

“How far they might generalize” is actually the most interesting part of the PathNet paper to me, and I think it’s also the most important part.

To recap the things that sucked about the transfer-learning-for-text-classification papers I was reading circa 2011:

1) They had all kinds of horrible domain-specific hacks related to text classification.
2) A lot of the benchmarks were incredibly contrived because the researchers used niche datasets. Many of those papers were less “here’s how this technique applied to a standard dataset/problem” and more “here’s how this technique applies to my pet dataset that I’m interested in.” This was true in general, not just for transfer learning-related papers–my own lab group was really guilty of this as well.
3) A lot of the benchmarks were incredibly contrived because of how success was measured (e.g. reporting only accuracy, not precision/recall/F1/ROC area/etc etc).
4) A lot of the benchmarks were in fact genuine, but the methods described turned out not to be useful enough to be adopted by other researchers.

In the PathNet paper, #1 is not really an issue–the method they describe is pretty general. I’m going to assume that #2 is not an issue, since these are real videogames with well-defined reward functions. #4 is basically why I advised caution and just waiting for more papers in my previous post.

Anyway, re #3, and analysis of the results in general, I am sort of hesitant to answer because I’m not a deep learning expert, nor did I really give the paper the time it deserves. Here are my initial impressions, though:

In Fig. 7, they measure the average score per training episode. This is kind of weird to me, since it’s mixing together the two metrics that we care about, learning speed and max score. This makes it difficult to interpret, and I really want to see these separately.

Looking at Fig. 6, Fig. 12, and Fig. 9 actually makes things more unclear–in Fig. 6 and 12, they show PathNet sometimes having a significantly higher max score and sometimes learning much faster than the control or the fine-tuning method. That’s not the case for the other tasks in Fig. 9–PathNet does not have a significantly higher max score on those tasks. This means that in Fig. 10 you are mostly just looking at learning speed.

Also, in Fig. 9 they show “[m]eans of the best 5 (out of a size 24 parameter sweep) training runs.” Uh…why? This seems totally arbitrary. To make things more confusing, they then show the best runs in Fig. 10.

In Fig. 7, fine-tuning has a average score of 1.15 and a variance of 0.53. PathNet has an average score of 1.33 and a variance of 1.15 (!!). It clear that the path selection is not always all that robust between tasks, which is concerning. They don’t show the score/time graphs for these runs, though, so we don’t know whether PathNet had a lower max/final score or took longer to reach its final score (only one example where PathNet has a lower score is shown in Fig. 12).

I agree with your analysis of Fig. 10. Even thought that matrix is basically just showing you differences in learning speed, it should still be symmetric. What concerns me is, again, the variance.

So, yeah, there is nothing in here that elicits a “holy shit” reaction from me. Really, I just wish the results were broken out in more detail.

Sort of related: The authors mention using A3C instead of tournament selection a few times. It’s not clear to me why they didn’t just do this, except that it would have taken a bit longer. Are they just setting up for a future paper?

Thanks for this – I remember feeling confused, trying to puzzle my way through those figures, and this clears up some things while also reassuring me that I am not the only one who finds the paper confusing.

I had completely missed the “average score per training episode” thing which, yes, mixes together two things we care about.

The various arbitrary or suboptimal choices in the paper look to me like the sorts of things that come from trying to get a paper together very quickly: “oh, I already [made this figure / computed this statistic] earlier, great, we can put that right in the paper,” when the specific figures/statistics may not be the best choices for the paper itself, they’re just what you had lying around from the tests you did while working on the project.  I can imagine “let’s look at the means of the best 5 out of 24″ making sense for technical reasons while you’re still working rather than writing, and then being hard to justify later without reference to your workflow.  Likewise, “oh we just realized we should have done A3C instead, but we don’t want to spend time generating all our data again, when we already have data that looks nice and publishable.”  But that’s all just a guess.

With figures 7 and 10, there’s a certain kind of asymmetry that makes sense, because the “target” task is the one you’re actually evaluated on; thus if, say, there’s a game that the network just sucks at no matter what, all scores will be low when that is the target.  (And there isn’t a corresponding effect for source tasks.)  So I decided to subtract out de novo performance from each column, and in that case the PathNet matrix was indeed “more symmetric” than the fine-tuned one, but still … 

I guess a more basic confusion I had is that when I think of “demonstrating transfer learning,” I think of showing things like some kind of source-target symmetry in the results (if doing A first helps with B, doing B first should help with A?), or evidence that the process gets more of a boost when the source and target are similar.  In this case, we just got a bunch of raw statistics about a network whose performance seemed to be all over the place.  The very fact that I had to subtract something from the columns in Figs. 7 and 10 feels very weird; in a paper about transfer learning, I shouldn’t have to do extra math to extract the metrics relevant to transfer learning.

That made me feel like maybe I was wrong about what the community meant by “transfer learning.”  Certainly PathNet generally does better than the thing they compared it to, and it pre-trains on other tasks where the other thing doesn’t, and I guess to some people “transfer learning” just means “pre-trains on other tasks + does better.”  It seems to me like you need more than that to show that it was really transfer learning and not “this is a different model fed different data, and it happens to be better than the other one.“

Anyway, PathNet: possibly a big deal, but if so, not for the obvious reason?  Something like that.

  1. nostalgebraist reblogged this from you-have-just-experienced-things
  2. you-have-just-experienced-things reblogged this from nostalgebraist and added:
    Thanks for linking your post–I, too, was confused by the lack of a clear slam-dunk result and the confusing way the...