Install Theme

pamorolo asked: When training the uploadedyudkowsky model, did you use the complete EY blog posts 2006-2009 or just the ones that were collected into the Sequences?

I think the former?  The original source was an ebook I had found somewhere called “Eliezer Yudkowsky Blog Posts 2006-2010: An Unofficial Compendium.”

Blissett dispelled all doubts on 30 June 2004, when he appeared on the British television sports show Fantasy Football League - Euro 2004, broadcast on ITV. During the whole show, Blissett intelligently joked and quipped about his own (alleged) involvement in the Luther Blissett Project. After host Frank Skinner read a line from the novel Q’s prologue (“The coin of the kingdom of the mad dangles on my chest to remind me of the eternal oscillation of human fortunes”), Blissett produced a copy of Luther Blissett’s Italian book Totò, Peppino e la guerra psichica (Toto, Peppino and the psychic war) and quoted extensively from it, in the original Italian: “Chiunque può essere Luther Blissett, semplicemente adottando il nome Luther Blissett” (Anyone can be Luther Blissett simply by adopting the name Luther Blissett). At the end of the show, hosts and guests all said in unison: “I am Luther Blissett!” Two years later, highlights of this broadcast were posted on YouTube.

the-moti:

nostalgebraist:

Copy/pasting a comment I wrote on Hacker News since it seems likely to interest people who like my posts on this sort of thing:

I suspect GPT-2’s limitations with larger-scale structure have less to do with the capacity to track long-range dependencies (which shouldn’t be a problem for an attention-based architecture), and more to do with language modeling itself as a task.

Language modeling is about predicting what can be predicted about the rest of a text, given the first N tokens. Not everything in text can be predicted in this way, even by humans; the things we say to each other tend to convey novel information and thus aren’t fully compressible. And indeed the compressible-ness of text varies across a text in a way that it is itself relatively predictable. If someone writes “for all intents and” you can be pretty sure the next word is “purposes,” i.e. you’re unlikely to learn much when you read it; if someone is writing a dialogue between two characters, and you’re about to see one of their names for the first time, you will learn something new and unpredictable when you read the next word, and you know that this will happen (and why).

A language modeling objective is only really natural for the first of these two cases. In the latter case, the “right” thing to do from the LM perspective is to output a fairly flat probability distribution over possible names (which is a lot of possibilities), assigning very low probability to any given name. But what this means is actually ambiguous between “I am unsure about my next observation because I don’t understand the context” and “I understand the context, and it implies (predictably) that my next observation will be inherently unpredictable.”

Since any model is going to be imperfect at judging whether it’s about to see something unpredictable, it’ll assign some weight to the next observation being predictable (say, a repeated topic or name) even if it’s mostly sure it will be unpredictable. This will push up the probabilities of its predictions on the assumption of predictability (i.e. of a repeated topic/name), and meanwhile the probability of anything else is low, because if an observation is unpredictable then it might well be anything.

I hypothesize that this is behind behavior like putting a single name (“Obama” in your earlier example) in too many roles in an article: if only Obama has been mentioned, then either an upcoming name is “Obama” (in which case we should guess “Obama”) or it’s some other name (in which case we should guess against Obama in slight favor of any other name – but this will only be conveyed to the model via the confusing signal “guess this arbitrary name! now this other one! now this one!”, with the right trend only emerging in the average over numerous unpredictable cases, while the predictable-case rule where you guess the name that has already been mentioned is crystal-clear and reinforced in every case where it happens to be right).

I also suspect the use of a sub-word encoding (BPE) in GPT-2 exacerbates this issue once we are doing generation, because the model can initially guess only part of the high-entropy word without fully committing to a repeat (say just the “O” in “Obama”), but once this becomes part of the context the probability of a repeat is now much higher (we already thought “Obama” was unusually probable, and now we’re looking for a name that starts with “O”).

Doesn’t this suggest that significant improvements can be made by grafting some kind of hack (like some new variation of top-p) onto GPT2? Like if we can distinguish “genuinely confused” from “confidently predicting uncertainty” then we want to bias text generation to “play it safe” and choose something high-probability in the first case and “get creative” and choose something low-probability in the second case. It might be possible to do this by clumping, e.g. all proper nouns together, and checking whether uncertainty remains high - this seems more likely in the first case. 

That sounds very interesting!  I think I’ll try it out sometime soon.

One simple approach to this would be a sort of “middle-p” sampling, where instead of reducing the probability mass from 1 to p by eliminating the lowest-probability choices, you do so by eliminating the lowest- and highest-probability choices in a roughly balanced way.  (In more confident situations where the highest probability is more than (1-p)/2, you wouldn’t take anything off the high end and it’d just be top-p.) 

The intuition being that if, say, the top 2 tokens only account for 10% of the mass, we aren’t really doing anything weird by not choosing them (since the model thinks they collectively have a 90% chance of not happening).  But, this will stop the model from trying to use weak heuristics it knows are weak, like “names are sometimes repeated,” to get a slight edge in cases of intrinsic unpredictability.

By the way, there’s a neat GPT-2 demo here that you see token probabilities as  a text is generated, and control the generation by selecting from the top 10.  Here are two screenshots demonstrating the phenomenon we’re talking about (the second is a confident prediction, for comparison):

image
image

Copy/pasting a comment I wrote on Hacker News since it seems likely to interest people who like my posts on this sort of thing:

I suspect GPT-2’s limitations with larger-scale structure have less to do with the capacity to track long-range dependencies (which shouldn’t be a problem for an attention-based architecture), and more to do with language modeling itself as a task.

Language modeling is about predicting what can be predicted about the rest of a text, given the first N tokens. Not everything in text can be predicted in this way, even by humans; the things we say to each other tend to convey novel information and thus aren’t fully compressible. And indeed the compressible-ness of text varies across a text in a way that it is itself relatively predictable. If someone writes “for all intents and” you can be pretty sure the next word is “purposes,” i.e. you’re unlikely to learn much when you read it; if someone is writing a dialogue between two characters, and you’re about to see one of their names for the first time, you will learn something new and unpredictable when you read the next word, and you know that this will happen (and why).

A language modeling objective is only really natural for the first of these two cases. In the latter case, the “right” thing to do from the LM perspective is to output a fairly flat probability distribution over possible names (which is a lot of possibilities), assigning very low probability to any given name. But what this means is actually ambiguous between “I am unsure about my next observation because I don’t understand the context” and “I understand the context, and it implies (predictably) that my next observation will be inherently unpredictable.”

Since any model is going to be imperfect at judging whether it’s about to see something unpredictable, it’ll assign some weight to the next observation being predictable (say, a repeated topic or name) even if it’s mostly sure it will be unpredictable. This will push up the probabilities of its predictions on the assumption of predictability (i.e. of a repeated topic/name), and meanwhile the probability of anything else is low, because if an observation is unpredictable then it might well be anything.

I hypothesize that this is behind behavior like putting a single name (“Obama” in your earlier example) in too many roles in an article: if only Obama has been mentioned, then either an upcoming name is “Obama” (in which case we should guess “Obama”) or it’s some other name (in which case we should guess against Obama in slight favor of any other name – but this will only be conveyed to the model via the confusing signal “guess this arbitrary name! now this other one! now this one!”, with the right trend only emerging in the average over numerous unpredictable cases, while the predictable-case rule where you guess the name that has already been mentioned is crystal-clear and reinforced in every case where it happens to be right).

I also suspect the use of a sub-word encoding (BPE) in GPT-2 exacerbates this issue once we are doing generation, because the model can initially guess only part of the high-entropy word without fully committing to a repeat (say just the “O” in “Obama”), but once this becomes part of the context the probability of a repeat is now much higher (we already thought “Obama” was unusually probable, and now we’re looking for a name that starts with “O”).

hello everyone 

i would like to tell you first of all that this post is definitely me and not my wife typing it while i went to the bathroom

definitely me, pay no attention to changes in typing style here. this is, as always, #just the facts

i am a very tired man today, but i am also very soft and kind. even though i am worn down and struggling to do things, i am still gentle and loving with my wife. even if i am so tired i stop talking in words and just make small distressed noises.

also, i am thirty one years old but never heard the duck tales theme song before today and was briefly convinced that the words “life is like a hurricane here in duckberg” were a quote from slam poetry. i was probably too busy being smart and disassembling old electronics as a child to watch that many cartoons. i am very cute.

What makes you feel the power of your own potential? For me, it’s a list of nouns.

I know I have a lot of gripes about academia, but I just came up with a new one, or at least a new point of confusion.

Ph.D students, or at least the most successful X% of Ph.D students for some X, clearly provide value to the academic departments they inhabit.  They do the majority of the “grunt work” involved in research and also a lot of the non-grunt work.  They frequently co-author publications, supposedly in recognition of their contributions.  If they are in training, it is more in the way a junior employee is training to be a senior one than in the way of a ordinary student.

So why don’t more departments re-hire their own students after they graduate, as postdocs or assistant professors?  Why invest in training someone, build a working relationship and benefit from that relationship, and then burn it all down when they pass some milestone?

It’s like a world where the standard way to promote junior employees is to give them an award saying “you are qualified for senior positions” and then fire them.  A world where such employees only end in a senior position at the same company by applying and interviewing all over again, and where this is considered almost a coincidence, and is the exception rather than the rule.

I’ve thought for a long time that many pathologies of academia trace back to the fact that much of the work is done by people who don’t expect to be around in a few years, who have little reason to build long-term trust with their co-workers or contribute to institutional health, and who are actually competing with one another for scarce jobs somewhere else.  But someone I never asked myself why this was the case, and I can’t come up with an answer at all, not even a cynical answer.

N.B. this seems like a distinct issue from the academic job funnel, in that I’m asking why the pool of hired postdocs and professors has the composition it does, not the size it does.  But I suppose the answer could be related to the funnel?  Maybe you can only keep a “student” around for so long before you have to admit they’re basically just a colleague, and have to pay them accordingly in money and social status.  And so you have to keep kicking out successful people and hoping equally successful people flow in at the other side.  But I dunno, the costs still seem immense.

It seems pretty clear to me by now that GPT-2 is not as dangerous as OpenAI thought (or claimed to think) it might be.

The 774M version has been out there for a while, and although it only has half as many parameters as the biggest version, I don’t expect there to be any large qualitative leap between the two.  After all, OpenAI’s staged release plan has given us two size-doublings already – from 124M to 355M, from 355M to 774M – and the differences after each doubling are surprisingly subtle, and overlaid on the same basic, recognizable strengths and weaknesses.

I’ve played with these models a lot this year, mostly via fine-tuning – it’s almost a hobby at this point.  I’ve

  • fine-tuned them on all sorts of different texts, including this tumblr
  • fine-tuned them on mixtures of very different texts (not very interesting – it’ll decide which type of text it’s writing in any given sample and stick with it)
  • tried different optimizers and learning rates for fine-tuning
  • experimented with custom encodings (common tags –> single non-English characters) to fit more text into the window when fine-tuning on webpages
  • tried to generate longer texts by repeatedly feeding the output in as context (i.e. prompt)
  • twiddled all the sampling parameters (temperature, top-k / top-p / neither) vs.  when sampling from any of the above
  • read over tons and tons of sampling output while monitoring a fine-tuning job, curating material for @uploadedyudkowsky, etc.

By now I think I have a good feel for the overall quality, and the quirks, of GPT-2 sampled text.  IMO, the model is good at all sorts of interesting things, but arguably least good at the things required for disinformation applications and other bad stuff.

———

It is best at the smallest-scale aspects of text – it’s unsettlingly good at style, and I frequently see it produce what I’d call “good writing” on a phrase-by-phrase, sentence-by-sentence level.  It is less good at larger-scale structure, like maintaining a consistent topic or (especially) making a structured argument with sub-parts larger than a few sentences.

Some of this is completely intuitive: GPT-2, which only learns from text, is at the largest disadvantage relatively humans in areas that require models of the outside world (since we experience that world in many non-textual ways), while there is much more parity in areas like style that are purely internal to language, especially written language.

Some of it is less intuitive.  GPT-2 samples often lack some large-scale features of real text that seem very simple and predicable.  For example, when generating fiction-like prose, it will frequently fail to track which characters are in a given scene (e.g. character A has some dialogue yet character B refers to them as if they’re not in the room), and has a shaky grasp of dialogue turn conventions (e.g. having the same character speak twice on successive lines).  In nonfiction-like prose, it tends to maintain a “topic” via repeating a set of key phrases, but will often make wildly divergent or contradictory assertions about the topic without noting the discontinuity.

I suspect some of this can be chalked up to the fact that GPT-2 is trained as a language model, i.e. as something that predicts real text, which is not quite the same thing as generating fake text.  Its training objective only cares about the distribution of training text, and does not encourage it to respond to its own predictive distribution in a stable or nice way.  (Note that its predictive distribution, by construction, is different from real text in that it’s less surprising to the model – see this great paper.)

The fact that feeding samples from the predictive distribution back into GPT-2 for further prediction produces impressive “generated text,” and not garbage, is thus a happy accident rather than a optimization target.  Indeed, getting this to happen requires judicious choice of the sampling method, and (op. cit.) some naive sampling methods do yield garbage.

Even with good sampling methods like top-p, the stability of sampling is somewhat brittle; when I’ve tried to generate texts longer than the context window via repeated “self-prompting,” I’ve noticed a phenomenon where the text will usually fall off a quality cliff after a certain point, suddenly becoming strikingly ungrammatical and typo-ridden and full of anomalous paragraph breaks.  [EDIT 6/10/20: I now think this may have been due to a bug in my code, and in any event I no longer think it’s a robust property of GPT-2 generation.]  My hypothesis is that this works like the panda/gibbon adversarial examples: the samples have an uncommonly high density of features GPT-2 can recognize, and eventually there’s a confluence of these that push in the same direction in some linear subspace (consider here the use of a non-saturating activation, gelu, in the transformer), which pushes the model far from the training manifold.

To zoom back out again, the model is capable of frequent brilliance at the phrase, sentence and even paragraph level, but its samples struggle with more global coherence across the scale of a short article or longer, and with maintaining recognizable positions that look like they refer to the real world.  (In conjunction with the lower-level good writing, this often generates an amusing “insight porn” effect: it feels like someone is saying something very intelligent and interesting… if only you could figure out what.)

———

My knee-jerk reaction is that this makes the model relatively useless for disinformation.  Telling it to “argue for X” or even “write about X” is quite difficult, while aiming for specific genres or styles is very effective.

The real situation is a little more subtle than that.  The model is unusually good at making things that look like news stories, presumably because they are common in the training set; in OpenAI’s large collection of released unconditional samples, news-like text dominates.  Thus, presuming you can find an effective way to feed a fake event into the model on the concept level, it will be able to generate convincing “fake news” that stays on topic and so forth.

This is what the creators of “GROVER” have done, albeit with a custom training corpus.  Roughly, they’ve trained a transformer to understand the relation between a news headline and the corresponding story in a structured way, allowing them to feed in the core substance of a hypothetical news story via the headline.  They then sample the body text, and (interestingly) loop back and generate the headline, overwriting the initial one.

What they show, basically, is that this lets you take a headline from Breitbart or Infowars or some “natural cancer cures” type website, generate from it a consistent news story in the style of a “real news” venue like the NYT, and then loop back and re-write the headline in a “real news” style as well.  Perhaps unsurprisingly, MTurkers then rate the resulting texts as more trustworthy than the originals.

There is definitely something a little scary about this, especially in the way it does give you close control over the topic, something that’s difficult with simple text prompting.  On the other hand… do we really believe that, in 2019, with Trump as president, that the Breitbart type of fake news is suffering from a stylistic credibility gap?  That there are people ready to believe that vaccination is an evil conspiracy, but only if the claim comes with an article that sounds like the NYT or WaPo?

The niche filled by this technology for bad actors just doesn’t feel like a niche that needs filling.  Lots of people will reshare articles on social media just based on the headline, without even clicking through, and people less trusting than this often (and sensibly) care about the actual source, not just the style.  I’m just not sure there’s a role for a device that will let you register TotallyARealNewspaper.biz and then auto-fill it with articles that sound exactly like Paul Krugman telling you that immigration = genocide.

And then, too, there’s the observation that actually prompted this post: AFAIK, the bad actors are not doing this stuff.  People have mostly used the technology for clearly-signposted fake subreddits and other harmless amusements.  GROVER was created by academics as a threat modeling exercise, on the premise that bad actors could make such a thing, so we’d better be prepared.  But where are the actual GPT-2 clickfarms?  They totally could exist by now, but I’ve never heard of even a single one.  (And trust me, it’s not like the text generation is so good that no one would ever notice.)

tanatola:

image

i did rewatch eva though

(via azdoine)