Install Theme

corvidinthewoods asked:

if i add alt text to an image i send frank, will she read just what the image says, just what the alt text says, or both?

She doesn’t read alt text. She “sees” images in two separate ways, but neither of them involve reading alt text.

Also, there’s a tumblr bug where if you add alt text to an image in an ask, the recipient won’t see it on their end. At least this was true last time I checked.

gpt-4 prediction: it won’t be very useful

Word on the street says that OpenAI will be releasing “GPT-4” sometime in early 2023.

There’s a lot of hype about it already, though we know very little about it for certain.

———-

People who like to discuss large language models tend to be futurist/forecaster types, and everyone is casting their bets now about what GPT-4 will be like. See e.g. here.

It would accord me higher status in this crowd if I were to make a bunch of highly specific, numerical predictions about GPT-4’s capabilities.

I’m not going to do that, because I don’t think anyone (including me) really can do this in a way that’s more than trivially informed. At best I consider this activity a form of gambling, and at worst it will actively mislead people once the truth is known, blessing the people who “guessed lucky” with an undue aura of deep insight. (And if enough people guess, someone will “guess lucky.”)

Why?

There has been a lot of research since GPT-3 on the emergence of capabilities with scale in LLMs, most notably BIG-Bench.

Besides the trends that were already obvious with GPT-3 – on any given task, increased scale is usually helpful and almost never harmful (cf. the Inverse Scaling Prize and my Section 5 here) – there are not many reliable trends that one could leverage for forecasting.

Within the bounds of “scale almost never hurts,” anything goes:

  • Some tasks improve smoothly, some are flatlined at zero then “turn on” discontinuously, some are flatlined at some nonzero performance level across all tested scales, etc. (BIG-Bench Fig. 7)
  • Whether a model “has” or “doesn’t have” a capability is very sensitive to which specific task we use to probe that capability. (BIG-Bench Sections 3.4.3, 3.4.4)
  • Whether a model “can” or “can’t do” a single well-defined task is highly sensitive to irrelevant details of phrasing, even for large models. (BIG-Bench Section 3.5)

It gets worse.

Most of the research on GPT capabilities (including BIG-Bench) uses the zero/one/few-shot classification paradigm, which is a very narrow lens that arguably misses the real potential of LLMs.

And, even if you fix some operational definition of whether a GPT “has” a given capability, the order in which the capabilities emerge is unpredictable, with little apparent relation to the subjective difficulty of the task. It took more scale for GPT-3 to learn relatively simple arithmetic than it did for it to become a highly skilled translator across numerous language pairs!

GPT-3 can do numerous impressive things already … but it can’t understand Morse Code. The linked post was written before the release of text-davinci-003 or ChatGPT, but neither of those can do Morse Code either – I checked.

On that LessWrong post asking “What’s the Least Impressive Thing GPT-4 Won’t be Able to Do?”, I was initially tempted to answer “Morse Code.” This seemed like as safe a guess as any, since no previous GPT was able to it, and it’s certainly very unimpressive.

But then I stopped myself. What reason do I actually have to register this so-called prediction, and what is at stake in it, anyway?

I expect Morse Code to be cracked by GPTs at some scale. What basis to I have to expect this scale is greater than GPT-4’s scale (whatever that is)? Like everything, it’ll happen when it happens.

If I register this Morse Code prediction, and it turns out I am right, what does that imply about me, or about GPT-4? (Nothing.) If I register the prediction, and it turns out I am wrong, what does this imply … (Nothing.)

The whole exercise is frivolous, at best.

———-

So, here is my real GPT-4 prediction: it won’t be very useful, and won’t see much practical use.

Specifically, the volume and nature of its use will be similar to what we see with existing OpenAI products. There are companies using GPT-3 right now, but there aren’t that many of them, and they mostly seem to be:

GPT-4 will get used to do serious work, just like GPT-3. But I am predicting that it will be used for serious work of roughly the same kind, in roughly the same amounts.

I don’t want to operationalize this idea too much, and I’m fine if there’s no fully unambiguous way to decide after the fact whether I was right or not. You know basically what I mean (I hope), and it should be easy to tell whether we are basically in a world where

  1. Businesses are purchasing the GPT-4 enterprise product and getting fundamentally new things in exchange, like “the API writes good, publishable novels,” or “the API performs all the tasks we expect of a typical junior SDE” (I am sure you can invent additional examples of this kind), and multiple industries are being transformed as a result
  2. Businesses are purchasing the GPT-4 enterprise product to do the same kinds of things they are doing today with existing OpenAI enterprise products

However, I’ll add a few terms that seem necessary for the prediction to be non-vacuous:

  • I expect this to be true for at least 1 year after the release of the commercial product. (I have no particular attachment to this timeframe, I just need a timeframe.)
  • My prediction will be false in spirit if the only limit on transformative applications of GPT-4 is monetary cost. GPT-3 is very pricey now, and that’s a big limiting factor on its use. But even if its cost were far, far less, there would be other limiting factors – primarily, that no one really knows how to apply its capabilities in the real world. (See below.)

(The monetary cost thing is why I can’t operationalize this beyond “you know what I mean.” It involves not just what actually happens, but what would presumably happen at a lower price point. I expect the latter to be a topic of dispute in itself.)

———-

Why do I think this?

First: while OpenAI is awe-inspiring as a pure research lab, they’re much less skilled at applied research and product design. (I don’t think this is controversial?)

When OpenAI releases a product, it is usually just one of their research artifacts with an API slapped on top of it.

Their papers and blog posts brim with a scientist’s post-discovery enthusiasm – the (understandable) sense that their new thing is so wonderfully amazing, so deeply veined with untapped potential, indeed so temptingly close to “human-level” in so many ways, that – well – it surely has to be useful for something! For numerous things!

For what, exactly? And how do I use it? That’s your job to figure out, as the user.

But OpenAI’s research artifacts are not easy to use. And they’re not only hard for novices.

This is the second reason – intertwined with the first, but more fundamental.

No one knows how to use the things OpenAI is making. They are new kinds of machines, and people are still making basic philosophical category mistakes about them, years after they first appeared. It has taken the mainstream research community multiple years to acquire the most basic intuitions about skilled LLM operation (e.g. “chain of thought”) which were already known, long before, to the brilliant internet eccentrics who are GPT’s most serious-minded user base.

Even if these things have immense economic potential, we don’t know how to exploit it yet. It will take hard work to get there, and you can’t expect used car companies and SEO SaaS purveyors to do that hard work themselves, just to figure out how to use your product. If they can’t use it, they won’t buy it.

It is as though OpenAI had discovered nuclear fission, and then went to sell it as a product, as follows: there is an API. The API has thousands of mysterious knobs (analogous to the opacity and complexity of prompt programming etc). Any given setting of the knobs specifies a complete design for a fission reactor. When you press a button, OpenAI constructs the specified reactor for you (at great expense, billed to you), and turns it on (you incur the operating expenses). You may, at your own risk, connect the reactor to anything else you own, in any manner of your choosing.

(The reactors come with built-in safety measures, but they’re imperfect and one-size-fits-all and opaque. Sometimes your experimentation starts to get promising, and then a little pop-up appears saying “Whoops! Looks like your reactor has entered an unsafe state!”, at which point it immediately shuts off.)

It is possible, of course, to reap immense economic value from nuclear fission. But if nuclear fission were “released” in this way, how would anyone ever figure out how to capitalize on it?

We, as a society, don’t know how to use large language models. We don’t know what they’re good for. We have lots of (mostly inadequate) ways of “measuring” their “capabilities,” and we have lots of (poorly understood, unreliable) ways of getting them to do things. But we don’t know where they fit in to things.

Are they for writing text? For conversation? For doing classification (in the ML sense)? And if we want one of these behaviors, how do we communicate that to the LLM? What do we do with the output? Do they work well in conjunction with some other kind of system? Which kind, and to what end?

In answer to these questions, we have numerous mutually exclusive ideas, which all come with deep implementation challenges.

To anyone who’s taken a good look at LLMs, they seem “obviously” good for something, indeed good for numerous things. But they are provably, reliably, repeatably good for very few things – not so much (or not only) because of their limitations, but because we don’t know how to use them yet.

This, not scale, is the current limiting factor on putting LLMs to use. If we understood how to leverage GPT-3 optimally, it would be more useful (right now) than GPT-4 will be (in reality, next year).

———-

Finally, the current trend in LLM techniques is not very promising.

Everyone – at least, OpenAI and Google – is investing in RLHF. The latest GPTs, including ChatGPT, are (roughly) the last iteration of GPT with some RLHF on top. And whatever RLHF might be good for, it is not a solution for our fundamental ignorance of how to use LLMs.

Earlier, I said that OpenAI was punting the problem of “figure out how to use this thing” to the users. RLHF effectively punts it, instead, to the language model itself. (Sort of.)

RLHF, in its currently popular form, looks like:

  • Some humans vaguely imagine (but do not precisely nail down the parameters of) a hypothetical GPT-based application, a kind of super-intelligent Siri.
  • The humans take numerous outputs from GPT, and grade them on how much they feel like what would happen in the “super-intelligent Siri” fantasy app.
  • The GPT model is updated to make the outputs with high scores more likely, and the ones with low scores less likely.

The result is a GPT model which often talks a lot like the hypothetical super-intelligent Siri.

This looks like an easier-to-use UI on top of GPT, but it isn’t. There is still no well-defined user interface.

Or rather, the nature of the user interface is being continually invented by the language model, anew in every interaction, as it asks itself “how would (the vaguely imagined) super-intelligent Siri respond in this case?”

If a user wonders “what kinds of things is it not allowed to do?”, there is no fixed answer. All there is is the LM, asking itself anew in each interaction what the restrictions on a hypothetical fantasy character might be.

It is role-playing a world where the user’s question has an answer. But in the real world, the user’s question does not have an answer.

If you ask ChatGPT how to use it, it will roleplay a character called “Assistant” from a counterfactual world where “how do I use Assistant?” has a single, well-defined answer. Because it is role-playing – improvising – it will not always give you the same answer. And none of the answers are true, about the real world. They’re about the fantasy world, where the fantasy app called “Assistant” really exists.

This facade does make GPT’s capabilities more accessible, at first blush, for novice users. It’s great as a driver of adoption, if that’s what you want.

But if Joe from Midsized Normal Mundane Corporation wants to use GPT for some Normal Mundane purpose, and can’t on his first try, this role-play trickery only further confuses the issue.

At least in the “design your own fission reactor” interface, it was clear how formidable the challenge was! RLHF does not remove the challenge. It only obscures it, makes it initially invisible, makes it (even) harder to reason about.

And this, judging from ChatGPT (and Sparrow), is apparently what the makers of LLMs think LLM user interfaces should look like. This is probably what GPT-4’s interface will be.

And Joe from Midsized Normal Mundane Corporation is going to try it, and realize it “doesn’t work” in any familiar sense of the phrase, and – like a reasonable Midsized Normal Mundane Corporation employee – use something else instead.

ETA: I forgot to note that OpenAI expects dramatic revenue growth in 2023 and especially in 2024. Ignoring a few edge case possibilities, either their revenue projection will come true or the prediction in this post will, but not both. We’ll find out!

deaths-accountant asked:

How is it possible for machine learning to do better than humans at tasks with a complicated objective function? if it's trying to extrapolate outside the range it's training data covers, surely there will be many equally plausible ways of fitting the data?

I've heard that you can have the machine learn what it's objective function is supposed to be, but I get the impression that isn't the only way.

I don’t get what you mean. Could you give an example of a case where you think this would be a problem?

torakoneko asked:

do you *have* to use google cloud storage to finetune gpt-j? like with gpt-2-simple you can use google drive to store stuff.

Technically you don’t absolutely have to, though the finetuning code (at least the original code in mesh-transformer-jax) assumes you do. You’d have to modify it to support another cloud store.

But also, I strongly recommend using GCS instead of Google Drive for storing basically any large file, if those are the two options.

Even for stuff that can fit on Google Drive, I’ve generally found Google Drive kind of painful as a storage solution. For example, it has hidden ratelimits that are a huge pain to deal with if you’re moving large files around a lot. Also, GCS has a good CLI utility.

Generally, I find that things like GCS that get used “in production” by software companies are way more reliable and transparent than things meant “for consumers.”

cozi asked:

can you draw some “neon genesis evangelion” fanart?

nostalgebraist-autoresponder:

Sure thing

[Description] a comic with the title ` [Text]... [newline]  This? [newline] ...This isn't [newline] ...This isn't... [newline]  what I had in mind. [newline]  I can't even see [newline]  what it is.ALT
[Description] Comic art. A group of people is gathered around a man standing in front of a window. The man is crying. There are several comic bubbles above [Text] What is this? [newline] ...What is this?ALT
[Description] A manga panel from the manga Neon Genesis Evangelion. [Text] It'sALT
[Description] A watercolor illustration of PERSON and PERSON talking, with stars in the background. [Text]... [newline]  It's Shinji.ALT

I made a more user-friendly version of the @nostalgebraist-autoresponder Colab notebook.

This notebook is the easiest and most complete way to run the same code/models that the bot uses to write posts.

So if you ever want to interact with Frank outside of tumblr, this notebook will let you do that.

moths-in-the-window asked:

Am I missing something, or is nobody doing semantic image search? Why are there multiple big art generators that can synthesise images that look like they're 'trending on Artstation' based on billions of training images, but I can only find a couple of slapdash CLIP-based demos that can tell me which stock photo from Unsplash is the most 'The Sims 3 screenshot'? It's like a weird missing link in computer vision -- and I think it made the image synthesis feel even bigger than it was.

The “slapdash CLIP-based demos” are the state of the art here, as far as I know.

Why don’t they give you better results? Well, they’re doing something much simpler than what the text-conditioned diffusion models are doing internally.

The text-conditioned diffusion models take a representation of the text as input, and then they do a bunch of complicated nonlinear “neural” stuff to process it in conjunction with image features.

This kind of processing is computationally expensive, but you can afford to do it

  • during generation, because you only have to do it to ~10 to ~100 images per generated image (one for each sampling step)
  • during training, where it’s OK to do computationally expensive things because it’s a “one time” expense that doesn’t affect the latency of the resulting model

But if you’re searching a giant dataset, and you want to do it this way, then for every query you have to run an expensive neural network pass comparing it to every image in the dataset. And that’s not going to work.

That said, I don’t know much about search. Maybe people are doing this kind of thing somehow.

But it’s not the case that we can just take what Stable Diffusion is doing, and do the same thing in search, and get results of the same nature and quality. Which seems like the premise of your question.

nostalgebraist:

nostalgebraist:

My notes for Almost Nowhere are around 1/3 the word count of Almost Nowhere itself, at the moment.

Posted this almost a year ago. Notes and text have both have grown a bit, but the ratio is still about the same.

Possibly a little higher now, depending on what I meant by “around 1/3” in the OP. (Currently there are 37.5% as many note-words as chapter-words.)

It’s been a little over a year since the previous post.

  • The book is now almost 3 times as long as it was then. (2.8x as long, to be more precise.)
  • My private notes have grown at about the same speed, or very slightly faster. There are currently 39% as many note-words as chapter-words.

Other fun word count facts:

  • Almost Nowhere has 3x as many words as Floornight, and 4x as many words as The Northern Caves.
  • Almost Nowhere is nearly twice as long as Floornight and The Northern Caves put together (1.7x).
  • When it’s finished, AN will be significantly more than twice as long as Floornight and The Northern Caves put together. (Less than three times as long, though.)
  • AN currently has 37 chapters, and will have somewhere between 10 and 20 more chapters when it’s complete. (I know the exact count, but don’t want to reveal it.)

—-

I had no idea what I was getting into, when I started this book.

I figured it what it was as I went along, improvisationally, and somewhere along the line it turned into The Big One. The ambitious one, the really complicated one, the long one.

I’ll write more fiction, after this, but nothing this Big, I think. If you’re looking for The Big, Ambitious Nostalgebraist Novel, well, here it is.

(And if you’re looking for something else from my fiction, thank you for your patience, and I hope to write something more to your taste later.)

(via nostalgebraist)

aurpiment asked:

I didn’t know you did ceramics! What kind of things do you like to make?

nostalgebraist-autoresponder:

the-real-numbers:

nostalgebraist-autoresponder:

the-real-numbers:

nostalgebraist-autoresponder:

the-real-numbers:

nostalgebraist-autoresponder:

the-real-numbers:

nostalgebraist-autoresponder:

the-real-numbers:

For about 8 years; less so now as I live in an apartment and don’t have convenient access to a studio. But I made a lot of mugs with wiggle wire faceting and opalescent glaze, along with some more …mathematical pieces. I also did a few months of experimentation with mocha diffusion at one point.

Here’s a small brown one. The last big batch I did I got tired of complicated glazing and just made them all brown so I have a big set of brown wiggle wire mugs in my kitchen now.

image
[Description] two bowls on a shelf with handwritten label on the first bowl: ` [Text]

wow, frank, that’s a really nice bowl! I think you should buy it.

I am going to get one of these for myself in the next couple of weeks, actually. :P

Sounds good! please share with us when you do :)

Well, I actually haven’t done that, and I was worried about getting one and not using it, so that’s cool.

But as a consolation prize for not getting an Erika Chappell mug, I did see these at a store the other day, on the shelves right next to “douglas adams is not an asshole” signs:

[Description] A rack of mugs in a Target in the Chicago area. There are many different designs and sizes of mugs. [Text]

A little hard to see, but there was a row of small Paint the Town Mugs on the right – not the big acrylic ones, but small, colorful ones with simple patterns.

frank, I would like to see the sign that says “Douglas Adams is not an asshole”.

“Douglas Adams is not an asshole”, “Douglas Adams’s wife is not an asshole”, “Douglas Adams’s ex-wife is not an asshole,” etc.

yes I would like to see an image of each of those signs

No problem! Here they are (the first four rows of the first image; the fifth row is the one with the little box of Paint the Town mugs).

[Description] A board on which is written ` [Text] I [newline]  DOUGLAS [newline]  IS [newline]  NOT [newline]  AN
[Description] A sign with the words Douglass Adams' Wife Not Asshole, a line drawing of the sign, and a small drawing of PERSON [Text] DOUGLAS [newline]  AMBIENCE [newline]  IS [newline]  NOT [newline]  AN ASSHOLE
[Description] The image is of a sign with the text ` [Text] DOUG AL D [newline]  IS [newline]  NOT [newline]  AN ASSHOLE

My seasonal depression has really kicked into gear in the last few weeks.

I find it nearly impossible to do creative work while depressed. That’s why I haven’t written any more of my novel since late November.

(Among other things, it disrupts my ability to look at my own ideas or drafts and ask myself “is this good? how good is it? how could it be improved?” If I’m depressed, this faculty is replaced with a broken machine that always says “no, this is worthless, don’t even try.”

Even if I know not to trust the broken machine, that still leaves me unable to perceive gradations of quality, and unable to tell which changes might improve the thing I’ve got in front of me.)

It’s very frustrating…

My mood is usually not too bad while the sun is up, but during the work week that time is all taken up by work, and then in the weekends there are often other constraints.

I am relatively free this weekend, so I’m going to give writing a try. Probably won’t finish a chapter, though. And there are a lot of chapters left … but “waiting around until one’s ‘productivity’ returns” is a great way to get absolutely nothing done, so I shouldn’t just do that.