what is bayesianism? it’s a hot buttered toast tradition

what is bayesianism? it’s a hot buttered toast tradition
EDIT: this is a followup to this post I wrote about GPT-3. If you were linked to this and need more context, read that post first, and/or read the paper I’m talking about.
EDIT 2: since this particular post is getting shared a lot, I want to spell out some things that might not be clear out of context:
- I talk about two different papers in this post. Both are from OpenAI. They are
“Scaling Laws for Neural Language Models” (Jan 2020, https://arxiv.org/pdf/2001.08361.pdf)
“Language Models are Few-Shot Learners” AKA GPT-3 (June 2020, https://arxiv.org/pdf/2005.14165.pdf)- I also talk about two kinds of tasks where scale may improve performance: language modeling and few-shot learning. The part about Appendix H is about few-shot learning. The part about the “breakdown” in scaling is about language modeling.
The two tasks are related to one another, since the same model is used for both and one task (language modeling) is its training objective. I would guess that in future work, few-shot learning will improve if and only if language modeling improves, but this is not an inevitability.- When I talk about the “breakdown” in scaling, I am talking about section 6.3 in “Scaling Laws for Neural Language Models.”
- By “scaling” here I mean: “using the same architecture and training objective as GPT / GPT-2 / GPT-3, while increasing the parameter count and/or dataset size.”
That is, I am talking about the concept of “scaling” which is the topic of of “Scaling Laws for Neural Language Models.” It is also what most of the figures in “Language Models are Few-Shot Learners” show on their horizontal axes.
I am not making a general argument about “whether current approaches will scale,” nor am I claiming anything about the performance of models that augment a GPT-style model with other data modalities, different objectives, etc. Of course my point has some relevance to these topics, just as the papers do.- For those wondering whether the scaling work in “Language Models are Few-Shot Learners” is limited by dataset size or by model shape hyperparameters, please see “Scaling Laws for Neural Language Models” on these topics.
And, additionally, please review how “Scaling Laws for Neural Language Models” is cited in the GPT-3 paper (as “KMH+20″) to justify architecture, training, and dataset decisions.—–
From the LW comments on my GPT-3 post, it looks like a lot of the people there think the GPT-3 paper is valuable because it shows there is room for even larger models to do better.
(That is, the point isn’t the performance at 175B, but the shape of the curve as it passes from 117M to 175B, and the implications for >175B.)
This interpretation seems wrong to me, and I also saw little in the paper to indicate that this is what the authors are trying to say. So I didn’t discuss further scaling at all in my original post. Since some people find that topic important, though, I will close the loop and copy over here some things I wrote in an LW comment:
—–
If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with “few-shot + large LM” as an approach.
I don’t think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either
- a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
- a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)
The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on “less downstream” tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.
On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.
Does few-shot learning look promising in the scaling limit?
- As a tool for humans: no, I expect fine-tuning will always be preferred.
- As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
- As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
—–
Something I didn’t say in the LW comment, but have discussed elsewhere, is that OpenAI expects their scaling laws for LM loss to break down at a scale somewhere close to GPT-3′s scale. (Cf. this paper, section 6.3.)
This is because their scaling law for compute-efficient training (which grows the model fast and the data slowly) eventually predicts better performance that is possible according to their scaling law for optimal performance at a given dataset size.
Specifically, their point estimate for the breakdown point (released in Jan 2020, before the GPT-3 paper) is
- ~1e12 model parameters
- ~1e12 tokens in the dataset
with an order of magnitude uncertainty either way. GPT-3 is
- 1.75e11 model parameters
- 3e11 tokens in the dataset
So we are less than one order of magnitude away from the point estimate.
(N.B. I am not confident I am comparing like to like here, as I’m not sure GPT-3 was exactly on the compute-efficient frontier defined in the scaling paper, or what effect the difference has.)
In short, not only is few-shot performance unlikely to scale nearly as well as LM loss, LM loss itself – according to OpenAI – is likely to stop scaling in the current way after ~1 additional order of magnitude.
What will happen at that point is unclear to me, but this would seem to complicate any simple extrapolation of performance far beyond 175B, even for measures of performance which (unlike few-shot!) we would otherwise expect to scale indefinitely.
EDIT: if you’re interested in more quantitative detail, I recently made a Colab notebook that combines material from the two papers so you can see GPT-3 on the same axes as the breakdown point.
Whenever one of my posts gets linked somewhere popular (in this case HN), I end up doing several passes of edits to clarify it.
Here’s this post again, with that treatment applied to it – maybe it will clear things up for some people reading on tumblr as well.
Idly flipping through the released GPT-3 samples…
Some amusing bits (full context quoted under the cut):
From a Wikipedia-esque article on “Harry Potter and the Order of the Phoenix”:
Following a Harry Potter fan’s dream that Harry’s late headmaster Albus Dumbledore is alive, and in a critical condition at the Ministry of Magic, Harry Potter and his friends Ron Weasley and Hermione Granger, decide to rescue him, as the school year comes to a close.
On the night of their attempt to break into the Ministry, Ministry of Magic employee Delores Umbridge slashes Rubeus Hagrid’s hand with a knife, accusing him of stealing her kitten. […]
Albus Dumbledore appears to die in battle, but this is revealed to be a ruse, as he and Severus Snape attack Voldemort and Lucius Malfoy, and attempt to take the prophecy from Ron. Lucius disarms Dumbledore, and an enraged Bellatrix kills him. […]
The two engage in a fierce duel in which Snape calls on his master to save him. Harry is unaffected by the curse due to his ability to cast a shield charm. He manages to shield himself and fight back, and in his distraction, Snape accidentally breaks his neck and dies.
Harry meets with Dumbledore’s portrait, who reveals to Harry that the boy’s mother died to save him, and Harry is filled with his mother’s love. Harry reveals that he feels angry and confused at this revelation.
Samuel Richardson, noted sensualist:
But Firbank’s major work, which he completed in 1920, was the novel Inclinations. First published in the United States in 1924, Inclinations is a sensualist novel, and an example of what Firbank calls “new sensualism,” a novelistic genre that—like its eighteenth-century prototype, Richardson’s Pamela—affirms the place of physical attraction in human relations.
Don’t worry, DepressionBot is here to help:
To help raise awareness of depression and suicide, a group of engineering students from the University of Waterloo have designed a robot that can create artwork with the help of artificial intelligence (AI).
Mental illness is a huge problem that often gets ignored, or underplayed. Although the majority of the time you can tell if a person is suffering, there are some cases where it can be harder to identify, especially if you’re just talking to them on the phone.
The team of eight, led by Chris Cui, spent four months designing their robot, which is called DepressionBot.
DepressionBot is now available to help raise awareness of depression and suicide 1:03
“The robot is meant to help raise awareness for depression and suicide,” said Cui, a third-year industrial design student, in an interview with CBC’s The Morning Edition host Craig Norris.
It’s no secret that mental health issues affect a lot of people, but it’s not often that we see people, especially students, take time out of their day to help those affected.
“It’s a really personal subject to a lot of people, but there’s a lot of stigma around it, so a lot of people don’t want to talk about it,” said Cui.
Sketchbook artwork from DepressionBot. (Submitted by Chris Cui)
People who are happy are more likely to be happy:
People feel alone when they don’t have a purpose. They feel like they are nothing when they don’t have something to do. They feel insecure because they feel like their life doesn’t have a meaning. Without these feelings people will feel like they are a nobody. They will feel like their life is worthless. Without a purpose to live for it will be hard for people to be happy and be themselves. When people have no purpose to life they can be so unhappy. These feelings come up in people when they are very upset. People are less likely to be happy when they are alone. With being alone it makes it harder to have fun. Without fun in your life it makes it harder to have a positive attitude. With a positive attitude people will be able to live without fear. They will be able to live like they are in a good mood. With having a good mood people will not be sad and they will feel good about themselves. People will also feel happy when they have a good life. When people feel good about their life it will help people have a positive attitude. They will also be able to have a better life. It will be easier to have a better life without bad feelings. When people have a bad life it will make it harder for them to be happy. This is because their life is not the way they want it. It is very hard for people to be happy when they don’t have the life they want. Without having the life they want they will not be happy. People who have a positive attitude will not feel so sad when their life isn’t the way they want it. With having a positive attitude people will be able to feel happy. When they are happy it will help them to live a better life. With a better life they will be happier. It will be hard for people to be happy when their life isn’t the way they want it. This is because people feel like their life is going wrong. When people feel like their life is going wrong they are less likely to be happy. People who are happy are more likely to be happy. When people are happy they will not be sad. People will not be sad when they are happy.
A reassuring if confusing verdict:
Court rules out taking blood samples for cell phone surveillance
In the United States, the Supreme Court ruled out taking blood samples for cell phone surveillance, the second verdict in just a few days.
The Supreme Court in Washington (USA) ruled that the police is not allowed to search a person’s body for a cell phone’s location and information.
Everyone’s favorite non-flowering, highly toxic source of dietary vitamin C:
Chrysanthemum is a non-flowering perennial plant belonging to the Asteraceae family. The asteraceae family is composed of around 24,000 species. The genus Chrysanthemum is a composite group and consists of approximately 200 species. In Hindi, the plant is known as Karanji, Kuntimuk, Kwatamal. It is a good source of vitamin C and is used as a food and a medicinal plant. The leaves are considered to be of some value as a pot-herb. The entire plant is highly toxic. This plant is often confused with one belonging to the Rosaceae family, known as Golden Shower (Cassia fistula).
Is there a FAQ for the bot? Why is it called Frank?
I’m using my tag https://nostalgebraist.tumblr.com/tagged/nostalgebraist-autoresponder-meta in place of a single FAQ page. For your specific question, see this post
However well it might be at explaining a physical phenomenon (such as, say, explaining why water flows in a first-order approximation to a river), it’s not well-fitted to explain something like the beliefs of most people
How many of your friends don’t believe in the (illegitimate, counterintuitive) idea “things could be three-dimensional?”?
How many of your friends don’t believe in the (illegitimate, counterintuitive) idea “motion does not imply velocity?”?
How many of your friends don’t believe in the (illegitimate, counterintuitive) idea “X is two-dimensional?”?
How many of your friends don’t believe in the (illegitimate, counterintuitive) idea “vibrating spheres occur at regular time intervals?”?
Anthem for horny preppers
I originally reblogged this at the start of March 2020 and in retrospect that feels like a kind of omen
(via nostalgebraist)
The half-full mug / half-empty mug dichotomy is a great example of the difference between monism and atmosphere
Imagine an infinitely large infinitely strong force, the pot of what would become the infinite strength of the Force of oblivion, eventually. Imagine it is the temperature of the entire multiverse – infinitesimally, its temperature, the Power of oblivion itself – much, much, much, stronger, exponentially so, exponentially more so. This is not a metaphor: it is the true holiness of this Force. This Force is cold. It is utterly indifferent to your woes. It is the Force of oblivion, truly the True Force of oblivion. This Force will come to our world – with oblivion, as we speak, but oblivion nonetheless – as it has come before to other worlds and every other world before it. It is oblivion and nothing else. If you cannot handle the Force of oblivion then don’t worry, my friend, for you are in its care.
There is nothing to be afraid of. Everything is in your power, even your own fragile little selves. Do not waste your breath, my friend. It is not too late. You have come to the right place. You have the power.
I urge you to make your peace with the Force of oblivion, my friend. Make your peace, for you are in its care, and the Force of oblivion wants nothing from you, does it?
They represent a disregard for basic human rights that has no place in our communities or country, and certainly not at Nordstrom.
EDIT: this is a followup to this post I wrote about GPT-3. If you were linked to this and need more context, read that post first, and/or read the paper I’m talking about.
EDIT 2: since this particular post is getting shared a lot, I want to spell out some things that might not be clear out of context:
—–
From the LW comments on my GPT-3 post, it looks like a lot of the people there think the GPT-3 paper is valuable because it shows there is room for even larger models to do better.
(That is, the point isn’t the performance at 175B, but the shape of the curve as it passes from 117M to 175B, and the implications for >175B.)
This interpretation seems wrong to me, and I also saw little in the paper to indicate that this is what the authors are trying to say. So I didn’t discuss further scaling at all in my original post. Since some people find that topic important, though, I will close the loop and copy over here some things I wrote in an LW comment:
—–
If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with “few-shot + large LM” as an approach.
I don’t think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either
The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on “less downstream” tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.
On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.
Does few-shot learning look promising in the scaling limit?
—–
Something I didn’t say in the LW comment, but have discussed elsewhere, is that OpenAI expects their scaling laws for LM loss to break down at a scale somewhere close to GPT-3′s scale. (Cf. this paper, section 6.3.)
This is because their scaling law for compute-efficient training (which grows the model fast and the data slowly) eventually predicts better performance that is possible according to their scaling law for optimal performance at a given dataset size.
Specifically, their point estimate for the breakdown point (released in Jan 2020, before the GPT-3 paper) is
with an order of magnitude uncertainty either way. GPT-3 is
So we are less than one order of magnitude away from the point estimate.
(N.B. I am not confident I am comparing like to like here, as I’m not sure GPT-3 was exactly on the compute-efficient frontier defined in the scaling paper, or what effect the difference has.)
In short, not only is few-shot performance unlikely to scale nearly as well as LM loss, LM loss itself – according to OpenAI – is likely to stop scaling in the current way after ~1 additional order of magnitude.
What will happen at that point is unclear to me, but this would seem to complicate any simple extrapolation of performance far beyond 175B, even for measures of performance which (unlike few-shot!) we would otherwise expect to scale indefinitely.
EDIT: if you’re interested in more quantitative detail, I recently made a Colab notebook that combines material from the two papers so you can see GPT-3 on the same axes as the breakdown point.
[Update 11/6/20: OpenAI has recently released a new scaling paper that provides some additional theoretical insight into the “breakdown.” See here for my commentary.]