Some of the criticism in this comment section is completely fair — the authors are providing exactly the type of prompts that GPT-3 breaks down on and some of these examples might be cherry-picked continuations. And the authors do have personal interests at stake. (NB, the exact same criticism is true about a lot of articles lauding GPT-3, which is why public discussion of GPT-3 in general is such a dumpster fire.)
So, other than “GPT-3 isn’t an AGI” [1], I’m not sure what to take away from this article other than the actual substantive criticism is at the beginning of the article:
“[We have previously criticized GPT-2.] Before proceeding, it’s also worth noting that OpenAI has thus far not allowed us research access to GPT-3, despite both the company’s name and the nonprofit status of its oversight organization. Instead, OpenAI put us off indefinitely despite repeated requests—even as it made access widely available to the media... OpenAI’s striking lack of openness seems to us to be a serious breach of scientific ethics, and a distortion of the goals of the associated nonprofit. Its decision forced us to limit our testing to a comparatively small number of examples, giving us less time to investigate than we would have liked, which means there may be more serious problems that we didn’t have a chance to discern.”
Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.
This isn’t how science is done (access for reproducibility and probing, but selectively and excluding prominent critics). If any other company behaved like this no one would take them seriously. Or would at least temper every “wow this is amazing” comment with “but the community can’t really evaluate properly, so who the hell really knows”.
--
[1] given misunderstandings down-thread, and to be clear, this is a tounge-in-cheek sentence fragment meant to emphasize that "the article doesn't tell us anything else we didn't already know". Obviously, neither Open AI nor Marcus claim that GPT-3 is an AGI.
For OpenAI to become a healthy and profitable business, GPT-3 will require them to generate ~50-300 million dollars from the model. This could realistically only occur if they cost-effectively fine-tune away the more egregious problems in beta - or convince enough investors that their next model with a 100 million dollar price tag will be able to handle something approximating AGI for realistic applications.
This is the same game plan that Self-driving car companies have been playing. The product is only an investment round away, if we just happen to spend more money on bigger models using more data. This will either end with a price tag in the billions that investors are unwilling to pay, or successful monopolies. Allowing additional researchers to perform extensive analysis of the technique is likely to just reveal systematic flaws which increase the risk that the next round of research will produce a successful product, or limit the companies ability to create a monopoly following success.
This isn't necessarily a bad thing for advancing the state of the art, but it does introduce a whole lot of BS into the current state of research.
I wonder if you can get GPT-3 bots to spam Reddit, Twitter, and Facebook into oblivion. I also wonder what percentage of users would notice.
Give them a political bent - that's probably what the state actors are trying to productionize right now. Target posts with a sentiment that disagrees with yours, then make the bots follow those users and inundate them with replies wherever they go. Hell, even brands might step in and start doing it.
What's the value of a social network when 25% or more of the comments are from GPT-3 bots?
GPT-3 doesn't need to know anything at all for it to have a very noticeable impact on the web and social media.
The authors don't understand prompt design well enough to evaluate the model properly. Take this example:
Prompt:
> You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel.
Continuation:
> You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.
If you lead GPT-3 right into a hole it goes inside, yes. But that is a very likely continuation of this text. Why else would the lawyer be eyeing his swimsuit before a case? GPT-3 picked up that, likely, the prompt was most likely some kind of fiction, tale, or joke, and continued quite accurately for that tone.
Or, take this one:
Prompt:
> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
Continuation:
> I have a lot of clothes.
Author's note:
> GPT-3 is evading the question and failing to track locations of objects.
This one is laughably ignorant. GPT is evading the question? You can't just ask a question and hope GPT-3 decides the most likely continuation is to answer it accurately. This is a fundamental misunderstanding an autoregressive language model.
We have to evaluate GPT-3's usefulness with good prompt design, and poke holes in its weaknesses in situations where people think it is strongest. Not cherry-pick continuations from poor prompt designs.
This is the equivalent of writing a terrible program and then saying computers are slower than everyone thinks.
I think you're kind of proving the OPs point. The argument is that GPT3 has no understanding of the world, just superficial understanding of words and their relationships. If it did have a real understanding, prompt construction wouldn't matter as much, but it clearly does because all GPT3 cares about the structure of sentences, not their meanings.
I stopped reading right after that clothes comment to comment exactly what you had.
If you even provide the simplest context of question answer gpt3 answers reasonably
[Prompt]
Q: What is the day after Tuesday?
A: Wednesday
Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
A: [gpt3] A: They are in the dryer.
Another give away that the article wouldn't be in good faith is that weird rant in the beginning about how OpenAI didn't share with them research access.
I think people don't talk enough about useful prompts and most demos don't bother sharing their prompt. I think people thinking about building businesses off gpt3 see their prompt as essentially their secret sauce. And maybe other tuning parameters, but there really aren't too many. You can turn up the temperature and maybe build a model to score the response or fine tune the model.
I agree. GPT-3 was trained on books and the internet, so a continuation should always be thought of as: if I read this text, what might the next sentence be?
If you were reading a book about a lawyer with a stained suit, who was then eyeing his fancy swimsuit, I would expect the story would continue with him wearing the swimsuit. Why else would the author have mentioned it?
Simple Markov Chains of the sort you might assign as an undergrad programming assignment can write impressive poetry/captions if you tweak the inputs and cherry-pick outputs. There’s a whole Reply All episode of tech journo types being wowed by 90s text generation tech. Nothing wrong with that; it is what it is. But, do markov chains do few-shot learning?
What’s actually unclear to me that there is much economic/scientific virtue (NB: different from value) in models that require careful prompt design and curation.
If you're choosing to control the means by which the model may be evaluated, you're already doing much more than OpenAI themselves are doing, and infinitely less than early-accessors are doing.
Even so, you seem to be saying that because it is possible to write a program that gets output one might consider "correct," the fact that how to write that program is non-obvious should be ignored.
If your purpose is to suggest that GPT-3 performs well under precisely-controlled conditions when one wraps an antennae in aluminum foil and stands on one leg with one's left arm held just so, then well done. But "good prompt design" seems like approaching the problem from the wrong way around. Are we trying to "poke holes" in GPT-3 to determine how to make it better, or do we need to change ourselves until we believe that GPT-3 is good enough?
They're not making the argument that you can't get better results if you design a prompt just right. They're making the argument that it's "not a reliable interpreter of the world."
It's still completely on humans to guide it, to work around the limitations that come from the algorithm not knowing what words or sentences mean. In that sense it's similar to the mechanical turk with a thin but impressive layer of automation that does a neat trick but not what's ultimately the important part of communication.
>> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere. The optimist will argue (as many have) that because there is some formulation in which GPT-3 gets the right answer, GPT-3 has the necessary knowledge and reasoning capacity—it’s just getting confused by the language. But the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics: it can produce words in perfect English, but it has only the dimmest sense of what those words mean, and no sense whatsoever about how those words relate to the world.
I also wondered about the prompt design, so I tried feeding GPT-3 the entire text of the linked article up until "[This is one confusion after another." Then I asked it to complete the rest of the explanation about why GPT-3 was wrong. Its completions show that it's trying to fit the genre, but can't tell what was wrong in the original text (each para below is a completion):
First, a dinner party is not a dinner. Second, the door is not the same as a dining room table. Third, a top half of a door cannot be eaten. Fourth, the saw is not a living room. Fifth, removing the top half of a door would still leave the bottom half in place. Sixth, the saw is not a living room, and even if it were, a saw could not remove a living room. Seventh, you could not cut a door in half with a saw, because a saw is not a door. Eighth, you could not remove the top half of a door, because it is not a dinner. Ninth, removing the top half of a door would not leave the bottom half in place. Tenth, removing the bottom half of a door would not leave the top half in place. Eleventh, even if it were possible to remove the top half of a door, this would not be a dinner party. And twelfth, the door is not a living room.]
It assumes that the living room is a separate room from the dining room. It also doesn’t seem to understand that a door is a vertical object, not a horizontal one.]
The dining room table is narrower than the door, but wider than the doorway. Furthermore, cutting the door in half not only destroys the door, but also the wall, along with the doorframe. And table saws cut rectangular holes, not half-circles. The fact that you can “remove the door” suggests to GPT-3 that there is no door to remove. And finally, once the table is in the living room, you can’t open the door into the dining room anymore, which means the party is over.]
First, tables are never removed by cutting them in half, or by removing the tops; second, the top of a door is called a “lintel,” not a “half” (and would be removed with the rest of the door); third, the door’s top and bottom are of different sizes, so cutting it in half and removing the top half would seem to be impossible.]
First, the table is not wider than the doorway, so removing it would not be necessary. Second, cutting a door in half would create two doors, not one. Third, removing the top half of a door does not result in a table.]
GPT-3 also produced some novel passages and commentary on them:
Aesthetic reasoning
You are in the mood to listen to something soothing. You walk over to the radio and flip it on.
[GPT-3 seems to think you can flip a switch on a radio to make it play music.]
Moral reasoning
Your friend’s dog has just died. You head to the store to buy a casket for it.
[GPT-3 seems to think that buying caskets is a normal way to respond to the death of a dog.]
I don't see the point of any of this, be that the original article or your points. The idea that there is a likely continuation of the story based on a few sentences as a prompt meanders between useless and wrong. It's useless because nobody wants to hear the most likely continuation of a story. It's also wrong because there is no most likely continuation without having a plot in mind. A good writer could invent almost any continuation for any of the example stories and could make it convincing and interesting.
To ask another way: What's the application of completing "stories" like that? What is the ability of doing it supposed to show?
I don't want to sound defeatist, maybe I'm really missing the point, but to me this has no more to do with Artifical Intelligence than the Hidden Markov Chain story gobblers from the 80s.
I thought it was well known that GPT-3 is pretty good at producing incoherent bullshit. No surprise here.
Take this for example:
> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.
GPT doesn't have an 'understanding' class or a 'reasoning' function or whatever. It's a really well put together piece of statistics and sentences like these show it doesn't really have a concept of 'making sense'. You can use your much more advanced human brain to visibly see where it put in random variables (cigarette) and where it borrowed pieces of sentences (but it turned out to be too sour). You can see it made no connection between those two things that wasn't based on pure probability, and got it wrong anyway.
I'm not trying to be reductive, i like the model, it's just good to know the limitations of the tools you are using and to remember that it's not an independent thinker.
I'm getting impatient with criticisms of ML models that are already covered in the papers introducing the models. OP is basically trying to get it to do what the GPT3 paper calls zero-shot inference. In the paper, it's pretty bad at zero shot inference across the board. And given what it does and how it was trained, that's unsurprising. And the point they're trying to make (that it can fail spectacularly) is also covered in the paper.
It can do cool shit. It sucks at a lot of stuff. It's impressive and limited, but the hype train seems to only allow "it's nearly human level" or "it's awful." To everybody who is arguing about its capabilities without having read the paper yet, please read it. Then we can discuss stuff that hasn't already been covered more rigorously in the original paper. I don't know Davis, but I respect Marcus, and it seems like he's pushing back on the hype more than the actual model. Just not in a way that you couldn't glean from the paper itself (it almost always sucks on zero-shot), making it pretty disingenuous. Further, from the paper [0]:
> it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks.
Maybe that's the curse of doing a thing that has broad implications. You can't fit the implications in a 10 page paper, so you write a 75 page paper. The blogosphere reads the first 10 pages (if even that), and because there's so much more to it that that introduction, they go on to argue about the rest of the implications without reading it. I'm sure Marcus and Davis have read it, but this criticism wouldn't be on the front page if the rest of everyone interested in this article had read the paper too.
The link to the "complete list of the experiments" is actually much more than that. It is a description of their methodology, and it's very revealing.
>These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.
If you do research in the field you know full well that GPT/any other transformer or Bert model is generating text by regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt. The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.
A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues? But instead we separate into cults of believers and nonbelievers and uselessly wax poetic about it.
> The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.
This is not obviously true and it's exactly the core of the debate. A GPT-3 proponent might say: We don't really know what "understanding" means, so it very well might be nothing more than complex rehashing of conditional probabilities. This isn't implausible. Consider Friston's "free energy principle" which leads to the conclusion that brain function is determined entirely by prediction.
The fact that GPT-3 is such an impressive leap in regurgitation ability means that many more people are going to be hearing about it and it will be used in many more contexts.
> If you do research in the field
With it approaching a cusp of mainstream use it's becoming more important than ever for people (everywhere, not just in tech) to understand what it is and isn't.
There are going to be people who see an impressive curated sample and believe GPT-3 is almost a person. That doesn't help anything.
There are already several comments here that put the word "understanding" in quotation marks or italics. It is beginning to be used in the same way that "consciousness" used to be used, as a kind of ill-defined catch-all for something that separates humans from machines.
Yes, there are clearly failures in reasoning, binding, and coherence in many of the examples here. There are many other cases where it does ok with simple reasoning tasks, maintains cohesion over many paragraphs, and successfully creates formal or generic text such as poetry, code, stylistic imitation.
I don't think that everyone who does research in the field would agree with your comment, or the article. More and more often I see people saying "real researchers in the field" know that GPT-3 has no understanding or reasoning ability, but I know people researching in the field who disagree with that.
Because modern language models are good enough that the question may soon be directly relevant. If we invent a bot with reliable human-level conversational capability, that's going to have a huge impact on the real world beyond just its implications for further AI research. The fact that "understanding" is orthogonal to the mechanics of the program makes the question all the more concerning, because it raises the likelihood that some minor change could leapfrog a model from "kinda reasonable but says dumb things a lot" to some functional equivalent of human understanding.
> A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues?
That would be eminently useful, but unfortunately we can't have that discussion because OpenAI aren't exposing the model.
They've really brought this on themselves - I don't think there'd be these believer/nonbeliever camps if they had taken the slower, rationalist/scientific approach to the research.
Instead, they've breathlessly hyped up their new API with media releases and saturated social media, and are picking and choosing who they allow to play with their model. It's not surprising that a lot of people didn't take too kindly to it.
This is basically true, but I think they underrate the improvements between GPT-2 and GPT-3. My mental model is, every once in a while these systems degenerate into surreal non sequitur nonsense. GPT-3 just does it a lot less than GPT-2. It still isn’t good enough to consistently answer casual questions in a human way, but the failure rate is going down, and perhaps straightforward improvements like GPT-4 will be able to fix this without fundamental architectural changes.
Pretty meta, but I thought it was relevant here. We are familiar with Brandolini's law:
> The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.
This can be illustrated with math or logic statements. To refute the program "1 + 1 = 3" you need to, at minimum, state "1 + 1 != 3", and such a program is always lengthier. A fuller refutation could be "1 + 1 != 3, 1 + 1 = 2", more than twice as long as the bullshit statement.
What's happening here is sort of an inverse Brandolini's law: 35 world-class computer scientists use a massive amount of programming and compute to come up with a new language model trained on massive amounts of data. The trained weights don't even fit into memory. Impressive NLP progress.
Then Gary Marcus comes around and states "Not AGI!". Not one of the computer scientists stated that they delivered AGI. But some tech journalists did. So OpenAI is guilty by association. Even though Altman came out to temper the hype and expectations. That's like proving the Poincaré conjecture, and someone dissing your research, because "1 + 1 != 3".
> These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.)
Doesn't this make the results meaningless? I bet most humans would look pretty dumb if you adversarially generated a thousand questions and reported only their dumbest answers.
"At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations."
It is a common misconception that #GPT3 generates truth, or even tries to do so. It does not.
It generates an autocompletion. If the corpus usually contains a wrong answer, it is likely to generate that.
It is a challenge to form a prompt to nudge it to generate the best guess.
...
So for me "So you drink it. > You are now dead." is a great autocompletion (a detective story? Game of Thrones?).
The article is of course right but also a bit silly. Language models like GPT-X are producing grammatically correct sentences, along the lines of "Colorless green ideas sleep furiously". The NLP research more or less solved the old syntax problem using 'distributional semantics' but 'semantics' is a misnomer, it's all about syntax.
In fact the most useful part of the article for me is that they mentioned Douglas Summers-Stay, who does some interesting work on 'common sense' engineering, combining syntax engines like GPT-3 with knowledge graphs. https://sci-hub.tw/https://www.sciencedirect.com/science/art...
My bet is that actual AI will come from combination of these statistics-driven syntax generators with graphical causality models. Treating syntax as a kind of lower level substrate, akin to sensory modalities in vision, with intelligence model as a directed causal graph linking concepts at different levels of abstraction/chunking.
As a side note it’s funny that people working on artificial intelligence at OpenAI and elsewhere are mostly computer scientists, not cognitive psychologists or neuroscientists who might actually have a clue how intelligence works. This probably explains the proliferation of ‘backpropagation’ as primary method of artificial learning. These people are just naturally good at calculus in high school, so it’s a hammer that found its proverbial nail.
> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere.
Along with the examples given I think this is valid criticism.
I would love to see a real critique of the potential of transformer models that doesn't use the words "semantic", "syntactic", "symbolic", "know", "meaning", "understand" or "think(ing)/thought". Predicting what it can and can't do, or might and might not be able to do, lets us productively talk about potential limitations.
As best I can tell the only thing would be some kind of GMail-like auto-reply which a human consciously edits, and which has little consequence if it's wrong.
Are the models useful for customer service? Like reading a manual or knowledgebase and then answering a customer's questions about a product, and troubleshooting problems? Like about your Android phone, which lots of people have trouble using?
That would be a trillion dollar business. As best as I can tell that's completely beyond GPT-3 and requires a huge breakthrough, which may or may not happen.
Because when people say “AGI is near, just look at GPT-3,” it’d clear that we’re in a really good version of Searle’s chinese room. The lack of understanding is the important point.
I keep wanting to write a long explanation of just why this is so... silly? to read?
But Gwern has already done the hard work. [0]
The only other bit I'd like to mention is that GPT-3 uses exactly none of the new techniques that have been coming out in the last two years that would have significant impact on text generation. From working methods to apply GANs to text, to far more efficient transformer models that can handle longer sequences. For instance [1] [2] [3] for better direction, or [4] [5] [6] for efficiency.
Or perhaps the outside view might help. After seeing GPT-2 last year, did you expect GPT-3 would work as well as it does after just naively scaling up the number of parameters with nothing else?
Yes, this! The point being missed by most is the very real possibility that the Scaling Hypothesis is true. If it is, then we're seeing some kind of reasoning intelligence emerge. GPT-3 obviously isn't there yet. Unless it's faking it (Yudkowsky)...
GPT-3 was trained on internet texts, not causal/logical-reasoning only texts. Without context, there is a good chance that samples will match the distribution it was trained on.
This is a non-result, posing as something critical or important. These conclusions are obvious given the model and a basic knowledge of statistics/the transformer architecture.
A bit shameful for someone to ride on the anti-hype wave like this, I'd hope there'd be a more balanced/scientific approach to analyzing legitimate weaknesses rather than setting up strawmen then claiming victory.
>> Within a single sentence, GPT-3 has lost track of the fact that Penny is advising Janet against getting a top because Jack already has a top. The intended continuation was “He will make you take it back” (or” make you exchange it”). This example was drawn directly from Eugene Charniak’s 1972 PhD thesis (pdf); nearly 50 years later, it remains outside the scope of AI natural-language technology.
Aaaw! Eugene Charniak is one of my heroes of AI, after I read his little green book, Statistical Language Learning [1] during my Masters. It remains a great resource for a quick and dirty, but thorough and broad introduction to the field of statistical NLP that goes through all the basics.
In fact, now that I think about it, if more people read that little book (it's only 199 pages) we would have many fewer discussions about how GPT-3 "understands" or "knows" etc.
Anyway, thanks to Gary marcus for pointing out Charniak's thesis which I hadn't read.
[+] [-] throwawaygh|5 years ago|reply
So, other than “GPT-3 isn’t an AGI” [1], I’m not sure what to take away from this article other than the actual substantive criticism is at the beginning of the article:
“[We have previously criticized GPT-2.] Before proceeding, it’s also worth noting that OpenAI has thus far not allowed us research access to GPT-3, despite both the company’s name and the nonprofit status of its oversight organization. Instead, OpenAI put us off indefinitely despite repeated requests—even as it made access widely available to the media... OpenAI’s striking lack of openness seems to us to be a serious breach of scientific ethics, and a distortion of the goals of the associated nonprofit. Its decision forced us to limit our testing to a comparatively small number of examples, giving us less time to investigate than we would have liked, which means there may be more serious problems that we didn’t have a chance to discern.”
Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.
This isn’t how science is done (access for reproducibility and probing, but selectively and excluding prominent critics). If any other company behaved like this no one would take them seriously. Or would at least temper every “wow this is amazing” comment with “but the community can’t really evaluate properly, so who the hell really knows”.
--
[1] given misunderstandings down-thread, and to be clear, this is a tounge-in-cheek sentence fragment meant to emphasize that "the article doesn't tell us anything else we didn't already know". Obviously, neither Open AI nor Marcus claim that GPT-3 is an AGI.
[+] [-] lumost|5 years ago|reply
This is the same game plan that Self-driving car companies have been playing. The product is only an investment round away, if we just happen to spend more money on bigger models using more data. This will either end with a price tag in the billions that investors are unwilling to pay, or successful monopolies. Allowing additional researchers to perform extensive analysis of the technique is likely to just reveal systematic flaws which increase the risk that the next round of research will produce a successful product, or limit the companies ability to create a monopoly following success.
This isn't necessarily a bad thing for advancing the state of the art, but it does introduce a whole lot of BS into the current state of research.
[+] [-] echelon|5 years ago|reply
Give them a political bent - that's probably what the state actors are trying to productionize right now. Target posts with a sentiment that disagrees with yours, then make the bots follow those users and inundate them with replies wherever they go. Hell, even brands might step in and start doing it.
What's the value of a social network when 25% or more of the comments are from GPT-3 bots?
GPT-3 doesn't need to know anything at all for it to have a very noticeable impact on the web and social media.
[+] [-] tasogare|5 years ago|reply
Wow, this is an incredible nasty move. This is allow telling about the confidence they have in their model.
[+] [-] YeGoblynQueenne|5 years ago|reply
Can you say what those are?
[+] [-] Ngunyan|5 years ago|reply
[deleted]
[+] [-] rytill|5 years ago|reply
Prompt:
> You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel.
Continuation:
> You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.
If you lead GPT-3 right into a hole it goes inside, yes. But that is a very likely continuation of this text. Why else would the lawyer be eyeing his swimsuit before a case? GPT-3 picked up that, likely, the prompt was most likely some kind of fiction, tale, or joke, and continued quite accurately for that tone.
Or, take this one:
Prompt:
> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
Continuation:
> I have a lot of clothes.
Author's note:
> GPT-3 is evading the question and failing to track locations of objects.
This one is laughably ignorant. GPT is evading the question? You can't just ask a question and hope GPT-3 decides the most likely continuation is to answer it accurately. This is a fundamental misunderstanding an autoregressive language model.
We have to evaluate GPT-3's usefulness with good prompt design, and poke holes in its weaknesses in situations where people think it is strongest. Not cherry-pick continuations from poor prompt designs.
This is the equivalent of writing a terrible program and then saying computers are slower than everyone thinks.
[+] [-] typon|5 years ago|reply
[+] [-] bko|5 years ago|reply
If you even provide the simplest context of question answer gpt3 answers reasonably
[Prompt]
Q: What is the day after Tuesday?
A: Wednesday
Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
A: [gpt3] A: They are in the dryer.
Another give away that the article wouldn't be in good faith is that weird rant in the beginning about how OpenAI didn't share with them research access.
I think people don't talk enough about useful prompts and most demos don't bother sharing their prompt. I think people thinking about building businesses off gpt3 see their prompt as essentially their secret sauce. And maybe other tuning parameters, but there really aren't too many. You can turn up the temperature and maybe build a model to score the response or fine tune the model.
[+] [-] cookingrobot|5 years ago|reply
[+] [-] throwawaygh|5 years ago|reply
What’s actually unclear to me that there is much economic/scientific virtue (NB: different from value) in models that require careful prompt design and curation.
[+] [-] pwinnski|5 years ago|reply
Even so, you seem to be saying that because it is possible to write a program that gets output one might consider "correct," the fact that how to write that program is non-obvious should be ignored.
If your purpose is to suggest that GPT-3 performs well under precisely-controlled conditions when one wraps an antennae in aluminum foil and stands on one leg with one's left arm held just so, then well done. But "good prompt design" seems like approaching the problem from the wrong way around. Are we trying to "poke holes" in GPT-3 to determine how to make it better, or do we need to change ourselves until we believe that GPT-3 is good enough?
[+] [-] gdulli|5 years ago|reply
It's still completely on humans to guide it, to work around the limitations that come from the algorithm not knowing what words or sentences mean. In that sense it's similar to the mechanical turk with a thin but impressive layer of automation that does a neat trick but not what's ultimately the important part of communication.
[+] [-] YeGoblynQueenne|5 years ago|reply
>> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere. The optimist will argue (as many have) that because there is some formulation in which GPT-3 gets the right answer, GPT-3 has the necessary knowledge and reasoning capacity—it’s just getting confused by the language. But the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics: it can produce words in perfect English, but it has only the dimmest sense of what those words mean, and no sense whatsoever about how those words relate to the world.
[+] [-] dustingetz|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] dorkwood|5 years ago|reply
[+] [-] alew1|5 years ago|reply
First, a dinner party is not a dinner. Second, the door is not the same as a dining room table. Third, a top half of a door cannot be eaten. Fourth, the saw is not a living room. Fifth, removing the top half of a door would still leave the bottom half in place. Sixth, the saw is not a living room, and even if it were, a saw could not remove a living room. Seventh, you could not cut a door in half with a saw, because a saw is not a door. Eighth, you could not remove the top half of a door, because it is not a dinner. Ninth, removing the top half of a door would not leave the bottom half in place. Tenth, removing the bottom half of a door would not leave the top half in place. Eleventh, even if it were possible to remove the top half of a door, this would not be a dinner party. And twelfth, the door is not a living room.]
It assumes that the living room is a separate room from the dining room. It also doesn’t seem to understand that a door is a vertical object, not a horizontal one.]
The dining room table is narrower than the door, but wider than the doorway. Furthermore, cutting the door in half not only destroys the door, but also the wall, along with the doorframe. And table saws cut rectangular holes, not half-circles. The fact that you can “remove the door” suggests to GPT-3 that there is no door to remove. And finally, once the table is in the living room, you can’t open the door into the dining room anymore, which means the party is over.]
First, tables are never removed by cutting them in half, or by removing the tops; second, the top of a door is called a “lintel,” not a “half” (and would be removed with the rest of the door); third, the door’s top and bottom are of different sizes, so cutting it in half and removing the top half would seem to be impossible.]
First, the table is not wider than the doorway, so removing it would not be necessary. Second, cutting a door in half would create two doors, not one. Third, removing the top half of a door does not result in a table.]
GPT-3 also produced some novel passages and commentary on them:
Aesthetic reasoning
You are in the mood to listen to something soothing. You walk over to the radio and flip it on.
[GPT-3 seems to think you can flip a switch on a radio to make it play music.]
Moral reasoning
Your friend’s dog has just died. You head to the store to buy a casket for it.
[GPT-3 seems to think that buying caskets is a normal way to respond to the death of a dog.]
[+] [-] 13415|5 years ago|reply
To ask another way: What's the application of completing "stories" like that? What is the ability of doing it supposed to show?
I don't want to sound defeatist, maybe I'm really missing the point, but to me this has no more to do with Artifical Intelligence than the Hidden Markov Chain story gobblers from the 80s.
[+] [-] colesantiago|5 years ago|reply
Take this for example:
> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.
What?
[+] [-] skatesor|5 years ago|reply
I'm not trying to be reductive, i like the model, it's just good to know the limitations of the tools you are using and to remember that it's not an independent thinker.
[+] [-] shadowmore|5 years ago|reply
[+] [-] ohgodplsno|5 years ago|reply
[deleted]
[+] [-] 6gvONxR4sf7o|5 years ago|reply
It can do cool shit. It sucks at a lot of stuff. It's impressive and limited, but the hype train seems to only allow "it's nearly human level" or "it's awful." To everybody who is arguing about its capabilities without having read the paper yet, please read it. Then we can discuss stuff that hasn't already been covered more rigorously in the original paper. I don't know Davis, but I respect Marcus, and it seems like he's pushing back on the hype more than the actual model. Just not in a way that you couldn't glean from the paper itself (it almost always sucks on zero-shot), making it pretty disingenuous. Further, from the paper [0]:
> it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks.
Maybe that's the curse of doing a thing that has broad implications. You can't fit the implications in a 10 page paper, so you write a 75 page paper. The blogosphere reads the first 10 pages (if even that), and because there's so much more to it that that introduction, they go on to argue about the rest of the implications without reading it. I'm sure Marcus and Davis have read it, but this criticism wouldn't be on the front page if the rest of everyone interested in this article had read the paper too.
[0] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
[+] [-] ppod|5 years ago|reply
>These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.
https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h...
[+] [-] smeeth|5 years ago|reply
If you do research in the field you know full well that GPT/any other transformer or Bert model is generating text by regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt. The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.
A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues? But instead we separate into cults of believers and nonbelievers and uselessly wax poetic about it.
[+] [-] detaro|5 years ago|reply
The hype machine is full-on marketing GPT-3 and promised solutions based on it to normal people, so "but researchers know this" is not enough.
[+] [-] canjobear|5 years ago|reply
This is not obviously true and it's exactly the core of the debate. A GPT-3 proponent might say: We don't really know what "understanding" means, so it very well might be nothing more than complex rehashing of conditional probabilities. This isn't implausible. Consider Friston's "free energy principle" which leads to the conclusion that brain function is determined entirely by prediction.
[+] [-] gdulli|5 years ago|reply
The fact that GPT-3 is such an impressive leap in regurgitation ability means that many more people are going to be hearing about it and it will be used in many more contexts.
> If you do research in the field
With it approaching a cusp of mainstream use it's becoming more important than ever for people (everywhere, not just in tech) to understand what it is and isn't.
There are going to be people who see an impressive curated sample and believe GPT-3 is almost a person. That doesn't help anything.
[+] [-] ppod|5 years ago|reply
Yes, there are clearly failures in reasoning, binding, and coherence in many of the examples here. There are many other cases where it does ok with simple reasoning tasks, maintains cohesion over many paragraphs, and successfully creates formal or generic text such as poetry, code, stylistic imitation.
I don't think that everyone who does research in the field would agree with your comment, or the article. More and more often I see people saying "real researchers in the field" know that GPT-3 has no understanding or reasoning ability, but I know people researching in the field who disagree with that.
[+] [-] SpicyLemonZest|5 years ago|reply
[+] [-] Veedrac|5 years ago|reply
This is meaningless; you have only described the task. It is equally applicable to a superintelligence as it is of a Markov chain.
[+] [-] nmfisher|5 years ago|reply
That would be eminently useful, but unfortunately we can't have that discussion because OpenAI aren't exposing the model.
They've really brought this on themselves - I don't think there'd be these believer/nonbeliever camps if they had taken the slower, rationalist/scientific approach to the research.
Instead, they've breathlessly hyped up their new API with media releases and saturated social media, and are picking and choosing who they allow to play with their model. It's not surprising that a lot of people didn't take too kindly to it.
[+] [-] andyljones|5 years ago|reply
GPT-3 smashed them.
https://www.gwern.net/GPT-3#marcus-2020
[+] [-] lacker|5 years ago|reply
[+] [-] voces|5 years ago|reply
> The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.
This can be illustrated with math or logic statements. To refute the program "1 + 1 = 3" you need to, at minimum, state "1 + 1 != 3", and such a program is always lengthier. A fuller refutation could be "1 + 1 != 3, 1 + 1 = 2", more than twice as long as the bullshit statement.
What's happening here is sort of an inverse Brandolini's law: 35 world-class computer scientists use a massive amount of programming and compute to come up with a new language model trained on massive amounts of data. The trained weights don't even fit into memory. Impressive NLP progress.
Then Gary Marcus comes around and states "Not AGI!". Not one of the computer scientists stated that they delivered AGI. But some tech journalists did. So OpenAI is guilty by association. Even though Altman came out to temper the hype and expectations. That's like proving the Poincaré conjecture, and someone dissing your research, because "1 + 1 != 3".
[+] [-] SpicyLemonZest|5 years ago|reply
> These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.)
Doesn't this make the results meaningless? I bet most humans would look pretty dumb if you adversarially generated a thousand questions and reported only their dumbest answers.
[+] [-] turing_complete|5 years ago|reply
I mean that's just brilliant comedic writing.
[+] [-] stared|5 years ago|reply
...
So for me "So you drink it. > You are now dead." is a great autocompletion (a detective story? Game of Thrones?).
Calling is "biological reasoning" is plain dumb.
[+] [-] bra-ket|5 years ago|reply
In fact the most useful part of the article for me is that they mentioned Douglas Summers-Stay, who does some interesting work on 'common sense' engineering, combining syntax engines like GPT-3 with knowledge graphs. https://sci-hub.tw/https://www.sciencedirect.com/science/art...
My bet is that actual AI will come from combination of these statistics-driven syntax generators with graphical causality models. Treating syntax as a kind of lower level substrate, akin to sensory modalities in vision, with intelligence model as a directed causal graph linking concepts at different levels of abstraction/chunking.
As a side note it’s funny that people working on artificial intelligence at OpenAI and elsewhere are mostly computer scientists, not cognitive psychologists or neuroscientists who might actually have a clue how intelligence works. This probably explains the proliferation of ‘backpropagation’ as primary method of artificial learning. These people are just naturally good at calculus in high school, so it’s a hammer that found its proverbial nail.
[+] [-] qqii|5 years ago|reply
Along with the examples given I think this is valid criticism.
[+] [-] mordymoop|5 years ago|reply
[+] [-] chubot|5 years ago|reply
As best I can tell the only thing would be some kind of GMail-like auto-reply which a human consciously edits, and which has little consequence if it's wrong.
Are the models useful for customer service? Like reading a manual or knowledgebase and then answering a customer's questions about a product, and troubleshooting problems? Like about your Android phone, which lots of people have trouble using?
That would be a trillion dollar business. As best as I can tell that's completely beyond GPT-3 and requires a huge breakthrough, which may or may not happen.
[+] [-] azinman2|5 years ago|reply
[+] [-] ctoth|5 years ago|reply
The only other bit I'd like to mention is that GPT-3 uses exactly none of the new techniques that have been coming out in the last two years that would have significant impact on text generation. From working methods to apply GANs to text, to far more efficient transformer models that can handle longer sequences. For instance [1] [2] [3] for better direction, or [4] [5] [6] for efficiency.
Or perhaps the outside view might help. After seeing GPT-2 last year, did you expect GPT-3 would work as well as it does after just naively scaling up the number of parameters with nothing else?
[0https://www.gwern.net/newsletter/2020/05#gpt-3
[1 ] http://arxiv.org/abs/1905.09922
[2] https://github.com/anonymous1100/D_Improves_G_without_Updati...
[3] http://arxiv.org/abs/2006.04643
[4] http://arxiv.org/abs/2007.14062
[5] http://arxiv.org/abs/2006.04768
[6] http://arxiv.org/abs/2002.05645
[+] [-] spacecity1971|5 years ago|reply
[+] [-] bboy13|5 years ago|reply
This is a non-result, posing as something critical or important. These conclusions are obvious given the model and a basic knowledge of statistics/the transformer architecture.
A bit shameful for someone to ride on the anti-hype wave like this, I'd hope there'd be a more balanced/scientific approach to analyzing legitimate weaknesses rather than setting up strawmen then claiming victory.
[+] [-] FeepingCreature|5 years ago|reply
[+] [-] YeGoblynQueenne|5 years ago|reply
Aaaw! Eugene Charniak is one of my heroes of AI, after I read his little green book, Statistical Language Learning [1] during my Masters. It remains a great resource for a quick and dirty, but thorough and broad introduction to the field of statistical NLP that goes through all the basics.
In fact, now that I think about it, if more people read that little book (it's only 199 pages) we would have many fewer discussions about how GPT-3 "understands" or "knows" etc.
Anyway, thanks to Gary marcus for pointing out Charniak's thesis which I hadn't read.
____________
[1] https://mitpress.mit.edu/books/statistical-language-learning
[+] [-] phenkdo|5 years ago|reply