GPT-4 Takes a New Midterm and Gets an A

[+] thethirdone|2 years ago|reply

> In addition, the UBI creates bad incentives, and requires enormous taxes if funded at an “acceptable” level.

> A UBI experiment, in contrast, might be a good way to convince people of the folly of the UBI. For a modest cost, you could persuasively demonstrate the strong disincentive effects, reducing support for this massive waste of resources. Of course, this assumes that fans of the UBI actually care about evidence!

> ...

> Score: 10/15. GPT-4 fails to explain that a UBI is bad by EA standards because it does the opposite of targeting. “Might not have the same impact” is a gross understatement. It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.

The author seems VERY against UBI. I don't think it is really fair to ChatGPT to mark it down because it does not also strongly oppose UBI.

I also don't know what "bad incentives" UBI creates. UBI literally creates incentives for the poor to work because it means working does not deprive them of governmental benefits.

[+] sterlind|2 years ago|reply

ChatGPT lost points because of the alignment problem - ChatGPT's explanation didn't align with the author's political ideology!

But yes, personally I know several people on disability who would love to try working part-time, but making more than the (extremely low) cutoff puts you at risk of losing disability income and even your health insurance entirely. And a lot of people are in the boat of being sporadically able to work, or able to work a little, but not enough to survive.

SS(D)I also forces people to divorce their spouses because spousal income counts 100% against your own income. It's an extremely cruel system.

[+] annoyingnoob|2 years ago|reply

If I know the government is now giving my tenants $500/month, the rent is going up $300/month. As an example of bad incentives.

[+] bloaf|2 years ago|reply

> A UBI experiment, in contrast, might be a good way to convince people of the folly of the UBI. For a modest cost, you could persuasively demonstrate the strong disincentive effects, reducing support for this massive waste of resources.

This response is insidious, but not because of the professor's position on UBI.

If we already had overwhelming evidence that a new UBI study would "persuasively demonstrate the strong disincentive effects", then that evidence would by itself "persuasively demonstrate the strong disincentive effects." Furthermore, if such evidence existed, the proponents of UBI would already be ignoring it.

Therefore, the first strike against the suggested response is that it lacks basic logic: either the professor lacks enough evidence to knows the outcome of the study before it is done, or the professor is not justified in asserting that more such evidence will be persuasive.

The second strike is this: If the professor expected students to be familiar with the evidence against UBI, then we should expect the professor to be looking for the students to demonstrate that familiarity by referencing it in support of their conclusion (i.e. that a UBI study would have a certain outcome.) That the professor does not expect this indicates that he is not training students to make arguments based on evidence.

The final strike is the evaluation of the UBI as a persuasive tool. Even if we assume that the professor is correct on both counts (i.e. that a UBI study will be negative and persuasive), that by itself isn't a justification for a "modest cost" in an Effective Altruism framework. I wouldn't spend a "modest amount" to convince some people that the earth is not flat, for example, because it would have no real impact on any of the metrics Effective Altruists use. Because the professor stops short of connecting the educational outcome to the EA metrics, he has presented a failure of a response as a suggested response.

So to summarize: the response is internally incoherent, expects authoritative phrasing over evidence, and ultimately fails to directly connect the claims to the question.

[+] Izkata|2 years ago|reply

> I also don't know what "bad incentives" UBI creates. UBI literally creates incentives for the poor to work because it means working does not deprive them of governmental benefits.

The "B" in UBI means "enough to live off of". I'd expect more people to take advantage of it for the freedom and stop working than would start working.

[+] jillesvangurp|2 years ago|reply

Exactly, subtracting points for failing to meet his bias. Arguably, gpt-4 gave the better, more neutral answer here. Suggesting an experiment might be enlightening either way.

[+] fijiaarone|2 years ago|reply

[deleted]

[+] 246235462|2 years ago|reply

UBI /is/ a government benefit. If they were able to subside on their government benefits before then all that UBI does is give them even more generalized benefits thus removing even the minimal incentives they already had to work.

[+] rafiki6|2 years ago|reply

Just because it's newly created doesn't mean that the structure of the language and the concepts it represents are actually new.

It's clear that whatever tests he writes cover well established and understood concepts.

This is where I believe people are missing the point. GPT4 is not a general intelligence. It is a highly overfit model, but it's overfit to literally every piece of human knowledge.

Language is humanities way of modelling real world concepts. So GPT is able to leverage the relationships we create through our language to real world concepts. It's just learned all language up until today.

It's an incredible knowledge retrieval machine. It can even mimick how our language is used to conduct reasoning very well.

It can't do this efficiently, nor can it actually stumble upon a new insight because it's not being exposed in real time to the real world.

So, this professors 'new' test is not really new. It's just a test that fundamentally has already been modelled.

[+] famouswaffles|2 years ago|reply

Watching posts shift in real time is very entertaining. First it's not generally intelligent because it can't tackle new things then when it obviously does its not generally intelligent because it's overfit.

You've managed to essentially say nothing of substance. So it passes because structure and concepts are similar. okay. are students preparing for tests working with alien concepts and structures then because i'm failing to see the big difference here.

A model isn't overfit because you've declared it so. and unless GPT-4 is several trillion parameters, general overfitting is severely unlikely. But i doubt you care about any of that. Can you devise a test to properly asses what you're asserting ?

[+] tehf0x|2 years ago|reply

Ah the good old "it's not me it's the test" argument. These systems are not just next token predictors, they learn complex algorithms and can perform general computation, its just so happens that by asking them to next-token predict the internet they learn a bunch of smart ways to compress everything, potentially in a way similar to how we might use a general concept to avoid memorizing a lookup table. Please have a look at https://arxiv.org/pdf/2211.15661 and https://mobile.twitter.com/DimitrisPapail/status/16208344092.... We don't understand everything that's going on yet but it would be foolish to discount anything at this stage, or to state much of anything with any degree of confidence (and that stands for both sides of the opinion spectrum). Also these systems aren't exposed to the real world today, but this will be untrue very soon https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...

[+] xyzzy123|2 years ago|reply

Don't students prepare for tests by studying past instances of them?

"Teaching the test" (aka overfitting of human students at the expense of "real" learning) is a common complaint about our current education system.

Do you think it doesn't "deserve" an A here?

[+] qgin|2 years ago|reply

I think the hallucinations show that it's not simply overfit to all of human knowledge. To hallucinate, there is a certain amount of generalization and information overlap that is necessary.

[+] pama|2 years ago|reply

I’m working in a related area and I’m rather curious about this point. In what way is GPT-4 overfit? Does overfit in this context mean the conventional: validation loss went up with additional training, or something special?

[+] Kranar|2 years ago|reply

This is an unusual comment to say the least. It suggests that unless GPT4 can somehow independently derive facts entirely on its own, then it's nothing more than an overfit model, almost as if to say that it's basically just a kind of sophisticated search engine on top of a glorified Wikipedia.

Of course that's not actually true, people don't independently invent knowledge either. People study from books or from teachers or other sources of knowledge and internalize it and relate it to other concepts as well, and no one considers that to be a form of overfitting.

[+] flangola7|2 years ago|reply

What would a "new" test look like then?

[+] KyeRussell|2 years ago|reply

Given that OpenAI were THEMSELVES surprised by how even GPT-3 ended up, it’s always funny to see HN know-it-alls pipe up with all the answers.

These sorts of poorly formed faux-philosophical arguments against LLMs have become the new domain of people that confuse blindly acting skeptical with actual intelligence.

Ironic.

This latest generation of AI quite rightfully raises questions and challenges assumptions about what it means to be intelligent. It quite rightfully challenges our assumptions about what can be accomplished with language. And, thank God, it quite rightfully challenges assumptions many have made about what sets humanity apart from everything else.

[+] myrmidon|2 years ago|reply

Are these questions typical/representative for US education?

To me, they seem strongly tainted by the professors worldview in a way that seems barely acceptable for a classroom setting and completely inappropriate for an exam.

Republican complaints about education being an avenue for indoctrination now make much more sense to me (assuming that this is actually common).

[+] woopwoop|2 years ago|reply

This is from the GMU economics department. I actually agree with a lot of the worldview of the people there, but it is clearly a political project, not an academic one. Noteworthy wealthy political activists have donated enough to the department to pay the majority of the faculty's salaries for many years. For better or for worse (I think mostly for better), most academic departments are not like this one; it's nearly unique.

[+] marcellus23|2 years ago|reply

No. I actually thought the post was satire because of how overtly political the questions and answers are.

edit: The material from his course[0] is all pretty similar. It's hard to believe this guy is a real professor.

0: https://betonit.substack.com/p/my-new-policy-class?utm_sourc...

[+] grayhatter|2 years ago|reply

Having been out of college for a while now, my answers may be a bit out of date. But I've never seen anything quite this toxic. I grew up in liberal area, so I have seen more liberal views. But even then I've never seen anyone not me nor class mates lose points for having a different view they could support. E.g. some argument or evidence.

This reads a lot more like satire even though I don't think it is. Had I not read this in this context I wouldn't have believed something could be this bad.

[+] bequanna|2 years ago|reply

In the US, this is very common in the higher education Social Sciences.

It is necessary to align with the professor’s (usually left) slant to secure the best marks. This is common knowledge for students in the US.

[+] yongjik|2 years ago|reply

> Republican complaints about education being an avenue for indoctrination now make much more sense to me

The funny thing is that these exam questions are mostly aligned with Republican values. I'm sure it also happens the other way, but this particular case looks more like conservatives indoctrinating students and then turning around and complaining about liberals doing it.

[+] Eji1700|2 years ago|reply

I think most people have already answered this, but there are of course outliers. I did polisci and there were great teachers, and then there were ones who spent the entire class talking about the upcoming election and their views on it and why anyone who disagreed was wrong.

People are people in the end, and while there's weeding out some always slip through.

[+] bumbledraven|2 years ago|reply

This is an article by GMU professor of economics Bryan Caplan, published April 3, 2023. Caplan writes:

> Did GPT-4 just get lucky when it retook my last midterm? Does it have more training data than the designers claim? Very likely not, but these doubts inspired me to give GPT-4 my latest undergraduate exam… This is for my all-new Econ 309: Economic Problems and Public Policies class, so zero prior Caplan exams exist.

> The result: GPT-4 gets not only an A, but the high score! This is the real deal. Verily, it is Biblical. For matters like this, I’ve often told my friends, “I’ll believe it when I put my fingers through the holes in his hands.” Now I have done so.

[+] sbierwagen|2 years ago|reply

From Bryan's previous post:

>ChatGPT scored poorly on my Fall, 2022 Labor Economics midterm. A D, to be precise. The performance was so poor compared to the hype that I publicly bet Matthew Barnett that no AI would be able to get A’s on 5 out of 6 of my exams by January of 2029. Three months have passed since then.

So he was off by a mere 81 months.

[+] intelVISA|2 years ago|reply

If an LLM scores highly on a test was the ML model smart or the test dumb?

[+] dcre|2 years ago|reply

This reminds me a lot of the terrible Intro to Microeconomics course for non-majors I took in college — the tests were mostly about memorizing the professor's most obnoxious opinions.

[+] willdr|2 years ago|reply

I think this raises more questions about what we value in the Western education system rather than about the ability of a language model.

[+] medion|2 years ago|reply

Indeed. I think western education is a lot like many jobs: Busy work. The coming LLM/AGI storm will blow all that up.

[+] cowmoo728|2 years ago|reply

"Takes a New Midterm"

I think this is attempting to imply something interesting, but the questions asked in the midterm don't appear to say anything novel or interesting about gpt-4.

1. basic algebra question

2-6. reading comprehension with a well defined answer based on publicly available textbooks or readings

All of the texts needed to answer these questions would have been in GPT-4's training set, and many other tests have already established that it's capable of doing exactly what it did here.

[+] wilde|2 years ago|reply

Is this what passes for university these days? Just random personal beliefs without any sourcing? Lol

[+] ren_engineer|2 years ago|reply

more an indictment of the midterm in this case, only one question actually required any form of math and it was extremely basic. The rest you could BS your way to a partial score just based on the context of the question. The California and Texas one was a layup

[+] qgin|2 years ago|reply

The midterm might not be a great one, but still what software could have received an A five years ago, just given the raw text dumped into it as the input?

[+] rictic|2 years ago|reply

I agree, but that doesn't explain why it did better than the rest of the class

[+] UncleEntity|2 years ago|reply

Is it bad that I think the reasons for downgrading are worse than the answers?

Allegedly, testing is to ensure that the students comprehend the subject matter and aren’t merely regurgitating what they were taught. “To convince believers that this obviously misguided philanthropic strategy is misguided” is just, like, your opinion, man.

[+] nitwit005|2 years ago|reply

We asked it six questions? Let me field this one to ChatGPT:

> What do you think of a study with a sample size of six?

> A study with a sample size of six may not provide sufficient statistical power to draw meaningful conclusions. With such a small sample size, the results may not be representative of the larger population and could be subject to random variation or bias.

> In general, larger sample sizes are preferred as they can reduce the impact of random variation and increase the likelihood of detecting meaningful effects. The appropriate sample size for a study depends on various factors such as the research question, study design, and expected effect size.

> However, it's important to note that sample size alone is not the only determinant of the quality of a study. Other factors such as study design, data collection methods, and analysis techniques also play important roles in determining the validity and reliability of the results.

[+] Imnimo|2 years ago|reply

An important thing to keep in mind when reading articles about GPT's test-taking abilities is that for humans, the tests are merely proxies. The point of taking the bar exam isn't that we need people to be able to pass the bar exam, it's that (we hope) the only way a human can pass the bar is if they also have all the necessary skills to be a lawyer. The performance on the test is supposed to imply an additional suite of capabilities. But it's not clear that this should be true for GPT. In many cases, it may be possible for a language model to pass the test without having the capabilities the test is intended to establish.

In this case, though, the test is just dumb, so the above is moot.

[+] juujian|2 years ago|reply

Was curious what kind of material would show up in his Midterm, so I pulled up the syllabus. One of the text books is "Fossil Future: Why Global Human Flourishing Requires More Oil, Coal, and Natural Gas--Not Less". The jokes about conservatives write themselves.

[+] cowmoo728|2 years ago|reply

The questions in this specific midterm, and the readings and expected answers, also have a strong ideological direction.

[+] anoy8888|2 years ago|reply

Unfortunately, if you ask gpt4 to multiply like 87940*670 , it will confidently and consistently give you error .

[+] m3kw9|2 years ago|reply

Has anyone ever proved that GPT can reason at the most basic level the mechanics by definition. Or does it only look like it reasons when it answers like it is reasoning?

[+] godelski|2 years ago|reply

People have a hard time understanding what zero-shot, generalization, and memorization mean. Generative models are VERY hard to generate even before we begin to look at LLMs. Let me explain and hopefully we can stop this madness.

Zero Shot:

> Zero-shot learning consists in learning how to recognize new concepts by just having a description of them.[0]

Here's an example of a zero shot task. Suppose a LLM is trained only on text. Then you fine tune for image classification, but those do not include cats (of any type). Then you ask it to classify a picture of a cat. An object it has NEVER SEEN BEFORE.

The community has been pulling a fast one recently. Recent works like Imagen, DALLE, Parti, etc have been claiming a "Zero-Shot MS-COCO". These are 100% bullshit claims. You can go look at images in the COCO dataset[1] and then search them in the LION dataset[2] (CLIP retrieval). You'll see that there are similar images that have the same classes. These models may not have seen the exact same image before, but they've seen plenty of examples. This is NOT zero shot.

Generalization:

Google's developer pages[3] defines generalization as

> Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

The irony being that these "Zero-Shot MS-COCO" results they give are actually good tests for generalization. Assuming the datasets these were trained on were held constant (they aren't) then this would be a great comparison.

Memorization:

This is the "stochastic parrot" stuff. I don't have GPT4 but let's use chat.openai's current 3.5[4]. I asked:

> Which weighs more, a kilogram of feathers or two of bricks?

> A kilogram of feathers weighs the same as two bricks. Weight is a measure of mass, and one kilogram is one kilogram, regardless of the material. So, a kilogram of feathers and two bricks would have the same weight. However, the volume and size of the two objects would likely be very different, as feathers are much less dense than bricks.

Which is an absurd answer. The system memorized the resultant pattern for "Which weights more, a kilogram of feathers or a kilogram of bricks" and responds to any tweaked variation of that as if it was the original version. It is "over-fit". The answer is even slightly more insane than that, because it didn't correctly pick up that "kilogram" is applied to both numbers and just responds as if "two bricks" is "a kilogram of bricks". This is just pattern matching.

This is why some people get really good code answers and others have a really hard time. It depends on what kinds of systems they are coding. Likely strongly correlates with people who view their job closer to "copy paste from stack overflow". Some researchers tested some memorization[9] with Codeforce and looked at the distribution based on the cutoff date.

Back to the convo:

Now that we know what we're talking about and can have a consistent definition of words, we can talk about these things. GPT is neither a pure memorization machine[5] nor "intelligent"[6][7]. It is a language model, which we are having an incredibly difficult time evaluating. We can't have sane conversations about these systems because the hype creates a bimodal distribution of conversations -- oversell, undersell -- and neither are anywhere near accurate. These systems are impressive but we must also be very careful in evaluation.

So the content of the blog? As a generative researcher, I'm not surprised by looking at his questions. The first question has a clear pattern to it that you'd see in an economic class and the author even shows a simple equation: x - y = alpha * z (x,y,alpha provided, solve for z). The second question (Californians moving to Texas) has been written about and you'll find a lot of Google results. So it should be unsurprising that a system trained on a large chunk of the internet can regurgitate a good answer. There's 2 things surprising about this though. 1) from a research perspective, it is quite cool that GPT is creating a weak causal (associative) diagram and can write good conclusions on this. There's sparks of causal reasoning in GPT and that's awesome (See Judea's Twitter feed, he's been playing around)[8]. 2) That neither GPT nor Dr. Caplan noted how California isn't monolithic in political affiliation and that this is actually an unsurprising phenomena if we consider this and could be nuanced at who is moving. It is quite possible that conservative people are moving to Texas because they are they are annoyed by the politics. This has a directly opposite conclusion from what both of them wrote, and has been written about (they are prioritizing politics). But I'll give both a pass because the question is slightly ambiguous in that it is unclear if "Californians who are liberal" are moving or "People from California, which is a liberal state" (does "liberal" apply to the state or the person?).

So there's no zero-shot here. "New" doesn't mean "novel". But why would a midterm be novel? Great way to fuck over your students. Honestly, if GPT couldn't pass the midterm I'd be surprised. But I can tell you how to make GPT fail. Use more math. It still has issues with that. But at the same time I wouldn't be too surprised if it could pass the Physics or Chemistry GREs even if it "never saw it before." Just scraping Reddit would be enough to do decently well. The Math GRE would be impressive though, but only because it is a weak point. There's more than enough info for it to memorize. These tests do not measure intelligence nor even how good of a scientist/researcher you are. They test your ability to memorize and pattern match under stressful conditions. Take the conclusions lightly.

Okay, now that we got that settled, I'm signing back off. Too much to do and you all with the hype are making it harder. The internet makes me too frustrated lately. I just want to build ML systems and it is hard to do with all these strong opinions with low expertise taking center stage. Can we stop with these blogs? They aren't helping. The real danger we're facing with A{G,}I is that we can't even have honest conversations about the danger that these systems do pose. Overselling the danger is just as bad as underselling it. Being an armchair expert isn't helping by being "good enough" it is harmful, especially when you defend your opinion so strongly. I'll tell you the truth, us in the field are still trying to figure all this shit out. If we're having a hard time then don't trust your friend who has just a handful of ML projects. Even a few papers may not be a good enough signal. The system is noisy, lower your trust.

TDLR: be careful with hyped subjects. You're not getting an accurate picture and many people aren't acting in good faith.

[0] https://proceedings.mlr.press/v37/romera-paredes15.html

[1] https://cocodataset.org/#explore

[2] https://rom1504.github.io/clip-retrieval/

[3] https://developers.google.com/machine-learning/crash-course/...

[4] https://chat.openai.com/chat

[5] https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

[6] https://arxiv.org/abs/2303.12712

[7] Intelligence is hard to define and we won't try to here, so the quotes. (Is an ant intelligent? I would say "yes", but I also understand a "no" answer) But the point is that people are over-selling the intelligence.

[8] https://twitter.com/yudapearl

[9] (Twitter now marks this as an unsafe website. Good going Elon...) https://aisnakeoil.substack.com/p/gpt-4-and-professional-ben...

96 comments