top | item 37155080

Open Challenges in LLM Research

163 points| muggermuch | 2 years ago |huyenchip.com

72 comments

order

errantspark|2 years ago

Fun fact: I took the photo she used as a cover for one of her books, she asked me if she could use it and I said I'd like to be compensated and her response was something akin to "oh I was just asking assuming you'd say yes, I'm going to do it anyway". Nobody's perfect, maybe she regrets it, and it hasn't really crossed my mind in years, but I guess it still sort of irks me to be reminded of it. Anyway if anyone needs a portrait for a book cover feel free to hit me up XD.

abatilo|2 years ago

Just another random anecdotal experience with Chip.

I was interviewing with Claypot.ai and when I met her for my first conversation, she was on a walking treadmill and very clearly was more interested in a Slack conversation she was having.

She moved me on to the next round which I irrefutably bombed and was respectfully told that I wouldn't be moving on which was the right decision, but I'll never forget watching her walking motion while looking at Slack on her second monitor almost the entire time we were talking.

lionkor|2 years ago

Just an itsy bitsy little bit of copyright infringement :)

wanderlust123|2 years ago

Sounds like a pretty entitled and unpleasant person. At the bare minimum you should have had a say in whether you picture could hve been used.

thrwayaistartup|2 years ago

Looking back in 25 years, the "Hallucination Problem" will sound a lot like the "Frame Problem" of the 1970s.

Looking back, it's a bit absurd to say that GOFAI would've got to AGI if only the Frame Problem could be solved. But the important point is why that sounds so absurd.

It doesn't sound absurd because we found out that the frame problem can't be solved; that's beside the point.

It also doesn't sound absurd because we found out that solving the frame problem isn't the key to GOFAI-based AGI. That's also beside the point.

It sounds absurd because the conjecture itself is... just funny. It's almost goofy, looking back, how people thought about AGI.

Hallucination is the Frame Problem of the 2023 AI Summer. Looking back from the other side of the next Winter, the whole thing will seem a bit goofy.

js8|2 years ago

My feeling is that GOFAI had a real problem with representing uncertainty, and handling contradiction. So, we tried to approach it theoretically, with fuzzy logic and probability and so on. But the theoretical research on uncertainty didn't reach any clear conclusion.

Meanwhile, the neural nets (and ML) researchers just trucked on, with more compute power, and pretty much ignored any theoretical issues with uncertainty. And surprisingly, with lots of amazing results.

But now they hit the same wall, we don't actually understand how to do reasoning with uncertainty correctly. LLMs seem to solve this by "just mimic reasoning that humans do". Except because we lack a good theory of reasoning, it can't tell when mimicking is bad and when it's good, unless there is a lot of specific examples. So in the most egregious cases, we get hallucinations but have no clue how to avoid them.

resonious|2 years ago

I don't know much about AI research but the idea of "measuring" hallucination definitely seems very loaded to me. Humans hallucinate too and I don't think we can measure that. It almost feels like "we need AGI in order to develop AGI".

p1esk|2 years ago

Does anyone think we would have AGI if only we could solve the hallucination problem?

emmender1|2 years ago

"... .Looking back from the other side of the next Winter, the whole thing will seem a bit goofy."

for most of us, what we wish for is what we believe.

ford|2 years ago

So far it's been ~8 months since ChatGPT started the (popular) LLM craze. I've found raw GPT to be useful for a lot of things, but have yet to see my most frequently used apps integrate it in a useful way. Maybe I'm using the wrong apps...

It'll be interesting to see what improvements (in a lab or at a company) need to happen before most people use purpose-built LLMs (or behind the scenes LLM prompts) in the apps they use every day. The answer might be "no improvements" and we're just in the lag time before useful features can be built

Legend2440|2 years ago

There are some unsolved practical problems like prompt injection, the difficulty of using them on your own data, etc.

But the biggest problem is that they take so much compute, which slows down both research and deployment. Only a handful of giant companies can train their own LLM, and it's a major undertaking even for them. Academic researchers and everyday tinkerers can only run inference on pretrained models.

netdur|2 years ago

I have helped making behind sense cases, one was to classify emails and redirect them to intended sides, second was quality monitoring of call center.

illusionist123|2 years ago

I think it's not possible to get rid of hallucinations given the structure of LLMs. Getting rid of hallucinations requires knowing how to differentiate fact from fiction. An analogy from programming languages that people might understand is type systems. Well-typed programs are facts and ill-typed programs are fictions (relative to the given typing of the program). To eliminate hallucinations from LLMs would require something similar, i.e. a type system or grammar for what should be considered a fact. Another analogy is Prolog and logical resolution to determine consequences from a given database of facts. LLMs do not use logical resolution and they don't have a database of facts to determine whether whatever is generated is actually factual (or logically follows from some set if facts) or not, LLMs are essentially Markov chains and I am certain it is impossible to have Markov chains without hallucinations.

So whoever is working on this problem, good luck because you have you have a lot of work to do to get Markov chains to only output facts and not just correlations of the training data.

blackkettle|2 years ago

While I'm definitely not going to argue that LLMs are inherently 'thinking' like people do, one thing I do find pretty interesting is that all this talk about hallucinations and bias seems to often conveniently ignore the fact that people are often even more prone to these exact same problems - and as far as I know that's also unlikely to be solved.

ChatGPT is often 'confidently wrong' - I'm pretty sure I've been confidently wrong a few times too, and I've met a lot of other people in my life who've express that trait from time to time too, intentionally or otherwise.

I think there is an inherent trade off between 'confidence', 'expression', and of course 'a-priori bias in the input'. You can learn to be circumspect when you are unsure, and you can learn to better measure your level of expertise on a subject.

But you can't escape that uncertainty entirely. On the other hand, I'm not very convinced about efforts to train LLMs on things like mathematical reasoning. These are situations where you really do have the tools to always produce an exact answer. The goal in these types of problems should focus not on holistically learning how to both identify and solve them, but exclusively on how to identify and define them, and then subsequently pass them off to exact tools suitable for computing the solution.

famouswaffles|2 years ago

LLMs already know how to distinguish fact from fiction much better than random chance and the base non-RLHF GPT-4 model was excellently calibrated (its predicted confidence in an answer generally matches the probability of being correct). "Eliminating" it is not that important. Getting it to human levels is the goal. and boy do humans often "hallucinate", i.e have a poor grasp of what they do or do not know and confidently spout nonsense.

StackOverlord|2 years ago

A type system is for internal consistency though. Facts are about external consistency with real world data. And even then facts are always a social augmentation in that they are always captured in a given social context, and by that I include the lenses of theoretical frameworks and axiomatics. They always have a spin they can lose when considered from another standpoint and at the very minimum they are conditioned by attention and relevancy, and it has everything to do with our current representation of the world and nothing with the world itself.

visarga|2 years ago

Technically, transformers condition on the entire past not just the last step, but RNNs are Markov Chains. RNNs have information bottleneck issues though.

mattlutze|2 years ago

"Never before in my life had I seen so many smart people working on the same goal"

I'm not sure why but the assumptions and naivety in this opening line bothers me. There are plenty of goals and problems that orders of magnitude more people are working on today.

dustypotato|2 years ago

But the author hadn't seen them in their life

strikelaserclaw|2 years ago

Maybe she hasn't watched the Oppenheimer movie yet :D

visarga|2 years ago

Let me add a few:

- organic data exhaustion - we need to step up synthetic data and its validation

- imbalanced datasets - catalog, assess and fill in missing data

- backtracking - make LLMs better at combinatorial or search problems

- deduction - we need to augment the training set for revealing implicit knowledge, in other words to study the text before learning it

- defragmentation - information comes in small chunks, sits in separate siloes, and context size is short, we need to use retrieval to bring it together for analysis

tl;dr We need quantity, diversity and depth in our training sets

josephg|2 years ago

And I’ll add some more:

- LLMs aren’t very good at large scale narrative construction. They get too distracted by low level details that they miss the high level details in long text. It feels like the same problem as stable diffusion giving people too many fingers.

- LLMs have 2 kinds of memory: current activations (context) and trained weights. This is like working memory and long term memory. How do we add short term memory? Like, if I read a function, I summarize it in my head and then remember the summary for as long as it’s relevant. (Maybe 20 minutes or something). How do we build a mechanism that can do this?

- How do we do gradient descent on the model architecture itself during training?

- Humans have lots more tricks to use when reading large, complex text - like re-reading relevant sections, making notes, thinking quietly, and so on. Can we introduce these thinking modalities into our systems? I bet they’d behave smarter if they could do this stuff.

- How do we combine multiple LLMs into a smarter overall system? Eg, does it make sense to build committees of “experts” (LLMs taking on different expert roles) to help in decision making? Can we get more intelligence out of chatgpt by using it in a different way in a larger system?

inciampati|2 years ago

One thing I'd like to see is more effort on developing citation systems for these models.

What I mean is that every part of the output of an LLM should be annotated with references to the content that is most important or relevant to it.

Who is leading this effort now?

astrange|2 years ago

It's not possible to do this without a completely different and less efficient architecture. You can approximate it, but it won't give you the correct answers, as there are no correct answers insofar as the model is learning to generalize rather than memorize things.

https://twitter.com/AnthropicAI/status/1688946685937090560

yeck|2 years ago

I have a hard time understanding why mechanistic interpretability has so few eyes on it. It's like trying to build a complex software system without logging or monitoring. Any other improvements you want to make on the system are going to just be trail and error with luck. The hallucination problem is one where interpretability of a model might be able to identify the failure modes that we need to address. Really any AI problem could likely be aided by a scalable approach to interpretability that is just as mundane feeling as classical software observability.

nonameiguess|2 years ago

I'm going to talk out of my ass here because I am not involved enough to know the mechanics of how LLMs are really trained at any deep level, but from the surface level understanding I have, I would expect any attempt to eliminate hallucination to be intractable given the techniques in use. As far as I understand, the initial training run is simply fed raw text and it works on the basis of predicting a next token. Then these are find-tuned using RLHF and potentially other techniques I don't know much about.

To truly eliminate hallucinations, I would think you'd have to change the initial training phase. Rather than only feeding raw text and predicting next tokens, you'd need to feed propositions labeled with some probability that they are actually true. Doing this with real fidelity is clearly not possible. No one has a database of all fact claims quantified by probability of truth. But you could potentially use the same heuristics used by human learners and impart some encoding of hierarchy of evidence. Give high weight to claims made by professional scientific organizations, high but somewhat lesser to conclusions of large-scale meta-analyses in relatively mechanistic fields, give very low weight to comments on Reddit.

That is all entirely possible but the manual human labor required seems antithetical to the business goals of anyone actually doing this kind of research. Without it, though, you're seemingly limited to either playing whack-a-mole with fine tuning out specific classes of error when they're caught or relying on a dubious assumption that plausibly human-generated utterances you're trying to mimic are sufficiently more likely to be true than false.

This problem arguably goes away if people treat LLMs for what they are, generators of strings that look like plausible human-generated utterances, rather than generators of fact claims likely to be true. But if we really want strong AI, we clearly need the latter. There is a reason epistemologists have long defined knowledge as justified true belief, not just incidentally lucking into being correct.

jebarker|2 years ago

When I looked into this briefly my impression was that it's extremely hard to do mechanistic interpretation beyond very simple cases like CNN classification or toy problems like arithmetic in transformers. Not to say it's not a worthy pursuit, but I think the difficulty isn't justified for many researchers since the results won't make a big splash like a new model training result.

karxxm|2 years ago

I have not seen many work on explainable AI regarding large language models. I remember many very nice visualizations and visual analysis tools trying to comprehend, what the network „is seeing“ (eg. in the realm of image classification) or doing

techwizrd|2 years ago

I really like seeing articles or papers that describe the current advances and open challenges in a sub-field (such as [0]). They're underappreciated, but good practice or reading for folks wanting to get in the field. They're also worthwhile and humbling to look back at every few years: did we get the challenges right? How well did we understand the problem at the time?

0: https://arxiv.org/abs/1912.04977

crosen99|2 years ago

The biggest challenge I’m trying to track isn’t on the list: online learning. The difficulties with getting LLMs to absorb new knowledge without catastrophic forgetting is a key factor making us so reliant on techniques like retrieval augmented generation. While RAG is very powerful, it’s only as good as the information retrieval step and context size, which quite often aren’t good enough.

Buttons840|2 years ago

Our APIs up to this point have been designed for computers. Json input, json output, and those are the nice ones.

I wonder if a deterministic but natural language API would be any better for LLMs to integrate with? Or do LLMs already speak Json well enough?

_pdp_|2 years ago

Another challenge is also around how we think LLMs should be used vs understanding how LLMs can be used. It will take some time to figure this out.

matanyal|2 years ago

Interesting, no mention of Groq for number 6.