Large Concept Models: Language modeling in a sentence representation space

nutanc|1 year ago

This maps a little to what we are doing research on what we are calling as shape of stories[1].

We can clearly see in 2D space itself how different "concepts" are explored.

Using the shape of stories for semantic chunking we can clearly see in multiple articles how we can chunk by "concepts". [2]

Now we are trying to see if we can just use these chunks and train a next "chunk" predictor instead of a next word predictor.

In the paper, they take a sentence to mean a concept. We believe that a "semantic chunk" is better suited for a concept instead of a sentence.

[1] https://gpt3experiments.substack.com/p/the-shape-of-stories-...

[2]https://gpt3experiments.substack.com/p/a-new-chunking-approa...

Lerc|1 year ago

Can you spot conceptually similar stories by their shape?

For instance what is the shape of the ugly duckling compared to Rudolf the red nosed reindeer. They are essentially the same story, so presumably on some dimension you should be able to spot them in a group of unrelated stories.

stravant|1 year ago

This feels like a failure to learn the bitter lesson: You're just taking the translation to concepts that the LLM is certainly already doing and trying to make it explicitly forced.

mdp2021|1 year ago

It is explicitly stated in the paper that

> One may argue that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output

And the problem remains that (text surrounding the above):

> Despite the undeniable success of LLMs and continued progress, all current LLMs miss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. [...] Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs

anon373839|1 year ago

The bitter lesson isn’t a law of nature, though. And as GPT-style LLMs appear to be at the foot of a scaling wall, I personally think inductive bias is due for a comeback.

Jensson|1 year ago

> You're just taking the translation to concepts that the LLM is certainly already doing and trying to make it explicitly forced.

That is what tokens are doing in the first place though, and you get better results with tokens instead of letters.

mdp2021|1 year ago

That should be proven. The two approaches - predicting tokens vs predicting "sentences" - should be compared to see how much their output differ in terms of quality.

Edit2: ...and both (and their variants) be compared to other ideas such as "multi-token prediction"...

Edit: or, appropriateness of the approach should be demonstrated after acquired "transparency" of how the LLMs effectively internally work. I am not aware of studies that make the inner workings of LLMs adequately clear.

Edit3: Substantially, the architecture should be as solid as possible (and results should reflect that).

blurbleblurble|1 year ago

At a performance boost of 10-100x :)

mdp2021|1 year ago

> Current best practice for large scale language modeling is to operate at the token level, i.e. to learn to predict the next tokens given a sequence of preceding tokens. There is a large body of research on improvements of LLMs, but most works concentrate on incremental changes and do not question the main underlying architecture. In this paper, we have proposed a new architecture,

For some 2024 may have ended badly,

but reading the lines above shines a great light of hope for the new year.

steenreem|1 year ago

I skimmed the paper but I couldn't figure out what they're doing to make concepts fundamentally different from tokens.

I would think that the purpose of concepts is to capture information at a higher density than tokens, so you can remember a longer conversation or better produce long-form output.

Given that, I would have expected that during the training phase, the concept model is evaluated based on how few concepts it emits until it emits a stop.

vimgrinder|1 year ago

I like the idea of "concept" .. you can represent a concept with language, visual etc. but it isn't any of those. Those are symbols used to communicate a concept or give representation to it but concepts are just connections between other concepts at the core. The closest things i feel to this is categories in category theory.

layer8|1 year ago

Concepts need to be linked to reality somehow in order to carry any meaning. They are thus not just relations between themselves.

dr_dshiv|1 year ago

Platonic forms?

rxm|1 year ago

What used to be feature engineering a decade or more ago now seems to have shifted to developing distributed representations. LLMs use word tokens (for words or the entities in images). But there are many more. The 3D Fields (or whatever they have evolved to) developed by Fei-Fei Li's group represent visual information in a way better suited for geometrical tasks. Wav2Vec, the convolutional features for YOLO and friends, and these sentence representations are other examples. I would love to read a review of this circle of ideas.

inshard|1 year ago

This is interesting. I wonder if such a project could dive into lower-level concepts, those akin to prime numbers. The atoms from which all other concepts are built.

lern_too_spel|1 year ago

This is like going back to CNNs. Attention is all you need.

zed1726|1 year ago

Quantum states are all one really needs, but it turns out that it's way to computationally expensive to simulate all that just for the purpose of AI applications - so instead we have to go to higher levels of construction. Attention is surely just about on the cusp of what is computationally reasonable which means that it's not all we need, we need more efficient and richer constructions.

snake_doc|1 year ago

Attention is just communication? It’s orthogonal to the space of the representation.

benreesman|1 year ago

Between this and learned patches and ModernBERT and DeepSeek?

I think it’s time to read up.

upghost|1 year ago

Aside from the using the word "concept" instead of "language" I don't see how this is different than an LLM. It's still doing next token prediction. This is like in D&D where you have two swords with wildly different flavor text but ultimately they both do 1d6+1 damage.

What am I missing -- aside from the marketing? Is there something architecturally different or what? Looks like regular autoregressive sequence transformer to me.

tantalor|1 year ago

(Guessing here) It does tokenization and prediction for a whole sentence, not fragments of words.

I like this idea because that's how humans think. We mentally formulate a whole sentence, then say it. People who don't do this speak in run-ons and word salad.

mdp2021|1 year ago

> something architecturally different

An embedding space engine accepting sentences (SONAR) fit in so that the tokens of this architecture are complex sets of the tokens of past architectures.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

YeGoblynQueenne|1 year ago

From the paper:

>> In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a “concept”.

I wonder if the many authors of the paper know that what they call "concept" is what all of machine learning and AI has also called a "concept" for many decades, and not a new thing that they have just named from scratch.

For instance, classes of "concepts" are the target of learning in Leslie Valiant's "A Theory of the Learnable", the paper that introduced Probably Approximately Correct Learning (PAC-Learning). Quoting from its abstract:

  ABSTRACT: Humans appear to be able to learn new
  concepts without needing to be programmed explicitly in
  any conventional sense. In this paper we regard learning as
  the phenomenon of knowledge acquisition in the absence of
  explicit programming. We give a precise methodology for
  studying this phenomenon from a computational viewpoint.
  It consists of choosing an appropriate information gathering
  mechanism, the learning protocol, and exploring the class of
  concepts that can be learned using it in a reasonable
  (polynomial) number of steps. Although inherent algorithmic
  complexity appears to set serious limits to the range of
  concepts that can be learned, we show that there are some
  important nontrivial classes of propositional concepts that
  can be learned in a realistic sense

From: https://web.mit.edu/6.435/www/Valiant84.pdf

Or take this Introduction to Chapter 2 in Tom Mitchell's "Machine Learning" (the original ML textbook, published 1997):

  This chapter considers concept learning: acquiring the definition of 
  a general category given a sample of positive and negative training 
  examples of the category.

From: https://www.cs.cmu.edu/~tom/mlbook.html (clink link in "the book").

I mean I really wonder some times what is going on here. There's been decades of research in AI and machine learning but recently papers look like their authors have landed in an undiscovered country and are having to invent everything from scratch. That's not good. There are pitfalls that all the previous generations have explored thoroughly by falling in them time and again. Those who don't remember those lessons will have to find that out the hard way.

mdp2021|1 year ago

I am not sure that fits the point, YGQ:

it seems to me the concept of «concept» in the paper is "the embedding vector we get in systems like SONAR (which we could use to generalize ordered sets of tokens into more complex ideas)". That's pretty specific, only marginally related to past handling as mentioned.

58 comments