I'm really excited about this project and I think it could be really disruptive. It is organized by LAION, the same folks who curated the dataset used to train Stable Diffusion.
My understanding of the plan is to fine-tune an existing large language model, trained with self-supervised learning on a very large corpus of data, using reinforcement learning from human feedback, which is the same method used in ChatGPT. Once the dataset they are creating is available, though, perhaps better methods can be rapidly developed as it will democratize the ability to do basic research in this space. I'm curious regarding how much more limited the systems they are planning to build will be compared to ChatGPT, since they are planning to make models with far less parameters to deploy them on much more modest hardware than ChatGPT.
As an AI researcher in academia, it is frustrating to be blocked from doing a
lot of research in this space due to computational constraints and a lack of the required data. I'm teaching a class this semester on self-supervised and generative AI methods, and it will be fun to let students play around with this in the future.
Yes definitely. If these become an important part of people's lives, they shouldn't all be walled off inside of companies (There is room for both: Microsoft can commission Yankee group to write a report about how the total cost of ownership of running openai models is lower)
We (humanity) really lost out on the absence of open source search and social media, so this is an opportunity to reclaim it.
I only hope we can have "neutral" open source curation of these and not try to impose ideology on the datasets and model training right out of the box. There will be calls for this, and lazy criticism about how the demo models are x-ist, and it's going to require principles to ignore the noise and sustain something useful
Today, computers run the world. Without the ability to run your own machine with your own software, you are at the mercy of those who do. In the future, AI models will run the world in the same way. Projects like this are crucial for ensuring the freedom of individuals in the future.
Totally agree. I was just thinking how I will eventually not use a search engine once chatGPT can link directly to what we are talking about with up to date examples.
That is a situation that censoring the model is going to be a huge disadvantage and would create a huge opportunity for something like this to actually be straight up better. Censoring the models is what I would bet on as being a fatal first mover mistake in the long run and the Achilles heel of chatGPT.
The power in ChatGPT isn't that it's a chat bot, but its ability to do semantic analysis. It's already well established that you need high quality semi-curated data + high parameter count and that at a certain critical point, these models start comprehending and understanding language. All the smart people in the room at Google, Facebook, etc are absolutely pouring resources into this I promise they know what they're doing.
We don't need yet-another-GUI. We need someone with a warehouse of GPUs to train a model with the parameter count of GPT3. Once that's done you'll have thousands of people cranking out tools with the capabilities of ChatGPT.
Your point about needing large models in the first place is well taken.
But I still think we would want a curated collection of chat/assistant training data if we want to use that language model and train it for a chat/assistant application.
So this is a two-phase project, the first phase being training a large model (GPT), the second being using Reinforcement Learning from Human Feedback (RLHF) to train a chat application (InstructGPT/ChatGPT).
There are definitely already people working on the first part, so it's useful to have a project focusing on the second.
>We need someone with a warehouse of GPUs to train a model with the parameter count of GPT3
So I'm assuming that you don't follow Rob Miles. If you do this alone you're either going to create a psychopath or something completely useless.
The GPT models have no means in themselves of understanding correctness or right/wrong answers. All of these models require training and alignment functions that are typically provided by human input judging the output of the model. And we still see where this goes wrong in ChatGPT where the bot turns into a 'Yes Man' because it's aligned with giving an answer rather than saying I don't know even when it's confidence in the answer is low.
> It's already well established that you need high quality semi-curated data + high parameter count and that at a certain critical point, these models start comprehending and understanding language
Is anyone working on an Ender's Game style "Jane" assistant that just listens via an earbud and responds? That seems totally within the realm of current tech but I haven't seen anything.
This is wonderful, no doubt about it, but the bigger problem is for making this usable on commodity hardware. Stablediffusion only needs 4 GB of RAM to run inference, but all of these large language models are too large to run on commodity hardware. Bloom from huggingface is already out and no one is able to use it. If chatgpt was given to the open source community, we couldn’t even run it…
> Bloom from huggingface is already out and no one is able to use it.
This RLHF dataset that is being collected by Open Assistant is just the kind of data that will turn a rebel LLM into a helpful assistant. But it's still huge and expensive to use.
Great looking project here. Absolutely need a local/FOSS option. There's been a number of open-source libraries for LLMs lately that simply call into paid/closed models via APIs. Not exactly the spirit of open-source.
There's already great local/FOSS options such as FLAN-T5 (https://huggingface.co/google/flan-t5-base). Would be great to see a local model like that trained specifically for chat.
In the not too distant future we may see integrations with always-on recording devices (yes, I know, shudder) transcribing our every conversation and interaction and incorporating the text in place of the current custom-corpus style addenda to LLMs to give a truly personal and social skew to the current capabilities in the form of automatically-compiled memories to draw on.
To me, the value of a local-LLM is that it can hold my life's notes and i d talk to it as if it was my alter ego until old age. One could say, it's the kind of "soul" that outlasts us
Look at David Shapiro's project on GitHub, not Raven but the other one that is more fleshed out. He already does the summarization of dialogue and retrieval of relevant info using the OpenAI APIs I believe. You could combine that with the Chrome web speech or speech-to-text API which can stay on continuously. You would need to modify it a bit to know about third party conversations and your phone would run out of battery. But you could technically make the code changes in a day or two I think.
Given how nerfed ChatGPT is (which is likely nothing compared to what large risk-adverse companies like Microsoft/Google will do), I'm heavily anticipating a Stable Diffusion-style model that is more free or at least configurable to have stronger opinions.
What if we use chatGPT responses as contributions? I dont see a legal issue here, unless openAi can claim ownership of any of their input/output material. It would be also a good way for those disillusioned by the "openness" of that company
Playing the "training game" is very interesting and kind of addictive.
The "reply as robot" task in particular is really enlightening. If you try to give it any sense of personality or humanity, your comments will be downvoted and flagged by other players.
It's like everybody, without instruction, has this pre-assumption that these assistants should have a deeply subservient, inhumane and corporate affectation.
Great, if i can use this to interactively search inside (OCR-) documents, files, emails and so on, would be huge, like asking when does my passport expire, or when were my grades in high school and so on.
I think we are right around the corner from actual AI personal assistants, which is pretty exciting.
We have great tooling for speech to text, text to speech, and LLMs with memory for “talking” to the AI. Combining those with both an index of the internet (for up to date data, likely a big part of the Microsoft/open ai partnership) and an index of your own content/life data, and this could all actually work together soon.
I’m an iPhone guy, but I would imagine all of this could be combined together on an android phone (due to it being way more flexible) then combining that with a wireless earbud and then rather than it being a “normal” phone, it’s just a pocketable smart assistant.
Crazy times we live in. I’m 35, so have basically lived through the world being “broken” by tech a few times now: the internet, social media, and smart phones all fundamentally reshaped society. Seems like AI that we are living through right now is about to break the world again.
EDIT: everything I wrote above is going to immediately run into a legal hellscape, I get that. If everyone has devices in their pockets recording and processing everything spoken around them in order to assist their owner, real life starts getting extra dicey quickly. Will be interesting to see how it plays out.
> https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data
…
> There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is
that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)
Yup, sure are the same folk who put together that dataset they used to train stable diffusion.
I've been excited about the notion of this for a while, but it's unclear to me how this would succeed where numerous well-resourced companies have failed.
Are there some advantages that Open Assistant has that Google/Amazon/Apple lack that would allow them to succeed?
Instruction tuning mostly relies on the quality of the data you put into the model. This makes it different from traditional language model training: essentially you take one of these existing hugely expensive models (there are lots of them already out there), and tune them specifically on high quality data.
This can be done on a comparatively small scale, since you don't need to train trillions of words, but only train on the smaller high quality data (even openai didn't have a lot of that).
In fact, if you look at the original paper https://arxiv.org/pdf/2203.02155.pdf Figure 1, you can see that even small models already significantly beat the current SOTA.
Open source projects often have trouble securing the HW ressources, but the "social" resources for producing a large dataset are much easier to manage in OSS projects. In fact, the data the OSS project collects might just be better since they don't have to rely on paying a handful minimum wage workers to produce thousands of examples.
In fact one of the main objectives is to reduce the bias generated by openai's screening and selection process, which is doable since much more people work on generating the data.
Google is at the mercy of advertisers, all three are profit driven and risk averse. There is no reason they couldn't do the same as LAION, it just doesn't align with their organizational incentives
The model hasn’t been trained yet. The goal for it is to fit into “consumer hardware” which likely means 2x3090 (48Gb NVLink) or 3090/4090 (24Gb) on the high end and something like 3080/4080 16Gb on the lower end.
I watched one of the developers YouTube video and he said it should run on consumer hardware. He said it's not going to ever run on something like a raspberry pi, but it should run pretty well on an "average Joe PC "
Though it's interesting to see the capabilities of "conversational user interfaces" improve, the current implementations are too verbose and slow for many real world tasks, and more importantly, context still has to be provided manually. I believe the next big leap will be low-latency dedicated assistants which are focused on specific tasks, with normalized and predictable results from prompts.
It may be interesting to see how a creative task like image or text generation changes when rewording your request slightly - after a minute wait - but if I'm giving directions to my autonomous vehicle, ambiguity and delay is completely unacceptable.
[+] [-] chriskanan|3 years ago|reply
My understanding of the plan is to fine-tune an existing large language model, trained with self-supervised learning on a very large corpus of data, using reinforcement learning from human feedback, which is the same method used in ChatGPT. Once the dataset they are creating is available, though, perhaps better methods can be rapidly developed as it will democratize the ability to do basic research in this space. I'm curious regarding how much more limited the systems they are planning to build will be compared to ChatGPT, since they are planning to make models with far less parameters to deploy them on much more modest hardware than ChatGPT.
As an AI researcher in academia, it is frustrating to be blocked from doing a lot of research in this space due to computational constraints and a lack of the required data. I'm teaching a class this semester on self-supervised and generative AI methods, and it will be fun to let students play around with this in the future.
Here is a video about the Open Assistant effort: https://www.youtube.com/watch?v=64Izfm24FKA
[+] [-] amrb|3 years ago|reply
[+] [-] version_five|3 years ago|reply
We (humanity) really lost out on the absence of open source search and social media, so this is an opportunity to reclaim it.
I only hope we can have "neutral" open source curation of these and not try to impose ideology on the datasets and model training right out of the box. There will be calls for this, and lazy criticism about how the demo models are x-ist, and it's going to require principles to ignore the noise and sustain something useful
[+] [-] kibwen|3 years ago|reply
[+] [-] epistemer|3 years ago|reply
That is a situation that censoring the model is going to be a huge disadvantage and would create a huge opportunity for something like this to actually be straight up better. Censoring the models is what I would bet on as being a fatal first mover mistake in the long run and the Achilles heel of chatGPT.
[+] [-] oceanplexian|3 years ago|reply
We don't need yet-another-GUI. We need someone with a warehouse of GPUs to train a model with the parameter count of GPT3. Once that's done you'll have thousands of people cranking out tools with the capabilities of ChatGPT.
[+] [-] txtai|3 years ago|reply
Another thread on HN (https://news.ycombinator.com/item?id=34653075) discusses a model that is less than 1B parameters and outperforms GPT-3.5. https://arxiv.org/abs/2302.00923
These models will get smaller and more efficiently use the parameters available.
[+] [-] richdougherty|3 years ago|reply
But I still think we would want a curated collection of chat/assistant training data if we want to use that language model and train it for a chat/assistant application.
So this is a two-phase project, the first phase being training a large model (GPT), the second being using Reinforcement Learning from Human Feedback (RLHF) to train a chat application (InstructGPT/ChatGPT).
There are definitely already people working on the first part, so it's useful to have a project focusing on the second.
[+] [-] bicx|3 years ago|reply
[+] [-] shpongled|3 years ago|reply
[+] [-] seydor|3 years ago|reply
where is that shown ?
[+] [-] pixl97|3 years ago|reply
So I'm assuming that you don't follow Rob Miles. If you do this alone you're either going to create a psychopath or something completely useless.
The GPT models have no means in themselves of understanding correctness or right/wrong answers. All of these models require training and alignment functions that are typically provided by human input judging the output of the model. And we still see where this goes wrong in ChatGPT where the bot turns into a 'Yes Man' because it's aligned with giving an answer rather than saying I don't know even when it's confidence in the answer is low.
Computerphile did a video on this in the last few days on this subject. https://www.youtube.com/watch?v=viJt_DXTfwA
[+] [-] f6v|3 years ago|reply
I’m not sure what you mean by “understanding”.
[+] [-] agentofoblivion|3 years ago|reply
[+] [-] damascus|3 years ago|reply
[+] [-] 88stacks|3 years ago|reply
[+] [-] visarga|3 years ago|reply
This RLHF dataset that is being collected by Open Assistant is just the kind of data that will turn a rebel LLM into a helpful assistant. But it's still huge and expensive to use.
[+] [-] Tepix|3 years ago|reply
I'm curious how they will get these LLM to work with consumer hardware myself. Is FP8 is the way to get them small?
[+] [-] zamalek|3 years ago|reply
[+] [-] txtai|3 years ago|reply
There's already great local/FOSS options such as FLAN-T5 (https://huggingface.co/google/flan-t5-base). Would be great to see a local model like that trained specifically for chat.
[+] [-] mdaniel|3 years ago|reply
[+] [-] mellosouls|3 years ago|reply
[+] [-] seydor|3 years ago|reply
[+] [-] ilaksh|3 years ago|reply
[+] [-] panosfilianos|3 years ago|reply
[+] [-] rahimnathwani|3 years ago|reply
[+] [-] siliconc0w|3 years ago|reply
[+] [-] seydor|3 years ago|reply
[+] [-] Mizza|3 years ago|reply
The "reply as robot" task in particular is really enlightening. If you try to give it any sense of personality or humanity, your comments will be downvoted and flagged by other players.
It's like everybody, without instruction, has this pre-assumption that these assistants should have a deeply subservient, inhumane and corporate affectation.
[+] [-] BizarreByte|3 years ago|reply
[+] [-] jacooper|3 years ago|reply
[+] [-] lytefm|3 years ago|reply
Such an AI assistant would know me extremely well, keep my data private and help me with generating and processing thoughts and ideas
[+] [-] rcme|3 years ago|reply
[+] [-] outside1234|3 years ago|reply
Is it possible to use a “SETI at Home” style approach to parcel out training?
[+] [-] dchuk|3 years ago|reply
EDIT: everything I wrote above is going to immediately run into a legal hellscape, I get that. If everyone has devices in their pockets recording and processing everything spoken around them in order to assist their owner, real life starts getting extra dicey quickly. Will be interesting to see how it plays out.
[+] [-] Quequau|3 years ago|reply
[+] [-] grealy|3 years ago|reply
In the very near future, there will be trained models which you can download and run, which is what it sounds like you were expecting.
[+] [-] coolspot|3 years ago|reply
[+] [-] wokwokwok|3 years ago|reply
> https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data
…
> There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)
Yup, sure are the same folk who put together that dataset they used to train stable diffusion.
Data? Yeah, just take everything. It’s all good.
[+] [-] karpierz|3 years ago|reply
Are there some advantages that Open Assistant has that Google/Amazon/Apple lack that would allow them to succeed?
[+] [-] mattalex|3 years ago|reply
This can be done on a comparatively small scale, since you don't need to train trillions of words, but only train on the smaller high quality data (even openai didn't have a lot of that).
In fact, if you look at the original paper https://arxiv.org/pdf/2203.02155.pdf Figure 1, you can see that even small models already significantly beat the current SOTA.
Open source projects often have trouble securing the HW ressources, but the "social" resources for producing a large dataset are much easier to manage in OSS projects. In fact, the data the OSS project collects might just be better since they don't have to rely on paying a handful minimum wage workers to produce thousands of examples.
In fact one of the main objectives is to reduce the bias generated by openai's screening and selection process, which is doable since much more people work on generating the data.
[+] [-] version_five|3 years ago|reply
[+] [-] Havoc|3 years ago|reply
[+] [-] braingenious|3 years ago|reply
[+] [-] coolspot|3 years ago|reply
[+] [-] hcal|3 years ago|reply
[+] [-] SergeAx|3 years ago|reply
[+] [-] russellbeattie|3 years ago|reply
It may be interesting to see how a creative task like image or text generation changes when rewording your request slightly - after a minute wait - but if I'm giving directions to my autonomous vehicle, ambiguity and delay is completely unacceptable.