Ghostwriter is notably worse than GPT-4, so while it may be true in a sense that "Training a custom model allows us to tailor it to our specific needs and requirements", the reality is they'd be getting better results just using OpenAI right now. Probably true for almost every other use case.
That said, I am patiently waiting and champing at the bit for the day this isn't true anymore. Cool to see the groundwork being laid for it.
Stable Diffusion 1.5 is not SOTA, but in reality the sea of augmentations makes SD kinda unbeatable, if you are willing to put in the work to use them.
I think LLMs could end up the same way, if the comminity consolidates around a good one.
Not everyone wants to depend on and trust a cloud service, and not everyone needs GPT-4 quality.
If there's a viable way to tune and run models locally they could still be useful if you don't need it to play chess and imitate a Python interpreter at the same time.
just my 5 cents. it should be easier to train small custom model which works off a big pre-trained one. getting latent state as an input. while big model does all the hard work. but, getting latent means it should be accessible. that's why open source models are so valuable, even if they are not that good in general. more over, open source models can be used in other projects in various setups.
They’re competing directly with Microsoft (and getting crushed) because GitHub is their biggest competitor, so it makes sense that they wouldn’t want to use OpenAI products.
This is interesting. I think the future of AI will not be re-creating something like ChatGPT, but using variations of these methodologies to train AI models for specific tasks.
There are some advantages to not having to make an LLM that impresses every human being on the planet. Imagine training the AI to be good in only one specific thing. I think it will become much more precise and deterministic.
This is just my hypothesis. I'm excited to see where this goes.
How expensive is it? My understanding is that it's not reasonable to train an LLM from scratch by yourself, and that if you want one that isn't just very stupid then you need to spend between hundreds of thousands and hundreds of millions of dollars. But if you don't want to train from scratch then you can fine-tune existing models for cheaper.
Disclaimer: I work for MosaicML (MosaicML is the creator of the training platform used by Replit).
Training these models from scratch on your domain specific data is not as expensive as one might think. We have provided some cost estimates in our blogs.
Looking at what they're doing here probably not as much as you think.
As you note, with the plethora of open/open-ish LLMs today and LoRA + PEFT you can fine tune with low VRAM and pretty quickly so even a single A100 or whatever cloud GPUs are just fine. I've even seen people pull it off in reasonable time on super cheap T4s, A10s, etc.
I doubt anyone reading a blog post is attempting to train a "true" multi-billion param LLM from scratch.
I am leading a similar initiative and I have also used Databricks for the preprocessing.
Most interesting is what happens between the preprocessing and the model training - the hand-off to the cluster workers.
I guess the efficient option is to partition the data, set up shards in-advance and ideally cache or even copy the data to local workers during init.
This, of course, breaks some of the promise of being able to scale training flexibly, for instance to experiment with the scaling of compute and data.
A different way to go about it is to use a streaming/iterable dataset/loader implementation with its own sharding logic that reads from a central store of parquets with some reasonable row-group size. This gives full flexibility in terms of node/gpu/worker/batch_size for experimentation - e.g. literally as parameters in PyTorch. Of course, one has to also implement caching of remote data since the data is kept centrally.
In my opinion, there is no satisfying/flexible solution for this, especially when one also wants to experiment with complex transformations or augmentations in the dataset/loader and remain portable across cloud offerings. So, this has to be implemented from scratch (not too difficult but still a lot of code). The coming datapipes also probably make this trivial.
Would love to hear more experiences in how you set this up!
> "a student coding on their phone in India should have access to the same AI as a professional developer in Silicon Valley. To make this possible, we train custom models with reduced cost."
In principle, that's great. But the reality is: whoever has the resources and benefits from something better will look for ways to get it. What they're communicating here is: the most resourceful developers on the planet aren't our ideal customer.
Did we ever get any resolution about what happened after this company threatened to sue their intern for making a side project that supposedly stole all their great ideas? I would like to know before I ever consider anything from them again.
The founder admitted his mistake and the ex-intern's site is back up and running https://riju.codes/. I'm personally a fan of both Amjad's (CEO) and Radon's (intern) and realize that everyone makes mistakes. It's not a reason to discount the hard work of the people at replit.
[+] [-] koboll|2 years ago|reply
That said, I am patiently waiting and champing at the bit for the day this isn't true anymore. Cool to see the groundwork being laid for it.
[+] [-] brucethemoose2|2 years ago|reply
I think LLMs could end up the same way, if the comminity consolidates around a good one.
[+] [-] alpaca128|2 years ago|reply
If there's a viable way to tune and run models locally they could still be useful if you don't need it to play chess and imitate a Python interpreter at the same time.
[+] [-] nemoniac|2 years ago|reply
[+] [-] two_in_one|2 years ago|reply
[+] [-] birdplant|2 years ago|reply
Agree that Ghostwriter is subpar though.
[+] [-] jackmott42|2 years ago|reply
[+] [-] zacksiri|2 years ago|reply
There are some advantages to not having to make an LLM that impresses every human being on the planet. Imagine training the AI to be good in only one specific thing. I think it will become much more precise and deterministic.
This is just my hypothesis. I'm excited to see where this goes.
[+] [-] kkielhofner|2 years ago|reply
That reminds me - I saw a somewhat-clever acronym variant for LLM that communicated this the other day but it escapes me ATM...
[0] - https://www.mosaicml.com/blog/introducing-pubmed-gpt
[1] - https://cloud.google.com/blog/topics/healthcare-life-science...
[2] - https://dev.to/reaminated/thoughts-on-bloomberggpt-and-domai...
[+] [-] ftxbro|2 years ago|reply
[+] [-] dkhudia|2 years ago|reply
Training these models from scratch on your domain specific data is not as expensive as one might think. We have provided some cost estimates in our blogs.
https://www.mosaicml.com/blog/mosaicbert
https://www.mosaicml.com/blog/training-stable-diffusion-from...
https://www.mosaicml.com/blog/gpt-3-quality-for-500k
[+] [-] kkielhofner|2 years ago|reply
As you note, with the plethora of open/open-ish LLMs today and LoRA + PEFT you can fine tune with low VRAM and pretty quickly so even a single A100 or whatever cloud GPUs are just fine. I've even seen people pull it off in reasonable time on super cheap T4s, A10s, etc.
I doubt anyone reading a blog post is attempting to train a "true" multi-billion param LLM from scratch.
[+] [-] lxe|2 years ago|reply
https://github.com/lxe/simple-llm-finetuner
[+] [-] ukuina|2 years ago|reply
[+] [-] zwaps|2 years ago|reply
Most interesting is what happens between the preprocessing and the model training - the hand-off to the cluster workers.
I guess the efficient option is to partition the data, set up shards in-advance and ideally cache or even copy the data to local workers during init.
This, of course, breaks some of the promise of being able to scale training flexibly, for instance to experiment with the scaling of compute and data.
A different way to go about it is to use a streaming/iterable dataset/loader implementation with its own sharding logic that reads from a central store of parquets with some reasonable row-group size. This gives full flexibility in terms of node/gpu/worker/batch_size for experimentation - e.g. literally as parameters in PyTorch. Of course, one has to also implement caching of remote data since the data is kept centrally.
In my opinion, there is no satisfying/flexible solution for this, especially when one also wants to experiment with complex transformations or augmentations in the dataset/loader and remain portable across cloud offerings. So, this has to be implemented from scratch (not too difficult but still a lot of code). The coming datapipes also probably make this trivial.
Would love to hear more experiences in how you set this up!
Edit: I guess for NLP this is a good implementation and what Mosaic uses https://huggingface.co/docs/datasets/stream
[+] [-] rmbyrro|2 years ago|reply
In principle, that's great. But the reality is: whoever has the resources and benefits from something better will look for ways to get it. What they're communicating here is: the most resourceful developers on the planet aren't our ideal customer.
[+] [-] jxi|2 years ago|reply
[+] [-] detrites|2 years ago|reply
[+] [-] dhruvbird|2 years ago|reply
[+] [-] VWWHFSfQ|2 years ago|reply
[+] [-] plondon514|2 years ago|reply
[+] [-] turk-|2 years ago|reply
[+] [-] CityOfThrowaway|2 years ago|reply
[+] [-] antonvs|2 years ago|reply
[+] [-] faithsmith|2 years ago|reply
[deleted]