Ask HN: Most efficient way to fine-tune an LLM in 2024?
114 points| holomorphiclabs | 2 years ago | reply
In particular we are trying to understand performance vs. cost trade-offs. We don't have a budget to train from scratch.
We are working with a proprietary data set on the order of 100M tokens and are looking to fine-tune a general purpose language model and also create task-specific models based on the same corpus.
Any help would be appreciated!
[+] [-] dhouston|2 years ago|reply
A single A100 or H100 with 80GB VRAM can fine tune 70B open models (and obviously scaling out to many nodes/GPUs is faster, or can use much cheaper GPUs for fine tuning smaller models.)
The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/ is also an awesome community for the GPU poor :)
[+] [-] alfor|1 year ago|reply
Have you seen benchmarks
[+] [-] holomorphiclabs|2 years ago|reply
[+] [-] gardnr|2 years ago|reply
If you do end up wanting to fine tune then use qlora with axolotl or unsloth to prove your hypothesis on a smaller model and then evaluate if you want the marginal gains you get from full precision training.
After you fine tune it with 100m token dataset, use DPO to polish it off. You need to create a DPO dataset for that but it can be relatively small to get some great gains.
After that, look at applying grammars during inference if you are expecting structured results like json.
You should be able to run the experiments on 4090s from vast.ai or runpod or similar service.
It can cost less than $100 depending on your requirements.
[+] [-] kawin|1 year ago|reply
I'd like to add that if you don't have pairwise preference data (A > B) but do have binary data (A is good for x_1, B is good for x_2, etc.), then Kahneman-Tversky Optimization (KTO) might be a better fit. Despite learning with a weaker signal, it works as well or better than dpo in practice.
[+] [-] objektif|2 years ago|reply
[+] [-] dosshell|2 years ago|reply
https://fortune.com/2024/03/11/adaptive-startup-funding-falc...
[+] [-] uptownfunk|2 years ago|reply
[+] [-] luke-stanley|2 years ago|reply
[+] [-] danielhanchen|1 year ago|reply
Mistral 7b is 2x faster than HuggingFace + Flash Attention 2. Gemma 7b is 2.4x faster than HF + FA2.
Check out https://github.com/unslothai/unsloth for full benchmarks!
[+] [-] jasonjmcghee|2 years ago|reply
https://github.com/OpenAccess-AI-Collective/axolotl
Someone from one of the cloud GPU vendors wrote a guide: https://brev.dev/blog/fine-tuning-mistral
[+] [-] stanbiryukov|2 years ago|reply
[+] [-] viksit|2 years ago|reply
could you give a high level way to think about how to use dspy for something like this?
[+] [-] HarHarVeryFunny|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] magdyks|2 years ago|reply
[+] [-] tdba|2 years ago|reply
Theres no one size fits all answer yet, but if you just want to test it out there are many commercial offerings on which you should be able to get some results for under $10k.
[+] [-] holomorphiclabs|2 years ago|reply
[+] [-] objektif|2 years ago|reply
[+] [-] troyvit|2 years ago|reply
* [Example 1](https://www.mongodb.com/developer/products/atlas/rag_with_cl...) (Claude and MongoDB's vector database)
* [Example 2](https://docs.mistral.ai/guides/basic-RAG/) (Mistral and the Faiss vector database or other embedding frameworks)
[+] [-] netdur|2 years ago|reply
[+] [-] Redster|2 years ago|reply
[+] [-] holomorphiclabs|2 years ago|reply
[+] [-] Redster|2 years ago|reply
[+] [-] blissfulresup|2 years ago|reply
https://arxiv.org/abs/2106.09685
[+] [-] holomorphiclabs|2 years ago|reply
[+] [-] dvt|2 years ago|reply
[1] https://arxiv.org/pdf/2305.11206.pdf
[+] [-] ozr|2 years ago|reply
LIMA demonstrated that instruction-tuning and output formatting could be trained with a limited number of samples, not that finetuning was incapable of adding new information to the model.
It may be sub-optimal in most cases to RAG, but it does work.
[+] [-] holomorphiclabs|2 years ago|reply
[+] [-] ramoz|2 years ago|reply
You would fine-tune, certainly, for domain-specific tasks, and would curate a subset of the 100M tokens. Total tokens in alignment study references is 1,000,000.
RAG is a hacky way to interpolate new knowledge with a base model. Not always reliable nor easy to integrate into task-specific workflows.
[+] [-] viksit|2 years ago|reply
what is the preferred way to feed documents / knowledge into a model so that the primary retrieval is done by the llm, and perhaps use vector db only for information enhancement (a la onebox)?
[+] [-] viksit|2 years ago|reply
and since your dataset is large, the longest context windows are insufficient.
[+] [-] xianshou|2 years ago|reply
Blazing fast compared to out-of-the-box transformers, also make sure to use flash attention if you have A100s or better and context length >= 2k
Add FAISS (https://github.com/facebookresearch/faiss) if you need fast local RAG
[+] [-] alxgt|2 years ago|reply
[+] [-] FezzikTheGiant|2 years ago|reply