Ask HN: Most efficient way to fine-tune an LLM in 2024?

[+] dhouston|2 years ago|reply

Qlora + axolotl + good foundation model (llama/mistral/etc, usually instruction fine tuned) + runpod works great.

A single A100 or H100 with 80GB VRAM can fine tune 70B open models (and obviously scaling out to many nodes/GPUs is faster, or can use much cheaper GPUs for fine tuning smaller models.)

The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/ is also an awesome community for the GPU poor :)

[+] alfor|1 year ago|reply

Can consumer systems like the rtx3090 or 4X rtx3090 achieve something?

Have you seen benchmarks

[+] holomorphiclabs|2 years ago|reply

Thank you! and yes huge fan of r/localllama :)

[+] gardnr|2 years ago|reply

You probably want to build a retrial augmented generation pipeline.

If you do end up wanting to fine tune then use qlora with axolotl or unsloth to prove your hypothesis on a smaller model and then evaluate if you want the marginal gains you get from full precision training.

After you fine tune it with 100m token dataset, use DPO to polish it off. You need to create a DPO dataset for that but it can be relatively small to get some great gains.

After that, look at applying grammars during inference if you are expecting structured results like json.

You should be able to run the experiments on 4090s from vast.ai or runpod or similar service.

It can cost less than $100 depending on your requirements.

[+] kawin|1 year ago|reply

This is great advice!

I'd like to add that if you don't have pairwise preference data (A > B) but do have binary data (A is good for x_1, B is good for x_2, etc.), then Kahneman-Tversky Optimization (KTO) might be a better fit. Despite learning with a weaker signal, it works as well or better than dpo in practice.

[+] objektif|2 years ago|reply

Do you have any tutorials do achieve all this? Thanks.

[+] dosshell|2 years ago|reply

I know this is maybe not the answer you want, but if you are just interested in getting the job done there exist companies that are experts on this, for example:

https://fortune.com/2024/03/11/adaptive-startup-funding-falc...

[+] uptownfunk|2 years ago|reply

Also interested in this. Does this task really require such specialized knowledge?

[+] luke-stanley|2 years ago|reply

For my ChillTranslator project I spent maybe a few dollars fine-tuning Phi 2, to generate less spicy variations of inflammatory Hacker News comments with very little data to see how well it worked (especially compared to your 100M tokens). I'll improve it when I have time. I mostly followed the Brev fine-tune tutorial but I wanted to have a 2 GB file GGUF quantised model I could run on any device with a specific JSON grammar. It uses Transformers PEFT and QLoRA. I didn't try Axolotl yet, or OpenPipe but I hope to. Actual compute time is probably much less than I spent, I wasted time dealing with drivers, trying to figure out how to merge the finetuned weights, serialise to old fashioned Pickle, not safe-tensors, and how to convert to GGUF, quantise it and rsync it.

[+] danielhanchen|1 year ago|reply

A bit late, but Unsloth makes LoRA / QLoRA finetuning 2x faster and reduces VRAM by 80% with 0% degradation in accuracy! (no approximations are done!)

Mistral 7b is 2x faster than HuggingFace + Flash Attention 2. Gemma 7b is 2.4x faster than HF + FA2.

Check out https://github.com/unslothai/unsloth for full benchmarks!

[+] jasonjmcghee|2 years ago|reply

The approach I see used is axolotl with QLoRA using cloud GPUs which can be quite cheap.

https://github.com/OpenAccess-AI-Collective/axolotl

Someone from one of the cloud GPU vendors wrote a guide: https://brev.dev/blog/fine-tuning-mistral

[+] stanbiryukov|2 years ago|reply

I recommend reviewing Stanford's dspy library - great examples of few-shot learning that works by generating and tuning prompts for LLMs and even distilling instruction following tasks to smaller models like T5. Second, as others mentioned, using QLoRA for supervised fine-tuning followed by DPO/KTO for preference optimization. This strategy placed Huggingface's Zephyr and IBM's Neural Chat on leaderboards for 7B parameter models. I also recommend reviewing the Unsloth library which has excellent accelerated examples of using these methods, along with the axolotl library. Lastly, skypilot and Modal both have excellent examples that showcase using axolotl to efficiently finetune models on cloud GPUs. [1] https://github.com/stanfordnlp/dspy [2] https://github.com/unslothai/unsloth [3] https://github.com/OpenAccess-AI-Collective/axolotl [4] https://github.com/skypilot-org/skypilot [5] https://github.com/modal-labs/llm-finetuning

[+] viksit|2 years ago|reply

i looked at dspy last week, and was trying to wrap my head around how it would be useful for a "fine tune" style use case - where i would want to give the base model more context vs use a vector DB and have the model put together a result.

could you give a high level way to think about how to use dspy for something like this?

[+] HarHarVeryFunny|2 years ago|reply

A possible alternative to fine-tuning is in-context learning, especially if you are using a model with long context where you can provide a lot of examples. Models can do one/few-shot learning, but in-context learning improves the more examples you give. You could experiment cheaply with Claude Haiku to see if this works for you.

[+] unknown|2 years ago|reply

[deleted]

[+] magdyks|2 years ago|reply

Finetuning a LoRA based adapter using a tool like predibase.com does this really fast. If you wanna go fully open source and have your own hardware you can do the same thing using a ludwig + lorax stack to do this yourself.

[+] tdba|2 years ago|reply

What's your measure of performance?

Theres no one size fits all answer yet, but if you just want to test it out there are many commercial offerings on which you should be able to get some results for under $10k.

[+] holomorphiclabs|2 years ago|reply

Are there any that are recommended? Honestly we would rather not share data with any 3P vendors. It's been a painstaking progress to curate it.

[+] objektif|2 years ago|reply

Apologize if out of topic but could anyone please point me to a resource regarding best practices of implementing RAG with either proprietary LLMs like GPT?

[+] troyvit|2 years ago|reply

I don't know if it's best practices but I found two tutorials. Apologies for the markdown.

* [Example 1](https://www.mongodb.com/developer/products/atlas/rag_with_cl...) (Claude and MongoDB's vector database)

* [Example 2](https://docs.mistral.ai/guides/basic-RAG/) (Mistral and the Faiss vector database or other embedding frameworks)

[+] netdur|2 years ago|reply

I understand the methods to address the fine-tuning and RAG issues but lack the time and possibly the technical skills to implement the solution. Fine-tuning can potentially dumb down a perfect model, and RAG has context limitations and may not cover all content. My thinking, we should vectorize the text and embed these vectors into all layers of the model at inference time. This approach would bypass the context size limitations and resource wastage associated with fine-tuning, as vectorization is fast. I believe this vectorization and embedding strategy is the solution.

[+] Redster|2 years ago|reply

What LLM are you hoping to use. Have you considered using HelixML? If I am reading you right, the primary concern is compute costs, not human-time costs?

[+] holomorphiclabs|2 years ago|reply

We are finding there is a trade-off between model performance and hosting costs post-training. The optimal outcome is where we have a model that performs well on next-token prediction (and some other in-house tasks we've defined) that ultimately results in a model that we can host on the lowest-cost hosting provider rather than be locked in. I think we'd only go the proprietary model route if the model really was that much better. We're just trying to save our selves weeks/months of benchmarking time/costs if there was already an established option in this space.

[+] Redster|2 years ago|reply

That said, I think that dvt's comment is helpful about RAG likely being what you need rather than fine-tuning, but wanted to offer something if you know that's what you need.

[+] blissfulresup|2 years ago|reply

Look into LoRa

https://arxiv.org/abs/2106.09685

[+] holomorphiclabs|2 years ago|reply

Thank you we have been exploring this.

[+] dvt|2 years ago|reply

I think you may be misunderstanding what fine tuning does. It does not teach the model new knowledge. In fact, Meta has a paper out that argues you only need a data set of 1000[1] to achieve pretty good alignment (fine-tuning) results. (100M is way overkill.) For knowledge retrieval, you need RAG (usually using the context window).

[1] https://arxiv.org/pdf/2305.11206.pdf

[+] ozr|2 years ago|reply

This is not correct. Fine-tuning can absolutely add new knowledge to a model. It's been repeatedly demonstrated at this point.

LIMA demonstrated that instruction-tuning and output formatting could be trained with a limited number of samples, not that finetuning was incapable of adding new information to the model.

It may be sub-optimal in most cases to RAG, but it does work.

[+] holomorphiclabs|2 years ago|reply

Our findings are that RAG does not generalize well when critical understanding is shared over a large corpus of information. We do not think it is a question of either context length or retrieval. In our case it is very clearly capturing understanding within the model architecture itself.

[+] ramoz|2 years ago|reply

Depending on the application, you would do continued pretraining over new tokens to gain new knowledge. 100M tokens is applicable here.

You would fine-tune, certainly, for domain-specific tasks, and would curate a subset of the 100M tokens. Total tokens in alignment study references is 1,000,000.

RAG is a hacky way to interpolate new knowledge with a base model. Not always reliable nor easy to integrate into task-specific workflows.

[+] viksit|2 years ago|reply

question: RAG by definition offloads the retrieval to a vector similarity search via embeddings db (faiss, knn et al).

what is the preferred way to feed documents / knowledge into a model so that the primary retrieval is done by the llm, and perhaps use vector db only for information enhancement (a la onebox)?

[+] viksit|2 years ago|reply

if i understand the problem correctly - you'd like to feed xMM documents directly into an LLM so that it uses this context to "reason" answers to questions, vs offload the retrieval to a vector db and merely assemble results into an "answer"?

and since your dataset is large, the longest context windows are insufficient.

[+] xianshou|2 years ago|reply

Single-GPU, optimal efficiency: unsloth + qlora + mistral-7b on runpod/vast/lambda

Blazing fast compared to out-of-the-box transformers, also make sure to use flash attention if you have A100s or better and context length >= 2k

Add FAISS (https://github.com/facebookresearch/faiss) if you need fast local RAG

[+] alxgt|2 years ago|reply

Interested

[+] FezzikTheGiant|2 years ago|reply

I was just gonna ask this question and saw this at the top of Ask. Interested.

48 comments