top | item 37384606

(no title)

snissn | 2 years ago

Hey! I don’t understand enough abt llms. Fine tuning seems like something great but I feel locked out of it. I need to prepare data in a question answer format? I have started to play with taking things like text, articles, tweets and converting them to questions but I don’t think I’m doing best practices. Can you help explain how to take different data sources maybe like a list of documentation for an open source project and fine tune using it?

discuss

rsaha7|2 years ago

Great feedback! We are working on adding instructions on loading custom datasets for your own needs. What the format of the prompt should be, etc.

Next release will have these features.

BoorishBears|2 years ago

To intentionally oversimplify, fine-tuning an LLM on your data is a completely nonsensical concept for 99% of the world.

People have the impression that training an LLM on your data will result in an LLM that can answer questions on your data. But for any realistic dataset and training that a non-FAANG can do, that's not true.

> Can you help explain how to take different data sources maybe like a list of documentation for an open source project and fine tune using it?

You would not do this.

Let's say you're writing code with an open source library. There's a new animation API that didn't exist when the LLM was trained:

1. You ask your coding chatbot: How do I make this this box move right to left across my screen

2. Before the chatbot UI submits the question to the LLM, it manually searches for text related to to your question in the library's documentation using BM25F and BERT

3. You give the LLM the results of your search and the user's question, at the same time.

The LLM now has a snippet of up to date documentation, and can look at that to produce a novel code that animates the box based on the documentation.

Depending on latency requirements you can have a "Step 2.5", where you ask the LLM "What searches would you do if I gave you the docs for this library and you needed to answer <insert question>".

Here BERT is being used to find snippets of text that are more likely to help us answer a question

For example, this model: https://huggingface.co/thenlper/gte-large

When given the query: Reference documentation for 'Write some code to move this square across my screen'

It ranks some imaginary documentation in the following order:

1. "Object translation has been reworked in the new animation API" 2. "Layout components include the Box, Grid, and Stack" 3. "Move your hosting to AWS with our cloud build API"

The BERT model "understands" that while we used the words moving and square: - "Move your hosting" is a semantically different concept - "Box" is similar to "Square", but "Box" is not central to the request.

Now we can give the LLM the most relevant snippet, and it uses that as guidance for its own reply (also known as Retrieval Augmented Generation)

There is a place where fine-tuning can actually be applied in this process but it is not fine-tuning an (chat) LLM. It is completely unrelated to most mentions of fine-tuning you've heard in the last X months.

You can fine-tune BERT (a much smaller model) to get better at finding relevant snippets of your documentation. You can do this without labeled data.

*Literally give it a bag of sentences and let it go to work*: https://www.sbert.net/examples/unsupervised_learning/TSDAE/R...

TSDAE doesn't really perform that well on wide domains, but it works well here the documents have both information, and what that information is for (think code documentation with examples vs wikipedia which is just raw information). It also only takes 1k sentences to start, you could find a bunch of random documentation sites on Github and feed them in.

rsaha7|2 years ago

I don’t think you fully understand the scope of this project. Your thinking and arguments are limited by your understanding of what all is possible with these models.

This repository argues that LLMs can be used for more applications beyond just chat, and QnA. Based on our experimental findings (which you would have found if you had the time to go through the README under any model folder), you can see LLMs do classification tasks really well under low data situations. For 99% of startup who don’t have the luxury of holding thousands of annotated samples like FAANG, LLMs provide a good alternative to get started with few annotated samples. At the end of the day, these models are based on attention transformer architecture.

I would be curious to see some quantitative backing of your statements and not just links to huggingface’s website & conjectures.

And btw, the entire ecosystem is trying to answer a lot of these questions because we are still early to predict anything. And here you are claiming they are absolutely non-sensical for 99% of companies.

Btw did you know that a lot of companies cannot use third-party APIs because of sensitive customer data? For them, having self-hosted models is a good alternative to have. And with the likes of Llama2 and Falcon closing the performance gap, the idea of self-hosted models for tasks beyond chat does not seem far-fetched.