rsaha7's comments

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Thanks for the feedback! Glad you got the default setting working quickly!

Right now, we are focussed mostly on offering support for open-source models but we can definitely extend support for OpenAI formats.

May I ask what history means?

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Thanks for the feedback!

The goal is to extend the training optimization techniques to beyond LoRA / QLoRA :)

Happy to have you join our team!

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Thanks for the feedback!

1. The largest model that we have tested is Llama2 13B. For the first phase, we focussed on fine-tuning LLMs in the 1B-13B range. For our next phase, we will focus on 13B-45B'ish -- for this we will have to incorporate distributed techniques.

2. Following incorporation of distributed training techniques, we will be able to run MoE based models, such as Mixtral.

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Thanks for the feedback! The goal is to offer new techniques via our toolkit as soon as they become available on HuggingFace. To that end, we are aiming to move fast and bring those techniques to the toolkit at the earliest post release.

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Thanks for the feedback! Yes, it is similar to ludwig but we do think that our toolkit is a more lightweight solution to fine-tuning and ablation studies. In most cases, finding the right LLM with the right config on your dataset requires multiple runs (grid search). Our toolkit offers this capability via one yaml file.

As for the test coverage, right now, the toolkit includes property-based unit tests. For instance, for an LLM fine-tuned on summarization, a property-test will evaluate if the summarized text is smaller in length compared to the actual input text.

Similar to the above test, we have a handful of property-based tests. Of course, the list is not exhaustive at this time. As more progress is being made on the testing side, we aim to distill the most relevant tests depending on use-cases.

Hope this helps.

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

You can fine-tune on your own dataset! As long as your dataset is in one of json, csv or huggingface formats, our toolkit can ingest your data!

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

The toolkit does not support UI at this time.

We focussed on simplifying the experimentation experience that a data scientist / engineer typically go through.

For instance, if you want to find the best LLM with the best configuration for your dataset, then ideally, we would like to run an ablation study (think grid search over learning rate, number of epochs, etc.). It would be challenging to show this progress over an UI.

The ideal user of the toolkit would set all the experimentation details in a config file, and then run it via the terminal -- come back to it after a day or so, depending on how big the search space is.

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

The toolkit supports open-source LLMs that are available on HuggingFace.

So, that would include Llama2, Falcon, Mistral and the likes.

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Also worth noting that the toolkit comes with 3 settings:

1. Basic - set up your first simple fine-tuning experiment 2. Intermediate - Create custom config files for specialized fine-tuning experiments 3. Advanced - Run ablation studies through the same config file by defining various setting!

rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing

Great question.

Right now, the roadmap includes extending the training optimizer sections to include techniques beyond LoRA.

Furthermore, the testing suite will be extended to add more unit-tests that are task dependent.

I know that other repositories exist with similar functionalities but they can be too low level for the day-to-day data scientist to understand. Also, there are several repositories that are too specific for either testing, fine-tuning, etc. Our repository consolidates the most critical aspects of running fine-tuning experiments while being lightweight for anyone to understand and play with.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

I have received a lot of great feedback. We are moving fast to add instructions of how to load your custom dataset, and how to choose prompts to give researchers a finer-level of control.

On a separate note, I have received a few questions about the value-add of this repository. Here is my take and my vision for this repository:

Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.

We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?

If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.

I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software — not just slap together a chat interface / UI on top of the latest LLM, and then calling it a revolutionary product.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

Thanks a ton! And that’s a great question!

Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.

We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?

If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.

I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software. Not just slap together a chat interface, and then calling a revolutionary product.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

This is a very common use-case, and other users have mentioned this as well. We have taken this feedback, and will move fast to add instructions on how to leverage these models on custom-datasets and custom prompts. Stay tuned!

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

I don’t think you fully understand the scope of this project. Your thinking and arguments are limited by your understanding of what all is possible with these models.

This repository argues that LLMs can be used for more applications beyond just chat, and QnA. Based on our experimental findings (which you would have found if you had the time to go through the README under any model folder), you can see LLMs do classification tasks really well under low data situations. For 99% of startup who don’t have the luxury of holding thousands of annotated samples like FAANG, LLMs provide a good alternative to get started with few annotated samples. At the end of the day, these models are based on attention transformer architecture.

I would be curious to see some quantitative backing of your statements and not just links to huggingface’s website & conjectures.

And btw, the entire ecosystem is trying to answer a lot of these questions because we are still early to predict anything. And here you are claiming they are absolutely non-sensical for 99% of companies.

Btw did you know that a lot of companies cannot use third-party APIs because of sensitive customer data? For them, having self-hosted models is a good alternative to have. And with the likes of Llama2 and Falcon closing the performance gap, the idea of self-hosted models for tasks beyond chat does not seem far-fetched.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

Great feedback! We are working on adding instructions on loading custom datasets for your own needs. What the format of the prompt should be, etc.

Next release will have these features.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

Looked at the project. Great initiative.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

Feedback taken. We are working on making it more explicit for users to mention the task and dataset they want to train models on. Additionally, we will introduce a flag to let people mention the prompt they want to use for finetuning these models.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

Great observation. We are working on making this part very explicit. The goal was to let researchers get up to speed with the codebase to begin with, and then they would understand what needs to change to make these models work on custom datasets.

That being said, we are working on adding instructions to specify dataset and also the prompt that users want to use.

rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub

You are right in that the loading is right now on huggingface’s dataset. The feedback about it being opaque has merit, and we are working on giving users more control and visibility into the dataset loading. To your point, adding instructions about how one can load their own dataset and do fine-tuning can assist researchers better in leveraging these models. That being said, the README under each model folder has all the info one needs to get started.

More than happy to have you contribute to the repo. There’s a lot of exciting work to be done.