rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
rsaha7's comments
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
The goal is to extend the training optimization techniques to beyond LoRA / QLoRA :)
Happy to have you join our team!
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
1. The largest model that we have tested is Llama2 13B. For the first phase, we focussed on fine-tuning LLMs in the 1B-13B range. For our next phase, we will focus on 13B-45B'ish -- for this we will have to incorporate distributed techniques.
2. Following incorporation of distributed training techniques, we will be able to run MoE based models, such as Mixtral.
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
As for the test coverage, right now, the toolkit includes property-based unit tests. For instance, for an LLM fine-tuned on summarization, a property-test will evaluate if the summarized text is smaller in length compared to the actual input text.
Similar to the above test, we have a handful of property-based tests. Of course, the list is not exhaustive at this time. As more progress is being made on the testing side, we aim to distill the most relevant tests depending on use-cases.
Hope this helps.
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
We focussed on simplifying the experimentation experience that a data scientist / engineer typically go through.
For instance, if you want to find the best LLM with the best configuration for your dataset, then ideally, we would like to run an ablation study (think grid search over learning rate, number of epochs, etc.). It would be challenging to show this progress over an UI.
The ideal user of the toolkit would set all the experimentation details in a config file, and then run it via the terminal -- come back to it after a day or so, depending on how big the search space is.
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
So, that would include Llama2, Falcon, Mistral and the likes.
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
1. Basic - set up your first simple fine-tuning experiment 2. Intermediate - Create custom config files for specialized fine-tuning experiments 3. Advanced - Run ablation studies through the same config file by defining various setting!
rsaha7 | 2 years ago | on: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing
Right now, the roadmap includes extending the training optimizer sections to include techniques beyond LoRA.
Furthermore, the testing suite will be extended to add more unit-tests that are task dependent.
I know that other repositories exist with similar functionalities but they can be too low level for the day-to-day data scientist to understand. Also, there are several repositories that are too specific for either testing, fine-tuning, etc. Our repository consolidates the most critical aspects of running fine-tuning experiments while being lightweight for anyone to understand and play with.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
On a separate note, I have received a few questions about the value-add of this repository. Here is my take and my vision for this repository:
Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.
We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?
If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.
I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software — not just slap together a chat interface / UI on top of the latest LLM, and then calling it a revolutionary product.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.
We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?
If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.
I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software. Not just slap together a chat interface, and then calling a revolutionary product.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
This repository argues that LLMs can be used for more applications beyond just chat, and QnA. Based on our experimental findings (which you would have found if you had the time to go through the README under any model folder), you can see LLMs do classification tasks really well under low data situations. For 99% of startup who don’t have the luxury of holding thousands of annotated samples like FAANG, LLMs provide a good alternative to get started with few annotated samples. At the end of the day, these models are based on attention transformer architecture.
I would be curious to see some quantitative backing of your statements and not just links to huggingface’s website & conjectures.
And btw, the entire ecosystem is trying to answer a lot of these questions because we are still early to predict anything. And here you are claiming they are absolutely non-sensical for 99% of companies.
Btw did you know that a lot of companies cannot use third-party APIs because of sensitive customer data? For them, having self-hosted models is a good alternative to have. And with the likes of Llama2 and Falcon closing the performance gap, the idea of self-hosted models for tasks beyond chat does not seem far-fetched.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
Next release will have these features.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
That being said, we are working on adding instructions to specify dataset and also the prompt that users want to use.
rsaha7 | 2 years ago | on: Show HN: finetune LLMs via the Finetuning Hub
More than happy to have you contribute to the repo. There’s a lot of exciting work to be done.
Right now, we are focussed mostly on offering support for open-source models but we can definitely extend support for OpenAI formats.
May I ask what history means?