Open-Source Data Collection Platform for LLM Fine-Tuning and RLHF

I'm Dani, CEO and co-founder of Argilla.

Happy to answer any questions you might have and excited to hear your thoughts!

More about Argilla

GitHub: https://github.com/argilla-io/argilla Docs: https://docs.argilla.io

Please change the logo. no offense, but people could agree that it deserves better logo. It maybe more apparent when you look at the favicon.

carom|2 years ago

Does this support versioning?

xrd|2 years ago

Looks like no quantized options with llama.cpp?

https://github.com/ggerganov/llama.cpp/issues/1602

dvilasuero|2 years ago

We're very much looking forward to seeing Falcon-40B support on llama.cpp. For production use cases, this is also highly relevant: https://huggingface.co/blog/sagemaker-huggingface-llm

sathergate|2 years ago

how does this compare to scale or surge’s offerings?

dvilasuero|2 years ago

Thanks! The main difference is that Argilla is built as an open-source component to be integrated into the wider MLOps/LLMOps stack. The focus being on continous data collection, monitoring, and fine-tuning with open-source and commercial LLMs, as opposed to outsourcing training data collection, and one-off labeling projects. In the blog post we mention this with other words:

Domain Expertise vs Outsourcing. In Argilla, the process of data labeling and curation is not a single event but an iterative component of the ML lifecycle, setting it apart from traditional data labeling platforms. Argilla integrates into the MLOps stack, using feedback loops for continuous data and model refinement. Given the current complexity of LLM feedback, organizations are increasingly leveraging their own internal knowledge and expertise instead of outsourcing training sets to data labeling services. Argilla supports this shift effectively.

I'd love to hear your thoughts on this!

anakin87|2 years ago

In my experience, Argilla is a good open source platform for datacentric NLP. And these features are a great addition... Have you tried it?

behnamoh|2 years ago

astroturfing maybe?

dvilasuero|2 years ago

Thanks Anakin! we want to bring the data-centric approach to how LLMs are built and fine-tuned too.

11 comments