Show HN: Autolabel, a Python library to label and enrich text data with LLMs
153 points| nihit-desai | 2 years ago |github.com
We built Autolabel because access to clean, labeled data is a huge bottleneck for most ML/data science teams. The most capable LLMs are able to label data with high accuracy, and at a fraction of the cost and time compared to manual labeling. With Autolabel, you can leverage LLMs to label any text dataset with <5 lines of code.
We’re eager for your feedback!
[+] [-] bomewish|2 years ago|reply
[+] [-] nihit-desai|2 years ago|reply
Autolabel is quite orthogonal to this - it's a library that makes interacting with LLMs very easy for labeling text datasets for NLP tasks.
We are actively looking at integrating function calling into Autolabel though, for improving label quality, and support downstream processing.
[+] [-] devjab|2 years ago|reply
But the key issue is going to be privacy. I’m not big on LLM, so I’m sorry if this is obvious, but can I use something like this without sending my data outside my own organisation?
[+] [-] oli5679|2 years ago|reply
https://github.com/ggerganov/llama.cpp
You need to be careful about liscencing - some of these models its a legal grey area whether you can use them for commercial projects.
The 'best' models require some quite large hardware to run, but a popular compression methodology at the moment is 'quantization', using lower precision model weights. I find it a bit hard to evaluate which open source models are better than others, and how they are impacted by quantization.
You can also use the Open-AI API. They don't use the data. They store for 30 days, which they use for fraud-monitoring, and then delete. It doesn't seem hugely different to using something like Slack/Google doc/AWS.
I think some people imagine their data will end up in the knowledge-base of GPT-5 if they use Open-AI products, but this would be a clear breach of TOS.
https://openai.com/policies/api-data-usage-policies
[+] [-] nihit-desai|2 years ago|reply
[+] [-] viswajithiii|2 years ago|reply
[+] [-] msp26|2 years ago|reply
How does this work exactly?
[+] [-] isawczuk|2 years ago|reply
[+] [-] Takennickname|2 years ago|reply
Pirate all LLMs. They're all yours anyway.
[+] [-] victorbjorklund|2 years ago|reply
[+] [-] fiknbddsehu|2 years ago|reply
[deleted]
[+] [-] applgo443|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] voz_|2 years ago|reply
It's one thing to show HN / share, its another thing to spam it with your ads.
[+] [-] nihit-desai|2 years ago|reply
The earlier post was a report summarizing LLM labeling benchmarking results. This post shares the open source library.
Neither is intended to be an ad. Our hope with sharing these is to demonstrate how LLMs can be used for data labeling, and get feedback from the community