top | item 36385061

(no title)

poomer | 2 years ago

At work we were facing this dilemna. Our team is working on a model to detect fraud/scam messages, in production it needs to label ~500k messages a day at low cost. We wanted to train a basic gbt/BERT model to run locally but we considered using GPT-4 as an label source instead of our usual human labelers.

For us human labeling is suprisingly cheap, the main advantage of GPT-4 would be that it would be much faster, since scams are always changing we could general new labels regularly and be continuously retraining our model.

In the end we didn't go down that route, there were several problems:

- GPT-4 accuracy wasn't as good as human labelers. I believe this is because scam messages are intentionally tricky, and require a much more general understanding of the world compared to the datasets used in this article which feature simpler labeling problems. Also, I don't trust that there was no funny business going on in generating the results for this blog, since there is clear conflict of interest with the business that owns it.

- GPT-4 would be consistently fooled by certain types of scams whereas human annotators work off a consensus procedure. This could probably be solved in the future when there's a larger pool of other high-quality LLMs available, and we can pool them for consensus.

- Concern that some PII information gets accidentally sent to OpenAI, of course nobody trusts that those guys will treat our customers data with any level of appropriate ethics.

discuss

bitshiftfaced|2 years ago

I wonder if the LLM could at least reliably label something as "smells funny," and then you could have human labelers only work on that smaller, refined batch. But like you said, PII is a concern. In any case, at the rate its going, does anyone really doubt that LLMs one or two years out will have the same problem?

nihit-desai|2 years ago

>> don't trust that there was no funny business going on in generating the results for this blog

All the datasets and labeling configs used for these experiments are available in our Github repo (https://github.com/refuel-ai/autolabel) as mentioned in the report. Hope these are useful!

poomer|2 years ago

Thank you, I appreciate your transparency with this work.

mycall|2 years ago

Did you consider fine-tuning your own copy of GPT-4 that can handle scam messages better? I'm doing something similar with Azure OpenAI Services and custom vector database to handle ham/spam labeling for some of my customer feedback APIs.