Classifying customer messages with LLMs vs traditional ML

[+] 19h|2 years ago|reply

We’re classifying gigabytes of intel (SOCMINT / HUMINT) per second and found semantic folding or better in classification quality vs throughput than BERT / LLMs.

How it works — imagine you’re having these sentences:

“Acorn is a tree” and “acorn is an app”

You essentially keep record of all word to word relations internal to a sentence:

- acorn: is, a, an, app, tree Etc.

Now you repeat this for a few gigabytes of text. You’ll end up with a huge map of “word connections”.

You now take the top X words that other words connect to (I.e. 16384). Then you create a vector of 16384 connections, where each word is encoded as 1,0,1,0,1,0,0,0, … (1 is the most connected to word, 0 the second, etc. 1 indicates “is connected” and 0 indicates “no such connection).

You’ll end up with a vector that has a lot of zeroes — you can now sparsify it (I.e. store only the positions of the ones).

You essentially have fingerprints now — what you can do now is to generate fingerprints of entire sentences, paragraphs and texts. Remove the fingerprints of the most common words like “is”, “in”, “a”, “the” etc. and you’ll have a “semantic fingerprint”. Now if you take a lot of example texts and generate fingerprints off it, you can end up with a very small amount of “indices” like maybe 10 numbers that are enough to very reliably identify texts of a specific topic.

Sorry, couldn’t be too specific as I’m on the go - if you’re interested drop me a mail.

We’re using this to categorize literally tens of gigabytes per second with 92% precision into more than 72 categories.

[+] wavemode|2 years ago|reply

I'd be curious how the output of your approach compares to merely classifying based on what keywords are contained in the text (given that AFAICT you're simply categorizing rather than trying to extract precise meaning).

[+] SomewhatLikely|2 years ago|reply

Sounds like TF-IDF vectors.

[+] lgas|2 years ago|reply

Not to dogpile on all the other "isn't this just" messages, but isn't this just sparse embeddings?

[+] espe|2 years ago|reply

very efficient but also brittle. that must be vast amounts of relatively clean data. you have to magically set the number of top n words to in- and exclude. for most user generated content one would need to heavily normalize the text, e.g. by stemming (to keep in line with the computational austerity). 16384 is very little even if it is neatly seperated concepts. applied to that volume of data it should amount to keyword matching.. that only works if users are basically self-tagging their texts via constrained language use.

edit: short version: not semantics and not a fingerprint :)

[+] dr_kiszonka|2 years ago|reply

If I understand your approach correctly, you could represent relations between words as graphs and use graph/network similarity measures (of which there are tons) to possibly get over the 92%. (Or not, I have never tried it.)

[+] spyckie2|2 years ago|reply

Just asking, this seems very similar to the attention algorithm that powers LLMs?

[+] mistrial9|2 years ago|reply

amazing that this streaming pile of characters and its uncreative associations with three-letter-agency code names, results in exactly ninety two percent accuracy.. almost like its profoundly wrong in exactly the most important ways

[+] LewisDavidson|2 years ago|reply

Do you have any code that demonstrates this? Sounds super interesting!

[+] nestorD|2 years ago|reply

LLMs are significantly slower than traditional ML, typically costlier and, I have been told, tend to be less accurate than a traditional model trained on a large dataset.

But, they are zero/few shot classifiers. Meaning that you can get your classification running and reasonably accurate now, collect data and switch to a fine-tuned very efficient traditional ML model later.

[+] SkyPuncher|2 years ago|reply

To me, LLMs feel like "low-code" tools in most applicable domains.

They're very, very good at creating a new, novel solution - but specially trained ML models will rule.

[+] godelski|2 years ago|reply

> LLMs are significantly slower than traditional ML, typically costlier

Literally point 3 in the article.

> But, they are zero/few shot classifiers

This is __NOT__ true. Zero-shot means out of domain, and if we're talking about text trained LLMs, there really isn't anything text that is out of domain for them because they are trained on almost anything you can find on the internet. This is not akin to training something on Tiny Shakespeare and then having it perform sentiment analysis (classification) on Sci-Fi novels. Similarly, training a model on JFT or LAION does not give you the ability to perform zero shot classification on datasets like COCO or ImageNet, since the same semantic data exists in both datasets. I don't know why people started using this term to describe the domain adaptation or transfer learning, but it is not okay. Zero-shot requires novel classes, and subsets are not novel.

[+] hellovai|2 years ago|reply

That's a great summary and insight. We should likely use that verbiage to help make it more crystal clear :)

[+] famouswaffles|2 years ago|reply

Current State of the art (GPT-4) is not going to be less accurate than whatever bespoke option you can cook up.

https://news.ycombinator.com/item?id=36685921

[+] alexmolas|2 years ago|reply

Where's the comparison with traditional ML? In the article I only see the good things about using LLM, but there's no mention to traditional ML besides from the title.

It would be nice to see how compares this "complex" approach against a "simple" TF-IDF + RF or SVM.

[+] specproc|2 years ago|reply

Yeah, my thoughts exactly. If you're running 500k in tokens through through someone else's hallucination-prone computer and paying for the privilege, I want to know why that's any better than something like SetFit.

All I saw were attempts to reproduce some chatgpt output.

[+] hellovai|2 years ago|reply

Thanks Alex, in this article we focused more on deployment comparisons, for example the cost and latency of what it would take to deploy a BERT based model vs LLMs.

In a future article, we're planning on posting accuracy comparisons as well, but here we want to evaluate a few other architectures for comparison. For example, at 1TPS with 1k tokens, chat-gpt-turbo would cost almost $5k vs a simpler BERT model you could run for under $50.

This is probably very obvious to some people, but a lot of people's first experience with any sort of AI is often an LLM, so this is just the first of many posts we hope to share.

[+] jonathankoren|2 years ago|reply

Yeah, I also find the lack of the comparison suspicious. As is the talk about “hallucinated class labels” being “helpful”.

If I had to take a guess, I suspect the LLM might perform a touch better, but we’re taking fractional percent better. Which is fine, if you have the volume, but a wash otherwise

[+] famouswaffles|2 years ago|reply

Current State of the art (GPT-4) is mostly on par with experts and much better than crowdworkers. Might be overkill though.

https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

3.5 (what is used here) is better than crowd workers https://arxiv.org/abs/2303.15056

[+] viraptor|2 years ago|reply

Or even slightly fancy Word2vec/USE or even sentence transformers with clustering that you can trivially run locally rather than a full blown conversational LLM. I'd love to see a large scale comparison.

[+] rossirpaulo|2 years ago|reply

This is great! We had a similar thought and couldn't agree more with "LLMs prefer producing something rather than nothing." We have been consistently requesting responses in JSON format, which, despite its numerous advantages, sometimes imposes an obligation for an output even if it shouldn't. This frequently results in hallucinations. Encouraging NULL returns, for example, is a great way to deal with that.

[+] caesil|2 years ago|reply

I've found that this is best dealt with along two axes with constrained options. i.e., request both a string and a boolean, and if you get boolean false you can simply ignore the string. So when the LLM ignores you and prints a string like "This article does not contain mention of sharks", you can discard that easily.

If you tell it "Return what this says about sharks or nothing if it does not mention them", it will mess up.

[+] galleywest200|2 years ago|reply

Have you tried using GPT-4s new Function Call feature? The "killer" portion of this is guaranteed JSON based on a schema you pass to the model.

[+] com2kid|2 years ago|reply

I've run into the same issue, but you can turn it into an advantage if you are careful enough.

Basically, give the LLM a schema that is loose enough for the LLM to expand where it feels expansion is needed. Saying always "return a number" is super limiting if the LLM has figured out you need a range instead. Saying "always populate this field" is silly because sometimes the field doesn't need to be populated.

[+] crazygringo|2 years ago|reply

This is really interesting.

I'm really wondering when LLM's are going to replace humans for ~all first-pass social media and forum moderation.

Obviously humans will always be involved in coming up with moderation policy and judging gray areas and refining moderation policy... but at what point will LLM's do everything else more reliably than humans?

6 months from now? 3 years from now?

[+] ghaff|2 years ago|reply

"Obviously" isn't really that obvious to me. We've seen plenty of companies willing to pass a huge amount of work to automation and if you have the misfortune to be an edge case the automation can't handle, said companies are often perfectly happy to let you fall through the cracks. Cheap and good enough often trump costlier and good.

[+] woeirua|2 years ago|reply

As LLM prices come down social media is going to be absolutely inundated with bots that are indistinguishable from humans. I can't see a world where forums or social media are _useful_ for anything in 10 years unless there are strict gate keepers (e.g. you have to receive a code in person or access is tied directly to your physical identity to access the site).

[+] janalsncm|2 years ago|reply

Back of the envelope calculation says it could be possible now.

Twitter gets about 500M tweets per day, average tweet is 28 characters. So that’s 14B characters per day. Converting to tokens at around 4 char/token that’s around 3.5B tokens per day. If GPT 3.5 turbo pricing is representative it will cost about $0.0015/thousand tokens which is $5k per day. So it’s possible now.

However, you can probably get that cost down a lot with your own models, which also has the benefit of not being at the mercy of arbitrary API pricing.

[+] zht|2 years ago|reply

this is some black mirror stuff

imagine Google's general approach to customer service/moderation, but applied all over the place by companies small and large

I shudder at the thought

[+] adam_arthur|2 years ago|reply

They are already sufficient for high level classification... its just a question of cost.

It's getting tiring reading all the LLM takes from people here who clearly don't use or understand them at all. So many still stuck in the "predicting next token" nonsense, as if humans don't do that too

[+] rckrd|2 years ago|reply

I just released a zero-shot classification API built on LLMs https://github.com/thiggle/api. It always returns structured JSON and only the relevant categories/classes out of the ones you provide.

LLMs are excellent reasoning engines. But nudging them to the desired output is challenging. They might return categories outside the ones that you determined. They might return multiple categories when you only want one (or the opposite — a single category when you want multiple). Even if you steer the AI toward the correct answer, parsing the output can be difficult. Asking the LLM to output structure data works 80% of the time. But the 20% of the time that your code parses the response fails takes up 99% of your time and is unacceptable for most real-world use cases.

[0] https://twitter.com/mattrickard/status/1678603390337822722

[+] Animats|2 years ago|reply

What's the application?

If you're using this to direct messages to approximately the correct department, it doesn't have to be that complicated.

If you're doing this to evaluate customer sentiment, you could probably just select a few hundred messages at random and read them. (There are many "big data" problems which are only big due to not sampling.)

[+] i-am-agi|2 years ago|reply

Wohoo this is amazing! I have been using the Autolabel (https://news.ycombinator.com/item?id=36409201) library so far for labeling a few classification and question answering datasets and have been seeing some great performance. Would be interested in giving gloo a shot as well to see if it helps performance further. Thanks for sharing this :)

[+] unknown|2 years ago|reply

[deleted]

[+] r_singh|2 years ago|reply

I have been using LLMs for ABSA, text classification and even labelling clusters (something that had to be done manually earlier on) and I couldn't be happier.

It was turning out to be expensive earlier but with optimising the prompt a lot, reduced pricing by OpenAI and now also being able to run Guanaco 13/33B locally has made it even more accessible in terms of pricing for millions of pieces of text.

[+] hellovai|2 years ago|reply

That's very interesting! What sort of direction did you head in with prompt optimization? Was it mostly in shrinking it and then using multi-shot examples? We found that shorter prompts (empirically) perform better than longer prompts.

[+] wilg|2 years ago|reply

Classic HN website nitpick: Logo should link to home page. In this case it is a link but just goes to the current page. However, points for being able to easily get to the main product page from the blog, usually that's buried.

[+] hellovai|2 years ago|reply

oh! Good catch! Fixed this, and will update in the release.

[+] m3kw9|2 years ago|reply

Prob cheaper with ML but you need training, with transfer learning though you can use a pub trained model and use way less data to train up a classifier like single digit thousands may be ok with 2-5 sentiments

[+] avereveard|2 years ago|reply

One can use the LLM to generate the label to distill a model to the desired precision, I used that approach and it worked quite well and the model runs locally, including creating the sentence embeddings, faster than the LLM, at a fraction of the cost.

Now certain problem space may be large enough to require models where the runtime makes it non economical to run it locally, but Ml is still a game of heuristics, see each problem requires some experimentation.

[+] andrewgazelka|2 years ago|reply

My understanding was training on ChatGPT output was against OpenAI ToS. Is this incorrect for this use case (training BERT)?

[+] YetAnotherNick|2 years ago|reply

Interested in knowing how you are running BERT model with $35/month? Cheapest GPU instance costs $200-300/month AFAIK.

[+] caycep|2 years ago|reply

what's "traditional ML"?

[+] katrinaroberta|2 years ago|reply

[deleted]

123 comments