(no title)
fumeux_fume | 3 months ago
I just finished working on consulting project that involved tagging corrective and preventative actions (CAPAs) for a lab to help them organize some of their QA efforts. The use of LLMs for tagging free-form text is a common task and I thought it would be fun to experiment with different strategies for improving the consistency of tags. The article above presents a good approach because it's a streaming solution, but does come with drawbacks (more overhead to set up and treats older data differently than new data). Commenters recommend using a batch approach by collecting all the text up front and then using various strategies to cluster and generate tags, then use an LLM--giving it the predefined tags in its prompt. Once you have enough good tags you could train your own smaller model to generate tags. The batch methods have lower overhead, but take more time for tweaking and experimenting for your specific dataset.
For generating embeddings, I used Cohere's v4 embedder. I found that using HDBSCAN for clustering the embeddings of the tags was much more helpful than using K-means. I also learned that training a Pytorch MLP to predict multiple tags was superior in every aspect to training a tree-based model and gives very good precision, but just OK recall due to the difficulty of nailing all the tags right. I also compared gpt-5-mini and claude-haiku-4.5 for generating tags. Gpt-5-mini was much slower, but cheaper and better at generating good tags. Claude-haiku-4.5 was not far behind and was much faster due to the absence of thinking tokens, but much more expensive. The metric I used to compare the LMMs on their raw tagging ability was Scikit-learns's homogeneity_score.
No comments yet.