Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.
Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync
its_down_again|1 year ago
I found challenge with LLMs isn’t generating a "real enough" data point—that’s doable. It’s about, "How do I load this in?", then, "How do I generate hundreds of these?" And even beyond that, "How do I make these pseudo-random in a way that tells a coherent story with the graphs?" It always feels like you’re right on the edge, but getting it to work reliably in the way you need is harder than it looks.
edrenova|1 year ago
juthen|1 year ago