top | item 41722766

(no title)

edrenova | 1 year ago

Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.

Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.

Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync

discuss

its_down_again|1 year ago

I’ve spent some time in enterprise TFO/demo engineering, and this kind of generative tool would’ve been a game changer. When it comes to synthetic data, the challenge lies at the sweet spot of being both "super tough" and in high business need. When you're working with customer data, it’s pretty risky—just anonymizing PII doesn’t cut it. You’ve got to create data that’s far enough removed from the original to really stay in the clear. But even if you can do it once, AI tools often need thousands of data rows to make the demo worthwhile. Without that volume, the visualizations fall flat, and the demo doesn’t have any impact.

I found challenge with LLMs isn’t generating a "real enough" data point—that’s doable. It’s about, "How do I load this in?", then, "How do I generate hundreds of these?" And even beyond that, "How do I make these pseudo-random in a way that tells a coherent story with the graphs?" It always feels like you’re right on the edge, but getting it to work reliably in the way you need is harder than it looks.

edrenova|1 year ago

Yup agreed. We built an orchestration engine into Neosync for that reason. Can handles all of the reading/writing from DBs for you. Also can generate data from scratch (using LLMs or not).

juthen|1 year ago

GANs are barely ten years old and already they have reached the classical ML algorithm status.