top | item 41718654

(no title)

pitah1 | 1 year ago

The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.

As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

discuss

No comments yet.