top | item 39634452

(no title)

SEGyges | 2 years ago

You are uploading 5 billion examples of <something>. You cannot filter it manually, of course, because there are five billion of it. Given that it is the year 2024, how hard is it to be positive that a well-resourced team at Stanford in 2029 will not have better methods of identifying and filtering your data, or a better reference dataset to filter it against, than you do presently?

It is a pretty hard problem.

discuss

vineyardmike|2 years ago

You don’t have to do it manually. There is a database of file hashes.

And this isn’t just “one engineer”. Companies like StabilityAI, Google, etc have used LAION datasets. If you built a dataset you should expend some resources on automated filtering. Don’t include explicit imagery as an intentional choice if you can’t do basic filtering.