You are uploading 5 billion examples of <something>. You cannot filter it manually, of course, because there are five billion of it. Given that it is the year 2024, how hard is it to be positive that a well-resourced team at Stanford in 2029 will not have better methods of identifying and filtering your data, or a better reference dataset to filter it against, than you do presently?It is a pretty hard problem.
vineyardmike|2 years ago
And this isn’t just “one engineer”. Companies like StabilityAI, Google, etc have used LAION datasets. If you built a dataset you should expend some resources on automated filtering. Don’t include explicit imagery as an intentional choice if you can’t do basic filtering.