Also, auditing any big dataset given verbatim is practically impossible for now. Instead of including the dataset verbatim, the model that purports to be practically-usefully open-source should contain a relatively small procedure for deriving the dataset from some reputable general-purpose dataset, small enough that the resulting dataset can practically be audited.
andy99|2 years ago
chromoblob|2 years ago