I genuinely don't understand the "permissioned data" assumption. Presumably, all the current models that were trained on illegitimate scraping of vastly larger sources will always have the upper hand (in terms of raw power, obviously at the cost of regurgitating evil stuff too), because they just have absorbed way more diverse data in their training. So the models trained on ethical datasets only will not be able to compete, unless they too rely on a common base of "foundational sin" data and just add those datasets as an ethical layer to cover the rotten roots.Is it really possible to start training from scratch at this stage and compete with the existing models, using only ethical datasets? Hasn't it been established that without the stolen data, those models could not exist or compete?
jd172|1 month ago
whether or not it's possible to compete I guess we'll see but I am hopeful and appreciative that Mozilla is trying, as I am getting tired of big tech trying to force everyone to hand over even more unhinged amounts of data than what they're already taking from us.
b1085436|1 month ago