While a lot of attention has been given to books3, another large component of this dataset is the deceptively-named "OpenWebText2". What's that? It's a scrape of 15 years' worth of third-party websites that were linked to from upvoted Reddit submissions. I know this includes some of my writing.
observationist|2 years ago
There are ways of copyrighting data, and establishing ownership of intellectual property. Your tumblr fanfic, youtube comments, or HN discussions are not legitimate copyright avenues. Stuff you post to legally scrapeable websites are fair game for fair use.
I can do anything I want in private to any data I collect. I could create an awesome HN LLM on the scraped datasets, and use it privately to my hearts content. I can even set up an API to that LLM that generates content, and, given recent rulings, even if i had all the written copyrighted data in the world, as long as I was making good faith efforts to ensure copyright was being respected and works weren't being recreated verbatim, then I could even use that model commercially. I just couldn't sell it to other people, or distribute it, without entering a different legal regime.
I can collect any data I want from public facing websites.
That's how the internet works; it's how it was designed. There are authentication mechanisms, network configurations, and a myriad other access control schemes you can implement to prevent public access. If you post to sites without those mechanisms, you're tacitly agreeing to give up any plausible claims of protection against a wide array of fair uses well established by precedent cases at this point. If you don't prevent public access, and you've got a domain name on a server, you're tacitly inviting the world to come download whatever it is you have on your server. This is a social good. This is what we want when we participate in the internet.
Insisting on some sort of vague entitlement as to how "your" data gets used completely bypasses the fact that anything you consider to be misused in OpenWebText2 fundamentally stems from the fact that you posted the content to a publicly visible website and gave up any say in what happens thereafter. It was scraped fair and square.
Don't complain that you didn't know the rules, or that life isn't fair.
It's not even clear that terms of service or those little popups on public websites have any legal relevance. If your website is open to the public, then it's fair game. If you post content to a public website, then that content's fair game.
quatrefoil|2 years ago
Finally, here's a fun experiment: decide that terms of service don't matter and start building a product by scrapping Facebook or Google. See how they'd react. Actually, no need for guesswork - they clutched their pearls and threatened legal action more than once before. It's a bit of a "have your cake and eat it too" kind of a deal. Their data is precious intellectual property; your stuff is, well, up for grabs.
UncleEntity|2 years ago
Which, incidentally, the New York Times does and they seem to think they have some legal right to the redistribution of their work.
Maybe they're right, maybe they're wrong, it's up to the courts to decide.
7moritz7|2 years ago
quatrefoil|2 years ago