top | item 39634211

(no title)

While a lot of attention has been given to books3, another large component of this dataset is the deceptively-named "OpenWebText2". What's that? It's a scrape of 15 years' worth of third-party websites that were linked to from upvoted Reddit submissions. I know this includes some of my writing.

discuss

observationist|2 years ago

Relevance and impact aside, if you publish something to the internet on a site with no access restriction in place, I don't know how you can keep a straight face while claiming some sort of moral right to the content. It's the equivalent of broadcasting it over radio, or printing and delivering it straight to the doorsteps of millions of random individuals. Methinks you doth protest too much, or something.

There are ways of copyrighting data, and establishing ownership of intellectual property. Your tumblr fanfic, youtube comments, or HN discussions are not legitimate copyright avenues. Stuff you post to legally scrapeable websites are fair game for fair use.

I can do anything I want in private to any data I collect. I could create an awesome HN LLM on the scraped datasets, and use it privately to my hearts content. I can even set up an API to that LLM that generates content, and, given recent rulings, even if i had all the written copyrighted data in the world, as long as I was making good faith efforts to ensure copyright was being respected and works weren't being recreated verbatim, then I could even use that model commercially. I just couldn't sell it to other people, or distribute it, without entering a different legal regime.

I can collect any data I want from public facing websites.

That's how the internet works; it's how it was designed. There are authentication mechanisms, network configurations, and a myriad other access control schemes you can implement to prevent public access. If you post to sites without those mechanisms, you're tacitly agreeing to give up any plausible claims of protection against a wide array of fair uses well established by precedent cases at this point. If you don't prevent public access, and you've got a domain name on a server, you're tacitly inviting the world to come download whatever it is you have on your server. This is a social good. This is what we want when we participate in the internet.

Insisting on some sort of vague entitlement as to how "your" data gets used completely bypasses the fact that anything you consider to be misused in OpenWebText2 fundamentally stems from the fact that you posted the content to a publicly visible website and gave up any say in what happens thereafter. It was scraped fair and square.

Don't complain that you didn't know the rules, or that life isn't fair.

It's not even clear that terms of service or those little popups on public websites have any legal relevance. If your website is open to the public, then it's fair game. If you post content to a public website, then that content's fair game.

quatrefoil|2 years ago

It feels like you're picking apart an argument I didn't make. But I would note that most people don't see this so unambiguously as the position you're defending. To give you an analogy: doxxing is "fair game" too if you posted your info online or gave it to others. But it's not exactly cool to do it, right? It's a subversion and abuse of the system we have in place.

Finally, here's a fun experiment: decide that terms of service don't matter and start building a product by scrapping Facebook or Google. See how they'd react. Actually, no need for guesswork - they clutched their pearls and threatened legal action more than once before. It's a bit of a "have your cake and eat it too" kind of a deal. Their data is precious intellectual property; your stuff is, well, up for grabs.

UncleEntity|2 years ago

> It's the equivalent of...printing and delivering it straight to the doorsteps of millions of random individuals.

Which, incidentally, the New York Times does and they seem to think they have some legal right to the redistribution of their work.

Maybe they're right, maybe they're wrong, it's up to the courts to decide.

7moritz7|2 years ago

Care to give me your domain name so I can check all major llms for plagiarism? I have a feeling none of them can produce a sentence from your writings

quatrefoil|2 years ago

It takes deliberate effort, but I was actually able to get pieces of my writing out of one of the leading LLMs (not ChatGPT). This is not particularly unique, a number of folks demonstrated the same.