top | item 40002458

(no title)

Isn’t the legality of web scraping still..disputed?

There’s been a few projects I’ve wanted to work on involving scraping, but the idea that the entire thing could be shut down with legal threats seems to make some of the ideas infeasible.

It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping and as far as I’m aware there haven’t been any legal threats.

Was there some law that was passed that makes all web scraping legal or something?

discuss

brushfoot|1 year ago

Web scraping the public Internet is legal, at least in the U.S.

hiQ's public scraping of LinkedIn was ruled to be within their rights and not a violation of the CFAA. I imagine that's why LinkedIn has almost everything behind an auth wall now.

Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

Some sites have terms at the bottom that prohibit scraping—but my understanding is that those aren't generally enforceable if the user doesn't have to take any action to accept or acknowledge them.

withinboredom|1 year ago

Most of these SaaS's have a "firehose" that if you are big enough (aka, can handle the firehose), can subscribe to. These are like RSS feeds on crack for their entire SaaS.

- https://developer.twitter.com/en/docs/twitter-api/enterprise...

- https://developer.wordpress.com/docs/firehose/

darby_eight|1 year ago

> Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

They're legally enforceable in the sense that the scraped services generally reserve the right to terminate the authorizing account at will, or legally enforceable in that allowing someone to scrape you with your credentials (or scraping using someone else's) qualifies as violating the CFAA?

bena|1 year ago

hiQ was found to be in violation of the User Agreement in the end.

Basically, in the end, it was essentially a breach of contract.

reaperman|1 year ago

There’s currently only one situation where scraping is almost definitely “not legal”:

If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you.

If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably).

The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban.

0: https://www.troyhunt.com/enumerationis-enumerating-resources...

ToucanLoucan|1 year ago

Also worth noting there's a long history of companies with deep pockets getting away with murder (sometimes literally) because litigation in a system that costs money to engage with inherently favors the wealthier party.

Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.

Karellen|1 year ago

> Isn’t the legality of web scraping still..disputed?

Are you suggesting it might be illegal to... write a program that connects to a web server and asks for a specific page, and then parses that page to see which resources it wants and which other pages it links to, and treats those links in some special fashion, differently from the text content of the page?

Especially given that a web server can be configured to respond to any request with a "403 Forbidden" response, if the server determines for any reason whatsoever that it does not want to give the client the page it requested?

foobarian|1 year ago

Why would it not be legal? Was there a law passed that makes it illegal?

dspillett|1 year ago

The issue often isn't the scraping, it is often how you use the information scraped afterwards. A lot of scraping is done with no reference to any licensing information the sites being read might publish, hence image making AI models having regurgitated chunks of scraped stock images complete with watermarks. Though the scraping itself can count as a DoS if done aggressively enough.

nutrie|1 year ago

Scraping publicly available data from websites is no different from web browsing, period. Companies stating otherwise in their T&Cs are a joke. Copyright infringement is a different game.

bena|1 year ago

Scraping is legal. Always has been, always will be. Mainly because there's some fuzz around the edges of the definition. Is a web browser a scraper? It does a lot of the same things.

IIRC LinkedIn/Microsoft was trying to sue a company based on Computer Fraud and Abuse Act violations, claiming they were accessing information they were not allowed to. Courts ruled that that was bullshit. You can't put up a website and say "you can only look at this with your eyes". Recently-ish, they were found to be in violation of the User Agreement.

So as long as you don't have a user account with the site in question or the site does not have a User Agreement prohibiting scraping, you're golden.

The problem isn't the scraping anyway, it's the reproduction of the work. In that case, it really does matter how you acquired the material and what rights you have with regards use of that material.

observationist|1 year ago

The 9th Circuit Court of Appeals found that scraping publicly accessible content on the internet is legal.

If you publish something on a publicly served internet page, you're essentially broadcasting it to the world. You're putting something on a server which specifically communicates the bits and bytes of your media to the person requesting it without question.

You have every right to put whatever sort of barrier you'd like on the server, such as a sign in, a captcha, a puzzle, a cryptographic software key exchange mechanism, and so on. You could limit the access rights to people named Sam, requiring them to visit a particular real world address to provide notarized documentation confirming their identity in exchange for a unique 2fa fob and credentials for secure access (call it The Sams Club, maybe?)

If you don't put up a barrier, and you configure the server to deliver the content without restriction, or put your content on a server configured as such, then you are implicitly authorizing access to your content.

Little popups saying "by visiting this site, you agree to blah blah blah" are not valid. Courts made the analogy to a "gate-up/gate-down" mechanism. If you have a gate down, you can dictate the terms of engagement with your server and content. If you don't have a gate down, you're giving your content to whoever requests it.

You have control over the information you put online. You can choose which services and servers you upload to and interact with. Site operators and content producers can't decide that their intent or consent be withdrawn after the fact, as once something is published and served, the only restrictions on the scraper are how they use the information in turn.

Someone who's archived or scraped publicly served data can do whatever they want with the content within established legal boundaries. They can rewrite all the AP news articles with their own name as author, insert their name as the hero in all fanfic stories they download, and swap out every third word for "bubblegum" if they want. They just can't publish or serve that content, in turn, unless it meets the legal standards for fair use. Other exceptions to copyright apply, in educational, archival, performance, accessibility, and certain legal conditions such as First Sale doctrine. Personal use of such media is effectively unlimited.

The legality of web scraping is not disputed in the US. Other countries have some silly ideas about post-hoc "well that's not what I meant" legal mumbo jumbo designed to assist politicians and rich people in whitewashing their reputations and pulling information offline using legal threats.

Aside from right to be forgotten inanity, content on the internet falls under the same copyright rules as books, magazines, or movies published on physical media. If Disney set up a stall at San Francisco city hall with copies of the Avengers movies on a thumb drive in a giant box saying "free, take one!", this would be roughly the same as publishing those movie files to a public Disney web page. The gate would be up. (The way they have it set up in real life, with their streaming services and licensed media access, the gate is down.)

So - leaving behind the legality of redistribution of content, there's no restriction on web scraping public content, because the content was served intentionally to the software or entity that visited the site. It's up to the server operator to put barriers in place and to make content private. It's not rocket surgery, but platforms want to have their cake and eat it too, with control over publicly accessible content that isn't legal or practical.

Twitter/X is a good example of impractical control, since the site has effectively become useless spam without signing in. Platforms have to play by the same rules as everyone else. If the gate is up, the content is fair game for scraping. The Supreme Court gave the decision to a lower court, who affirmed the gate up/gate down test for legality of access to content.

Since Google and other major corporations have a vested interest in the internet remaining open and free, and their search engines and other tech are completely dependent on the gate up/gate down status quo, it's unlikely that the law will change any time soon.

Tl;dr: Anything publicly served is legal to scrape. Microsoft attempted to sue someone for scraping LinkedIn, but the 9th Circuit court ruled in favor of access. If Microsoft's lawyers and money can't impede scraping, it's likely nobody will ever mount an effective challenge, and the gating doctrine is effectively the law of the land.

unknown|1 year ago

[deleted]

gtirloni|1 year ago

> It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping

Like Google and many others.