top | item 42660794

(no title)

agmater | 1 year ago

From the Wayback Machine [0] it seems they had a normal "open" set-up. They wanted to be indexed, but it's probably a fair concern that OpenAI isn't going to respect their image license. The article describes the robot.txt [sic] now "properly configured", but their solution was to block everything except Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart thing these days, but it's a shame for any new search engines.

[0] https://web.archive.org/web/20221206134212/https://www.tripl...

discuss

order

peterldowns|1 year ago

The argument about image/content licensing is, I think, distinct from the one about how scrapers should behave. I completely agree that big companies running scrapers should be good citizens — but people hosting content on the web need to do their part, too. Again, without any details on the timing, we have no idea if OpenAI made 100k requests in ten seconds or if they did it over the course of a day.

Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

dghlsakjg|1 year ago

> Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.

The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.

The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.