From the Wayback Machine [0] it seems they had a normal "open" set-up. They wanted to be indexed, but it's probably a fair concern that OpenAI isn't going to respect their image license. The article describes the robot.txt [sic] now "properly configured", but their solution was to block everything except Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart thing these days, but it's a shame for any new search engines.[0] https://web.archive.org/web/20221206134212/https://www.tripl...
peterldowns|1 year ago
Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.
dghlsakjg|1 year ago
They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.
The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.
The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.