(no title)
jackienotchan | 1 year ago
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.
jackienotchan | 1 year ago
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.
hibikir|1 year ago
Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human
amiga386|1 year ago
Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.
Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.
thomasahle|1 year ago
kjkjadksj|1 year ago
bunderbunder|1 year ago
tedivm|1 year ago
For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.
smt88|1 year ago
disqard|1 year ago
brightball|1 year ago
__MatrixMan__|1 year ago
When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.
bunderbunder|1 year ago
MattDaEskimo|1 year ago
Websites previously would have their own in-house API to freely deliver content to anyone who requests it.
Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.
Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.
So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.
Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.
AI (and greed) has killed the open freedoms of the Internet.
candiddevmike|1 year ago
zeroCalories|1 year ago
danielmarkbruce|1 year ago
tempfile|1 year ago
mahdi7d1|1 year ago
Also maybe the recent rise in captcha difficulty is not companies making them harder to prevent bots but rather bots twisting the right answer. As I know it captcha works based on other users' answers so if a huge portion of these other users are bots they can fool the alghorithm into thinking their wrong answer is the right answer.
MattGaiser|1 year ago
tempfile|1 year ago
bgorman|1 year ago
rs999gti|1 year ago
yifanl|1 year ago
dartos|1 year ago
This is a similar situation.
skoocda|1 year ago
zkid18|1 year ago
Firefishy|1 year ago
Our S3 bucket is thankfully supported by the AWS Open Data Sponsorship Program.
dorgo|1 year ago
glitchc|1 year ago
unknown|1 year ago
[deleted]
Scoundreller|1 year ago
(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).
MisterBastahrd|1 year ago
zild3d|1 year ago
thomasahle|1 year ago
agilob|1 year ago
jgalt212|1 year ago
We've had good success with
- Cloudflare Turnstile
- Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents)
londons_explore|1 year ago
Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.
tempfile|1 year ago
this is not a solution if you want a public internet (and sites that don't care about the public internet already don't have a problem)
_heimdall|1 year ago
At best any email I have is 4 or 5 years old.
azemetre|1 year ago
mcherm|1 year ago