(no title)
blindgeek | 1 year ago
Public content on the Internet should be scrapable. That's what public means.
The fact that my reddit posts were publicly available never bothered me. Even if they were going to be used to train some LMM. What does bother me is reddit locking up my posts and making exclusive deals with Google to train Google's LMM.
Preventing scraping isn't good for the average user; it is good for the company that wants to take content created by said user, lock it up, and sell it to their buddies.
miki123211|1 year ago
Not necessarily, especially if you want to expose some relationships in one direction while hiding the other.
Imagine your government creates a CNAM-like[1][2] system that lets you enter a phone number and see their owner, to see who is calling you and whether a number you're given is legit. However, they do not want to let you see a person's phone number just by entering their name.
If there's no captcha, an unscrupulous actor, registered in the Seychelles and unconcerned with your country's laws, can just scrape all possible phone numbers and offer a "reverse lookup" service.
In a way, the number/name records are public information, after all, the government lets you query them without authentication, but in a way they aren't, because you're only permitted to query them in a certain way.
Variations of this problem have appeared many times, particularly across Europe, usually with company numbers, property deeds and such.