Random question from a very-much-not-computer savvy person on the off chance someone cares to answer: If a tiny charge was levied every time a webserver delivered a page to me would it cure this kind of problem? I'm imagining e.g. my browser has to send some crypto of some variety (or guarantee that it will in the future so as to not slow things down).
thephyber|1 year ago
We create bot traffic, but we don’t want to. The problem is that the data we want isn’t available when we want it (we can’t wait days/weeks for the central CVE db to unembargo CVE records which have high impact) and isn’t delivered to us. Instead, we have to go through lots of effort to go get it. So we create a resilient crawler. And other similar companies / entities do too. Now we are all competing to get the same info in a short time, so we poll the sites too often. This all becomes a stress on the websites we hit.
All because the info should be open, but the companies with the info don’t want to build the most efficient system to distribute it. And there is probably legal liability for a middleman company to just crawl those websites and build a shim webhook system to push data as soon as it is found to webhook subscribers.
mattboardman|1 year ago
https://github.com/RupertBenWiser/Web-Environment-Integrity/...
delfinom|1 year ago
mike_hearn|1 year ago
It'd also entrench search monopolies even harder, because everyone would exempt Google/Bing because they want to get indexed, but they wouldn't exempt other bots like the one you need for your new engine.
malcolmgreaves|1 year ago
oneshtein|1 year ago
thephyber|1 year ago
If your server doesn’t serve responses unless someone pays, then there is the problem of uncertainty for the client — how do I know the content behind the paywall is worth it?
Nearly all of the services we use that index the web are free/cheap and require the ability to crawl the web without logging into services. Search engines like Google, Bing, Yandex, Baidu. LLMs like ChatGPT piggyback on CommonCrawl, in addition to paying for large expensive data contracts from companies like Reddit.
We have a word for the part of the internet that is walled off from open crawling — the Deep Web.