top | item 43425944

(no title)

explain | 11 months ago

robots.txt is meant for automated crawlers, not human-driven actions.

discuss

zupa-hu|11 months ago

Every automated crawler follows human-driven actions.

josh-sematic|11 months ago

Conversely, every browser is a program that automatically executes HTTP requests.

gopher_space|11 months ago

Welcome to "Context".

nicce|11 months ago

It must form the search index somehow. That is prior the human action. Simply it would not find the page at all if it respects.

pests|11 months ago

I remember in late 90s/early 2000 as a teen going to robots.txt to specifically see what they were trying to hide and exploring those urls.

What is the difference if I use a browser or a LLM tool (or curl, or wget, etc) to make those requests?

Tostino|11 months ago

Let's say you had a local model with the ability to do tool calls. You give that llm the ability to use a browser. The llm opens that browser, goes to Google or Bing, and does whatever searches it needs to do.

Why would that be an issue?

bayindirh|11 months ago

So, do you mean LLMs are human-like and conscious?

I thought they were just machine code running on part GPU and part CPU.

Ukv|11 months ago

I think they mean that it's a tool accessing URLs in response to a user request to present to the user live - with that user being a human. Like if you used some webpage translation service, or non-ML summarizer.

There's some gray area though, and the search engine indexing in advance (not sure if they've partnered with Bing/Google/...) should still follow robots.txt.

Filligree|11 months ago

There’s a human using the LLM. In a live web browsing session like this, the LLM stands in for the browser.

unknown|11 months ago

[deleted]

postexitus|11 months ago

if a human triggers the web crawlers by pressing a button, should they ignore robots.txt?

Filligree|11 months ago

If a human triggers a browser by pressing a button, should it ignore robots.txt?

dudeinjapan|11 months ago

In practice, robots.txt is to control which pages appear in Google results, which is respected as a matter of courtesy, not legality. It doesn't prevent proxies etc. from accessing your sites.