top | item 36992184

(no title)

pierrefar | 2 years ago

The major problem with Brave search is their position about indexing and licensing content against the wishes of the website publisher. Their robot does not identify itself, meaning the publisher cannot use the standard robots.txt to block its crawling if the publisher so wishes. Incidentally, the robots.txt file has been used in court cases litigating if a search engine is legal or not.

Even worse, they state that Brave search won't index a page only if other search engines are not allowed to index it. It is morally not their right to make that call. A publisher should have full control to discriminate which search engine indexes the website's content. That's the very heart of why the Robots Exclusion Protocol exists, and Brave is brazenly ignoring it.

Even worse than that, the Brave search API allows you (for an extra fee) to get the content with a "license" to use the content for AI training? Who allowed them the right to distribute the content that way?

I wrote about all this here:

https://searchengineland.com/crawlers-search-engines-generat...

and more references elsewhere in this thread:

https://news.ycombinator.com/item?id=36989129

Amusingly, while I was writing my article, this got posted to their forums, asking about how to block their crawler:

https://community.brave.com/t/stop-website-being-shown-in-br...

No reply so far.

discuss

yreg|2 years ago

Hmm, I don't know, it doesn't seem obvious to me that it is unethical to disobey the publisher wishes.

If you post something to the open web, what's it to you who reads it and how? You can block some IPs but that's about it.

I don't know if Brave has a knowledge graph - if they do, I would understand objecting if they filled it in with “stolen” content. But I don't see what's the problem with search.

By the way, isn't everyone's favourite archive.is doing the same thing?

I have no strong opinion on this, curious to hear counter arguments.

CaptainFever|2 years ago

I’m just thinking that if website publishers are able to legally allow Googlebot but block other bots, it might contribute to the Google monopoly.

jaharios|2 years ago

This make me want to use Brave search now. When I use a tool I expect it that it serves me, not the material it provides.

> A publisher should have full control to discriminate which search engine indexes the website's content

If you want someone to not see what you publish block him yourself. Also why would you want to do that? Do you want google to own the web or something?

pierrefar|2 years ago

There is a a difference between a human being able to access content vs a search engine indexing it (and in the case of Brave, "licensing" it on).

I share your concern about Google having this much power, and I'd add that Microsoft Bing is equally bad but gets away with it because they're smaller. Still, the final decision about which search engine indexes a website is purely the publisher's.

waithuh|2 years ago

If the website owner wants google to own the web, they should be able to restrict their website. Nothing wrong with that.

bastawhiz|2 years ago

Let's say I pay for Kagi. Kagi is a tool that I'm using to avoid doing hard work manually. With relatively few exceptions, I can probably accomplish what I use a search engine for manually, but with much more time and effort. So I'm paying for a tool to assist me with my use of the web. A "user agent", you might even say.

It simply doesn't sound right to say which tool a user can use. It's literally the same as arguing that you should be able to block Firefox from accessing your website and it's Mozilla's fault that they don't respect your wishes as a webmaster to block Firefox exclusively. Or that a VPN doesn't publish its IP addresses so that you can block it. Or a screen reader that processes the text to speech in a way that you disagree with.

Philosophically it seems intuitive to say "I should be able to block a third party that is abusing my site" but it's ignoring the broader context of what "open web" and "net neutrality" actually mean.

I run a service for podcasters. There are podcast apps and directories that either ignorantly make unnecessary requests for content or have software bugs that cause redownloads. I could trivially block them, but I don't because doing so penalizes the end user who is ultimately innocent, rather than the badly behaved service operator. The better solution is primitives like rate limiting, which I use liberally. Plus, blocking anyone literally has a direct effect of incentivizing centralization on Apple, Spotify, etc. and making the state of open tech in podcasting even worse.

> the Brave search API allows you (for an extra fee) to get the content with a "license" to use the content for AI training? Who allowed them the right to distribute the content that way?

I don't think there's any court at this point that would back you up that freely published content annotated with full provenance cannot be scraped and published for a fee. Services like this have existed for decades. If you don't want your content scraped, put it behind a login. Especially considering this only applies when you allow other search engines and if you think Google and Bing aren't using your content to train AI, you're off your rocker.

sangnoir|2 years ago

> With relatively few exceptions, I can probably accomplish what I use a search engine for manually, but with much more time and effort. So I'm paying for a tool to assist me with my use of the web. A "user agent", you might even say

1. User agents should identify themselves

2. A crawler is not a User agent - it's an agent for Brave

>I don't think there's any court at this point that would back you up that freely published content annotated with full provenance cannot be scraped and published for a fee.

You can't end-run copyright like this: just because something is publicly available doesn't mean anyone can redistribute it. Look at the legal issues & cases relating to Library Genesis.

cvalka|2 years ago

They are right and you are wrong. If some web page is publicly available, it should be indexed. Scraping neutrality, please.

waithuh|2 years ago

Heavily disagree. I own the server, thus the website. I should be able to allow or disallow any type of web crawler/scraper i want. Similar to how you cant easily regulate whats in a website without lawsuits and takedowns, you cant regulate how discoverable a website is.

tympious|2 years ago

What if I want what I publish to be known only by word of mouth?

What if I consider (some or any of) my ideas to be un-indexable, not directly suitable to representation in any hierarchy other than those I may set them in?

vGPU|2 years ago

> The Robots Exclusion Protocol is a mechanism for publishers to discriminate between what users and crawlers are allowed to access, and discriminate between different crawlers (for example, allow Bingbot to crawl but not Googlebot).

To me as a search engine end user, this kind of behavior is undesirable. Why would I want a website to selectively degrade my experience because of my choice in search engine or browser?

Brings back horrible flashbacks of “this website is only compatible with IE6”.

1vuio0pswjnm7|2 years ago

Curious why cannot selectively block using IP address instead of user-agent string. According to HTTP specification, UA is not a required header. There is certainly no technical requirement for it in order to process HTTP requests. Of course, any website could block requests that lack a UA header. I never send one and it's relatively rare IME to see a site require it, but it's certainly possible.

pierrefar|2 years ago

This is explained more in the article I referred to, but briefly: Brave delegates crawling to normal Brave browsers, so it's a huge IP addresses pool, not a single IP address or range.

Also, these search crawls by the browser do not identify themselves beyond the Brave standard UA header, namely a plain Chrome user-agent string.

1vuio0pswjnm7|2 years ago

According to Brave Privacy Policy, participation in the Web Discovery Project is "opt-in". How many Brave users have opted in to sending data to Brave.

How many Chrome users have opted in to sending data to Google.

Sometimes uninformed consent is not actually consent. These so-called "tech" companies love to toe that line.

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]