top | item 36735777

The shady world of Brave selling copyrighted data for AI training

261 points| rand0mx1 | 2 years ago |stackdiary.com | reply

123 comments

order
[+] hartator|2 years ago|reply
> Simply observe the event in which a user does a query q in Brave and then, within one hour, does the same query on a different search engine. What we do is to move the script that detects bad-queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers.

Wait. Brave browser sends back to Brave Search engine about your browsing? Other search engines usage, but also crawl pages on your computer to help build their search index?

Ref: https://github.com/brave/web-discovery-project/blob/main/mod...

[+] jrmg|2 years ago|reply
If you don’t trust Brave then, yeah, they could be doing anything in the browser or on their servers - but that snippet you quoted is a slightly out of context statement from a big document about how they collect data like this, but _don’t_ collect or store it in a way that they could associate it with a user.

If you don’t trust that they’re doing what they say they are, then the document doesn’t mean anything. Although that would also mean the quote is kind of meaningless…

[+] drusepth|2 years ago|reply
This specific feature is already opt-in, but historically the answer has always been "yes" for dozens of 'features' like this that fly under the radar until users start complaining, and then eventually get converted to opt-in or removed in order to save face.
[+] choppaface|2 years ago|reply
And Google gets the same data joining your cookies ever since Google Plus unified auth across their properties a decade ago. Wait you mean you thought G+ was supposed to compete against Facebook-the-product and not just Facebook-the-ad-network? Oops

Brave is perfectly OK with having oopsies too

[+] 6gvONxR4sf7o|2 years ago|reply
> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

> 2) The nature of the copyrighted work

> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole

> 4) The effect of the use upon the potential market for or value of the copyrighted work

[emphasis from TFA]

HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.

Regardless, it makes it seem much less clear cut than people here often say.

[+] _fbpp|2 years ago|reply
The entire fair use claim is derived not from any legal basis, but rather, that "it has to be fair use" because it would be legally catastrophic for OpenAI et al if it weren't true.

If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.

For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.

If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.

[+] amluto|2 years ago|reply
I would look at #1 here. Crawling the Internet to collect information is one thing. (And people putting text on the web without requiring authentication seem to be granting at least some kind of license to anyone who sends a GET request.). But crawling the Internet (via centralized robots or users’ browsers), then storing that data and charging money to others for rights to that data (as Brave seems to be doing, quite explicitly) seems like it deserves a very different evaluation under factor #1.
[+] bewaretheirs|2 years ago|reply
Unpopular opinion time:

A ML model is clearly a derivative work of its input.

Here's what I think would be fair:

Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).

In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.

Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.

[+] civilitty|2 years ago|reply
That’s not at all clear to me. IANAL but first of all it’s a balancing test, not a bright-line test. The judge could focus on any one factor and make an argument for either side quite easily.

Second, “use” here could mean one of two things: training or inference. It’s publishing the results of inference that can lead to actual effects on the market, not the training.

At the end of the day, someone has to prove tangible harm.

[+] flangola7|2 years ago|reply
Microsoft is gambling on the hope that model training will be ruled fair use. This makes it seem that outcome is unlikely.
[+] xp84|2 years ago|reply
From article:

> without any worry for copyright infringement because Brave acts as a middleman.

This isn’t how law works. Unless Brave is explicitly indemnifying all their customers (which their lawyers would have to be insane to let them do), any trouble you could get in, is going to be 100% your problem. Pointing the finger at Brave could theoretically get them in trouble too, but would in no way let you off the hook.

[+] isodev|2 years ago|reply
I firmly believe that corps like these don't deserve the benefit of the doubt. Google, Brave and really anyone big enough to allow themselves to do bad things and get away with it must adhere to a standard where they proactively show their stuff doesn't have malicious intents.
[+] sourcecodeplz|2 years ago|reply
As always, if the product is free, you are the product...
[+] k__|2 years ago|reply
The websites a Brave user browses are anonymously relayed to their servers for indexing/training. So, they crawl the web without a crawler and the website operators can't do anything about it.

That's genius!

[+] jonathansampson|2 years ago|reply
I'm Sampson, from the Brave team. The Web Discovery Project is a clever approach. For Brave to compete with Google, and offer a truly novel index of the Web, a novel approach must be taken. The WDP is an opt-in, privacy-preserving approach which gives Brave a fighting chance against the Search incumbants. Due to our preference of "Can't be evil" over "Don't be evil," the WDP is not only designed with privacy and anonymity as a prerequisite, but it is also open-source for public scrutiny and evaluation: https://github.com/brave/web-discovery-project.
[+] throwaway72762|2 years ago|reply
I think this title is overstated. It seems like Brave is trying to do the right thing here vs other companies that don't even make the attempt. (Also, crawling as a service has been a thing for a while.)
[+] jsnell|2 years ago|reply
> It seems like Brave is trying to do the right thing here vs other companies that don't even make the attempt

I feel like I'm missing something. What the article claims they're doing is:

1. Misrepresenting what rights they have, and selling access to those rights.

2. Stealth-crawling the web, hiding from the webmasters just how much Brave is crawling their site, and making it impossible to block just their crawler.

How is either of these the right thing? I mean, for somebody besides Brave. What "attempt" are they making that other companies aren't?

[+] lern_too_spel|2 years ago|reply
Brave continues to be shady. They claim to respect robots.txt but don't identify their crawler if you want to block it.

> They don't mention their crawler anywhere in their docs, either. So, if you wanted to block Brave from crawling and indexing and ultimately selling your content to third parties, your only option for the time being would be to block all crawlers, which is how Brave would be able to "respect robots.txt".

[+] kodah|2 years ago|reply
Unpopular opinion: the next iteration of privacy laws needs to factor in AI. If AI is allowed to slurp up PII or derogative works and the people defending it defend it with the zeal of cryptobros then we're in for a decade of real pain in terms of both copyright law, PII, and IP exposure.
[+] asynchronous|2 years ago|reply
AI is going to do that irregardless- the debate is essentially going to revolve around how and what people can make new commercial works from that data.
[+] _fbpp|2 years ago|reply
The fun part is that the GDPR already does. The answer is you're not allowed to use personal data for AI. (And "personal data" here covers things like all public social media posts)

Facebook recently got told by the CJEU that, no, they can't use people's posts to target advertisements. Even if those ads are what's paying for the platform. That you can't claim such processing as "part of the contract" unless it is absolutely necessary in the same way the post office needs an address to send a parcel.

If Facebook can't even do that, there is no way LLMs will be allowed. (And remember. The GDPR does not care if your system doesn't distribute personal data. Any kind of processing at all falls under the GDPR's requirements)

OpenAI is already being chased by the EU's privacy agencies. Right now they're in the process of asking pointed questions, things will heat up after that.

[+] lopatin|2 years ago|reply
Why use brave if my info is already being leaked by third parties? E.g. experian. Is it worth the inconvenience and their repeated tricky attempts at monetizing their security conscious niche? Not being facetious, just a real question from a non security conscious person.
[+] asynchronous|2 years ago|reply
It’s built in Ad blocker and other features are heads and tails above anything else I’ve used before, personally.
[+] soundnote|2 years ago|reply
You get degoogled Chromium with e2ee bookmarks etc. sync and a lot of nice convenience features like vertical tabs and mobile background video playback.

And if it's your cup of tea, they let you straight up pay money for the search engine.

[+] ricardo81|2 years ago|reply
My entirely biased opinion is https://www.mojeek.com/ - a traditional search engine crawler (as in, follow links on the web) that identifies its user agent. Dead Simple. The open web, you can search on it.
[+] verisimi|2 years ago|reply
How long until IP works its way onto ai training data or ais themselves? Ie that for some specific instance, the training is intentionally wrong, so as to check and prove that there has been a breach of IP.
[+] DesktopMonitor|2 years ago|reply
While not intentionally wrong, Van Halen's brown M&M's rider comes to mind as an example of a similar measure.
[+] the8472|2 years ago|reply
Depends, how do you distinguish humans acquiring knowledge by ingesting copyrighted content vs. a human using an AI that ingested copyrighted content?
[+] niemandhier|2 years ago|reply
This discussion on fair use are always quite anglocentric.

Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.

Move operation to the EU, train a foundational model, than train a constitutional model based on that.

As much as I hate the upcoming AI regulation, the CDSM is solid.

https://academic.oup.com/grurint/article/71/8/685/6650009 https://eur-lex.europa.eu/eli/dir/2019/790/oj

Update: Fixed wrong link

[+] pedrocr|2 years ago|reply
It's not clear that "data mining" covers this use. These models are huge, big enough that they can just contain direct copies of copyrighted works. They've been shown to reproduce them relatively easily. The argument is that they've actually generalized enough or learned enough that they're now no longer the sum of the dataset. I can definitely see that being possible but the way the technology works it's really hard to know if that has happened or if what's happening instead is a bunch of copyright washing.

There are some things that would make for good faith displays by the players in the space. For example, Microsoft has been investing a lot and yet their code offering is not trained on their internal code base. Same for Google. Start by doing that and I'll entertain the argument that your tools are fair use or data mining.

[+] 411111111111111|2 years ago|reply
It's always surprising to me when I hear people using the brave browser... It's by a company that initially tried to replace their blocked ads with their own "safe and non-intrusive" ads as far as I remember, until they backpaddled because of the outrage.

It's also a for-profit company and you're not the customer, as you're not paying them money.

I'd be way more worried how they're using the data they're collecting on you vs Google or MS

[+] nicce|2 years ago|reply
People still like to defend Brave when it gets caught on shady things over and over again. I guess there are no too many other options. For some people it is already too difficult to install uBlock or know its existence.
[+] 2Gkashmiri|2 years ago|reply
We have these cropping up like ants.

Mullvad

Brave

Opera

Vivaldi

Microsoft

Heck zoho is in on a browser now

What net gain does each of these companies provide over skinning chromium that isn't in Firefox?

Last time I asked brave fanboys why they don't redskin Firefox and the response was "Firefox is pita to build" all the while we have projects like palemoon and waterfox that are hobby projects. If they can work with firefox, so could someone else but no

[+] rglullis|2 years ago|reply
You are a victim of the Mandella effect. There never was anything related to replacing ads in-page, yet if you ask all detractors what they don't like about it, that's the first point they bring up.
[+] cempaka|2 years ago|reply
> I'd be way more worried how they're using the data they're collecting on you vs Google or MS

Why? They don't even have access to my emails and texts like those other companies do. I also don't see the names of their top executives and founders showing up in articles about connections to Jeffrey Epstein every few months.