solso's comments

solso | 3 years ago | on: Brave Search Goggles: Alter search rankings with rules and filters

Disclaimer: Works at Brave, before at Tailcat.

Goggles white-paper was released more than a year ago, long before Kagi was even announced to the public.

Additionally, before Brave acquired Tailcat (Jan 2021) I had the pleasure to share the draft of the paper with Kagi's founder.

So no, there is no prior art.

Let me add that I do not claim that Goggles is prior art of Lenses either.

One of the key features of Goggles design is that the instructions, rules and filters are open and URL accessible.

A Goggle is not so much a personal preference configuration, but a way to collaborative come up with shareable and expandable search re-rankers.

Very different goals if you ask me. Of course, Goggles can be used for personal preferences exclusively, but that's not the use case we had in mind.

solso | 3 years ago | on: Brave Search Goggles: Alter search rankings with rules and filters

An active choice is better than a passive one, if only because it requires an effort, in that respect the explicitness is an advantage over the typical personalization.

The article also mentions that Goggles will not stop polarization, it suffices to not exacerbate it.

No technology/system on any period of time has been able to suppress it, censorship included.

Disclaimer: I work at Brave search

solso | 3 years ago | on: Bing contract prohibits DuckDuckGo from completely blocking Microsoft tracking

There is plenty of comments discussing on the provenance of DDG results, including from Gabriel himself, which is the one we both have participated in,

"it is misleading to say our results just come from Bing."

Discussing how many sources can one bring together it's a distraction to not discuss the degree of dependency between DDG and Bing. More-so when claiming that others suffer from the same, which is factually incorrect for Brave search.

solso | 3 years ago | on: Bing contract prohibits DuckDuckGo from completely blocking Microsoft tracking

This is Josep M. Pujol from Brave Search.

I'd like to correct some factually incorrect information regarding Brave Search.

Brave search crawls the web through the Web Discovery Project and has its own crawler, which fetches a bit more than 100M pages daily.

Brave search uses Bing API and Google fallback for about 8% of the results shown to the users, the remaining 92% are served from our own index, when we launched almost 1 year ago the number of results from 3rd parties was 13%.

There is no need to mention "multiple source" when a number can be given. The underlying theme here is not if DDG provides no value on top of Bing, it does, no one is questioning that. The question is whether DDG would be able to operate if Bing were to shut DDG down tomorrow.

If Bing and Google were to disappear tomorrow, for whatever reason, Brave search would continue to operate, that's the independence Brave search is building.

solso | 4 years ago | on: Brave Search beta

You bring a very good point on the diversity of information sources, which is something we plan to attack in the near future with open ranking [0]

In my opinion having similar results to Google will facilitate adoption. After all, Google is pretty good for many types of queries (not all), and people in general have strong habits.

The fact that we are similar with our own index is great. It means that we have the power of deviating from it when needed, as we mature/evolve.

Allow me to repurposed your statement on why not use startpage if you want Google-like results: if tomorrow Google disappears (or for some reason becomes unusable), brave search will continue to operate as normal (similar to old Google). What will happen to searx or startpage? What till happen to ddg or swisscows if the provider turning bad is Microsoft. IMHO, no matter how much reranking or nice features they you put on top, unless you do not control the search results themselves, diversity can only be superficial.

Sorry for the "rant". Thanks a lot for the inputs and for updating the doc, appreciate it.

[0] https://brave.com/wp-content/uploads/2021/03/goggles.pdf

solso | 4 years ago | on: Brave Search beta

Mixing with Google results only can happen after opt-in and only in Brave browser. You can see if a single query has been mixed clicking on the `Info`, or check the independence metrics on the `Settings` tab.

The fact that you see results similar to Google for popular queries is a by-product of the fact that our ranking is trained using anonymous query-log. There is plenty of references to the methodology (https://0x65.dev/).

The fact that we are similar to Google on certain types of queries, is good (at from the perspective of human assessment). It's easy to find other types of queries for which we are not similar to Google. It would be rather stupid if we were to "use google" on easy to solve queries but not on the complicated ones, don’t you think? In any case, very nice article besides a couple of miss-conceptions (like this one), will bookmark.

Disclaimer: work at Brave search, used to work at Cliqz

solso | 5 years ago | on: Brave buys a search engine, promises no tracking, no profiling

It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set.

However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack.

If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin.

The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person.

If you are interested in more details I recommend this article: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

[Disclaimer I'm one of the authors]

solso | 5 years ago | on: Brave buys a search engine, promises no tracking, no profiling

The chosen excerpt omits the fact that it is predicated on the HumanWeb. In the technical papers above there is a more precise description on what and how was collected. There was no user tracking, session or history being sent as all data points are anonymous and record-unlinkable by the receiver. The vague language, required for a general audience journal, certainly does not help.

solso | 5 years ago | on: Brave buys a search engine, promises no tracking, no profiling

Mozilla never did such a thing. The browsing history was never sent in any shape or form. As the journalistic article you quote states, Mozilla put in place the HumanWeb[1,2,3], which was a privacy preserving data collection which ensured record-unlinkability, hence no session or history. Anonymity was guaranteed and the framework was extensively tested by privacy researchers from both Cliqz and Mozilla. Disclaimer: I worked at Cliqz.

[1]https://0x65.dev/blog/2019-12-02/is-data-collection-evil.htm... [2]https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i... [3]https://0x65.dev/blog/2019-12-04/human-web-proxy-network-hpn...

solso | 5 years ago | on: Brave buys a search engine, promises no tracking, no profiling

There was no tracking on Cliqz, nor it will be any in Brave. To know more about the underlying tech of Cliqz there are interesting posts at https://0x65.dev, some of them covering how signals are collected, data, but no tracking. I did work at Cliqz and now I work at Brave. I can tell for a fact, that all data was, is and will be, record-unlinkable. That means that no-one, not me, not the government, not the ad department can reconstruct a session with your activity. Again, there is no tracking, full anonymity, Brave would not do it any other way.

solso | 5 years ago | on: DuckDuckGo, Google, and Android choice screens

You are missing the entire point, "search" is sadly about advertising, not about the search itself :-)

Bing is interested serving DDG, Qwant, Ecosia and a lot of other unknown search engines because of the aggregated reach they provide for their ad network. Ad-networks only work if the aggregated audiences are massive, otherwise advertisers do not bother putting their ads there, only the top-3/5 ad networks get to see any action. So Bing wants/needs a bigger audience just to be on the game. They can grow in 2 paths: 1) increase Bing search reach (difficult), or 2) use partners with different value propositions.

Bing charges little for 1K query, 1USD officially but it gets cheaper, to zero :-) The real thing though, is that if you display Bing ads, you get a 70%-90% rev-share of the ad-revenue, which varies from country to country, something between $5 to $20 per 1K queries.

So, DDG basically gets around 5$ to 10$ net for each 1K, and can spend all that money on distribution and marketing so that they get even more users. Bing gets the rest, money, and what's more important, their ad-network continues to be competitive.

Everybody wins, right? :-/

Search is so cheap 1$/1K queries and people makes 2/3 queries per day, so $1/year/user (average). It makes no economic sense to build an alternative. Unless of course, you are building out of "ideals".

solso | 5 years ago | on: Cliqz is shutting down

>> You can collect data from users and still do not compromise their privacy

> This is definitionally false. The very collection of data compromises one's privacy, by nature of it having been collected.

That's not definitionally false, if it sounds false to you is because you have an implicit assumption that does not apply.

Data from users does not imply user sessions on the collector side (session as a set of multiple data points belonging to the same user).

If sessions are collected, then, privacy is impossible to guarantee. We are well aware of that, having worked on this problems for almost 20 years. But that's precisely what Cliqz never did. All messages from our users are record-unlinkable for us, meaning that we have no way to reconstruct any session.

If you are interested, check the HumanWeb posts on https://0x65.dev/ or the papers https://0x65.dev/pages/dissemination-cliqz.html

solso | 5 years ago | on: Cliqz is shutting down

There were no problems in 2017 or before, we were doing the same exactly the same during Firefox times (we went through security and privacy audits). Data collection is and always was safe wrt to privacy.

Why the ruckus then? Because some assume that is data is sent, privacy is compromised, period. They do not know how to do it, and they assume it's impossible. Instead of checking the claims for themselves (code is public, data can be inspected, documentation, etc.) they prefer to stick to their belief system, which is more comfortable and does not imply hard work. The press release that FF -- written by one of these people with a lot of biases and published without review -- did not help as it was misleading.

We did a big mistake back then. Instead of rebutting it, we chose to ignore the FUD assuming that facts would prevail. They did not.

Sadly the community is "scared", we have been congratulated and lauded by anyone who checked our systems. But never endorsed in public, there is little to gain and a lot to lose (you are getting a sneak preview right now).

Sad story, extremely frustrating too, but there is nothing we can do now.

solso | 5 years ago | on: Cliqz is shutting down

This claim on the Wikipedia is factually incorrect: "This data is tied to a unique identifier allowing Cliqz to track long-term performance."

Thanks for noticing it, we will create an issue.

UUIDs only applies to telemetry, which is not the data being described in the paragraph: queries, scrolling, amount time spend, urls, etc. For this kind of user data (HumanWeb) there is no uuid, neither implicit or explicit.

There are plenty of papers on the topic, independent audits, the code is open-source and the data can be inspected. HumanWeb data is 100% record-unlikable, we have no way to know if two messages received come from the same person or not.

solso | 5 years ago | on: Cliqz is shutting down

Brave is based on Chrome, whereas Cliqz is based on Firefox (just to be precise). Note that ownership of code is not the same of ownership of a service... if Brave is depending on Google services, then you would be right (what happens with the [meta]searchers. But the code is open, and can be forked at will (there are some caveats to that claim, licences, internal APIs, etc.)

You can collect data from users and still do not compromise their privacy, it's how you do it that matters, becomes a design requirement. Collecting a url visited, can lead to build a user history (privacy hazard) or not. It's an design choice. The whole mantra that data!=privacy is doing a lot of damage (for anyone curios we did publish plenty of material on the topic, https://0x65.dev/blog/2019-12-02/is-data-collection-evil.htm...)

solso | 5 years ago | on: Cliqz is shutting down

Yes, but once you have such a strong dependency it's difficult to remove it. Others have tried the approach and are still stuck with them.

Sorry to hear that the quality was not good for you, it depends on country to country (depending on the users-base basically). For Germany, quality was good enough, QA analysis on stratified queries backed it up. That being said, perceived quality from a person is not properly reflected on NDCG-like metrics, you do not remember the 9 queries it did right, but the one that was totally off.

In any case, DDG is good, and let me emphasize, they (and others) provide a lot of value to the users, privacy-concerned or otherwise. But the underlying problem is not getting fixed, unless, hopefully someday, they come up with an independent index (let's hope).

solso | 5 years ago | on: Cliqz is shutting down

Just for archive reasons. There are some interesting points worth addressing (IMHO). Of course I worked at Cliqz :-)

"The company only survived because of the investor throw a lot of money". 100% correct, and that speaks greatly about the investor. They believe that Google is a monopoly that needs to fought, as many others. But, instead of (or on top of) bitching and moaning, lobbying, etc. they put good money where their mouth was. Kudos for that.

Privacy was never Cliqz primary product. Privacy was a strict design requirement of Cliqz, which can be marketed more or less. Data collection and browsers alike, we wanted them to be private, because that's the right thing to do, even if it was more difficult to implement. The whole data vs. privacy argument is fallacious. One of the reasons why privacy was so important to us is precisely now, whoever ends up owning the data cannot learn anything about any of the users. Imagine the government getting Google's data if they go belly up or upon "legal" request (change Google by any other company). The data of Cliqz poses no risk to any user, including myself.

The primary product of Cliqz was search, either as the typical result page or instant search integrated on the browser. That's very difficult to build, and expensive, something that DuckDuckGo, Startpage, Qwant, etc. do not have to pay because they rely on the backend of others (not 100%, but mostly). If we were repackaging Bing/Google/Yandex with a different ranking twists, our quality would have been better from the beginning, of course. But that's not building an alternative to Google, which is what we wanted. Still, that's not a pun to DDG and others, what they provide has value to the users, of course. But they are not real alternative, kind of an electric car that gets its electricity from burning coal.

Brave is a great browser, respects to Brendan and team. We both "fight" against Google. For Brave it's Chrome, for Cliqz was both Chrome and Search. Too much to chew? Yes, but we had plenty of fun. The only thing I regret after +6 years working there is the loss of such a great team.

solso | 5 years ago | on: Cliqz is shutting down

Cliqz never masqueraded anything, only in your odd perception of the world. Advertisement as implemented today is a privacy hazard, but there are other ways to do it, client-side, which is what Cliqz attempted. The same goes for data-collection, you can collect all and put the privacy of the users at risk, or collect only signals that cannot be record-linked, which is what Cliqz did.

Cliqz search was never on par with Google -- I build parts of it -- but was getting there little by little. To be more precise, it was getting good enough, to not be a factor. That has some merit given the totally independent index (not relying on Bing under the hood).

Brave the same as Cliqz are trying their best to offer an alternative. If you think you can do better, please do so. Believe, I'll root for you regardless of my opinion about you (we crossed path in the past). Why would I support you, even though that does not mean I use what you build? Because we are in need of having plurality on the Web, the more the better. Unlike you, I do not see the point of speaking bullshit, not sure if out of ignorance or ill-will, don't know, don't care.

page 1