There are some pretty serious privacy issues with this extension.
There should be a way to block domains that you don't want searched, otherwise secret URLs such as youtube private links, google docs, etc are exposed. One scary thing is you are potentially exposing them to third parties, by searching reddit, HN, and google.
This is a pretty good example of why people need to be wary of chrome extensions they install. They sit in an advantageous position where violating things like SOP, CSP Rules (depending on the browser) and more is okay by using the background page.
That was a major concern of mine. You can set the Research mode to "Off" by default and check URLs on a case-by-case basis and still do custom searches. Admittedly, it takes some of the serendipity out of it, but it addresses your concern.
I considered (and was planning on) adding a "block URL" feature - but the issue of how to store those sensitive URLs (to block) came up. Because localStorage and sync storage in Chrome is not sandboxed or encrypted, the blocked list would be "in the open" to other extensions. Yes, you could hash the urls you'd hope to block, but then there would be no way to read that list back to the user at a later point in time, and slight mismatches in URL schemes would lead to an imperfect system. So simply toggling Research Mode and researching pages of interest is the best option, IMO.
I don't cache any personal info in localStorage or sync storage (which at least Chrome does not encrypt :< ). The api results are stored in a local variable within the scope of the extension. And the "history" is a hashed and padded blob.
This is also why I released it for both Chrome and FF, since some people assign different use cases to different browsers. The code is also public/open-source.
The problem isn't just private links. Even knowing about the public sites that someone visits can expose information about their politics, religion, sexual orientation and other things that they may want to keep private (especially in countries where they can be discriminated against or even killed based on this information).
One would hope there aren't too many things you genuinely want to keep private where the only protection is an unguessable URL. That's never going to be terribly secure. Google Docs you have to go out of your way to create a doc that works that way.
I built a kind of similar stuff as a side project in 2011 - [redacted]. It allowed you to leave notes for a URL which your friends / followers can see when they visit that URL.
In the beginning when I was testing this with my friends and colleagues, I sent every URL a user visited to the server to check if any of his friends have left any notes and then alert him via notification badges. I disabled it when I started seeing a lot of private URLs (like Google Docs links with share access) in server logs. I then changed the extension to query server only when a user clicks on extension button.
This made it a bit safer, but the extension still needed access to all the sites a user visits. And with Chrome's auto updation of extensions, one may never know if the extension author has started sending every URL back to server again.
After developing such extension, I am quite suspicious such extensions and only install extensions from trusted authors (Buffer, Pocket, etc).
I agree and will say that I'm as pleasantly surprised by the review process Mozilla has for its add-ons -- as I am dismayed that Chrome has no equivalent process. I'm in-queue of the Firefox review (takes on average 10 days) and have exchanged emails with their volunteer-team on best practices to adopt.
Ultimately it comes down to winning the user's trust, and I'm trying to address as many questions as I can up front.
In response to another comment, I've also un-minified the Chrome extension code and will keep it un-minified going forward (will take up to an hour to propagate [update: fresh installs are now un-minified / and the current-install base will get the update within 6 hours])
It won’t. I figure the amazing APIs made available by Algolia, Reddit, Google News (and hopefully more) are incurring the only ongoing expense, and I’ve done my best to design the extension to respect their needs. My only need is that people get something out of the effort. :)
Are you against making money on projects?
I hope to sell some funny stickers at some point! But in all seriousness, I’d be happy to charge for a product that doesn’t depend on others’ APIs. Ambitious projects that aren’t concerned with revenue are (more than likely) destined to fail -> certainly there terrific exceptions to this...
I really like this idea of not charging for a service heavily based on other people's APIs. How many nodejs programmers are out there trying to make dirty money off other people's work + CSS ? It's shameful
I disagree. Unless they're breaking licenses or taking credit for someone else's work, it's not dirty money. Is it shameful for Apple to make money from an OS based on FreeBSD?
I had an extension called Deeper History which I shut down out security/privacy concerns very similar to yours. The solution I came up with half worked. I used https://github.com/travist/jsencrypt to encrypt the sensitive data before storing it in IndexedDB.
The problem was I couldn't get it to work with public keys I created locally. According to jsencrypt's github it should be possible. If you could get it to work you could give security conscious people a way to safely cache stuff locally.
Anyways if it would help to store user info on the client, I jut wanted to say there is a viable way forward on that. I have the code to chunk and encrypt stuff on the client if you're interested.
Thanks for sharing your approach. In the end, I decided the API results didn't need to be persisted in storage -- they get stored in a local variable and jettisoned with Javascript's garbage collection. My primary concern was: to-what scope did the data belong? (Check my other comments to see my dissatisfaction with Chrome's dev-doc's language on this topic.) I decided that the user history did not need to be precisely-known, so - my strategy here was to hash the url, cut the hash to a much shorter string (to increase likelihood of collisions), and then pad the shortened-hash with a random number of characters on either side, and concatenate that string to a large blob of text. The extension would then be able to perform an indexOf search on that blob to be reasonably sure that the user had been to that URL. This uses localStorage. (This keeps the extension from repeatedly querying commonly-visited URLs but also does not prove you were at any given URL since collisions are expected to happen.)
I really liked StumbleUpon exactly for this feature until they ruined the product during the monetization phase.
Edit:
Having comments is really important. Maybe marked by color of source (blue for reddit and orange for hacker news) and separated into submission sections. Also important is the preservation of the original tree structure of the submission comments.
Great idea! I immediately thought "Why didn't I think of that!?"
With regards to the privacy concerns of Research mode, there may be a way solution. For sites like Reddit, it should be possible to build a bloom filter. Have the metafruit server actively spidering Reddit for new, popular threads and add them to a bloom filter. The plugin would download the bloom filter from the metafruit server at some regular interval. That way checking whether any particular URL has an associated conversation is just a local operation. Plus, it's faster than pinging an API, and burns less of the target API's resources.
That would also provide a way to monetize, by giving out the metafruit bloom filter to subscribers only. Or perhaps the free plugin can update its bloom filter once a day, but subscribers can update once per hour.
I might enjoy using this. But, PRIVACY! Sending back every visited URL has never been ok for any reason, first time I saw this idea shot down was in '93.
But there might be a way out:
I'd be willing to give up privacy of URL hashes. This is how I'd do it:
- you already track a set of URLs that have discussions (I assume). If not, you need to figure out how to seed these. Volunteers, APIs....
- hash these URLs on server, and use a not-too-unique hash function. You want to end up with a high collision rate, but not too high.
- now, the client can query for conversations without revealing the URL it has visited:
- ask server whether there are any conversations for a particular hash.
- if server finds any, it returns { pageUrl: '', conversationsUrls[]}
- now client can decide whether the url really matches, or it was just a random hash collision.
- I know this is not perfect. A privacy-busting determined enemy could generate hashes of large number of public sites and use statistics to infer what sites you've visited just from your hashes. But it'd be good enough for me.
Bonus money-making idea:
- offer your plugin as a paid service to different web communities. Increases their "community engagement".
I seriously contemplated starting an "annotations" startup in the 90s. Someone else did, and they folded after a few years.
Then there has to be a whole backend infrastructure, though. Lots more time/effort/money involved in that solution. Right now there's no recurring cost for the developer.
This extension appears to be better made and has more features. It's nice to see discussion on HN too.
It does not work reliably. I clicked on a bunch of links from the HN front page. Reddit check did find that they had been posted to reddit before, but Kiwi did not. However all of those links had only been posted once, and had no discussion. Still it seems strange it would say they had never been posted before.
It was also unable to find youtube videos that had been posted before. Youtube is terrible at unique URLs, and I don't blame it. However Reddit Check is able to find all the different places youtube videos have been posted.
I found a link that had been posted to reddit hundreds of times. It only found 11 results. There was also an option for "fuzzy matches" which included a few more links to the exact same URL, but also links that had nothing to do with it. Reddit Check also has a problem where it only returns the first 25 results.
Clicking on any of the links closes the menu, so you can't open many links in new tabs at once. This is also a problem with Reddit Check.
It does not find http versions of https links. Also a problem with Reddit Check.
Clicking on the "submit to reddit" option opened a submission page, but not with the URL in it.
I tried to look at the code but it was all squished together. It does not appear possible to modify it anyway.
Anyway none of these are dealbreakers. I will be using this extension alongside Reddit Check due to the extra features it has. I am concerned about sending so many requests to reddit every time I open a new tab though.
Terrific comment, thank you. First, I've un-minified the Chrome extension code (only added 2kb in size), and it will remain that way going forward. It will take up to an hour for it to propagate to the Chrome Web Store, but the Firefox extension code is un-minified currently : https://addons.mozilla.org/en-US/firefox/addon/kiwi-conversa... (thanks to their review process, which requires it not be minified). The Chrome code is also available on Github : https://github.com/sdailey/kiwi
As for the results vs Reddit check -- maybe Reddit check uses a home-rolled API that crawls more frequently than Reddit's official API? Could you either tweet me the specific links or reply to this comment?
3 days later: I have responded with an update to the extension that addresses the privacy concerns here. Whitelists have been implemented, privacy defaults have changed to start with Research Mode 'off', and commenter/Houshalter's problems were fixed. Now Kiwi can fetch Reddit posts that have been hidden by moderators. Full changelog report here: http://www.metafruit.com/kiwi/changelog/2015/08/06/kiwi-conv...
does the searching happen on your machine (scrape google search results by crafting a url query) or does it get routed to a central server that we are forced to trust? If the latter, no way in hell this is going to be popular around here.
[+] [-] ejcx|10 years ago|reply
There should be a way to block domains that you don't want searched, otherwise secret URLs such as youtube private links, google docs, etc are exposed. One scary thing is you are potentially exposing them to third parties, by searching reddit, HN, and google.
This is a pretty good example of why people need to be wary of chrome extensions they install. They sit in an advantageous position where violating things like SOP, CSP Rules (depending on the browser) and more is okay by using the background page.
[+] [-] spenvo|10 years ago|reply
I considered (and was planning on) adding a "block URL" feature - but the issue of how to store those sensitive URLs (to block) came up. Because localStorage and sync storage in Chrome is not sandboxed or encrypted, the blocked list would be "in the open" to other extensions. Yes, you could hash the urls you'd hope to block, but then there would be no way to read that list back to the user at a later point in time, and slight mismatches in URL schemes would lead to an imperfect system. So simply toggling Research Mode and researching pages of interest is the best option, IMO.
I don't cache any personal info in localStorage or sync storage (which at least Chrome does not encrypt :< ). The api results are stored in a local variable within the scope of the extension. And the "history" is a hashed and padded blob.
This is also why I released it for both Chrome and FF, since some people assign different use cases to different browsers. The code is also public/open-source.
[+] [-] greenyoda|10 years ago|reply
[+] [-] eli|10 years ago|reply
[+] [-] Pxtl|10 years ago|reply
[+] [-] jagira|10 years ago|reply
In the beginning when I was testing this with my friends and colleagues, I sent every URL a user visited to the server to check if any of his friends have left any notes and then alert him via notification badges. I disabled it when I started seeing a lot of private URLs (like Google Docs links with share access) in server logs. I then changed the extension to query server only when a user clicks on extension button.
This made it a bit safer, but the extension still needed access to all the sites a user visits. And with Chrome's auto updation of extensions, one may never know if the extension author has started sending every URL back to server again.
After developing such extension, I am quite suspicious such extensions and only install extensions from trusted authors (Buffer, Pocket, etc).
[+] [-] spenvo|10 years ago|reply
Ultimately it comes down to winning the user's trust, and I'm trying to address as many questions as I can up front.
In response to another comment, I've also un-minified the Chrome extension code and will keep it un-minified going forward (will take up to an hour to propagate [update: fresh installs are now un-minified / and the current-install base will get the update within 6 hours])
[+] [-] nichochar|10 years ago|reply
It won’t. I figure the amazing APIs made available by Algolia, Reddit, Google News (and hopefully more) are incurring the only ongoing expense, and I’ve done my best to design the extension to respect their needs. My only need is that people get something out of the effort. :)
Are you against making money on projects?
I hope to sell some funny stickers at some point! But in all seriousness, I’d be happy to charge for a product that doesn’t depend on others’ APIs. Ambitious projects that aren’t concerned with revenue are (more than likely) destined to fail -> certainly there terrific exceptions to this...
I really like this idea of not charging for a service heavily based on other people's APIs. How many nodejs programmers are out there trying to make dirty money off other people's work + CSS ? It's shameful
[+] [-] nsgi|10 years ago|reply
[+] [-] Kiro|10 years ago|reply
[+] [-] alistproducer2|10 years ago|reply
The problem was I couldn't get it to work with public keys I created locally. According to jsencrypt's github it should be possible. If you could get it to work you could give security conscious people a way to safely cache stuff locally.
Anyways if it would help to store user info on the client, I jut wanted to say there is a viable way forward on that. I have the code to chunk and encrypt stuff on the client if you're interested.
[+] [-] spenvo|10 years ago|reply
[+] [-] lqdc13|10 years ago|reply
Edit: Having comments is really important. Maybe marked by color of source (blue for reddit and orange for hacker news) and separated into submission sections. Also important is the preservation of the original tree structure of the submission comments.
[+] [-] spenvo|10 years ago|reply
[+] [-] fpgaminer|10 years ago|reply
With regards to the privacy concerns of Research mode, there may be a way solution. For sites like Reddit, it should be possible to build a bloom filter. Have the metafruit server actively spidering Reddit for new, popular threads and add them to a bloom filter. The plugin would download the bloom filter from the metafruit server at some regular interval. That way checking whether any particular URL has an associated conversation is just a local operation. Plus, it's faster than pinging an API, and burns less of the target API's resources.
That would also provide a way to monetize, by giving out the metafruit bloom filter to subscribers only. Or perhaps the free plugin can update its bloom filter once a day, but subscribers can update once per hour.
[+] [-] atotic|10 years ago|reply
But there might be a way out:
I'd be willing to give up privacy of URL hashes. This is how I'd do it:
- you already track a set of URLs that have discussions (I assume). If not, you need to figure out how to seed these. Volunteers, APIs....
- hash these URLs on server, and use a not-too-unique hash function. You want to end up with a high collision rate, but not too high.
- now, the client can query for conversations without revealing the URL it has visited: - ask server whether there are any conversations for a particular hash. - if server finds any, it returns { pageUrl: '', conversationsUrls[]} - now client can decide whether the url really matches, or it was just a random hash collision.
- I know this is not perfect. A privacy-busting determined enemy could generate hashes of large number of public sites and use statistics to infer what sites you've visited just from your hashes. But it'd be good enough for me.
Bonus money-making idea: - offer your plugin as a paid service to different web communities. Increases their "community engagement".
I seriously contemplated starting an "annotations" startup in the 90s. Someone else did, and they folded after a few years.
[+] [-] mikeokner|10 years ago|reply
[+] [-] jostmey|10 years ago|reply
[+] [-] spenvo|10 years ago|reply
[+] [-] Houshalter|10 years ago|reply
This extension appears to be better made and has more features. It's nice to see discussion on HN too.
It does not work reliably. I clicked on a bunch of links from the HN front page. Reddit check did find that they had been posted to reddit before, but Kiwi did not. However all of those links had only been posted once, and had no discussion. Still it seems strange it would say they had never been posted before.
It was also unable to find youtube videos that had been posted before. Youtube is terrible at unique URLs, and I don't blame it. However Reddit Check is able to find all the different places youtube videos have been posted.
I found a link that had been posted to reddit hundreds of times. It only found 11 results. There was also an option for "fuzzy matches" which included a few more links to the exact same URL, but also links that had nothing to do with it. Reddit Check also has a problem where it only returns the first 25 results.
Clicking on any of the links closes the menu, so you can't open many links in new tabs at once. This is also a problem with Reddit Check.
It does not find http versions of https links. Also a problem with Reddit Check.
Clicking on the "submit to reddit" option opened a submission page, but not with the URL in it.
I tried to look at the code but it was all squished together. It does not appear possible to modify it anyway.
Anyway none of these are dealbreakers. I will be using this extension alongside Reddit Check due to the extra features it has. I am concerned about sending so many requests to reddit every time I open a new tab though.
[+] [-] spenvo|10 years ago|reply
As for the results vs Reddit check -- maybe Reddit check uses a home-rolled API that crawls more frequently than Reddit's official API? Could you either tweet me the specific links or reply to this comment?
[+] [-] spenvo|10 years ago|reply
[+] [-] faceyspacey|10 years ago|reply
[+] [-] curiousjorge|10 years ago|reply
[+] [-] spenvo|10 years ago|reply
[0] - Reddit - https://github.com/reddit/reddit/wiki/API [1] - HN - https://hn.algolia.com/api [2] - https://developers.google.com/news-search/v1/devguide#gettin...
Also, it can be set to search a-la-carte by toggling Research Mode.
[+] [-] mingus68040|10 years ago|reply