Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?
Good question. With my understanding, being that Greplin has access to all private information, they have an incredible amount of power -- in fact more than Google who just has one/two aspect(s) (albeit large ones). Just for the sake of privacy, imagine Greplin agreeing to giving up private user information to the gov't, just like all these other companies? They'd have access to everything.
Scares me a bit too much to sign up for the convenience.
I'm guessing that: a "document" on Twitter is a single tweet;
a "document" on Facebook is a wall-post or equivalent;
a "document" on GMail is an e-mail;
a "document" on Google Calendar is an appointment.
Therefore, the comparison with Google’s web-wide index in 2001 is a little misleading (in terms of the amount of data), given that the average size of a web-page is greater than all of these.
Of course average size of a file on Dropbox is likely to be larger than a webpage. I wonder what percentage of those 1.5 billion documents are files on Dropbox.
greplin doesn't index content within files on dropbox, just the filenames.
I am building a startup that does that i.e. it indexes your doc/pdf files (more formats coming), and allow you to instantly search through them. It's called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it since I am not sure how well it scales. Mail me at [email protected] if you have any feedback. Would really appreciate it.
What does the HN community think of the greplin concept? They have recently added a Chrome plugin and a greplin search replacement for standard email search.
Think it is a public beta and anyone can signup if not ping me and I'll send you an invite.
My main concerns with the service are:
+ Centralized risk - keys to very valuable kingdom
+ No two factor - but they tell me its coming
+ No word on whether they encrypt in storage - although it should only be an index to the information rather than the actual info
+ Standard SAAS / Cloud risks - internal abuse, legal turnover etc.
Any others? All of these could be mitigated to a reasonable degree. What do you think? Is there a future for this type of service (or big buyout for Google / Bing) or is it just too scary?
It's an incredible amount of personal data. If all that data was collected, then abused, I'd dissociate from much of my identity. I would just feel totally alienated by post-industrial society.
I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.
I would get seriously excited about this if I could install it on my own server and keep my own index. I'm a bit hesitant giving them access to all my data to all my accounts, in exchange for a small convenience.
It is pretty impressive, though saying that it launched in February is misleading. I signed up last year, ran into a bunch of problems with it not indexing anything, and haven't opened it since. Now it looks like everything actually has been indexed, which is cool. I'm deleting my account for now though, as it doesn't yet seem easy enough to be useful for my purposes.
I regularly hear that if I could install it on my own server argument and wonder if you think you can handle security and administration much better than someone who's paid to do it. I, for one, can't and would not want to waste my time on it.
(a) It's a proxy for traction. Greplin indexes data that can't be crawled; users have to authorize it to index their data. So aside from how hard of an engineering feat it is, the fact that they've indexed this much data probably means that they have a sizable number of users.
(b) While you're right that the technical challenge of indexing that many documents is easier now than in 2001 thanks to things like AWS (and numerous open source projects), to do it with a team of six is still impressive.
While I understand that real-time full-text indexing is a much more difficult problem to solve, I've got just under 1.5 billion tweets "indexed" in TweetStats. And I'm one person.
Granted, given the 30MM/day number they must be growing that index very quickly and they've likely hit that 1.5 mark pretty darn quickly.
real-time full-text indexing much more difficult problem to solve
Solve?
Greplin has probably not built their own search technology. I'd guess they're simply running Lucene or Sphinx like everyone else.
Their index is still small by search standards, as you can tell from TechCrunch having to reach 10 years back to make an "impressive" analogy.
Today, 1.5 billion documents translates to a couple terabytes of data (probably high single digit). 30 million indexed/day translates to about ~400/sec. You could store and process all that on a single, beefy box. Or you can spread it out over a couple amazon instances.
But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a 40 GB harddrive...
While Greplin is impressive, it's not in the same scale as Google, even in its early day. Google built an large global index for everyone, while Greplin built many small indices for many users. Some calculation would illustrate the point.
Google's global index: 1 billion documents. Searchable by 1 million users. Need to support 1B x 1M search capacity.
Greplin's individual indices: 1000 documents/user for each individual index. With 1 million users, there are 1B documents total. Each user only searches his 1K index. Only need to support 1K x 1M search capacity.
[+] [-] pw|15 years ago|reply
Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?
[+] [-] aik|15 years ago|reply
Scares me a bit too much to sign up for the convenience.
[+] [-] Tibbes|15 years ago|reply
Therefore, the comparison with Google’s web-wide index in 2001 is a little misleading (in terms of the amount of data), given that the average size of a web-page is greater than all of these.
Of course average size of a file on Dropbox is likely to be larger than a webpage. I wonder what percentage of those 1.5 billion documents are files on Dropbox.
[+] [-] tsycho|15 years ago|reply
I am building a startup that does that i.e. it indexes your doc/pdf files (more formats coming), and allow you to instantly search through them. It's called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it since I am not sure how well it scales. Mail me at [email protected] if you have any feedback. Would really appreciate it.
[+] [-] rakkhi|15 years ago|reply
Think it is a public beta and anyone can signup if not ping me and I'll send you an invite.
My main concerns with the service are: + Centralized risk - keys to very valuable kingdom + No two factor - but they tell me its coming + No word on whether they encrypt in storage - although it should only be an index to the information rather than the actual info + Standard SAAS / Cloud risks - internal abuse, legal turnover etc.
Any others? All of these could be mitigated to a reasonable degree. What do you think? Is there a future for this type of service (or big buyout for Google / Bing) or is it just too scary?
[+] [-] oscilloscope|15 years ago|reply
I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.
[+] [-] webmonkeyuk|15 years ago|reply
[+] [-] itgoon|15 years ago|reply
Just the logistics of _handling_ 1.5B docs would keep six people pretty damn busy.
[+] [-] aik|15 years ago|reply
It is pretty impressive, though saying that it launched in February is misleading. I signed up last year, ran into a bunch of problems with it not indexing anything, and haven't opened it since. Now it looks like everything actually has been indexed, which is cool. I'm deleting my account for now though, as it doesn't yet seem easy enough to be useful for my purposes.
[+] [-] lehmannro|15 years ago|reply
[+] [-] g123g|15 years ago|reply
With cloud providers like Amazon providing computing power on the pay as you go basis I am not sure why this is a news now days.
Some ridiculous comparisons are thrown about in the article -
same size as Google’s web-wide index in 2001
60 times the size of Google’s original 1998 index
I am not sure how to process and make sense of these comparisons.
[+] [-] mlinsey|15 years ago|reply
(a) It's a proxy for traction. Greplin indexes data that can't be crawled; users have to authorize it to index their data. So aside from how hard of an engineering feat it is, the fact that they've indexed this much data probably means that they have a sizable number of users.
(b) While you're right that the technical challenge of indexing that many documents is easier now than in 2001 thanks to things like AWS (and numerous open source projects), to do it with a team of six is still impressive.
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] dacort|15 years ago|reply
Granted, given the 30MM/day number they must be growing that index very quickly and they've likely hit that 1.5 mark pretty darn quickly.
[+] [-] moe|15 years ago|reply
Solve?
Greplin has probably not built their own search technology. I'd guess they're simply running Lucene or Sphinx like everyone else.
Their index is still small by search standards, as you can tell from TechCrunch having to reach 10 years back to make an "impressive" analogy.
Today, 1.5 billion documents translates to a couple terabytes of data (probably high single digit). 30 million indexed/day translates to about ~400/sec. You could store and process all that on a single, beefy box. Or you can spread it out over a couple amazon instances.
But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a 40 GB harddrive...
[+] [-] ww520|15 years ago|reply
Google's global index: 1 billion documents. Searchable by 1 million users. Need to support 1B x 1M search capacity.
Greplin's individual indices: 1000 documents/user for each individual index. With 1 million users, there are 1B documents total. Each user only searches his 1K index. Only need to support 1K x 1M search capacity.
It's orders of magnitude difference.
[+] [-] B-Scan|15 years ago|reply
[+] [-] gubatron|15 years ago|reply
[+] [-] sigil|15 years ago|reply
http://news.ycombinator.com/item?id=2443675
[+] [-] lennexz|15 years ago|reply
[+] [-] mindotus|15 years ago|reply
I'm sure we'll be hearing much more from these guys.