Greplin: 1.5 Billion Documents Indexed, Six Engineers

[+] pw|15 years ago|reply

Makes me think of this Quora question: Is it true that size of the portion of the web that Google indexes is actually smaller than sum of sizes of the contents of everyone's Gmail? (http://www.quora.com/Is-it-true-that-size-of-the-portion-of-...)

Is it fair to say that the size of the "private" web (what Greplin aims to index) is, in aggregate, larger than the public web? And are there any amazing things that become possible once you've indexed a large portion of that private web?

[+] aik|15 years ago|reply

Good question. With my understanding, being that Greplin has access to all private information, they have an incredible amount of power -- in fact more than Google who just has one/two aspect(s) (albeit large ones). Just for the sake of privacy, imagine Greplin agreeing to giving up private user information to the gov't, just like all these other companies? They'd have access to everything.

Scares me a bit too much to sign up for the convenience.

[+] Tibbes|15 years ago|reply

I'm guessing that: a "document" on Twitter is a single tweet; a "document" on Facebook is a wall-post or equivalent; a "document" on GMail is an e-mail; a "document" on Google Calendar is an appointment.

Therefore, the comparison with Google’s web-wide index in 2001 is a little misleading (in terms of the amount of data), given that the average size of a web-page is greater than all of these.

Of course average size of a file on Dropbox is likely to be larger than a webpage. I wonder what percentage of those 1.5 billion documents are files on Dropbox.

[+] tsycho|15 years ago|reply

greplin doesn't index content within files on dropbox, just the filenames.

I am building a startup that does that i.e. it indexes your doc/pdf files (more formats coming), and allow you to instantly search through them. It's called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it since I am not sure how well it scales. Mail me at [email protected] if you have any feedback. Would really appreciate it.

[+] rakkhi|15 years ago|reply

What does the HN community think of the greplin concept? They have recently added a Chrome plugin and a greplin search replacement for standard email search.

Think it is a public beta and anyone can signup if not ping me and I'll send you an invite.

My main concerns with the service are: + Centralized risk - keys to very valuable kingdom + No two factor - but they tell me its coming + No word on whether they encrypt in storage - although it should only be an index to the information rather than the actual info + Standard SAAS / Cloud risks - internal abuse, legal turnover etc.

Any others? All of these could be mitigated to a reasonable degree. What do you think? Is there a future for this type of service (or big buyout for Google / Bing) or is it just too scary?

[+] oscilloscope|15 years ago|reply

It's an incredible amount of personal data. If all that data was collected, then abused, I'd dissociate from much of my identity. I would just feel totally alienated by post-industrial society.

I'd be okay using Greplin if I knew Google was going to acquire them. I trust Google. I figure when Google goes bad, there will be much bigger issues facing humanity and our internet pasts will all be damning anyways.

[+] webmonkeyuk|15 years ago|reply

1.5B docs by just six people is impressive but I suspect that computers did a bunch of the indexing work.

[+] itgoon|15 years ago|reply

Ha!

Just the logistics of _handling_ 1.5B docs would keep six people pretty damn busy.

[+] aik|15 years ago|reply

I would get seriously excited about this if I could install it on my own server and keep my own index. I'm a bit hesitant giving them access to all my data to all my accounts, in exchange for a small convenience.

It is pretty impressive, though saying that it launched in February is misleading. I signed up last year, ran into a bunch of problems with it not indexing anything, and haven't opened it since. Now it looks like everything actually has been indexed, which is cool. I'm deleting my account for now though, as it doesn't yet seem easy enough to be useful for my purposes.

[+] lehmannro|15 years ago|reply

I regularly hear that if I could install it on my own server argument and wonder if you think you can handle security and administration much better than someone who's paid to do it. I, for one, can't and would not want to waste my time on it.

[+] g123g|15 years ago|reply

Big Deal?

With cloud providers like Amazon providing computing power on the pay as you go basis I am not sure why this is a news now days.

Some ridiculous comparisons are thrown about in the article -

same size as Google’s web-wide index in 2001

60 times the size of Google’s original 1998 index

I am not sure how to process and make sense of these comparisons.

[+] mlinsey|15 years ago|reply

It's a big deal because:

(a) It's a proxy for traction. Greplin indexes data that can't be crawled; users have to authorize it to index their data. So aside from how hard of an engineering feat it is, the fact that they've indexed this much data probably means that they have a sizable number of users.

(b) While you're right that the technical challenge of indexing that many documents is easier now than in 2001 thanks to things like AWS (and numerous open source projects), to do it with a team of six is still impressive.

[+] unknown|15 years ago|reply

[deleted]

[+] dacort|15 years ago|reply

While I understand that real-time full-text indexing is a much more difficult problem to solve, I've got just under 1.5 billion tweets "indexed" in TweetStats. And I'm one person.

Granted, given the 30MM/day number they must be growing that index very quickly and they've likely hit that 1.5 mark pretty darn quickly.

[+] moe|15 years ago|reply

real-time full-text indexing much more difficult problem to solve

Solve?

Greplin has probably not built their own search technology. I'd guess they're simply running Lucene or Sphinx like everyone else.

Their index is still small by search standards, as you can tell from TechCrunch having to reach 10 years back to make an "impressive" analogy.

Today, 1.5 billion documents translates to a couple terabytes of data (probably high single digit). 30 million indexed/day translates to about ~400/sec. You could store and process all that on a single, beefy box. Or you can spread it out over a couple amazon instances.

But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a 40 GB harddrive...

[+] ww520|15 years ago|reply

While Greplin is impressive, it's not in the same scale as Google, even in its early day. Google built an large global index for everyone, while Greplin built many small indices for many users. Some calculation would illustrate the point.

Google's global index: 1 billion documents. Searchable by 1 million users. Need to support 1B x 1M search capacity.

Greplin's individual indices: 1000 documents/user for each individual index. With 1 million users, there are 1B documents total. Each user only searches his 1K index. Only need to support 1K x 1M search capacity.

It's orders of magnitude difference.

[+] B-Scan|15 years ago|reply

250M of documents per engineer. Not bad at all.

[+] gubatron|15 years ago|reply

I wonder if they use Solr to distribute their ever growing index.

[+] sigil|15 years ago|reply

Lucene, I believe.

http://news.ycombinator.com/item?id=2443675

[+] lennexz|15 years ago|reply

at 19 this young man is already doing big. I havnt tried greplin yet but I think it has a very bright future

[+] mindotus|15 years ago|reply

Agreed and very impressive indeed.

I'm sure we'll be hearing much more from these guys.

30 comments