top | item 6090181

Full MongoDB database dump of the Blippex search engine

23 points| karli | 12 years ago |blippex.github.io

13 comments

order

mgamache|12 years ago

So it's a bunch of internet URLs without content or content metadata?

bkanber|12 years ago

Seems that way. I mean, that's still be valuable and interesting for other reasons, but let's not call it a "dump of a search engine". There's nothing in there that's actually searchable!

Still, nice gesture by Blippex. Somebody will find something interesting to do with this, even if they just use it educationally.

geraldbaeck|12 years ago

Yes, we will add metadata in the next dump, but currently the time_spent is all that matters.

Our plans are to add categories, the language and rough information about the content type (video, image, etc).

Gerald, CTO Blippex

rgiar|12 years ago

so this is just when a given site was crawled?

  "_id": "b919f02c8f053c41e8ee86311ca9b0f6,
  "url": "https://www.example.com/",
  "host": "www.example.com",
  "root": "example.com",
  "time_spent": [
    {
      "sec": 45,
      "seen_at": ISODate("2013-06-23T00: 41: 44.0Z")
    },
    {
      "sec": 5,
      "seen_at": ISODate("2013-07-01T14: 41: 44.0Z")
    }

karli|12 years ago

Hi,

yes, as it is said in the blogpost, the only thing missing is the full text of the page for indexing & searching in it, we don't dare to release it because of copyright issues (he, you distribute the full text of my page!).

With this data you could for example built a new alexa and find out what was the most visited page last week :)

itsmeduncan|12 years ago

This will be fun as a list of places to try out 0-day exploits on.