jsrfded's comments

jsrfded | 14 years ago | on: Mozilla accelerates search navigation with blekko | blekko

This was a skunkworks project from a 2-person team in mozilla labs doing some thinking around how to get user keystrokes out of the browser-navigation workflow. The extension can be slightly laggy, but when it hits the preview is cool.

jsrfded | 14 years ago | on: Searching the bottom of the web

It's easy, just add the site you want to delete from the web to your /spam list or the /slashtag that you're negating out of the results.

jsrfded | 14 years ago | on: The Anatomy Of Search Technology: Blekko’s NoSQL Database

Partitions are based on a hash of the primary key. The number of buckets in the system has to be a power of 2. But we can split buckets to increase the number, or even have some buckets that have split and some that haven't yet. Each bucket is stored on 3 separate servers (and the assignment makes sure the three servers are on separate racks).

jsrfded | 14 years ago | on: The Anatomy Of Search Technology: Blekko’s NoSQL Database

Paxos would be good for electing a master, but we wanted to avoid having any masters in the architecture. There are also scenarios where paxos can be slow or fail to reach a consensus. We wanted high availability from each node in the cluster regardless of whether 2/3 of the rest of the cluster were down or unreachable; both parts of a partioned cluster should also be able to continue to function as best they could.

Individual nodes can often make "personal" decisions about what to do in subobtimal situations. If you can answer an incoming request, even with partial or out-of-date data, do so; it's better than not replying. For the repair agent, each node can see its own view of "holes" in the 3-level replication, and offer to make copies of <3 buckets to bring back up to three copies.

jsrfded | 14 years ago | on: The Anatomy Of Search Technology: Blekko’s NoSQL Database

Within the datastore, there are 3 copies of each piece of data. When a get() request is made, it goes out to the "closest" copy; if an answer isn't heard from by some threshold, a 2nd request is made to one of the other replicas. Whoever gets the data back first wins.

jsrfded | 14 years ago | on: The Anatomy Of Search Technology: Blekko’s NoSQL Database

Greg has planned a whole series about the combinator architecture behind blekko's datastore. Greg and I have both presented aspects of the system at various conferences, but we're happy to chat about it with you directly too. I think this might be the first time it's been published on the web though.

jsrfded | 15 years ago | on: Random Hacker News

The other day I "ran out" of stuff to read on Hacker News. I had looked at everything that interested me, and had even checked out page 2 (I was getting desperate).

I realized that there were thousands of great HN threads that I hadn't seen because I hadn't been paying attention to the site when they were ranking.

So I pulled together a little db of the top 10,000 HN threads (loosely defined; a thread with >1 points, 1> comments, and some web link rank).

I put these into a random shuffle so that reload would give me 30 fresh threads that I (probably) hadn't seen before.

I'm pretty happy with this. Lets me scratch my HN itch when I've exhausted the main page, and it's often interesting to see the old material again.

jsrfded | 15 years ago | on: Why We Desperately Need a New (and Better) Google

blekko has that feature. every result has a "spam" button underneath it. Click the link, and the host will be added to your personal /spam slashtag. Everything on your /spam list gets negated from all of your results by default.

Very handy. I put ehow.com on mine and never see results from them.

jsrfded | 15 years ago | on: Blekko partners with DuckDuckGo

We had the same debate at blekko. We nearly set up a blog.blekko.com, but corporate blogs can get so boring and impersonal. Writing on skrenta.com as the ceo of blekko helps me keep the tone more direct and avoid simply pushing the businesswire release out. In some of my recent posts I've tried to tell stories about projects we did around the launch.

Sure, you can do that on a corporate blog too. But something about them, maybe the multiple authorship, or the fact that it is a company blogging and not an individual...I don't know, I don't tend to read a lot of corp blogs.

jsrfded | 15 years ago | on: Blekko partners with DuckDuckGo

Really good feedback - thank you. You would think I would actually link to blekko.com in my post about our new partnership. The post could definitely have used some more expository material about who we are and what we're doing. That stuff gets added by default to the press release, but I assume skrenta.com has a pretty niche audience and that anyone on my blog already knows who I am and what we're doing....bad assumption obviously and I'll take that into account for future posts.

jsrfded | 15 years ago | on: Anatomy of blekko's press launch

(I'm posting a comment I initially was privately drafting for ryan in an email.)

I posted the article - and included the embargo paras, which my co-founder and I nearly cut - because I thought the backstory would be useful/interesting to the folks there, who seemed to be unaware of the pr process during the prerelease of blekko. I wanted to open that up for them.

Your comment was spot-on good advice for the ycomb co's though. I voted it up.

Really irked that wsj broke our embargo. Irritated that I wasn't in the office when our site went live after 3.33 years, irritated that we didn't get to do the last bug-fix push to production, irritated that I knew TC wouldn't post, irritated that other journos would be irritated with me, irritated that it flatted the temporal curve on the launch pop. And for what?

Time-sync on stories is actually a good thing for the news stream. I don't see why journos don't get that.

jsrfded | 15 years ago | on: Anatomy of blekko's press launch

btw, when the WSJ broke our embargo, I was on my way into the office. We were planning to get there around 3-4pm for the 9pm PT launch. Some folks were already there and turned the site live since the first press had gone up.

But techcrunch has a policy of not posting their story if an embargo is broken, so we didn't get the TC story that we had briefed them on.

jsrfded | 15 years ago | on: Blekko is alive

We have our own crawl/index/serve technology end-to-end. We have a 3 billion page web crawl, a machine-learning trained ranker, and then the slashtag vertical features. Since BOSS gives us an additional 20-40B pages for very long tail queries, we fall into /yahoo if we don't have any of our own results.

We're auto-firing slashtags for certain regular queries now, e.g. [cure for headaches] will auto-fire /health, [industrial design colleges] will auto-fire /colleges. We're doing this initially for health, lyrics, colleges, autos, hotels, recipes, and personal finance.

Getting the crap from sites like ehow out of the results and pushing results into a curated set of high-quality sites for queries in spammy categories really cleans up the results there.

jsrfded | 15 years ago | on: Blekko is alive

This is Monday press... The press was supposed to go live at midnight on Nov 1 (so you can be in the papers on Monday). The Wall Street Journal broke our embargo by 5 hours.
page 1