gane5h's comments

gane5h | 10 years ago | on: Odd ways to zeroing some x86_64 registers

I've used this in the past, in high performance math.

If you have data (vectors, matrices, etc.) that doesn't fit neatly into a SIMD block size, you'll have to zero out fields after the calculation. At this point, it's cheaper to generate a zero on the register than load via memory (cheaper as in the number of CPU instructions.)

gane5h | 10 years ago | on: Thoughts on Time-series Databases

We store our event stream data in Elasticsearch. Two features that made it appealing:

  * the ingest-side can be scaled up by adding more shards
  * the query-side can be scaled up by adding more replicas
To compute rollup analytics, we make heavy use of Elasticsearch's aggregation framework to compute daily/weekly/monthly/quarterly active users.

From my understanding Postgres has many of these features, but the distributed features of ES are killer!

gane5h | 10 years ago | on: DataLake

The closest to a hosted solution I've come across is Azure's DataLake announced at the recent Build conference: https://azure.microsoft.com/blog/2015/04/29/introducing-azur...

We're working on something adjacent at Silota. We pull in your CRM, behavioral analytics and support data and provide an easy to comprehend view layer atop. Our users are not analysts or data scientist, but account managers (more popularly known as Customer Success Managers.)

(and hence the appeal of hosted solutions.)

gane5h | 11 years ago | on: Scaling Out PostgreSQL for CloudFlare Analytics Using CitusDB (YC S11)

Really cool write up – thanks! First time I’m hearing about CitusDB. They appear to be building a columnar, distributed database while preserving the Postgres frontend (similar to redshift, aster, greenplum, etc.)

It’s all in the details. I’m planning to investigate the following during my next weekend hack. Hope somebody can answer some pre-sales questions for me:

  - how complete is the postgres functionality (e.g.: lateral joins)
  - can you set a sharding key to control the shard distribution
  - does the database do multiple passes for queries with subselects
  - usually one increases the replication factor (limited by budget) to improve query times, with the limitation that it slows down loading time. does the DB stage intermediate writes to batch them, so does the user need to do this? this works really well for append-only, timestamped event data.
  - do you have a job manager or scheduler, needed when you have multiple views that need to be updated without melting your infrastructure
  - how easy is it to operate? does the database expose operational metrics so that you can see the load on each shard to potentially detect unbalanced shards?
  - tips on hardware configuration (big advantage of redshift here is that you don’t have to run your own warehouse.) maybe partner with MongoHQ?
It’ll be nice to see some sample query plans graphically visualized.

gane5h | 11 years ago | on: Spark Breaks Previous Large-Scale Sort Record

Going on a tangent here: this benchmark highlights the difficulty of sorting in general. Sorts are necessary for computing percentiles (such as the median.) In practical applications, an approximate algorithm such as t-digest should suffice. You can return results in seconds as opposed to "chest thumping" benchmarks to prove a point. :)

I wrote a post on this: http://www.silota.com/site-search-blog/approximate-median-co...

gane5h | 12 years ago | on: Your REST API should come with a client

I learnt this the hard way. Initially, I designed the API with bigints for ids and found some older versions of PHP didn't support bigints. I had to switch to using strings.

gane5h | 12 years ago | on: Elasticsearch – the definitive guide

Good feedback. Still iterating on the messaging – at this stage, we are still figuring out how to describe the product.

When you begin incorporating ES into your application, roughly you’d be thinking about: 1. Figuring out the structure of your data and translating that into ES’s mapping 2. Learning the query syntax (the ES docs assume you already know Lucene, not usually the case.) 3. Setup an ingest workflow and keeping your indexed data in sync 4. Securing your cluster if you want to hit ES directly from the browser/API client 5. Maintaining your cluster

Silota attempts to solve 3, 4, 5. Improving documentation helps with 2.

There’s an e-commerce search example here: http://www.silota.com/docs/api/ecommerce-product-search-exam...

gane5h | 12 years ago | on: Ask HN: Difference between selling benefits vs featuers

Say I'm trying to sell water.

Features: Liquid at room temperature. One oxygen atom, two hydrogen atoms. Has high specific heat capacity.

Benefits: Helps with the balance of bodily fluids necessary for digestion, absorption, transportation of nutrients, creation of saliva, maintenance of body temperature.

gane5h | 12 years ago | on: Poll: Is your startup or side project profitable?

I have a couple of side-projects as well as a startup.

The last few years, I’ve switched my side-projects to helping people with their backend infrastructure or architecture. This is definitely not profitable and most times I don’t even charge money. I like helping people and this is very satisfying work. Cold-email me (details in profile) if you like to discuss infrastructure.

My startup silota.com will be profitable as of April. I started hitting the road actively looking for customers in Jan of this year. The product isn’t self-serve yet and requires me to actively hand-hold people during the on boarding process. It’s been great and looking forward to the next few months!

gane5h | 12 years ago | on: Ask HN: Is async request processing possible with Python Django?

Two ways you can do this:

1. Long polling: It’s definitely possible to have thousands of long-lived requests with something like gunicorn/tornado. Remember to turn-off buffering in your front-end nginx proxy if you want to use long-polling.

2. Async with web hooks. Gather the request payload, push it to a job queue, and return the response. Process the job queue at your leisure and then call the web hook when complete. You can use celery, beanstalk or my personal choice rq (and django-rq.)

gane5h | 12 years ago | on: Elasticsearch 1.0.0 released

Really impressed with the pace of innovation in the last few months: cat api, aggregations, snapshots. The unfortunate side effect is that books and stack overflow posts written before 1.0 are outdated.

Disclaimer: I’m the founder of a hosted Search As A Service and we use ES in a few critical parts of our infrastructure.

gane5h | 16 years ago | on: Ask HN: Access to PG?

I mostly wanted to meet the YC folks in person to see what it's all about.

So, I applied to Startup School last year and got in. I went to the YC offices afterwards to hang out. I had read almost all of his essays and also Founders@Work, so I didn't have to bore them with the same old questions. My questions were mostly about Canadian startups, and the statistics on how many applied and how many moved back. I was also at a different stage than you are, so my questions were also around how to figure out what people want, etc.

gane5h | 16 years ago | on: LLVM's Clang Successfully Self-Hosts

In my opinion, I believe llvm-g++ is a more promising approach to supporting C++. C++ is an incredibly difficult language to parse, so why not re-use work that has already been done?
page 1