top | item 9976889

Making Postgres Bloom

69 points| usman-m | 10 years ago |pipelinedb.com | reply

23 comments

order
[+] gleb|10 years ago|reply
The example doesn't quite make sense:

  SELECT user_id IN (SELECT DISTINCT user_id FROM user_actions);
is not valid SQL. You may mean something like:

  SELECT 123 IN (SELECT DISTINCT user_id FROM user_actions);
which is a strange query, as it's equivalent to:

  SELECT 123 IN (SELECT user_id FROM user_actions);
[+] usman-m|10 years ago|reply
Oops, it was meant to be:

  SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);
You're absolutely right that both those queries will give the same result. I guess I was trying to motivate the basic problem of finding whether some user exists in a set of users, and `SELECT DISTINCT` is the SQL way of representing a set.

Fixed the post, thanks!

[+] teddyh|10 years ago|reply
Wouldn’t

  SELECT EXISTS (SELECT 1 FROM user_actions WHERE user_id = 123);
lead to a better execution plan?
[+] mamikonyana|10 years ago|reply
When you say adding more online algorithms, is that mostly going to be limited to adding more probabilistic data structures?
[+] usman-m|10 years ago|reply
Mostly that. I've also been thinking about how we could incorporate some machine learning algorithms, like online perceptrons.
[+] pbnjay|10 years ago|reply
Don't get me wrong, I love Postgres and use it in pretty much all of my projects... but for this kind of application it's not very well suited. Leave your relational data for the database and use something more efficient!

Redis comes with both bitfields (see http://redis.io/commands/bitcount) and hyperloglog counters (see http://redis.io/commands/pfcount), requires almost no setup and has very minimal overhead.

[+] danneu|10 years ago|reply
Classic premature optimization.

"Just add another database!"

[+] zallarak|10 years ago|reply
The idea of using probabilistic data structures instead of counting every point of data (for things like customer analytics) is pretty significant -- getting caught in the weeds of managing every data point is error-prone and inefficient.
[+] ahachete|10 years ago|reply
usman-m, the approach of PipelineDB seems really interesting. However, I'd like to understand how in your opinion it compares with processing the stream of data changes accessed over PostgreSQL's logical decoding (http://www.postgresql.org/docs/9.4/static/logicaldecoding.ht...) interface. Thank you
[+] usman-m|10 years ago|reply
ahachete, I'm not sure if I totally understand your question.

Continuous views are consumers for streams. You can think of them as high throughput real-time materialized views. The source of data for the stream can be practically anything. Logical decoding on the other hand is a producer of streaming data--it's basically a human readable replication log. So you could potentially stream the logically decoded log into PipelineDB and build some continuous views in front of it.