I wrote a custom SP-GiST index for fuzzy image searching. Along the way, I also wound up with a decent example in how to do a simple binary-tree SP-GiST index.
SP-GiST indexes are specific to optimizing data that can fit into a metric-space, but if your application can use that, it's on github here: https://github.com/fake-name/pg-spgist_hamming
Hey, fake-name, I've been following your work. Have you looked into faiss from Facebook or annoy from Spotify? Their biggest drawback is that you can't modify the index, you have to create a new one if you want to add additional entries.
related: we just opensourced pgANN[1] that allows ANN searches and include an example for image search. It uses the vectors from an imagenet and find it works well for our needs.
Apologies for being negative, but... Is it just me or does it look like a bad abstraction trying to solve a problem that was created by SQL not being direct enough in the first place?
I had been working with a couple billion-row-scale PostgreSQL setups and trying to please query planner was sometimes a project in itself. Many times I just wished I could specify an algorithm on my own instead of having SQL interpreted as one plan for one variable, but as another if I replace it. Did anyone else have a similar impression?
The flip side of this is that as your data grows and your schema changes, sometimes a query planner can do better than an optimization based on old, no-longer-true assumptions. Sometimes SQL is the right abstraction, and this kind of tool then becomes very useful.
Yes I run into this a lot in SQL server but SQL server does let you do hints which in my understanding PG does not. No hints would be excruciating, I don't care how good the optimizer is, sometimes I know better and sometime the optimizer is brain dead.
My C is too awful at this point to get over my impostor syndrome and contribute to Postgres, but I've always been curious about the prospects of scaling Postgres higher by having different indexes on different replicas.
Lacking anything like this, we have to make the OLAP vs OLTP workflow splits sooner than we likely would have. Or at least, for companies that are growing organically. For VC backed companies this may be a difference of only 3-6 months.
Using logical replication you can do this. Effectively, it propragtes only the data changes and you can control the ddl changes. I use this to great effect to have a prod db logically replicating to a data warehouse with different indexes and views AND to a hotstandby via the usual binary wal shipping method.
Marketing teams that are clueless about their customers and read "Top 10" tricks found after a Google search to create BS metrics to send to managers & execs at the end of the week because "progress".
[+] [-] fake-name|6 years ago|reply
SP-GiST indexes are specific to optimizing data that can fit into a metric-space, but if your application can use that, it's on github here: https://github.com/fake-name/pg-spgist_hamming
[+] [-] tsomctl|6 years ago|reply
[+] [-] bobosha|6 years ago|reply
[1] https://github.com/netrasys/pgANN/blob/master/README.md
[+] [-] d33|6 years ago|reply
I had been working with a couple billion-row-scale PostgreSQL setups and trying to please query planner was sometimes a project in itself. Many times I just wished I could specify an algorithm on my own instead of having SQL interpreted as one plan for one variable, but as another if I replace it. Did anyone else have a similar impression?
[+] [-] btown|6 years ago|reply
[+] [-] SigmundA|6 years ago|reply
[+] [-] hinkley|6 years ago|reply
Lacking anything like this, we have to make the OLAP vs OLTP workflow splits sooner than we likely would have. Or at least, for companies that are growing organically. For VC backed companies this may be a difference of only 3-6 months.
[+] [-] chillydawg|6 years ago|reply
[+] [-] airstrike|6 years ago|reply
[+] [-] tomnipotent|6 years ago|reply