ekzhu | 3 years ago | on: Everything is a funnel, but SQL doesn't get it
ekzhu's comments
ekzhu | 3 years ago | on: Show HN: PostgresML, now with analytics and project management
> I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.
Indeed. Not only memory but also inefficiency related to Python itself. It would be great if feature engineering pipelines can be pushed down to lower layers as well. But for now, the usability of Python is still unparallel.
ekzhu | 3 years ago | on: Show HN: PostgresML, now with analytics and project management
ekzhu | 4 years ago | on: Zero-downtime schema migrations in Postgres using Reshape
ekzhu | 4 years ago | on: The Google home page is 500K
ekzhu | 4 years ago | on: DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB
So you have a new way to run SQL on Parquet et al through DuckDB -> Arrow -> Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.
ekzhu | 4 years ago | on: Data Fabric vs. Data Mesh: What's the Difference?
Another thing to point out: besides relying on the future promises of ML, there are already many signals that can be used by a centralized catalog for data discovery. For example: data sketches (MinHash, Hyperloglog) for joinable datasets, social signals (likes, comments, stars, etc. see Alation and Select Star SQL), lineages through data movements (e.g., Azure Data Factory and Azure Purview). If the centralized catalog uses those signals, then the data producers are incentivized to provide them for better visibility.
ekzhu | 4 years ago | on: T-Wand: beating Lucene in less than 600 lines of code
> As someone who has a Ph.D. in Human-computer Interaction ;-), I feel like I am entitled to define a condition of "good" in relevance here. I hereby declare that:
>> A good top-K algorithm should rank a document containing more user query terms higher than a document containing less number of user query terms.
> This makes perfect sense. Right?
Also, “most search engines” don’t use vector space model as the only way to rank result, for example, page rank.
Edit: in some search scenarios finding the documents with the most query terms make sense, but Lucene can also rank using this metric. Still, applaud the author's effort in digging into research literature. Search relevance is very hard and standard off the shelf metrics like TF-IDF and page rank are often not enough. Good search usually requires deep understanding of the specific subject domain and hand-tuning tons of signals, many of which aren't even strictly based on search terms (e.g., previously purchased products on a store's website, geographic location, trending results).
ekzhu | 4 years ago | on: Function pipelines: Building functional programming into PostgreSQL
[0] https://blog.timescale.com/blog/sql-vs-flux-influxdb-query-l...
ekzhu | 4 years ago | on: Function pipelines: Building functional programming into PostgreSQL
ekzhu | 4 years ago | on: Function pipelines: Building functional programming into PostgreSQL
This is really cool! I wonder what was the initial drive for this new feature?
Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL? While I like the idea of custom data types with dataflow-like syntax, implementing a whole new query processing engine for the new data type seems like a lot of engineering work. Also you now have to handle many edge cases such as very very large time series -- I wonder if you have efficient lookup mechanisms on timevectors yet, and various timestamp and value types. If all these new syntax can actually be expressed using SQL, however complex, I think a "lazier" approach is to write a "translator" that rewrites the new syntax into good old SQLs or add a translator at planning stage. This way you can take advantage of Postgres' optimizer and let it do the rest of heavy lifting.
Kusto has a similar data type "series" that is also created from aggregating over some columns (https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...).
ekzhu | 4 years ago | on: U.S. Treasury Data Lab
1. "Amazon Restaurant & Bar Inc" received 1.3M in FY2021 while apparently empolying only 8 people and taking a revenue of 96k (https://www.manta.com/c/mhx084z/amazon-restaurant-bar-inc).
2. Google received 11k in the last 12 months, less than a Florida man named Christian Google.
3. Palantir Technologies Inc. 231.3M, versus Microsoft Corporation 357.5M in the last 12 months.
ekzhu | 5 years ago | on: The Splitgraph Data Delivery Network – query over 40k public datasets
ekzhu | 5 years ago | on: Google's differential privacy library
ekzhu | 6 years ago | on: AWS Data Exchange
I have an open source project on crawling public datasets and make them searchable in one place: https://github.com/findopendata/findopendata.
ekzhu | 7 years ago | on: Databricks open-sources Delta Lake to make data lakes more reliable
Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.
[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf
ekzhu | 7 years ago | on: Show HN: All-pair similarity search on millions of sets in Python and on laptop
ekzhu | 7 years ago | on: Show HN: All-pair similarity search on millions of sets in Python and on laptop
Update:
In IPython using pyhash library (C++):
import pyhash
h = pyhash.murmur3_32()
timeit h(b"test")
703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
import hashlib
timeit hashlib.sha1(b"test")
217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)ekzhu | 7 years ago | on: Show HN: All-pair similarity search on millions of sets in Python and on laptop