ekzhu's comments

ekzhu | 3 years ago | on: Show HN: PostgresML, now with analytics and project management

The Dataframe is loaded from disk true, but it is possible that batch loading is faster (esp. with structured data) than row-by-row translation Postgres types into Python types. Would be interesting to see the benchmark results.

> I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.

Indeed. Not only memory but also inefficiency related to Python itself. It would be great if feature engineering pipelines can be pushed down to lower layers as well. But for now, the usability of Python is still unparallel.

ekzhu | 3 years ago | on: Show HN: PostgresML, now with analytics and project management

Great idea! I see this is implemented using the Python language interface supported by PostgreSQL and importing sklearn models. I always wonder how scalable this is considering the serialization-deserialization overhead between Postgres' core and Python. Do you see any significant performance difference between this and training the sklearn models directly on something like Dataframes?

ekzhu | 4 years ago | on: The Google home page is 500K

Most of it should be cached. I think it’s a trade off between server load and client experience. Millions of 10KB requests << thousands of 500KB requests (also potentially with lots of compressed stuff).

ekzhu | 4 years ago | on: DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB

TLDR: Arrow got an SQL interface provided by DuckDB.

So you have a new way to run SQL on Parquet et al through DuckDB -> Arrow -> Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.

ekzhu | 4 years ago | on: Data Fabric vs. Data Mesh: What's the Difference?

There is no reason for both approaches to not coexist: a centralized catalog managed by a small team, setting the “gold standard” for the many decentralized data producers and curators, who are incentivized to maximize their impacts (i.e., usage) by having higher quality data following the standard.

Another thing to point out: besides relying on the future promises of ML, there are already many signals that can be used by a centralized catalog for data discovery. For example: data sketches (MinHash, Hyperloglog) for joinable datasets, social signals (likes, comments, stars, etc. see Alation and Select Star SQL), lineages through data movements (e.g., Azure Data Factory and Azure Purview). If the centralized catalog uses those signals, then the data producers are incentivized to provide them for better visibility.

ekzhu | 4 years ago | on: T-Wand: beating Lucene in less than 600 lines of code

I cannot continue reading after this following “declaration”… Author should take a look at the Wikipedia page for TF-IDF.

> As someone who has a Ph.D. in Human-computer Interaction ;-), I feel like I am entitled to define a condition of "good" in relevance here. I hereby declare that:

>> A good top-K algorithm should rank a document containing more user query terms higher than a document containing less number of user query terms.

> This makes perfect sense. Right?

Also, “most search engines” don’t use vector space model as the only way to rank result, for example, page rank.

Edit: in some search scenarios finding the documents with the most query terms make sense, but Lucene can also rank using this metric. Still, applaud the author's effort in digging into research literature. Search relevance is very hard and standard off the shelf metrics like TF-IDF and page rank are often not enough. Good search usually requires deep understanding of the specific subject domain and hand-tuning tons of signals, many of which aren't even strictly based on search terms (e.g., previously purchased products on a store's website, geographic location, trending results).

ekzhu | 4 years ago | on: Function pipelines: Building functional programming into PostgreSQL

Thanks for the background. I find it fascinating that the small-data scenarios in analytics are still kind of chaotic when it comes to tooling. Full-fledged SQL queries on relations seems heavy but closer to raw data. The timevector custom data type is like a middle ground. Each timevector is essentially a pre-aggregated time series (maybe compressed also) so approach likely adds performance benefit when the task is to analyze many many small time series. Although I still feel supporting 70+ new functions adds a lot of maintenance burden, and people cannot debug/extend this set of functions because they are not SQL. I am wondering if you often find that users just want an out-of-box solutions or they need to have the ability to tweak or adding their own domain-specific logic.

ekzhu | 4 years ago | on: Function pipelines: Building functional programming into PostgreSQL

Database researcher here.

This is really cool! I wonder what was the initial drive for this new feature?

Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL? While I like the idea of custom data types with dataflow-like syntax, implementing a whole new query processing engine for the new data type seems like a lot of engineering work. Also you now have to handle many edge cases such as very very large time series -- I wonder if you have efficient lookup mechanisms on timevectors yet, and various timestamp and value types. If all these new syntax can actually be expressed using SQL, however complex, I think a "lazier" approach is to write a "translator" that rewrites the new syntax into good old SQLs or add a translator at planning stage. This way you can take advantage of Postgres' optimizer and let it do the rest of heavy lifting.

Kusto has a similar data type "series" that is also created from aggregating over some columns (https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...).

ekzhu | 4 years ago | on: U.S. Treasury Data Lab

Interesting findings:

1. "Amazon Restaurant & Bar Inc" received 1.3M in FY2021 while apparently empolying only 8 people and taking a revenue of 96k (https://www.manta.com/c/mhx084z/amazon-restaurant-bar-inc).

2. Google received 11k in the last 12 months, less than a Florida man named Christian Google.

3. Palantir Technologies Inc. 231.3M, versus Microsoft Corporation 357.5M in the last 12 months.

ekzhu | 7 years ago | on: Databricks open-sources Delta Lake to make data lakes more reliable

We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.

Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.

[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf

ekzhu | 7 years ago | on: Show HN: All-pair similarity search on millions of sets in Python and on laptop

I am sure Murmur3 would improve performance, but I doubt it would improve the indexing time very much. I can give it a try.

Update:

In IPython using pyhash library (C++):

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test")
  217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

ekzhu | 7 years ago | on: Show HN: All-pair similarity search on millions of sets in Python and on laptop

Author here. The algorithm used here is based on Google's 2007 paper "Scaling Up All Pairs Similarity Search." Since then I am sure they have started to look at billions of sets. Generally speaking exact algorithms like the one presented here max-out around 100M on not-crazy hardware, but going over a billion probably requires some approximate algorithms such as Locality Sensitive Hashing. You maybe interested in the work by Anshumali Shrivastava.
page 1