jaychia | 1 month ago | on: Ask HN: What is the best way to provide continuous context to models?
jaychia's comments
jaychia | 3 months ago | on: All it takes is for one to work out
Applicable not just for grad school applications, but also to job apps, startups, and relationships.
Hang in there y'all, all it takes is for one to work out. Keep working hard, kings & queens.
jaychia | 11 months ago | on: Preview: Amazon S3 Tables and Lakehouse in DuckDB
Also no cluster, no JVM. Just `pip install daft` and go. Runs locally (as fast as DuckDB for a lot of workloads; faster, if you have S3 cloud data access) and also runs distributed if you have a Ray cluster you can point it at
(Disclaimer: I work on it)
jaychia | 1 year ago | on: Should you ditch Spark for DuckDB or Polars?
Thanks for the feedback on marketing! Daft is indeed distributed using Ray, but to do so involves Daft being architected very carefully for distributed computing (e.g. using map/reduce paradigms).
Ray fulfills almost a Kubernetes-like role for us in terms of orchestration/scheduling (admittedly it does quite a bit more as well especially in the area of data movement). But yes the technologies are very complementary!
jaychia | 1 year ago | on: DataChain: DBT for Unstructured Data
Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!
Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.
Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.
In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.
jaychia | 1 year ago | on: Amazon's exabyte-scale migration from Apache Spark to Ray on EC2
We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.
Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.
This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.
jaychia | 1 year ago | on: Pg_lakehouse: Query Any Data Lake from Postgres
We are building a Python distributed query engine, and share a lot of the same frustrations… in fact until quite recently most of the table formats only had JVM client libraries and so integrating it purely natively with Daft was really difficult.
We finally managed to get read integrations across Iceberg/DeltaLake/Hudi recently as all 3 now have Python/Rust-facing APIs. Funny enough, the only non-JVM implementation of Hudi was contributed by the Hudi team and currently still lives in our repo :D (https://github.com/Eventual-Inc/Daft/tree/main/daft/hudi/pyh...)
It’s still the case that these libraries still lag behind their JVM counterparts though, so it’s going to be a while before we see full support across the full featureset of each table format. But we’re definitely seeing a large appetite for working with table formats outside of the JVM ecosystem (e.g. in Python and Rust)
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
1. Construct a dataframe (performs schema inference)
2. Access (now well-typed) columns and operations on those columns in the dataframe, with associated validations.
Unfortunately step (1) can only happen at runtime and not at type-checking-time since it requires running some schema inference logic, and step (2) relies on step (1) because the expressions of computation are "resolved" against those inferred types.
However, if we can fix (1) to happen at type-checking time using user-provided type-hints in place of the schema inference, we can maybe figure out a way to propagate this information through to mypy.
Would love to continue the discussion further as an Issue/Discussion on our Github!
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
We actually already have read support. Check out the pyiceberg docs' Daft section: https://py.iceberg.apache.org/api/#daft
It's also very easy to use from Daft itself: `daft.read_iceberg(pyiceberg_table)`. Give it a shot and let us know how it works for you!
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
The network indeed becomes the bottleneck. In 2 main ways:
1. Reading data from cloud storage is very expensive. Here’s a blogpost where we talk about some of the optimizations we’ve done in that area: https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io
2. During a global shuffle stage (e.g. sorts, joins, aggregations) network transfer of data between nodes becomes the bottleneck.
This is why the advice is often to stick with a local solution such as DuckDB, Polars or Pandas if you can keep vertically scaling!
However, horizontally scaling does have some advantages:
- Higher aggregate network bandwidth for performing I/O with storage
- Auto-scaling to your workload’s resource requirements
- Scaling to large workloads which may not fit on a single machine. This is more common in Daft usage because we also work with multimodal data such as images, tensors and more for ML data modalities.
Hope this helps!
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
And thanks for the feedback! We’ll add more capabilities for regex, as well as flesh out our documentation for partitioning.
Edit: added a new issue for regex support :) https://github.com/Eventual-Inc/Daft/issues/1962
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
For hardware, we were using AWS i3.2xlarge machines in a distributed cluster. And on the storage side we are reading Parquet files over the network from AWS S3. This is most representative of how users run query engines like Daft.
The TPC-H benchmarks are usually performed on databases which have pre-ingested the data into a single-node server-grade machine that’s running the database.
Note that Daft isn’t really a “database”, because we don’t have proprietary storage. Part of the appeal of using query engines like Daft and Spark is to able to read data “at rest” (as Parquet, CSV, JSON etc). However this will definitely be slower than a database which has pre-ingested the data into indexed storage and proprietary formats!
Hope that helps explain the discrepancies!
jaychia | 2 years ago | on: Daft: Distributed DataFrame for Python
We do have a dependency on the Arrow2 crate like Polars does, but that has been deprecated recently so both projects are having to deal with that right now.
jaychia | 2 years ago | on: Working with the Apache Parquet file format
There's lots of lore/history in the versioning of the format's various features, and I put together a post to share some of the things I learned by browsing the issues/mailing list and talking to folks from the Parquet community.
Enjoy!
jaychia | 2 years ago | on: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
Yes, give it a whirl and let us know what you think! Ray is amazing and has actually gotten a lot better post their 2.0 release :)
> Is this based on Apache Arrow?
Indeed it is, and thanks for the feedback. We'll make this a little more visible. We use the arrow2 Rust crate (same one that Polars uses) for our in-memory data representation.
Our data representation makes it such that converting Daft into a Ray dataset (`df.to_ray_dataset()`) is actually zero-copy. So you can go from data transformations into downstream ML stuff in Ray really easily.
> It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes.
Unfortunately we don't have Polars API compatibility. This seems to be a recurring theme in this thread though. The problem is that certain Polars expressions are non-trivial to do in a distributed setting, and Polars itself as a project is so young and moves so quickly it's hard for us to maintain 100% API-compatibility.
That being said, you are correct that a lot of the API is very much inspired by Polars, which should hopefully make it easy to move between the two.
jaychia | 2 years ago | on: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
As a performance-driven project it’s important for us to understand which operations and use-cases are slowest/buggiest for our users so that we can focus on them. We tried to be very intentional in scoping the telemetry we collect and take this very seriously (telemetry is top-level on both our docs and README).
Happy to hear any feedback on this - we understand it's an important topic.
[Edit: parent link was fixed, thanks! :)]
jaychia | 2 years ago | on: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
Not yet, it’s on our todo list to integrate with the ecosystem of data catalogs (Iceberg/Delta/Hudi etc). Join our Slack/get in touch with us if you’re keen on this though, we’d love to learn more about your use-case!
> Any plans to support sql queries?
We do eventually want to support SQL as well, but haven’t had the bandwidth to build and maintain it. Really we’d just need to compile the SQL down to our logical plan - we could pretty easily integrate UDFs so that they can be registered as SQL functions too!
jaychia | 2 years ago | on: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
We did actually start by using Polars as our underlying execution engine, but eventually transitioned off to our own Rust Table abstraction to better suit our needs (e.g. custom datatypes and kernels). We still share the arrow2 dependency with Polars for in-memory representation of our data.
> what's the "killer feature" of Daft for trying to compete with Polars (and Pandas)
We don't specifically try to compete on a local machine since there is so much good new tooling being made recently available (DuckDB, Polars etc). That being said, we do try our best to make the local experience as seamless as possible because we've all felt the pain of developing locally in PySpark. Aside from the ability to go distributed, I like using Daft for:
* Working on more "complex" datatypes such as URLs, images etc
* Working with "many [Parquet/CSV/JSON] files in cloud storage (S3)" which we've found to be quite common for many workloads. We already have some and are building more intelligent optimizations here such as column pruning, predicate pushdowns etc to reduce and optimize I/O from the cloud.
As you've pointed out, one of our main responsibilities is to handle memory very very well. This is something we're actively working on and I'm thinking this will be a big reason to use us locally as well!
> Are they API-compatible?
We are not API-compatible with Pandas/Polars, but our API is quite inspired by Polars. We found that building out the core set of dataframe functionality was much more tractable than attempting to go API-compatible from the get-go.
> How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark.
We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries!
That being said, you're spot-on about memory usage being a key metric here. One of the key advantages of having native types for multimodal data (e.g. tensors and images) is that we can much more tightly account for memory requirements when working with these types, beyond the usual black-box Python UDF which often results in a ton of out-of-memory issues.
Our current mechanism for dealing with this is relying on Ray's excellent object spilling mechanisms for working with out-of-core data. We recognize that there are many situations in which this is insufficient.
The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them!
[Edit: typos!]
jaychia | 2 years ago | on: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
1. Thanks! We think so too :)
2. Here's my 2c in argument of flat files
- Ingestion: ingesting things into a data lake is much easier than writing to a database (all you have to do is drop some JSON, CSVs or protobufs into a bucket). This makes integrating with other systems, especially 3rd-party or vendors, much easier since there's an open language-agnostic format to communicate with.
- Multimodal data: Certain datatypes (e.g. images, tensors) may not make sense in a traditional SQL database. In a datalake though, data is usually "schema-on-read", so you can at least ingest it and now the responsibility is on the downstream application to make use of it if it can/wants to - super flexible!
- "Always on": with databases, you pay for uptime which likely scales with the size of your data. If your requirements are infrequent accesses of your data then a datalake could save you a lot of money! A common example of this: once-a-day data cleanups and ETL of an aggregated subset of your data into downstream (clean!) databases for cheaper consumption.
On "isn't updating it a huge chore?": many data lakes are partitioned by ingestion time, and applications usually consume a subset of these partitions (e.g. all data over the past week). In practice this means that you can lifecycle your data and put old data into cold-storage so that it costs you less money.
Anthropic's post on the Claude Agent SDK (formerly Claude Code SDK) talks about how the agent "gathers context", and is fairly accurate as to how people do it today.
1. Agentic Search (give the agent tools and let it run its own search trajectory): specifically, the industry seems to have made really strong advances towards giving the agents POSIX filesystems and UNIX utilities (grep/sed/awk/jq/head etc) for navigating data. MCP for data retrieval also falls into this category, since the agent can choose to invoke tools to hit MCP servers for required data. But because coding agents know filesystems really well, it seems like that is outperforming everything else today ("bash is all you need").
2. Semantic Search (essentially chunking + embedding, a la RAG in 2022/2023): I've definitely noticed a growing trend amongst leading AI companies to move away from this. Especially if your data is easily represented as a filesystem, (1) seems to be the winning approach.
Interestingly though this approach has a pretty glaring flaw: all the approaches today really only provide the agents with raw unprocessed data. There's a ton of recomputation on raw data! Agents that have sifted through the raw data once (maybe it reads v1, v2 and v_final of a design document or something) will have to do the same thing again in the next session.
I have a strong thesis that this will change in 2026 (Knowledge Curation, not search, is the next data problem for AI) https://www.daft.ai/blog/knowledge-curation-not-search-is-th... and we're building towards this future as well. Related ideas here that have anecdotal evidence of providing benefits, but haven't really stuck yet in practice include: agentic memory, processing agent trajectory logs, continuous learning, persistent note-taking etc.