I am currently evaluating dsq and its partner desktop app DataStation. AIUI, the developer of DataStation realised that it would be useful to extract the underlying pieces into a standalone CLI, so they both support the same range of sources.
I agree jq's syntax it doesn't make much sense for tables where the primary operations are filter and merge, but for deep tree-like datasets, which is what JSON is supposed to be used for, traversal and iteration are more important and the syntax makes perfect sense there.
I'd be willing to bet most programmers would instantly understand something like `.users[] | {email: .email, lastLogin: .logins[-1].date}`, even if they've never seen jq.
Now that I'm thinking about it, the kind of structures we often use JSON for are, in a way, a subset of what can be done with tables and pointers (foreign keys), so would it be possible to create a kind of jq to SQL compiler? Has anyone tried that?
I have a use case, where my company's application logs will be shipped to S3 in a directory structure such as application/timestamp(one_hour)_logs.parquet. We want to build a simple developer focussed UI, where we can query for a given application for a time range and retrieve a bunch of s3 blobs in that time range and brute force search for the desired string. I see that roapi offers a REST interface for a fixed set of files but I would like to dynamically glob newer files. Are there are alternatives that can be used too ? Thanks
This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast.
If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.
It's plenty fast and easier to add new data sources for as external plugins.
It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.
Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory.
I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].
As I commented on a recent similar discussion, these tools can't be used for update or insert. As useful as querying might be, it's terribly misleading to claim to "run SQL" if you can't change the data, since that's such a critical part of an SQL database.
What you're really saying is that the database presented in OP is not useful because it only handles DQL.
1. SQL can be thought of as being composed of several smaller lanuages: DDL, DQL, DML, DCL.
2. columnq-cli is only a CLI to a query engine, not a database. As such, it only supports DQL by design.
3. I have the impression that outside of data engineering/DBA, people are rarely taught the distinction between OLTP and OLAP workloads [1]. The latter often utilizes immutable data structures (e.g. columnar storage with column compression), or provides limited DML support, see e.g. the limitations of the DELETE statement in ClickHouse [2], or the list of supported DML statements in Amazon Athena [3]. My point -- as much as this tool is useless for transactional workloads, it is perfectly capable of some analytical workloads.
I think it is worth pointing out that this tool does support querying Delta Lake (the author of ROAPI is also a major contributor the native Rust implementation of Delta Lake). Delta Lake certainly supports transactions, so ROAPI can query transactional data, although the writes would not go through ROAPI.
Yes, I designed the code base so that the core of the IO and query logic are abstracted into a Rust library called columnq. My plan is to wrap it with pyo3 so the full API can be accessed as a Python package! If you are interested in helping with this, please feel free to submit a PR. The core library is located at https://github.com/roapi/roapi/tree/main/columnq
ebfe1|3 years ago
clickhouse-local - https://news.ycombinator.com/item?id=22457767
q - https://news.ycombinator.com/item?id=27423276
textql - https://news.ycombinator.com/item?id=16781294
simpql- https://news.ycombinator.com/item?id=25791207
We need a benchmark i think..;)
sjellis|3 years ago
dsq CLI - https://github.com/multiprocessio/dsq DataStation desktop app - https://datastation.multiprocess.io/
Two alternative CLI tools that I looked at:
sq - https://sq.io/ octosql - https://github.com/cube2222/octosql
Honourable mentions to:
Miller - https://miller.readthedocs.io/en/latest/index.html Dasel - https://daseldocs.tomwright.me/
These don't use SQL.
VisiData is also amazing for data navigation, although it requires some effort to get the model:
https://www.visidata.org/
tanin|3 years ago
marklit|3 years ago
unknown|3 years ago
[deleted]
antman|3 years ago
mattewong|3 years ago
mmastrac|3 years ago
simonw|3 years ago
samwillis|3 years ago
I created this project for distributing Node via PyPI: https://pypi.org/project/nodejs-bin/
einpoklum|3 years ago
jonahx|3 years ago
gavinray|3 years ago
2) the author once answered some questions I posted on Datafusion, so they're cool in my book
Here are my anecdotes.
playingalong|3 years ago
franga2000|3 years ago
I'd be willing to bet most programmers would instantly understand something like `.users[] | {email: .email, lastLogin: .logins[-1].date}`, even if they've never seen jq.
Now that I'm thinking about it, the kind of structures we often use JSON for are, in a way, a subset of what can be done with tables and pointers (foreign keys), so would it be possible to create a kind of jq to SQL compiler? Has anyone tried that?
henrydark|3 years ago
https://github.com/benfred/py-spy
tootie|3 years ago
ramraj07|3 years ago
johnnunn|3 years ago
mkane|3 years ago
cube2222|3 years ago
marek_leisk2|3 years ago
cube2222|3 years ago
If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.
It's plenty fast and easier to add new data sources for as external plugins.
It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.
Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory.
[0]: https://github.com/cube2222/octosql
sakras|3 years ago
> blazingly fast
I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].
[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/
[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf
cube2222|3 years ago
A simple group by
takes 66 seconds.The equivalent in OctoSQL takes less than 3 seconds.
I retract my statement about this project being blazingly fast, though I imagine it's just the JSON parser that requires optimization.
smugma|3 years ago
e.g.
sqlite3 :memory: -cmd '.mode csv' -cmd '.import royalties.csv Royalty' -cmd '.mode column' \
bachmeier|3 years ago
mgradowski|3 years ago
1. SQL can be thought of as being composed of several smaller lanuages: DDL, DQL, DML, DCL.
2. columnq-cli is only a CLI to a query engine, not a database. As such, it only supports DQL by design.
3. I have the impression that outside of data engineering/DBA, people are rarely taught the distinction between OLTP and OLAP workloads [1]. The latter often utilizes immutable data structures (e.g. columnar storage with column compression), or provides limited DML support, see e.g. the limitations of the DELETE statement in ClickHouse [2], or the list of supported DML statements in Amazon Athena [3]. My point -- as much as this tool is useless for transactional workloads, it is perfectly capable of some analytical workloads.
[1] Opinion, not a fact.
[2] https://clickhouse.com/docs/en/sql-reference/statements/dele...
[3] https://docs.aws.amazon.com/athena/latest/ug/functions-opera...
TAForObvReasons|3 years ago
> Create full-fledged APIs for slowly moving datasets without writing a single line of code.
Even the name of the project "ROAPI" has "read only" in the name.
gavinray|3 years ago
What kind of headline would make you want to read/try such a thing?
(I'm planning on announcing it + releasing code on HN but have never done so before)
andygrove|3 years ago
tomrod|3 years ago
cerved|3 years ago
skybrian|3 years ago
the_optimist|3 years ago
theGnuMe|3 years ago
whimsicalism|3 years ago
Kalanos|3 years ago
houqp|3 years ago