top | item 30911000

(no title)

jpf0 | 3 years ago

I did some work to compare Jd to data.tables and found that it was more performant in some instances such as on derived columns, and approximately equally performant on aggregations and queries. Jd is currently single-threaded, whereas multiple threads are important on some types of queries. I tried to further compare with Julia DB at the same time (maybe a year ago) and found that was incorrectly benchmarked by the authors and far slower than both; that might be different now. Jd is more equivalent to data.tables on disk; Clickhouse is far better at being a large-scale database.

Rules of thumb on memory usage: Python/Pandas (not memory-mapped): "In Pandas, the rule of thumb is needing 5x-10x the memory for the size of your data.” R (not memory-mapped): "A rough rule of thumb is that your RAM should be three times the size of your data set.” Jd: "In general, performance will be good if available ram is more than 2 times the space required by the cols typically used in a query.”

Re CSV reading, Jd has a fast CSV reader whereas J itself does not. I have written an Arrow integration to enable J to get to that fast CSV reader and read Parquet.

discuss

No comments yet.