XSV – A fast CSV toolkit in Rust

[+] burntsushi|11 years ago|reply

Author here. I was really hoping to get binaries for Windows/Mac/Linux available before sharing it with others, but clearly I snoozed. I do have them available for Linux though, so you don't have to install Rust in order to try xsv: https://github.com/BurntSushi/xsv/releases

Otherwise, you could try using rustle[1], which should install `xsv` in one command (but it downloads Rust and compiles everything for you).

While I have your attention, if I had to pick one of the cooler features of xsv, I'd tell you about `xsv index`. Its a command that creates a very simple index that permits random access to your CSV data. This makes a lot of operations pretty fast. For example:

    xsv index worldcitiespop.csv  # ~1.5s for 145MB
    xsv slice -i 500000 worldcitiespop.csv | xsv table  # instant, plus elastic tab stops for good measure

That second command doesn't have to chug through the first 499,999 records to get the 500,000th record.

This can make other commands faster too, like random sampling and statistic gathering. (Parallelism is used when possible!)

Finally, have you ever seen a CLI app QuickCheck'd? Yes. It's awesome! :-) https://github.com/BurntSushi/xsv/blob/master/tests/test_sor...

[1] - https://github.com/brson/rustle

[+] simi_|11 years ago|reply

I'm looking forward to playing with cool languages like Rust, Nim, and Elm. But when I read stuff like this I remember why I love using Go every day. Generating binaries for multiple platforms is braindead easy, as is building from source on any system with Go installed.

That aside, really great work OP! I quite like the CSV format and had 2 ideas based on my experience with it that I'd love to get an opinion on:

1. markdown compiler plugin to expand ![title](filename.csv)

2. barebones, imgur-like website for quick CSV file[s] upload, maybe also a public gallery to showcase interesting data (obviously all uploads marked public/unlisted/private)

[+] timClicks|11 years ago|reply

Sorry for not looking into it myself, but what is the format of your index file (I'm assuming the index is stored as a file somewhere). Could other tools read from that too?

[+] dbro|11 years ago|reply

Here's another suggestion for the criticism section (which is a good idea for any open-minded project to include):

Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.

csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.

[+] burntsushi|11 years ago|reply

That's a wicked cool tool! Thank you for sharing.

I do think there is room for both tools though. One of the cooler things I did with `xsv` was implement a very basic form of indexing. It's just a sequence of byte offsets where records start in some CSV data. Once you have that, you can do things like process the data in parallel or slice records in CSV instantly regardless of where those records occur.

It helps when the CSV parser has support for this: http://burntsushi.net/rustdoc/csv/struct.Reader.html#method....

[+] tbrownaw|11 years ago|reply

From the "criticisms" section: You shouldn't be working with CSV data because CSV is a terrible format.

Er, what's wrong with it? Or is this a case of, people using it for things other than what it's meant for? Is there a better format for sending data between different companies using different enterprisey database systems?

My complaint about csv is that people frequently generate it manually and don't understand how to quote text fields, so they don't double any quote characters that are part of the data. Which means I have to spend time cleaning up malformed files.

[+] ahoge|11 years ago|reply

> what's wrong with [CSV]?

The encoding, line endings, and the used separators are random. Encoding/decoding behavior may be locale-dependent. E.g. if you use Excel with a de-DE locale to create a CSV file, Excel with a en-US locale won't open it. The former uses ';' as separators and the latter expects ','.

Since the format is so (seemingly) simple, there are millions of ad hoc implementations out there. Naturally, most of them are horribly broken.

Fun fact: ASCII actually reserved control characters for this stuff. 1F is the "unit separator" and 1E is the "record separator". There is even a "group separator" (1D) and a "file separator" (1C).

http://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart

[+] valevk|11 years ago|reply

I think csv is great. The problem is, as you said, that people don't use dedicated csv parsers, and create utter crap.

The other problem is Microsoft Excel. It is installed on almost any Windows machine, and automatically mapped to *.csv file endings. The first I do on Windows machines is to map _any_ text based file format to notepad++.

Excel destroys any csv file upon saving. It can't handle different decimal point formats. File separation characters are mappes to tabs. Multiline text is destroyed (not sure if true anymore, in Office 2007 it was). And don't even get me started about dates.

[+] steveklabnik|11 years ago|reply

"CSV" is bad because it's not well-formed. There are tons of "CSV" parsers in the wild, and they all make reasonable, but different, choices when it comes to some behaviors. INI is the same way.

[+] leeoniya|11 years ago|reply

i think JSONH [1] is a great replacement for csv. it's not as human-readable, but since no editors support elastic tabstops [2] by default, most csvs aren't readable anyhow.

[1] https://github.com/WebReflection/JSONH

[2] http://nickgravgaard.com/elastictabstops/

[+] 101914|11 years ago|reply

Did you try benchmarking against kdb+?

Seems like there are always HN commenters lambasting CSV. I am sure they have very good reasons.

But, as for me, CSV is one of my favorite formats. (Sort of like how people like XML or JSON I guess.) I like the limitations of CSV because I like simple, raw data.

I wish the de facto format that www servers delivered was CSV instead of HTML (for reason why, see below). Or at least I wish there was an option to receive pages in CSV in addition to HTML.

Users could create their own markup, client side. Users could effectively use their "spreadsheet software" to read the information on the www. Or they could create infinitely creative presentations of data for themselves or others using HTML5 or some other tool of personal expression.

It is easy to create HTML from CSV but I find it is a nuisance creating CSV from HTML.

Because I have a need for CSV I write scanners with flex to convert HTML to CSV.

I often wonder why I cannot access all the data I need from the www in CSV format. Many have agreed over the years that the www needs more structure to be more valuable as a data source. If data is first created in CSV, then you have some inherent structure to build on; you can _use it_ to create markup and add infinite creativity without destroying the underlying structure.

If data (cf. art or forms of personal expression) cannot be presented in CSV then is it really raw data or is it something else, more subjective and unwieldy?

Whatever. Back to reality. Pay no mind.

[+] burntsushi|11 years ago|reply

> Did you try benchmarking against kdb+?

xsv is never ever never going to compete with a real database. Full stop.

It's just a command line tool that tries to make some things faster when slicing and dicing CSV data.

[+] btown|11 years ago|reply

If you need to do an indexing step anyways, why not simply import the data into a SQL database, or build this as a wrapper that introspects the CSV file, builds a database schema, and does the import for you? Is the issue limited scratch space?

[+] burntsushi|11 years ago|reply

See the "Motivation" section: https://github.com/BurntSushi/xsv#motivation

There's a line somewhere between "conveniently play with large CSV data on the CLI" and "the full power of a RDBMS." It's blurry and we won't all agree on where it lays, but it certainly exists for me. (And based on feedback, it exists for lots of others too.)

Also, there are already tools that look at a CSV file and figure out a schema. No need to replicate that.

Finally, the indexing step is blindingly fast and only uses `N * 8` bytes, where `N` is the number of records.

[+] userbinator|11 years ago|reply

Looks like it's based on this CSV parser:

https://github.com/BurntSushi/rust-csv

and it claims to be RFC4180-compliant, which is a good thing.

[+] brazzledazzle|11 years ago|reply

This is one of the things I really love about PowerShell. Import, manipulation and export of formatted raw data like CSV is dead simple.

47 comments