Q: Run SQL Directly on CSV Files

[+] rudolfwinestock|7 years ago|reply

“Any file is a database if you awk hard enough.” —Bryan Horstmann-Allen

https://twitter.com/neilkod/status/914217352564137984

Furthermore, no thread on CSV files can be complete without mentioning this infamous bit of fact-trolling: the ASCII standard has had unit- and record-delimiters baked into it from the beginning.

https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...

[+] rusk|7 years ago|reply

> the ASCII standard has had unit- and record-delimiters baked into it from the beginning.

That's mad that I've never heard of that before.

I'm amazed this isn't used more often!

[+] brianm|7 years ago|reply

Ability to eyeball it on terminal is huge though :-(

Never underestimate ergonomics like screen printability!

[+] z92|7 years ago|reply

Usability suffers, if you can't type the delimiters easily.

[+] enriquto|7 years ago|reply

oh, I love that! I'm going to use these delimiters from now on.

[+] bhengaij|7 years ago|reply

And they're literally named thus too. US, FS.

We just need an rcat to print them clearly.

[+] delinka|7 years ago|reply

If you're on Windows, you've had the ability to do this for Quite Some Time®: https://support.microsoft.com/en-us/help/850320/creating-an-...

[+] eli|7 years ago|reply

Ah, I thought you were going to link to the fascinating skunkworks Log Parser tool https://www.microsoft.com/en-us/download/details.aspx?id=246... Wonder if it still runs on Windows 10?

[+] bhengaij|7 years ago|reply

Been on windows for 18 years now, never heard of this before. Sometimes I wonder if Microsoft really sucks at marketing dev tools

[+] ppande|7 years ago|reply

Sqlite virtual tables. You can literally query any data source. All you need is a module implementation of interface exposed by sqlite for the data source of interest. I once wrote a module that could query protocol buffers encoded data

[+] joshumax|7 years ago|reply

Fun fact: MySQL has actually supposed CSV files as a storage backend for quite some time now[1]

1: https://dev.mysql.com/doc/refman/8.0/en/csv-storage-engine.h...

[+] athenot|7 years ago|reply

Came here to post the same. This can be used in very interesting ways, either for quick ad-hoc queries or for import/export functionality (ETL), where the core of your data is in InnoDB but you keep a CSV version for one of the phases of your processing, to interface with other tools.

This may looks like a dirty hack compared to a clean API, but if you have performance considerations or need to interface with some legacy thing that happens to understand CSV, this can be a good choice.

[+] cbcoutinho|7 years ago|reply

I've never heard of this before, but it reminded me of this other command line tool for wrangling csv files in a cli. This tool makes use of SQL syntax much more than xsv, so there isn't 100% overlap here.

https://github.com/BurntSushi/xsv

[+] Cyph0n|7 years ago|reply

There is also a similar TSV manipulation tool suite written in D: https://github.com/eBay/tsv-utils

[+] devy|7 years ago|reply

Thanks! xsv is new to me. I love Andrew Gallant's ripgrep(rg) grepping cmd tool.

[+] yellowapple|7 years ago|reply

Another trick along these lines is to cut out the middleman (middleprogram? middleware?) and use SQLite to do it:

    $ sqlite3
    sqlite3> .mode csv
    sqlite3> .import foo.csv foo
    sqlite3> SELECT * FROM foo WHERE bar = 'baz'; -- you get the gist...
    (a bunch of rows)

Q seems to be much easier to use (and certainly easier to remember), but I've always found it handy to have the full power of SQL at my fingertips when needing to do a bunch of CSV manipulations.

[+] pixelmonkey|7 years ago|reply

I think the csvkit CLI tool, csvsql, provides this functionality, too.

[+] Tepix|7 years ago|reply

The perl CPAN module DBD::CSV also lets you do this. It was started in 1998 and still gets updates.

[+] walshemj|7 years ago|reply

Text::CSV_XS is my goto module for csv just use that and whack it into MySQL postgress would be my advice.

[+] singingfish|7 years ago|reply

I just came to say this too. Turns out it was last updated in October 2018 :o

[+] codeulike|7 years ago|reply

Real-world CSV files generally contain some or all of the following horrors:

- some strings enclosed in speechmarks, but some not

- empty fields

- speechmarks within strings

- commas within strings

- carriage returns within strings

How does Q do up against a CSV file with those traits?

[+] setr|7 years ago|reply

All of your “horrors” seem...correct? Its comma delimited, so anything that is between two commas should be parsed without issue; if it’s a string with a comma in it, and unquoted, you simply have a broken csv file. If its quoted, than anything until the next (unescaped) quote is fine, including commas

Unless you’re trying to parse csv files with regexes, none of those should be difficult, or even unexpected, to handls with a PEG parser, or any equivalent device

Ofc if you’re accepting ambiguity then its just arbitrary how you handle it, but none of your examples afaict present any ambiguity (I’m assuming strings are either quoted or unquoted, with the former primarily allowing commas/newlines in strings; escaping exists as well; comma delimited columns, newline delimited rows)

[+] barrkel|7 years ago|reply

More interesting: any kind of delimiter, including chars from utf8 and windows-1252, and you need to detect encoding too. And CSV embedded in CSV, a result of flattening an XML source. And fixed width files, not CSV but where you see CSV you may need to support. And let's not get into date parsing or other typed data, and type inference over sample files.

[+] Avshalom|7 years ago|reply

Without diving into the source code I can only say Q pops up a couple times a year in either as posts or in cli recommendation threads so I suspect it's at least reasonably robust.

[+] snazz|7 years ago|reply

For those reasons I much prefer tab-delimited files. Does anyone know if Q supports that?

[+] samatman|7 years ago|reply

Just this weekend was filtering commas out of unquoted dollar values in a CSV

[+] badcircle|7 years ago|reply

PowerShell will clean up a gross CSV:

Import-CSV .\file.csv | Export-CSV .\file.csv -NoTypeInformation -Encoding UTF8

[+] jannes|7 years ago|reply

Excel has a lot of this stuff builtin as well in the PowerQuery editor [0] which supports CSV, JSON and XML data sources. Albeit not SQL, it allows you to do almost everything SQL can do, but in a GUI.

After you're done preparing your data in PowerQuery you can run PowerPivot on it for aggregations.

[0]: https://cdn-5a6cb102f911c811e474f1cd.closte.com/wp-content/u...

[+] ashrk|7 years ago|reply

Looks like PostgreSQL has a foreign data wrapper for CSV.

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

In fact, seems it has several.

[+] garyclarke27|7 years ago|reply

Yes Also trivially easy and lightning fast to import csv to table using copy command in Postgres.

[+] daotoad|7 years ago|reply

Perl in 1998. https://metacpan.org/pod/DBD::CSV Okay, 2002--it took a while before anyone bothered to write a shell. https://metacpan.org/pod/release/TLOWERY/DBI-Shell-11.9/lib/...

And there are drivers for everything from PostgreSQL and Oracle to text files and the Azure WMI.

[+] hprotagonist|7 years ago|reply

Related, but with a free web interface https://github.com/simonw/datasette , https://github.com/simonw/csvs-to-sqlite

[+] herdrick|7 years ago|reply

csvkit includes csvsql, which does this. I’ve used it and liked it. https://csvkit.readthedocs.io/en/1.0.3/tutorial/3_power_tool...

[+] fiveFeet|7 years ago|reply

Came here to tell about csvkit and the csvsql that comes with it. Highly recommended. It is one of the first things I install on a new system.

[+] zimpenfish|7 years ago|reply

Discovered csvkit last week and it was a lifesaver. Had to remove a bunch of columns from a large CSV file with many columns - csvcut was way easier than trying to do it with awk or cut.

[+] minxomat|7 years ago|reply

So is this a wrapper around https://www3.sqlite.org/csv.html ? It mentions sqlite.

[+] alex_stoddard|7 years ago|reply

Looks like it is implemented in python(2) with use of the stdlib sqlite3 module. https://github.com/harelba/q/blob/master/bin/q

[+] bonesss|7 years ago|reply

We've had this for ages now: the F# type provider gives strongly types access to CSV files, and the resulting object heiarchies are LINQ compatible. There's even support for applying units of measure.

Veeeery nice for data munging :)

http://fsharp.github.io/FSharp.Data/library/CsvProvider.html

[+] aaaaaaaaaab|7 years ago|reply

The letter Q is already taken...

https://en.m.wikipedia.org/wiki/Q_%28programming_language_fr...

[+] kkarakk|7 years ago|reply

>It is proprietary software, commercialized by Kx Systems

q is available i'm thinking

[+] usgroup|7 years ago|reply

I think tools like this don’t exist much because it typically makes more sense to just put it into a database or at least SQLite ... then do whatever you want from there.

Meanwhile for all your streaming, filtering and aggregating need there is awk.

I’ve never been board enough to write my own little sql library for awk, but I’d be surprised if it doesn’t exist.

[+] TkTech|7 years ago|reply

That's exactly what q (the linked tool) and almost all other tools like it do - it loads into an sqlite database and provides some handy wrappers, that's it.

[+] IOT_Apprentice|7 years ago|reply

Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

https://github.com/dbohdan/sqawk

[+] dima55|7 years ago|reply

There's also this:

  https://github.com/dkogan/vnlog

This is effectively a set of wrappers around awk, sort, join, etc that knows about field names, and lets you use the vanilla shell tools in a much nicer way. It's exactly what you want for the 99% use case. If your problem is complicated enough that you feel like sql is what you want, maybe leaving the shell is ok at that point.

[+] phaedrus|7 years ago|reply

Many of the applications at my workplace use flat text files in CSV format for logs and configuration. For new development I've been using Sqlite to replace some of these usages. Sometimes when I need to analyze legacy log files I import the CSV data into Sqlite tables. After doing this a few times I hit upon the idea of why not skip the import (which balloons my otherwise-small db files) and write an Sqlite plugin that allows treating the flat CSV file as a virtual table? It's neat to see there's some existing work here!

[+] mathiasrw|7 years ago|reply

If you need to work with CSV or Excel or tab files from the command line I suggest having a look at alasql. Can be installed via npm and gives you a good list of export options.

[+] shodan757|7 years ago|reply

Eep, no Python 3 support? :(

[+] f00_|7 years ago|reply

I was about to just recommend you pandas, but just realized it doesn't natively support running sql on dataframes, but pyspark does!

96 comments