CSV as a Data Source

[+] jeremysmyth|12 years ago|reply

MySQL has supported CSV as a data source for quite some time: http://dev.mysql.com/doc/refman/5.6/en/csv-storage-engine.ht...

You can run full SQL queries directly against a text file as if it was a table.

[+] thingsilearned|12 years ago|reply

Yeah, and that's how we used to direct customers who wanted upload CSV data. We even wrote some nice tutorials on how to set it up

https://chartio.com/education/databases/excel-to-mysql

But we found that many people either aren't technical enough, or didn't want to go through the hassle of setting up a MySQL instance, defining a schema, and cleaning the data. So now we do that for them.

[+] justinator|12 years ago|reply

Perl's DBI also has a driver for CSV as the datasource,

https://metacpan.org/module/DBD::CSV

[+] eli|12 years ago|reply

I haven't used it in ages, but Log Parser -- a fantastic old Microsoft skunkworks project -- lets you run full SQL directly against CSV files, Apache logs, the Windows Registry, Active Directory, file paths...

http://www.microsoft.com/en-us/download/details.aspx?id=2465...

Windows only, natch.

[+] Ecio78|12 years ago|reply

Microsoft SQL Server has a couple of methods for querying CSV/XLS(X) files (I remember at least OPENROWSET and BULK INSERT), but sometimes it's a bit tricky to use/configure (have the right OLEDB driver, use INI files for configuration etc..) or the behaviour is not very consistent/well-documented

Edit: of course with BULK INSERT you don't directly query the file, you must load it to a (temp) table

[+] guard-of-terra|12 years ago|reply

Data isn't "locked up" in CSV, rather it is available in CSV.

CSV is one of the most open things possible.

As a programmer I do a lot of one-time makeshift data reports for other people, and I always use CSV (or precisely, tab separated) because that's what every program happily emits and consumes. If it does not, it's trivial to transform thanks to UNIX sort, awk and uniq.

[+] zbanks|12 years ago|reply

Once your CSV (or TSV) files start having quoted fields, they become very tricky to parse using standard multi-purpose tools like sort, awk, & uniq.

It's hard enough when you have delimiters in quoted fields, but dealing with quoted newlines starts to become unreasonable, especially for line-based tools.

CSV files, as you say, are absolutely wonderful to create. Problems come up when you try to parse files other people write. Not everyone follows RFC 4180.

[+] thejteam|12 years ago|reply

If I were to build a MVP for any type of data charting or visualization or analysis the first data format that I would think of supporting would be CSV.

[+] voltagex_|12 years ago|reply

OK, here's one to do with "just" sort, awk and uniq ;)

http://naa.gov.au/naaresources/govhack-2013/PassengersArriva...

[+] JulianMorrison|12 years ago|reply

CSV is the biggest pile of nuisance you'd never expect from a seemingly simple data format.

- Header line or none?

- "\n" or "\r\n"?

- Is there a newline at the last line? How about two?

- Escape quotes with doubling or backslash? How about both in the same file? How about both, inconsistently, in different fields? How about quotes including a newline and commas?

- Strings always quoted? Only if necessary? Is ,, a null or an empty string, or an error?

- How about mixed line lengths? Are missing trailing entries nulls? How about multiple data types in a file, with the first field being type, and line length only fixed per type?

I have generally found "TSV with a rule that data cannot represent tabs or newlines, period" as vastly superior.

[+] zbanks|12 years ago|reply

Chartio actually will accept TSV files because it will detect most of these settings. It'll detect the delimiter, the presence of a header line, file encoding, and newline character(s).

I agree TSV is a lot nicer, and has the bonus that Excel will open a TSV file with an .xls extension without any problems (great for sharing!).

[+] jwilliams|12 years ago|reply

TSV has all those same issues. However, you just have less frequent need of a tab, so most of the escaping edge cases never come up.

Those aspects are defined in RFC 4180 - just a lot of systems don't bother. How would you define a simpler data format?

[+] mattdeboard|12 years ago|reply

Oh God line endings created such havoc for me using Django's FileField model fields. I had to finally subclass it and do a replace of \r\n with \n to simulate opening in universal newline mode. Btw that whole area in Django needs a good reactor.

I should probably package it up for submission upstream.

[+] mbq|12 years ago|reply

There is something called ARFF which is basically a CSV with a header that defines how to interpret data, a bit of documentation and a reference implementation.

[+] choffstein|12 years ago|reply

Is it possible to set this up so that you can point it to a remote URL to fetch the CSV? For example, I'd love it if I could point it towards a Yahoo! Finance API call that returns a CSV file and have my dashboard auto-update. Or Dropbox URL. Or any remote URL. That would be awesome.

[+] maspwr|12 years ago|reply

That exact use case is not currently possible, but it's something we've talked a bit about. We're still considering doing something like it in the future.

[+] michaelrbock|12 years ago|reply

As someone who has worked in an industry where FTP'ing massive CSV files was the norm, I must say this looks really nice.

[+] thingsilearned|12 years ago|reply

Thanks!

[+] tbrownaw|12 years ago|reply

The biggest issue I have with csv files, is when the people sending them don't get the $%^&@ quotes right. The second biggest issue, is when they say "sorry, we're not able to do that" after I tell them exactly how to fix what they're sending.

[+] sehrope|12 years ago|reply

Substringed JSON arrays work great for this as well. Basically you just convert each row to an array, convert the array to JSON, and the strip the leading and trailing brackets ("[" and "]"). Newlines will be properly represented as \n as well as other funky characters.

[+] alayne|12 years ago|reply

I prefer DSV files for this reason http://en.wikipedia.org/wiki/Delimiter-separated_values

I have a simple scheme where comma is the delimiter and , in content is escaped as \,. There are no quotes around values.

[+] akinder|12 years ago|reply

What is the proper way to encode the quotes? I'm playing around with writing a Livejournal data dump reader and will have to do the quote dance pretty soon. In LJ files they are encoded like so:

col1,col2,col3,"Longer data and somethin ""with"" quotes",col5

[+] dergachev|12 years ago|reply

If you use SQL, check out http://github.com/dergachev/csv2sqlite

Its a little ruby script that in 1 command takes one or several CSV files, parses their structure into simplistic sqlite table definitions, and then creates a new sqlite database file populated with structure and data from these CSVs.

[+] eli|12 years ago|reply

Hey neat. A csv2sqlite script is one of the first things I wrote in Python: https://github.com/elidickinson/csv-tools

It's a terrible hack, but I actually still use it pretty frequently.

[+] caissy|12 years ago|reply

That is really nice. I remember implementing something similar to step 2 and 3 on AppEngine a few years back for a project. It turned out quite nice and non tech-savvy users used it without much problems. We managed to have something that could easily work with different CSV formats (MS Excel and others).

[+] bdon|12 years ago|reply

The type/error detection would be really cool paired with something like DataWrangler: http://vis.stanford.edu/wrangler/

[+] thingsilearned|12 years ago|reply

Yeah, DataWrangler is an awesome project. There are actually a few great companies coming out with products based around it. We may just integrate with one of those.

[+] AdrianRossouw|12 years ago|reply

I'm rather fond of using [1] csvkit for doing my csv wrangling. That and google refine (now open refine).

[1] http://csvkit.readthedocs.org/

[+] funstr|12 years ago|reply

CSV format does not define encoding. Also customers usually think about CSV as export/import format from/to Microsoft Excel. Unfortunately each Office localization uses different output encoding and delimiters (and date format). And you cannot suppose that the encoding is UTF-8. The most complete importer of CSV file which I have seen is in Open Office. It is also worth mentioning that Excel can import CSV with other than system locale and different delimiters - but it just not work through open file command.

[+] zbanks|12 years ago|reply

This also detects the file encoding & delimiters.

[+] netvarun|12 years ago|reply

This is really cool! Are there any open source libraries to do steps 2 and 3 - auto type detection and error correction/detection?

[+] thingsilearned|12 years ago|reply

Not really. We utilize python-dateutil to help recognize the different date formats, but everything else we built ourselves.

http://labix.org/python-dateutil

[+] hedonist|12 years ago|reply

A vastly under-rated and under-appreciated art form, indeed.

52 comments