But we found that many people either aren't technical enough, or didn't want to go through the hassle of setting up a MySQL instance, defining a schema, and cleaning the data. So now we do that for them.
I haven't used it in ages, but Log Parser -- a fantastic old Microsoft skunkworks project -- lets you run full SQL directly against CSV files, Apache logs, the Windows Registry, Active Directory, file paths...
Microsoft SQL Server has a couple of methods for querying CSV/XLS(X) files (I remember at least OPENROWSET and BULK INSERT), but sometimes it's a bit tricky to use/configure (have the right OLEDB driver, use INI files for configuration etc..) or the behaviour is not very consistent/well-documented
Edit: of course with BULK INSERT you don't directly query the file, you must load it to a (temp) table
Data isn't "locked up" in CSV, rather it is available in CSV.
CSV is one of the most open things possible.
As a programmer I do a lot of one-time makeshift data reports for other people, and I always use CSV (or precisely, tab separated) because that's what every program happily emits and consumes. If it does not, it's trivial to transform thanks to UNIX sort, awk and uniq.
Once your CSV (or TSV) files start having quoted fields, they become very tricky to parse using standard multi-purpose tools like sort, awk, & uniq.
It's hard enough when you have delimiters in quoted fields, but dealing with quoted newlines starts to become unreasonable, especially for line-based tools.
CSV files, as you say, are absolutely wonderful to create. Problems come up when you try to parse files other people write. Not everyone follows RFC 4180.
If I were to build a MVP for any type of data charting or visualization or analysis the first data format that I would think of supporting would be CSV.
CSV is the biggest pile of nuisance you'd never expect from a seemingly simple data format.
- Header line or none?
- "\n" or "\r\n"?
- Is there a newline at the last line? How about two?
- Escape quotes with doubling or backslash? How about both in the same file? How about both, inconsistently, in different fields? How about quotes including a newline and commas?
- Strings always quoted? Only if necessary? Is ,, a null or an empty string, or an error?
- How about mixed line lengths? Are missing trailing entries nulls? How about multiple data types in a file, with the first field being type, and line length only fixed per type?
I have generally found "TSV with a rule that data cannot represent tabs or newlines, period" as vastly superior.
Chartio actually will accept TSV files because it will detect most of these settings. It'll detect the delimiter, the presence of a header line, file encoding, and newline character(s).
I agree TSV is a lot nicer, and has the bonus that Excel will open a TSV file with an .xls extension without any problems (great for sharing!).
Oh God line endings created such havoc for me using Django's FileField model fields. I had to finally subclass it and do a replace of \r\n with \n to simulate opening in universal newline mode. Btw that whole area in Django needs a good reactor.
I should probably package it up for submission upstream.
There is something called ARFF which is basically a CSV with a header that defines how to interpret data, a bit of documentation and a reference implementation.
Is it possible to set this up so that you can point it to a remote URL to fetch the CSV? For example, I'd love it if I could point it towards a Yahoo! Finance API call that returns a CSV file and have my dashboard auto-update. Or Dropbox URL. Or any remote URL. That would be awesome.
That exact use case is not currently possible, but it's something we've talked a bit about. We're still considering doing something like it in the future.
The biggest issue I have with csv files, is when the people sending them don't get the $%^&@ quotes right. The second biggest issue, is when they say "sorry, we're not able to do that" after I tell them exactly how to fix what they're sending.
Substringed JSON arrays work great for this as well. Basically you just convert each row to an array, convert the array to JSON, and the strip the leading and trailing brackets ("[" and "]"). Newlines will be properly represented as \n as well as other funky characters.
What is the proper way to encode the quotes? I'm playing around with writing a Livejournal data dump reader and will have to do the quote dance pretty soon. In LJ files they are encoded like so:
col1,col2,col3,"Longer data and somethin ""with"" quotes",col5
Its a little ruby script that in 1 command takes one or several CSV files, parses their structure into simplistic sqlite table definitions, and then creates a new sqlite database file populated with structure and data from these CSVs.
That is really nice. I remember implementing something similar to step 2 and 3 on AppEngine a few years back for a project. It turned out quite nice and non tech-savvy users used it without much problems. We managed to have something that could easily work with different CSV formats (MS Excel and others).
Yeah, DataWrangler is an awesome project. There are actually a few great companies coming out with products based around it. We may just integrate with one of those.
CSV format does not define encoding. Also customers usually think about CSV as export/import format from/to Microsoft Excel. Unfortunately each Office localization uses different output encoding and delimiters (and date format). And you cannot suppose that the encoding is UTF-8. The most complete importer of CSV file which I have seen is in Open Office. It is also worth mentioning that Excel can import CSV with other than system locale and different delimiters - but it just not work through open file command.
[+] [-] jeremysmyth|12 years ago|reply
You can run full SQL queries directly against a text file as if it was a table.
[+] [-] thingsilearned|12 years ago|reply
https://chartio.com/education/databases/excel-to-mysql
But we found that many people either aren't technical enough, or didn't want to go through the hassle of setting up a MySQL instance, defining a schema, and cleaning the data. So now we do that for them.
[+] [-] justinator|12 years ago|reply
https://metacpan.org/module/DBD::CSV
[+] [-] eli|12 years ago|reply
http://www.microsoft.com/en-us/download/details.aspx?id=2465...
Windows only, natch.
[+] [-] Ecio78|12 years ago|reply
Edit: of course with BULK INSERT you don't directly query the file, you must load it to a (temp) table
[+] [-] guard-of-terra|12 years ago|reply
CSV is one of the most open things possible.
As a programmer I do a lot of one-time makeshift data reports for other people, and I always use CSV (or precisely, tab separated) because that's what every program happily emits and consumes. If it does not, it's trivial to transform thanks to UNIX sort, awk and uniq.
[+] [-] zbanks|12 years ago|reply
It's hard enough when you have delimiters in quoted fields, but dealing with quoted newlines starts to become unreasonable, especially for line-based tools.
CSV files, as you say, are absolutely wonderful to create. Problems come up when you try to parse files other people write. Not everyone follows RFC 4180.
[+] [-] thejteam|12 years ago|reply
[+] [-] voltagex_|12 years ago|reply
http://naa.gov.au/naaresources/govhack-2013/PassengersArriva...
[+] [-] JulianMorrison|12 years ago|reply
- Header line or none?
- "\n" or "\r\n"?
- Is there a newline at the last line? How about two?
- Escape quotes with doubling or backslash? How about both in the same file? How about both, inconsistently, in different fields? How about quotes including a newline and commas?
- Strings always quoted? Only if necessary? Is ,, a null or an empty string, or an error?
- How about mixed line lengths? Are missing trailing entries nulls? How about multiple data types in a file, with the first field being type, and line length only fixed per type?
I have generally found "TSV with a rule that data cannot represent tabs or newlines, period" as vastly superior.
[+] [-] zbanks|12 years ago|reply
I agree TSV is a lot nicer, and has the bonus that Excel will open a TSV file with an .xls extension without any problems (great for sharing!).
[+] [-] jwilliams|12 years ago|reply
Those aspects are defined in RFC 4180 - just a lot of systems don't bother. How would you define a simpler data format?
[+] [-] mattdeboard|12 years ago|reply
I should probably package it up for submission upstream.
[+] [-] mbq|12 years ago|reply
[+] [-] choffstein|12 years ago|reply
[+] [-] maspwr|12 years ago|reply
[+] [-] michaelrbock|12 years ago|reply
[+] [-] thingsilearned|12 years ago|reply
[+] [-] tbrownaw|12 years ago|reply
[+] [-] sehrope|12 years ago|reply
[+] [-] alayne|12 years ago|reply
I have a simple scheme where comma is the delimiter and , in content is escaped as \,. There are no quotes around values.
[+] [-] akinder|12 years ago|reply
col1,col2,col3,"Longer data and somethin ""with"" quotes",col5
[+] [-] dergachev|12 years ago|reply
Its a little ruby script that in 1 command takes one or several CSV files, parses their structure into simplistic sqlite table definitions, and then creates a new sqlite database file populated with structure and data from these CSVs.
[+] [-] eli|12 years ago|reply
It's a terrible hack, but I actually still use it pretty frequently.
[+] [-] caissy|12 years ago|reply
[+] [-] bdon|12 years ago|reply
[+] [-] thingsilearned|12 years ago|reply
[+] [-] AdrianRossouw|12 years ago|reply
[1] http://csvkit.readthedocs.org/
[+] [-] funstr|12 years ago|reply
[+] [-] zbanks|12 years ago|reply
[+] [-] netvarun|12 years ago|reply
[+] [-] thingsilearned|12 years ago|reply
http://labix.org/python-dateutil
[+] [-] hedonist|12 years ago|reply