top | item 7585808

*NEVER* sanitize your inputs

37 points| billpg | 12 years ago |blog.hackensplat.com | reply

73 comments

order
[+] breischl|12 years ago|reply
I guess the "never sanitize" headline is clickbait, but the point is valid. "Sanitizing" input is really hard, and can provide a false sense of security. That string has been sanitized, so it's safe! Wait, is it safe for SQL? What about HTML? What about inside a <script> tag? What about a different database engine, or Mongo, or Azure Tables? You are much better off giving up on the illusion of "safe input" that sanitization gives you, and instead always treating user input as data rather than mixing it up with your code.

My major complaint is that after correctly identifying the solution for SQL, he ends up with nothing to say about HTML. The right approach for rendering user input into HTML is with the Javascript createTextNode() function. That's how you tell the browser that it absolutely shouldn't interpret that content as HTML.

[+] billpg|12 years ago|reply
Thanks for that. I'll add a note mentioning createTextNode once I've had a chance to read up on it.
[+] king_magic|12 years ago|reply
"But that's what we mean by "sanitize"! Then you should stop calling it that."

Ugh, eyeroll. Seriously, let's waste time arguing over what to call security vulnerabilities & ways to address them - instead of using consistent terminology that security-minded developers instantly recognize.

To quote the hilarious Mean Girls - "stop trying to make fetch happen".

[+] 6cxs2hd6|12 years ago|reply
Although I understand how you feel, I think OP's point was a bit more meaningful: Calling it "sanitizing" leads some programmers to try to "clean up" the input -- but instead they should contain it.

And when they try to "clean it up", they enter the realm of Falsehoods Programmers Believe About X.

e.g. http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...

[+] billpg|12 years ago|reply
Okay, let's keep advising people to "sanitize" inputs. Even though its confusing and there's another word that isn't confusing. Because reasons.
[+] bevacqua|12 years ago|reply
This is terribly confusing advice: "NEVER sanitize your inputs!". He means: "just don't call it sanitizing".
[+] msl09|12 years ago|reply
Click honeypot
[+] kijin|12 years ago|reply
The argument makes sense in the SQL injection example (don't escape, use prepared statements!) but falls apart when you get to the XSS example. Now we're just trying to redefine words.

"HTML injection" does sound cool though. Since XSS nowadays is not necessarily about sending cookies to another site, perhaps we could adopt "HTML injection" as a more generic term.

Now of course, the problem we're trying to fix is someone who does:

    $content = htmlspecialchars(mysql_real_escape_string(addslashes($content)));
before $content ever hits the database, without any understanding of what those functions really do. It's a surprisingly common cargo cult among newbie web devs. Just throw all the security-related functions together and you'll be safe!
[+] vezzy-fnord|12 years ago|reply
HTML injection already is a term: https://www.owasp.org/index.php/HTML_Injection

Same principle, but different method of exploitation. If we supply plain HTML tags in a vulnerable parameter, it's HTML injection. If we use JavaScript (via a script tag or whatnot), it's XSS.

[+] mikeash|12 years ago|reply
To me, "sanitize" implies a blacklist approach, which is inherently insecure. For the HTML example, it means you're going through and blocking <script> and such, while allowing the rest through. What you should be doing is keeping a small whitelist of allowed tags and blocking everything else, if you must support user-provided HTML in the first place. That to me isn't sanitizing but rather defining an HTML subset and then translating from it to full HTML.
[+] gizmogwai|12 years ago|reply
This whole post is ridiculous. The problem he poorly tries to described has been solved by mathematicians a few millenniums ago. In a single word: CONTEXT.

A word is nothing if not bound by a context. Developers have already developed part of this context. Design patterns names are an example of those words defined within the context. Sanitizing input is just another.

[+] billpg|12 years ago|reply
Put yourself in the shoes of an inexperienced programmer building their first website. You've been advised to sanitize your inputs with the example of Bobby Tables.

You know the plain English meaning of "Sanitize". Clearly, you need to remove those single quote characters as they are unsanitary?

[+] simonw|12 years ago|reply
This article almost gets it right, then screws it up with the HTML example.

Both SQLi and XSS have the same cause: concatenating strings when you are working with active code of some sort.

They both have the same solution: you need to know the escaping rules for the active code you are assembling.

You shouldn't be solving XSS by stripping tags (that's a great way to build a discussion forum where no-one can talk about how to use HTML) - you should be escaping user input before assembling it in to HTML.

To protect against dumb mistakes (because it's really easy to screw up just once and have a huge security hole) you should use abstractions that do this for you. If you're working with Django the ORM will do this for SQLi and auto-escaping in the template language will do this for XSS (watch out for variables you are outputting in a script tag context though).

Escaping, not sanitizing, should be the message.

[+] huxley|12 years ago|reply
Hate to disagree with you but you can have plenty of flexibility with element and attribute white-lists without abandoning sanitizing. Sanitize as much of your inputs as you are comfortable with and escape the outputs.
[+] nubs|12 years ago|reply
I've always seen "sanitization" as more of an output-encoding problem.

People love to consider sanitizing the inputs, but how you do so doesn't depend on the inputs but on the specific usage of it - more-or-less the output of your program.

Rather than trying to think of all the ways the inputs to your program could be abused to cause abuse, I find that it is safer to start at where the output occurs - database calls, system calls, etc. The most commonly used of these (database calls, shell commands, etc) tend to have a variety of encoding capabilities to ensure that when you want to stick a string in a particular place it does exactly that regardless of whether the string came from user input or elsewhere. For example, bind parameters for databases, or proper escaping functions.

If you think about it as sanitizing input it means you tend to misplace your attention to detail and only consider the entry to your application. A single input is often used to do multiple things through a program so you cannot properly handle sanitization at input.

The real push should be for proper output encoding, not input sanitization.

[+] peterwwillis|12 years ago|reply
The purpose of sanitizing input is not to prevent security vulnerabilities. It is to make sure the values taken by your program are valid. If you accept a number range, and the user inputs a word, it's invalid input for your parameter and your program will crash. Input sanitizing validates the input is correct for your use. It indirectly improves security, but is not itself a practice of making an app more secure.
[+] mantrax4|12 years ago|reply
This is why I just call it encoding and decoding. Proper words, and assume context (encoding for what... decoding from what).
[+] SlashmanX|12 years ago|reply
> Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter.

Stopped reading here as I assumed the rest of the article was satirical

[+] theboss|12 years ago|reply
This is stupid and I don't see anyone quite hitting the mail on the head as to why.

People normally dumb web vulnerabilities together. Xss and sqli especially. Preventing xss you have to sanitize. Preventing sqli you used parameterized queries.

To prevent stored xss you sanitize what you put in the database. So really... You still need to sanitize.

I've also seen people make arguments about inexperienced web programmers and how this advice can cause them to write bad code. I think the argument is bad because so many resources exist to help them. There is real code on stack overflow, w3 schools, owasp, and other blogs that can be copied and pasted in to their projects.

[+] zAy0LfpBZLC8mAC|12 years ago|reply
No, you don't sanitize what you put in your database, you validate what you put in your database, and convert into the output format when using data from the database. Sanitizing is always(!) wrong.
[+] IgorPartola|12 years ago|reply
A somewhat related term, I really like "mogrify": http://initd.org/psycopg/docs/cursor.html#cursor.mogrify
[+] rhizome31|12 years ago|reply
I've been wondering where this word comes from. The only other occurrence I know of is the ImageMagick command of the same name. It doesn't seem to be a real English word. What does it evoke to a native English speaker? (ESL here)
[+] huxley|12 years ago|reply
> "Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter."

I hope this is satire, Irish didn't "make up" the letter Ó, it was the standard historical form but was converted into O' when the names were anglicized.

Frankly his advise about sanitizers seems equally suspect, I've processed a lot of complex scientific abstracts using html5lib and Bleach without any mangling like he describes. He must be using very naive sanitizers.

[+] zAy0LfpBZLC8mAC|12 years ago|reply
The overall point is very true indeed, though I think it's not made particularly clear what the actual problem with sanitizing input is.

The problem is that you are silently changing information, and that's an absolute no-go for reliable data processing, and the cause is that people think of, say, html, as "some kind of text/strings".

HTML is a serialization of a tree, similarly, SQL is a serialization of a syntax tree ... - and if you want to add plain-text user input to such a serialized tree, you have to _convert_ it from, say, "plain text" to "HTML character data". You have to think of them as two different data types, and so when you want to use a value presented as one of the types as the other type, you don't have to "sanitize" it, even calling it "escaping" is confusing - you have to _convert_ it. And if it happens that some input can not be represented in the target type, then you have to _validate_ the input and _reject_ broken input.

[+] chriswarbo|12 years ago|reply
Not sure why this is being downvoted. This "some kind of text/strings" approach is why we have "escaping" functions which turn strings into strings, instead of conversion functions from, say, "plain text" to "HTML character data".

This prevents our computers from helping us, even if we're using an ivory tower type system with a whizz-bang IDE, since everything's just "String" so the compiler says OK.

Here's an example of the alternative http://blog.moertel.com/posts/2006-10-18-a-type-based-soluti... (remember that most of the code there is building the libraries; using them is simple and terse).

[+] marcosdumay|12 years ago|reply
"Convert" is ambiguous. When you read the text "<head>" from a template file, you probably want to represent it as the HTML "<head>", but when you read it from the database, you may, or may not want to represent it as "&lt;head&gt;".

People started using the word "sanitize" exactly because it conveys that information that "you want to treat it differently, depending of where it comes from". We also use the words "dirty" (sometimes "tainted") and "clean" conserving their usual relations to "sanitize".

Now somebody wants throw away a very concise and expressive jargon just because some people are giving bad advice on the Internet?

[+] TomGullen|12 years ago|reply
False surely, as another poster commented you want to sanitise inputs for example for user signatures to remove Javascript and other nasties. Sanitising inputs isn't just about protecting against SQL injections.

What the author actually means is the removal of apostraphes to prevent SQL injections can affect your data integrity, so paramatise your queries.

Alternatively replace single apostraphes with double apostraphes in your queries also works, but paramatising queries is a much better practise to get into.

[+] IgorPartola|12 years ago|reply
No, TFA is right. If the user wants to post <script>alert("I am a hacker.")</script>, so be it. Display it literally. You do have to take care to escape it when you are rendering your HTML. But guess what? You have to anyways since <script> is not the only evil tag out there. XSS can be performed a number of ways and you are not going to catch them all by removing stuff from user input.
[+] jasonlotito|12 years ago|reply
Your comment highlights exactly why the article makes a compelling point. I mean, you make several suggestions before you get to one thing you need to do on input:

> but paramatising queries is a much better practise to get into.

[+] billpg|12 years ago|reply
Read on, there's a section titled "Isn’t sanitization still needed with HTML?".
[+] nraynaud|12 years ago|reply
Well in HTML you can use a sandboxed iframe (or <webview> in technologies that have it), but it's not cheap.
[+] tokenizerrr|12 years ago|reply
I just remembered visiting a website that used iframe's with script tags disabled for users their signatures. It was a pretty interesting approach.
[+] mantrax4|12 years ago|reply
I can't post a comment containing <script> in this comment form, because it's "disallowed", instead of just escaping it as plain text.

The thick, thick irony of a guy who can't even follow his own advice.

[+] zAy0LfpBZLC8mAC|12 years ago|reply
He is perfectly following his own advice. Apparently, his form field takes HTML syntax with a subset of HTML tags. Your input does not conform to that. So, instead of silently altering what you wrote, it tells you about the validation failure and asks you to correct your input instead of silently changing what you wrote. The input field takes HTML, so you have to write "&lt;script>" (I suppose, haven't tested it) in order to display "<script>" - that is perfectly consistent.
[+] billpg|12 years ago|reply
I know. I'm awful.

:)