I think a better way to think of this may be in terms of canonicalization. Inside your application, you should decide on a single canonical way to represent data, one which fits the type of processing and expected use of the application. For example, you might decide that all strings should be UTF8, and should be interpreted (and stored) as whatever the user initially wrote. You might decide that any structured data should be parsed and then stored as protobufs in a BigTable. Or you might decide that an RDBMS is your native datastore and use whatever the native string encoding is for it, as well as parse & normalize data into tables upon input.
Then, whenever you take input, your job is to validate and encode it. If you get a Windows-1252 string, you should re-encode it to utf8 for further storage. If it has data that are invalid UTF-8 codepoints, you should either strip, replace with a replacement character, or notify the user with a validation failure. Same with structured data that fails your normalization rules - you should usually notify the user.
And when you send output, you should escape based on the intended output device. If you're putting it in an HTML page, HTML-escape it. If it's a URL, url-encode it. If it's a database query, SQL escape it. If it's a CSV, quote it.
Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries), and it also gives you a lot of flexibility to preserve the user's intent and add new output formats later.
> Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries),
This is an excellent point but are there any devs who don't do this? This seems like such an obvious thing to do. I mean I guess if you dealing with tech debt and you want to upgrade from ASCII/bytes to UTF-8 you will not have this invariant for a shirt while but why would you not maintain the invariant in new code?
The fundamental problem is attempting to conflate a bunch of semantically-distinct things, just because they might happen to (sometimes) be represented in memory by similar byte sequences.
Such 'byte coincidences' lead to lazy, non-sensical operations, like "append this user-provided name to that SQL statement"; implemented by munging together a bunch of bytes, without thought for how they'll be interpreted.
A much better solution is to ignore whether things might just-so-happen to be represented in a similar way in memory; and instead keep things distinct if they have different semantic meanings (like "name", "SQL statement", "HTML source", "shell command", "form input", etc.). That way, if we try to do non-sensical things like appending user input to HTML, we'll get an informative error message that there is no such operation.
This isn't hard; but it requires more careful thought about APIs. Unfortunately many languages (and now frameworks) have APIs littered with "String"; ignoring any distinctions between values, and hence allowing anything to be plugged into anything else (AKA injection vulnerabilities)
It's cool to see how these posts are becoming less and less important in the wake of today's frameworks/tools protecting devs by default.
From ORMs escaping SQL, to FE frameworks escaping html/js, to browsers starting to default to same-site=lax. It feels like we've slowly pulled ourselves out of OWASP hell. Pretty nice to see!
Obviously it's still important (see log4j) to know it all especially when its not so clear cut, but still good progress.
I think we really failed in earlier eras to get it right due to the momentum of the frameworks.
I would liken to some of the crap building materials that were allowed in the past as new, cheap alternatives but subsequently showed failure or hazards after short service-lifes. Contractors were tasked with implementing these materials to stay within budget and everyone suffered the effects later.
> The parallel for SQL injection might be if you’re building a data charting tool that allows users to enter arbitrary SQL queries. You might want to allow them to enter SELECT queries but not data-modification queries. In these cases you’re best off using a proper SQL parser [...] to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review.
If you are ever in this situation, you should actually use a dedicated read-only user that can only access the relevant data. If you need to hide columns, use views. Trying to parse SQL can easily go very wrong, especially when someone (ab-)uses the edge cases of your DB.
This solution doesn't match the problem. Even the SQL injection example shows him sanitizing the input, which is at odds with the title of the post. Log4J is a more recent example of it being too late/useless to escape the output.
This is an example of why the term "sanitize" just brings confusion and leads to incorrect software. If we say "escape" (for concatenation) or "parameterize" (for discrete arguments) instead, then there's no confusion: we know that it should be done at the point of use, because the procedure for doing so depends on that use.
Calling it "sanitization" implies that the data is somehow dirty, so naturally it should be cleaned as soon as possible, and after that it's safe. But all that accomplishes in general is corrupting the data, often in an unrecoverable way, and then opening up security vulnerabilities because the specific use doesn't happen to exactly match the sanitization done in advance.
It's great to validate the data on input and make it conform to the correct domain of values, but conflating this with output formats and expecting this to take care of downstream security as well just leads to incorrect data along with security vulnerabilities.
PHP's long-ago-removed magic quotes feature was an example of this confusion in action. It not only mangled incoming strings containing single quotes in an effort to prevent SQL injection, but did so in a way that left some databases completely exposed, depending on their quoting syntax.
SQL injection is avoided at the point of usage. Trying to sanitize your input against it is an extremely bad practice. The same is true about HMTL injection (whether you call it XSS or something else).
Log4j is an example of not interpreting text that the developer was never aware that was code. It's kinda of the extreme opposite of escaping your text on usage.
The article is specifically about sanitizing inputs to prevent XSS attacks. Sanitizing input isn't a great defense against that; you need a defense that better matches the attack.
Validating or sanitizing input input is a reasonably good defense against certain other things. E.g. zeroes in values you'll later divide by, when it's too late to return an error; multi-gigabyte names; information that you want to avoid storing like credit card numbers. That sort of use case doesn't really have a whole lot to do with the article, though.
Sanitizing inputs is not what you realistically want. You should prohibit certain types of input. Whitelisting strings is that what I would call it.
You should escape outputs, of course (not that anyone in 2022 thinks otherwise).
Why escaping outputs alone won't work is because user inputs will be stored in some database and you can't realistically predict how, when, where it will be used. Years in the future. User name could be used as a filename once, opening up possibility of shell-based exploit. It could trigger a little-known spreadsheet formula vulnerability when exported for analysis. Novel, interesting xss attacks are common and produced every day. That could be even not your code, but the code your client or partner organisation run. You just never know.
One common defence is user names (and other freeform fields) should not be allowed to be arbitrary bytes.
That is defence in depth, an established practice.
Agree and Disagree. Sanitization has it's place, but from a user perspective it's better to just outright reject (through validation) inputs that aren't valid.
There are often unexpected ways that data gets into the system (IT manually adding data, internal support tool to help customers add data, etc.) You need to ensure that you're properly sanitizing your input at every single input faucet and your sanitization has to predict how, when, and where it will be used by sanitizing for dangerous characters in filenames, shell, spreadsheet formula vulns, and XSS attacks.
Instead, (Or In addition to) just make the assumption that data in the database is dangerous, and ensure that you properly escape for your use case when using that data.
Using a username to create a new file? Escape for filenames based on which OS/language your using.
Using birthdates in an excel file? Escape for excel formulas.
Using bio on an HTML page? HTML Escape.
Using username as part of a URL path? URL Escape.
And finally circle back to the fact that sanitization where you change user input without their knowledge (like the "O'brien" -> "Obrien" example in the article) creates for a frustrating user experience.
That works well for things you can limit to alphanumeric, which is pretty much only usernames. For everything else there will be an exploit in some context without proper escaping. You can decrease the attack surface, but you have to weigh that against the false sense of security it might give developers.
If you're doing both, I'd ask you what you think you're accomplishing by sanitizing input, especially when you're already escaping output.
All you're doing is corrupting the data with a ritual that seems like it's securing something, and it tends to make you think that your data is now ready to be rendered anywhere without issue.
Two different advises for two different things. That one is about data validation, making sure it is coherent and fits your data quality rules. This one is about data encoding, making sure it fits a different system's rules.
These are both good advise. I have seen really funny bugs where Java accepted non-ascii numbers in an IP address but the C++ control plane very much did not. If the re-serialized version was sent to the backend this wouldn't have been an issue.
But the domains are different. Data validation is ensuring that the information is something that your system accepts. Data encoding is used when you are serializing information. You should very likely validate on input, but not "sanitize" or encode. You do your encoding on output.
I think the domain models are different. "parse-don't-validate" is great when your users are internal and trusted (e.g. a library that does codegen - the operators of the parser are already in the codebase). When your users are potentially hostile, you should at some level have a separate validate and eject strategy.
I think what makes this hard for folks is tracking what the expected form of data is at each step of its lifecycle, especially considering people working with new and unfamiliar codebases or splitting focus on multiple projects.
There are some frameworks that try using types to solve the problem. Alternatively, the developers could throw in a comment that looks something like:
// client == submits raw data ==> web_server == inserts raw data (param. sql stmt) ==> db_server ==> returns query with raw data ==> our_function == returns html-escaped data ==> client
I think escaping output is making the same mistake as sanitizing input. What we should really be saying is "stop using string interpolation/concatenation to process generic user data".
By default, text should only ever be treated as a blob. Yes, there are circumstances where it needs to be treated otherwise but they should be seen as a giant flashing 'danger' sign indicating the need to go back to sanitizing etc.
Every time this topic comes up, the comments are full of people talking past each other because they're operating under different definitions of "sanitize", "input", and "escape".
And now in this case, we add "output" to the confusion.
Is the SQL query you send to your DB input or output?
You can't escape it ahead of time, for the same reason you can't reliably block or remove "dangerous" inputs ahead of time — you can't reliably know all the places and contexts they will be used in.
So you escape at the point of use, as late as possible, when you know exactly what escaping you need.
It's also easy to forget to escape. This is why it's best to have tools and practices that automate it, e.g. HTML templating engine that escapes everything by default, e-mail composing library that automatically converts text to whatever MIME magic is required, etc.
I'm really surprised by the discussion here. It's so obviously true and I realized this when correct php function to escape string for sql was names mysql_real_escape_string
No, garbage in, garbage out. Sure, things like log or SQL injections should not only be solved by sanitizing. You solve it by separating data and code. A lot of times you really want to store data in a structured canonical way. Usernames for instance. It is bad if you with Unicode trickery can create multiple usernames that looks the same. Product descriptions, it is bad if your ML needs to handle HTML and so on.
This is wrong. If I leave a comment `'; DROP TABLE users; --` You should display it back in the app as exactly that. If you put it into an HTML attribute you escape the `'` and if you stick it in SQL you use parametrized statements.
There is nothing "wrong" with that initial input. What is wrong is pasting it into an SQL string, HTML element, HTML attribute, URL parameter or anywhere else without properly encoding it.
This is the main reason you can't "sanitize" input. You need to know what the output format is to properly encode it. There are different requirements if you are pasting it into a sed replacement command vs HTML attribute vs HTML element body. You can strip everything except a-zA-Z and cross your fingers but even that isn't necessarily sufficient for all output formats.
Every online form where user can interact and send data back to a server is always a nightmare in terms of security. I do utilize mod_secure, but with my next project, I have an idea of doing "base64" on everything in client's browser via javascript then sending it to server and checking on backend if content is a valid base64. Is that a good concept?
That could work if you are just going to store things as base64.
It accomplishes nothing if you are going to decode the base64 on the backend and then use the original value as-is. If anything it's worse than nothing, because now mod_secure will just see the base64 content and might fail to detect certain attacks.
Unfortunately that wouldn't help with a whole lot. The danger with input is that it could be used to e.g. escape a SQL query and delete your database. Which is why we now have parameterised queries and such to help alleviate those worries.
If you think about it the process you're describing already happens: the browser sends the user's input as (usually) UTF8 string data, then the server decodes it. Changing that process to base64 wouldn't change much.
[+] [-] nostrademons|4 years ago|reply
Then, whenever you take input, your job is to validate and encode it. If you get a Windows-1252 string, you should re-encode it to utf8 for further storage. If it has data that are invalid UTF-8 codepoints, you should either strip, replace with a replacement character, or notify the user with a validation failure. Same with structured data that fails your normalization rules - you should usually notify the user.
And when you send output, you should escape based on the intended output device. If you're putting it in an HTML page, HTML-escape it. If it's a URL, url-encode it. If it's a database query, SQL escape it. If it's a CSV, quote it.
Thinking in these terms keeps the internal logic of your application simple (there are no format conversions except at system boundaries), and it also gives you a lot of flexibility to preserve the user's intent and add new output formats later.
[+] [-] Sohcahtoa82|4 years ago|reply
No, you parameterize it.
[+] [-] platz|4 years ago|reply
[+] [-] omegalulw|4 years ago|reply
This is an excellent point but are there any devs who don't do this? This seems like such an obvious thing to do. I mean I guess if you dealing with tech debt and you want to upgrade from ASCII/bytes to UTF-8 you will not have this invariant for a shirt while but why would you not maintain the invariant in new code?
[+] [-] chriswarbo|4 years ago|reply
Such 'byte coincidences' lead to lazy, non-sensical operations, like "append this user-provided name to that SQL statement"; implemented by munging together a bunch of bytes, without thought for how they'll be interpreted.
A much better solution is to ignore whether things might just-so-happen to be represented in a similar way in memory; and instead keep things distinct if they have different semantic meanings (like "name", "SQL statement", "HTML source", "shell command", "form input", etc.). That way, if we try to do non-sensical things like appending user input to HTML, we'll get an informative error message that there is no such operation.
This isn't hard; but it requires more careful thought about APIs. Unfortunately many languages (and now frameworks) have APIs littered with "String"; ignoring any distinctions between values, and hence allowing anything to be plugged into anything else (AKA injection vulnerabilities)
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] parhamn|4 years ago|reply
From ORMs escaping SQL, to FE frameworks escaping html/js, to browsers starting to default to same-site=lax. It feels like we've slowly pulled ourselves out of OWASP hell. Pretty nice to see!
Obviously it's still important (see log4j) to know it all especially when its not so clear cut, but still good progress.
[+] [-] erosenbe0|4 years ago|reply
I would liken to some of the crap building materials that were allowed in the past as new, cheap alternatives but subsequently showed failure or hazards after short service-lifes. Contractors were tasked with implementing these materials to stay within budget and everyone suffered the effects later.
[+] [-] Sebb767|4 years ago|reply
If you are ever in this situation, you should actually use a dedicated read-only user that can only access the relevant data. If you need to hide columns, use views. Trying to parse SQL can easily go very wrong, especially when someone (ab-)uses the edge cases of your DB.
[+] [-] dang|4 years ago|reply
Don’t try to sanitize input – escape output - https://news.ycombinator.com/item?id=22431022 - Feb 2020 (280 comments)
[+] [-] gkoberger|4 years ago|reply
[+] [-] ashearer|4 years ago|reply
Calling it "sanitization" implies that the data is somehow dirty, so naturally it should be cleaned as soon as possible, and after that it's safe. But all that accomplishes in general is corrupting the data, often in an unrecoverable way, and then opening up security vulnerabilities because the specific use doesn't happen to exactly match the sanitization done in advance.
It's great to validate the data on input and make it conform to the correct domain of values, but conflating this with output formats and expecting this to take care of downstream security as well just leads to incorrect data along with security vulnerabilities.
PHP's long-ago-removed magic quotes feature was an example of this confusion in action. It not only mangled incoming strings containing single quotes in an effort to prevent SQL injection, but did so in a way that left some databases completely exposed, depending on their quoting syntax.
[+] [-] marcosdumay|4 years ago|reply
SQL injection is avoided at the point of usage. Trying to sanitize your input against it is an extremely bad practice. The same is true about HMTL injection (whether you call it XSS or something else).
Log4j is an example of not interpreting text that the developer was never aware that was code. It's kinda of the extreme opposite of escaping your text on usage.
[+] [-] amalcon|4 years ago|reply
Validating or sanitizing input input is a reasonably good defense against certain other things. E.g. zeroes in values you'll later divide by, when it's too late to return an error; multi-gigabyte names; information that you want to avoid storing like credit card numbers. That sort of use case doesn't really have a whole lot to do with the article, though.
[+] [-] brodouevencode|4 years ago|reply
[+] [-] hombre_fatal|4 years ago|reply
[+] [-] Ingaz|4 years ago|reply
> If you’re not using Markdown but want to let your users enter HTML directly, you only have the second option – you must filter using a whitelist.
So "don't filter but filter"
[+] [-] billpg|4 years ago|reply
[+] [-] hamilyon2|4 years ago|reply
You should escape outputs, of course (not that anyone in 2022 thinks otherwise).
Why escaping outputs alone won't work is because user inputs will be stored in some database and you can't realistically predict how, when, where it will be used. Years in the future. User name could be used as a filename once, opening up possibility of shell-based exploit. It could trigger a little-known spreadsheet formula vulnerability when exported for analysis. Novel, interesting xss attacks are common and produced every day. That could be even not your code, but the code your client or partner organisation run. You just never know.
One common defence is user names (and other freeform fields) should not be allowed to be arbitrary bytes.
That is defence in depth, an established practice.
[+] [-] InitialBP|4 years ago|reply
There are often unexpected ways that data gets into the system (IT manually adding data, internal support tool to help customers add data, etc.) You need to ensure that you're properly sanitizing your input at every single input faucet and your sanitization has to predict how, when, and where it will be used by sanitizing for dangerous characters in filenames, shell, spreadsheet formula vulns, and XSS attacks.
Instead, (Or In addition to) just make the assumption that data in the database is dangerous, and ensure that you properly escape for your use case when using that data.
Using a username to create a new file? Escape for filenames based on which OS/language your using.
Using birthdates in an excel file? Escape for excel formulas.
Using bio on an HTML page? HTML Escape.
Using username as part of a URL path? URL Escape.
And finally circle back to the fact that sanitization where you change user input without their knowledge (like the "O'brien" -> "Obrien" example in the article) creates for a frustrating user experience.
[+] [-] wongarsu|4 years ago|reply
[+] [-] HWR_14|4 years ago|reply
That said, it's obviously not worth build a "don't sanitize this" filter for that case.
[+] [-] iou|4 years ago|reply
[+] [-] hombre_fatal|4 years ago|reply
All you're doing is corrupting the data with a ritual that seems like it's securing something, and it tends to make you think that your data is now ready to be rendered anywhere without issue.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] wnoise|4 years ago|reply
[+] [-] marcosdumay|4 years ago|reply
[+] [-] kevincox|4 years ago|reply
But the domains are different. Data validation is ensuring that the information is something that your system accepts. Data encoding is used when you are serializing information. You should very likely validate on input, but not "sanitize" or encode. You do your encoding on output.
[+] [-] dnautics|4 years ago|reply
[+] [-] ncc-erik|4 years ago|reply
There are some frameworks that try using types to solve the problem. Alternatively, the developers could throw in a comment that looks something like:
// client == submits raw data ==> web_server == inserts raw data (param. sql stmt) ==> db_server ==> returns query with raw data ==> our_function == returns html-escaped data ==> client
[+] [-] taneq|4 years ago|reply
By default, text should only ever be treated as a blob. Yes, there are circumstances where it needs to be treated otherwise but they should be seen as a giant flashing 'danger' sign indicating the need to go back to sanitizing etc.
[+] [-] Sohcahtoa82|4 years ago|reply
And now in this case, we add "output" to the confusion.
Is the SQL query you send to your DB input or output?
[+] [-] gumby|4 years ago|reply
And how can the consumer of an arbitrary string trust that every input will have been properly escaped?
[+] [-] pornel|4 years ago|reply
So you escape at the point of use, as late as possible, when you know exactly what escaping you need.
It's also easy to forget to escape. This is why it's best to have tools and practices that automate it, e.g. HTML templating engine that escapes everything by default, e-mail composing library that automatically converts text to whatever MIME magic is required, etc.
[+] [-] scotty79|4 years ago|reply
[+] [-] 1970-01-01|4 years ago|reply
[+] [-] pwdisswordfish9|4 years ago|reply
[+] [-] AtNightWeCode|4 years ago|reply
[+] [-] kevincox|4 years ago|reply
There is nothing "wrong" with that initial input. What is wrong is pasting it into an SQL string, HTML element, HTML attribute, URL parameter or anywhere else without properly encoding it.
This is the main reason you can't "sanitize" input. You need to know what the output format is to properly encode it. There are different requirements if you are pasting it into a sed replacement command vs HTML attribute vs HTML element body. You can strip everything except a-zA-Z and cross your fingers but even that isn't necessarily sufficient for all output formats.
[+] [-] joering2|4 years ago|reply
[+] [-] justinsaccount|4 years ago|reply
It accomplishes nothing if you are going to decode the base64 on the backend and then use the original value as-is. If anything it's worse than nothing, because now mod_secure will just see the base64 content and might fail to detect certain attacks.
[+] [-] afavour|4 years ago|reply
If you think about it the process you're describing already happens: the browser sends the user's input as (usually) UTF8 string data, then the server decodes it. Changing that process to base64 wouldn't change much.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] scotty79|4 years ago|reply
[+] [-] adrr|4 years ago|reply
[+] [-] joering2|4 years ago|reply
[+] [-] lesquivemeau|4 years ago|reply
[+] [-] swlkr|4 years ago|reply