top | item 47136906

(no title)

entuno | 5 days ago

This kind of thing always makes me nervous, because you end with a mix of methods where you can (supposedly) pass arbitrary user input to them and they'll safely handle it, and methods where you can't do that without introducing vulnerabilities - but it's not at all clear which is which from the names. Ideally you design that in from the state, so any dangerous functions are very clearly dangerous from the name. But you can't easily do that down the line.

I'm also rather sceptical of things that "sanitise" HTML, both because there's a long history of them having holes, and because it's not immediately clear what that means, and what exactly is considered "safe".

discuss

jncraton|5 days ago

You are right that the concept of "safe" is nebulous, but the goal here is specifically to be XSS-safe [1]. Elements or properties that could allow scripts to execute are removed. This functionality lives in the user agent and prevents adding unsafe elements to the DOM itself, so it should be easier to get correct than a string-to-string sanitizer. The logic of "is the element currently being added to the DOM a <script>" is fundamentally easier to get right than "does this HTML string include a script tag".

[1] https://developer.mozilla.org/en-US/docs/Web/API/Element/set...

entuno|5 days ago

It's certainly an improvement over people trying to homebrew their own sanitisers. But that distinction of being XSS-safe is a potentially subtle one, and could end up being dangerous if people don't carefully consider whether XSS-safe is good enough when they're handling arbitrary users input like that.

intrasight|5 days ago

Also has made me nervous for years that there's been no schema against which one can validate HTML. "You want to validate? Paste your URL into the online validation tool."

cxr|5 days ago

> it's not at all clear which is which from the names. Ideally you design that in from the [start]

It was, and there is: setting elementNode.textContent is safe for untrusted inputs, and setting elementNode.innerHTML is unsafe for untrusted inputs. The former will escape everything, and the latter won't escape anything.

You are right that these "sanitizers" are fundamentally confused:

> "HTML sanitization" is never going to be solved because it's not solvable.¶ There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement.

<https://news.ycombinator.com/item?id=46222923>

The Web platform folks who are responsible for getting fundamental APIs standardized and implemented natively are in a position to know better, and they should know better. This API should not have made it past proposal stage and should not have been added to browsers.

Dylan16807|5 days ago

> There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement.

It is not a hard requirement that untrusted input is "treated like text". And this API lets you customize exactly what tags/attributes are allowed in the untrusted input. That's way better than telling everyone to write their own; it's not trivial.

voxic11|5 days ago

The idea is you wouldn't mix innerHTML and setHTML, you would eliminate all usage of innerHTML and use the new setHTMLUnsafe if you needed the old functionality.

extraduder_ire|5 days ago

I looked up setHTMLUnsafe on MDN, and it looks like its been in every notable browser since last year.

Good idea to ship that one first, when it's easier to implement and is going to be the unsafe fallback going forward.

croes|5 days ago

If I need the old functionality why not stick to innerHTML?

reddalo|5 days ago

You can't rename an existing method. It would break compatibility with existing websites.

post-it|5 days ago

> you would eliminate all usage of innerHTML

The mythical refactor where all deprecated code is replaced with modern code. I'm not sure it has ever happened.

I don't have an alternative of course, adding new methods while keeping the old ones is the only way to edit an append-only standard like the web.

Cthulhu_|5 days ago

Ideally you should be able to set a global property somewhere (as a web developer) that disallows outdated APIs like `innerHTML`, but with the Big Caveat that your website will not work on browsers older than X. But maybe there's web standards for that already, backup content if a browser is considered outdated.

cxr|5 days ago

It's not an "outdated API". It's still good for what it was always meant for: parsing trusted, application-generated markup and atomically inserting it into the content tree as a replacement for a given element's existing children.

> set a global property somewhere (as a web developer) that disallows[…] `innerHTML`

    Object.defineProperty(Element.prototype, "innerHTML", {
      set: (() => { throw Error("No!") })
    });

(Not that you should actually do this—anyone who has to resort to it in their codebase has deeper problems.)

staticassertion|5 days ago

Doesn't using TrustedTypes basically do that? I'm not really web-y, someone please correct me if I'm off.

afavour|5 days ago

I like the idea of that. But I imagine linting rules are a much more immediate answer in a lot of projects.

jaffathecake|5 days ago

fwiw, if you serve your page with:

Content-Security-Policy: require-trusted-types-for 'script'

…then it blocks you from passing regular strings to the methods that don't sanitize.

thaumasiotes|5 days ago

> I'm also rather sceptical of things that "sanitise" HTML, both because there's a long history of them having holes, and because it's not immediately clear what that means, and what exactly is considered "safe".

What is safe depends on where the sanitized HTML is going, on what you're doing with it.

It isn't possible to "sanitize HTML" after collecting it so that, when you use it in the future, it will be safe. "Safe" is defined by the use.

But it is possible to sanitize it before using it, when you know what the use will be.

DoctorOW|5 days ago

They do link the default configuration for "safe": https://wicg.github.io/sanitizer-api/#built-in-safe-default-...

But I agree, my default approach has usually been to only use innerText if it has untrusted content:

So if their demo is this:

    container.SetHTML(`<h1>Hello, {name}</h1>`);

Mine would be:

    let greetingHeader = container.CreateElement("h1");
    greetingHeader.innerText = `Hello, {name}`;

itishappy|5 days ago

What if I wanted an <h2>?

Edit: I don't mean this flippantly. If I want to render, say, my blog entry on your site, will I need to select every markup element from a dropdown list of custom elements that only accept text a la Wordpress?

duxup|5 days ago

Yeah someone tells me something has been made “safe” is nice but unless I know exactly what that means … it’s easy to say safe by someone who doesn’t have to deal with it when the bad corner case happens.

Oh and it’s safe… in this browser… not that one, so this idea of safety is kinda dead to me for now.

HWR_14|5 days ago

That's why I only allow user input of alphanumeric ascii characters. No need to worry about sanitation then, and you can just remove all the characters that don't match.

(It's a joke, but it is also 100% XSS, SQL injection, etc. safe and future proof)

noduerme|5 days ago

Some sanitization is better than none? If you're relying on the browser to handle it for you, you're already in a lot of trouble.

post-it|5 days ago

realSetSafeHTML()

unknown|5 days ago

[deleted]

snowhale|5 days ago

[deleted]

pornel|5 days ago

BTW, HTML allows inline SVG with an XML-flavored syntax that interprets <script/> and <title> differently. It's a goldmine for sanitizer escapes. There are completely bonkers syntax switching and error recovery rules that interact with parsing modes (there's even an edge case where a particular attribute value switches between HTML and XML-ish parsing rules).

Don't even try to allow inline <svg> from untrusted sources! (and then you still must sanitise any svg files you host)

cxr|5 days ago

It may be using some of the same deserialization machinery, but "parsing" is a broad term that includes things that the sanitizer is doing and that the browser's ordinary content-processing → rendering path does not.

Even with this being a native API, there are still two parsers that need to be maintained. What a native API achieves is to shift the onus for maintaining synchronicity between the two onto the browser makers. That's not nothing, but it's also not the sort of free lunch that some people naively believe it is.

onion2k|5 days ago

it's not at all clear which is which from the names

There's setHTML and setHTMLUnsafe. That seems about as clear as you can get.

entuno|5 days ago

If that'd been the design from the start, then sure. But it's not at all obvious that setHTML is safe with arbitrary user input (for a given value of "safe") and innerHTML is dangerous.

hahn-kev|5 days ago

But you can use InnerHTML to set HTML and that's not safe.