top | item 41021436

Show HN: JSON-Threat-Protection Rust High-Performance Crate

34 points| ADD-SP | 1 year ago |github.com

19 comments

order

anonymoushn|1 year ago

For things that are claimed to be high-performance, it would be helpful to see some numbers without running it locally on our own json files.

ADD-SP|1 year ago

Makes sense, numbers added.

blirio|1 year ago

"Whether to allow duplicate object entry names." This is interesting. I just did a test and it look like `jq` evaluates `{ "a": 1, "a": 2 }` to just `{ "a": 2 }`. I have always thought that this was invalid JSON. This mean that the order of keys in JSON do have some semantic meaning.

ADD-SP|1 year ago

The JSON RFC (https://datatracker.ietf.org/doc/html/rfc8259#page-6) doesn't require the unique entry name, and also the fact is that many parser uses the last-win strategy like serde_json.

For human, this is invalid, but many web services accepts this kind of JSON consciously or unconsciously.

I'm guessing this may have become a feature of some services and it's hard for maintainers to break this behavior. ᵕ︵ᵕ

scottlamb|1 year ago

Interestingly, ECMA-404 says the following:

> The goal of this specification is only to define the syntax of valid JSON texts. Its intent is not to provide any semantics or interpretation of text conforming to that syntax.

So it is legal JSON although not useful with a lot of concrete implementations. Maybe a way to find an exciting security vulnerability involving two parsers differing in their interpretation...

thesuperbigfrog|1 year ago

"It is expected that the json-threat-protection crate will be faster than the serde_json crate because it never store the deserialized JSON Value in memory, which reduce the cost on memory allocation and deallocation."

"As you can see from the table, the json-threat-protection crate is faster than the serde_json crate for all datasets, but the number depends on the dataset. So you could get your own performance number by specifying the JSON_FILE to your dataset."

However:

"This project is not a parser, and never give you the deserialized JSON Value!"

Is this performance comparison to serde_json fair? If serde_json is a parser and has a different feature set than json-threat-protection, does it make sense to compare performance?

matthews2|1 year ago

> If serde_json is a parser and has a different feature set than json-threat-protection, does it make sense to compare performance?

If you were using serde_json just to validate a payload before passing it on to another service (like a WAF), then the comparison makes sense. If you had more complex validations or wanted to extract some of the data, then maybe not.

ADD-SP|1 year ago

This crate is not an alternative of the serde_json, it only do the validation.

Currently, there is no other crates do the sames validation works on JSON, so I have to parse the dataset by a common JSON parser (sede_json) and do the same validation on its deserialized value as the comparable results.

So it would be better to compare to other crates which do the same work, but I didn't found the similar crate so far. And this is also the reason I developed this crate.

michaelmior|1 year ago

I don't think it was intended to say that this crate is "better" than serde_json. I interpreted it to be a measurement of the overhead of adding it as an additional step on top of parsing.

peterkelly|1 year ago

kstenerud|1 year ago

I think you may have misunderstood the article.

The point of the article is to parse AND validate input AT THE BOUNDARY between the outside world and your program, rather than a bunch of ad-hoc validations at various points after the suspect data has entered the castle walls and has already been (at least partially) processed (thus making the program state harder to reason about). By enforcing your invariants at the border, you ensure that all data entering your system always conforms to your expectations, just like a strong type system ensures that invalid states are not representable. A schema is basically a type system for your raw data.

This concept is also a major element of Domain Driven Design https://en.wikipedia.org/wiki/Domain-driven_design

ADD-SP|1 year ago

Great to see this article, I totally agreed with the view that rejecting any invalid case by designing the right data structure.

Unfortunately, it is hard to achieve it in practice and people even don't realize this, JSON Object is a good example, Human are incline expecting the duplicated key is not allowed in JSON, but it happens.

For this goal, I think the Protobuf is good way to eliminate the possible invalid data for data transportation.