dmitry_dygalo's comments

dmitry_dygalo | 2 months ago | on: How do you test multiple API payloads and edge cases?

Schemathesis author here. Feel free to open an issue if something does not work as you expect :)

As a side note, if you'd like to check what is covered by Schemathesis or any other testing tool or manual tests, you may want to try https://demo.tracecov.sh/ which will generate API coverage reports from many popular formats (there is also a CLI proxy for cases when there is no direct integration)

dmitry_dygalo | 3 months ago | on: Hypothesis: Property-Based Testing for Python

Hi! Author of Schemathesis here, really glad to hear it helped you uncover those edge cases. If any of those bugs were in public or open-source APIs, I’d love to feature them in our new “trophy case”: https://github.com/schemathesis/schemathesis/issues/new?temp...

dmitry_dygalo | 1 year ago | on: Fast linked lists

Thanks!

> Two suggestions: it’s not immediately obvious whether subsequent benchmark result tables/rows show deltas from the original approach or from the preceding one (it’s the latter, which is more impressive). Maybe call that out the first couple of times?

Agree!

> Second, the “using the single Vec but mutating it” option would presumably benefit from a reserve() or with_capacity() call. Since in that approach you push to the vector in both erroring and non-erroring branches, it doesn’t have to be exact (though you could do a bfs search to find maximum depth, that doesn’t strike me as a great idea) and could be up to some constant value since a single block memory allocation is cheap in this context. (Additionally, the schema you are validating against defines a minimum depth whereas the default vec has a capacity of zero, so you’re guaranteed that it’s a bad choice unless you’re validating an empty, invalid object.)

Oh, this is a cool observation! Indeed it feels like `with_capacity` would help here

> But I agree with the sibling comment from @duped that actually parsing to JSON is the biggest bottleneck and simply parsing to the minimum requirements for validation would be far cheaper, although it depends on if you’ll be parsing immediately after in case it isn’t invalid (make the common case fast, assuming the common case here is absence of errors rather than presence of them) or if you really do just want to validate the schema (which isn’t that rare of a requirement, in and of itself).

My initial assumption was that usually the input is already parsed. E.g. validating incoming data inside an API endpoint which is then passed somewhere else in the same representation. But I think that is a fair use case too and I was actually thinking of implementing it at some point via a generic `Json` trait which does not imply certain representation.

> (Edit: I do wonder if you can still use serde and serde_json but use the deserialize module’s `Visitor` trait/impl to “deserialize” to an enum { Success, ValidationError(..) }` so you don’t have to write your own parser, get to use the already crazy-optimized serde code, and still avoid actually fully parsing the JSON in order to merely validate it.)

Now when I read the details, it feels like a really really cool idea!

> If this were in the real world, I would use a custom slab allocator, possibly from storage on the stack rather than the heap, to back the Vec (and go with the last design with no linked lists whatsoever). But a compromise would be to give something like the mimalloc crate a try!

Nice! In the original `jsonschema` implementation the `validate` function returns `Result<(), ErrorIter>` which makes it more complex to apply that approach, but I think it still should be possible.

dmitry_dygalo | 1 year ago | on: Fast linked lists

> This article is disingenuous with its Vec benchmark. Each call to `validate` creates a new Vec, but that means you allocate + free the vec for each validation. Why not store the vec on the validator to reuse the allocation? Why not mention this in the article, i had to dig in the git history to find out whether the vec was getting reallocated.

The idea comes back to [0] which is similar to one of the steps in the article, and before adding `push` & `pop` I just cloned it to make things work. That's what Rust beginners do.

> This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

Maybe from today's point in time, I'd think the same.

> It would be interesting to see the performance of a `Vec<&str>` where you reuse the vector, but also a `Vec<u8>` where you copy the path bytes directly into the vector and don't bother doing any pointer traversals. The example path sections are all very small - 'inner', 'another', 5 bytes, 7 bytes - less than the length of a pointer! storing a whole `&str` is 16 bytes per element and then you have to rebuild it again anyway in the invalid case.

Yeah, that makes sense to try!

> This whole article is kinda bad, it's titled 'blazingly fast linked lists' which gives it some authority but the approach is all wrong. Man, be responsible if you're choosing titles like this. Someone's going to read this and assume it's a reasonable approach, but the entire section with Vec is bonkers.

> Why are we designing 'blazingly fast' algorithms with rust primitives rather than thinking about where the data needs to go first? Why are we even considering vector clones or other crazy stuff? The thought process behind the naive approach and step 1 is insane to me:

> 1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

> 2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

That's clickbait-y, though none of the article's ideas aim to be a silver bullet. I mean, there are admittedly dumb ideas in the article, though I won't believe that somebody would come up with a reasonable solution without trying something stupid first. However, I might have used better wording to highlight that and mention that I've come up with some of these ideas when was working on `jsonschema` in the past.

> I understand you're trying to be complete, but 'some scenarios' is doing a lot of work here. An Arc<[T]> approach is literally just the same as the naive approach but with extra atomic refcounts! Why mention it in this context?

If you don't need to mutate the data and need to store it in some other struct, it might be useful, i.e. just to have cheap clones. But dang, that indeed is a whole different story.

> I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

Fair. Adding `imbl` wasn't a really good idea for this context at all.

Overall I think what you say is kind of fair, but I think that our perspectives on the goals of the article are quite different (which does not disregard the criticism).

Thank you for taking the time and answer!

- [0] - https://github.com/Stranger6667/jsonschema-rs/commit/1a1c6c3...

dmitry_dygalo | 1 year ago | on: Fast linked lists

That is really cool idea! Thank you

dmitry_dygalo | 1 year ago | on: Fast linked lists

This was my intention

dmitry_dygalo | 1 year ago | on: Fast linked lists

Easier to avoid allocation errors, e.g. in the Linux kernel. I think Alice Ryhl mentioned it here - https://www.youtube.com/watch?v=CEznkXjYFb4

dmitry_dygalo | 1 year ago | on: Fast linked lists

I'd say most developers don't write kernels/drivers or embeds, at least from what I've seen. I am not saying that there are not many devs like this, but rather that there are fewer kernel devs than web devs.

dmitry_dygalo | 1 year ago | on: Fast linked lists

Indeed, I agree with your points.

This idea was added after I wrote the post and wasn't taken from my own optimization efforts in `jsonschema`. Originally, in `jsonschema` the output type is actually a badly composed iterator and I intended to simplify it to just a `Result<(), ValidationError>` for the article, but with this output, there are actually way better optimizations than I originally implemented.

If I'd discovered this idea earlier, I'd probably spend more time investigating it.

dmitry_dygalo | 1 year ago | on: Ask HN: Any Good Fuzzer for gRPC?

I'd be happy to help, looking forward :)

dmitry_dygalo | 1 year ago | on: Ask HN: Any Good Fuzzer for gRPC?

I am not aware of any tools like that, but eventually, I plan to add support for gRPC fuzzing to Schemathesis. There were already some discussions and it is more or less clear how to move forward. See https://github.com/schemathesis/schemathesis/discussions/190...

dmitry_dygalo | 2 years ago | on: Show HN: Auto-generate load tests/synthetic test data from OpenAPI spec/HAR file

Hi!

> Test feedback - during our TestGen flow, the user provides feedback on the sequence and contents of the API requests.

So, it is not fully automated, the user needs to provide the feedback, or is it optional?

Originally by feedback, I meant if there is a feedback loop between the system and the test harness, so the test harness can learn from the system behavior and produce better data / spend less time on ineffective cases. This also is essential for things like test case reduction when a failure happens.

> There is no learning curve, assuming you have basic knowledge of JavaScript. Maintenance is typically minimal.

I'd be cautious about saying that there is no learning curve. Based on the docs at https://docs.multiple.dev/how-it-works/ai-test-gen I see that one who uses the feature should also be aware of your environment API, e.g. `ctx`, `axios`, etc. That does not match my expectations when read about no learning curve and basic JS knowledge. It is not far from there though.

> CLI - we are launching our CLI shortly, where users can start tests from command line as you describe. It'll work similarly to Jest or other unit test frameworks, where the test scripts will live in our user's codebase.

Cool! So, the user needs to commit the test code to their codebase, right?

> The use of AI - we use AI to generate realistic-looking synthetic data, which can be challenging with strings. The AI matches each field to the most relevant faker-js function. We need the content of the string to look like something the target application would receive in production. And with HAR files, we use AI to help filter out irrelevant requests such as analytics.

Yep, thanks for the clarification. I am thinking about how effective such realistic-looking synthetic data is in uncovering defects, i.e. if it covers happy-path with such data, then it left me wondering what about uncommon scenarios? Specifically, if it still can cover uncommon characters (from various Unicode categories)

Overall, I'd say that I like the idea and what I've read in the docs :) Good luck!

dmitry_dygalo | 2 years ago | on: Show HN: Auto-generate load tests/synthetic test data from OpenAPI spec/HAR file

Also, an important note would be that Schemathesis is a property-based testing tool, which does not necessarily imply load testing. I.e. the comparison might not be that helpful as tools have different goals.

However, Schemathesis can use targeted property-based testing to guide the input generation to values more likely to cause slow responses, i.e. it can maximize the response time and at the end, the user can discover that passing `limit=100000000` will read the whole DB table and cause a response timeout (which is a trivial example though)

dmitry_dygalo | 2 years ago | on: Show HN: Auto-generate load tests/synthetic test data from OpenAPI spec/HAR file

Schemathesis author here. I hope to clarify a few points here

> From my understanding, Schemathesis can generate data based on a value being a string, number, boolean, etc

Schemathesis can generate data that matches the spec or not based on the config option (specifically meaning JSON Schema based validation) including all the formats (e.g. date, etc) defined by the Open API spec. For GraphQL it supports all built-in scalar types + a handful of popular ones like DateTime or IP. With extra configuration can also generate syntactically invalid data (e.g. invalid JSON). Serialization is a different step - the payloads can be serialized to JSON or XML, YAML, etc. In my private extension, I also use a Python version of `faker` to mix more realistic data into the set.

> It also seems fairly manual to set up and has a learning curve. Our output is JavaScript that can be run anywhere.

The simplest one-off run is `st run <SCHEMA>`, and it is not clear to me what you mean by being fairly manual to set up. If the user already has a schema (or derived it from traffic / generated by a framework, etc), the only thing they need is to invoke the CLI. Surely there are many config options for different scenarios, and one may take more effort to configure than the other.

Everything has a learning curve - more interesting aspects would be whether this learning curve is justifiable and how often the user needs to dive deep into configuration. My aim with Schemathesis is that in 90% its defaults should be enough for most of the users, for the rest 10% there should be as few barriers as possible for the user to accomplish their goal (which often generates data that has a higher probability to uncover defects).

> From there, it automatically generates the correct type and format of data - e.g., if a field is named "address," it generates a value that looks like an address and is formatted in the same way as examples. It wouldn't be practical to cover every potential edge case and scenario without AI.

From the point of view of coverage of the edge cases, the description sounds like a happy-path scenario. What about the deviations?

Also, most fuzzers do a pretty good job in terms of covering edge cases without AI, especially greybox ones. What would be the concrete AI contribution here? Or what is the core difference in covering with AI or without it?

dmitry_dygalo | 2 years ago | on: Show HN: Auto-generate load tests/synthetic test data from OpenAPI spec/HAR file

This reminds me of several solutions albeit lacking the explicit "AI" part:

- Up9 observes traffic and then generates test cases (as Python code) & mocks

- Dredd is built with JavaScript, runs explicit examples from the Open API spec as tests + generates some parts with faker-js

- EvoMaster generates test cases as Java code based on the spec. However, it is a greybox fuzzer, so it uses code coverage and dynamic feedback to reach deeper into the source code

There are many more examples such as Microsoft's REST-ler, and so on.

Additionally, many tools exist that can analyze real traffic and use this data in testing (e.g. Levo.ai, API Clarity, optic). Some even use eBPF for this purpose.

Given all these tools, I am skeptical. Generating data for API requests does not seem to me to be that difficult. Many of them, already combine traffic analysis & test case generation into a single workflow.

For me, the key factors are the effectiveness of the tests in achieving their intended goals and the effort required for setup and ongoing maintenance.

Many of the mentioned tools can be used as a single CLI command (not true for REST-ler though), and it is not immediately clear how much easier it would be to use your solution than e.g. a command like `st run <schema url/file>`. Surely, there will be a difference in effectiveness if both tools are fine-tuned, but I am interested in the baseline - what do I get if I use the defaults?

My primary area of interest is fuzzing, however, at first glance, I'm also skeptical about the efficacy of test generation without feedback. This method has been used in fuzzing since the early 2000s, and the distinction between greybox and blackbox fuzzers is immense, as shown by many research papers in this domain. Specifically in the time a fuzzer needs to discover a problem.

Sure, your solution aims at load testing, however, I believe it can benefit a lot from common techniques used by fuzzers / property-based testing tools. What is your view on that?

What strategies do you employ to minimize early rejections? That is, ensuring that the generated test cases are not just dropped by the app's validation layer.

dmitry_dygalo | 3 years ago | on: Show HN: We built a tool that automatically generates API tests

Hi, Schemathesis author here

I'd add that Schemathesis is essentially a fuzzer, where from the first glance Step CI is not (correct me if I am wrong).

dmitry_dygalo | 3 years ago | on: Show HN: Open API and GraphQL Fuzzing via GitHub Actions

Hi Juan! Likewise :)

Thank you! That's a great point! Added to the backlog :) Most of the machinery is provider-independent on the Schemathesis side, so adding a new provider should be relatively straightforward. I'll take a look at it soon

dmitry_dygalo | 3 years ago | on: Show HN: Pg_jsonschema – A Postgres extension for JSON validation

The `jsonschema` crate author here.

First of all, this is an exciting use case, I didn't even anticipate it when started `jsonschema` (it was my excuse to play with Rust). I am extremely pleased to see such a Postgres extension :)

At the moment it supports Drafts 4, 6, and 7 + partially supports Draft 2019-09 and 2020-12. It would be really cool if we can collaborate on finishing support for these partially supported drafts! What do you think?

If you'll have any bug reports on the validation part, feel free to report them to our issue tracker - https://github.com/Stranger6667/jsonschema-rs/issues.

Re: performance - there are a couple of tricks I've been working on, so if anybody is interested in speeding this up, feel free to join here - https://github.com/Stranger6667/jsonschema-rs/pull/373

P.S. As for the "Prior Art" section, I think that https://github.com/jefbarn/pgx_json_schema should be mentioned there, as it is also based on `pgx` and `jsonschema`.

dmitry_dygalo | 4 years ago | on: GitHub Copilot available for JetBrains and Neovim

And from GraphQL APIs as well :)

page 1