top | item 30106590

(no title)

Consider us one. :)

We tried removing "async" -- thinking it would force sequential processing -- but it unexpectedly seemed to cause parallel processing of requests, which caused CUDA memory errors.

Before removing "async", this is the weird behavior we observed:

* Hacker blasts 50-100 requests.

* Our ML model processes each request in normal time and sequentially.

* But instead of returning individual responses immediately, the server holds onto all responses -- sending responses only when the last request finishes (or a bunch of requests finish).

* Normally, request 1 should return in N seconds, request 2 in 2N seconds, but with this, all requests returned in about N50 seconds (assuming batch size of 50).

1. Any suggestions on this?

2. Mind clarifying how sync vs aync works? The FastAPI docs are unclear.

Any help would be much appreciated.

This has been extremely frustrating.

discuss

Nextgrid|4 years ago

Any chance the entire thing can be offloaded to a task queue (Celery/etc)? This would decouple the HTTP request processing from the actual ML task.

The memory errors you're seeing could suggest that you may not actually be able to run multiple instances of the model, and even if you could it may not actually give you more performance than processing sequentially.

Seems like ultimately your current design can't gracefully handle too many concurrent requests, legitimate or malicious - this is a problem I recommend you address regardless of whether you manage to ban the malicious users.

pjgalbraith|4 years ago

Yeah this is the way.

@headlessvictim2 search for "Asynchronous Request-Reply pattern" if you want more information about this kind of architecture. You will remove any bottleneck from the API server and can easily scale out from the task queue.

tempest_|4 years ago

Python async is co-operative multi-tasking (as opposed to per-emptive)

There is an event loop that goes through all the tasks and runs them.

The issue is the event loop can only move on to the next task when you reach an await. So if you run a lot of code (say an ML model) between awaits no other task can advance during this time.

This is why it is co-operative, it is up to a task to release the event loop, by hitting an await, so other tasks can get work done.

This is fine when you have async libs that often hit awaits at things that are IO related like say db, or http calls.

FastAPI will spawn controllers that are not defined as async functions on a thread pool but it is still a python so GIL and all that.

You should do as the sibling comment says and decouple your http from your ML and feed the ML with something like Celery. This way your server is always there to respond to things (even if just a 429) to hit a cache or whatever else.

headlessvictim2|4 years ago

Thanks for the FastAPI explanation. This makes sense.

Right now, we use ELB (Elastic Load Balancer) to sit in front of multiple GPU instances.

Is this sufficient or do you suggest adding Celery to this architecture?

azinman2|4 years ago

What are the bots goals? Curious

headlessvictim2|4 years ago

To use premium settings without paying.

It appears less like malicious DDoS and more like pragmatic theft.