top | item 30106299

(no title)

Thanks for the reply!

The freemium service provides access to machine learning models on GPU instances, served with FastAPI.

Each request invokes a compute-intensive ML model, but perhaps there is something wrong with the FastAPI configuration as well?

discuss

tempest_|4 years ago

It could be.

I watch the FastAPI repos a lot and tones of people do not understand how async python works and put their models with sync code in an async context.

headlessvictim2|4 years ago

Consider us one. :)

We tried removing "async" -- thinking it would force sequential processing -- but it unexpectedly seemed to cause parallel processing of requests, which caused CUDA memory errors.

Before removing "async", this is the weird behavior we observed:

* Hacker blasts 50-100 requests.

* Our ML model processes each request in normal time and sequentially.

* But instead of returning individual responses immediately, the server holds onto all responses -- sending responses only when the last request finishes (or a bunch of requests finish).

* Normally, request 1 should return in N seconds, request 2 in 2N seconds, but with this, all requests returned in about N50 seconds (assuming batch size of 50).

1. Any suggestions on this?

2. Mind clarifying how sync vs aync works? The FastAPI docs are unclear.

Any help would be much appreciated.

This has been extremely frustrating.