Breaking up with Flask and FastAPI: Why they don’t scale for ML model serving

[+] kroolik|3 years ago|reply

I have a feeling the root cause is not in FastAPI or Flask, but in the architecture of the system itself.

Why? You are doing the inference in the same request, which is synchronous from the perspective of the caller. The request can be memory-intensive or cpu-intensive. And the issue is you can't efficiently consider all the workloads for a single machine without being bottlenecked by Python.

I would say that the problem is in your approach trying to use the webapp hammer for all the different flavors of nails in your system, using a language that isn't suited for concurrency. What I would do is decoupling the validation/interface logic from your models via a queue. This way you can scale your capacity according to workload and make sure the workload runs on hardware most relevant to the job.

I have a feeling trying to throw a webapp at the problem might not solve your root issue, only delay it in time.

[+] jononor|3 years ago|reply

Having designed similar systems, I agree completely. This is not a language or framework problem, but system architecture. I have used RabbitMQ to great success as the message queuing system, and there are many other good alternatives.

[+] sauyon|3 years ago|reply

If I had my way both Python runners and the runner webapp would not be long for this world.

I will say, though: the runner webapp MVP exists to do basically what you're describing (and keeps an internal queue). Yatai is architected so that the runner instances run on a separate Kubernetes cluster to the pre- and post-processing code that that's run in the main webapp itself, and can be scaled separately.

[+] chaoyu_|3 years ago|reply

Regardless of the architecture, you would still need the model server for online serving use cases. And you’re right that it’s best to decouple the validation/preprocessing logic from the model runtime from architecture perspective, and that’s exactly what BentoML does but a regular web framework can not do.

[+] saltedonion|3 years ago|reply

Apologies for the noob question but how would FastAPI/Flask know that the job has been successfully completed? Would the worker have to persist ml inference results somewhere and the FastAPI server poll it periodically?

[+] detroitcoder|3 years ago|reply

The jump of going from model -> webserver by placing the webserver in the same process as the model is enticing because you can get it to work in under an hour by adding flask/django/fastapi to the env and decorating a function. The problem is that that your model and webserver do NOT scale in the same way, and if you don't realize this fast, you are going to be trying to fit a square peg through a round hole once you have adoption trying to make it work.

All models at scale eventually need to be executed by an async queue processor which is fundamentally different from a request response REST API. For simplicity managing this outside of the process making the web request will help you debug issues when people start asking why they are getting 502 responses. If you are forced to use python for this, I would always suggest of going to celery/huey/dramatiq as an immediate next step after the REST API MVP. I hear Celery is getting better but I have ran into issues over the year so it pains me to recommend it.

[+] kroolik|3 years ago|reply

Exactly this.

Prime example of the difference is whether you accept disruption of inference when you deploy a new version of your webapp.

Very high chances you don't, thus you will start implementing a queueing mechanism without realizing it.

[+] beckingz|3 years ago|reply

"FastAPI is not perfect for ML serving"

Yup. There's a huge amount of work that you need to do to do the whole ML lifecycle, and FastAPI doesn't support that out of the box like a full fledged ML Platform.

But you probably don't actually want a full ML Platform because they're all opinionated and if you try and fight them it's often worse than just serving it as an API via FastAPI...

[+] chaoyu_|3 years ago|reply

Exactly! A vertical solution focused on model serving and deployment would make more sense if it can easily integrate with the ML training platform for CICD

[+] alpineidyll3|3 years ago|reply

Yeah... I mean the author is blatantly biased, but his low level discussion is cogent.

[+] ttymck|3 years ago|reply

Forgive me, I don't mean this flippantly, but it sounds like you implemented queuing and multiprocessing consumers on a Starlette webserver. "micro batching" is a feature enabled by the queueing. The GPU/CPU abstraction is nice, but I feel it's buried by the "FastAPI isn't good enough" digression. If it were framed as "here's what we added to the Starlette ecosystem", I would have approached it much more agreeably.

It would've been delightful to see "instantiate a runner in your existing Starlette application". I don't want to instantiate a Bento service. Perhaps I can mount the bento service on the Starlette application?

Apologies if I am still grossly misunderstanding. I tried to look through some of the _internal codebase to see how the Runner is implemented, the constructor signatures are very complex and the indirection to RunnerMethod had me cross-eyed.

[+] sauyon|3 years ago|reply

You can absolutely mount a BentoML service into your own Starlette (or any ASGI framework); `svc.asgi_app` is all you'd need.

Instantiating and using a runner can be done anywhere with `init_local`, but it's really the runner ASGI app that does the work of queuing and batching. We've thought about allowing users to spin that app up separately but it's not a focus right now; instead we're trying to ensure that the system is as easy to use for data scientists as possible and have that workflow fully ironed out before we support the more advanced use-cases.

The whole runner situation is quite complex because we wanted to support user-created runners in the nicest way possible, and also leave the space open for non-python runners (and service app) in the future.

[+] anderskaseorg|3 years ago|reply

> While FastAPI does support async calls at the web request level, there is no way to call model predictions in an async manner.

This confuses me. How is that FastAPI’s fault? Can’t you just asynchronously delegate them to a concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor? What does Starlette provide here that FastAPI doesn’t? If the FastAPI limitations are due to ASGI, shouldn’t Starlette have the same limitations?

[+] timliu99|3 years ago|reply

> While there are several different methods to use your own executor pools or potentially use shared memory for a large model, all of these solutions are not first-class solutions for ML use cases...

Definitely not FastAPI's fault and yes Starlette has the same limitations. BentoML builds additional ML features/abstractions on top of Starlette. We introduced a "runner" concept which automatically creates separate processes for models to run in.

[+] sgt101|3 years ago|reply

Advert disguised as experience report.

[+] timliu99|3 years ago|reply

Disclaimer: I'm the author.

We've been using Flask for years as the foundation for our 0.13 version. Our choice to move away from Flask and FastAPI as a core part of our library is based on our experience with hundreds of users and use cases

[+] isoprophlex|3 years ago|reply

Nice advertorial but what about a a queue and some machines running torchserve?

[+] lmeyerov|3 years ago|reply

We have been using async python for GPU pydata , including fronting dask/dask_cuda for sharing and bigger-than-memory scenarios, so a lot rings true.

For model serving, we were thinking Triton (native vs python server) as it is a tightly scoped problem and optimized: any perf comparison there?

[+] andrewstuart|3 years ago|reply

FastAPI is collapsing under the weight of its github issues (1,100) and pull requests (483).

[+] pratikss|3 years ago|reply

Simply could also mean it's thriving. FastAPI has nearly 50k stars. It is very popular and very effective in what it promises. Active repos with similar popularity have way, WAY more issues and pull requests, projects like Apache/echarts, Godot, Redis or Grafana.

Projects with similar stars, momentJS or Jquery don't see much active development have very few issues and pull requests.

[+] ttymck|3 years ago|reply

Could you clarify what you mean by "collapsing under the weight"?

Will a project be abandoned because users are pointing out ways to improve it? Does work stop at some point because "we can't get to every issue"? Sure, the maintainer could get burned out, but that is not a given.

[+] agumonkey|3 years ago|reply

IIRC FastAPI is built on starlette, can some of these issues belong to them ? just hoping the numbers don't reflect a whole reality.

[+] unknown|3 years ago|reply

[deleted]

[+] timliu99|3 years ago|reply

Wait... so you're tell me FastAPI is slow...

[+] ssheng|3 years ago|reply

FastAPI is great building block but can't expect it to work for model serving out of box.

[+] jeremycarter|3 years ago|reply

Actually it's very slow. Wait until you have 50 concurrent customers hitting it.

[+] timcavel|3 years ago|reply

[deleted]

42 comments