You can deploy PyTorch models in production but the way to do that efficiently is to wrap them into an inference framework like BentoML.
This allows you to overcome the issues you’ve mentioned and doesn’t force you to export your model into TorchScript, which is by the way a format that doesn’t support some layers or sophisticated inference logic.
I really encourage you to look into BentoML, it provides features like:
- decoupling the web server from the ML inference pipeline
- horizontal and vertical scaling: each model in you inference graph is backed by a runner, that’s independently deployed on a pod on K8S
- async calls
- micro batching - for the record, this feature doesn’t exist in FastAPI
You can deploy PyTorch models in production but the way to do that efficiently is to wrap them into an inference framework like BentoML.
This allows you to overcome the issues you’ve mentioned and doesn’t force you to export your model into TorchScript, which is by the way a format that doesn’t support some layers or sophisticated inference logic.
I really encourage you to look into BentoML, it provides features like:
- decoupling the web server from the ML inference pipeline
- horizontal and vertical scaling: each model in you inference graph is backed by a runner, that’s independently deployed on a pod on K8S
- async calls
- micro batching - for the record, this feature doesn’t exist in FastAPI
- gpu support