Running and Scaling a FastAPI ML Inference Server with Kubernetes
A guide on scaling your model’s inference capabilities.
6 min readDec 28, 2023
Running and scaling Machine Learning models is a complex problem that requires consulting and experimenting with lots of solutions.
In this tutorial, let’s look at a way to make the process easier with less moving parts using the following tools:
- FastAPI to build our inference API, and
- Ray Serve to make our API automagically scalable on a locally running Kubernetes cluster.
Let’s get started!
Building a Face Detection model
We’ll be using the open source Deepface library to perform face detection and extraction on a given image.
A simple function called detect_face
will take an image URL as a parameter and:
- download the image using
requests
- read the image using
PIL
- convert the image into a
numpy
array and finally, - Perform face detection on the image
We’ll get an output consisting of the coordinates of the faces detected in the given image.