Running and Scaling a FastAPI ML Inference Server with Kubernetes

A guide on scaling your model’s inference capabilities.

6 min readDec 28, 2023

Running and scaling Machine Learning models is a complex problem that requires consulting and experimenting with lots of solutions.

In this tutorial, let’s look at a way to make the process easier with less moving parts using the following tools:

FastAPI to build our inference API, and
Ray Serve to make our API automagically scalable on a locally running Kubernetes cluster.

Let’s get started!

We’ll be using the open source Deepface library to perform face detection and extraction on a given image.

A simple function called detect_face will take an image URL as a parameter and:

We’ll get an output consisting of the coordinates of the faces detected in the given image.