Running and Scaling a FastAPI ML Inference Server with Kubernetes

A guide on scaling your model’s inference capabilities.

Yash Prakash


Photo by Fausto García-Menéndez on Unsplash

Running and scaling Machine Learning models is a complex problem that requires consulting and experimenting with lots of solutions.

In this tutorial, let’s look at a way to make the process easier with less moving parts using the following tools:

  • FastAPI to build our inference API, and
  • Ray Serve to make our API automagically scalable on a locally running Kubernetes cluster.

Let’s get started!

Building a Face Detection model

We’ll be using the open source Deepface library to perform face detection and extraction on a given image.

A simple function called detect_face will take an image URL as a parameter and:

  • download the image using requests
  • read the image using PIL
  • convert the image into a numpy array and finally,
  • Perform face detection on the image

We’ll get an output consisting of the coordinates of the faces detected in the given image.



Yash Prakash

Software engineer → Solopreneur ⦿ Scaling my own 1-person business model ⦿ Writing for busy founders and business owners.