Why Ray Serve?

Speed and simplicity are just 2 of the many reasons to consider building your machine learning serving APIs with Ray Serve.

icn

Pythonic API

Configure your model serving declaratively in pure Python, without needing YAML or JSON configs.

icn

Low latency, high throughput

Horizontally scale across hundreds of processes or machines, while keeping the overhead in single-digit milliseconds.

icn

Multi-model composition

Easily compose multiple models, mix model serving with business logic, and independently scale components, without complex microservices.

icn

Framework-agnostic

Use a single tool to serve all types of models — from PyTorch and Tensorflow to scikit-Learn models — and business logic.

icn

FastAPI Integration

Scale an existing FastAPI server easily or define an HTTP interface for your model using its simple, elegant API.

icn

Native GPU support

Using GPUs is as simple as adding one line of Python code. Maximize hardware utilization by sharing CPUs or GPUs between different models.

Try it yourself

Install Ray Serve with pip install "ray[serve]" and give this example a try.

import ray
from ray import serve
 
from fastapi import FastAPI
from transformers import pipeline
 
app = FastAPI()
 
 
# Define our deployment.
@serve.deployment(num_replicas=2)
class GPT2:
   def __init__(self):
       self.nlp_model = pipeline("text-generation", model="gpt2")
 
   async def predict(self, query: str):
       return self.nlp_model(query, max_length=50)
 
   async def __call__(self, request):
       return self.predict(await request.body())
 
 
@app.on_event("startup")  # Code to be run when the server starts.
async def startup_event():
   ray.init(address="auto")  # Connect to the running Ray cluster.
   serve.start(http_host=None)  # Start the Ray Serve instance.
 
   # Deploy our GPT2 Deployment.
   GPT2.deploy()
 
 
@app.get("/generate")
async def generate(query: str):
   # Get a handle to our deployment so we can query it in Python.
   handle = GPT2.get_handle()
   return await handle.predict.remote(query)
 
 
@app.on_event("shutdown")  # Code to be run when the server shuts down.
async def shutdown_event():
   serve.shutdown()  # Shut down Ray Serve.
Code sample background image

See Ray Serve in action

See how companies are using Ray Serve to run their production model serving systems in a fast, reliable, and scalable way.

thumbnail-wildlife

Wildlife Studios

Wildlife Studios serves in-game offers 3X faster, while simultaneously reducing infrastructure spend by 95% cost, with Ray Serve.

Read the case study
thumbnail-ikigai-labs

Ikigai Labs

See how a small team of data scientists built a dynamic, scalable data pipeline service for their users using Ray Serve.

Read the story
thumbnail-widas

WidasConcepts

Learn how German tech services giant built their next-generation identity management platform on top of Ray Serve, running on Kubernetes.

Watch the video

Scale more than just serving

Expand your Ray journey beyond model serving and scale other parts of your machine learning pipeline.

Ray Train

Scalable deep learning

Ray Tune

Scale hyperparameter search

Ray Datasets

Scale data loading and collections use cases