Ray Serve
Fast and simple API for scalable model serving
Ray Serve lets you serve machine learning models in real-time or batch using a simple Python API. Serve individual models or create composite model pipelines, where you can independently deploy, update, and scale individual components.

Why Ray Serve?
Speed and simplicity are just 2 of the many reasons to consider building your machine learning serving APIs with Ray Serve.
Pythonic API
Configure your model serving declaratively in pure Python, without needing YAML or JSON configs.
Low latency, high throughput
Horizontally scale across hundreds of processes or machines, while keeping the overhead in single-digit milliseconds.
Multi-model composition
Easily compose multiple models, mix model serving with business logic, and independently scale components, without complex microservices.
Framework-agnostic
Use a single tool to serve all types of models — from PyTorch and Tensorflow to scikit-Learn models — and business logic.
FastAPI Integration
Scale an existing FastAPI server easily or define an HTTP interface for your model using its simple, elegant API.
Native GPU support
Using GPUs is as simple as adding one line of Python code. Maximize hardware utilization by sharing CPUs or GPUs between different models.
Try it yourself
Install Ray Serve with pip install "ray[serve]" scikit-learn requests
and give this example a try.
import requests
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from ray import serve
@serve.deployment(route_prefix="/iris")
class BoostingModel:
def __init__(self, model):
self.model = model
self.label_list = iris_dataset["target_names"].tolist()
async def __call__(self, request):
payload = await request.json()
print(f"Received flask request with data {payload}")
prediction = self.model.predict([payload["vector"]])[0]
human_name = self.label_list[prediction]
return {"result": human_name}
if __name__ == "__main__":
# Train model.
iris_dataset = load_iris()
model = GradientBoostingClassifier()
model.fit(iris_dataset["data"], iris_dataset["target"])
# Deploy model
serve.run(BoostingModel.bind(model))
# Query model
sample_request_input = {"vector": [1.2, 1.0, 1.1, 0.9]}
response = requests.get("http://localhost:8000/iris", json=sample_request_input)
print(response.text)
# prints
# Result:
# {"result": "versicolor"}

See Ray Serve in action
See how companies are using Ray Serve to run their production model serving systems in a fast, reliable, and scalable way.

Wildlife Studios
Wildlife Studios serves in-game offers 3X faster, while simultaneously reducing infrastructure spend by 95% cost, with Ray Serve.

Ikigai Labs
See how a small team of data scientists built a dynamic, scalable data pipeline service for their users using Ray Serve.

WidasConcepts
Learn how German tech services giant built their next-generation identity management platform on top of Ray Serve, running on Kubernetes.
Scale more than just serving
Expand your Ray journey beyond model serving and scale other parts of your machine learning pipeline.
O'Reilly Learning Ray Book
Get your free copy of early release chapters of Learning Ray, the first and only comprehensive book on Ray and its ecosystem, authored by members on the Ray engineering team
