Why Ray Serve?
Speed and simplicity are just 2 of the many reasons to consider building your machine learning serving APIs with Ray Serve.
Configure your model serving declaratively in pure Python, without needing YAML or JSON configs.
Low latency, high throughput
Horizontally scale across hundreds of processes or machines, while keeping the overhead in single-digit milliseconds.
Easily compose multiple models, mix model serving with business logic, and independently scale components, without complex microservices.
Use a single tool to serve all types of models — from PyTorch and Tensorflow to scikit-Learn models — and business logic.
Scale an existing FastAPI server easily or define an HTTP interface for your model using its simple, elegant API.
Native GPU support
Using GPUs is as simple as adding one line of Python code. Maximize hardware utilization by sharing CPUs or GPUs between different models.
Try it yourself
Install Ray Serve with
pip install "ray[serve]" and give this example a try.
import ray from ray import serve from fastapi import FastAPI from transformers import pipeline app = FastAPI() # Define our deployment. @serve.deployment(num_replicas=2) class GPT2: def __init__(self): self.nlp_model = pipeline("text-generation", model="gpt2") async def predict(self, query: str): return self.nlp_model(query, max_length=50) async def __call__(self, request): return self.predict(await request.body()) @app.on_event("startup") # Code to be run when the server starts. async def startup_event(): ray.init(address="auto") # Connect to the running Ray cluster. serve.start(http_host=None) # Start the Ray Serve instance. # Deploy our GPT2 Deployment. GPT2.deploy() @app.get("/generate") async def generate(query: str): # Get a handle to our deployment so we can query it in Python. handle = GPT2.get_handle() return await handle.predict.remote(query) @app.on_event("shutdown") # Code to be run when the server shuts down. async def shutdown_event(): serve.shutdown() # Shut down Ray Serve.
See Ray Serve in action
See how companies are using Ray Serve to run their production model serving systems in a fast, reliable, and scalable way.
Wildlife Studios serves in-game offers 3X faster, while simultaneously reducing infrastructure spend by 95% cost, with Ray Serve.
See how a small team of data scientists built a dynamic, scalable data pipeline service for their users using Ray Serve.
Learn how German tech services giant built their next-generation identity management platform on top of Ray Serve, running on Kubernetes.
Scale more than just serving
Expand your Ray journey beyond model serving and scale other parts of your machine learning pipeline.
Scalable deep learning
Scale hyperparameter search
Scale data loading and collections use cases