Why Ray Datasets?

A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.

Built for scale

Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code

Distributed Arrow

With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.

Ray ecosystem

Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.

Try It Yourself

Install Ray Datasets with pip install ray pyarrow fsspec and give this example a try.

import ray
 
# read a local CSV file
csv_path = "path/to/file.csv"
ds = ray.data.read_csv(csv_path)
 
# read parquet from S3
parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
ds = ray.data.read_parquet(parquet_path)

Code sample background image

Do more with Ray

Get more from your Ray investments by scaling other use cases on Ray.

Ray SGD

Scale deep learning

Ray Tune

Scale hyperparameter search

Ray Serve

Scale model serving