Why Ray Datasets?

A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.

Built for scale

Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code

Distributed Arrow

With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.

Ray ecosystem

Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.

Try It Yourself

Install Ray Datasets with pip install ray pyarrow fsspec and give this example a try.

import ray

# Create a Dataset of Python objects.
ds = ray.data.range(10000)
# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)

ds.take(5)
# -> [0, 1, 2, 3, 4]

ds.count()
# -> 10000

# Create a Dataset of Arrow records.
ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)])
# -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string})

ds.show(5)
# -> {'col1': 0, 'col2': '0'}
# -> {'col1': 1, 'col2': '1'}
# -> {'col1': 2, 'col2': '2'}
# -> {'col1': 3, 'col2': '3'}
# -> {'col1': 4, 'col2': '4'}

ds.schema()
# -> col1: int64
# -> col2: string
Code sample background image

Do more with Ray

Get more from your Ray investments by scaling other use cases on Ray.

Ray Train

Scale deep learning

Ray Tune

Scale hyperparameter search

Ray Serve

Scale model serving

O'Reilly Learning Ray Book

Get your free copy of early release chapters of Learning Ray, the first and only comprehensive book on Ray and its ecosystem, authored by members on the Ray engineering team

Group 5