Why Ray Datasets?

A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.

Built for scale

Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code

Distributed Arrow

With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.

Ray ecosystem

Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.

Try It Yourself

Install Ray Datasets with pip install ray pyarrow fsspec and give this example a try.

import ray

# Create a Dataset of Python objects.
ds = ray.data.range(10000)
# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)

# -> [0, 1, 2, 3, 4]

# -> 10000

# Create a Dataset of Arrow records.
ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)])
# -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string})

# -> {'col1': 0, 'col2': '0'}
# -> {'col1': 1, 'col2': '1'}
# -> {'col1': 2, 'col2': '2'}
# -> {'col1': 3, 'col2': '3'}
# -> {'col1': 4, 'col2': '4'}

# -> col1: int64
# -> col2: string
