Ray Datasets
Scalable data loading in Python
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as map, filter, and repartition, and are compatible with a variety of file formats, datasources, and distributed frameworks.

Why Ray Datasets?
A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.
Built for scale
Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code
Distributed Arrow
With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.
Ray ecosystem
Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.
Try It Yourself
Install Ray Datasets with pip install ray pyarrow fsspec
and give this example a try.
import ray
# Create a Dataset of Python objects.
ds = ray.data.range(10000)
# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)
ds.take(5)
# -> [0, 1, 2, 3, 4]
ds.count()
# -> 10000
# Create a Dataset of Arrow records.
ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)])
# -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string})
ds.show(5)
# -> {'col1': 0, 'col2': '0'}
# -> {'col1': 1, 'col2': '1'}
# -> {'col1': 2, 'col2': '2'}
# -> {'col1': 3, 'col2': '3'}
# -> {'col1': 4, 'col2': '4'}
ds.schema()
# -> col1: int64
# -> col2: string

Do more with Ray
Get more from your Ray investments by scaling other use cases on Ray.
O'Reilly Learning Ray Book
Get your free copy of early release chapters of Learning Ray, the first and only comprehensive book on Ray and its ecosystem, authored by members on the Ray engineering team
