Why Ray Datasets?
A few reasons why you should choose Ray Datasets for your large-scale data processing and transformation needs.
Built for scale
Run basic data operations such as map, filter, repartition, and shuffle on petabyte-scale data in native Python code
With a distributed Arrow backend, it easily works with a variety of file formats, data sources, and distributed frameworks.
Load your data once and enjoy a pluggable experience Ray once your data is in your Ray cluster with Datasets, leveraging Ray is a breeze.
Try It Yourself
Install Ray Datasets with
pip install ray pyarrow fsspec and give this example a try.
import ray # read a local CSV file csv_path = "path/to/file.csv" ds = ray.data.read_csv(csv_path) # read parquet from S3 parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet" ds = ray.data.read_parquet(parquet_path)
Do more with Ray
Get more from your Ray investments by scaling other use cases on Ray.
Scale deep learning
Scale hyperparameter search
Scale model serving