Large dataset ingestion
This guide provides best practices for downloading, transforming, and storing large datasets within Modal. A dataset is considered large if it contains hundreds of thousands of files and/or is over 100 GiB in size.
These guidelines ensure that large datasets can be ingested fully and reliably.
Configure your Function for heavy disk usage
Large datasets should be downloaded and transformed using a modal.Function and stored
into a Volume.
This modal.Function should specify a large timeout because large dataset processing can take hours,
and it should request a larger ephemeral disk in cases where the dataset being downloaded and processed
is hundreds of GiBs.
Prefer sharded or archived outputs for tiny-file datasets
Volumes can store large datasets, but datasets made up of millions of tiny files are still usually easier to ingest and consume when they are first grouped into larger artifacts such as tar shards, WebDataset archives, Parquet files, or other batched formats.
See the transforming section below for more details.
Experimentation
Downloading and transforming large datasets can be fiddly. While iterating on a reliable ingestion program
you may want an interactive environment so you can inspect downloaded files, validate credentials,
and benchmark transforms before automating the full ingestion job. Modal Notebooks work well for this. Attach the same Volume that your ingestion Functions use, keep transient scratch
data in /tmp, and persist intermediate artifacts under /mnt/....
Downloading
The raw dataset data should be first downloaded into the container at /tmp/ and not placed
directly into the mounted volume. This serves a couple purposes.
- Download tools often create temporary files, partial files, or rename targets while writing, and local SSD handles that more efficiently.
- The raw dataset data may need to be transformed before use, in which case it is wasteful to store it permanently.
This snippet shows the basic download-and-copy procedure:
Transforming
When ingesting a large dataset it is sometimes necessary to transform it before storage, so that it is in an optimal format for loading at runtime. A common kind of necessary transform is gzip decompression. Very large datasets are often gzipped for storage and network transmission efficiency, but gzip decompression (80 MiB/s) is hundreds of times slower than reading from a solid state drive (SSD) and should be done once before storage to avoid decompressing on every read against the dataset.
Transformations should be performed after storing the raw dataset in /tmp/. Performing transformations almost always increases container disk usage and this is where the ephemeral_disk parameter parameter becomes important. For example, a
100 GiB raw, compressed dataset may decompress to into 500 GiB, occupying 600 GiB of container disk space.
Transformations should also typically be performed against /tmp/. This is because
- transforms can be IO intensive and IO latency is lower against local SSD.
- transforms can create temporary data which is wasteful to store permanently.
Once the transform is complete, write the final dataset layout to the attached Volume and commit it so subsequent Functions and Notebooks can reload and use the same data.
Examples
The best practices offered in this guide are demonstrated in the modal-examples repository.
The examples include these popular large datasets:
- ImageNet, the image labeling dataset that kicked off the deep learning revolution
- COCO, the Common Objects in COntext dataset of densely-labeled images
- LAION-400M, the Stable Diffusion training dataset
- Data derived from the Big “Fantastic” Database, Protein Data Bank, and UniProt Database used in training the RoseTTAFold protein structure model