Embed 30 million Amazon reviews at 575k tokens per second with Qwen2-7B
This example demonstrates how to create embeddings for a large text dataset. This is often necessary to enable semantic search, translation, and other language processing tasks. Modal makes it easy to deploy large, capable embedding models and handles all of the scaling to process very large datasets in parallel on many cloud GPUs.
We create a Modal Function that will handle all of the data loading and submit inputs to an inference Cls that will automatically scale up to handle hundreds of large batches in parallel.
Between the time a batch is submitted and the time it is fetched, it is stored via
Modal’s spawn system, which can hold onto up to one million inputs for up to a week.
We define our main function as a local_entrypoint. This is what we’ll call locally
to start the job on Modal.
You can run it with the command
By default we down-scale to 1/100th of the data for demonstration purposes.
To launch the full job, set the --down-scale parameter to 1.
But note that this will cost you!
The entrypoint starts the job and gets back a function call ID for each batch.
We can use these IDs to retrieve the embeddings once the job is finished.
Modal will keep the results around for up to 7 days after completion. Take a look at our job processing guide for more details.
Load the data and start the inference job
Next we define the Function that will do the data loading and feed it to our embedding model. We define a container Image with the data loading dependencies.
In it, we download the data we need and cache it to the container’s local disk, which will disappear when the job is finished. We will be saving the review data along with the embeddings, so we don’t need to keep the dataset around.
Embedding a large dataset like this can take some time, but we don’t need to wait
around for it to finish. We use spawn to invoke our embedding Function
and get back a handle with an ID that we can use to get the results later.
This can bottleneck on just sending data over the network for processing, so
we speed things up by using ThreadPoolExecutor to submit batches using multiple threads.
Once all of the batches have been sent for inference, we can return the function IDs to the local client to save.
Massively scaling up and scaling out embedding inference on many beefy GPUs
We’re going to spin up many containers to run inference, and we don’t want each
one to have to download the embedding model from Hugging Face. We can download and save it to a
Modal Volume during the image build step using run_function.
We’ll use the GTE-Qwen2-7B-instruct model from Alibaba, which performs well on the Massive Text Embedding Benchmark.
For inference, we will use Hugging Face’s Text Embeddings Inference framework for embedding model deployment.
Running lots of separate machines is “scaling out”. But we can also “scale up” by running on large, high-performance machines.
We’ll use L40S GPUs for a good balance between cost and performance. Hugging Face has prebuilt Docker images we can use as a base for our Modal Image. We’ll use the one built for the L40S’s SM89/Ada Lovelace architecture and install the rest of our dependencies on top.
Next we define our inference class. Modal will auto-scale the number of
containers ready to handle inputs based on the parameters we set in the @app.cls and @modal.concurrent decorators. Here we limit the total number of containers to
100 and the maximum number of concurrent inputs to 10, which caps us at 1000 concurrent batches.
On Modal’s Starter (free) and Team plans, the maximum number of concurrent GPUs is lower,
reducing the total number of concurrent batches and so the throughput.
Customers on Modal’s Enterprise Plan regularly scale up another order of magnitude above this. If you’re interested in running on thousands of GPUs, get in touch.
Here we also specify the GPU type and attach the Modal Volume where we saved the embedding model.
This class will spawn a local Text Embeddings Inference server when the container starts, and process each batch by receiving the text data over HTTP, returning a list of tuples with the batch text data and embeddings.
Helper Functions
The book review dataset contains ~30M reviews with ~12B total characters, indicating an average review length of ~500 characters. Some are much longer. Embedding models have a limit on the number of tokens they can process in a single input. We will need to split each review into chunks that are under this limit.
The proper way to split text data is to use a tokenizer to ensure that any
single request is under the models token limit, and to overlap chunks to provide
semantic context and preserve information. For the sake of this example, we’re going
just to split by a set character length (CHUNK_SIZE).
While the embedding model has a limit on the number of input tokens for a single
embedding, the number of chunks that we can process in a single batch is limited by
the VRAM of the GPU. We set the BATCH_SIZE accordingly.