Train a model to solve math problems using GRPO and verl
This example demonstrates how to train with GRPO on Modal using the verl framework. GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1. verl is a reinforcement learning training library that is an implementation of HybridFlow, an RLHF framework.
The training process works as follows:
- Each example in the dataset corresponds to a math problem.
- In each training step, the model attempts to solve the math problems showing its steps.
- We then compute a reward for the model’s solution using the reward function defined below.
- That reward value is then used to update the model’s parameters according to the GRPO training algorithm.
Setup
Import the necessary modules for Modal deployment.
Defining the image and app
We define an image where we clone the verl repo and install its dependencies. We use a base verl image as a starting point.
Defining the dataset
In this example, we’ll use reinforcement learning to train a model to solve math problems. We use the GSM8K dataset of math problems and a Modal Volume to store the data.
We write a Modal Function to populate the Volume with the data. This downloads the dataset and stores it in the Volume. You will need to run this step if you don’t already have data you’d like to use for this example.
You can kick off the dataset download with modal run <filename.py>::prep_dataset
Defining a reward function
In reinforcement learning, we define a reward function for the model.
We can define this in a separate file, or in the same file as in this case that we then pass as an argument to verl.
We use a default reward function for GSM8K from the verl repo, modified to return 1.0 if it’s a correct answer and 0 otherwise.
Reward functions need to follow a predefined signature.
We then define constants to pass into verl during the training run.
Kicking off a training run
We define some more constants for the training run.
We also define a Volume for storing model checkpoints.
Now, we write a Modal Function for kicking off the training run. If you wish to use Weights & Biases, as we do in this code, you’ll need to create a Weights & Biases Secret.
verl uses Ray under the hood. It creates Ray workers for each step where each Ray worker is a python process and each step is a step in the RL dataflow pipeline. verl also keeps a separate control flow process that’s independent of this, responsible for figuring out what step in the RL pipeline to execute. Each Ray worker gets mapped onto 1 or more GPUs. Depending on the number of GPUs available, Ray will decide what workers go where, or to hold off scheduling workers if there are no available GPUs. Generally, more VRAM = less hot-swapping of Ray workers, which means less waiting around for memory copying each iteration. In this example we have chosen a configuration that allows for easy automated testing, but you may wish to use more GPUs or more powerful GPU types. More details here.
You can now run the training using modal run --detach grpo_verl.py::train, or pass in any additional args from the CLI like this modal run --detach grpo.py::train -- trainer.total_epochs=20 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16.
Performing inference on the trained model
We use vLLM to perform inference on the trained model.
Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see this guide.
The weights path is as follows: global_step_n/actor/huggingface where n is the checkpoint you want (e.g. global_step_5/actor/huggingface).
The latest_checkpointed_iteration.txt file stores the most recent checkpoint index.
We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out this example.
You can then deploy the server using modal deploy grpo_verl.py, which gives you a custom URL. You can then query it using the following curl command:
or in the following ways.