Train a model to solve coding problems using GRPO and TRL

This example demonstrates how to run GRPO on Modal using the TRL GRPO trainer GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1. TRL is a reinforcement learning training library by Huggingface.

First we perform the imports and then define the app.

We define an image where we install the TRL library. We also install vLLM for the next part of this example. We also use Weights & Biases for logging.

We import the necessary libraries needed in the context of the image.

We also define a Modal Volume for storing model checkpoints.

Defining the reward function

In this example, we use the OpenCoder-LLM/opc-sft-stage2 dataset to train a model to solve coding problems.

In reinforcement learning, we define a reward function for the model. Since we are evaluating code that is generated by a model, we use Modal Sandboxes to evaluate the code securely.

For each completion from the model and a test case to test the completion, we define a simple reward function. The function returns 1 if there are no errors, and 0 otherwise. You might want to adjust this reward function as the model is unlikely to learn well with this function.

We write a function that constructs a program from the model completion. This is determined based on the format of the data. The completions are supposed to follow the format “```python …“. The test cases are a list of assert statements. More details here.

Finally, we define the function that is passed into the GRPOTrainer, which takes in a list of completions. Custom reward functions must conform to a specific signature.

Kicking off a training run

Preprocess the data, preparing the columns that GRPOTrainer expects. We use the OpenCoder-LLM educational instruct dataset, which has (instruction, code, test case) triples validated through a Python compiler. More details here.

We use Weights & Biases for logging, hence we use a Modal Secret with wandb credentials.

To run: modal run --detach grpo_trl.py::train.

Speeding up training with vLLM

vLLM can be used either in server mode (run vLLM server on separate gpu) or colocate mode (within the training process). In server mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference. More details here. Here, we use 2 GPUs. We run the GRPOTrainer on 1 of them, and the vLLM process on another.

You can execute this using modal run --detach grpo_trl.py::train_vllm_server_mode.

In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. More details here.

You can execute this using modal run --detach grpo_trl.py::train_vllm_colocate_mode.

Performing inference on the trained model

We use vLLM to perform inference on the trained model.

Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see this guide. The weights path is as follows: global_step_n/actor/huggingface where n is the checkpoint you want (eg global_step_5/actor/huggingface). The latest_checkpointed_iteration.txt file stores the most recent checkpoint index.

We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out this example.

You can then deploy the server using modal deploy grpo_trl.py, which gives you a custom url. You can then query it using the following curl command:

or in the following ways.

Train a model to solve coding problems using GRPO and TRL

Defining the reward function

Kicking off a training run

Speeding up training with vLLM

Performing inference on the trained model

Try this on Modal!