Fine-tune open source YOLO models for object detection
Example by @Erik-Dunteman and @AnirudhRahul.
The popular “You Only Look Once” (YOLO) model line provides high-quality object detection in an economical package. In this example, we use the YOLOv10 model, released on May 23, 2024.
We will:
Download two custom datasets from the Roboflow computer vision platform: a dataset of cats and a dataset of dogs
Fine-tune the model on those datasets, in parallel, using the Ultralytics package
Run inference with the fine-tuned models on single images and on streaming frames
For commercial use, be sure to consult the Ultralytics software license options, which include AGPL-3.0.
Set up the environment
Modal runs your code in the cloud inside containers. So to use it, we have to define the dependencies of our code as part of the container’s image.
We also create a persistent Volume for storing datasets, trained weights, and inference outputs. For more on storing model weights on Modal, see this guide.
We attach both of these to a Modal App.
Download a dataset
We’ll be downloading our data from the Roboflow computer vision platform, so to follow along you’ll need to:
Create a free account on Roboflow
Set up a Modal Secret called
roboflow-api-keyin the Modal UI here, setting theROBOFLOW_API_KEYto the value of your API key.
You’re also free to bring your own dataset with a config in YOLOv10-compatible yaml format.
We’ll be training on the medium size model, but you’re free to experiment with other model sizes.
Train a model
We train the model on a single A100 GPU. Training usually takes only a few minutes.
Run inference on single inputs and on streams
We demonstrate two different ways to run inference — on single images and on a stream of images.
The images we use for inference are loaded from the test set, which was added to our Volume when we downloaded the dataset.
Each image read takes ~50ms, and inference can take ~5ms, so the disk read would be our biggest bottleneck if we just looped over the image paths.
To avoid it, we parallelize the disk reads across many workers using Modal’s .map,
streaming the images to the model. This roughly mimics the behavior of an interactive object detection pipeline.
This can increase throughput up to ~60 images/s, or ~17 milliseconds/image, depending on image size.
We use the @enter feature of modal.Cls to load the model only once on container start and reuse it for future inferences.
We use a generator to stream images to the model.
Running the example
We’ll kick off our parallel training jobs and run inference from the command line.
This runs the training in quick_check mode, useful for debugging the pipeline and getting a feel for it.
To do a longer run that actually meaningfully improves performance, use:
Addenda
The rest of the code in this example is utility code.