Face detection on YouTube videos

This is an example that uses OpenCV as well as video utilities pytube and moviepy to work with video files.

The face detection is a quite simple model built into OpenCV and is not state of the art.

The result

If you watched this, we succeeded rickrolling you! 🤣

The Python code

We start by setting up the container image we need. This requires installing a few dependencies needed for OpenCV as well as downloading the face detection model

import os

import modal

OUTPUT_DIR = "/tmp/"
FACE_CASCADE_FN = "haarcascade_frontalface_default.xml"

image = (
        "apt-get install -y libgl1-mesa-glx libglib2.0-0 wget",
        f"wget https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/{FACE_CASCADE_FN} -P /root",
    .pip_install("pytube", "opencv-python", "moviepy")
stub = modal.Stub("example-youtube-face-detection", image=image)

if stub.is_inside():
    import cv2
    import moviepy.editor
    import pytube

For temporary storage of movie clips, we use a “shared volume”

stub.sv = modal.SharedVolume()

Face detection function

The face detection function takes three arguments:

  • A filename to the source clip
  • A time slice denoted by start and a stop in seconds

The function extracts the subclip from the movie file (which is stored on the shared volume), runs face detection on every frame in its slice, and stores the resulting video back to the shared storage.

@stub.function(shared_volumes={"/clips": stub.sv}, timeout=600)
def detect_faces(fn, start, stop):
    # Extract the subclip from the video
    clip = moviepy.editor.VideoFileClip(fn).subclip(start, stop)

    # Load face detector
    face_cascade = cv2.CascadeClassifier(f"/root/{FACE_CASCADE_FN}")

    # Run face detector on frames
    imgs = []
    for img in clip.iter_frames():
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        faces = face_cascade.detectMultiScale(gray, 1.1, 4)
        for (x, y, w, h) in faces:
            cv2.rectangle(img, (x, y), (x + w, y + h), (255, 0, 0), 2)

    # Create mp4 of result
    out_clip = moviepy.editor.ImageSequenceClip(imgs, fps=clip.fps)
    out_fn = f"/clips/{start:04d}.mp4"
    return out_fn

This ‘entrypoint’ into Modal controls the main flow of the program:

  1. Download the video from YouTube
  2. Fan-out face detection of individual 1s clips
  3. Stitch the results back into a new video
@stub.function(shared_volumes={"/clips": stub.sv})
def process_video(url):
    print(f"Downloading video from '{url}'")
    yt = pytube.YouTube(url)
    stream = yt.streams.filter(file_extension="mp4").first()
    fn = stream.download(output_path="/clips/")

    # Get duration
    duration = moviepy.editor.VideoFileClip(fn).duration

    # Create (start, stop) intervals
    intervals = [(fn, offset, offset + 1) for offset in range(int(duration))]

    print("Processing each range of 1s intervals using a Modal map")
    out_fns = list(detect_faces.starmap(intervals))

    print("Converting detections to video clips")
    out_clips = [moviepy.editor.VideoFileClip(out_fn) for out_fn in out_fns]

    print("Concatenating results")
    final_clip = moviepy.editor.concatenate_videoclips(out_clips)
    final_fn = "/clips/out.mp4"

    # Return the full image data
    with open(final_fn, "rb") as f:
        return os.path.basename(fn), f.read()

Local entrypoint

The code we run locally to fire up the Modal job is quite simple

  • Take a YouTube URL on the command line
  • Run the Modal function
  • Store the output data
def main(youtube_url: str = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"):
    fn, movie_data = process_video.call(youtube_url)
    abs_fn = os.path.join(OUTPUT_DIR, fn)
    print(f"writing results to {abs_fn}")
    with open(abs_fn, "wb") as f:

Running the script

Running this script should take approximately a minute or less. It might output a lot of warnings to standard error. These are generally harmless.

Note that we don’t preserve the sound in the video.

Further directions

As you can tell from the resulting video, this face detection model is not state of the art. It has plenty of false positives (non-faces being labeled faces) and false negatives (real faces not being labeled). For better model, consider a modern one based on deep learning.

Try this on Modal!

You can run this on Modal with 60 seconds of work!
Creating an account is free and no credit card is required. After creating an account, install the Modal Python package and create an API token:
pip install modal-client
modal token new
git clone https://github.com/modal-labs/modal-examples
cd modal-examples
modal run 03_scaling_out/youtube_face_detection.py