A simple web scraper
In this guide we’ll introduce you to Modal by writing a simple web scraper. We’ll explain the foundations of a Modal application step by step.
Set up your first Modal app
Modal apps are orchestrated as Python scripts, but can theoretically run anything you can run in a container. To get you started, make sure to install the latest Modal Python package and set up an API token (the first two steps of the Getting started page).
First, we create an empty Python file
scrape.py. This file will contain our
application code. Lets write some basic Python code to fetch the contents of a
web page and print the links (href attributes) it finds in the document:
import re import sys import urllib.request def get_links(url): response = urllib.request.urlopen(url) html = response.read().decode("utf8") links =  for match in re.finditer('href="(.*?)"', html): links.append(match.group(1)) return links if __name__ == "__main__": links = get_links(sys.argv) print(links)
Now obviously this is just pure standard library Python code, and you can run it on your machine:
$ python scrape.py http://example.com ['https://www.iana.org/domains/example']
Running it in Modal
To make the
get_links function run in Modal instead of your local machine, all
you need to do is import
modal, create a
modal.Stub instance and add a
annotation to your function and wrap your
__main__ block in the
import re import sys import urllib.request +import modal +stub = modal.Stub(name="link-scraper") +@stub.function def get_links(url): ... if __name__ == "__main__": + with stub.run(): links = get_links(sys.argv) print(links)
You still run the file like a normal Python script, but will now see some progress indication while the script is running:
$ python scrape.py http://example.com ✓ Initialized. ✓ Created objects. ['https://www.iana.org/domains/example'] ✓ App completed.
In the code above we make use of the Python standard library
to dynamically load content, which wouldn’t appear in the loaded html file.
Let’s use the
Playwright package to
that might be on the page.
We can pass custom container images (defined using
modal.Image) to the
We’ll make use of the
modal.Image.debian_slim pre-bundled image add the shell
commands to install Playwright and its dependencies:
playwright_image = modal.Image.debian_slim().run_commands( [ "apt-get install -y software-properties-common", "apt-add-repository non-free", "apt-add-repository contrib", "apt-get update", "pip install playwright==1.20.0", "playwright install-deps chromium", "playwright install chromium", ], )
Note that we don’t have to install Playwright or Chromium on our development
machine since this will all run in Modal. We can now modify our
function to make use of the new tools:
async def get_links(cur_url: str): from playwright.async_api import async_playwright async with async_playwright() as p: browser = await p.chromium.launch() page = await browser.new_page() await page.goto(cur_url) links = await page.eval_on_selector_all("a[href]", "elements => elements.map(element => element.href)") await browser.close() print("Links", links) return links
Since Playwright has a nice async interface, we’ll redeclare our
function as async (Modal works with both sync and async functions).
The first time you run the function after making this change, you’ll notice that the output first shows the progress of building the custom image you specified, after which your function runs like before. This image is then cached so that on subsequent runs of the function it will not be rebuilt as long as the image definition is the same.
So far, our script only fetches the links for a single page. What if we want to scrape a large list of links in parallel?
We can do this easily with Modal, because of some magic: the function we wrapped
@stub.function decorator is no longer an ordinary function, but a
Function object. This
means it comes with a
map property built in, that lets us run this function
for all inputs in parallel, scaling up to as many workers as needed.
Let’s change our code to scrape all urls we feed to it in parallel:
if __name__ == "__main__": urls = ["http://modal.com", "http://github.com"] with stub.run(): for links in get_links.map(urls): for link in links: print(link)
Schedules and deployments
Let’s say we want to log the scraped links daily. We move the print loop into
its own Modal function and annotate it with a
modal.Period(days=1) schedule -
indicating we want to run it once per day. Since the scheduled function will not
run from our command line, we also add a hard-coded list of links to crawl for
now. In a more realistic setting we could read this from a database or other
accessible data source.
def daily_scrape(): urls = ["http://modal.com", "http://github.com"] for links in get_links.map(urls): for link in links: print(link)
To deploy this as a permanent app, run the command
modal app deploy scrape.py
Running this command deploys this function and then closes immediately. We can see the deployment and all of its runs, including the printed links, on the Modal Deployments page. Rerunning the script will redeploy the code with any changes you have made - overwriting an existing deploy with the same name (“link-scraper” in this case).
Integrations and Secrets
Instead of looking at the links in the run logs of our deployments, let’s say we
wanted to post them to our
#scraped-links Slack channel. To do this, we can
make use of the
Slack API and the
The Slack SDK WebClient requires an API token to get access to our Slack Workspace, and since it’s bad practice to hardcode credentials into application code we make use of Modal’s Secrets. Secrets are snippets of data that will be injected as environment variables in the containers running your functions.
The easiest way to create Secrets is to go to the Secrets section of modal.com. You can both create a free-form secret with any environment variables, or make use of presets for common services. We’ll use the Slack preset and after filling in the necessary information we are presented with a snippet of code that can be used to post to Slack using our credentials:
import os slack_sdk_image = modal.Image.debian_slim().pip_install(["slack-sdk"]) def bot_token_msg(channel, message): import slack_sdk client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"]) client.chat_postMessage(channel=channel, text=message)
We’ll copy that code as is and just call the
bot_token_msg function from our
@stub.function(schedule=modal.Period(days=1)) def daily_scrape(): urls = ["http://modal.com", "http://github.com"] for links in get_links.map(urls): for link in links: - print(link) + bot_token_msg("scraped-links", link)
Note that we are freely making function calls across completely different container images, as if they were regular Python functions in the same program.
We rerun the script which overwrites the old deploy with our updated code, and now we get a daily feed of our scraped links in our Slack channel 🎉
We have shown how you can use Modal to develop distributed Python data applications using custom containers. Through simple constructs we were able to add parallel execution. With the change of a single line of code were were able to go from experimental development code to a deployed application. The full code of this example can be found here. We hope this overview gives you a glimpse of what you are able to build using Modal.