News article summarizer

In this example we scrape news articles from the New York Times’ Science section and summarize them using Google’s deep learning summarization model Pegasus. We log the resulting summaries to the terminal, but you can do whatever you want with the summaries afterwards: saving to a CSV file, sending to Slack, etc.

import os
import re
from dataclasses import dataclass
from typing import List

import modal

Building Images and Downloading Pre-trained Model

We start by defining our images. In Modal, each function can use a different image. This is powerful because you add only the dependencies you need for each function.

stub = modal.Stub(name="example-news-summarizer")
MODEL_NAME = "google/pegasus-xsum"
CACHE_DIR = "/cache"

The first image contains dependencies for running our model. We also download the pre-trained model into the image using the huggingface API. This caches the model so that we don’t have to download it on every function call.

stub["deep_learning_image"] = modal.Image.debian_slim().pip_install(
    "transformers==4.16.2", "torch", "sentencepiece"
)

Defining the scraping image is very similar. This image only contains the packages required to scrape the New York Times website, though; so it’s much smaller.

stub["scraping_image"] = modal.Image.debian_slim().pip_install(
    "requests", "beautifulsoup4", "lxml"
)

volume = modal.SharedVolume().persist("pegasus-modal-vol")

We will also instantiate the model and tokenizer globally so it’s available for all functions that use this image.

if stub.is_inside(stub["deep_learning_image"]):
    from transformers import PegasusForConditionalGeneration, PegasusTokenizer

    TOKENIZER = PegasusTokenizer.from_pretrained(
        MODEL_NAME, cache_dir=CACHE_DIR
    )
    MODEL = PegasusForConditionalGeneration.from_pretrained(
        MODEL_NAME, cache_dir=CACHE_DIR
    )


if stub.is_inside(stub["scraping_image"]):
    import requests
    from bs4 import BeautifulSoup

Collect Data

Collecting data happens in two stages: first a list of URL articles using the NYT API then scrape the NYT web page for each of those articles to collect article texts.

@dataclass
class NYArticle:
    title: str
    image_url: str = ""
    url: str = ""
    summary: str = ""
    text: str = ""

In order to connect to the NYT API, you will need to sign up at NYT Developer Portal, create an Stub then grab an API key. Then head to Modal and create a Secret called nytimes. Create an environment variable called NYTIMES_API_KEY with your API key.

@stub.function(
    secret=modal.Secret.from_name("nytimes"), image=stub["scraping_image"]
)
def latest_science_stories(n_stories: int = 5) -> List[NYArticle]:
    # query api for latest science articles
    params = {
        "api-key": os.environ["NYTIMES_API_KEY"],
    }
    nyt_api_url = "https://api.nytimes.com/svc/topstories/v2/science.json"
    response = requests.get(nyt_api_url, params=params)

    # extract data from articles and return list of NYArticle objects
    results = response.json()
    reject_urls = {"null", "", None}
    articles = [
        NYArticle(
            title=u["title"],
            image_url=u.get("multimedia")[0]["url"]
            if u.get("multimedia")
            else "",
            url=u.get("url"),
        )
        for u in results["results"]
        if u.get("url") not in reject_urls
    ]

    # select only a handful of articles; this usually returns 25 articles
    articles = articles[:n_stories]
    print(f"Retrieved {len(articles)} from the NYT Top Stories API")
    return articles

The NYT API only gives us article URLs but it doesn’t include the article text. We’ll get the article URLs from the API then scrape each URL for the article body. We’ll be using Beautiful Soup for that.

@stub.function(image=stub["scraping_image"])
def scrape_nyc_article(url: str) -> str:
    print(f"Scraping article => {url}")

    # fetch article; simulate desktop browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")

    # get all text paragraphs & construct single string with article text
    article_text = ""
    article_section = soup.find_all(
        "div", {"class": re.compile(r"\bStoryBodyCompanionColumn\b")}
    )
    if article_section:
        paragraph_tags = article_section[0].find_all("p")
        article_text = " ".join([p.get_text() for p in paragraph_tags])

    # return article with scraped text
    return article_text

Now the summarization function. We use huggingface’s Pegasus tokenizer and model implementation to generate a summary of the model. You can learn more about Pegasus does in the HuggingFace documentation. Use gpu="any" to speed-up inference.

@stub.function(
    image=stub["deep_learning_image"],
    gpu=False,
    shared_volumes={CACHE_DIR: volume},
    memory=4096,
)
def summarize_article(text: str) -> str:
    print(f"Summarizing text with {len(text)} characters.")

    # summarize text
    batch = TOKENIZER(
        [text], truncation=True, padding="longest", return_tensors="pt"
    ).to("cpu")
    translated = MODEL.generate(**batch)
    summary = TOKENIZER.batch_decode(translated, skip_special_tokens=True)[0]

    return summary

Create a Scheduled Function

Put everything together and schedule it to run every day. You can also use modal.Cron for a more advanced scheduling interface.

@stub.function(schedule=modal.Period(days=1))
def trigger():
    articles = latest_science_stories.call()

    # parallelize article scraping
    for i, text in enumerate(scrape_nyc_article.map([a.url for a in articles])):
        articles[i].text = text

    # parallelize summarization
    for i, summary in enumerate(
        summarize_article.map([a.text for a in articles if len(a.text) > 0])
    ):
        articles[i].summary = summary

    # show all summaries in the terminal
    for article in articles:
        print(f'Summary of "{article.title}" => {article.summary}')

Create a new Modal scheduled function with:

modal deploy --name news_summarizer news_summarizer.py

You can also run this entire Modal app in debugging mode before. call it with modal run news_summarizer.py

@stub.local_entrypoint()
def main():
    trigger.call()

And that’s it. You will now generate deep learning summaries from the latest NYT Science articles every day.