Publish interactive datasets with Datasette
Build and deploy an interactive movie database that automatically updates daily with the latest IMDb data. This example shows how to serve a Datasette application on Modal with millions of movie and TV show records.
Try it out for yourself here.
Along the way, we will learn how to use the following Modal features:
Volumes: a persisted volume lets us store and grow the published dataset over time.
Scheduled functions: the underlying dataset is refreshed daily, so we schedule a function to run daily.
Web endpoints: exposes the Datasette application for web browser interaction and API requests.
Basic setup
Let’s get started writing code. For the Modal container image we need a few Python packages.
Persistent dataset storage
To separate database creation and maintenance from serving, we’ll need the underlying database file to be stored persistently. To achieve this we use a Volume.
Getting a dataset
IMDb Datasets are available publicly and are updated daily. We will download the title.basics.tsv.gz file which contains basic information about all titles (movies, TV shows, etc.). Since we are serving an interactive database which updates daily, we will download the files into a temporary directory and then move them to the volume to prevent downtime.
Data processing
This dataset is no swamp, but a bit of data cleaning is still in order. The following function reads a .tsv file, cleans the data and yields batches of records.
Inserting into SQLite
With the TSV processing out of the way, we’re ready to create a SQLite database and feed data into it.
Importantly, the prep_db function mounts the same volume used by download_dataset, and rows are batch inserted with progress logged after each batch,
as the full IMDb dataset has millions of rows and does take some time to be fully inserted.
A more sophisticated implementation would only load new data instead of performing a full refresh, but we’re keeping things simple for this example! We will also create indexes for the titles table to speed up queries.
Keep it fresh
IMDb updates their data daily, so we set up a scheduled function to automatically refresh the database every 24 hours.
Web endpoint
Hooking up the SQLite database to a Modal webhook is as simple as it gets.
The Modal @asgi_app decorator wraps a few lines of code: one import and a few
lines to instantiate the Datasette instance and return its app server.
First, let’s define a metadata object for the database. This will be used to configure Datasette to display a custom UI with some pre-defined queries.
Now we can define the web endpoint that will serve the Datasette application
Publishing to the web
Run this script using modal run cron_datasette.py and it will create the database under 5 minutes!
If you would like to force a refresh of the dataset, you can use:
modal run cron_datasette.py --force-refresh
If you would like to filter the data to be after a specific year, you can use:
modal run cron_datasette.py --filter-year year
You can then use modal serve cron_datasette.py to create a short-lived web URL
that exists until you terminate the script.
When publishing the interactive Datasette app you’ll want to create a persistent URL.
Just run modal deploy cron_datasette.py and your app will be deployed in seconds!
You can explore the data at the deployed web endpoint.