Run a job queue that turns documents into structured data with Datalab Marker

This tutorial shows you how to use Modal as an infinitely scalable job queue that can service async tasks from a web app.

Our job queue will handle a single task: converting images/PDFs into structured data. We’ll use Marker from Datalab, which can convert images of documents or PDFs to Markdown, JSON, and HTML. Marker is an open-weights model; to learn more about commercial usage, see here.

For the purpose of this tutorial, we’ve also built a React + FastAPI web app on Modal that works together with it, but note that you don’t need a web app running on Modal to use this pattern. You can submit async tasks to Modal from any Python application (for example, a regular Django app running on Kubernetes).

Try it out for yourself here.

Define an App

Let’s first import modal and define an App. Later, we’ll use the name provided for our job queue App to find it from our web app and submit tasks to it.

from typing import Optional

import modal
from typing_extensions import Literal

app = modal.App("example-doc-ocr-jobs")

We also define the dependencies we need by specifying an Image.

inference_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
    "marker-pdf[full]==1.9.3", "torch==2.8.0"
)

We can obtain the pre-trained model we want to run from Datalab by using the Marker library.

def load_models():
    import marker.models

    print("loading models")

    return marker.models.create_model_dict()

The create_model_dict function downloads model weights from Datalab’s cloud storage (S3 bucket) if they aren’t already present in the filesystem. However, in Modal’s serverless environment, filesystems are ephemeral, so using this code alone would mean that models need to be downloaded many times (every time a new instance of our Function spins up).

So instead, we create a Modal Volume to store the models. Each Modal Volume is a durable filesystem that any Modal Function can access. You can read more about storing model weights on Modal in our guide.

marker_cache_path = "/root/.cache/datalab/"
marker_cache_volume = modal.Volume.from_name(
    "marker-models-modal-demo", create_if_missing=True
)
marker_cache = {marker_cache_path: marker_cache_volume}

Now let’s set up the actual inference.

Using the @app.function decorator, we set up a Modal Function. We provide arguments to that decorator to customize the hardware, scaling, and other features of the Function.

Here, we say that this Function should use NVIDIA L40S GPUs, automatically retry failures up to 3 times, and have access to our shared model cache.

Inside the Function, we write out our inference logic, which mostly involves configuring components provided by the marker library.

@app.function(gpu="l40s", retries=3, volumes=marker_cache, image=inference_image)
def parse_document(
    document: bytes,
    page_range: str | None = None,
    force_ocr: bool = False,
    paginate_output: bool = False,
    output_format: Literal["markdown", "html", "chunks", "json"] = "markdown",
    use_llm: bool = False,
) -> str | dict:
    """
    Args:
        document: Document data (PDF, JPG, PNG) as bytes.
        page_range: Specify which pages to process. Accepts comma-separated page numbers and ranges.
        force_ocr: Force OCR processing on the entire document, even for pages that might contain extractable text.
                    This will also format inline math properly.
        paginate_output: Paginates the output, using \n\n{PAGE_NUMBER} followed by - * 48, then \n\n
        output_format: Output format. Can be markdown, JSON, HTML, or chunks.
        use_llm: use an llm to improve the marker results.
    """
    from tempfile import NamedTemporaryFile

    import marker.config.parser
    import marker.converters.pdf
    import marker.output

    models = load_models()

    # Set up document "converter"
    config = {
        "page_range": page_range,
        "force_ocr": force_ocr,
        "paginate_output": paginate_output,
        "output_format": output_format,
        "use_llm": use_llm,
    }

    config_parser = marker.config.parser.ConfigParser(config)
    config_dict = config_parser.generate_config_dict()
    config_dict["pdftext_workers"] = 1

    converter = marker.converters.pdf.PdfConverter(
        config=config_dict,
        artifact_dict=models,
        processor_list=config_parser.get_processors(),
        renderer=config_parser.get_renderer(),
        llm_service=config_parser.get_llm_service() if use_llm else None,
    )

    # Run the converter on our document
    with NamedTemporaryFile(delete=False, mode="wb+") as temp_path:
        temp_path.write(document)
        rendered_output = converter(temp_path.name)

    # Format the output and return it
    if output_format == "json":
        result = rendered_output.model_dump_json()
    else:
        text, _, images = marker.output.text_from_rendered(rendered_output)

        result = text

    return result

Testing and debugging remote code

To make sure this code works, we want a way to kick the tires and debug it.

We can run it on Modal, with no need to set up separate local testing, by adding a local_entrypoint that invokes the Function .remotely.

@app.local_entrypoint()
def main(document_filename: Optional[str] = None):
    import urllib.request
    from pathlib import Path

    if document_filename is None:
        document_filename = Path(__file__).parent / "receipt.png"
    else:
        document_filename = Path(document_filename)

    if document_filename.exists():
        image = document_filename.read_bytes()
        print(f"running OCR on {document_filename}")
    else:
        document_url = "https://modal-cdn.com/cdnbot/Brandys-walmart-receipt-8g68_a_hk_f9c25fce.webp"
        print(f"running OCR on sample from URL {document_url}")
        request = urllib.request.Request(document_url)
        with urllib.request.urlopen(request) as response:
            image = response.read()
    print(parse_document.remote(image, output_format="html"))

You can then run this from the command line with:

modal run doc_ocr_jobs.py

Deploying the document conversion service

Now that we have a Function, we can publish it by deploying the App:

modal deploy doc_ocr_jobs.py

Once it’s published, we can look up this Function from another Python process and submit tasks to it:

fn = modal.Function.from_name("example-doc-ocr-jobs", "parse_document")
fn.spawn(my_document)

Modal will auto-scale to handle all the tasks queued, and then scale back down to 0 when there’s no work left. To see how you could use this from a Python web app, take a look at the receipt parser frontend tutorial.

Run a job queue that turns documents into structured data with Datalab Marker

Define an App

Cache the pre-trained model on a Modal Volume

Run Datalab Marker on Modal

Testing and debugging remote code

Deploying the document conversion service

Try this on Modal!