Cursor-Based Pagination for Large Course Rosters

Pulling a complete enrollment roster for a high-capacity course — a 2,000-seat lecture, a cross-listed section, or an institution-wide term snapshot — is the canonical case where offset pagination quietly corrupts a dataset. While a ?page=N&per_page=100 loop walks the table, students drop, add, and swap sections, the underlying rows shift beneath the offset arithmetic, and the export silently skips or double-counts enrollments. This page is a single, self-contained procedure: extract a large roster page-by-page using a forward-only cursor so the result is gap-free, resumable, and FERPA-safe. It is one concrete technique inside the broader pagination strategies for bulk exports that govern the extraction stage of API ingestion and sync workflows.

A cursor replaces the fragile offset with an opaque token — or a deterministic (sort_key, tiebreak_id) pair — that names the exact position in a sorted result set. The server hands back a pointer to where the next batch begins, so the extraction window only ever moves forward and is immune to insertions or deletions ahead of it. The cost of getting this wrong is not a crash; it is a warehouse roster that disagrees with the registrar, which is exactly the kind of reconciliation failure that turns into a FERPA finding.

Prerequisites

Python 3.10+ (the script uses str | None unions and match-free typing)
requests==2.31.* and urllib3==2.* installed in the active virtualenv
A Canvas API token scoped to url:GET|/api/v1/courses/*/enrollments (read-only is sufficient)
The course’s course_id and your institution’s API base, e.g. https://institution.instructure.com
One observed value of the token’s bucket quota (commonly ~700 cost units) so you can set a safe pacing floor
Upstream shape: each page is a JSON array of enrollment objects carrying id, user_id, course_id, enrollment_state, and a user.sis_user_id you must never persist in the clear

Treat the token itself as a credential, never a logged value, and rotate it through the Canvas API token refresh procedure if a long run risks expiry mid-stream. Every identifier that leaves this loop is hashed before it crosses the FERPA compliance boundary.

Step-by-Step Implementation

1. Anchor on a stable sort, not a page number

Cursor correctness depends entirely on the result set being totally ordered by a key that never changes for a row. For Canvas enrollments, order by (updated_at, id): updated_at gives the forward direction and id breaks ties so two rows sharing a timestamp can never be skipped. This composite anchor is what makes the window deterministic regardless of how the roster mutates during the pull.

2. Read the cursor from the `Link` header, fall back to a JSON token

Canvas, like most RFC 5988 implementations, advertises the next page in a Link response header with rel="next". Parse that URL verbatim rather than reconstructing it — the server has already encoded the cursor state into the query string. Custom or self-hosted LMS deployments sometimes return a JSON next_cursor instead, so a resilient parser checks the header first and falls back to the body.

python

import re

def parse_next(response) -> str | None:
    link = response.headers.get("Link", "")
    if (m := re.search(r'<([^>]+)>;\s*rel="next"', link)):
        return m.group(1)            # opaque, server-encoded cursor URL
    body = response.json()
    return body.get("next_cursor") or body.get("pagination", {}).get("next")

3. Pace against the budget headers, not a fixed sleep

Canvas meters by a refilling cost bucket, and — unlike most APIs — signals exhaustion with 403 Forbidden carrying a Rate Limit Exceeded body, not 429. After every response, read X-Rate-Limit-Remaining and yield only when it nears the floor. Keying off the reported budget rather than a clock is the same flow-control discipline the Canvas rate-limit cost model demands, and the 403 must be classified by body, never assumed to be a permissions error.

4. Stream records through a generator

Buffering every page of a 40,000-seat roster into a list spikes heap usage and risks an out-of-memory kill on a small worker. Yield records one at a time from a generator so the pipeline’s memory stays flat regardless of roster size; the consumer decides whether to batch-insert, upsert, or write straight to a staging file.

5. Tokenize identifiers, then dedupe on a composite key

Hash sis_user_id to a stable SHA-256 digest before the record leaves the loop, so no raw student identifier ever reaches the warehouse — the field-level rule the Cross-LMS Student ID Mapping schema enforces. Because a concurrent re-enrollment can legitimately surface the same logical row twice across a long pull, deduplicate on the composite (course_id, user_id, enrollment_state) so the final set converges to one source of truth.

Complete Runnable Code

A self-contained extractor: a resilient session, header-or-JSON cursor parsing, budget-aware pacing, FERPA tokenization, and composite-key dedup, all streamed through a generator.

python

import hashlib
import logging
import re
import time
from typing import Any, Iterator

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

REMAINING_FLOOR = 50.0   # cost units; yield below this so the bucket can refill
YIELD_SECONDS = 2.0


def build_session(token: str, max_retries: int = 3) -> requests.Session:
    """Resilient session: bearer auth + exponential retry on transient 5xx."""
    session = requests.Session()
    session.headers.update({
        "Authorization": f"Bearer {token}",
        "Accept": "application/json",
        "User-Agent": "edtech-roster-sync/1.0",
    })
    retry = Retry(
        total=max_retries,
        backoff_factor=1.5,
        status_forcelist=[500, 502, 503, 504],   # 403 budget signal handled in-loop
        allowed_methods=["GET"],
    )
    session.mount("https://", HTTPAdapter(max_retries=retry))
    return session


def parse_next(response: requests.Response) -> str | None:
    """Next cursor from the rel=next Link header (RFC 5988), or a JSON fallback."""
    link = response.headers.get("Link", "")
    if (m := re.search(r'<([^>]+)>;\s*rel="next"', link)):
        return m.group(1)
    try:
        body = response.json()
    except ValueError:
        return None
    return body.get("next_cursor") or body.get("pagination", {}).get("next")


def tokenize(record: dict[str, Any]) -> dict[str, Any]:
    """Replace the raw sis_user_id with a stable sha256 digest before persistence."""
    clean = dict(record)
    user = clean.get("user") or {}
    if (sis := user.get("sis_user_id")) is not None:
        clean["sis_user_id_hash"] = hashlib.sha256(str(sis).encode()).hexdigest()
    clean.pop("user", None)   # drop the nested PII blob entirely
    return clean


def stream_roster(base_url: str, token: str, course_id: int,
                  per_page: int = 100) -> Iterator[dict[str, Any]]:
    """Yield enrollment records page-by-page using a forward-only cursor."""
    session = build_session(token)
    url: str | None = f"{base_url.rstrip('/')}/api/v1/courses/{course_id}/enrollments"
    params: dict[str, Any] | None = {"per_page": per_page, "order_by": "updated_at"}

    while url:
        resp = session.get(url, params=params, timeout=30)

        if resp.status_code == 403 and "rate limit exceeded" in resp.text.lower():
            logger.warning("Canvas budget exhausted (403) — backing off 30s")
            time.sleep(30)
            continue                              # retry same cursor, no advance
        resp.raise_for_status()

        if float(resp.headers.get("X-Rate-Limit-Remaining", REMAINING_FLOOR + 1)) <= REMAINING_FLOOR:
            logger.info("Near budget floor — yielding %.1fs", YIELD_SECONDS)
            time.sleep(YIELD_SECONDS)

        for record in resp.json():
            yield tokenize(record)

        url = parse_next(resp)
        params = None                             # cursor URL already carries state


def collect_unique(base_url: str, token: str, course_id: int) -> list[dict[str, Any]]:
    """Drain the stream, deduping on (course_id, user_id, enrollment_state)."""
    seen: set[tuple] = set()
    out: list[dict[str, Any]] = []
    for rec in stream_roster(base_url, token, course_id):
        key = (rec.get("course_id"), rec.get("user_id"), rec.get("enrollment_state"))
        if key not in seen:
            seen.add(key)
            out.append(rec)
    logger.info("Collected %d unique enrollments for course %d", len(out), course_id)
    return out


if __name__ == "__main__":
    import os
    roster = collect_unique(
        os.environ["CANVAS_BASE_URL"],
        os.environ["CANVAS_ACCESS_TOKEN"],
        int(os.environ["CANVAS_COURSE_ID"]),
    )
    print(f"{len(roster)} unique enrollments")

Verification and Output Validation

Confirm the pull is complete and clean, not merely non-erroring:

Row count matches the registrar. Compare len(roster) against the course’s authoritative active-enrollment count. A cursor pull should match exactly; an offset pull is where the off-by-a-few drift shows up.
No raw identifiers leak. Assert no record retains the nested user blob and every row carries a 64-character sis_user_id_hash:

python

assert all("user" not in r for r in roster)
assert all(len(r.get("sis_user_id_hash", "")) == 64 for r in roster)

Composite key is unique. len({(r["course_id"], r["user_id"], r["enrollment_state"]) for r in roster}) == len(roster) — duplicates here mean the dedup key is too narrow for your enrollment model.
Forward monotonicity. If you retain updated_at, assert it is non-decreasing across the yielded order; a decrease signals the cursor reset or the sort anchor was ignored.

Troubleshooting

Roster comes back ~100 rows short of the registrar. The loop stopped at page one because the Link header was never parsed — confirm parse_next is reading response.headers["Link"] and that your reverse proxy is not stripping the header.
403 Forbidden aborts the run instead of backing off. A generic handler treated the budget signal as a hard permissions error. Canvas uses 403 + "Rate Limit Exceeded", not 429; classify by body as the script does, and keep 403 out of the Retry.status_forcelist.
401 Unauthorized partway through a long pull. The token expired mid-stream. Refresh it via the Canvas API token refresh flow and resume from the last successful cursor rather than restarting.
Duplicate enrollments in the warehouse despite dedup. Two genuinely distinct states (e.g. active and completed) share a user_id; that is correct data, not a bug — the composite key intentionally keeps both. Collapse to latest state downstream only if your model requires it.
MemoryError on the largest sections. Something is materializing the generator (a stray list(stream_roster(...))). Keep the consumer streaming, or batch-insert inside the for loop instead of accumulating.
Pull never terminates / cursor loops. The server returned a rel="next" pointing at the current page, usually a clock-skew tie on updated_at. The (updated_at, id) tiebreak prevents this; if a custom LMS lacks a stable secondary sort, fall back to its opaque next_cursor token and add a max-page guard. Log the stall through structured JSON logging of failed syncs so academic IT has an audit trail.

Pagination Strategies for Bulk Exports — the parent guide covering page-state models, checkpointing, and Link-header mechanics this procedure specializes.
Handling Canvas API Rate Limits — the cost-bucket model and 403 budget signal the pacing logic keys off.
Bypassing Canvas API throttling with queue workers — fans this single-roster pull out across thousands of courses without blowing the budget.
Cross-LMS Student ID Mapping — the canonical identifier schema the sis_user_id hash feeds into.

Part of: Pagination Strategies for Bulk Exports

Related in this section