How to Parse Canvas Gradebook JSON with Pandas

Institutional data teams ingest Canvas gradebook payloads to drive predictive analytics, trigger academic-intervention workflows, and synchronize downstream student information systems. The Canvas REST API delivers gradebook data as deeply nested, paginated JSON that rarely aligns with relational table structures, so a naive pd.json_normalize(response.json()) produces misaligned columns, silent type coercion, and truncated rosters. This guide walks through a deterministic procedure that flattens raw submission objects into a typed, analysis-ready DataFrame while honouring the FERPA compliance boundary before any data is serialized. It builds directly on the Canvas Gradebook Data Structure entity model and slots into the wider LMS Data Architecture & Schema Mapping discipline.

Prerequisites

Confirm each of these before running the procedure — the script assumes all of them are in place.

Python 3.10+ (the code uses str | None union syntax and the match-free structural patterns below).
pandas==2.2.* and requests==2.32.* installed in the virtualenv. The nullable Int64/Float64/boolean dtypes used here require pandas 1.0+, but 2.2 is assumed for datetime64[ns] parsing behaviour.
A Canvas access token with the url:GET|/api/v1/courses/:course_id/students/submissions scope, exported as CANVAS_TOKEN. For unattended jobs, generate it through automating Canvas API token refresh in Python rather than pasting a long-lived token.
The numeric course_id you intend to extract, and a teacher/admin enrollment on that course (the bulk submissions endpoint returns only courses the token’s user can grade).
Upstream data shape: an array of submission objects from GET /api/v1/courses/:id/students/submissions?include[]=assignment&include[]=user, where each record nests an assignment object (points_possible, grading_type, grading_period_id) and a user object (id, login_id).

Step-by-step implementation

1. Page through the submissions endpoint with the Link header, not an offset. Canvas paginates using RFC 5988 Link headers and follows rel="next" until it disappears; an offset loop or a “stop when the page is short” heuristic silently drops rows, because Canvas can return a short page mid-stream. This is the same contract documented in pagination strategies for bulk exports.

2. Set per_page=100 and request only the includes you need. Each include[]=assignment and include[]=user embeds the nested object you will flatten, which avoids a second round-trip per submission. Requesting submission_comments you do not use only inflates the payload and drags PII through your pipeline.

3. Back off on throttling. Canvas signals rate limiting with 403 Forbidden (Rate Limit Exceeded) — not only 429 — so a client that retries on 429 alone aborts the run as if it were an auth error. Pair this parser with Canvas API rate-limit handling for production pacing.

4. Flatten with json_normalize(sep="_") and an explicit record_prefix discipline. Passing the raw list to json_normalize collapses assignment.points_possible into assignment_points_possible. Doing this with a deterministic separator is what keeps column names stable across Canvas API version bumps.

5. Coerce to nullable dtypes, never default-coerce. Use Int64, Float64, and boolean (capitalized, nullable) so that an ungraded submission’s score: null stays <NA> instead of becoming 0.0. Conflating null with zero deflates every affected student’s grade — the single most common gradebook bug.

6. Branch on grading periods before filtering by them. When Multiple Grading Periods is disabled, assignment.grading_period_id is absent entirely; filtering on a missing column raises KeyError. Detect the column’s presence first, mirroring the rules in the weighted grade calculation engines guide.

7. Pseudonymize identifiers before serialization. Hash user_login_id with SHA-256 and drop the raw column before the DataFrame leaves memory. This preserves a stable join key for longitudinal tracking — the same surrogate-key idea behind Cross-LMS Student ID Mapping — while keeping direct identifiers out of downstream storage.

Complete runnable code block

python

import os
import hashlib
import logging
import pandas as pd
import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("canvas_gradebook_parser")

# Strict, nullable schema: <NA> is preserved so ungraded != zero.
SCHEMA: dict[str, str] = {
    "id": "Int64",
    "assignment_id": "Int64",
    "assignment_name": "string",
    "assignment_points_possible": "Float64",
    "assignment_grading_type": "string",
    "assignment_grading_period_id": "Int64",
    "user_id": "Int64",
    "user_login_id": "string",
    "score": "Float64",
    "grade": "string",
    "late": "boolean",
    "missing": "boolean",
    "excused": "boolean",
    "posted_at": "datetime64[ns]",
}


def _next_url(link_header: str | None) -> str | None:
    """Extract the rel="next" URL from a Canvas RFC 5988 Link header."""
    if not link_header:
        return None
    for part in link_header.split(","):
        if 'rel="next"' in part:
            return part.split(";")[0].strip("<> ")
    return None


def fetch_all_submissions(api_base: str, token: str, course_id: int) -> list[dict]:
    """Follow Link headers to retrieve every submission for a course."""
    url = f"{api_base}/api/v1/courses/{course_id}/students/submissions"
    headers = {"Authorization": f"Bearer {token}"}
    params = {"include[]": ["assignment", "user"], "student_ids[]": "all", "per_page": 100}
    rows: list[dict] = []
    while url:
        resp = requests.get(url, headers=headers, params=params, timeout=30)
        # Canvas throttles with 403 (Rate Limit Exceeded), not only 429.
        if resp.status_code in (403, 429):
            raise RuntimeError(f"Throttled ({resp.status_code}); add backoff before retrying.")
        resp.raise_for_status()
        rows.extend(resp.json())
        url = _next_url(resp.headers.get("Link"))
        params = {}  # subsequent next-URLs already carry the query string
    logger.info("Retrieved %d submission records for course %d", len(rows), course_id)
    return rows


def normalize_gradebook(submissions: list[dict]) -> pd.DataFrame:
    """Flatten nested submission JSON and enforce the strict schema."""
    if not submissions:
        return pd.DataFrame(columns=list(SCHEMA))
    df = pd.json_normalize(submissions, sep="_")
    # posted_at must be parsed to datetime before astype() can apply the dtype.
    if "posted_at" in df.columns:
        df["posted_at"] = pd.to_datetime(df["posted_at"], errors="coerce", utc=True).dt.tz_localize(None)
    df = df.astype({col: dtype for col, dtype in SCHEMA.items() if col in df.columns})
    # Keep only known columns, in deterministic order, to resist schema drift.
    return df[[c for c in SCHEMA if c in df.columns]]


def apply_ferpa_minimization(df: pd.DataFrame) -> pd.DataFrame:
    """Hash login IDs and drop raw PII before the frame leaves memory."""
    if df.empty:
        return df
    df = df.copy()
    df["user_login_id_hash"] = df["user_login_id"].map(
        lambda x: hashlib.sha256(str(x).encode()).hexdigest() if pd.notna(x) else pd.NA
    )
    return df.drop(columns=["user_login_id"])


def build_gradebook(api_base: str, token: str, course_id: int) -> pd.DataFrame:
    raw = fetch_all_submissions(api_base, token, course_id)
    return apply_ferpa_minimization(normalize_gradebook(raw))


if __name__ == "__main__":
    frame = build_gradebook(
        api_base="https://canvas.instructure.com",
        token=os.environ["CANVAS_TOKEN"],
        course_id=int(os.environ["COURSE_ID"]),
    )
    logger.info("DataFrame shape: %s", frame.shape)
    print(frame.dtypes)
    print(frame.head())

Verification and output validation

Confirm the parser produced a clean, FERPA-safe frame before handing it downstream:

Shape and grain. assert len(frame) == frame[["assignment_id", "user_id"]].drop_duplicates().shape[0] — every row is a unique (assignment_id, user_id) pair, the submission grain.
No raw identifiers leaked. assert "user_login_id" not in frame.columns and assert frame["user_login_id_hash"].str.len().dropna().eq(64).all() (SHA-256 hex is 64 chars).
Null preserved, not zeroed. Pick an ungraded submission and assert pd.isna(frame.loc[mask, "score"]).all() — a 0.0 here would be the classic ungraded-as-zero bug.
Dtypes are nullable. assert frame["score"].dtype.name == "Float64" and assert frame["excused"].dtype.name == "boolean"; lowercase float64/bool means a null slipped through as a coerced value.
Grading period optional. If MGP is off, assignment_grading_period_id is simply absent — assert frame["assignment_grading_period_id"].isna().all() should hold when the column exists but the course has no periods.

Troubleshooting

KeyError: 'assignment_grading_period_id' when filtering by period. MGP is disabled on the course, so Canvas omits the field entirely. Check "assignment_grading_period_id" in frame.columns before filtering; the schema-driven column selection above already tolerates its absence.
401 Unauthorized partway through pagination. A short-lived token expired mid-export, leaving a half-ingested roster. Refresh proactively rather than reactively — wire in automating Canvas API token refresh in Python and call get_valid_token() at the top of each page.
403 Forbidden (Rate Limit Exceeded) treated as an auth failure. Canvas throttles with 403, not only 429. The script raises a distinct RuntimeError for both; layer Canvas API rate-limit handling on top so the run pauses instead of aborting.
Truncated DataFrame (fewer rows than the gradebook shows). A loop that stops when a page returns fewer than per_page rows drops mid-stream short pages. Only stop when _next_url() returns None — the rel="next" link is the sole authority on completion.
Excused submissions scored as zero downstream. excused: true arrives with score: null. The nullable Float64 dtype keeps it <NA>; ensure any aggregation drops excused rows from both numerator and denominator rather than calling .fillna(0).
ValueError: cannot convert ... to Int64 on astype. Canvas occasionally returns IDs as strings or embeds an error object instead of a list when the token lacks grading rights. Inspect submissions[0] — a dict with an errors key means the request was rejected, not paginated.
Columns named assignment.points_possible with literal dots. You passed sep="." (or omitted sep); pandas then keeps dotted names that break attribute access. Use sep="_" as shown so assignment_points_possible is a valid identifier.

Canvas Gradebook Data Structure — the parent entity model, endpoints, and field definitions this parser consumes.
Cross-LMS Student ID Mapping — turning the hashed user_login_id into a stable surrogate key shared across platforms.
Weighted Grade Calculation Engines — reconstructing final grades from the normalized rows, including grading-period and drop-rule logic.
Pagination Strategies for Bulk Exports — the Link-header contract that the fetch_all_submissions loop depends on.

Part of: Canvas Gradebook Data Structure