LMS CSV Export Format Standards for Gradebook, Attendance, and Engagement Data

CSV remains the lowest-common-denominator interchange format between a Learning Management System and the analytics estate behind it. Every major platform can emit a flat file for gradebook, attendance, and engagement data, but each one applies its own column naming, delimiter, quoting, encoding, and date conventions to identical academic concepts. The result is that a “standard” CSV export is anything but standard across vendors. This page defines a canonical export schema for the three core data domains, documents how Canvas, Moodle, and Blackboard each produce these files and through which endpoints, and specifies the deterministic normalization rules and FERPA controls that turn heterogeneous exports into an analytics-ready contract — the front door to the broader LMS Data Architecture & Schema Mapping discipline.

Treating CSV ingestion as a strict ETL boundary rather than a passive file drop is the single design decision that prevents silent data corruption from propagating into downstream data marts. The rest of this page builds that boundary field by field.

Canonical Export Schema (Entity Model)

The contract begins with a platform-agnostic record shape for each domain. Vendor headers are mapped onto these canonical field names before any analytical processing, so downstream consumers never see vendor-specific nomenclature. Every record also carries a student_id_token rather than a raw student identifier — the tokenization rules are covered under Compliance Constraints and align with Cross-LMS Student ID Mapping so the same learner resolves to a stable key across platforms.

Gradebook record

Canonical field	Type	Notes
`student_id_token`	string (sha256 hex)	Tokenized identifier; never the raw SIS key
`course_id`	string	Stable course identifier
`assignment_id`	string	Stable assignment identifier
`assignment_group`	string, nullable	Weighting bucket; null when grading is unweighted
`points_earned`	float, nullable	Raw points; null distinguishes ungraded from a true zero
`points_possible`	float, nullable	Denominator for percent normalization
`score_normalized`	float (0.0–1.0)	Derived: `points_earned / points_possible`, clamped
`submission_status`	enum	`graded`, `submitted`, `missing`, `excused`, `late`
`graded_at`	timestamp (UTC, ISO 8601)	Null until graded

The hard rule for grades is that a missing value and a zero are not interchangeable. An excused submission must remain null in points_earned, because folding it to 0.0 corrupts every weighted average computed downstream by the weighted grade calculation engines. The score_normalized projection standardizes percentages, letter grades, and raw point totals onto a single decimal scale:

$\text{score\_normalized} = \operatorname{clamp}\!\left(\frac{\text{points\_earned}}{\text{points\_possible}},\ 0.0,\ 1.0\right)$

Letter-grade and pass/fail columns are resolved through a controlled-vocabulary lookup before this formula applies; the per-platform grade scales for that lookup come from the Canvas Gradebook Data Structure and the equivalent Moodle scale definitions.

Attendance record

Canonical field	Type	Notes
`student_id_token`	string (sha256 hex)	Tokenized identifier
`course_id`	string	Foreign key to course
`session_date`	date (ISO 8601)	Calendar date of the session, institution-local
`status_code`	enum	`present`, `absent`, `late`, `excused`, `remote`
`duration_minutes`	integer, nullable	Engaged minutes where the platform reports them
`recorded_at`	timestamp (UTC)	When the record was written

Attendance is the domain most damaged by timezone drift: a session logged at 2026-01-14T23:30:00 in a UTC export can land on the wrong calendar day once localized. The canonical schema keeps session_date as an institution-local calendar date and recorded_at as a true UTC instant, so the two concerns never collide. Status vocabularies vary widely between vendors and are collapsed through the controlled mapping defined by the attendance state normalization rules.

Engagement record

Canonical field	Type	Notes
`student_id_token`	string (sha256 hex)	Tokenized identifier
`course_id`	string	Foreign key to course
`resource_url`	string	Page, file, or activity touched
`interaction_type`	enum	`page_view`, `discussion_post`, `video_watch`, `submission`
`dwell_seconds`	integer, nullable	Time-on-task where instrumented
`event_at`	timestamp (UTC)	Event instant

Engagement telemetry is high-volume and noisy. The canonical record deliberately omits raw click streams in favor of typed interactions, and aggregation windows are anchored to the academic calendar (term, week-of-term) rather than arbitrary rolling periods, so that “low engagement” means the same thing in week three of every term.

Export Endpoints and Request Patterns

CSV files arrive through two channels: an interactive/REST export endpoint that the pipeline triggers, or a scheduled drop on SFTP. The mechanics differ sharply per vendor.

Canvas

Canvas gradebook CSVs are produced asynchronously. A POST /api/v1/courses/:course_id/gradebook_csv enqueues a report and returns a progress object; the file is collected by polling the progress_url until workflow_state is completed, then downloading the attachment.url. Because this is asynchronous, the export integrates naturally with the async polling for grade syncs pattern rather than a blocking request. Auth is a bearer token (Authorization: Bearer <token>), and large courses still page their underlying submission data, so the pagination strategies for bulk exports apply when reconstructing the gradebook from the API instead of the CSV. Canvas exports are UTF-8, comma-delimited, and prepend a “Points Possible” pseudo-row beneath the header that must be detected and stripped.

Moodle

Moodle exposes the grader report through the web UI and through web service functions such as gradereport_user_get_grades_table. CSV downloads route through /grade/export/csv/, authenticated with a web-service token passed as wstoken. Moodle exports frequently embed nested user metadata and HTML entities inside flat cells, and they honor the site’s configured csvdelimiter (which can be a comma, semicolon, or tab depending on locale). Correctly flattening Moodle’s hierarchy without losing course or grade-category context requires the entity definitions from the Moodle Course & User Schema.

Blackboard

Blackboard Learn delivers Grade Center data through the REST Gradebook endpoints (for example GET /learn/api/public/v2/courses/:courseId/gradebook/columns and the per-column users resource) and, for bulk institutional feeds, through Data Integration flat-file drops. REST access uses an OAuth2 bearer token obtained from /learn/api/public/v1/oauth2/token. The endpoint architecture and its envelope conventions are documented in Blackboard REST API Architecture. Blackboard flat files are often Windows-1252 encoded with CRLF line endings — an encoding trap covered below.

All three channels are wrapped by the same retrieval shell: pull via SFTP or REST, validate against the contract, and quarantine malformed files rather than halting the pipeline. The HTTP plumbing for the REST paths is the same Python requests session for LMS APIs used elsewhere in the estate, including its rate-limit handling.

Normalization and Transformation Logic

Normalization runs in a fixed order so that failures are deterministic and debuggable.

Encoding and delimiter detection. Sniff the byte-order mark and a sample of the file before parsing. Default to UTF-8 but fall back to Windows-1252 for Blackboard feeds; detect the delimiter rather than assuming a comma, because Moodle locale exports are routinely semicolon-delimited.
Header normalization. Lower-case, strip, and map each vendor header to a canonical field through a configuration-driven dictionary. A missing required header raises immediately; an unknown extra header is logged and dropped. This is the same canonical dictionary expanded in Standardizing LMS CSV Headers for Data Lakes.
Row pre-filtering. Strip vendor artifact rows — Canvas’s “Points Possible” pseudo-row, Moodle’s category subtotal rows, and any test-student rows.
Type coercion and grade reconstruction. Cast numerics, resolve letter/percent/points grading types to score_normalized, and reconstruct assignment_group weights where the export flattened them. Coercion is explicit: an unparseable grade becomes null plus a recorded error, never a silent 0.0.
Temporal normalization. Convert every timestamp to UTC ISO 8601 and derive institution-local session_date separately.
Tokenization. Replace raw identifiers with student_id_token at the staging boundary, before anything is persisted.

Header mapping must precede row parsing so that a malformed file is rejected on its cheapest property — its shape — before the pipeline spends effort on millions of rows.

Compliance Constraints

Student records pulled from LMS exports are education records under FERPA, so the transformation layer is also a compliance boundary. Three field-level rules apply to every canonical record:

Tokenize before persistence. Raw SIS keys and platform user IDs must be replaced with a salted student_id_token before any row is written to staging. Names and email addresses are dropped entirely unless a downstream consumer has a documented, access-controlled need.
Minimize. Engagement and attendance exports often carry far more PII than analytics requires (IP addresses, free-text comments). Strip these at ingestion rather than masking them later.
Audit. Append ingested_at, source_file, and source_platform columns so every canonical row is traceable to the export that produced it, satisfying audit-log requirements without exposing learner data.

The tokenization function is deterministic so the same learner maps to the same token across files and platforms — the join key that makes cross-platform analytics possible — but it is keyed with an institutional salt so tokens cannot be reversed outside the governed environment.

Reference Python Implementation

The following normalizes a single vendor gradebook export into the canonical schema. It detects encoding and delimiter, strips vendor artifact rows, coerces grade types, tokenizes identifiers, and stamps audit columns. It uses pandas, but the same contract ports cleanly to Polars for bulk runs.

python

import csv
import hashlib
from datetime import datetime, timezone
from pathlib import Path

import pandas as pd

# Institutional salt would come from a secrets manager, not source control.
TOKEN_SALT = "rotate-me-from-vault"

CANVAS_HEADER_MAP = {
    "student id": "raw_student_id",
    "section": "course_id",
    "assignment": "assignment_id",
    "assignment group": "assignment_group",
    "score": "points_earned",
    "points possible": "points_possible",
}
REQUIRED = {"raw_student_id", "course_id", "assignment_id", "points_earned"}


def tokenize(raw_id: str) -> str:
    """Deterministic, salted, non-reversible student token (FERPA-safe)."""
    digest = hashlib.sha256(f"{TOKEN_SALT}:{raw_id}".encode()).hexdigest()
    return digest  # e.g. sha256 of 'student_id'


def sniff_dialect(path: Path) -> tuple[str, str]:
    raw = path.read_bytes()
    encoding = "utf-8-sig" if raw[:3] == b"\xef\xbb\xbf" else "utf-8"
    try:
        sample = raw.decode(encoding, errors="strict")[:4096]
    except UnicodeDecodeError:
        encoding = "cp1252"  # Blackboard flat-file fallback
        sample = raw.decode(encoding, errors="replace")[:4096]
    delimiter = csv.Sniffer().sniff(sample, delimiters=",;\t").delimiter
    return encoding, delimiter


def normalize_gradebook(path: Path) -> pd.DataFrame:
    encoding, delimiter = sniff_dialect(path)
    df = pd.read_csv(path, encoding=encoding, sep=delimiter, dtype=str)

    df.columns = [c.strip().lower() for c in df.columns]
    df = df.rename(columns={k: v for k, v in CANVAS_HEADER_MAP.items() if k in df.columns})

    missing = REQUIRED - set(df.columns)
    if missing:
        raise ValueError(f"{path.name}: missing required columns {sorted(missing)}")

    # Drop Canvas's "Points Possible" pseudo-row and blank student rows.
    df = df[df["raw_student_id"].notna() & (df["raw_student_id"].str.strip() != "")]

    # Explicit numeric coercion: unparseable -> NaN, never a silent zero.
    for col in ("points_earned", "points_possible"):
        df[col] = pd.to_numeric(df.get(col), errors="coerce")

    df["score_normalized"] = (df["points_earned"] / df["points_possible"]).clip(0.0, 1.0)
    df["student_id_token"] = df["raw_student_id"].map(tokenize)
    df = df.drop(columns=["raw_student_id"])

    # Audit columns for traceability without exposing learner data.
    df["source_file"] = path.name
    df["source_platform"] = "canvas"
    df["ingested_at"] = datetime.now(timezone.utc).isoformat()

    canonical = [
        "student_id_token", "course_id", "assignment_id", "assignment_group",
        "points_earned", "points_possible", "score_normalized",
        "source_file", "source_platform", "ingested_at",
    ]
    return df.reindex(columns=canonical)


if __name__ == "__main__":
    frame = normalize_gradebook(Path("canvas_export.csv"))
    assert frame["student_id_token"].str.len().eq(64).all()  # all tokenized
    print(frame.shape, frame.dtypes["score_normalized"])

The final assertion is the cheapest possible regression test: if any row still carries a non-tokenized identifier, the token-length check fails before the data ever leaves the process.

Failure Modes and Edge Cases

Missing vs. zero grades. A blank score cell is ungraded, not a zero. Coercing it with fillna(0) silently destroys weighted averages. Always coerce to null and let the grading engine decide.
Excused submissions. Canvas marks excused work with EX and Moodle with a dash glyph; both must map to excused status and a null score, not a parse error or a zero.
Delimiter drift. A Moodle site reconfigured to a semicolon delimiter mid-term turns a comma-split parser’s every row into a single column. Sniff the delimiter per file rather than assuming it.
Encoding artifacts. Blackboard’s Windows-1252 flat files render smart quotes and accented names as mojibake under a strict UTF-8 read. Detect and fall back rather than errors="ignore", which silently deletes characters from student names.
Header drift on version bumps. An LMS upgrade can rename a column (“Section” → “Course Section”). Because header mapping runs first and raises on missing required fields, this surfaces as a clean quarantine event instead of a corrupt load — the schema-drift detection signal for the pipeline.
Pseudo-rows and subtotals. Canvas’s “Points Possible” row and Moodle’s category-subtotal rows are not students. Filtering them after numeric coercion inflates row counts and skews cohort statistics.

LMS Data Architecture & Schema Mapping — the parent discipline this export contract feeds into.
Standardizing LMS CSV Headers for Data Lakes — the production header-mapping pipeline that implements this contract end to end.
Canvas Gradebook Data Structure — the relational source behind Canvas gradebook CSVs.
Moodle Course & User Schema — how to flatten Moodle’s hierarchy without losing context.
Cross-LMS Student ID Mapping — resolving one learner to a stable token across platforms.
Attendance State Normalization Rules — the controlled vocabulary for status_code.

Part of: LMS Data Architecture & Schema Mapping

Explore deeper

Related in this section