Standardizing LMS CSV Headers for Data Lakes

This is a single, precise task: take a raw gradebook, attendance, or engagement CSV exported by Canvas, Moodle, or Blackboard and turn it into one canonical, FERPA-safe Parquet file the data lake can ingest without per-vendor special cases. Each platform applies its own column names, casing, delimiters, encodings, and date formats to identical academic concepts, so a flat file drop that “works” for one term silently breaks the next. The procedure below builds a deterministic header-mapping boundary that resolves vendor headers to a fixed schema, tokenizes the student identifier before anything is written, and coerces types so downstream joins never guess.

It implements the header-mapping stage of the LMS CSV export format standards contract, sits inside the broader LMS Data Architecture & Schema Mapping discipline, and emits the same student_id_token surrogate defined by Cross-LMS Student ID Mapping so the same learner resolves to a stable key across every source.

Prerequisites

Python 3.10 or newer (the procedure uses str | None unions and structural match-free typing).
pandas==2.2.* and pyarrow==16.* installed (pip install "pandas==2.2.*" "pyarrow==16.*").
A salt for tokenization stored as an environment variable (LMS_TOKEN_SALT), never hard-coded — the same salt must be reused across every run so tokens stay join-stable.
A raw export on disk: a gradebook, attendance, or engagement CSV from Canvas (Export Entire Gradebook), Moodle (Grader report → Export → Plain text file), or Blackboard (Grade Center → Work Offline → Download).
Write access to the lake’s bronze prefix (s3://, abfss://, or gs://), or a local path while testing.
The upstream shape: a single header row, one record per row, a student-identifier column under some vendor name, and at minimum a course identifier.

Step-by-step implementation

Each step states the why before the code, because the failure modes here are silent — a wrong default does not raise, it just poisons a metric three layers downstream.

1. Pin the canonical header map. A platform-agnostic dictionary mapping each canonical field to its known vendor aliases is the contract every consumer reads against; without it, a Canvas sis_user_id and a Moodle idnumber become two different columns.

python

CANONICAL_SCHEMA: dict[str, list[str]] = {
    "student_id": ["student_id", "sis_user_id", "user_id", "idnumber", "username"],
    "course_id": ["course_id", "sis_course_id", "context_id", "courseid"],
    "assignment_id": ["assignment_id", "item_id", "grade_item_id", "column_id"],
    "points_possible": ["points_possible", "max_points", "points_total", "max_score"],
    "points_earned": ["points_earned", "score", "grade", "points_awarded"],
    "submission_status": ["submission_status", "status", "submitted"],
    "submitted_at": ["submitted_at", "submission_date", "date_submitted"],
    "graded_at": ["graded_at", "grading_date", "date_graded"],
}
REQUIRED_COLUMNS: frozenset[str] = frozenset({"student_id", "course_id"})

2. Read with encoding fallback, everything as strings. Legacy LMS exports arrive as utf-8-sig, latin-1, or cp1252; reading as dtype=str defers all type decisions to the coercion step so pandas never guesses an int column that later gains a letter grade.

python

for enc in ("utf-8-sig", "utf-8", "latin-1", "cp1252"):
    try:
        df = pd.read_csv(path, encoding=enc, dtype=str, low_memory=False)
        break
    except UnicodeDecodeError:
        continue

3. Sanitize the raw headers. Vendors ship trailing spaces, mixed case, and "Points Possible"-style labels; normalizing to lower-snake before lookup means the alias table only has to list one spelling per concept.

python

df.columns = [str(h).strip().lower().replace(" ", "_") for h in df.columns]

4. Resolve vendor headers to canonical names. Building the rename map from the alias table — first match wins — is what collapses three vendors onto one schema and keeps downstream SQL identical regardless of source.

python

rename_map = {
    alias: canonical
    for canonical, aliases in CANONICAL_SCHEMA.items()
    for alias in aliases
    if alias in df.columns
}
df = df.rename(columns=rename_map)

5. Validate required columns before doing any work. Raising immediately on a missing student_id or course_id turns a corrupt partition into a loud, debuggable failure instead of a half-ingested course nobody notices until reporting season.

python

missing = REQUIRED_COLUMNS - set(df.columns)
if missing:
    raise ValueError(f"missing required canonical columns: {sorted(missing)}")

6. Tokenize the identifier at the FERPA boundary. The raw student key is PII and must never land in the lake; replacing student_id with a salted SHA-256 student_id_token upholds the same FERPA compliance boundary the rest of the export contract enforces, and a fixed salt keeps the token join-stable across runs.

python

salt = os.environ["LMS_TOKEN_SALT"].encode()
df["student_id_token"] = df["student_id"].map(
    lambda v: hashlib.sha256(salt + str(v).encode()).hexdigest()
)
df = df.drop(columns=["student_id"])

7. Coerce types and derive the normalized score. Numeric and datetime columns must be real types for partition pruning and arithmetic to work; deriving score_normalized here gives every downstream model one decimal scale instead of percent-versus-points ambiguity. The projection is

$\text{score\_normalized} = \operatorname{clamp}\!\left(\frac{\text{points\_earned}}{\text{points\_possible}},\ 0.0,\ 1.0\right)$

with the hard rule that a null earned value stays null — an ungraded or excused submission is not a zero, the same invariant the weighted grade calculation engines depend on.

8. Drop unmapped columns and write partitioned Parquet. Stripping columns outside the contract enforces data minimization, and partitioning by course_id keeps lake query cost proportional to what a dashboard actually scans.

Complete runnable script

This is the whole procedure end to end — point it at one CSV and it writes a canonical, tokenized Parquet partitioned by course.

python

import hashlib
import logging
import os
from pathlib import Path

import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")

CANONICAL_SCHEMA: dict[str, list[str]] = {
    "student_id": ["student_id", "sis_user_id", "user_id", "idnumber", "username"],
    "course_id": ["course_id", "sis_course_id", "context_id", "courseid"],
    "assignment_id": ["assignment_id", "item_id", "grade_item_id", "column_id"],
    "points_possible": ["points_possible", "max_points", "points_total", "max_score"],
    "points_earned": ["points_earned", "score", "grade", "points_awarded"],
    "submission_status": ["submission_status", "status", "submitted"],
    "submitted_at": ["submitted_at", "submission_date", "date_submitted"],
    "graded_at": ["graded_at", "grading_date", "date_graded"],
}
REQUIRED_COLUMNS: frozenset[str] = frozenset({"student_id", "course_id"})
NUMERIC_COLS = ("points_possible", "points_earned")
DATETIME_COLS = ("submitted_at", "graded_at")


def _read_with_fallback(path: Path) -> pd.DataFrame:
    """Read a legacy LMS CSV, trying encodings oldest-vendor-first."""
    for enc in ("utf-8-sig", "utf-8", "latin-1", "cp1252"):
        try:
            df = pd.read_csv(path, encoding=enc, dtype=str, low_memory=False)
            logging.info("parsed %s as %s", path.name, enc)
            return df
        except UnicodeDecodeError:
            continue
    raise ValueError(f"no supported encoding parsed {path.name}")


def normalize_lms_csv(input_path: Path, output_dir: Path) -> pd.DataFrame:
    """Map vendor headers to the canonical schema, tokenize PII, coerce, and write Parquet."""
    df = _read_with_fallback(input_path)

    # 3. sanitize raw headers -> lower-snake
    df.columns = [str(h).strip().lower().replace(" ", "_") for h in df.columns]

    # 4. vendor alias -> canonical (first match wins)
    rename_map = {
        alias: canonical
        for canonical, aliases in CANONICAL_SCHEMA.items()
        for alias in aliases
        if alias in df.columns
    }
    df = df.rename(columns=rename_map)

    # 5. fail loud on missing required fields
    missing = REQUIRED_COLUMNS - set(df.columns)
    if missing:
        raise ValueError(f"missing required canonical columns: {sorted(missing)}")

    # 6. FERPA boundary: replace raw id with a salted, stable token
    salt = os.environ["LMS_TOKEN_SALT"].encode()
    df["student_id_token"] = df["student_id"].map(
        lambda v: hashlib.sha256(salt + str(v).encode()).hexdigest()
    )
    df = df.drop(columns=["student_id"])

    # 8a. data minimization: keep only contract columns
    allowed = (set(CANONICAL_SCHEMA) - {"student_id"}) | {"student_id_token"}
    dropped = [c for c in df.columns if c not in allowed]
    if dropped:
        logging.warning("dropping unmapped columns: %s", dropped)
        df = df.drop(columns=dropped)

    # 7. type coercion + normalized-score projection
    for col in NUMERIC_COLS:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    for col in DATETIME_COLS:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce", format="mixed", utc=True)
    if {"points_earned", "points_possible"} <= set(df.columns):
        df["score_normalized"] = (
            df["points_earned"] / df["points_possible"]
        ).clip(0.0, 1.0)  # null stays null: ungraded != zero

    # 8b. partitioned write for cheap downstream scans
    output_dir.mkdir(parents=True, exist_ok=True)
    df.to_parquet(output_dir, index=False, engine="pyarrow", partition_cols=["course_id"])
    logging.info("wrote %d rows to %s", len(df), output_dir)
    return df


if __name__ == "__main__":
    normalize_lms_csv(
        Path("exports/canvas_gradebook_q3.csv"),
        Path("lake/bronze/gradebook/"),
    )

Verification and output validation

Confirm the normalization actually happened rather than trusting that it ran:

python

out = normalize_lms_csv(Path("exports/canvas_gradebook_q3.csv"), Path("lake/bronze/gradebook/"))

# 1. raw identifier is gone; token is present and 64 hex chars
assert "student_id" not in out.columns
assert out["student_id_token"].str.fullmatch(r"[0-9a-f]{64}").all()

# 2. types are real, not strings
assert out["points_earned"].dtype.kind == "f"      # float
assert str(out["submitted_at"].dtype).startswith("datetime64")

# 3. normalized score is bounded and preserves nulls
graded = out["points_earned"].notna()
assert out.loc[graded, "score_normalized"].between(0.0, 1.0).all()
assert out.loc[~graded, "score_normalized"].isna().all()

# 4. only contract columns survived
assert set(out.columns) <= {
    "student_id_token", "course_id", "assignment_id", "points_possible",
    "points_earned", "submission_status", "submitted_at", "graded_at", "score_normalized",
}

Re-reading the Parquet back with pd.read_parquet("lake/bronze/gradebook/") should return the same row count and surface course_id as a partition column. A token computed for the same student in a later run must match byte-for-byte — that is the fastest end-to-end check that your salt is stable.

Troubleshooting

KeyError: 'LMS_TOKEN_SALT' — the tokenization salt is not exported. Set LMS_TOKEN_SALT in the job environment before invocation; a missing salt is a hard stop by design so PII never lands unsalted.
ValueError: missing required canonical columns: ['student_id'] — the export’s identifier column is under an alias the table doesn’t know (e.g. Blackboard’s Username plus a Child Course ID quirk). Add the real header to the relevant alias list in CANONICAL_SCHEMA and re-run.
Tokens differ between runs for the same student — the salt changed, or whitespace leaked into the raw id. Pin one salt for the dataset’s lifetime and .strip() the value before hashing; unstable tokens silently break every cross-run and cross-platform join.
points_earned is all NaN after coercion — the source stored letter grades or "92%" strings in that column. Resolve the controlled vocabulary to points before pd.to_numeric, mirroring the grade-scale lookup from the Canvas Gradebook Data Structure.
pyarrow.lib.ArrowInvalid on write — a coerced column is still mixed-type because an unmapped column slipped through. Confirm the data-minimization drop ran before the write, and that every retained column appears in NUMERIC_COLS/DATETIME_COLS or is intentionally string.
Schema drift after a vendor update — a Canvas or Moodle release renames a column and ingestion starts dropping it as “unmapped.” Watch the dropping unmapped columns warning in logs as a drift signal, version the alias table, and add the new spelling rather than silently losing the field.

LMS CSV Export Format Standards — the parent contract defining the canonical record shapes, delimiters, and FERPA controls this procedure implements.
Cross-LMS Student ID Mapping — the surrogate-key scheme the student_id_token feeds, so a learner resolves identically across platforms.
Weighted Grade Calculation Engines — the consumer that depends on score_normalized and the null-is-not-zero rule established here.
Moodle Course & User Schema — where Moodle’s idnumber and grader-report headers originate before this map resolves them.

Part of: LMS CSV Export Format Standards

Standardizing LMS CSV Headers for Data Lakes

Prerequisites

Step-by-step implementation

Complete runnable script

Verification and output validation

Troubleshooting

Related

Related in this section