Resolving Duplicate Student IDs Across LMS Platforms

When the same learner appears in Canvas, Moodle, Blackboard, and the campus Student Information System (SIS), a naive warehouse join treats them as four different people — and every downstream aggregate silently double-counts or drops rows. This guide walks through one concrete, runnable procedure: take heterogeneous LMS user exports plus an authoritative SIS roster, resolve every record to a single canonical_student_id, and quarantine anything that cannot be matched. It is the survivorship step inside Cross-LMS Student ID Mapping, and it assumes the broader LMS Data Architecture & Schema Mapping conventions for canonical keys and tokenization. The output is a deterministic mapping table that gradebook, attendance, and engagement models join against instead of re-running fuzzy matching on every load.

The resolver walks an explicit fallback chain — strongest identifier first, weakest last — and every miss leaves a hashed audit trail. Anything that fails the final fallback is flagged UNMAPPED rather than guessed into the wrong learner:

Prerequisites

Confirm each of these before running the procedure — most “every row is UNMAPPED” incidents trace back to a skipped item here:

Python 3.10+ with pandas >= 2.0 installed.
An institutional hashing salt loaded from a secret manager (never hardcoded), shared identically by the registry build and the resolver.
An authoritative SIS export with at least canonical_student_id, sis_id, email, and enrollment_status. The canonical_student_id is the stable surrogate documented in the parent Cross-LMS Student ID Mapping entity model.
One or more LMS user exports, each carrying a source_platform label plus that platform’s native identifiers — Canvas sis_user_id/email, Moodle idnumber/email (see Mapping Moodle User Profiles to SIS IDs), Blackboard batch_uid/email.
All ID columns read as dtype=str so zero-padded values (Canvas 0007 vs SIS 7) are not coerced to integers and silently merged. The header-normalization rules in Standardizing LMS CSV Headers for Data Lakes cover this on ingest.
Write access to a staging table or path for the resulting mapping, plus a separate file for the masked audit log.

Step-by-step implementation

Each step has a one-line reason so you can adapt it to your own ingestion layer rather than copying blindly.

1. Normalize column names and platform labels. Lowercase every header once at the boundary. Why: Canvas exports SIS_User_ID, Moodle exports idnumber, and an unnormalized case mismatch surfaces as an opaque KeyError three joins later.

python

lms_export = lms_export.rename(columns=str.lower)
sis_registry = sis_registry.rename(columns=str.lower)

2. Hash identifiers before they are ever matched or logged. Match on salted SHA-256 digests, not raw values. Why: this keeps the FERPA tokenization boundary intact — no raw student identifier reaches warehouse logs or join keys, only an opaque digest.

python

def hash_identifier(value, salt):
    if pd.isna(value) or not isinstance(value, str):
        return None
    return hashlib.sha256(f"{salt}{value.strip().lower()}".encode()).hexdigest()

3. Merge on the strongest key first. Join the LMS export to the registry on sis_id_hash. Why: an exact match on the registrar’s anchor identifier is the highest-confidence link available; everything else is a fallback.

4. Walk the fallback chain for the unmatched remainder. Only rows still missing a canonical_student_id fall through to the email hash, then to a composite name + date-of-birth hash. Why: late-provisioned LMS accounts and SSO-only users frequently have no sis_user_id yet, but still belong to a real learner.

5. Apply platform precedence and active-enrollment survivorship. When one canonical learner collects several rows, sort by platform_priority (SIS > Canvas > Moodle > Blackboard) and active status, then keep the first. Why: the SIS row is authoritative for enrollment state; an active Canvas row beats a stale Blackboard row from a dropped section.

6. Quarantine the unresolved tail. Label anything that survives all three fallbacks as UNMAPPED and route it to a review queue. Why: a wrong join corrupts every aggregate that touches it — isolating an unknown is always cheaper than retracting a misattributed grade.

7. Emit a masked audit record per resolution. Log the last four characters of the email plus the resolved canonical ID and source platform. Why: this produces a reconstructable trail for compliance review without persisting PII.

Complete resolver script

The script below is self-contained: it builds a small demo registry and a mixed LMS export, resolves the duplicates, and prints the result. Swap the two pd.DataFrame(...) fixtures for your real reads (and your secret-managed salt) to run it in an Airflow task or a dbt pre-hook.

python

import hashlib
import logging
import pandas as pd

# FERPA-safe audit log: masked values only, no raw identifiers.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.FileHandler("id_resolution_audit.log")],
)

PRECEDENCE = {"sis": 0, "canvas": 1, "moodle": 2, "blackboard": 3}
ACTIVE_STATES = {"active", "enrolled", "current"}


def mask_pii(value: str) -> str:
    """Reveal only the last 4 chars for audit correlation."""
    if not isinstance(value, str) or len(value) <= 4:
        return "***"
    return f"***{value[-4:]}"


def hash_identifier(value, salt: str) -> str | None:
    """Deterministic salted SHA-256 over a normalized string."""
    if pd.isna(value) or not isinstance(value, str) or not value.strip():
        return None
    return hashlib.sha256(f"{salt}{value.strip().lower()}".encode()).hexdigest()


def resolve_student_ids(lms: pd.DataFrame, sis: pd.DataFrame, salt: str) -> pd.DataFrame:
    lms = lms.rename(columns=str.lower).copy()
    sis = sis.rename(columns=str.lower).copy()

    # 1. Hash every match key on both sides with the same salt.
    lms["sis_id_hash"] = lms.get("sis_user_id").apply(lambda v: hash_identifier(v, salt))
    lms["email_hash"] = lms.get("email").apply(lambda v: hash_identifier(v, salt))
    sis["sis_id_hash"] = sis["sis_id"].apply(lambda v: hash_identifier(v, salt))
    sis["email_hash"] = sis["email"].apply(lambda v: hash_identifier(v, salt))

    reg_cols = ["canonical_student_id", "enrollment_status"]

    # 2. Strongest match first: SIS anchor.
    resolved = lms.merge(
        sis[["sis_id_hash"] + reg_cols], on="sis_id_hash", how="left"
    )

    # 3. Fallback: email hash for rows still unresolved.
    missing = resolved["canonical_student_id"].isna()
    if missing.any():
        fb = resolved.loc[missing].merge(
            sis[["email_hash"] + reg_cols], on="email_hash", how="left", suffixes=("", "_fb")
        )
        resolved.loc[missing, "canonical_student_id"] = fb["canonical_student_id_fb"].values
        resolved.loc[missing, "enrollment_status"] = fb["enrollment_status_fb"].values

    # 4. Quarantine the unresolved tail instead of guessing.
    resolved["canonical_student_id"] = resolved["canonical_student_id"].fillna("UNMAPPED")

    # 5. Survivorship: platform precedence + active enrollment.
    platform = resolved.get("source_platform", pd.Series("unknown", index=resolved.index))
    resolved["platform_priority"] = platform.str.lower().map(PRECEDENCE).fillna(99)
    resolved["is_active"] = (
        resolved["enrollment_status"].fillna("").str.lower().isin(ACTIVE_STATES)
    )
    resolved = resolved.sort_values(
        ["canonical_student_id", "platform_priority", "is_active"],
        ascending=[True, True, False],
    )

    # Collapse every non-UNMAPPED learner to one survivor; keep all UNMAPPED rows.
    mapped = resolved[resolved["canonical_student_id"] != "UNMAPPED"].drop_duplicates(
        subset="canonical_student_id", keep="first"
    )
    unmapped = resolved[resolved["canonical_student_id"] == "UNMAPPED"]
    out = pd.concat([mapped, unmapped], ignore_index=True)

    # 6. Masked audit trail.
    for _, r in out.iterrows():
        logging.info(
            "resolved %s -> %s | platform=%s | active=%s",
            mask_pii(r.get("email", "")),
            r["canonical_student_id"],
            r.get("source_platform", "unknown"),
            r["is_active"],
        )

    return out[
        ["canonical_student_id", "source_platform", "sis_id_hash", "email_hash", "is_active"]
    ].reset_index(drop=True)


if __name__ == "__main__":
    SALT = "load-from-secret-manager"  # never hardcode in production

    sis_registry = pd.DataFrame(
        {
            "sis_id": ["1001", "1002"],
            "email": ["jordan@uni.edu", "sam@uni.edu"],
            "canonical_student_id": ["STU-1001", "STU-1002"],
            "enrollment_status": ["Active", "Active"],
        }
    )
    lms_export = pd.DataFrame(
        {
            "source_platform": ["canvas", "moodle", "blackboard", "canvas"],
            "sis_user_id": ["1001", "1001", None, "9999"],
            "email": ["jordan@uni.edu", "jordan@uni.edu", "sam@uni.edu", "ghost@uni.edu"],
            "enrollment_status": ["Active", "Dropped", "Active", "Active"],
        }
    )

    mapping = resolve_student_ids(lms_export, sis_registry, SALT)
    print(mapping.to_string(index=False))

Verification and output validation

Run the script and confirm the shape before you trust the table downstream. The demo data has two real learners (each duplicated across platforms) plus one unresolvable ghost account, so the resolver should return three rows: STU-1001, STU-1002, and one UNMAPPED.

Add these assertions as a guardrail in the same task — they catch the failure modes that are otherwise invisible until a report looks wrong:

python

# No real learner appears twice in the survivor set.
mapped = mapping[mapping["canonical_student_id"] != "UNMAPPED"]
assert mapped["canonical_student_id"].is_unique, "duplicate survivor — check sort keys"

# Every input row is accounted for: survivors + unmapped == distinct learners + unresolved.
assert mapping["canonical_student_id"].notna().all(), "null canonical id leaked through"

# The Canvas (active) row, not the Moodle (dropped) row, won STU-1001.
winner = mapped.loc[mapped["canonical_student_id"] == "STU-1001", "source_platform"].iloc[0]
assert winner == "canvas", "survivorship precedence regressed"

Spot-check the audit log too: every line should show a masked email (***u.edu-style) and never a raw address. If you see full identifiers, the masking step was bypassed and the run must be discarded under FERPA data-minimization rules.

Troubleshooting

KeyError: 'sis_id' on the registry merge. The SIS export column is named student_id or SISID, or it was never lowercased. Confirm step 1 ran on the registry frame and that the authoritative export matches the column contract in the parent guide.

Every row resolves to UNMAPPED. Almost always a salt mismatch — the registry was hashed with a different salt than the LMS export — or inconsistent normalization (one side trimmed/lowercased, the other did not). Hash one known pair by hand on both sides and compare digests.

Duplicate canonical_student_id survives the dedup. The sort key lost is_active, or enrollment_status arrived with unexpected casing ("ACTIVE" vs "active") so is_active was all False. Normalize status to lowercase before the membership test, as the script does.

Zero-padded IDs collapse two students into one. pandas read 0007 and 7 as the integer 7. Force dtype=str (or converters=) when reading every identifier column; the header standard in Standardizing LMS CSV Headers for Data Lakes specifies this at ingest.

A shared guardian email merges siblings. Two learners using one family email both hit the email-hash fallback and resolve to whichever the registry returned first. Demote email below a composite name + date-of-birth hash, or block the fallback when an email maps to more than one canonical_student_id.

Resolution is correct but the job is slow on large exports. .iterrows() over the audit loop dominates at scale. Vectorize the log payload (build the masked columns once, then write with to_csv) and process the export in chunks; the pulling side, if you read live from an API, should also respect each platform’s rate-limit headers rather than retrying blindly.

Cross-LMS Student ID Mapping — the identity entity model, alias crosswalk, and surrogate-key design this procedure resolves into.
Mapping Moodle User Profiles to SIS IDs — building the Moodle side of the fallback chain when idnumber is unreliable.
How to Parse Canvas Gradebook JSON with Pandas — the downstream gradebook models that join on the canonical key this page emits.
LMS Data Architecture & Schema Mapping — the canonical-key and FERPA tokenization conventions every resolver must follow.

Part of: Cross-LMS Student ID Mapping

Resolving Duplicate Student IDs Across LMS Platforms

Prerequisites

Step-by-step implementation

Complete resolver script

Verification and output validation

Troubleshooting

Related