Cross-LMS Student ID Mapping: Surrogate Keys, Identity Resolution, and the SIS Anchor

A student who exists in Canvas, Moodle, Blackboard, and the campus SIS is, from a data engineer’s point of view, four different rows with four different primary keys that happen to describe one person. Cross-LMS student ID mapping is the discipline of collapsing those rows onto a single, stable institutional identity so that gradebook, attendance, and engagement facts from every platform join cleanly in the warehouse. It is the deterministic-join layer that the rest of the LMS Data Architecture & Schema Mapping stack depends on: get the identity key wrong and every downstream aggregate — term GPA, cross-platform attendance rate, at-risk flags — silently double-counts or drops students. This page documents the identity entity model, the endpoints that emit each platform’s native identifiers, the normalization rules that anchor them to a canonical key, the FERPA controls that govern how those keys are stored, and a reference Python resolver.

Entity Model and Relational Schema

The mapping layer is a small but high-stakes star schema. At its center is a single student_identity dimension that issues one surrogate key per real person; around it sit per-platform alias tables that record every external identifier ever observed for that person. Modeling aliases as their own rows — rather than as columns on the identity row — is what lets a student accumulate a Canvas ID, two Moodle accounts from a system migration, and a Blackboard ID without schema changes.

`student_identity` (the canonical dimension)

This is the authority table. Every other system’s identifier resolves to a row here.

student_key (integer/bigint) — the surrogate primary key, generated by the pipeline and never reused. This is the only identifier that appears on fact tables.
sis_user_id (string) — the institutional Student Information System ID. This is the anchor: it is stable across academic terms, survives course re-enrollment, and is the one identifier the registrar treats as authoritative. It is nullable only for the brief window before SIS sync completes.
sis_login_id (string, nullable) — the institutional username/email used for SSO; useful as a secondary match key but mutable (students change names and logins).
status (enum) — active, merged, or retired. A merged row points at its survivor via merged_into_key so historical facts remain joinable after a duplicate collapse.
merged_into_key (integer, nullable) — foreign key to the surviving student_key when two identities are reconciled. The full survivorship procedure lives in Resolving Duplicate Student IDs Across LMS Platforms.
created_at, updated_at (timestamps) — audit columns; updated_at advances whenever a new alias is bound.

`lms_alias` (the per-platform crosswalk)

One row per (platform, native identifier) pair. This is the table the resolver actually queries on every ingest.

alias_id (integer) — surrogate primary key for the alias row.
student_key (integer) — foreign key to student_identity.student_key. The many-to-one direction is the whole point: many platform IDs, one person.
platform (enum) — canvas, moodle, blackboard, sis, or oneroster.
native_id (string) — the platform’s own primary key, stored verbatim and zero-padding-preserving (Canvas 7chars vs Moodle 0007 are different strings and must not be silently coerced to the integer 7).
id_scope (enum) — global or course. Canvas exposes both an account-global id and a course-scoped enrollment id; Moodle’s mdl_user.id is global but mdl_user_enrolments.id is not. Mixing scopes is the single most common source of mismatched joins.
match_method (enum) — how the binding was established: sis_direct, login_match, email_match, or manual. This column is what makes a join auditable — a reviewer can later see why a Moodle row was attributed to a given person.
confidence (float, 0–1) — a deterministic score, not a probability. sis_direct matches earn 1.0; weaker email matches earn less and are flagged for review.
first_seen_at, last_seen_at (timestamps) — bound the window the alias was active, which matters when a platform recycles an ID after a hard delete.

The relational contract is simply: lms_alias.student_key → student_identity.student_key, with a unique constraint on (platform, native_id, id_scope) so the same platform ID can never bind to two people. Canvas-specific source fields feed this from the Canvas Gradebook Data Structure user objects, while Moodle’s relational layout is mapped per the Moodle Course & User Schema.

API Endpoints and Request Patterns

Each platform emits its identifiers from a different endpoint with a different pagination contract. The resolver’s extractors must collect the global identifier plus the SIS cross-reference from each, because the SIS cross-reference is the only field that lets two platforms agree on a person without guessing.

Canvas

GET /api/v1/accounts/:account_id/users — account-scoped user enumeration. Request ?include[]=email and, critically, the SIS fields require the users:read_sis permission on the token.
The fields that matter: id (global Canvas user id), sis_user_id (the anchor — present only with SIS sync configured), and login_id.
Pagination is Link-header based: follow rel="next" until it is absent; never infer completion from a short page. The full mechanics are in pagination strategies for bulk exports.
Throttling surfaces as 403 Forbidden (Rate Limit Exceeded) with an X-Rate-Limit-Remaining header, covered under handling Canvas API rate limits.

Moodle

core_user_get_users via the web-service endpoint POST /webservice/rest/server.php with wstoken, wsfunction, and moodlewsrestformat=json.
The fields that matter: id (global mdl_user.id), idnumber (the institutional ID field — Moodle’s equivalent of sis_user_id, but only populated if the institution writes it), and username.
Moodle has no native cursor pagination on this function; bulk extracts page by criteria filters or read the database directly, as detailed in mapping Moodle user profiles to SIS IDs.

Blackboard

GET /learn/api/public/v1/users on the Blackboard REST surface, authenticated with an OAuth2 bearer token.
The fields that matter: id (the opaque _nnn_1 primary key), userName, and externalId (Blackboard’s SIS cross-reference field).
Pagination is cursor-envelope based — a paging.nextPage URL inside the response body rather than a header — which is why the Blackboard extractor cannot share the Canvas paginator. The auth and envelope details are in Blackboard REST API Architecture.

SIS / OneRoster

The authoritative feed is usually a OneRoster GET /ims/oneroster/v1p1/users endpoint or a nightly CSV drop conforming to the LMS CSV Export Format Standards. Either way, sourcedId (OneRoster) or the SIS primary key becomes the sis_user_id anchor that every other platform’s externalId / idnumber / sis_user_id is matched against.

Normalization and Transformation Logic

Raw identifiers cannot be compared as-is. A Canvas sis_user_id of "00451", a Moodle idnumber of "451", and a Blackboard externalId of "SIS-451" may all denote the same registrar record, and the resolver’s job is to recognize that deterministically — never with fuzzy string distance, which is unauditable for educational records.

Match-key construction. Each extractor produces a normalized match_key from its SIS cross-reference field by applying a fixed, documented transform: trim whitespace, uppercase, strip a known institution prefix (SIS-), and remove leading zeros from the numeric portion. The transform is intentionally narrow — it corrects formatting drift between systems but never collapses two genuinely different IDs.

python

import re

def normalize_sis_ref(raw: str | None) -> str | None:
    """Deterministically normalize a platform's SIS cross-reference to a match key."""
    if raw is None:
        return None
    s = raw.strip().upper()
    s = re.sub(r"^SIS[-_]", "", s)          # strip the institution prefix
    m = re.fullmatch(r"0*(\d+)", s)          # numeric IDs: drop leading zeros
    return m.group(1) if m else s            # alphanumeric IDs: leave as-is

Resolution order. Bindings are attempted in descending order of trust, and the first that succeeds wins so a strong key is never overridden by a weak one:

sis_direct — the platform’s SIS cross-reference normalizes to an existing student_identity.sis_user_id. Confidence 1.0.
login_match — the platform login_id / username matches sis_login_id exactly. Confidence 0.8; queued for review.
email_match — institutional email matches a stored login. Confidence 0.6; always queued for review.
unmatched — no key resolves; the extractor mints a provisional student_identity row with a null sis_user_id so facts are not dropped, and flags it for the duplicate-resolution workflow.

Surrogate-key issuance. When a match succeeds, the resolver writes an lms_alias row pointing at the existing student_key. When nothing matches, it issues a new student_key (the only place new keys are born) and records match_method = 'manual' pending confirmation. Because the surrogate key is generated centrally and never derived from any platform ID, a vendor changing its ID format never invalidates a historical join — the contract that makes this layer the stable backbone for gradebook and attendance normalization.

Compliance Constraints

The identity tables are the most sensitive surface in the whole pipeline: they are, by definition, the place where a person’s records across every system are linked together, which is precisely the linkage FERPA’s data-minimization principle wants controlled. Field-level rules:

sis_user_id and native_id are direct identifiers and must cross the FERPA tokenization boundary before any data leaves the secure ingestion zone. Downstream analytics tables should carry only the opaque student_key, never the raw SIS ID.
Never log raw identifiers in plaintext. Debug logs key on the surrogate student_key or on a salted SHA-256 of the SIS reference, never the value itself. Model the FERPA-safe pattern with a placeholder hash rather than a real ID:

python

import hashlib

# Placeholder pattern — hash, never log, the raw value.
token = hashlib.sha256(b"student_id").hexdigest()   # e.g. f4c9...; opaque in logs

match_method, confidence, and first_seen_at are mandatory audit columns. FERPA’s right-to-inspect means an institution must be able to explain why a given Blackboard row was attributed to a student; an unaudited fuzzy match cannot answer that.
Role-based access. The student_identity ↔ lms_alias join should be readable only by the identity service’s role. Analysts query fact tables on student_key; they never need — and should not have — the crosswalk that re-identifies it.
Retention on merge. A merged identity is never hard-deleted, because its merged_into_key pointer is what keeps historical grades attributable. Soft-retire instead, preserving the audit trail. Align retention windows with institutional policy and the U.S. Department of Education FERPA guidelines.

Reference Python Implementation

The resolver below ingests a batch of platform user records, normalizes each one’s SIS reference, binds it to an existing canonical identity or mints a provisional one, and emits FERPA-safe lms_alias rows. It uses only pandas for batch work and a dict-backed index standing in for the persistent student_identity table; in production the index is SQLAlchemy against the warehouse.

python

from __future__ import annotations
import hashlib
import re
from dataclasses import dataclass, field

import pandas as pd

ANCHOR_FIELD = {                      # the SIS cross-reference field per platform
    "canvas": "sis_user_id",
    "moodle": "idnumber",
    "blackboard": "externalId",
}
CONFIDENCE = {"sis_direct": 1.0, "login_match": 0.8, "email_match": 0.6}


def normalize_sis_ref(raw: str | None) -> str | None:
    if raw is None or str(raw).strip() == "":
        return None
    s = str(raw).strip().upper()
    s = re.sub(r"^SIS[-_]", "", s)
    m = re.fullmatch(r"0*(\d+)", s)
    return m.group(1) if m else s


def safe_token(value: str) -> str:
    """FERPA-safe surrogate for logs — never emit the raw identifier."""
    return hashlib.sha256(value.encode()).hexdigest()[:16]


@dataclass
class IdentityIndex:
    by_sis: dict[str, int] = field(default_factory=dict)   # match_key -> student_key
    _next_key: int = 1000

    def mint(self) -> int:
        self._next_key += 1
        return self._next_key

    def resolve(self, match_key: str | None) -> tuple[int, str]:
        if match_key and match_key in self.by_sis:
            return self.by_sis[match_key], "sis_direct"
        key = self.mint()                       # provisional identity; facts never dropped
        if match_key:                           # anchor it so later platforms collapse onto it
            self.by_sis[match_key] = key
        return key, "manual"


def resolve_batch(records: list[dict], platform: str, index: IdentityIndex) -> pd.DataFrame:
    anchor = ANCHOR_FIELD[platform]
    rows = []
    for rec in records:
        match_key = normalize_sis_ref(rec.get(anchor))
        student_key, method = index.resolve(match_key)
        rows.append({
            "student_key": student_key,
            "platform": platform,
            "native_id": str(rec["id"]),                 # stored verbatim, no int coercion
            "id_scope": rec.get("id_scope", "global"),
            "match_method": method,
            "confidence": CONFIDENCE.get(method, 0.0),
            "sis_token": safe_token(match_key) if match_key else None,
        })
    df = pd.DataFrame(rows)
    # Enforce the uniqueness contract before persistence.
    dupes = df.duplicated(subset=["platform", "native_id", "id_scope"], keep=False)
    if dupes.any():
        raise ValueError(f"duplicate alias rows for {platform}: {df[dupes]['native_id'].tolist()}")
    return df


if __name__ == "__main__":
    idx = IdentityIndex(by_sis={"451": 1001})            # 451 already known from SIS load
    canvas_users = [
        {"id": "70001", "sis_user_id": "00451", "id_scope": "global"},   # → collapses onto 1001
        {"id": "70002", "sis_user_id": "SIS-882", "id_scope": "global"}, # → new provisional key
    ]
    out = resolve_batch(canvas_users, "canvas", idx)
    print(out[["student_key", "native_id", "match_method", "confidence"]].to_string(index=False))

Running it binds the first Canvas user (sis_user_id "00451") onto the pre-loaded student_key 1001 via sis_direct, and mints a fresh key for the second — demonstrating that formatting drift ("00451" vs "451") is reconciled deterministically while a genuinely new student is never silently merged.

Failure Modes and Edge Cases

Scope mixing. Binding a Canvas course-scoped enrollment id where a global user id is expected attributes a student’s facts to the wrong person in every other course. Always tag id_scope and refuse cross-scope joins.
Integer coercion of zero-padded IDs. Loading native_id as an integer turns Moodle "0007" and Canvas "7" into the same value 7. Keep native_id a string end to end.
Recycled IDs after hard deletes. A platform that deletes and later reissues a primary key will hand you a native_id that already binds to a retired person. Gate bindings on last_seen_at and treat a reappearing-but-stale ID as a new alias requiring review.
Null SIS reference on early-term loads. Before SIS sync runs, Canvas sis_user_id is null; matching on login_id then is acceptable only with confidence 0.8 and a review flag — never silently promoted to sis_direct.
Two anchors for one person (the merge problem). A student with a legacy and a current SIS ID produces two student_identity rows. Do not delete either; reconcile via merged_into_key per Resolving Duplicate Student IDs Across LMS Platforms.
Truncated extraction binding a partial roster. A 403 rate-limit or 401 token expiry mid-pagination yields a half-loaded batch; resolving against it mints provisional keys for students who would have matched. Make the resolver idempotent and run it only on complete extracts.
Mutable login/email used as the anchor. Names and emails change at the registrar; a pipeline anchored on login_id re-shuffles identities every time a student is renamed. Anchor on sis_user_id and demote login/email to review-only signals.

LMS Data Architecture & Schema Mapping — the parent reference covering ingestion, staging, normalization, and the compliance boundary across platforms.
Resolving Duplicate Student IDs Across LMS Platforms — the survivorship and merge procedure for collapsing two canonical identities into one.
Canvas Gradebook Data Structure — the source of the Canvas user and submission objects that feed lms_alias rows.
Moodle Course & User Schema — the relational layout behind Moodle’s idnumber anchor and global vs enrolment-scoped IDs.
Blackboard REST API Architecture — the cursor-envelope user endpoint and externalId cross-reference field.
Mapping Moodle User Profiles to SIS IDs — the platform-specific how-to for extracting Moodle’s SIS anchor.

Part of: LMS Data Architecture & Schema Mapping

Explore deeper

Related in this section