Mapping Moodle User Profiles to SIS IDs in Python

Every downstream EdTech workflow that joins Moodle activity to an official roster — gradebook sync, attendance reconciliation, early-alert scoring — depends on one fragile join: the bridge between a Moodle mdl_user row and the institution’s Student Information System (SIS) identifier. Moodle does not enforce that bridge at the database level, so in production the SIS ID lives in idnumber for most users, in a custom profile field for some, and nowhere at all for the accounts that will quietly break your pipeline. This guide is a single, deterministic procedure: pull user profiles through the Moodle Web Services API, locate the SIS ID wherever it actually lives, normalize it to a canonical form, and emit a tokenized surrogate key that honours the FERPA compliance boundary before any identifier reaches storage. It builds on the Moodle Course & User Schema entity model and feeds the Cross-LMS Student ID Mapping layer that unifies identities across platforms.

Prerequisites

Confirm each of these before running the procedure — the script assumes all of them are in place.

Python 3.10+ (the code uses str | None union syntax and structural typing below).
requests==2.32.* and pandas==2.2.* installed in the virtualenv; the nullable Int64/boolean dtypes used here require pandas 1.0+, but 2.2 is assumed.
A Moodle Web Services token (wstoken) bound to an account with the moodle/user:viewalldetails and moodle/user:viewdetails capabilities, exported as MOODLE_TOKEN. Web Services and the REST protocol must be enabled site-wide, and core_user_get_users / core_user_get_users_by_field added to the token’s external service.
The base URL of your Moodle site, exported as MOODLE_URL (the script appends /webservice/rest/server.php). For unattended jobs, manage credentials with the same discipline as Python requests for LMS APIs rather than hard-coding the token.
The shortname of the custom profile field that holds the SIS ID when idnumber is blank (commonly sisid or studentnumber). Custom fields surface in the customfields array of each user payload.
Upstream data shape: core_user_get_users returns {"users": [ {...}, ... ]}, where each user dict carries id, username, idnumber, email, suspended, and a customfields list of {"shortname", "value"} objects.

Step-by-step implementation

1. Call the REST endpoint, not the database. Querying mdl_user directly couples your pipeline to the institution’s table prefix and bypasses Moodle’s capability checks. core_user_get_users enforces the token’s permission scope and returns only the fields the service account is allowed to see, which keeps extraction inside the audit boundary described in the parent Moodle Course & User Schema guide.

2. Request a minimal criteria set and let Moodle filter server-side. core_user_get_users takes a criteria array of {key, value} pairs. Filtering on deleted=0 server-side avoids dragging tombstoned accounts — which retain a stale idnumber — into the mapping. Asking only for what you need is the data-minimization rule that every field in this pipeline obeys.

3. Detect Moodle’s HTTP-200 error envelope before trusting the body. Moodle returns errors as a 200 OK whose JSON body is {"exception", "errorcode", "message"}, so raise_for_status() never fires. A client that skips this check parses an error object as if it were a user list and silently produces an empty mapping.

4. Resolve the SIS ID with an explicit fallback order. Read idnumber first; when it is blank, pull the configured custom field out of customfields. Recording which source supplied each value (source_field) is what lets you later audit how many identities rely on the fragile custom-field path versus the indexed idnumber column.

5. Normalize before you compare. SIS IDs arrive with leading/trailing whitespace, mixed case, and inconsistent hyphenation, but leading zeros are significant — 00421 and 421 are different students. Strip surrounding whitespace, uppercase, and remove only interior separators (spaces, hyphens) while preserving every digit. Normalizing at ingestion is what makes the join idempotent across term rollovers.

6. Tokenize the SIS ID into a surrogate key. Hash the normalized SIS ID with SHA-256 and drop the raw value before the frame leaves memory. The hash is deterministic, so the same student maps to the same surrogate across every system — the exact property the Cross-LMS Student ID Mapping layer relies on — without exposing the real identifier in logs or staging tables.

7. Quarantine the unmappable, do not drop them. Users with no idnumber and no custom-field value cannot be mapped; emit them to a separate quarantine frame so an analyst can reconcile them against the SIS export, rather than letting them vanish from the pipeline and resurface as orphaned grade rows.

The deterministic resolution path: idnumber first, custom profile field as fallback, then normalization before the SIS ID crosses the FERPA boundary to become a tokenized surrogate. Records with no value in either source are quarantined, never dropped.

Complete runnable code block

python

import os
import re
import hashlib
import logging
import pandas as pd
import requests

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("moodle_sis_mapper")

# Canonical mapping schema. Raw identifiers never appear in the output frame.
SCHEMA: dict[str, str] = {
    "moodle_user_id": "Int64",
    "sis_id_hash": "string",
    "source_field": "string",
    "suspended": "boolean",
}

_SEPARATORS = re.compile(r"[\s\-]+")


def normalize_sis_id(raw: str | None) -> str | None:
    """Canonicalize a SIS ID: trim, uppercase, drop interior separators.

    Leading zeros are significant and MUST be preserved — 00421 != 421.
    """
    if raw is None:
        return None
    cleaned = _SEPARATORS.sub("", str(raw).strip()).upper()
    return cleaned or None


def tokenize(value: str) -> str:
    """Deterministic SHA-256 surrogate key (64-char hex) — FERPA-safe join key."""
    return hashlib.sha256(value.encode("utf-8")).hexdigest()


def fetch_users(base_url: str, token: str) -> list[dict]:
    """Pull non-deleted users via core_user_get_users, handling Moodle's
    HTTP-200 exception envelope."""
    endpoint = f"{base_url.rstrip('/')}/webservice/rest/server.php"
    params = {
        "wstoken": token,
        "wsfunction": "core_user_get_users",
        "moodlewsrestformat": "json",
        "criteria[0][key]": "deleted",
        "criteria[0][value]": "0",
    }
    resp = requests.get(endpoint, params=params, timeout=30)
    resp.raise_for_status()
    body = resp.json()
    # Moodle reports errors as 200 OK with an "exception" key.
    if isinstance(body, dict) and "exception" in body:
        raise RuntimeError(f"Moodle WS error {body.get('errorcode')}: {body.get('message')}")
    users = body.get("users", [])
    logger.info("Retrieved %d user records", len(users))
    return users


def resolve_sis_id(user: dict, custom_field: str) -> tuple[str | None, str]:
    """Return (normalized_sis_id, source_field) using idnumber, then a
    custom profile field as fallback."""
    direct = normalize_sis_id(user.get("idnumber"))
    if direct:
        return direct, "idnumber"
    for field in user.get("customfields", []):
        if field.get("shortname") == custom_field:
            fallback = normalize_sis_id(field.get("value"))
            if fallback:
                return fallback, f"customfield:{custom_field}"
    return None, "unmapped"


def build_mapping(base_url: str, token: str, custom_field: str
                  ) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Produce (mapped, quarantined) frames. Raw SIS IDs are tokenized
    and never returned."""
    mapped_rows: list[dict] = []
    quarantine_rows: list[dict] = []
    for user in fetch_users(base_url, token):
        sis_id, source = resolve_sis_id(user, custom_field)
        if sis_id is None:
            quarantine_rows.append({
                "moodle_user_id": user.get("id"),
                "username": user.get("username"),  # kept ONLY for analyst reconciliation
                "reason": "no idnumber or custom SIS field",
            })
            continue
        mapped_rows.append({
            "moodle_user_id": user.get("id"),
            "sis_id_hash": tokenize(sis_id),
            "source_field": source,
            "suspended": bool(user.get("suspended", 0)),
        })

    mapped = pd.DataFrame(mapped_rows, columns=list(SCHEMA)).astype(SCHEMA)
    quarantined = pd.DataFrame(quarantine_rows,
                               columns=["moodle_user_id", "username", "reason"])
    logger.info("Mapped %d users; quarantined %d", len(mapped), len(quarantined))
    return mapped, quarantined


if __name__ == "__main__":
    mapped, quarantined = build_mapping(
        base_url=os.environ["MOODLE_URL"],
        token=os.environ["MOODLE_TOKEN"],
        custom_field=os.environ.get("SIS_CUSTOM_FIELD", "sisid"),
    )
    logger.info("Mapped frame shape: %s", mapped.shape)
    print(mapped.dtypes)
    print(mapped.head())
    if not quarantined.empty:
        logger.warning("Unmapped users require manual reconciliation:\n%s",
                       quarantined.to_string(index=False))

Verification and output validation

Confirm the mapping is correct, unique, and free of raw identifiers before handing it downstream:

No raw SIS ID leaked. assert "idnumber" not in mapped.columns and assert mapped["sis_id_hash"].str.len().eq(64).all() — SHA-256 hex is always 64 characters, so any other length means a raw value slipped through.
One Moodle account per surrogate. assert mapped["moodle_user_id"].is_unique. A duplicated moodle_user_id means the same user appeared twice in the payload and the frame would double-count grades.
Surrogate collisions are real duplicates, not bugs. mapped.groupby("sis_id_hash")["moodle_user_id"].nunique() greater than 1 flags two Moodle accounts sharing one SIS ID — feed those to resolving duplicate student IDs across LMS platforms rather than discarding either row.
Dtypes are nullable. assert mapped["suspended"].dtype.name == "boolean" and assert mapped["moodle_user_id"].dtype.name == "Int64"; lowercase bool/int64 means a null was coerced to a default.
Fallback usage is visible. mapped["source_field"].value_counts() should be dominated by idnumber; a large customfield: share signals an SIS sync that is not writing idnumber and is a schema-drift early warning.

Troubleshooting

Empty mapped frame, no exception raised. The token’s service account lacks moodle/user:viewalldetails, so Moodle returns users with idnumber blanked rather than erroring. Check source_field.value_counts() — an all-unmapped result points at the capability, not the data.
RuntimeError: Moodle WS error invalidtoken. The wstoken is wrong, expired, or not attached to an external service that exposes core_user_get_users. Verify under Site administration → Server → Web services → Manage tokens and confirm the function is added to that service.
accessexception / webservicesnotenabled in the message. Web Services or the REST protocol is disabled site-wide, or the IP is outside the service’s allowed range. Enable both protocols before retrying.
Two students collapse into one surrogate. A reused idnumber (common after a re-admission or a clerical SIS edit) produces a hash collision that is actually a data-quality issue. Resolve it upstream and re-run; do not append a salt, which would break cross-system determinism.
Leading zeros missing from matched IDs. Something cast the SIS ID to an integer before it reached this script (a spreadsheet round-trip is the usual culprit). Confirm the custom field is stored as text in Moodle; normalize_sis_id preserves zeros only if they survive transport.
Large payloads time out or strain the server. core_user_get_users returns the full result set in one response with no paging. For large tenants, batch by username or idnumber lists through core_user_get_users_by_field, applying the pagination strategies for bulk exports and rate-limit pacing from handling LMS API rate limits.

Moodle Course & User Schema — the parent entity model, tables, and Web Service endpoints this procedure draws from.
Cross-LMS Student ID Mapping — turning this tokenized surrogate key into an identity shared across Canvas, Moodle, and Blackboard.
Resolving Duplicate Student IDs Across LMS Platforms — what to do when two accounts collapse into one surrogate hash.
Python Requests for LMS APIs — credential handling and resilient request patterns for the Moodle REST endpoint.

Part of: Moodle Course & User Schema