Moodle Course & User Schema for Institutional Data Pipelines

The foundational layer of any institutional data pipeline begins with a precise understanding of the underlying LMS schema, and Moodle is the hardest of the three major platforms to model correctly because it exposes academic reality as a deeply normalized relational graph rather than a tidy API envelope. For EdTech engineers, institutional data analysts, and academic IT teams, the difficulty is not reading a row — it is that a single fact such as “this student earned this grade in this course” is scattered across half a dozen tables joined through Moodle’s polymorphic context system, and that the authoritative student identifier almost never lives where you expect it. Unlike cloud-native platforms that abstract data access through standardized REST resources, Moodle relies on a highly normalized PostgreSQL or MySQL schema (often with an institution-specific table prefix), so extracting reliable gradebook, attendance, and engagement telemetry requires direct database queries or carefully scoped Web Service calls, deterministic context resolution, and strict adherence to compliance boundaries. Treating that relational graph as a structured stream to be flattened, type-coerced, and tokenized is what aligns Moodle extraction with the broader LMS data architecture and schema mapping reference that downstream gradebook and attendance pipelines target.

This page covers the Moodle entity model and its relational shape, the exact tables and Web Service endpoints plus their paging mechanics, how raw Moodle rows map onto the canonical institutional schema, the FERPA field-level constraints specific to these entities, a production-quality reference extractor, and the Moodle-specific failure modes that only surface under real academic load.

Entity Model and Relational Schema

Moodle anchors its data model on five table families: identity (mdl_user), course structure (mdl_course, mdl_course_categories), the polymorphic mdl_context table that scopes everything, enrolment and roles (mdl_enrol, mdl_user_enrolments, mdl_role_assignments), and the gradebook (mdl_grade_items, mdl_grade_grades, mdl_grade_categories). The mdl_ prefix is configurable per install — production code must read it from config.php ($CFG->prefix) rather than hardcoding it.

The single most important structural fact for pipeline builders is that mdl_user rarely contains the authoritative student identifier. The id column is an internal surrogate key, meaningless outside the tenant; the registrar key lives in idnumber (when populated) or in a custom profile field stored in mdl_user_info_data. Reconciling that internal id against the Student Information System (SIS) key is the central join every Moodle integration depends on, treated in depth on mapping Moodle user profiles to SIS IDs and solved generically, across platforms, by cross-LMS student ID mapping.

The relational shape a Moodle extractor materializes looks like this:

Table	Key field	Foreign keys	Significant fields	Type notes
`mdl_user`	`id` (bigint)	—	`idnumber`, `username`, `email`, `institution`, `suspended`, `deleted`	`idnumber` may be empty; `deleted=1` rows retain a mangled email
`mdl_course`	`id` (bigint)	`category` → `mdl_course_categories`	`idnumber`, `shortname`, `fullname`, `startdate`, `enddate`	`idnumber` is the SIS course/section key when set
`mdl_context`	`id` (bigint)	`instanceid` (polymorphic)	`contextlevel`, `instanceid`, `path`, `depth`	`contextlevel=50` is course, `=10` system, `=70` module
`mdl_enrol`	`id` (bigint)	`courseid` → `mdl_course`	`enrol` (method), `status`, `roleid`	one row per enrolment method per course
`mdl_user_enrolments`	`id` (bigint)	`enrolid` → `mdl_enrol`, `userid` → `mdl_user`	`status`, `timestart`, `timeend`	`status=0` active, `=1` suspended
`mdl_role_assignments`	`id` (bigint)	`contextid` → `mdl_context`, `userid`, `roleid`	`roleid`, `timemodified`	join through context, not course, to scope
`mdl_grade_items`	`id` (bigint)	`courseid` → `mdl_course`	`itemtype`, `grademax`, `grademin`, `aggregationcoef`, `weightoverride`, `hidden`	`itemtype='course'` is the course total
`mdl_grade_grades`	`id` (bigint)	`itemid` → `mdl_grade_items`, `userid`	`finalgrade`, `rawgrade`, `excluded`, `hidden`	`finalgrade` may be NULL for ungraded

The structural trap is mdl_role_assignments. A naive query that joins it straight to mdl_course does not compile, because role assignments are not scoped to courses — they are scoped to contexts. Moodle’s context hierarchy (system = 10, coursecat = 40, course = 50, module = 70, block = 80, user = 30) is a polymorphic key space where mdl_context.instanceid points at a different table depending on contextlevel. To list the students of a course you must join mdl_role_assignments → mdl_context and filter contextlevel = 50 AND instanceid = :courseid, then map roleid to a role shortname via mdl_role. Skip that filter and your roster inflates with site administrators, category managers, and module-level teaching assistants who carry no course-level student participation.

Unlike the flatter, assignment-centric Canvas gradebook data structure, where assignments and submissions are addressed directly, Moodle forces every roster and grade query through this context indirection, and unlike the JSON-envelope nesting of the Blackboard REST API architecture it exposes the relationships as raw table joins you control. The normalization layer exists precisely to collapse the context graph into flat, composite-keyed rows.

Access Patterns: SQL Tables and Web Service Endpoints

Moodle offers two extraction surfaces, and a production pipeline usually mixes them: direct read-only SQL against a replica for bulk historical pulls, and the Web Services API for incremental, permission-scoped syncs. Both target the same underlying schema.

Web Service endpoints

Moodle’s external API is exposed at /webservice/rest/server.php and dispatched by a wsfunction parameter rather than by URL path. A token (provisioned per service in Site administration → Server → Web services) authenticates every call, passed as wstoken, and moodlewsrestformat=json selects the response format. The functions a course-and-user pipeline relies on:

core_user_get_users — search users by criteria (criteria[0][key]=idnumber); returns the profile fields the token’s role can see.
core_user_get_users_by_field — batch resolve by id, idnumber, username, or email; the workhorse for SIS reconciliation.
core_enrol_get_enrolled_users — given courseid, return the enrolled users with their roles already resolved through the context system — the supported way to avoid hand-writing the context join.
core_course_get_courses_by_field — course metadata including idnumber and term dates.
gradereport_user_get_grade_items — the gradebook for a course/user, with grademax, weightraw, and the already-aggregated course total.

Two paging realities differ sharply from REST platforms. First, the raw extraction functions do not cursor-paginate — core_enrol_get_enrolled_users returns the entire course in one response, so memory, not paging, is the constraint; batch by course rather than by page. Second, functions that do page (such as core_course_get_courses_by_field over large catalogs) use offset/limit-style parameters, not opaque cursors, so a catalog mutating mid-walk can skip or double-count rows — snapshot the course list first, then iterate. Rate limiting is enforced per token at the web-server tier rather than via X-RateLimit headers, so backpressure surfaces as HTTP 429/503 and must be handled with the same retry discipline as the API ingestion and sync workflows section prescribes for every LMS.

Direct SQL access

For nightly bulk loads the database replica is faster and avoids token quotas. Connections must use the Python DB-API 2.0 specification for parameterized queries, transactional safety, and connection pooling — never string-interpolate a course id. Incremental extraction keys on the timemodified / timecreated Unix-epoch columns present on most tables: pull WHERE timemodified > :last_watermark rather than full table scans. Attendance and engagement telemetry come from mdl_logstore_standard_log (a high-volume event stream keyed on userid, courseid, eventname, and timecreated) and mdl_user_lastaccess; both need deduplication and timezone normalization (Moodle stores UTC epochs but renders in the site timezone) before ingestion. Always validate queries against the official Moodle DML documentation before deploying, because minor version upgrades can add columns, change constraints, or shift aggregation defaults.

Normalization and Transformation Logic

The job of the normalization layer is to collapse Moodle’s relational graph into the canonical institutional row — one tokenized, composite-keyed record per student-per-grade-item — that the warehouse and the downstream gradebook and attendance normalization pipelines consume.

Composite key construction. The canonical key is (course_token, item_token, student_token). The student_token derives from the resolved SIS id (idnumber, falling back to a custom profile field), never from mdl_user.id, so that the same human keys consistently across a term rollover or a multi-campus federation.

Role mapping. Translate roleid into the canonical role enum by joining mdl_role for shortname and mapping student/teacher/editingteacher/manager onto your pipeline’s roles. Discard non-participating roles before aggregation, and carry the active/suspended state from mdl_user_enrolments.status so a withdrawn student does not pollute engagement denominators.

Grade reconstruction. Raw values in mdl_grade_grades.finalgrade are unbounded by themselves; the normalized percentage needs grademax/grademin from mdl_grade_items. For a single item:

$p_i = \frac{g_i - \text{grademin}_i}{\text{grademax}_i - \text{grademin}_i} \times 100$

Reconstructing a course total the way Moodle’s own UI shows it means replaying its aggregation. For the common natural-weighting / weighted-mean case the course grade is

$G = \frac{\sum_{i} w_i \, p_i}{\sum_{i} w_i}, \qquad w_i = \text{aggregationcoef}_i$

but only over items where excluded = 0, hidden = 0, and finalgrade IS NOT NULL. The aggregation method itself lives in mdl_grade_categories.aggregation (an integer enum — 10 = mean of grades, 13 = natural/sum), and weightoverride signals a manually pinned weight that supersedes the computed aggregationcoef. Branching on the aggregation enum is mandatory: applying a weighted-mean formula to a “sum of grades” category produces plausible-but-wrong totals. Type coercion matters too — finalgrade arrives as a string or Decimal over different drivers; coerce to float once at the boundary and treat empty strings as None.

Compliance Constraints

Every field that leaves the staging zone must pass the FERPA boundary defined by the parent architecture, and Moodle’s schema is unusually leaky because identity and academic data sit one join apart. Apply data minimization at the query, not after: request only id, idnumber, username, email, and suspended for identity work, and never SELECT * from mdl_user (which carries address, phone, IP history, and authentication metadata).

Field-level handling for these entities:

Tokenize before serving: the resolved SIS id (idnumber / custom field) and any mdl_user.id that could be re-joined to it. Replace with a salted hash (model a placeholder as sha256("student_id")) and keep the lookup map inside the compliance boundary only.
Drop entirely: email, firstname/lastname, city, phone1, lastip. These are directory information at best and have no place in an analytical fact table.
Pass through: course idnumber, shortname, grade items, epoch timestamps, and the tokenized keys — none re-identify a student on their own.
Audit columns to add: source_system='moodle', extracted_at, and source_row_version (from timemodified) on every emitted row so an accreditation or breach audit can trace lineage without storing raw identifiers, consistent with the conventions on the LMS CSV export format standards page. Honoring FERPA’s directory-information rules means the tokenization step, not the dashboard, is the enforcement point.

Reference Python Implementation

The extractor below resolves the course roster through the context system, pulls each student’s gradebook, reconstructs normalized percentages, and tokenizes the SIS id before any row leaves the function. It uses the Web Services API so that the token’s role scopes visibility, and yields flat canonical rows ready for a warehouse upsert.

python

import hashlib
import os
from decimal import Decimal, InvalidOperation
from datetime import datetime, timezone
from typing import Iterator

import requests  # 2.31+

WS = "https://moodle.example.edu/webservice/rest/server.php"
TOKEN = os.environ["MOODLE_WS_TOKEN"]
SALT = os.environ["TOKEN_SALT"].encode()  # kept inside the compliance boundary


def tokenize(sis_id: str) -> str:
    """Salted, deterministic pseudonym — never store the raw idnumber downstream."""
    return "stu_" + hashlib.sha256(SALT + sis_id.strip().encode()).hexdigest()[:24]


def call(wsfunction: str, **params) -> dict | list:
    payload = {
        "wstoken": TOKEN,
        "wsfunction": wsfunction,
        "moodlewsrestformat": "json",
        **params,
    }
    resp = requests.post(WS, data=payload, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    if isinstance(data, dict) and data.get("exception"):
        raise RuntimeError(f"{wsfunction}: {data['errorcode']} {data.get('message')}")
    return data


def to_float(value) -> float | None:
    if value in (None, "", "-"):
        return None
    try:
        return float(Decimal(str(value)))
    except (InvalidOperation, ValueError):
        return None


def extract_course(course_id: int, course_idnumber: str) -> Iterator[dict]:
    extracted_at = datetime.now(timezone.utc).isoformat()
    # core_enrol_get_enrolled_users resolves roles through the context graph for us,
    # so we never hand-write the contextlevel=50 join. It returns the whole course.
    users = call("core_enrol_get_enrolled_users", courseid=course_id)
    for u in users:
        roles = {r["shortname"] for r in u.get("roles", [])}
        if "student" not in roles:           # drop teachers, managers, TAs
            continue
        sis_id = (u.get("idnumber") or "").strip()
        if not sis_id:                        # quarantine rows with no SIS key
            continue
        student_token = tokenize(sis_id)
        items = call("gradereport_user_get_grade_items",
                     courseid=course_id, userid=u["id"])
        for report in items.get("usergrades", []):
            for gi in report.get("gradeitems", []):
                gmax, gmin = to_float(gi.get("grademax")), to_float(gi.get("grademin")) or 0.0
                final = to_float(gi.get("graderaw"))
                pct = None
                if final is not None and gmax and (gmax - gmin):
                    pct = round((final - gmin) / (gmax - gmin) * 100, 2)
                yield {
                    "course_token": tokenize(course_idnumber),
                    "item_token": tokenize(f"{course_idnumber}:{gi['id']}"),
                    "student_token": student_token,
                    "item_type": gi.get("itemtype"),
                    "percent": pct,
                    "weight": to_float(gi.get("weightraw")),
                    "source_system": "moodle",
                    "extracted_at": extracted_at,
                }

The function yields one flat, composite-keyed, tokenized row per student-per-grade-item — exactly the shape a warehouse upsert keyed on (course_token, item_token, student_token) consumes idempotently. Buffer the generator into a DataFrame and write with a merge-on-conflict statement to turn a course into a deterministic nightly sync.

Failure Modes and Edge Cases

The breakage patterns below are specific to Moodle and rarely appear in a sandbox of a handful of users.

The contextlevel filter omission. Joining mdl_role_assignments without mdl_context.contextlevel = 50 is the single most common Moodle pipeline bug. It silently inflates rosters with system and category admins, skewing every per-course engagement and grade denominator. The error never throws — the numbers are just wrong. Always scope role queries through context, or prefer core_enrol_get_enrolled_users, which does it for you.

Empty idnumber. Manually created accounts, bulk-imported test users, and legacy records frequently have a blank idnumber. Tokenizing an empty string collapses every such user onto one shared stu_ bucket, merging unrelated grades. Skip and quarantine empty-key rows rather than hashing "".

NULL finalgrade versus zero. An ungraded item has finalgrade = NULL; a scored-zero item has finalgrade = 0. Coalescing NULL to 0 fabricates failing grades for work that was never submitted, and arithmetic on a None throws TypeError. Gate the percentage on final is not None and carry “not yet graded” as a distinct state.

Aggregation-method drift. A teacher who switches a category from “natural” to “weighted mean of grades” mid-term changes mdl_grade_categories.aggregation under your pipeline. A course total formula hardcoded to one method now disagrees with the Moodle UI. Read the aggregation enum per category and branch on it; alert when it changes between runs.

excluded and hidden flags ignored. Items with excluded = 1 or hidden = 1 are omitted from Moodle’s own course total. Summing them anyway produces totals that no instructor recognizes. Filter both flags before aggregating, exactly as Moodle does.

Table-prefix and version assumptions. Hardcoding mdl_ breaks on installs with a custom $CFG->prefix, returning relation does not exist. And a minor-version upgrade can rename or re-type a column (epoch ints widening, new weightoverride semantics). Read the prefix from config and validate the result shape with a schema check at the staging edge so drift fails loudly rather than three joins downstream.

Timezone and epoch confusion. Every time* column is a UTC Unix epoch, but the Moodle UI renders in the site timezone, so a log event that looks like “yesterday” to a registrar may belong to a different calendar day in UTC. Normalize all epochs to UTC at ingestion and attach the institutional timezone as metadata, never by mutating the stored value.

A well-architected Moodle extractor turns this normalized, context-scoped telemetry into a reliable, query-ready asset that powers retention analytics, accreditation reporting, and adaptive learning interventions — without ever leaking a direct identifier past the staging boundary.

LMS data architecture and schema mapping — the reference topology and canonical schema this Moodle extractor feeds.
Mapping Moodle user profiles to SIS IDs — the deep dive on resolving idnumber and custom profile fields against the registrar key.
Canvas gradebook data structure — the flatter, assignment-centric counterpart and how its paging differs.
Blackboard REST API architecture — the JSON-envelope alternative and its UUID-keyed resource graph.
Cross-LMS student ID mapping — federating Moodle idnumber with other platforms’ identifier spaces.

Part of: LMS Data Architecture & Schema Mapping

Explore deeper

Related in this section