Attendance State Normalization Rules for LMS Data Pipelines

Institutional learning management systems operate as isolated data silos, each exposing student attendance through an incompatible vocabulary. Canvas records attendance through the Roll Call tool as a graded assignment with badge labels, Blackboard’s Attendance API emits enumerated meeting statuses, and Moodle’s mod_attendance plugin returns per-session status codes that an instructor defines per course. Without a deterministic transformation layer, cross-platform engagement analytics become statistically unreliable and accreditation reporting grows fragile. This page defines the canonical attendance schema, the vendor endpoints that feed it, and the rules that collapse a dozen inconsistent attendance vocabularies into one auditable state machine. It is the attendance half of the gradebook and attendance normalization layer, and it consumes identity keys produced by cross-LMS student ID mapping before any record is written to the warehouse.

Entity Model and Canonical Schema

The foundation of any robust attendance pipeline is a strict target schema with an explicit primary key. The canonical fact table — call it attendance_event — models one student’s status for one session of one course section, regardless of which platform produced it. Every record is uniquely identified by a composite key, which is what makes retroactive instructor corrections deterministic rather than duplicative.

Field	Type	Role	Notes
`student_token`	`char(64)`	composite key	SHA-256 of the institutional `student_id`; never the raw identifier
`course_section_id`	`bigint`	composite key	Foreign key to the section dimension
`session_ts`	`timestamptz`	composite key	Session start, stored UTC, RFC 3339 on the wire
`source_system`	`enum`	composite key	`canvas` \| `moodle` \| `blackboard`
`canonical_state`	`enum`	payload	One of the five canonical states below
`raw_state`	`text`	provenance	The verbatim vendor label, preserved for audit
`modality`	`enum`	payload	`in_person` \| `remote` \| `async`
`minutes_attended`	`int`	auxiliary	Telemetry, never overrides `canonical_state`
`recorded_at`	`timestamptz`	audit	When the pipeline observed this version
`version`	`int`	audit	Monotonic per composite key for correction history

The canonical_state enumeration is the heart of the contract. Most institutional warehouses converge on a five-state model — PRESENT, ABSENT, LATE, EXCUSED, and UNEXCUSED — with UNRECORDED as the implicit pre-mark state. The source_system column is part of the key, not merely a tag: the same physical student can appear in two platforms during a migration term, and folding them prematurely destroys the audit trail that academic appeals depend on.

The canonical five-state model can be expressed directly as a state machine. The transitions encode the legal lifecycle of an attendance record — instructor marks, grace-window arrivals, and retroactive corrections — and the diagram doubles as a test oracle when validating the normalization layer:

API Endpoints and Request Patterns

Attendance lives in a different place on every platform, and none of the three expose it through the same primitive. Extraction code must encode each vendor’s request shape, pagination dialect, and rate-limit behavior explicitly rather than assuming a shared pattern. The same discipline that governs general API rate limit handling and pagination strategies for bulk exports applies here, with attendance-specific endpoints.

Canvas has no first-class attendance resource. The Roll Call LTI writes attendance back into the gradebook as a graded assignment, so most pipelines read it through GET /api/v1/courses/:course_id/assignments/:assignment_id/submissions with Authorization: Bearer <token>. Canvas paginates with RFC 5988 Link headers (rel="next"), and the throttle signal is the X-Rate-Limit-Remaining cost bucket — back off before it reaches zero. The raw badge labels (present, late, absent) live in the Roll Call submission metadata, not the numeric score, so reading the score alone silently discards state.
Blackboard Learn exposes a dedicated Attendance API. GET /learn/api/public/v1/courses/{courseId}/meetings lists meetings; GET /learn/api/public/v1/courses/{courseId}/meetings/{meetingId}/users returns per-user records whose status field is one of Present, Late, Absent, or Excused. Authentication is a three-legged OAuth2 bearer token from /learn/api/public/v1/oauth2/token; pagination uses an opaque offset/limit cursor returned in paging.nextPage.
Moodle surfaces attendance through the mod_attendance plugin’s web services. mod_attendance_get_sessions returns sessions for an attendance instance, and per-session status codes (acronym values such as P, L, A, E) are instructor-defined per course, which is the single largest source of normalization ambiguity. Calls go to /webservice/rest/server.php with a wstoken, wsfunction, and moodlewsrestformat=json; there is no cursor, so callers page by session ID range.

Because Moodle status acronyms are not fixed, the pipeline must fetch each course’s mod_attendance_get_session status definitions and build a per-course translation map before normalizing events. Hard-coding A → ABSENT is a latent bug: one department may define A as “Authorized absence.”

Normalization and Transformation Logic

Normalization maps a verbatim vendor payload to one canonical_state using a deterministic lookup that prioritizes explicit status labels over inferred numeric values. When a payload carries composite metadata — a status: "tardy" flag alongside a minutes_attended: 12 metric — the engine applies a strict precedence: the primary state label wins, and the secondary telemetry is archived in minutes_attended rather than overwriting canonical_state. This precedence model is what prevents silent corruption when a vendor introduces an undocumented variant or deprecates a legacy field.

Type coercion is mandatory at the boundary. Vendor labels arrive with inconsistent casing and whitespace ("Present", " present ", "PRESENT"), so every raw label is lowercased and stripped before lookup, and any label with no mapping is routed to a quarantine queue rather than being coerced to a default. The composite key is constructed identically for all three sources so that a Canvas present and a later Blackboard correction for the same student, section, and session collapse onto the same row by version, not into two rows.

Hybrid and asynchronous instruction introduces structural ambiguity that the rules must address explicitly. A learner who completes a synchronous lab remotely but misses the physical room should not collapse to a blanket ABSENT. Instead the engine reads the modality flag and preserves REMOTE_PRESENT as a canonical_state with modality = remote, while asynchronous content consumption maps to modality = async only when the access timestamp falls inside the session’s valid participation window. Temporal resolution is therefore part of normalization, not an afterthought: every session_ts is parsed as RFC 3339 and converted to UTC before the window comparison, so institutions spanning multiple time zones do not misclassify a late arrival as an absence.

When a LATE record must contribute a numeric participation coefficient to a downstream grade, the engine derives it from attended minutes rather than a flat penalty, so that the value flows cleanly into the weighted grade calculation engines without floating-point drift:

$c_{\text{part}} = \min\!\left(1,\; \frac{m_{\text{attended}}}{m_{\text{session}}}\right)$

where $m_{\text{attended}}$ is minutes_attended and $m_{\text{session}}$ is the scheduled session length. The coefficient is computed in decimal arithmetic and stored alongside the canonical state, never in place of it.

Compliance Constraints

Attendance records are education records under FERPA, so the schema enforces field-level data minimization before any row reaches the warehouse. The raw institutional student_id is never persisted in the fact table; it is tokenized to student_token at the ingestion boundary, mirroring the FERPA tokenization boundary that governs the rest of the platform. Only the salted, hashed token, the section key, and the de-identified state cross into analytics; the reverse mapping lives in a separately access-controlled identity vault.

Three field-level rules are non-negotiable for this entity. First, student_token must be a salted SHA-256 hash — an unsalted hash of a small, enumerable student-ID space is trivially reversible and does not satisfy de-identification. Second, free-text justification notes from excused-absence workflows (medical, disciplinary) are categorically excluded from the fact table; they may reference protected health or conduct information and belong only in the source system. Third, every row carries recorded_at and version audit columns so that a retroactive change from ABSENT to EXCUSED is reconstructable for an appeal without exposing who requested it. These audit columns satisfy the data-provenance expectations that accreditation reviews place on attendance-derived metrics.

Reference Python Implementation

The following extractor-normalizer demonstrates tokenization, the per-source label map, deterministic precedence, and composite-key construction. It is deliberately framework-light so it can drop into a pandas batch job or an orchestrated task. Note that the student identifier is hashed immediately and the raw value is never returned.

python

import hashlib
import os
from dataclasses import dataclass
from datetime import datetime, timezone
from enum import Enum

SALT = os.environ["ATTENDANCE_HASH_SALT"].encode()  # rotated, never in code

class State(str, Enum):
    PRESENT = "PRESENT"
    ABSENT = "ABSENT"
    LATE = "LATE"
    EXCUSED = "EXCUSED"
    UNEXCUSED = "UNEXCUSED"

# Per-source label maps. Moodle's map is built per course at runtime from
# mod_attendance status definitions; this is only the institutional default.
LABEL_MAP: dict[str, dict[str, State]] = {
    "canvas":     {"present": State.PRESENT, "late": State.LATE, "absent": State.ABSENT},
    "blackboard": {"present": State.PRESENT, "late": State.LATE,
                   "absent": State.ABSENT, "excused": State.EXCUSED},
    "moodle":     {"p": State.PRESENT, "l": State.LATE,
                   "a": State.ABSENT, "e": State.EXCUSED},
}

@dataclass(frozen=True)
class AttendanceEvent:
    student_token: str
    course_section_id: int
    session_ts: datetime
    source_system: str
    canonical_state: State
    raw_state: str
    minutes_attended: int | None

def tokenize(student_id: str) -> str:
    """Salted SHA-256 — FERPA-safe, irreversible without the salt."""
    return hashlib.sha256(SALT + student_id.encode()).hexdigest()

def normalize(record: dict, source: str, course_map: dict[str, State] | None = None) -> AttendanceEvent:
    raw = str(record["status"]).strip().lower()
    table = course_map or LABEL_MAP[source]
    if raw not in table:
        raise ValueError(f"unmapped {source} status {raw!r} -> quarantine")  # never default-coerce
    ts = datetime.fromisoformat(record["session_ts"]).astimezone(timezone.utc)
    return AttendanceEvent(
        student_token=tokenize(record["student_id"]),
        course_section_id=int(record["section_id"]),
        session_ts=ts,
        source_system=source,
        canonical_state=table[raw],
        raw_state=raw,
        minutes_attended=record.get("minutes_attended"),
    )

def composite_key(e: AttendanceEvent) -> tuple[str, int, str, str]:
    return (e.student_token, e.course_section_id, e.session_ts.isoformat(), e.source_system)

The composite key returned by composite_key is what the warehouse upserts on, so an interrupted sync that re-emits the same session produces an idempotent overwrite-by-version rather than a duplicate row.

Failure Modes and Edge Cases

Attendance pipelines break in vendor-specific ways that generic ETL testing rarely surfaces.

Moodle acronym collisions. Because status acronyms are course-defined, the same letter means different things across departments. Always build the per-course translation map from mod_attendance_get_session before normalizing, and quarantine any acronym absent from that map.
Canvas Roll Call score-only reads. Reading the numeric submission score discards the badge label, so late and present both surface as full credit. Read the Roll Call metadata, not the score.
Retroactive corrections as blind overwrites. When an instructor changes ABSENT to EXCUSED weeks later, a naive upsert erases history. Emit a new version row keyed on the composite key and keep the prior state for the appeal trail.
Time-zone misclassification. A session stored in campus-local time without an offset will shift a grace-window arrival across the late threshold once parsed as UTC. Require an explicit offset and reject naive timestamps at the boundary.
Blackboard 429 and token expiry. The Attendance API shares the tenant rate ceiling; a bulk meeting sweep will trip 429 Too Many Requests mid-pagination, and three-legged tokens expire silently. Apply the same retry-with-backoff discipline used for error and retry logic in sync jobs, and refresh the token before resuming the cursor.
Null and partial sessions. A session created but never marked returns no per-user rows; treat absence of a record as UNRECORDED, not ABSENT, so an unmarked roster does not fabricate absences.

Gradebook & Attendance Normalization — the parent normalization layer this attendance schema belongs to.
Weighted Grade Calculation Engines — consumes normalized attendance as a participation coefficient.
Cross-LMS Student ID Mapping — produces the identity keys tokenized into student_token.
Handling Canvas API Rate Limits — throttle handling for the Roll Call submission reads.
Error & Retry Logic for Sync Jobs — backoff patterns for 429s and token expiry during attendance sweeps.

Part of: Gradebook & Attendance Normalization

Explore deeper

Related in this section