Observability Standard¶

v1.0 — 2026-06-06 — added from expert review (item 14: Majors, Observability Engineering).

Tracing (the backbone)¶

OpenTelemetry is the org standard for traces and metrics in every service — wire it in ExpertGroup.Core hosting (AddExpertGroupTelemetry() extension) so services get it by configuration, not per-repo code.
W3C Trace Context propagates everywhere: incoming HTTP/GraphQL → outgoing HTTP → RabbitMQ messages (Core.Ipc carries traceparent in message headers). A request must be followable across every service it touches.
Trace ID in every log line — ILogger scopes carry it automatically once OTel is wired; a log entry that can't be joined to its trace is half-useless.
Export via OTLP; backend is Azure Monitor/Application Insights (the Azure-native choice) unless an ADR says otherwise.

Emit one wide structured event per request per service (the enriched request log/root span) rather than dozens of fragmentary log lines.
Enrich with high-cardinality dimensions: tenant, actor/user id, build/version, entity ids, GraphQL operation name. High cardinality is what makes unknown-unknowns debuggable — don't strip it to save storage.
Structured logging only (message templates, never interpolation — already binding in the C# standard); never log secrets, tokens, or personal data (Security standard applies).
Steady State (item 9, accepted 2026-06-06): logs and telemetry must not grow without bound — file logs rotate, telemetry backends get explicit retention settings, in-process caches have eviction policies. Unbounded growth is a design defect, not an ops surprise.

Health checks: liveness + readiness endpoints (MapHealthChecks), with readiness covering critical dependencies (DB, queue).
Standard metrics: request rate/duration/error rate (OTel ASP.NET Core instrumentation gives these free), plus queue depth/consumer lag for Ipc consumers.
Dashboards and alerting rules: deferred to a future revision (expert-review item 15 — alert on SLOs/symptoms) once tracing is live.

Any production incident with customer impact gets a short written postmortem within a week: timeline, root cause, what went well/badly, action items each with a named owner. Template: templates/POSTMORTEM-template.md.
Blameless: the analysis targets systems and process, never individuals — "why did the system allow this" beats "who did this".
Minimal ceremony at current team size: one page, written async; a meeting only for major incidents. Postmortems are revisited until their action items close.