Skip to content

Observability Standard

v1.0 — 2026-06-06 — added from expert review (item 14: Majors, Observability Engineering).

Tracing (the backbone)

  • OpenTelemetry is the org standard for traces and metrics in every service — wire it in ExpertGroup.Core hosting (AddExpertGroupTelemetry() extension) so services get it by configuration, not per-repo code.
  • W3C Trace Context propagates everywhere: incoming HTTP/GraphQL → outgoing HTTP → RabbitMQ messages (Core.Ipc carries traceparent in message headers). A request must be followable across every service it touches.
  • Trace ID in every log lineILogger scopes carry it automatically once OTel is wired; a log entry that can't be joined to its trace is half-useless.
  • Export via OTLP; backend is Azure Monitor/Application Insights (the Azure-native choice) unless an ADR says otherwise.

Wide, structured events

  • Emit one wide structured event per request per service (the enriched request log/root span) rather than dozens of fragmentary log lines.
  • Enrich with high-cardinality dimensions: tenant, actor/user id, build/version, entity ids, GraphQL operation name. High cardinality is what makes unknown-unknowns debuggable — don't strip it to save storage.
  • Structured logging only (message templates, never interpolation — already binding in the C# standard); never log secrets, tokens, or personal data (Security standard applies).
  • Steady State (item 9, accepted 2026-06-06): logs and telemetry must not grow without bound — file logs rotate, telemetry backends get explicit retention settings, in-process caches have eviction policies. Unbounded growth is a design defect, not an ops surprise.

Baseline signals per service

  • Health checks: liveness + readiness endpoints (MapHealthChecks), with readiness covering critical dependencies (DB, queue).
  • Standard metrics: request rate/duration/error rate (OTel ASP.NET Core instrumentation gives these free), plus queue depth/consumer lag for Ipc consumers.
  • Dashboards and alerting rules: deferred to a future revision (expert-review item 15 — alert on SLOs/symptoms) once tracing is live.

Incident postmortems (item 18, accepted 2026-06-06)

  • Any production incident with customer impact gets a short written postmortem within a week: timeline, root cause, what went well/badly, action items each with a named owner. Template: templates/POSTMORTEM-template.md.
  • Blameless: the analysis targets systems and process, never individuals — "why did the system allow this" beats "who did this".
  • Minimal ceremony at current team size: one page, written async; a meeting only for major incidents. Postmortems are revisited until their action items close.

Sources: Observability Engineering · .NET OpenTelemetry · W3C Trace Context