Observability Standard¶
v1.0 — 2026-06-06 — added from expert review (item 14: Majors, Observability Engineering).
Tracing (the backbone)¶
- OpenTelemetry is the org standard for traces and metrics in every service — wire it in ExpertGroup.Core hosting (
AddExpertGroupTelemetry()extension) so services get it by configuration, not per-repo code. - W3C Trace Context propagates everywhere: incoming HTTP/GraphQL → outgoing HTTP → RabbitMQ messages (Core.Ipc carries traceparent in message headers). A request must be followable across every service it touches.
- Trace ID in every log line —
ILoggerscopes carry it automatically once OTel is wired; a log entry that can't be joined to its trace is half-useless. - Export via OTLP; backend is Azure Monitor/Application Insights (the Azure-native choice) unless an ADR says otherwise.
Wide, structured events¶
- Emit one wide structured event per request per service (the enriched request log/root span) rather than dozens of fragmentary log lines.
- Enrich with high-cardinality dimensions: tenant, actor/user id, build/version, entity ids, GraphQL operation name. High cardinality is what makes unknown-unknowns debuggable — don't strip it to save storage.
- Structured logging only (message templates, never interpolation — already binding in the C# standard); never log secrets, tokens, or personal data (Security standard applies).
- Steady State (item 9, accepted 2026-06-06): logs and telemetry must not grow without bound — file logs rotate, telemetry backends get explicit retention settings, in-process caches have eviction policies. Unbounded growth is a design defect, not an ops surprise.
Baseline signals per service¶
- Health checks: liveness + readiness endpoints (
MapHealthChecks), with readiness covering critical dependencies (DB, queue). - Standard metrics: request rate/duration/error rate (OTel ASP.NET Core instrumentation gives these free), plus queue depth/consumer lag for Ipc consumers.
- Dashboards and alerting rules: deferred to a future revision (expert-review item 15 — alert on SLOs/symptoms) once tracing is live.
Incident postmortems (item 18, accepted 2026-06-06)¶
- Any production incident with customer impact gets a short written postmortem within a week: timeline, root cause, what went well/badly, action items each with a named owner. Template:
templates/POSTMORTEM-template.md. - Blameless: the analysis targets systems and process, never individuals — "why did the system allow this" beats "who did this".
- Minimal ceremony at current team size: one page, written async; a meeting only for major incidents. Postmortems are revisited until their action items close.
Sources: Observability Engineering · .NET OpenTelemetry · W3C Trace Context