Design: Observability Patterns
Prerequisites
- trellis-patterns.md - Trellis architecture and communication model
- type-system-patterns.md - Result and error conventions
Scope
This document defines Trellis observability, documentation, telemetry, and request-correlation patterns.
Service Observability
Every service exposes:
<Service>.HealthRPC- baseline
Health.Heartbeatevent publishing through the shared Trellis health contract - optional
<Service>.StatsRPC - OpenTelemetry tracing and metrics
- structured logging
Activated devices publish Health.Heartbeat through the same shared contract.
Connected service and device participants receive a Trellis-defined baseline
health use for trellis.health@v1; it is modeled as the grouped uses.required.health dependency in emitted manifests, not as a flat health
alias, and contract authors do not manually repeat it.
Health example:
const service = await TrellisService.connect({
trellisUrl: config.trellisUrl,
contract: graph,
name: "graph",
sessionKeySeed: config.sessionKeySeed,
server: {
log,
healthChecks: {
db: () => db.ping(),
},
},
});
service.health.setInfo({
version: build.version,
info: { region: config.region },
});
service.health.add("db", async () => ({
status: (await db.ping()) ? "ok" : "failed",
info: { engine: "postgres" },
})); Heartbeat behavior:
- if the connected service contract uses the shared
Health.Heartbeatevent,TrellisService.connect(...)and RustTrellisClient::connect_service(...)publish baseline heartbeats automatically - if the connected device contract uses the shared
Health.Heartbeatevent,TrellisDevice.connect(...)and RustTrellisClient::connect_device(...)publish baseline heartbeats automatically - baseline heartbeats include runtime metadata, instance identity, publish interval, and a built-in NATS connectivity check
service.health.setInfo(...)andservice.health.add(...)extend service heartbeat payloads at publish time using callback-based state snapshots; the same helper surface is also available on device connections- the Trellis console can subscribe to these heartbeats directly and show both a live feed and an in-browser current-participant view without a separate aggregator
Stats example:
await service.handle.rpc.graph.stats(async () => {
return Result.ok({
users: { count: await db.countUsers() },
partners: { count: await db.countPartners() },
});
}); Documentation
Exported functions, classes, and methods require JSDoc.
Required fields:
- brief purpose description
@paramfor each parameter@returnsdescription@throwsor@errorsfor error conditions@examplefor complex usage
Skip JSDoc for private helpers when the code is self-evident and for tests.
Telemetry
@qlever-llc/trellis/telemetry is the public TypeScript telemetry entrypoint.
The former public @qlever-llc/trellis/tracing subpath is not part of the
current public surface.
TrellisService.connect() initializes OpenTelemetry automatically using the
service name unless automatic telemetry is disabled by the caller. Runtime
helpers must keep browser-safe imports separate from Node/Deno telemetry SDK
setup; package entrypoints may use @opentelemetry/api, but exporter and SDK
packages should be loaded only from server/runtime initialization paths.
Tracing rules:
Span naming:
- RPC client:
rpc.client.<MethodName> - RPC server:
rpc.server.<MethodName> - Event publish:
event.publish.<Domain>.<Action> - Event handle:
event.handle.<Domain>.<Action> - Job handle:
job.handle.<service>.<queue>
Required attributes:
rpc.systemrpc.methodmessaging.destination
Library support rule:
- libraries performing I/O must accept trace context, create child spans, and propagate context
TrellisErrorsubclasses should includetraceIdwhen tracing is active- if a runtime has not installed an OpenTelemetry tracer provider, RPC error
responses should still attach
traceIdfrom a valid inboundtraceparentheader before the error leaves the server span boundary
Error metrics:
- Trellis runtime libraries record caller-visible and runtime-observed failures
with the
trellis.errorsOpenTelemetry counter - metrics are no-op unless an OpenTelemetry meter provider/exporter is installed or configured by telemetry runtime initialization
- metric attributes must be low cardinality and must not include user IDs, session keys, raw subjects, payload data, trace IDs, request IDs, or raw error messages
- allowed Trellis error attributes are stable labels such as
trellis.surface,trellis.direction,trellis.operation,trellis.phase,trellis.error.type,trellis.remote_error.type, and boundedtrellis.auth.reason trellis.operationshould come from contract metadata or runtime operation kind, not from NATS subjects, URLs, payloads, or remote peer-provided values- expected public failures remain
Result-modeled behavior; metrics are observability side effects and must not change error semantics
Request Correlation
RPCs and jobs include a requestId for correlation and audit. Domain event
contract payloads are bodies only; Trellis assigns event identity and timestamp
as runtime metadata (Nats-Msg-Id and Trellis-Event-Time headers) alongside
trace context. Domain events do not currently emit a separate request-id NATS
header unless they are job lifecycle events.
Rules:
- the client supplies a unique
request-idfor signed RPCs; auth includes it in the RPC proof and replay-cache key - after auth validation, the server may use the request id as correlation context but must still treat logs/traces as observability data, not as a source of authorization policy
- request IDs propagate across downstream RPC and job flows
- logs and traces include
requestId
Propagation:
| Context | request-id value |
|---|---|
| RPC handler | generated on receipt |
| RPC response | echoed from handler |
| Domain event | not set; use Nats-Msg-Id and trace |
| Job created from RPC/event/job | inherited when available; otherwise new ULID |
| Job lifecycle event | copied from job.context.requestId |
| Scheduled or cron job/event | new ULID for jobs; event Nats-Msg-Id only |
Job correlation:
- job creation records
job.context.requestId,job.context.traceId,job.context.traceparent, and optionaljob.context.tracestate - if no active trace exists when a job is created, the runtime creates a fresh W3C trace context rather than leaving the job untraced
- every job lifecycle publish includes matching
request-id,traceparent, andtracestateNATS headers when present - workers expose immutable job context to handlers and create job handling spans from that context where the language runtime supports tracing
Auth/admin control-plane correlation:
- built-in auth/admin RPCs follow the same inbound
traceparentextraction as application RPCs - traced admin errors include the request trace ID in serialized Trellis error data so operators can correlate failed control-plane calls with logs and spans
- the integration harness covers both a successful traced
Auth.Sessions.Mecall and a traced failingAuth.Users.Getcall through live NATS/auth-callout
Event deduplication:
- domain events include
Nats-Msg-Id: <event id>as transport metadata - domain events include
Trellis-Event-Time: <event timestamp>as transport metadata - JetStream deduplicates within its configured window
- this protects against duplicate publication on retries and reconnects