Design: Trellis Operations

Prerequisites

../core/trellis-patterns.md - service and library patterns
../contracts/trellis-contracts-catalog.md - contract model and permission derivation
../auth/trellis-auth.md - session proofs, inbox permissions, and reply-subject validation
../jobs/trellis-jobs.md - service-local background execution

Context

Trellis needs a clear model for caller-visible asynchronous workflows.

Today, long-running work is often implemented with service-local jobs, but jobs are the wrong public abstraction:

jobs are service-internal execution machinery
job topology may change without changing the service contract
global jobs visibility is too broad for ordinary callers
caller-visible async work needs durable state, typed progress, and private live updates

We therefore need a distinct concept for async work that belongs in the public service contract. Operations are part of the Trellis API model, alongside RPCs, events, and owned subject spaces.

This document defines operation architecture, durable state semantics, authorization, and the internal wire/control protocol. Ordinary TypeScript and Rust usage examples belong in /guides/libraries/typescript, /guides/libraries/rust, and the generated references linked from /api.

Design

1) Jobs and operations are different things

a job is a service-private execution primitive
an operation is a caller-visible async contract owned by one service
callers interact with operations; service internals may implement those operations with jobs, direct orchestration, or other storage/compute strategies

Regular users and peer services MUST NOT depend on trellis.jobs for normal product behavior. trellis.jobs is an admin and observability surface. Public async workflows belong to the owning service’s operations.

2) Operations are part of the contract and API surface

Contracts MAY declare owned operations in a top-level operations object. These are public API entries, not documentation-only annotations or an implementation hint for jobs.

Example:

{
  "operations": {
    "Billing.Refund": {
      "version": "v1",
      "subject": "operations.v1.Billing.Refund",
      "input": { "schema": "BillingRefundRequest" },
      "progress": { "schema": "BillingRefundProgress" },
      "output": { "schema": "BillingRefundResult" },
      "capabilities": {
        "call": ["billing.refund"],
        "observe": ["billing.refund"],
        "cancel": ["billing.refund.cancel"],
        "control": ["billing.refund.control"]
      },
      "signals": {
        "approveRefund": {
          "input": { "schema": "BillingRefundApproval" }
        }
      },
      "cancel": true
    }
  }
}

Descriptor rules:

subject SHOULD default to operations.<version>.<LogicalName> when omitted
input and output are required schema references
progress is optional; if omitted, the operation does not emit typed progress payloads
capabilities.call gates initial invocation
capabilities.observe gates get, wait, and watch; if omitted, it defaults to capabilities.call
capabilities.cancel gates cancel; if omitted, cancellation is not caller-accessible unless the runtime is acting as the owning service
capabilities.control gates named operation signals; if omitted, signal submission does not require additional Trellis capabilities beyond authentication and operation ownership, but services SHOULD declare explicit control capabilities for caller-visible workflows that accept post-start input
signals declares named post-start inputs for schema validation, docs, authorization review, and generated SDK aliases
cancel: true means the operation exposes cancellation semantics; omitted or false means callers cannot cancel it
operations are always authenticated; invoking or observing an operation always requires an authenticated caller plus operation authorization

uses MUST support operation invocation the same way it supports RPC calls, events, and subjects. A participant may only start remote operations that it explicitly declares in uses and that its current auth state authorizes. Follow-up gets, waits, watches, cancels, and signals are authorized against the specific operation id, creator/owner metadata, and action-specific operation capabilities.

3) Operations are durable by default

An operation is not just a live stream. The owning service MUST persist enough operation state to support:

get() after disconnect or process restart
wait() after reconnect
authorization checks against a specific operation id

The storage mechanism is service-owned and is not centralized in trellis.jobs. The service MAY use KV, streams, a database, or another durable local persistence strategy, but the public semantics are uniform:

an operation has a durable snapshot
an operation may emit live events
watch() complements durable state; it does not replace it

4) Public runtime API

Rules:

public operation surfaces are descriptor-driven and typed from the contract; exact language signatures, overloads, exported names, and generated member inventories belong in the generated TypeScript API reference and Rustdoc under /api
callers configure new operations through typed builders derived from contract-owned operation keys, not free-form string dispatch
operation.<group>.<leaf>.start(input) returns an OperationRef, not a terminal result
OperationRef.get() returns the current durable snapshot
OperationRef.wait() resolves from durable state and live events to a terminal snapshot
OperationRef.watch() returns a live async stream of typed operation events
transfer-capable operations initiate caller-to-service send transfer through operation.<group>.<leaf>.input(input).transfer(body).start()
transfer initiation is builder-only: callers MUST NOT start an operation once and attach bytes later, and resumed operation references MUST only observe transfer-backed operations
receive grants returned by RPCs are consumed through the root transfer helper, not through an operation reference; the operation reference remains an observation handle
public TypeScript operations APIs MUST use Result / AsyncResult for expected failures rather than exception-oriented wrappers
Rust operation APIs return Rust Result values directly while preserving the same expected-failure semantics
runtimes MUST expose operation APIs through normal helpers; callers MUST NOT need to know hidden *.Start, *.Get, *.Wait, or *.Watch wire names
public APIs MUST NOT expose hidden control subjects, caller reply subjects, reply-stream mechanics, or runtime control envelopes
the TypeScript runtime exposes universal cancel() and signal(...) helpers on operation references; unsupported cancel or signal attempts MUST return an expected failure (Result.err / language equivalent), not throw as a normal control-flow path and not mutate operation state
non-TypeScript generated runtimes SHOULD expose equivalent operation-reference control helpers as their operation runtime support catches up; exact current language support belongs in /api and Rustdoc
event streams and generated callbacks MUST keep runtime control frames hidden; public cancellation and signal affordances are operation-reference methods, not lifecycle watch events

Caller surface:

callers start async workflows with operation.<group>.<leaf>.start(input) and receive an OperationRef
callers observe the operation through get(), wait(), watch(), and operation-reference control helpers such as cancel() and signal(...)
callers send bytes for transfer-backed operations through the generated operation leaf’s builder path, operation.<group>.<leaf>.input(input).transfer(body).start()

Owning-service surface:

owning services register handlers with service.handle.operation.<group>.<leaf>(handler)
generated service surfaces expose typed input, caller identity, and an active operation handle to handlers
transfer-capable handlers receive provider-side transfer progress and durable completion helpers, such as TypeScript transfer.updates() / transfer.completed() or Rust UploadTransferCompletion::completed()
handlers may complete operations directly or attach local jobs to them
handler-visible active operation handles are the normal in-handler path for publishing lifecycle changes, progress, terminal success, terminal failure, cancellation, or job attachment
owning services also expose an operation-scoped control path such as service.handle.operation.<group>.<leaf>.control(operationId) for service-private jobs and other durable service-owned execution paths that only have the operation id after handler return, restart, redelivery, retry, or delayed execution
service-side control by id MUST load the durable operation record, verify the record belongs to the current service and requested operation key, validate progress/output against the operation descriptor, reject terminal mutations, and return normal expected errors rather than re-running the operation handler
handler-visible active operation handles expose a durable private signal stream for named caller inputs submitted after start; handlers consume it through runtime helpers such as signals() or nextSignal(name?)
handlers that intentionally leave terminal completion to another control path return op.defer() after recording any durable progress they own. The runtime MUST NOT auto-complete, auto-fail, or keep the handler promise pending for a deferred operation. The deferred sentinel is explicit external terminal ownership; it is not an operation output and MUST NOT be replaced with a never-resolving promise.

Generated operation runtimes MUST derive input, progress, output, cancelability, signals, transfer behavior, and provider-side transfer helpers from the operation descriptor. They MUST expose typed operation helpers only for operations the participant owns or explicitly declares in uses, and they MUST preserve enough descriptor metadata for language-specific generated facades. For exact TypeScript and Rust signatures, use the generated API reference and Rustdoc under /api.

5) Operation model

The public operation model is shared across languages. Exact exported type names and method signatures belong in generated API reference/Rustdoc, but every runtime MUST preserve these logical fields and semantics:

operation state is one of pending, running, completed, failed, or cancelled
an operation reference identifies the operation id, owning service, and operation key, and supports current-state reads, terminal waits, live watches, cancellation, and named signals; unsupported cancellation or signal submission returns an expected failure rather than silently succeeding
a durable snapshot carries operation identity, a monotonic revision, timestamps, current state, optional typed progress, optional transfer progress, optional typed output, and an error view for failed terminal outcomes
terminal snapshots are snapshots whose state is completed, failed, or cancelled
lifecycle events carry accepted, started, transfer progress, progress, completed, failed, and cancelled changes as appropriate for the descriptor
generated progress events carry the progress payload both as the event payload and on the embedded snapshot; generated transfer events do the same for transfer progress
runtime control events that are not part of this public lifecycle remain hidden behind the operation reference
accepted operation signals are private operation-control inputs; they are not public lifecycle events and do not appear on watch() unless the service later reflects their effects through progress or terminal snapshots

Lifecycle rules:

the first externally visible event MUST be accepted
accepted creates a durable operation snapshot in pending
every durable snapshot exposes a monotonic public revision
accepted signals are persisted with a private monotonic signal sequence and do not increment the public snapshot revision
started transitions the snapshot to running
transfer updates the stored transfer progress payload, emits once per acknowledged chunk, and MUST carry that payload as both event.transfer and event.snapshot.transfer
progress updates the stored progress payload, does not change terminal state, and MUST carry that payload as both event.progress and event.snapshot.progress
completed, failed, and cancelled are terminal
a service handler may explicitly defer terminal completion only by returning the runtime’s operation-deferred sentinel. Deferral means the accepted operation remains durable and non-terminal until another authorized service path completes, fails, or cancels the same operation id.
service-side control by id is that durable external terminal path for service-private jobs. Calling it MUST NOT create a new operation, publish a new accepted event, or invoke the original operation handler again.

6) Operations use caller `_INBOX` subjects for live watch streams

Operation watch streams MUST use the caller’s inbox space.

Rules:

the caller opens watch() on the same authenticated NATS connection it already owns
the runtime uses a reply subject under the caller’s _INBOX prefix
the service MUST validate that the reply subject starts with the caller’s authorized inbox prefix
the service streams operation events to that reply subject
no new NATS connection is required to watch an operation

This keeps operation watches private to the authenticated caller while avoiding general-purpose cross-service subscribe grants.

7) Operation wire model

The TypeScript public API is operation.<group>.<leaf>.start(input) plus OperationRef.get/wait/watch/cancel/signal. Other language runtimes expose the same protocol through their generated facade support as it lands. The underlying wire model is standardized enough for auth and codegen.

Rules:

invoking an operation publishes the input payload to the operation’s declared subject
every operation also has a derived control subject: <subject>.control
the runtime uses the control subject for get, wait, watch, cancel, and signal
watch and wait send a reply subject under the caller’s _INBOX prefix and receive responses on that subject
get, cancel, and signal are single-response control requests
watch is a streaming control request
wait is a streaming or long-poll control request that terminates on the first terminal snapshot

The control envelope format is an internal Trellis runtime detail rather than a service-authored contract type, but it is still fixed by this document so independent implementations remain compatible.

7a) Internal invoke response envelope

Starting an operation sends the contract-defined input payload to the operation subject and receives exactly one accepted response on the caller reply subject.

type OperationAcceptedEnvelope<TProgress, TOutput> = {
  kind: "accepted";
  ref: {
    id: string;
    service: string;
    operation: string;
  };
  snapshot: OperationSnapshot<TProgress, TOutput> & {
    revision: number;
  };
  transfer?: TransferGrant & { direction: "send" };
};

Rules:

the service MUST allocate the operation id before replying
the accepted reply MUST include the initial durable snapshot
transfer-capable operations MUST include the runtime-owned send transfer session data needed to execute the builder-managed send step
the initial snapshot revision MUST be 1
the accepted reply is the only response sent for operation.<group>.<leaf>.start(input)

7b) Internal control request envelope

All follow-up operation control requests publish a runtime-owned envelope to <subject>.control.

type OperationControlRequest =
  | {
    action: "get";
    operationId: string;
  }
  | {
    action: "wait";
    operationId: string;
    includeProgress?: boolean;
  }
  | {
    action: "watch";
    operationId: string;
  }
  | {
    action: "cancel";
    operationId: string;
  }
  | {
    action: "signal";
    operationId: string;
    signal: string;
    input?: JsonValue;
  };

Rules:

operationId is always required for control requests
signal is the contract-declared signal name for action: "signal"
input is validated against the matching signal descriptor’s input schema; rejected signals are not persisted
the public runtime owns this envelope; user code never constructs it directly
every control request MUST be authenticated and authorization-checked against the referenced operation id

7c) Internal control response frames

get, wait, watch, cancel, and signal all respond with standardized internal frames on the validated caller reply subject.

type OperationControlFrame<TProgress, TOutput> =
  | {
    kind: "snapshot";
    snapshot:
      & (
        | OperationSnapshot<TProgress, TOutput>
        | TerminalOperation<TProgress, TOutput>
      )
      & {
        revision: number;
      };
  }
  | {
    kind: "event";
    sequence: number;
    event: OperationEvent<TProgress, TOutput>;
  }
  | {
    kind: "signal-accepted";
    operationId: string;
    signal: string;
    signalSequence: number;
    acceptedAt: string;
    snapshot: OperationSnapshot<TProgress, TOutput>;
  }
  | {
    kind: "keepalive";
  }
  | {
    kind: "error";
    error: SerializableTrellisError;
  };

Rules:

revision is a monotonically increasing durable snapshot version scoped to one operation id
sequence is a monotonically increasing stream sequence scoped to one operation id
signalSequence is a private monotonically increasing signal-log sequence scoped to one operation id and is independent of public snapshot revision
error frames are expected control-request failures or runtime/internal protocol failures; domain failure outcomes remain normal terminal operation snapshots with state failed
error frame payloads MUST use the normal Trellis serializable error shape, including stable type, message, id, optional context, and any error-specific fields, so operation lifecycle failures remain typed across client/server boundaries
runtimes MUST hide these internal frames behind OperationRef methods

7d) Internal method behavior

get:

sends one OperationControlRequest with action: "get"
receives exactly one snapshot or error frame
MUST then close the reply stream

wait:

sends one OperationControlRequest with action: "wait"
if the operation is already terminal, receives exactly one terminal snapshot frame and closes
otherwise receives zero or more event frames, optional keepalive frames, then exactly one terminal snapshot frame and closes

watch:

sends one OperationControlRequest with action: "watch"
receives exactly one initial snapshot frame representing current durable state
then receives zero or more event frames and optional keepalive frames
after a terminal event, the service MUST close the reply stream

cancel:

sends one OperationControlRequest with action: "cancel"
receives exactly one snapshot frame containing the post-cancel durable state, or one error frame
if the descriptor is not cancelable, the service MUST return an error frame and MUST NOT mutate the durable operation state
cancel authorization uses capabilities.cancel, not capabilities.control
MUST then close the reply stream

signal:

sends one OperationControlRequest with action: "signal"
receives exactly one signal-accepted frame containing the accepted signal sequence and current durable snapshot, or one error frame
the service MUST persist the accepted signal before acknowledging it
the service MUST reject unknown signal names, invalid signal payloads, terminal operations, and operations not running in the current service process
accepted signals MUST NOT emit public watch events by themselves
MUST then close the reply stream

Keepalive rules:

keepalive frames are optional for watch and wait
if emitted, they MUST NOT carry domain data
if emitted, the interval MUST be at least 5 seconds and at most 30 seconds

8) Auth model for operations

Operations eliminate the need for a global end-user jobs-read capability.

Authorization rules:

invocation is gated by authentication plus the operation’s declared capabilities.call
observe access (get, wait, watch) is gated by authentication, capabilities.observe, and the owning service’s operation-level authorization logic
cancel access is gated by authentication, capabilities.cancel, and the owning service’s operation-level authorization logic
signal access is gated by authentication, capabilities.control, and the owning service’s operation-level authorization logic
the owning service MUST persist enough operation ownership metadata to authorize follow-up access to a specific operation id
the default runtime rule is creator-bound visibility: the principal that created the operation may observe it later unless the owning service explicitly grants broader domain access

Trellis MUST NOT introduce a broad deployment-wide capability equivalent to “observe every operation everywhere” for ordinary clients.

9) Auth callout and reply permissions

Unary RPC response semantics are insufficient for operation watch streams.

Rules:

Trellis auth MUST permit bounded multi-response publishing to a validated caller reply subject for authenticated operation streams
this permission applies to a reply subject derived from a request the service actually received; it is not a general publish grant to arbitrary inbox subjects
unary RPCs remain single-response operations by convention even if the transport permission can support multiple responses

This keeps the security property of reply-subject validation while allowing streamed responses for operations.

10) Jobs remain the service-private execution layer

Operations and jobs integrate, but they do not collapse into one concept.

Rules:

a service MAY back an operation with one or more local jobs
internal jobs SHOULD carry operationId when they contribute to caller-visible async work
a service runtime SHOULD provide a helper to attach a JobRef to an OperationRef
when a handler returns op.defer(), the corresponding job SHOULD resume the operation through the public operation-scoped service control helper using only the stored operationId
callers never need to know internal job ids or job types
changing internal job topology MUST NOT break the public operation contract

11) Realistic example

Scenario:

Billing.Refund is the public operation
submitRefund is an internal billing job
Payments.Refund is a remote operation exposed by the payments service
Notifications.Email.Send is another remote operation that is not part of refund completion semantics

The caller starts Billing.Refund, receives an operation reference, watches progress, and waits for a terminal snapshot. The billing service may accept the operation, persist its operation id, enqueue submitRefund, and return a deferred sentinel. The job later resumes service-side operation control by id, marks the operation started, publishes domain progress, invokes Payments.Refund, optionally triggers a notification, and completes or fails the original billing operation.

The important design invariant is that callers depend only on the public Billing.Refund operation contract. They do not see the billing job id, the job queue topology, or whether notification sending is implemented as an operation, RPC, event, or local side effect. Language-specific code for this scenario belongs in /guides/libraries/typescript, /guides/libraries/rust, and /api.

Design: Trellis Operations

Prerequisites

Context

Design

1) Jobs and operations are different things

2) Operations are part of the contract and API surface

3) Operations are durable by default

4) Public runtime API

5) Operation model

6) Operations use caller _INBOX subjects for live watch streams

7) Operation wire model

7a) Internal invoke response envelope

7b) Internal control request envelope

7c) Internal control response frames

7d) Internal method behavior

8) Auth model for operations

9) Auth callout and reply permissions

10) Jobs remain the service-private execution layer

11) Realistic example

6) Operations use caller `_INBOX` subjects for live watch streams