Design: Trellis Operations
Prerequisites
- ../core/trellis-patterns.md - service and library patterns
- ../contracts/trellis-contracts-catalog.md - contract model and permission derivation
- ../auth/trellis-auth.md - session proofs, inbox permissions, and reply-subject validation
- ../jobs/trellis-jobs.md - service-local background execution
Context
Trellis needs a clear model for caller-visible asynchronous workflows.
Today, long-running work is often implemented with service-local jobs, but jobs are the wrong public abstraction:
- jobs are service-internal execution machinery
- job topology may change without changing the service contract
- global jobs visibility is too broad for ordinary callers
- caller-visible async work needs durable state, typed progress, and private live updates
We therefore need a distinct concept for async work that belongs in the public service contract. Operations are part of the Trellis API model, alongside RPCs, events, and owned subject spaces.
This document defines operation architecture, durable state semantics,
authorization, and the internal wire/control protocol. Ordinary TypeScript and
Rust usage examples belong in /guides/libraries/typescript, /guides/libraries/rust, and the generated references linked from /api.
Design
1) Jobs and operations are different things
- a
jobis a service-private execution primitive - an
operationis a caller-visible async contract owned by one service - callers interact with operations; service internals may implement those operations with jobs, direct orchestration, or other storage/compute strategies
Regular users and peer services MUST NOT depend on trellis.jobs for normal
product behavior. trellis.jobs is an admin and observability surface. Public
async workflows belong to the owning service’s operations.
2) Operations are part of the contract and API surface
Contracts MAY declare owned operations in a top-level operations object. These
are public API entries, not documentation-only annotations or an implementation
hint for jobs.
Example:
{
"operations": {
"Billing.Refund": {
"version": "v1",
"subject": "operations.v1.Billing.Refund",
"input": { "schema": "BillingRefundRequest" },
"progress": { "schema": "BillingRefundProgress" },
"output": { "schema": "BillingRefundResult" },
"capabilities": {
"call": ["billing.refund"],
"observe": ["billing.refund"],
"cancel": ["billing.refund.cancel"],
"control": ["billing.refund.control"]
},
"signals": {
"approveRefund": {
"input": { "schema": "BillingRefundApproval" }
}
},
"cancel": true
}
}
} Descriptor rules:
subjectSHOULD default tooperations.<version>.<LogicalName>when omittedinputandoutputare required schema referencesprogressis optional; if omitted, the operation does not emit typed progress payloadscapabilities.callgates initial invocationcapabilities.observegatesget,wait, andwatch; if omitted, it defaults tocapabilities.callcapabilities.cancelgatescancel; if omitted, cancellation is not caller-accessible unless the runtime is acting as the owning servicecapabilities.controlgates named operation signals; if omitted, signal submission does not require additional Trellis capabilities beyond authentication and operation ownership, but services SHOULD declare explicit control capabilities for caller-visible workflows that accept post-start inputsignalsdeclares named post-start inputs for schema validation, docs, authorization review, and generated SDK aliasescancel: truemeans the operation exposes cancellation semantics; omitted orfalsemeans callers cannot cancel it- operations are always authenticated; invoking or observing an operation always requires an authenticated caller plus operation authorization
uses MUST support operation invocation the same way it supports RPC calls,
events, and subjects. A participant may only start remote operations that it
explicitly declares in uses and that its current auth state authorizes.
Follow-up gets, waits, watches, cancels, and signals are authorized against the
specific operation id, creator/owner metadata, and action-specific operation
capabilities.
3) Operations are durable by default
An operation is not just a live stream. The owning service MUST persist enough operation state to support:
get()after disconnect or process restartwait()after reconnect- authorization checks against a specific operation id
The storage mechanism is service-owned and is not centralized in trellis.jobs.
The service MAY use KV, streams, a database, or another durable local
persistence strategy, but the public semantics are uniform:
- an operation has a durable snapshot
- an operation may emit live events
watch()complements durable state; it does not replace it
4) Public runtime API
Rules:
- public operation surfaces are descriptor-driven and typed from the contract;
exact language signatures, overloads, exported names, and generated member
inventories belong in the generated TypeScript API reference and Rustdoc under
/api - callers configure new operations through typed builders derived from contract-owned operation keys, not free-form string dispatch
operation.<group>.<leaf>.start(input)returns anOperationRef, not a terminal resultOperationRef.get()returns the current durable snapshotOperationRef.wait()resolves from durable state and live events to a terminal snapshotOperationRef.watch()returns a live async stream of typed operation events- transfer-capable operations initiate caller-to-service send transfer through
operation.<group>.<leaf>.input(input).transfer(body).start() - transfer initiation is builder-only: callers MUST NOT start an operation once and attach bytes later, and resumed operation references MUST only observe transfer-backed operations
- receive grants returned by RPCs are consumed through the root transfer helper, not through an operation reference; the operation reference remains an observation handle
- public TypeScript operations APIs MUST use
Result/AsyncResultfor expected failures rather than exception-oriented wrappers - Rust operation APIs return Rust
Resultvalues directly while preserving the same expected-failure semantics - runtimes MUST expose operation APIs through normal helpers; callers MUST NOT
need to know hidden
*.Start,*.Get,*.Wait, or*.Watchwire names - public APIs MUST NOT expose hidden control subjects, caller reply subjects, reply-stream mechanics, or runtime control envelopes
- the TypeScript runtime exposes universal
cancel()andsignal(...)helpers on operation references; unsupported cancel or signal attempts MUST return an expected failure (Result.err/ language equivalent), not throw as a normal control-flow path and not mutate operation state - non-TypeScript generated runtimes SHOULD expose equivalent operation-reference
control helpers as their operation runtime support catches up; exact current
language support belongs in
/apiand Rustdoc - event streams and generated callbacks MUST keep runtime control frames hidden; public cancellation and signal affordances are operation-reference methods, not lifecycle watch events
Caller surface:
- callers start async workflows with
operation.<group>.<leaf>.start(input)and receive anOperationRef - callers observe the operation through
get(),wait(),watch(), and operation-reference control helpers such ascancel()andsignal(...) - callers send bytes for transfer-backed operations through the generated
operation leaf’s builder path,
operation.<group>.<leaf>.input(input).transfer(body).start()
Owning-service surface:
- owning services register handlers with
service.handle.operation.<group>.<leaf>(handler) - generated service surfaces expose typed input, caller identity, and an active operation handle to handlers
- transfer-capable handlers receive provider-side transfer progress and durable
completion helpers, such as TypeScript
transfer.updates()/transfer.completed()or RustUploadTransferCompletion::completed() - handlers may complete operations directly or attach local jobs to them
- handler-visible active operation handles are the normal in-handler path for publishing lifecycle changes, progress, terminal success, terminal failure, cancellation, or job attachment
- owning services also expose an operation-scoped control path such as
service.handle.operation.<group>.<leaf>.control(operationId)for service-private jobs and other durable service-owned execution paths that only have the operation id after handler return, restart, redelivery, retry, or delayed execution - service-side control by id MUST load the durable operation record, verify the record belongs to the current service and requested operation key, validate progress/output against the operation descriptor, reject terminal mutations, and return normal expected errors rather than re-running the operation handler
- handler-visible active operation handles expose a durable private signal
stream for named caller inputs submitted after start; handlers consume it
through runtime helpers such as
signals()ornextSignal(name?) - handlers that intentionally leave terminal completion to another control path
return
op.defer()after recording any durable progress they own. The runtime MUST NOT auto-complete, auto-fail, or keep the handler promise pending for a deferred operation. The deferred sentinel is explicit external terminal ownership; it is not an operation output and MUST NOT be replaced with a never-resolving promise.
Generated operation runtimes MUST derive input, progress, output, cancelability,
signals, transfer behavior, and provider-side transfer helpers from the
operation descriptor. They MUST expose typed operation helpers only for
operations the participant owns or explicitly declares in uses, and they MUST
preserve enough descriptor metadata for language-specific generated facades. For
exact TypeScript and Rust signatures, use the generated API reference and
Rustdoc under /api.
5) Operation model
The public operation model is shared across languages. Exact exported type names and method signatures belong in generated API reference/Rustdoc, but every runtime MUST preserve these logical fields and semantics:
- operation state is one of pending, running, completed, failed, or cancelled
- an operation reference identifies the operation id, owning service, and operation key, and supports current-state reads, terminal waits, live watches, cancellation, and named signals; unsupported cancellation or signal submission returns an expected failure rather than silently succeeding
- a durable snapshot carries operation identity, a monotonic revision, timestamps, current state, optional typed progress, optional transfer progress, optional typed output, and an error view for failed terminal outcomes
- terminal snapshots are snapshots whose state is completed, failed, or cancelled
- lifecycle events carry accepted, started, transfer progress, progress, completed, failed, and cancelled changes as appropriate for the descriptor
- generated progress events carry the progress payload both as the event payload and on the embedded snapshot; generated transfer events do the same for transfer progress
- runtime control events that are not part of this public lifecycle remain hidden behind the operation reference
- accepted operation signals are private operation-control inputs; they are not
public lifecycle events and do not appear on
watch()unless the service later reflects their effects through progress or terminal snapshots
Lifecycle rules:
- the first externally visible event MUST be
accepted acceptedcreates a durable operation snapshot inpending- every durable snapshot exposes a monotonic public
revision - accepted signals are persisted with a private monotonic signal sequence and do
not increment the public snapshot
revision startedtransitions the snapshot torunningtransferupdates the stored transfer progress payload, emits once per acknowledged chunk, and MUST carry that payload as bothevent.transferandevent.snapshot.transferprogressupdates the stored progress payload, does not change terminal state, and MUST carry that payload as bothevent.progressandevent.snapshot.progresscompleted,failed, andcancelledare terminal- a service handler may explicitly defer terminal completion only by returning the runtime’s operation-deferred sentinel. Deferral means the accepted operation remains durable and non-terminal until another authorized service path completes, fails, or cancels the same operation id.
- service-side control by id is that durable external terminal path for
service-private jobs. Calling it MUST NOT create a new operation, publish a
new
acceptedevent, or invoke the original operation handler again.
6) Operations use caller _INBOX subjects for live watch streams
Operation watch streams MUST use the caller’s inbox space.
Rules:
- the caller opens
watch()on the same authenticated NATS connection it already owns - the runtime uses a reply subject under the caller’s
_INBOXprefix - the service MUST validate that the reply subject starts with the caller’s authorized inbox prefix
- the service streams operation events to that reply subject
- no new NATS connection is required to watch an operation
This keeps operation watches private to the authenticated caller while avoiding general-purpose cross-service subscribe grants.
7) Operation wire model
The TypeScript public API is operation.<group>.<leaf>.start(input) plus OperationRef.get/wait/watch/cancel/signal. Other language runtimes expose the
same protocol through their generated facade support as it lands. The underlying
wire model is standardized enough for auth and codegen.
Rules:
- invoking an operation publishes the input payload to the operation’s declared
subject - every operation also has a derived control subject:
<subject>.control - the runtime uses the control subject for
get,wait,watch,cancel, andsignal watchandwaitsend a reply subject under the caller’s_INBOXprefix and receive responses on that subjectget,cancel, andsignalare single-response control requestswatchis a streaming control requestwaitis a streaming or long-poll control request that terminates on the first terminal snapshot
The control envelope format is an internal Trellis runtime detail rather than a service-authored contract type, but it is still fixed by this document so independent implementations remain compatible.
7a) Internal invoke response envelope
Starting an operation sends the contract-defined input payload to the operation subject and receives exactly one accepted response on the caller reply subject.
type OperationAcceptedEnvelope<TProgress, TOutput> = {
kind: "accepted";
ref: {
id: string;
service: string;
operation: string;
};
snapshot: OperationSnapshot<TProgress, TOutput> & {
revision: number;
};
transfer?: TransferGrant & { direction: "send" };
}; Rules:
- the service MUST allocate the operation id before replying
- the accepted reply MUST include the initial durable snapshot
- transfer-capable operations MUST include the runtime-owned send transfer session data needed to execute the builder-managed send step
- the initial snapshot revision MUST be
1 - the accepted reply is the only response sent for
operation.<group>.<leaf>.start(input)
7b) Internal control request envelope
All follow-up operation control requests publish a runtime-owned envelope to <subject>.control.
type OperationControlRequest =
| {
action: "get";
operationId: string;
}
| {
action: "wait";
operationId: string;
includeProgress?: boolean;
}
| {
action: "watch";
operationId: string;
}
| {
action: "cancel";
operationId: string;
}
| {
action: "signal";
operationId: string;
signal: string;
input?: JsonValue;
}; Rules:
operationIdis always required for control requestssignalis the contract-declared signal name foraction: "signal"inputis validated against the matching signal descriptor’s input schema; rejected signals are not persisted- the public runtime owns this envelope; user code never constructs it directly
- every control request MUST be authenticated and authorization-checked against the referenced operation id
7c) Internal control response frames
get, wait, watch, cancel, and signal all respond with standardized
internal frames on the validated caller reply subject.
type OperationControlFrame<TProgress, TOutput> =
| {
kind: "snapshot";
snapshot:
& (
| OperationSnapshot<TProgress, TOutput>
| TerminalOperation<TProgress, TOutput>
)
& {
revision: number;
};
}
| {
kind: "event";
sequence: number;
event: OperationEvent<TProgress, TOutput>;
}
| {
kind: "signal-accepted";
operationId: string;
signal: string;
signalSequence: number;
acceptedAt: string;
snapshot: OperationSnapshot<TProgress, TOutput>;
}
| {
kind: "keepalive";
}
| {
kind: "error";
error: SerializableTrellisError;
}; Rules:
revisionis a monotonically increasing durable snapshot version scoped to one operation idsequenceis a monotonically increasing stream sequence scoped to one operation idsignalSequenceis a private monotonically increasing signal-log sequence scoped to one operation id and is independent of public snapshotrevisionerrorframes are expected control-request failures or runtime/internal protocol failures; domain failure outcomes remain normal terminal operation snapshots with statefailederrorframe payloads MUST use the normal Trellis serializable error shape, including stabletype,message,id, optionalcontext, and any error-specific fields, so operation lifecycle failures remain typed across client/server boundaries- runtimes MUST hide these internal frames behind
OperationRefmethods
7d) Internal method behavior
get:
- sends one
OperationControlRequestwithaction: "get" - receives exactly one
snapshotorerrorframe - MUST then close the reply stream
wait:
- sends one
OperationControlRequestwithaction: "wait" - if the operation is already terminal, receives exactly one terminal
snapshotframe and closes - otherwise receives zero or more
eventframes, optionalkeepaliveframes, then exactly one terminalsnapshotframe and closes
watch:
- sends one
OperationControlRequestwithaction: "watch" - receives exactly one initial
snapshotframe representing current durable state - then receives zero or more
eventframes and optionalkeepaliveframes - after a terminal event, the service MUST close the reply stream
cancel:
- sends one
OperationControlRequestwithaction: "cancel" - receives exactly one
snapshotframe containing the post-cancel durable state, or oneerrorframe - if the descriptor is not cancelable, the service MUST return an
errorframe and MUST NOT mutate the durable operation state - cancel authorization uses
capabilities.cancel, notcapabilities.control - MUST then close the reply stream
signal:
- sends one
OperationControlRequestwithaction: "signal" - receives exactly one
signal-acceptedframe containing the accepted signal sequence and current durable snapshot, or oneerrorframe - the service MUST persist the accepted signal before acknowledging it
- the service MUST reject unknown signal names, invalid signal payloads, terminal operations, and operations not running in the current service process
- accepted signals MUST NOT emit public watch events by themselves
- MUST then close the reply stream
Keepalive rules:
keepaliveframes are optional forwatchandwait- if emitted, they MUST NOT carry domain data
- if emitted, the interval MUST be at least 5 seconds and at most 30 seconds
8) Auth model for operations
Operations eliminate the need for a global end-user jobs-read capability.
Authorization rules:
- invocation is gated by authentication plus the operation’s declared
capabilities.call - observe access (
get,wait,watch) is gated by authentication,capabilities.observe, and the owning service’s operation-level authorization logic - cancel access is gated by authentication,
capabilities.cancel, and the owning service’s operation-level authorization logic - signal access is gated by authentication,
capabilities.control, and the owning service’s operation-level authorization logic - the owning service MUST persist enough operation ownership metadata to authorize follow-up access to a specific operation id
- the default runtime rule is creator-bound visibility: the principal that created the operation may observe it later unless the owning service explicitly grants broader domain access
Trellis MUST NOT introduce a broad deployment-wide capability equivalent to “observe every operation everywhere” for ordinary clients.
9) Auth callout and reply permissions
Unary RPC response semantics are insufficient for operation watch streams.
Rules:
- Trellis auth MUST permit bounded multi-response publishing to a validated caller reply subject for authenticated operation streams
- this permission applies to a reply subject derived from a request the service actually received; it is not a general publish grant to arbitrary inbox subjects
- unary RPCs remain single-response operations by convention even if the transport permission can support multiple responses
This keeps the security property of reply-subject validation while allowing streamed responses for operations.
10) Jobs remain the service-private execution layer
Operations and jobs integrate, but they do not collapse into one concept.
Rules:
- a service MAY back an operation with one or more local jobs
- internal jobs SHOULD carry
operationIdwhen they contribute to caller-visible async work - a service runtime SHOULD provide a helper to attach a
JobRefto anOperationRef - when a handler returns
op.defer(), the corresponding job SHOULD resume the operation through the public operation-scoped service control helper using only the storedoperationId - callers never need to know internal job ids or job types
- changing internal job topology MUST NOT break the public operation contract
11) Realistic example
Scenario:
Billing.Refundis the public operationsubmitRefundis an internal billing jobPayments.Refundis a remote operation exposed by thepaymentsserviceNotifications.Email.Sendis another remote operation that is not part of refund completion semantics
The caller starts Billing.Refund, receives an operation reference, watches
progress, and waits for a terminal snapshot. The billing service may accept the
operation, persist its operation id, enqueue submitRefund, and return a
deferred sentinel. The job later resumes service-side operation control by id,
marks the operation started, publishes domain progress, invokes Payments.Refund, optionally triggers a notification, and completes or fails
the original billing operation.
The important design invariant is that callers depend only on the public Billing.Refund operation contract. They do not see the billing job id, the job
queue topology, or whether notification sending is implemented as an operation,
RPC, event, or local side effect. Language-specific code for this scenario
belongs in /guides/libraries/typescript, /guides/libraries/rust, and /api.