Fugue Labs · Open Source

gollem

The production agent framework for Go. Typed agents, structured output, streaming, and runtime traces for agents that edit files, wait on humans, delegate work, resume after crashes, fork from checkpoints, and prove the branch with diff/regression evidence. Zero core dependencies. Single binary.

$ go get github.com/fugue-labs/gollem ★Star

What a run looks like four agents, four transcripts

Every run emits structured trace events. Switch between agent archetypes below to see the same trace system rendering four different workloads.

gollem ./research-agent · claude-sonnet-4-5

✓ Each line is a trace event. Export to JSON, OpenTelemetry, or plug in your own TraceExporter.

Trace workbench real agents, real state

Tracing is not enough when the agent is allowed to change the world.

Span dashboards tell you what happened. Gollem traces are runtime artifacts: the messages, tools, approvals, snapshots, file diffs, topology, evaluator results, and fork provenance needed to operate on the run after it has already done real work.

14-hour code agentcatches SIGTERM, writes a partial trace, and resumes from the last checkpoint instead of losing the night Temporal workflowexports a waiting approval state from a worker fleet, then continues after the human decision lands Delegate runshows which sub-agent edited which file, what topology changed, and where the root plan diverged Sleepy sweepranks candidate prompts and runtime strategies by evaluator score, file diffs, cost, and first divergence

observability tracemodel calls, tool spans, latency, tokens, errors, and request/response panels for debugging the application runtime traceall of that, plus resumable snapshots, approval state, artifact evidence, branch lineage, Temporal status, delegate topology, and one-command fork continuation

runtime animation · one canonical trace across workers t+13h42m · checkpointing

root planclassify task delegatespawn test worker waitapproval boundary forkfresh segment diffgate branch

temporal worker-7model boundary signalpending approval resumedecision lands exportstatus trace closesucceeded

codetool readsnapshot.go bashreplay test fails editpreserve messages artifacthash + hunk testgreen

sleepy baselineold prompt candidatebranch prompt scoreeval + cost rankstrategy wins evidencetrace bundle

active boundaryroot plan · classify task state capturedmessages, tools, topology human stateno approval pending next actiondelegate tests

trace diff engine · causal path + artifacts + evaluator deltas stage 1 · align

baseline trace 184model.requested · worker-7same 191tool.completed · replay test failedsame 196checkpoint.created · bad message statefork 211artifact.changed · no rc messagesfail 249evaluator.completed · replay invalid0.42

forked trace 184model.requested · worker-7same 191tool.completed · replay test failedsame 196fcheckpoint.forked · preserve rc messagesfirst 211fartifact.changed · snapshot.go hunkfix 249fevaluator.completed · tests pass0.91

fingerprint alignmentThe algorithm hashes stable event boundaries first: kind, run lineage, request/tool IDs, checkpoint IDs, topology node, and redacted payload keys. visual output184, 191 match
same worker, same model boundary, same failing command
causal path still shared

first divergenceThe fork point is not guessed from timestamps. It is the first unmatched causal boundary after the shared prefix: checkpoint 196 becomes checkpoint 196f. visual outputfirst divergence: checkpoint.created
baseline path keeps stale snapshot messages
fork path uses RunContext messages

artifact deltaThe diff attaches file evidence to the causal boundary: before/after hashes, bounded hunk, tool call, and omission reason when content is unsafe or binary. visual outputsnapshot.go changed
1 hunk, 1 semantic field restored
no unrelated file churn

regression scoreThen it scores the branch: final status, evaluator result, retry/error delta, topology delta, artifact delta, token/cost delta, and final output drift. visual outputfork wins
eval +0.49 · cost -18% · retries -2
tests pass, branch accepted

case terminal-bench auth regression · 13h42m fork won · tests pass · cost -18%

question answered

What did the worker actually send?

The trace captures the request after middleware, history processors, message interceptors, retries, and delegate routing shaped it. In a Kubernetes or Temporal fleet, this is the exact boundary emitted by the worker that made the call.

requestreq_000184 · turn 27 · worker-7 modelclaude-sonnet-4-5 visiblemessages · settings · params · deltas · retry policy tokens8,214 input · 1,306 output

Replay applies recorded model/tool boundaries to reconstructed state. It does not pretend model sampling is deterministic.

question answered

Which real command broke the run?

Every tool call stays paired with the result, elapsed time, error payload, approval outcome, workspace, and root/delegate lineage. The failure is tied to the terminal action, not reconstructed from logs later.

toolbash callgo test ./cmd/gollem ./ext/trace -run Replay -count=1 resultexit=1 · replay boundary mismatch lineageroot run · trace worker · delegate=tests

question answered

Where can the real run resume?

Checkpoints make branch points operational. Resume the same run after SIGTERM, or fork from a step, event ID, checkpoint ID, or event kind and continue as a fresh trace segment.

checkpointsnap_13h42m statemessages · tool state · cost · retries · fork provenance resumefresh trace segment sourcerun_tb2_auth · preserved in metadata

question answered

What is waiting on a human?

Approvals, sleeps, deferred work, and Temporal waits are first-class boundaries. A live workflow can export as waiting, not failed, with the unresolved decision still visible.

requestbash requires approval · go vet ./... statuswaiting · exported from Temporal status replayvalid unresolved approval boundary auditapproved=false stays different from missing

question answered

What did the agent mutate?

Artifact events turn filesystem mutations into evidence: path, operation, tool call, before/after hashes, omission reasons, and bounded unified diff hunks when text content is safe to capture.

@@ -69,7 +69,8 @@ func Snapshot(rc RunContext) *RunSnapshot {
-    snap := state.snapshot(rc.AgentID, rc.RunID)
+    snap := state.snapshot(rc.AgentID, rc.RunID)
+    snap.Messages = append([]Message(nil), rc.Messages...)
     return snap

question answered

Did the branch deserve to live?

Diff and regression reports compare baseline and forked traces by divergence, final output, usage, cost, topology, evaluator score, retries, errors, and artifacts. The output is review evidence, not a vibes-based rerun.

baselinetb2-auth-base.trace.json varianttb2-auth-fork.trace.json first diffevent 196 · forked checkpoint resulttests pass · cost -18%

capturegollem run --trace-out inspectgollem trace view branchgollem trace fork --continue comparegollem trace diff gategollem trace regress

sh · trace workflowccopy$ gollem run --trace-out base.trace.json "fix the failing auth test"
$ gollem trace view base.trace.json
$ gollem trace fork base.trace.json --from-checkpoint snap_13h42m \
      --append-user "preserve RunContext messages in snapshots" \
      --continue --out fork.trace.json
$ gollem trace diff base.trace.json fork.trace.json
$ gollem trace regress base.trace.json fork.trace.json --require-status succeeded

Canonicalgollem.trace.v1 is the single trace format across local CLI, SDK runs, Temporal status export, team/delegate runs, and dashboard artifacts.

ForkableFork from step, event ID, checkpoint ID, or event kind. Continue as a fresh run segment; the original trace is not prepended or appended.

ComparableDiffs show first divergence, runtime path, usage, cost, retries, errors, topology, evaluator output, final output, and artifact changes.

ShareableValidate, redact, compact, and export trace-backed evidence for reviews, long-run debugging, and Sleepy candidate ranking.

Typed agents, typed results

Agent[T] is the central type. You define the output shape; gollem generates the JSON Schema, validates every model response against it, auto-repairs malformed output with a repair model, and hands you a typed Go struct.

goccopytype Analysis struct {
    Sentiment  string   `json:"sentiment"  jsonschema:"enum=positive|negative|neutral"`
    Keywords   []string `json:"keywords"   jsonschema:"description=Key topics"`
    Confidence float64  `json:"confidence" jsonschema:"minimum=0,maximum=1"`
}

agent := gollem.NewAgent[Analysis](model,
    gollem.WithSystemPrompt[Analysis]("You are a sentiment analyst."),
    gollem.WithOutputRepair[Analysis](gollem.ModelRepair[Analysis](repairModel)),
    gollem.WithOutputValidator[Analysis](func(a Analysis) error {
        if a.Confidence < 0 || a.Confidence > 1 {
            return fmt.Errorf("confidence out of range: %f", a.Confidence)
        }
        return nil
    }),
)

result, _ := agent.Run(ctx, "Analyze this earnings call transcript.")
fmt.Println(result.Output.Sentiment)    // string, not map[string]any
fmt.Println(result.Output.Confidence)   // float64, not interface{}

Schema generated from struct tags. No hand-written JSON schemas. No json.Unmarshal at the callsite. No type assertions.

Streaming with Go 1.23+ iterators

Four streaming modes share one interface. All expose as iter.Seq2[T, error]: no channels, no callbacks, no goroutine management.

live

Same stream, any mode. Switch based on latency budget or transport.

go · raw deltasccopystream, _ := agent.RunStream(ctx, "Write a story about a robot.")

// Raw incremental chunks as they arrive from the model.
for delta, err := range gollem.StreamTextDelta(stream) {
    if err != nil { return err }
    fmt.Print(delta)                   // "The " "robot " "powered " ...
}

go · accumulatedccopy// Growing accumulated text at each step. Ideal for React/UI updates.
for text, err := range gollem.StreamTextAccumulated(stream) {
    if err != nil { return err }
    updateUI(text)                    // "The " → "The robot " → "The robot powered "
}

go · debouncedccopy// Grouped delivery every 100ms. Grouping network frames for websocket clients.
for text, err := range gollem.StreamTextDebounced(stream, 100*time.Millisecond) {
    if err != nil { return err }
    sendToClient(text)                // fewer frames, still feels live
}

go · unifiedccopy// Single function with options. Switch modes without rewriting the loop.
for text, err := range gollem.StreamText(stream, gollem.StreamTextOptions{
    Mode:     gollem.StreamModeDebounced,
    Debounce: 100 * time.Millisecond,
}) {
    if err != nil { return err }
    handle(text)
}

Tools from typed functions

FuncTool[P] turns a typed Go function into a tool. Parameter schemas come from struct tags via reflection. Access typed dependencies through the run context: no globals, no singletons, no any.

goccopytype SearchParams struct {
    Query string `json:"query" jsonschema:"description=Search query"`
    Limit int    `json:"limit" jsonschema:"description=Max results,default=10"`
}

type AppDeps struct { DB *sql.DB; Cache *redis.Client }

searchTool := gollem.FuncTool[SearchParams](
    "search", "Search the knowledge base",
    func(ctx context.Context, rc *gollem.RunContext, p SearchParams) (string, error) {
        deps := gollem.GetDeps[*AppDeps](rc)    // compile-time type safe
        return doSearch(deps.DB, p.Query, p.Limit)
    },
)

agent := gollem.NewAgent[Report](model,
    gollem.WithTools[Report](searchTool),
    gollem.WithDeps[Report](&AppDeps{DB: db, Cache: cache}),
    gollem.WithDefaultToolTimeout[Report](10*time.Second),
    gollem.WithToolResultValidator[Report](nonEmpty),
)

Tool-choice control: Auto, Required, None, Force("name"), with optional auto-reset to prevent infinite loops.

Multi-agent orchestration

Three composition primitives. AgentTool for delegation. Handoff for sequential chains with context filters. Pipeline for parallel fan-out and conditional branching. For durable coordination across restarts, ext/orchestrator owns tasks, leases, schedulers, and artifact history.

go · pipelineccopy// One agent calls another as a tool.
orchestrator := gollem.NewAgent[FinalReport](model,
    gollem.WithTools[FinalReport](
        orchestration.AgentTool("research", "Delegate research", researcher),
    ),
)

// Pipeline with parallel steps and conditional branching.
pipe := gollem.NewPipeline(
    gollem.AgentStep(researcher),
    gollem.ParallelSteps(
        gollem.AgentStep(factChecker),
        gollem.AgentStep(editor),
    ),
    gollem.ConditionalStep(
        func(s string) bool { return len(s) > 5000 },
        gollem.AgentStep(summarizer),
        gollem.TransformStep(strings.TrimSpace),
    ),
)

go · team swarmccopyt := team.NewTeam(team.TeamConfig{
    Name:    "code-review",
    Leader:  "lead",
    Model:   model,
    Toolset: codingTools,
    PersonalityGenerator: modelutil.CachedPersonalityGenerator(
        modelutil.GeneratePersonality(model),
    ),
})

t.SpawnTeammate(ctx, "reviewer", "Review auth module for security vulnerabilities")
t.SpawnTeammate(ctx, "tester",   "Write comprehensive tests for the payment flow")
t.SpawnTeammate(ctx, "docs",     "Update API docs for the new endpoints")

leader := gollem.NewAgent[string](model, gollem.WithTools[string](team.LeaderTools(t)...))
result, _ := leader.Run(ctx, "Coordinate the review across all teammates.")

Each teammate runs as a goroutine with a fresh context window. The LLM itself writes the system prompt for each task; SHA256-keyed cache prevents redundant generations.

Code mode N tool calls, one round-trip

Traditional tool use is round-trip heavy: model asks, you execute, model waits, model asks again. Code mode ships all of your tools into an LLM-authored Python script that runs in a pure-Go WASM sandbox via monty-go. The model composes in one shot.

traditional   model ──► tool1 ──► model ──► tool2 ──► model ──► result
              3 model calls · 2 context refills · serial latency

code mode     model ──► python { tool1(); tool2(); } ──► result
              1 model call · 0 refills · parallel execution

goccopyimport "github.com/fugue-labs/gollem/ext/monty"

agent := gollem.NewAgent[Report](model,
    monty.AgentOptions(
        monty.WithTools(searchTool, fetchTool, citeTool),
    )...,
)

// The model writes a single Python script that calls N tools as functions.
// Runs in a WASM sandbox. No CGO, no containers, no subprocess.
result, _ := agent.Run(ctx, "Research and cite the top 5 papers on memory consolidation.")

python · what the model wroteread-only# gollem injects typed function stubs; the model chooses how to compose.
results = search(query="memory consolidation LLM", limit=10)
top = sorted(results, key=lambda r: r["score"], reverse=True)[:5]

# Parallel fetches in the sandbox; each call is a typed Go function.
docs = [fetch_url(url=r["url"]) for r in top]

final_result(
    summary="Consolidation requires decay scheduling ...",
    citations=[cite(doc=d) for d in docs],
)

One model round-trip. Up to N× fewer tokens than sequential tool use on branchy workloads. Sandbox timeout, memory cap, and import allowlist configurable.

Graph workflows typed state machines

When control flow outgrows linear pipelines, drop into ext/graph: typed state, conditional branches, fan-out / map-reduce, cycle detection, Mermaid export. Nodes and edges are type-checked at compile time.

  start  ──►  classify  ──►  { simple   ──►  answer  ──►  end
                          { complex  ──►  plan  ──►  fanout[3]
                                                            ├►  search
                                                            ├►  fetch
                                                            └►  analyze  ──►  merge  ──►  answer

goccopyg := graph.New[State]()
g.Node("classify", classifyFn).Edge("simple", simplePath).Edge("complex", complexPath)
g.FanOut("plan", searchNode, fetchNode, analyzeNode).Merge("merge", mergeFn)
g.Edge("merge", "answer")

if err := g.Validate(); err != nil {  // cycle detection at build time
    return err
}
fmt.Println(g.Mermaid())                // diagram for PRs
result, _ := g.Run(ctx, initialState)

Guardrails, cost, observability

Production concerns are first-class. Guardrails at every lifecycle stage. Cost tracked per run and cumulative. Middleware composes like HTTP middleware. First registered is outermost.

goccopytracker := gollem.NewCostTracker(map[string]gollem.ModelPricing{
    "claude-sonnet-4-5-20250929": {InputTokenCost: 0.003, OutputTokenCost: 0.015},
})

agent := gollem.NewAgent[Report](model,
    // Safety: validate prompts, turns, outputs.
    gollem.WithInputGuardrail[Report]("length", gollem.MaxPromptLength(10_000)),
    gollem.WithInputGuardrail[Report]("content", gollem.ContentFilter("ignore previous")),
    gollem.WithTurnGuardrail[Report]("turns", gollem.MaxTurns(20)),

    // Cost & usage.
    gollem.WithCostTracker[Report](tracker),
    gollem.WithUsageQuota[Report](gollem.UsageQuota{MaxRequests: 50, MaxTotalTokens: 100_000}),

    // Middleware: outer to inner.
    gollem.WithAgentMiddleware[Report](gollem.TimingMiddleware(metrics.RecordLatency)),
    gollem.WithAgentMiddleware[Report](gollem.LoggingMiddleware(log.Printf)),
    gollem.WithMessageInterceptor[Report](gollem.RedactPII(
        `\b\d{3}-\d{2}-\d{4}\b`, "[SSN REDACTED]",
    )),

    // Observability.
    gollem.WithTracing[Report](),
    gollem.WithTraceExporter[Report](gollem.NewJSONFileExporter("./traces")),
    gollem.WithRunCondition[Report](gollem.Or(
        gollem.MaxRunDuration(2*time.Minute),
        gollem.ToolCallCount(50),
    )),
)

GuardrailsMaxPromptLength, ContentFilter, MaxTurns, plus custom input / turn / output / tool-result validators.

MiddlewareTimingMiddleware, LoggingMiddleware, MaxTokensMiddleware, or write your own. Skip the model call entirely if you want.

InterceptorsRedactPII, AuditLog, or custom. Intercept before the message leaves your system; transform responses on the way back.

TracingCanonical trace artifacts, runtime event recording, partial failed-run exports, fork snapshots, diff/regression commands, Temporal status export, and OpenTelemetry middleware.

HooksOnRunStart, OnRunEnd, OnModelRequest, OnModelResponse, OnToolStart, OnToolEnd.

Event busTyped pub/sub with Subscribe[E], Publish[E]. Built-in RunStartedEvent, ToolCalledEvent, RunCompletedEvent carry run IDs, parent IDs, timestamps.

Providers one interface, swap freely

All providers implement the same Model interface. Wrap any with retry, rate limiting, and caching. Switch the import and the agent code is unchanged.

go · anthropicccopyimport "github.com/fugue-labs/gollem/provider/anthropic"

// Reads ANTHROPIC_API_KEY from env.
claude := anthropic.New()

// Opt-in features.
claude = anthropic.New(
    anthropic.WithModel("claude-sonnet-4-5-20250929"),
    anthropic.WithExtendedThinking(anthropic.Thinking{Budget: 10_000}),
    anthropic.WithPromptCaching(),
)

go · openaiccopyimport "github.com/fugue-labs/gollem/provider/openai"

// Reads OPENAI_API_KEY from env.
gpt := openai.New()

// WebSocket continuation for tool-heavy loops (non-streaming).
gpt = openai.New(
    openai.WithModel("gpt-4o"),
    openai.WithTransport("websocket"),     // or OPENAI_TRANSPORT=websocket
    openai.WithJSONMode(),                     // native structured output
)

go · vertex aiccopyimport "github.com/fugue-labs/gollem/provider/vertexai"

// Uses GCP application default credentials.
gemini := vertexai.New("my-project", "us-central1")

gemini = vertexai.New("my-project", "us-central1",
    vertexai.WithModel("gemini-2.0-flash"),
    vertexai.WithJSONMode(),
)

go · vertex · anthropicccopyimport "github.com/fugue-labs/gollem/provider/vertexai_anthropic"

// Claude via Vertex: extended thinking + prompt caching + GCP auth.
vc := vertexai_anthropic.New("my-project", "us-east5",
    vertexai_anthropic.WithModel("claude-sonnet-4-5@20250929"),
    vertexai_anthropic.WithExtendedThinking(Thinking{Budget: 10_000}),
)

go · resilience wrappersccopy// Retry around rate-limit around cache around raw. Works for any Model.
resilient := gollem.NewRetryModel(
    gollem.NewRateLimitedModel(
        gollem.NewCachedModel(claude, gollem.NewMemoryCacheWithTTL(5*time.Minute)),
        10, 20, // rps, burst
    ),
    gollem.DefaultRetryConfig(),
)

// Or route by capability: same agent code, right model per prompt.
router := gollem.NewCapabilityRouter(
    []gollem.Model{fast, power, vision},
    gollem.ModelProfile{SupportsVision: true, SupportsToolCalls: true},
)

Capability	Anthropic	OpenAI	Vertex AI	Vertex · Anthropic
Structured output	●	●	●	●
Streaming	●	●	●	●
Tool use	●	●	●	●
Extended thinking	●	○	○	●
Prompt caching	●	○	○	●
Native JSON mode	○	●	●	○
Auth	API key	API key	OAuth2 · GCP	OAuth2 · GCP

Single binary ship the compiler's output, not a virtualenv

Gollem compiles to a statically-linked binary. Cross-compile to any OS/arch from any OS/arch. No runtime. No interpreter. No shared library resolution at startup.

$go build -o research-agent ./cmd/research $ls -lh research-agent -rwxr-xr-x 1 user staff 14M Apr 17 17:45 research-agent $file research-agent research-agent: Mach-O 64-bit executable arm64 $otool -L research-agent research-agent: /usr/lib/libSystem.B.dylib (compatibility version 1.0.0) /usr/lib/libresolv.9.dylib (compatibility version 1.0.0) $GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o research-agent-linux ./cmd/research $scp research-agent-linux prod:/usr/local/bin/ research-agent-linux 100% 14MB 8.2MB/s 00:01 ✓ deployed.

StaticZero core dependencies. Linked against libSystem on macOS, nothing on Linux with CGO_ENABLED=0.

SmallTypical agent binary: ~14 MB with Anthropic + OpenAI + Vertex + monty. Strips to ~10 MB.

Cross-compileAny OS/arch → any OS/arch. Build on your laptop; deploy to Linux ARM servers, serverless, edge.

ObservabilityThe binary ships its own trace exporter, OTLP middleware, and structured logger. No sidecar required.

Testing without ever calling a real model

TestModel is a deterministic mock. Canned responses, call recording, per-invocation assertions. Swap with WithTestModel or Override in tests without touching the production agent definition.

goccopymodel := gollem.NewTestModel(
    gollem.ToolCallResponse("search", `{"query":"Go generics"}`),
    gollem.ToolCallResponse("final_result", `{"answer":"..."}`),
)

result, err := productionAgent.WithTestModel(model).Run(ctx, "prompt")

// Assert what the model saw.
calls := model.RecordedCalls()
assert.Len(t, calls, 2)
assert.Equal(t, "search", calls[0].ToolName)

Build your agent live configuration

Pick a provider, an output shape, and the features you need. A real compilable snippet regenerates live. Copy it and paste into a main.go. Nothing else to set up.

Configuration

Output

Provider

Features

Guardrails

The rest lives in the Go reference. Every public type has a docstring; every extension package has an example; every feature is tested.