Why is a single loop insufficient for multi-step tasks?

A single loop works for one-shot Q&A, but multi-step tasks need phased execution (explore → plan → execute → verify), approval gates, error retries, and state persistence. State machines solve these by separating conversation from structure.

What is the core design of the Node trait?

A Node is a phase that receives mutable state and returns NextStep (Goto next node, Halt stop, or Interrupt wait). State is passed explicitly between nodes; no message bus or global shared state.

How does SQLite checkpointing work?

After each node transition, state is saved to a SQLite table (run_id, next_node, state_json). After a crash, the latest checkpoint is loaded and execution resumes from the next node. Halt deletes the checkpoint to prevent infinite table growth.

What is the Interrupt suspend/resume mechanism?

When a node returns Interrupt { reason }, the runner serializes state and returns Outcome::Interrupted. The caller (CLI or web app) handles the reason, and after user confirmation calls run_with_checkpoint again to resume.

How does PermissionMode control permissions?

Four levels: Plan (read-only), Default (destructive ops require confirmation), AcceptEdits (pre-approved), BypassPermissions (no gates). Nodes declare required permissions; hooks check if the current mode satisfies them.

Building AI Agents in Rust (Part 4): State Machines for Multi-Step Tasks

Why the loop is not enough

The Part 1 loop handles questions that fit in one breath: read this file, list that directory, answer the user. Two turns, three turns, done. As long as the model can plan and execute inside one conversation, the loop is enough.

Real work is messier. Some tasks have phases: explore, then plan, then execute, then verify. Some phases have approval gates: do not write to the database until a human signs off. Some steps fail in retryable ways: a transient network blip, a rate-limit, a momentary deadlock that clears on a second try. Some tasks need to survive a process restart: the user ran the agent overnight on a research crawl, the machine rebooted, and they want the next morning to pick up where it left off rather than starting over.

The loop cannot do any of this gracefully. The shape that can is a state machine: typed state shared between nodes, named transitions between them, durable checkpoints after each step, and a way for any node to pause and surface a reason. The Claude Code CLI’s EnterPlanMode and ExitPlanMode tools are this idea in production: a state mode the agent enters and exits, with the implementation phase gated by user approval. LangGraph, the Python framework most cited in the agent literature, is this idea wearing a different coat. Both reduce to the same core.

This post writes it in Rust. Eugene v0.4 introduces eugene-state: a Node<S> trait, a Graph<S> that walks nodes, a Checkpointer<S> with a SQLite implementation, and a typed interrupt that suspends the graph until the caller resumes it. Three nodes, one approval gate, durable to a crash. The Part 1 loop is still in there. It just lives inside one of the nodes now.

The node trait

A node is one phase. It takes typed state, mutates it, and tells the runner what to do next:

#[async_trait]
trait Node<S>: Send + Sync + 'static
where
    S: Send + Sync + 'static,
{
    fn name(&self) -> &'static str;
    async fn run(&self, state: &mut S) -> Result<NextStep, GraphError>;
}

enum NextStep {
    Goto(&'static str),
    Halt,
    Interrupt { reason: String },
}

Goto names the next node. Halt ends the run. Interrupt pauses, waiting for external confirmation. There is no implicit order: nodes only jump explicitly via Goto. This looks more verbose than implicit sequencing, but it is what makes the graph maintainable—every transition is explicit, searchable in source, and visualizable at runtime.

Rust AI Agent state machine: Draft → Review → [Interrupt] → Revise → Halt, with SQLite checkpoint persistence Rust AI Agent state machine workflow: Draft → Review → [Interrupt] → Revise → Halt, with checkpoints persisted to SQLite after each transition.

The state S is the graph’s global shared state. Each phase reads what it needs and writes what it produces. The next phase finds its inputs where the previous phase put them. There is no message passing between nodes, no event bus, no shared global. Mutable state, explicit handoffs.

The graph runner

The runner is short. It holds nodes by name, an entry node, and a max-steps cap so a buggy Goto does not loop forever:

struct Graph<S> {
    nodes: HashMap<&'static str, Box<dyn Node<S>>>,
    entry: &'static str,
    max_steps: usize,
}

impl<S: Send + Sync + 'static> Graph<S> {
    async fn run_with_checkpoint<C: Checkpointer<S>>(
        &self,
        run_id: &str,
        initial: S,
        ckpt: &C,
    ) -> Result<Outcome<S>, GraphError> {
        let (mut state, mut current) = match ckpt.load(run_id).await? {
            Some(ck) => (ck.state, ck.next_node),
            None => (initial, self.entry),
        };

        for _ in 0..self.max_steps {
            let node = self.nodes.get(current)
                .ok_or(GraphError::UnknownNode(current))?;
            let next = node.run(&mut state).await?;

            match next {
                NextStep::Goto(name) => {
                    ckpt.save(run_id, name, &state).await?;
                    current = name;
                }
                NextStep::Halt => {
                    ckpt.delete(run_id).await?;
                    return Ok(Outcome::Success(state));
                }
                NextStep::Interrupt { reason } => {
                    ckpt.save(run_id, current, &state).await?;
                    return Ok(Outcome::Interrupted { state, reason });
                }
            }
        }

        Err(GraphError::MaxStepsExceeded)
    }
}

The persistence cadence is the load-bearing part. The runner saves after every transition. A crash between two nodes loses zero work; a crash inside a node loses at most one node’s worth of computation. The next run loads the most recent checkpoint and resumes from the node the previous run was about to enter.

delete on Halt is small but important: it cleans up after success so a future run with the same run_id starts fresh. Without it the checkpoint table grows forever. With it the table only holds in-flight or paused runs.

Interrupts are just halts with a reason

The Interrupt variant is the human-in-the-loop primitive. A node decides it cannot proceed without acknowledgement, returns Interrupt { reason }, and the runner serialises the state, saves the next node, and returns Outcome::Interrupted to the caller.

What the caller does next is up to the agent shell. In an interactive CLI, it prints the reason, waits for y/n from stdin, and resumes by calling run_with_checkpoint again with the same run_id. In a web app, it surfaces a notification, lets a human approve in a UI, and resumes via an HTTP endpoint. In an automated pipeline, it might queue the interrupt for review and resume hours later. The graph does not care. The checkpoint is durable; the resume is one function call.

The Part 4 gist demonstrates this with stdin:

Outcome::Interrupted { state, reason, .. } => {
    println!("\n=== Draft ===\n\n{}", state.draft.as_deref().unwrap_or(""));
    println!("\n=== Editor's notes ===\n\n{}",
             state.critique.as_deref().unwrap_or(""));
    if prompt_yes_no(&reason) {
        let outcome = graph
            .run_with_checkpoint(RUN_ID, state, &ckpt)
            .await?;
        // ... print final answer ...
    } else {
        eprintln!("[abort] Checkpoint kept; rerun to resume.");
    }
}

If the user answers no, the gist exits without deleting the checkpoint. A second run of the same command loads the checkpoint, finds the draft and critique already in place, and resumes at revise. Cancelling the user’s review and coming back to it tomorrow costs nothing.

The plan mode lineage

This is the same shape Claude Code uses for its plan mode. When the user invokes EnterPlanMode, the CLI enters a state where every tool is restricted to read-only operations and the agent’s job is to produce an implementation plan. The user reviews the plan; if they approve, ExitPlanMode flips the state back and the implementation tools become callable. The state lives in PermissionMode. The interrupt is implicit: the agent stops calling tools when it has a coherent plan and waits for the user.

The mechanism here is more general. Any node can return Interrupt, not just a designated “plan” node. The reason is free-form text, so the caller can format it however the UI needs. The state is fully serialised at the point of pause, so the resume does not require the same process or even the same machine.

SQLite as the default store

Checkpointer<S> is a trait. Implementations decide where the state lives. The default InMemoryCheckpointer<S> in the workspace crate is a Mutex<HashMap<String, Checkpoint<S>>>, suitable for tests and for short-lived runs. The SqliteCheckpointer is the one a real agent wants:

const SCHEMA: &str = r#"
CREATE TABLE IF NOT EXISTS checkpoints (
    run_id     TEXT PRIMARY KEY,
    next_node  TEXT NOT NULL,
    state_json TEXT NOT NULL,
    updated_at INTEGER NOT NULL
);
"#;

One table. One row per in-flight run, keyed by run_id. The state is stored as JSON in one column, which means the table is generic over any S that implements Serialize and DeserializeOwned. No schema migrations—add a field and put #[serde(default)] on the Rust struct.

SQLite in Rust goes through rusqlite, with writes wrapped in tokio::task::spawn_blocking because rusqlite is synchronous. It compiles in a fraction of the time and the work it does (one INSERT per node, one SELECT on resume) is small enough that the blocking-pool round trip is invisible. The eugene-state crate ships both: rusqlite is the default sqlite backend; the user can swap in their own Checkpointer<S> impl for Postgres, Redis, or whatever fits their deployment.

The same table works for many concurrent runs. The run_id keys them apart. A worker pool can have a hundred graphs in flight, each writing to the same table, with row-level locking handling the contention. A long research agent with a thousand read_file checkpoints inserts a thousand rows, then the final Halt deletes them all. SQLite handles this without complaint.

Retries belong in the graph, not the node

Transient errors are the third leg of state-machine durability. A 429 Too Many Requests from the model. A connection reset on a third-party API the agent is calling. A momentary unable-to-write on the filesystem. Each of these is worth retrying, but in different places.

The Part 3 with_retry helper wraps an individual HTTP call. That is the right scope when one node makes one external request. The graph layer adds a second scope: when a node fails with GraphError::Transient, the graph re-runs the whole node up to retry_attempts times before propagating the error.

let mut graph = Graph::<State>::new("draft").retry_attempts(3);

The two scopes do different jobs. The HTTP retry handles the kind of failure where you should not give up on the work you have already done in the function. The graph retry handles the kind where the entire phase needs to start over: maybe the model produced an invalid JSON plan that fails to parse, and you want to ask again with a fresh context. Picking the right scope means thinking about what state survives between attempts. Retries inside the function preserve the function’s local variables. Retries at the graph level start the node fresh.

A three-node graph in practice

The Part 4 gist demonstrates the smallest interesting graph: draft, review, revise. Each node makes one Claude call. The interrupt sits between review and revise so the user sees the draft and the editor’s notes before authorising a rewrite.

struct Draft { http: reqwest::Client, api_key: String }

#[async_trait]
impl Node<State> for Draft {
    fn name(&self) -> &'static str { "draft" }
    async fn run(&self, s: &mut State) -> Result<NextStep, GraphError> {
        let system = "You are Eugene, a careful technical writer. \
                      Write a single short answer to the user's question. \
                      Plain prose, no bullet lists, fewer than 200 words.";
        let answer = call_claude(&self.http, &self.api_key, system,
                                 &format!("Write a draft answer to: {}", s.request)).await?;
        s.draft = Some(answer);
        Ok(NextStep::Goto("review"))
    }
}

struct Review { http: reqwest::Client, api_key: String }

#[async_trait]
impl Node<State> for Review {
    fn name(&self) -> &'static str { "review" }
    async fn run(&self, s: &mut State) -> Result<NextStep, GraphError> {
        let system = "You are a sharp editor. Read the draft and list the \
                      three biggest weaknesses in one sentence each.";
        let critique = call_claude(&self.http, &self.api_key, system,
                                   s.draft.as_deref().unwrap_or("")).await?;
        s.critique = Some(critique);
        Ok(NextStep::Interrupt {
            reason: "Draft and critique ready. Approve revision?".into(),
        })
    }
}

struct Revise { http: reqwest::Client, api_key: String }

#[async_trait]
impl Node<State> for Revise {
    fn name(&self) -> &'static str { "revise" }
    async fn run(&self, s: &mut State) -> Result<NextStep, GraphError> {
        let system = "You are Eugene. Rewrite the draft using the editor's \
                      notes. Keep the same tone. Plain prose, fewer than 200 words.";
        let improved = call_claude(&self.http, &self.api_key, system,
                                   &format!("Draft:\n{}\n\nNotes:\n{}",
                                           s.draft.as_deref().unwrap_or(""),
                                           s.critique.as_deref().unwrap_or(""))).await?;
        s.draft = Some(improved);
        Ok(NextStep::Halt)
    }
}

The runner pauses after review. The user sees the draft and the critique, answers yes, the runner enters revise, then Halt. The user answers no, the program exits, leaving the checkpoint in place. Run the same command again and it walks straight to the interrupt without recomputing the draft or the critique. The cache survives a sigint, a power cut, an angry kill -9.

Visual honesty: a graph is more honest than code

The real value of a state machine is not at runtime but in readability. A picture tells you the structure of a task without reading a single line of Rust. draft → review → [interrupt] → revise → halt. This picture exists for two readers: you, planning the next iteration, and the model, when you eventually want to ask it to extend the graph by adding a verification node. A clear graph is a thing the model can read and modify. A clear graph is what makes a state machine maintainable.

Hooks generalise the interrupt

The Part 4 gist gates the revise step by hand: the review node returns NextStep::Interrupt and the main function reads stdin to decide whether to resume. That works when one node is special. For agents with many destructive nodes and a desire to apply the same gating rule to all of them, hard-coding the interrupt in each node becomes tedious and easy to forget.

Claude Code solves this with hooks. Its tool executor runs a chain of pre-tool and post-tool hooks around every dispatch. The hooks check permission rules, run user-defined shell commands tied to lifecycle events, log analytics, and modify the tool’s input or output. A new tool inherits the hook chain automatically; the policy lives in one place.

The same pattern ports to a state machine. eugene-state defines a NodeHook<S> trait with before and after methods:

#[async_trait]
pub trait NodeHook<S>: Send + Sync + 'static {
    async fn before(
        &self,
        node_name: &str,
        required: PermissionMode,
        state: &S,
    ) -> Result<HookOutcome, GraphError> {
        Ok(HookOutcome::Proceed)
    }

    async fn after(
        &self,
        node_name: &str,
        state: &S,
        result: NextStep,
    ) -> Result<NextStep, GraphError> {
        Ok(result)
    }
}

enum HookOutcome {
    Proceed,
    Modify(Value),
    Deny(String),
    Replace(String),
}

The before hook can reject, modify, or replace the node’s input. The after hook can rewrite the node’s next step, which is how shared post-processing nodes get stitched onto graphs without editing the nodes themselves.

Permission modes

The headline use of hooks is permission gating. Claude Code’s ToolPermissionContext carries a PermissionMode taken from a small enum: plan (read-only), default (ask before destructive), acceptEdits (pre-approved), bypassPermissions (no gates). Each tool declares a minimum mode; the executor checks the current mode and either runs the tool, asks the user, or refuses.

The Rust equivalent in eugene-state is one enum, one method on Node, and one built-in hook:

pub enum PermissionMode {
    Plan,
    Default,
    AcceptEdits,
    BypassPermissions,
}

impl PermissionMode {
    pub fn level(self) -> u8 { /* Plan=0 < Default=1 < AcceptEdits=2 < Bypass=3 */ }
    pub fn permits(self, required: PermissionMode) -> bool {
        self.level() >= required.level()
    }
}

#[async_trait]
pub trait Node<S> {
    fn name(&self) -> &'static str;
    fn required_mode(&self) -> PermissionMode { PermissionMode::Plan }
    async fn run(&self, state: &mut S) -> Result<NextStep, GraphError>;
}

The ModeGate hook checks required_mode in before:

impl<S: Send + Sync + 'static> NodeHook<S> for ModeGate {
    async fn before(&self, node_name: &str, required: PermissionMode, state: &S)
        -> Result<HookOutcome, GraphError> {
        if self.current_mode.permits(required) {
            Ok(HookOutcome::Proceed)
        } else {
            Ok(HookOutcome::Deny(format!(
                "Node '{}' requires {:?} mode; current mode is {:?}.",
                node_name, required, self.current_mode
            )))
        }
    }
}

Permission modes can be switched dynamically. Plan mode forbids all writes and forces the agent to produce a plan instead of executing. AcceptEdits mode lets a reviewed agent run unattended. BypassPermissions is for testing and automated pipelines. Changing modes does not require touching any node; it is a method call on the Graph that mutates the state held by the ModeGate hook.

What this reveals

The Part 1 loop, the Part 2 prompt, the Part 3 registry: each of those is a tool a node can use. The graph is what wraps them. When you have one phase to run, a node is overkill. When you have three, the graph is the difference between code you trust and code you debug.

Three properties are what the graph buys. Resumability: a crash loses one node, not the whole task. Gates: a node can suspend and wait for a human without holding a process open. Composability: phases can be added or reordered without touching the runner, because nodes only interact through the typed state and the named transitions.

The trait shape is the standard Rust toolkit: async_trait for the async method, an associated state type carried by the graph, object-safe dispatch through Box<dyn Node<S>>, serde-driven persistence through the checkpoint trait, and tokio::task::spawn_blocking to keep the sqlite write off the async runtime. None of this is exotic. It is what production Rust looks like when you assemble it with care.

What comes next

A graph runs one phase at a time, in one process. The next problem is parallelism across phases: a researcher and a writer who should both look at a topic at the same time, then converge; a planner and a critic that argue until they reach consensus; a fleet of identical investigators each working on a different question. Part 5 introduces multi-agent crews: a router that dispatches to specialists, a debate protocol where two agents argue, and tokio-native parallel execution where the orchestration is join_all instead of bespoke message buses. The Part 3 registry and the Part 4 graph both come along.

The workspace

The polished version of the trait, the graph, and the checkpoint stores lives in the workspace as the eugene-state crate. Twelve unit tests cover the runner (halt, interrupt, retry, max-steps, unknown-node), the in-memory checkpointer, the SQLite checkpointer (round-trip and reopen), the permission-mode arithmetic, and the ModeGate / post-hook chain that rewrites next-steps. The SQLite backend is behind the sqlite feature so projects that do not want the dependency can omit it. See eugene/crates/eugene-state . ⭐ Star on GitHub

Found this useful?

If this post helped you, please:

Clap / share / bookmark it so more Rust + AI developers can find it;
Follow Mengshou Programming for weekly Rust / AI engineering notes;
Leave a comment with your questions about state machine design, checkpointing strategy, or permission modes;
Check out the AI programming assistant service to bring Claude Code-level productivity to your team.

Why the loop is not enough#

The node trait#

The graph runner#

Interrupts are just halts with a reason#

The plan mode lineage#

SQLite as the default store#

Retries belong in the graph, not the node#

A three-node graph in practice#

Visual honesty: a graph is more honest than code#

Hooks generalise the interrupt#

Permission modes#

What this reveals#

What comes next#

The workspace#

Related topics#

Found this useful?#