close icon
daily.dev platform

Discover more from daily.dev

Personalized news feed, dev communities and search, much better than whatโ€™s out there. Maybe ;)

Start reading - Free forever
Start reading - Free forever
Continue reading >

How we built a Linear coding agent: the hard parts

How we built a Linear coding agent: the hard parts
Author
 Ido Shamun
Related tags on daily.dev
toc
Table of contents
arrow-down

๐ŸŽฏ

At daily.dev we log everything in Linear: bug reports, user feedback, feature tasks. And we already use coding agents locally. The gap was the manual glue between them: copying ticket context into an agent, spinning up worktrees across the right repos, feeding in the plan, opening PRs, tying them back to the ticket.

We built Huginn to close that gap. It's a coding agent that lives inside Linear. You delegate a ticket, it reads the context, figures out which repos need changes, proposes a plan, you approve (or don't), it executes, self-reviews, and opens PRs. The whole interaction happens through Linear's native Agent API, so there's no separate tool to switch to. You can also run it in autopilot mode for simple tasks, request follow-ups after the PR is opened, and it commits under your own GitHub identity so git blame stays useful.

The product side is the easy part to explain. This post is about the engineering: what broke, what surprised us, and the decisions we'd make differently.

Wrapping CLI agents as child processes

Huginn doesn't call LLM APIs directly. It spawns Claude Code and Codex as child processes, each with their own streaming format, session management, and authentication model.

This sounds straightforward until you actually try it.

Claude Code outputs JSON via --output-format=stream-json, but only if you also pass --verbose, and the flag has to come before --output-format. We learned this by shipping broken parsing to production. The JSON nests content under message.message.content (not message.content), which isn't documented anywhere we could find. Codex has its own streaming format, its own session model, and writes its last response to a temp file you have to read after the process exits.

Session management was worse. Claude Code uses --session-id to start a new session and --resume to continue one. If a user clicks "Try again" after a failure, Huginn generates the same deterministic session ID. But if the previous Claude process hasn't fully exited, --session-id fails with "already in use." The fix was a bidirectional fallback: try --session-id first, and if it collides, retry with --resume. And vice versa. A three-line fix in hindsight, but it took a production outage to find.

Each provider has its own universe of failure modes. Mixing provider-specific logic into workflow code would have been a death sentence, so we built an AgentRunner interface early. Claude and Codex each get their own runner with their own argument builders, output parsers, and session handling. The workflow engine only talks to the interface. When someone on the team wanted to try Codex, it was one new class and one factory method. The workflow code didn't change.

That abstraction wasn't premature. It was self-defense.

The output parsing war

Getting structured data out of LLM responses was the hardest recurring problem. Not hard once, hard over and over. Here's a sample of what we hit, roughly in the order we hit it:

  • Repos discovered during planning were lost by execution time because we weren't persisting them in the issue description
  • Claude's result message overwrote the accumulated plan content, and our first fix for that caused duplicated output
  • The agent decided to put the REPOS: line at the bottom instead of the top, breaking our parser
  • The agent wrapped repo names in **bold** markdown, which the parser didn't strip
  • Codex sent MCP tool arguments as a JSON string instead of an object
  • Claude and Codex see different MCP tool name formats (huginn_planning_result vs mcp__huginn_workflow__huginn_planning_result)
  • The agent returned a plan containing only metadata headers but no actual plan content

Our first approach was text parsing: ask the agent to output REPOS: api, apps on the first line, regex it out. Reliability was around 80%.

We added an MCP server that exposes structured tools: huginn_planning_result accepts { plan, repos, step_descriptions } as typed parameters. MCP enforces structure, so when it works, it works well. But MCP tools are sometimes unavailable to the agent. We don't fully understand why. So the code chains three extraction strategies: MCP tool calls first, then text output parsing, then parsing the accumulated pre-result output as a last resort.

The repo name parser itself handles markdown-escaped names, trailing punctuation, repos/ path prefixes, and unicode artifacts. Each normalization rule traces back to a specific production bug.

This problem isn't solved. It's managed. Every few weeks we find a new way the agent formats output that our parser doesn't expect, and we add another case. If you're building an orchestrator on top of LLM tools, budget for this. It doesn't stop.

Session continuity and the CWD trap

Early on, every workflow stage started a fresh agent session. The agent re-read the codebase, re-read the ticket, rebuilt context from scratch. Token consumption was high and quality suffered because the agent forgot its own decisions between stages.

We added session continuity: a SQLite table maps (issueId, continuityType, provider) to a Claude/Codex session ID. Planning, implementation, code review, and PR creation all resume the same session. The agent remembers what it planned, which matters when it's reviewing its own code.

This broke PR creation for a while, and the root cause was not obvious. Claude Code uses the working directory as part of its session lookup key. Implementation ran from /workspaces/ENG-485 (the workspace root), but PR creation initially ran from /workspaces/ENG-485/daily-api (the repo subdirectory). Different CWD meant Claude resolved a different session, and the mismatch caused silent failures. No error message, just a session that didn't have the context we expected.

The fix was to always use the workspace root as CWD for every stage and direct per-repo operations through the prompt instead ("create a PR from the daily-api/ directory"). A small constraint that took real debugging time to discover. If you're chaining multiple Claude Code invocations and relying on session persistence, keep the CWD identical across all of them.

State machine on Linear labels

Huginn's workflow has four states: IDLE โ†’ PLANNING โ†’ APPROVED โ†’ EXECUTING. We store state as Linear labels on the issue.

This wasn't our first approach. Initially we encoded state as JSON in Linear's plan field (a UI element meant for displaying progress steps). That worked until we discovered Linear archives agent threads, wiping the plan content and our state with it.

Labels turned out to be better in every way. They survive archival. They're visible in the issue list, so you can filter by huginn:executing to see what's running. Engineers can manually remove a label to reset state, or add huginn:approved to skip planning. The system reads labels on every webhook, so manual overrides work without special handling.

During execution, sub-stage labels track finer progress: huginn:stage:workspace-setup, huginn:stage:implementation, huginn:stage:code-review, huginn:stage:pr-creation. If the server crashes mid-execution, it reads these labels on restart and resumes from the right stage. The recovery system figures out the workspace path from the branch name and checks if it still exists on disk.

Using an external system's metadata as both user-facing status and crash-recovery state sounds hacky. In practice it's been one of the most reliable parts of the whole system.

What a 99% AI-generated codebase actually means

Most of Huginn's code was written by AI coding agents. A lot of it was written by Huginn itself. But that sentence is misleading without context, so here's what the workflow actually looks like.

Bootstrapping was human-driven. Picking TypeScript, Fastify, Drizzle, SQLite, the state-machine-on-labels architecture, the provider abstraction pattern: those were human decisions. The agents wrote the code, but a human was steering.

After the foundation was down, the loop became: write a spec with the user flow, edge cases, and high-impact technical decisions. Have the agent write tests first. Review the test cases (that's where most of our review time goes). Let the agent implement. Ship. Test manually. Fix what breaks.

We don't review generated code line by line. We review specs and tests.

What makes this possible without everything falling apart is DTU: Digital Twin Universe. For every external service Huginn depends on, we built a behavioral replica that implements the real API interface but runs in-memory.

Our Linear DTU is a Go server that implements Linear's GraphQL API. Tests use the real @linear/sdk against this fake backend. They can seed issues, trigger webhook deliveries, verify that labels were set, check that activities were emitted. The GitHub DTU handles repo creation, branch management, and PR creation. The KMS DTU does envelope encryption with in-memory keys.

Each DTU runs in a Docker container. Integration tests use Testcontainers to spin up the full stack: Huginn plus all its dependencies. A test scenario seeds an issue in the Linear twin, triggers a webhook, and asserts that Huginn produced the right labels, activities, and PR.

Huginn also debugs its own production issues. It can SSH into the VM, read logs, correlate errors, and write fixes. Most post-deployment bugs get resolved this way. The AGENTS.md file accumulates learnings from every incident, so the agents get better at following project conventions over time.

This only works because the test infrastructure is solid. If the DTUs didn't exist, AI-generated code at this ratio would be reckless.

What's still janky

Output parsing, as mentioned, is a permanent arms race. We'll never be done with it.

The BYOK credential system has rough edges. OAuth tokens require copying ~/.claude to a temp directory per user, and each temp HOME duplicated package manager caches (.npm, .cache, .nvm). Across five users, this ate about 25GB on the VM before we caught it. The fix was symlinks and cron jobs, but the whole per-user HOME approach feels like it needs a rethink.

Tight iterative work is still hard. If a task needs the kind of back-and-forth you'd have pairing with another developer for an hour, Huginn isn't the right tool. A local agent or iterating on specs before creating the ticket works better.

Large tasks that span many files and need nuanced architectural judgment hit a ceiling. The planning phase helps, but there's only so much you can communicate through a ticket description.

The moment it clicked

The first time a ticket went from "filed in Linear" to "PR opened" without anyone opening an IDE or cloning a repo, that was it. The whole workflow visible in Linear's agent UI: thoughts appearing, plans being generated, code being written, PRs being opened.

Our bug and user-feedback backlog shrunk dramatically after that. Not because the agent is better than a human at fixing bugs, but because the activation energy dropped to zero. File a ticket, delegate, move on.

โ€

That's Huginn. Named after Odin's raven, because we have a Norse mythology naming convention and we weren't about to break it.

โ€

Why not level up your reading with

Stay up-to-date with the latest developer news every time you open a new tab.

Read more