We built an org-wide AI agent in 4 days. Here's what broke in the weeks after.

Smith is an AI agent that runs inside our Slack workspace at daily.dev. You message it, it does things: queries BigQuery, runs SQL against ClickHouse and Postgres, moderates content, generates sales reports, automates browser sessions, manages secrets, runs scheduled jobs. It has access to basically everything.

We shipped the first version internally on March 12, four days after the first commit. This post isn't about those four days. It's about everything that broke once real people started using it, and the three weeks of fixes, security patches, and production incidents that followed.

The gap it filled

Before Smith, getting data at daily.dev meant asking our data analyst. Engineers who needed logs or metrics across multiple databases had to write glue code. Sales couldn't pull campaign numbers without filing a request. The bottleneck was never the data. It was secure access and knowledge.

We wanted to fix that by building our version of OpenClaw, with security and secret management baked in from day one. We borrowed OpenClaw's best idea: you do everything with a message. No dashboard, no forms, no setup. You say what you want in Slack and you get it.

What we didn't anticipate was how many other use cases would follow. Once Smith could query databases, people started asking it to moderate spam, curate content highlights, audit A/B experiments, discover new content sources, and generate monthly business reports. The scheduled task system opened doors we hadn't planned for at all.

29,000 lines, mostly Codex

Smith is 29,000 lines of TypeScript with 10,000 lines of tests. It was primarily written by Codex, with some touches from Claude Code. The human contribution was steering: writing specs, reviewing tests, course-correcting when the agent went in the wrong direction. Two people, 118 commits.

We've written about this AI-driven development process before, so I won't rehash it here. The short version: you write specs, point a coding agent at them, review tests more than code, and iterate.

What's worth covering is where this project's challenges diverged. With Huginn, our Linear coding agent, the hard part was getting structured output from LLMs and managing subprocess lifecycles. With Smith, the hard part was security. Not "we added auth middleware" security. More like "the agent keeps finding creative ways to leak credentials and we spent weeks playing whack-a-mole" security.

Not leaking access

This was the hardest problem in building Smith. Not one bug, but a category of problems that kept resurfacing in different forms. Secrets leaking into tool output. Credentials from one user's session bleeding into another's. The agent itself probing for tokens it shouldn't have. Every section that follows is a different face of the same issue.

Secrets and redaction

Smith manages secrets through GCP KMS. Users create secrets with ACLs that control who can use them. When the agent needs a secret for a bash command, it goes through a resolver that checks ownership, group membership, and policy before decrypting anything. The decrypted value gets injected as an environment variable, prepended to the bash command, then redacted from all output before the LLM sees the result. Every resolved secret value gets string-replaced with [REDACTED] in the command output. Secrets are sorted longest-first so partial matches don't leak substrings.

This handles the straightforward case: a user asks Smith to call an API with their key, and the key never appears in the conversation. The harder cases are below.

The GitHub credential mess

We didn't want to give Smith blanket access to our GitHub org. Instead, we set up three tiers: a read-only shared token by default, a write token scoped to Smith's own memory repo, and per-user GitHub OAuth so Smith gets your access level when you explicitly link your account.

Simple in theory. It took 10 consecutive fix commits to sort out.

The first problem: the agent would run credential-checking CLI commands as part of reasoning about whether it had GitHub access. In a shared runtime where multiple users hit the same process, this mutates global CLI state. One user's auth check corrupts the next user's session. So we built a command sanitizer that intercepts bash commands before execution and blocks dangerous subcommands.

The second problem: token leaking between sessions. A read-only token from the shared environment would leak into a user's turn, giving them less access than their OAuth grant should allow. Or a previous user's write token would persist. We made credential injection strictly per-turn and added a deterministic tool so the agent could check its own credential state without touching the shell environment.

The third problem: the agent got creative. After we blocked the obvious commands, it started trying environment inspection commands with grep, running git push as a test, spawning scripting-language one-liners to read the process environment. Each workaround needed its own block in the sanitizer.

Here's where it gets meta. While writing this blog post, Smith was helping draft the content. The draft included descriptions of the sanitizer's regex patterns. Smith tried to write them to a file using a bash heredoc. The heredoc contained the literal strings that trigger the sanitizer's rules, because the blog post was about those rules. Smith's own command sanitizer blocked the write.

We routed the file write through Python to avoid the bash sanitizer. Then the Python approach hit a second landmine: markdown backtick-quoted command examples inside a bash heredoc got interpreted as shell command substitution, which actually executed those commands and dumped real environment variables into the output file. We caught it immediately, but the irony was thick. The security system we were writing about demonstrated two of its own failure modes in the process of being documented.

The sanitizer is a blocklist that grew through production incidents. It works. But it's not elegant and it will never be complete.

The containerized sandbox

All bash commands run inside a Docker container called smith-exec. Ubuntu 24.04, non-root user, with the tools we know people need: ClickHouse client, PostgreSQL client, Python 3 with matplotlib, git, jq.

Environment variables entering the container go through an explicit allowlist. Everything not on the list gets stripped: API keys, database credentials, KMS configuration, anything from the host that the agent shouldn't see. If it needs a credential, it goes through the ACL-checked secret resolution system. There's no shortcut.

The silent death problem

Users started reporting that Smith wasn't responding. They'd send a message in Slack and nothing came back. No error, no timeout, just silence.

The process was alive. Systemd showed the service running. But the Node.js event loop was blocked, and no new requests could be processed.

Systemd was configured with Restart=on-failure, which only triggers on process crash. A blocked event loop isn't a crash. The process sits there, alive but useless, indefinitely.

We pointed a coding agent at the production logs and asked it to find the root cause. The fix was four layers: a worker-thread watchdog that sends heartbeats from the main thread every 5 seconds and force-exits the process if 30 seconds pass without one; Fastify request timeouts (11 minutes for agent requests, 30 seconds for connections); Caddy active health checks polling /api/health every 10 seconds; and reduced systemd stop timeouts. Four layers because no single one is sufficient.

If you're running long-lived Node.js agent processes, add a watchdog before you need one. We added ours after.

The teammate who keeps killing Smith

We have one team member, our data analyst, who consistently crashes the agent. Not maliciously. They're the power user. Complex multi-step analytical tasks, large result sets, long conversation threads that push everything to the limit.

One incident was traced to a conversation with 170+ messages. Tool results contained 25KB SQL MERGE statements. The agent's memory grew unbounded until the 15GB VM froze entirely. No swap configured. The whole machine went down, not just Smith.

The fix was memory limits via systemd cgroups. Web services get 6GB max. Schedulers get 4GB. The exec container gets 2GB. Processes that exceed limits get OOM-killed and restarted cleanly instead of freezing everything.

The strange part: the underlying agent runtime actually has conversation compaction built in. It should handle long threads. But something about this team member's usage pattern bypasses it, and we still haven't figured out what. The data analyst still kills Smith regularly. We just recover in seconds instead of minutes now.

Progressive tool disclosure

Smith has around 60 tools. Browser automation alone is 15. BigQuery is 6. Secrets management is 10.

Early on, every tool was available on every turn. The system prompt ballooned. Token costs went up. The agent would reach for heavy tools when simpler ones would do, just because it could see them.

Now Smith starts each thread with 18 always-on tools and a single meta-tool that unlocks capability bundles on demand. When a task needs browser automation or BigQuery, the agent enables the relevant bundle. Six bundles are available: browser, cron, BigQuery, BigQuery writes, secrets/policy, and Slack messaging. Once enabled, a bundle stays active for the rest of the thread.

This cut the baseline prompt, reduced cost per turn, and kept the agent focused. We also log the active tool surface per LLM call in a usage ledger, so we can see exactly which capabilities were engaged and what they cost.

The brain: 25 skills, all self-authored

Smith has a git repository called the brain. We chose git deliberately: it gives us a full history of every change Smith makes to its own knowledge, and anyone on the team can browse the repo to see exactly what context the agent is working with. Three directories: docs for reference, skills for reusable task instructions, scripts for executable helpers.

When Smith learns something during a conversation, it writes that knowledge to the brain. A skill for spam detection that queries ClickHouse for suspicious patterns and actions users through our internal API. A skill for generating sales reports from ad campaign data. One for discovering new content sources by researching the web with browser automation. One for curating content highlights.

There are about 25 skills right now. Every single one was written by Smith during real conversations with humans. Nobody hand-authored them.

A cron job commits and pushes brain changes to GitHub. Another normalizes file permissions, because the container user and host user have different UIDs and writes from inside the container end up with wrong ownership.

We don't audit the brain systematically. Smith updates its own context, and over time recurring tasks get smoother. Whether every skill is well-structured or even accurate, honestly, we haven't checked carefully. The system self-corrects through usage. That's an assumption we're comfortable with but haven't proven.

What it actually does all day

The data access use case worked as planned. But the cronjob system turned Smith from a query tool into an autonomous operator.

Every night at 3 AM, Smith sweeps for spam. It queries ClickHouse for suspicious posting patterns, cross-references with user data, and auto-moderates through our internal API. Every week it audits our A/B experiments, checking whether feature flags in the codebase have corresponding GrowthBook experiments. It reviews pending content keywords. It discovers new content sources by browsing the web. It updates its own skills and documentation.

None of these workflows existed before Smith. They weren't things we were doing manually and then automated. They're things that became possible once an agent could connect to all our systems and run on a schedule. The spam sweep alone catches patterns that would take a human analyst hours to surface from raw event data.

The MCP server was a late addition that turned out to be unexpectedly useful. Several of us use Claude Code locally and wanted to give it access to internal systems. Smith exposes a single MCP tool called ask_smith. You point Claude Code at Smith's endpoint and it can query databases, check deployment status, or run moderation tasks. Agent-to-agent delegation, using our own security and ACL layer.

We've also started building internal APIs specifically to give Smith more capabilities, unlocking use cases that weren't possible when it could only reach external services. The more we connect, the more useful it gets.

What's still broken

The data analyst still kills Smith. The agent runtime has compaction, but something about their usage pattern sidesteps it. Memory limits are a band-aid. We haven't found the root cause.

The command sanitizer is a blocklist that grows through production incidents. It catches what we've seen. Calling it complete would be naive.

Slack event handling remains a source of bugs. Nested thread replies, bot message loops, duplicate event deliveries, hydrated reply payloads with missing fields. Each defensive check in the events handler traces back to a specific production incident. It works, but reading that code feels like reading a changelog of Slack API surprises.

The brain is unaudited. Skills might drift. We're betting on self-correction, not verification.

And the deepest open question: how do you verify that an autonomous agent with access to production databases, GitHub repos, and browser sessions won't do something unexpected? We have defense-in-depth: the command sanitizer, the env allowlist, ACL-checked secrets, container isolation, per-turn credential injection. But there's no formal proof. No guarantee the agent can't escalate its own access in a way we haven't imagined. Just layers of "we haven't seen it happen yet" and the willingness to add another layer when we do.

That's where we are. Running in production, used daily by the whole team, still getting killed by the data analyst about once a week. Named after Agent Smith from The Matrix, because our Norse mythology names ran out and this one fit.

We built an org-wide AI agent in 4 days. Here's what broke in the weeks after.

The gap it filled

29,000 lines, mostly Codex