AI agents in production: one crypto exchange’s six-month experiment

A team building software for a crypto exchange spent six months integrating AI agents into their development process. Features per engineer tripled. The failure rate on agent-drafted code dropped from 32% to 9%. This is what they changed – and what they didn’t.
On a crypto exchange, even small mistakes can have real financial consequences, so the team was not looking for a shortcut around engineering review. But a lot of engineering time was still going to work that didn’t require much judgment: setting up branches, writing boilerplate, wiring endpoints, drafting baseline tests. Necessary work, but not the best use of senior engineering time.
Over six months, the team introduced AI agents into the workflow to handle repetitive implementation tasks. Engineers still made the decisions. Over that period, features shipped per engineer per week rose from 1.2 to 3.6, while the rework rate on agent-drafted code fell from 32% to 9%. The change failure rate did not increase.
The workflow
The process they run in production follows a fixed sequence: product idea → triage → engineer → agent drafts PR → engineer review → team review → agent fix loop → merge.
Most of the gains depended on getting triage right. Before an agent starts on anything, the engineer prepares a “delegation bundle”: acceptance criteria, constraints, links to similar prior work, and a risk tag. Vague instructions produced bad results fast, so the team made specificity mandatory.
The agent’s job is narrow: produce a first draft pull request that follows existing patterns. Its job is to work from what already exists in the repository: interfaces, conventions, and code that is already in use. It is not supposed to invent missing pieces. Everything the agent produces goes through the same review and CI/CD pipeline as code written by a human.
A real example
A feature for 2FA backup codes went like this:
- Engineer delegates at 10:15 with acceptance criteria and links to similar work.
- Agent opens a PR at 10:35 – UI and API changes, feature flag wiring, baseline tests included.
- Engineer reviews at 11:00 for intent, security properties, and edge cases.
- Two-engineer team review at 11:30 covers service boundaries and failure modes.
- Merged at 12:00.
Total active human time: around 45 minutes. Most of that time went to review, edge cases, and security judgment rather than writing the initial draft.
What went wrong early
In the first month, 32% of agent-drafted PRs needed meaningful rework. The team tracked failures by category and found five recurring patterns.
- Hallucinated integrations accounted for roughly 18% of failures. The agent would assume an SDK method existed or invent an API contract. The fix: require citations to internal interfaces. If the agent can’t point to a real source, it stops and asks.
- Vague specs became wrong UX in about 25% of cases. A prompt like “make this mobile friendly” could produce something functional but incomplete. The fix: explicit acceptance criteria with examples of what acceptable looks like and what doesn’t qualify.
- Scope creep disguised as optimization showed up in 22% of failures. Requests like “optimize this flow” sometimes triggered large refactors. The fix: hard scope limits on files and change size, plus a plan-first step where the engineer approves the approach before any code is written.
- Wrong internal patterns appeared in 12% of cases – code that worked but didn’t match the team’s conventions, creating maintainability and security risks down the line.
The safeguards
Authentication, permissions, withdrawals, and key management are treated as high-risk surfaces, so they default to manual work unless a senior engineer decides otherwise. Every agent-drafted change goes through individual review by the delegating engineer, then team review by at least two engineers. CI/CD runs the same checks it runs on human-written code: tests, static analysis, dependency hygiene, security scanning.
The team also built an agent fix loop into the review process. Reviewers leave normal PR comments, then invoke the agent to address specific items – with an explicit instruction not to touch anything else. Reviewers stay focused on judgment and correctness, while the agent handles the smaller follow-up edits.
What they learned
The gains came less from the model than from the process built around it. Clear delegation standards, strict scope limits, traceability requirements, and consistent review kept the system stable as it scaled. The agents handled more of the repetitive implementation work. Architecture, threat modeling, and failure analysis still stayed with the engineers.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.







