Our AI Development Reality Check - Flower Press Interactive

ubaid-e-alyafizi-ils6r1sRJNk-unsplash - Flower Press Interactive

07.01.2025 Michael English

Tl;dr
We couldn’t get AI to autonomously build features from detailed specs. However, we did develop a workflow that significantly increases the productivity of experienced developers. The key was to shift our mental model away from AI as a powerful autocomplete to AI as a pair programming partner. Getting this right requires every developer to be able to operate as a feature lead.

Lessons From the Front Lines

We’ve been working hard to stay abreast of the changes in the AI software development landscape. Not an easy task, as everything seems to be progressing very rapidly. After some recent experiments and false starts with everything from codegen agents to complex prompting strategies, I think we’ve finally found something that works, for now. But it required us to throw out some core assumptions about what AI can (and can’t) do. This is a write-up of what failed, what worked, and what I think that means for development teams trying to make LLMs an effective part of their toolset.

From Auto-Complete to Pair Programming

The first conceptual change we had to adjust to was realizing that AI agents are not simply smarter autocompletes. With agents, you start out by describing what you want, let the agents code, and then review the resulting diffs. The AI is your pair programming partner. It codes fast, it appears to know everything, and it seems almost capable of magic. However, it doesn’t actually “think like a machine,” and this is where we ran into problems.

We approached AI as if it were a deterministic interpreter. We wrote long-form specs, translated these specs into increasingly more detailed and structured prompts with all the necessary inputs, and expected it to return increasingly more complete and better code. Instead, we continued to run into all the familiar LLM failure modes: hallucinations, code duplication, logic errors, and false confidence. The more we tried to specify all the extensively detailed tasks, the more we ran into these problems. These issues tended to cascade. A single bad assumption by the agent would spread across files or functions, introducing compounding problems. The upfront investment didn’t pay off. We spent a ton of time writing and refining specs, only to spend more time debugging or rewriting the results.

Eventually, we realized that the deeper issue is that large language models operate on vibes, not long lists of detailed tasks. This was initially deceptive because you can ramble conversationally about a feature or component, and the AI will often get the gist and do a surprisingly good job with implementation. The natural instinct is to refine the rambling and provide more details and explicit instructions. However, if you list 12 precise behaviors or edge cases you expect to be handled, it will fumble some of them. Our second conceptual shift was realizing that we had to establish fairly specific human-driven workflows to work around the fundamental nature of the AI agents.

What’s Working for Us (So Far)

What ended up working was much more pragmatic and developer-driven. Instead of trying to treat AI as an autonomous actor, we reframed it as a tool for rapid pair programming. The key shifts:

Invest in the environment, not the prompts.
Don’t waste time writing perfect specs for each story. Invest instead in preparing your codebase and tooling for fast AI-assisted iteration: clear coding conventions, well-defined modular architecture, and opinionated library choices. The biggest gains didn’t come from writing better prompts or choosing the right model; they came from defining a clear architecture for the AI to work within and building better processes around execution.
Decompose user stories into semantic chunks up front.
Break work into units small enough that each chunk can be reasoned about, executed, and validated in isolation. Each semantic chunk gets its own branch. This has a multitude of benefits, making it easier to keep the AI focused, less costly to restart if needed, quicker to validate changes via tests, and easier for humans to review the code.
Every dev is a feature lead.
Everyone interacting with AI needs to be capable of leading the human-AI interaction. The human lead needs to be able to decompose the work items effectively. They need to understand the feature, know what the code is supposed to do, and apply that understanding to make judgment calls about the implementation details when the AI goes off track.
Codify your pair programming guidelines.
We’ve developed a set of best practices for working with AI agents:
- Setup
  - Provide examples alongside abstract instructions. Prioritize using existing similar code here.
  - Define a modular architecture with clear API boundaries. Good fences make good neighbors, and help prevent AI chaos.
  - Use a linter. AI Agents are good at responding to this automated feedback.
- Guide the Work
  - Keep changes small enough that human validation is still easy.
  - Review all AI-generated code carefully, and assume the AI will fail silently on edge cases.
  - Write tests early and often. Use AI to generate a comprehensive suite of tests. Passing tests means no unintended consequences.
- Iterate
  - Don’t be afraid to completely restart on your current path. Once an AI agent goes down a poorly chosen path, it can be next to impossible to redirect things. Sometimes, a fresh start is the best option.
  - Copy errors directly into the AI conversation, which can very quickly make progress towards solutions.

Everyone is a Feature Lead Now

With the right approach, AI can be a massive force multiplier. We’ve seen measurable improvements in throughput, but it took some meaningful changes to our approach. Those gains came with new expectations. Developers can no longer rely on well-scoped tickets and clean handoffs. They need to drive. That means owning the feature end-to-end, understanding the user impact, decomposing the work into chunks the AI can handle, guiding the agent with examples and guardrails, and validating the results with a critical eye. The job now is less about typing code and more about orchestrating its creation. Everyone on the team, regardless of title, is now expected to operate with the mindset and skills of a feature lead.

AI has made our coding much faster, but it hasn’t changed the fundamentals. Clear architecture, rigorous validation, and strong engineering leadership still matter more than ever. Speed without the structure of good software engineering just means you’ll quickly end up with bad code and all its attendant problems. What we’ve learned is that combining the speed of AI with tried and true practices is the way to rapidly develop reliable, maintainable software.

AIEngineering