Illustrative engagement scenario
Series B platform — feature velocity and inference cost
A growth-stage SaaS platform compressing its release cycles and bringing inference spend back in line with feature revenue, over four months.
This is an illustrative engagement scenario. It draws on patterns we've seen across multiple engineering organisations of similar shape but does not describe a specific real client. The numbers are realistic but illustrative.
The situation
The client was a Series B SaaS platform with around 217 engineers across product, infrastructure, and AI teams. Annual recurring revenue was growing, but two metrics had drifted in the wrong direction over the previous twelve months.
Feature delivery had visibly slowed. The team had grown by 28% over the year, but the number of meaningful product changes shipped per quarter had stayed flat. Senior engineers reported that more and more of their time was going to reviewing and unblocking work rather than building it. Sprint cycle times had stretched from a median of eight days to a median of thirteen.
At the same time, the platform's AI features — used by a little over half the customer base — had become a meaningful cost line. Inference spend was now around 17.8% of cloud cost, up from 5.9% at the start of the year, and it was scaling with usage in a way that didn't track the revenue from those features. The CFO had flagged it. The CTO had asked his AI lead to "look into it" two months earlier; the AI lead was already running a feature roadmap and didn't have time.
The CTO reached out after a referral conversation with another engineering leader who had done similar work.
What we measured first
Two weeks of baseline work, run in parallel across three measurement passes.
Pass 1 — Engineering time allocation. We sampled six representative product squads. For each, we instrumented a four-week period: tickets categorised by type, time-in-status from creation to merge, time-in-status from merge to production, queue time at code review and QA hand-offs. The output was a clean picture of where engineering time was going, ticket type by ticket type. The finding worth flagging: testing and bug-fix work had grown from about 21.7% of engineering time to about 36.4% over the previous year. The growth was concentrated in two services that had been instrumented for AI features — the test surface area had expanded faster than test coverage had kept up with.
The second pass decomposed twelve months of model spend by feature, by model, and by request shape, producing a per-feature cost-to-revenue ratio. Roughly 58% of inference cost was concentrated in two features. One was using GPT-4 for a classification task that a much smaller model would have handled fine. The other was making three sequential model calls per user action when one well-prompted call with structured output would have done the job. Neither was subtle. They just hadn't been measured before.
The third pass was qualitative. We shadowed two squads for a sprint each, sat in on standups, retros and sprint planning, and mapped every cross-team hand-off and the queue time at each one. A typical feature was now passing through 4–6 cross-team hand-offs from spec to production, where the median a year earlier had been 2–3. The growth was driven by new ownership boundaries added as the team expanded but never reviewed. Each hand-off was costing 1.5–3 days of queue time.
What we changed
Three workstreams over the next twelve weeks, sequenced rather than run in parallel so that each measurement could be clean against a stable baseline.
Workstream 1 — Test generation in the AI-instrumented services. We worked with the platform team on AI-augmented test generation against the two services where testing load had grown. The work itself was unglamorous: characterisation tests on existing behaviour, regression tests around the AI feature surfaces, and fuzz testing on the input shape. Tooling was a mix of standard test generators and structured prompting against the codebase for higher-level cases. The team learned the prompting patterns alongside us, and the tooling stayed in place after the engagement ended.
The second workstream tackled the two heavy AI features. On Feature One, we swapped the GPT-4 calls out for Claude Haiku after benchmarking both against the team's actual quality criteria, and ran them in parallel on production traffic for two weeks before the cutover. Feature Two was a different shape — we restructured its three-call sequence into a single structured-output call with a longer prompt and explicit schema enforcement. Both changes were instrumented for cost and quality from day one.
The third workstream was process. We worked with the engineering leadership of one product squad to collapse two of the cross-team hand-offs that had emerged over the previous year. That meant redrawing one ownership boundary and removing an approval gate that no longer reflected how risk was actually being managed. The squad ran with the new structure for six weeks while we measured cycle time against the baseline.
What moved
After re-measurement against the baseline:
Engineering time allocation. Testing and bug-fix work came back down from 36.4% to roughly 24% of engineering time across the instrumented squads. The freed-up time went back into product development. Median sprint cycle time on the instrumented services dropped from 13 days to 9 days.
Inference cost. Monthly model spend on the two targeted features dropped by 53% in aggregate. Feature One stayed within the team's defined acceptance criteria for quality. Feature Two improved a little on quality alongside the cost reduction, which we put down to the structured-output restructuring.
Coordination overhead. On the piloted squad, median feature cycle time from spec to production dropped by 29%. The bulk of the reduction came from the removed approval gate and the consolidated ownership boundary.
The team also kept the baseline measurement framework. They have continued to use it for subsequent decisions about tooling and feature scope, well beyond the work we did directly.
What we didn't change
A short note on scope. The engagement covered three workstreams across a representative slice of the engineering organisation. It did not cover:
- The other four product squads outside the instrumented sample
- The infrastructure team's roadmap, which was running independently
- The AI roadmap itself — model selection on new features built after the engagement
The narrow scope is part of how the engagement model works. The patterns and instrumentation that came out of the work are available to the rest of the organisation; applying them is the team's call.
What this means for similar teams
The engagement shape in this study tends to work for organisations with a few things going on at once: they're in a growth phase and headcount has outpaced process, AI features are in production and the cost line has become visible, and the CTO or VP-Engineering wants to see specifically where gains and losses live rather than read a strategy deck about them.
If that sounds familiar, the first step is a 30-minute conversation about the specific situation in your team.