Writing User Stories for Uncertain Systems

Written by Herman Lintvelt

Updates

Originally posted on Substack

When your AI feature might behave differently every time, how do you even specify what “working” means?

In my previous article, I argued that engineering fundamentals matter more than ever in the AI era. Today, let’s dig into the first and most critical fundamental: specification through user stories.

If you’ve been writing software for a while, you’ve probably seen the classic user story format a thousand times: “As a [user], I want [feature] so that [benefit].” Simple, right?

But what happens when that feature involves an AI that might give different answers to the same question? Or when “working correctly” means “performs acceptably 85% of the time”? (Yes, this kind of new acceptable acceptability thresholds is making me uncomfortable as well... ) Suddenly, our comfortable user story templates feel inadequate.

The truth is, user stories aren’t just documentation. They’re how we think through what we’re building before we build it. And with AI systems, that upfront thinking becomes the difference between shipping something useful and shipping something that fails mysteriously in production.

Apart from using user stories to think about what you are building and how to measure the impact, it is even much more important in today’s era of GenAI code assistants that we spend more time on specifications. GenAI is only as good as the input tokens it’s given.

Why User Stories Matter (Even More Now)

Let me take you back to basics for a moment. User stories emerged from Extreme Programming Explained as a way to capture requirements that actually matter to users. They force us to answer three questions (this familiar pattern was later evolved by Rachel Davies):

Who needs this? (As a...)
What do they need? (I want...)
Why does it matter? (So that...)

This structure keeps us honest. It prevents us from building features nobody asked for. It connects technical work to actual value.

In traditional software, once you had these three elements clear, you could usually figure out the implementation. The path from “user wants to sort their contacts” to “implement a sorting algorithm” was straightforward.

AI systems break this linearity.

When you’re building with AI, the gap between “what the user wants” and “how the system will behave” is filled with uncertainty. Your recommendation engine won’t always recommend the perfect item. Your sentiment analysis won’t always get the tone right. Your chatbot will sometimes misunderstand. The code your AI assistant created will be full of shortcuts and “TODO” comments.

This isn’t a bug; it’s the nature of AI. Which means your specifications need to explicitly account for uncertainty from the start.

The Uncertainty Problem

I learned this lesson the hard way on a health platform I worked on. We had a user story that seemed clear: “As a parent, I want personalised video recommendations for managing my child’s ADHD symptoms, so I can get relevant support quickly.”

Sounds reasonable, right? We built an AI model, trained it on our content library and user engagement data, and shipped it.

Then reality hit. The AI worked brilliantly for some families. For others, it kept recommending the same few videos. For a small subset, the recommendations seemed almost random. All from the same model, with the same code.

The problem wasn’t the AI – it was our specification. We hadn’t defined:

What “personalised” actually meant (similar to what? based on what signals?)
What level of accuracy was acceptable (all recommendations perfect? most? half?)
How the system should behave when uncertain (show popular content? ask for more info?)
How it should learn from mistakes (implicit feedback? explicit corrections?)

We’d written a user story for a deterministic system and tried to implement it with a probabilistic one.

Specifying for Uncertainty

Here’s what that user story should have looked like:

As a parent managing my child’s ADHD, I want video recommendations personalised to my family’s current challenges, so I can quickly find relevant guidance without searching.

Acceptance Criteria:

Primary recommendations (top 3) achieve >70% click-through rate within the user’s first session
System demonstrates learning: recommendations improve to >85% CTR after user rates 5+ videos
When confidence is low (<60%), the system shows “popular in similar situations” content instead of personalised
The system provides an explicit feedback mechanism: thumbs up/down on each recommendation
Fallback: if no personalised recommendations meet the threshold, show trending content from the user’s child’s age group
Performance: recommendations generated within 500ms
Edge case: new users with no history see curated “getting started” sequence

See the difference? This version acknowledges that the AI might not always get it right, specifies what “right” means quantitatively, and defines what happens when the system is uncertain.

The AI User Story Pattern

Through building several AI-powered products, we developed a pattern for writing AI-focused user stories. Beyond the traditional three parts, you need four additional elements:

1. Performance Thresholds

AI isn’t binary — it exists on a spectrum of quality. Define the acceptable range:

Accuracy requirements (e.g., “categorises expenses correctly 85% of the time”)
Latency limits (e.g., “generates response within 2 seconds”)
Confidence thresholds (e.g., “only acts on predictions with >75% confidence”)

2. Graceful Degradation

What happens when the AI can’t deliver the ideal outcome?

Fallback behaviours (e.g., “if uncertain, ask user to clarify”)
Partial results (e.g., “show top 3 most confident categories”)
Human handoff criteria (e.g., “escalate to support when confidence <50%”)

3. Learning Expectations

How should the system improve over time?

Feedback mechanisms (explicit ratings by humans or LLM-as-a-judge, implicit signals we can codify and measure)
Adaptation timeline (improves after N interactions)
Personalisation scope (per user, per organisation, globally)

4. Failure Modes

Traditional software has bugs. AI has failure modes that will ALWAYS need to be dealt with, and thus need specification:

False positive handling (wrongly flagging something)
False negative handling (missing something important)
Bias mitigation (avoiding discriminatory outcomes)
Adversarial robustness (handling malicious inputs)

Making It Practical with SpecKit

Now, you might be thinking: “This sounds like a lot of work. Can’t AI help with writing these specifications?”

Yes, but carefully.

This is where tools like SpecKit become valuable. SpecKit is GitHub’s framework for spec-driven development with AI coding agents. It provides structured commands that help you and your AI assistant work through specifications before jumping into code.

The workflow looks like this:

1. Establish Constitution (`/speckit.constitution`)

Define your project’s governing principles first. This isn’t the feature spec — it’s the rules of the road:

/speckit.constitution This is a health platform mobile app using Flutter MVVM architecture. All AI features must: - Provide clear confidence indicators - Degrade gracefully to manual alternatives - Never make medical claims without professional review - Log all AI decisions for audit - Include explicit user feedback mechanisms

This constitution becomes the context for everything that follows. Your AI coding assistant now understands the constraints.

2. Define Requirements (`/speckit.specify`)

Now you can describe what you want to build, and SpecKit helps structure it:

/speckit.specify Create a video recommendation engine that suggests relevant parent training content based on the child’s diagnosis, parent’s stated challenges, and engagement history. The system should learn from both explicit ratings and implicit signals like watch time and video completions.

SpecKit takes this description and generates a structured specification that includes user stories, acceptance criteria, and edge cases. But, and this is crucial, you must review and refine this output.

3. Create Implementation Plan (`/speckit.plan`)

With the spec defined, create the technical approach:

/speckit.plan Use collaborative filtering combined with content-based filtering. Store user interactions in Supabase. Generate recommendations using a lightweight ML model that can run inference in <500ms. Include A/B testing framework to measure recommendation quality improvements.

4. Generate Tasks (`/speckit.tasks`)

Break the plan into actionable development tasks, each with clear completion criteria.

5. Implement (`/speckit.implement`)

Only now do you start coding, with AI assistance that understands the full context.

The beauty of this approach is that it forces you to think through the specification before AI starts generating code. The AI can help draft the spec, suggest edge cases you might have missed, and structure your requirements, but you remain the senior engineer reviewing and approving.

AI-Augmented Specification (With Guard Rails)

Here’s my workflow for using AI to help write user stories:

Start with the business value

I write the core user story manually—the “as a/I want/so that” part. This forces me to think about the actual user need before any AI gets involved.

Use AI to explore edge cases

I’ll prompt: “Given this user story about AI-powered expense categorisation, what edge cases and failure modes should I consider?” The AI often surfaces scenarios I hadn’t thought about.

Draft acceptance criteria together

I’ll ask the AI to suggest acceptance criteria, then ruthlessly edit them. AI is good at remembering patterns it’s seen before (like “should include performance thresholds”), but I need to set the specific numbers based on user research and product requirements.

Validate against reality

I take the AI-assisted spec and test it against real scenarios. Would this spec have caught the problems we saw in the last sprint? Does it give engineers enough clarity to implement? Can QA test against these criteria?

Review with the team

Just like AI-generated code needs human review, AI-assisted specifications need team review. The product manager checks if it captures user value. The engineer checks if it’s implementable. The data scientist validates the performance thresholds.

The AI is like a junior product manager — helpful for grunt work and pattern matching, but the senior human must take responsibility for clarity, completeness, and correctness.

From Spec to Success Metrics

Here’s something crucial: your user story’s acceptance criteria should map directly to how you’ll measure success in production.

If your spec says “recommendations achieve >70% CTR,” then your monitoring dashboard should track CTR. If you specified graceful degradation when confidence is low, you need telemetry showing how often that happens.

The specifications become your continuous evaluation criteria. This is the “CE” part of CI/CD/CE I mentioned in the previous article. You’re not just shipping code; you’re shipping measurable expectations.

This means your user stories need to include:

Metrics that can be instrumented (not just “feels personalised”)
Thresholds that can be monitored (specific numbers, not ranges)
Alerts that can be automated (degradation triggers)

When your AI feature starts underperforming in production, you’ll know immediately because you specified what “performing” means upfront.

The Human Responsibility

I want to be clear about something: AI can augment specification work, but it cannot replace the human responsibility to ensure specifications are clear, complete, and executable.

Why? Because specifications require judgment about:

Trade-offs: Should we prioritise accuracy or speed? The AI can’t decide — someone with business context must.
User empathy: What will frustrate users versus delight them? AI can guess based on patterns, but you know your users.
Technical feasibility: Is this actually buildable with our stack and timeline? Your engineers need to weigh in.
Ethical implications: Could this specification lead to biased outcomes? Humans must take responsibility here.

Treat AI-assisted specification like AI-assisted coding: it’s a powerful multiplier, but the human remains accountable for the output.

Start Here

If you’re building AI features and haven’t updated how you write user stories, start with one change: add explicit acceptance criteria for uncertainty.

Take any AI feature you’re working on and ask:

What’s the minimum acceptable performance level?
What should happen when the AI isn’t confident?
How will users provide feedback to improve the system?
What failure modes could hurt users?

Answer these questions in your user story before you write a line of code.

You might find that simply thinking through these questions changes what you decide to build. That’s not a bug; it’s the whole point. Specifications are where we decide what’s worth building and what good looks like.

In the AI era, those decisions are more important than ever. Because the code might be non-deterministic, but our standards for delivering value should be rock solid.

Next in this series: “Testing the Untestable” - how to build confidence in systems you can’t perfectly predict.

Additional Resources:

SpecKit on GitHub - Structured specification framework for AI-assisted development
Extreme Programming Explained by Kent Beck - The original case for user stories