πŸ“š Table of Contents

  1. The AI Feature That Went Wrong (And Why It’s Not the Model’s Fault)
  2. What AI Actually Is (And What It Isn’t)
  3. LLMs Demystified: The PM’s Mental Model
  4. RAG, Fine-Tuning, and When to Use What
  5. The AI Feature Development Process
  6. Specifying AI Features: The New Requirements
  7. AI Costs: Understanding the Token Economy
  8. Testing and Quality for AI Features
  9. Common AI Product Mistakes
  10. Building Your AI Product Muscles
  11. Questions to Ask Your AI/Engineering Team
  12. The Bottom Line

The AI Feature That Went Wrong (And Why It’s Not the Model’s Fault)

In 2024, I watched a company spend $300,000 building an AI feature that failed spectacularly.

The feature: An AI assistant that would answer customer support questions.

The approach:

  • Fine-tuned GPT-4 on their support documentation
  • 3 months of development
  • $100K in compute costs
  • $200K in engineering time

The launch: Users hated it.

The problems:

  • Hallucinated policies that didn’t exist
  • Confidently wrong about product features
  • Couldn’t access real-time information
  • Recommended discontinued products

The post-mortem: The model wasn’t the problem. The product decisions were:

  1. Fine-tuning was wrong choice: They should have used RAG (Retrieval Augmented Generation)
  2. No grounding: The AI couldn’t verify its answers against actual data
  3. No feedback loop: Users couldn’t correct wrong answers
  4. Wrong use case: They wanted accurate answers, not creative generation

The naked truth: AI product failures are usually product failures, not model failures. The PM didn’t understand the technology well enough to make the right architecture decisions.

This guide will help you avoid that mistake.


What AI Actually Is (And What It Isn’t)

The Simple Definition

AI (Artificial Intelligence) = Systems that can perform tasks that typically require human intelligence.

For product managers, you’re mostly dealing with:

TypeWhat It DoesExamples
LLMsGenerate text, understand languageChatGPT, Claude, GPT-4
Image ModelsGenerate/analyze imagesDALL-E, Midjourney, Stable Diffusion
Embedding ModelsConvert text to numbers for similaritytext-embedding-ada-002
Speech ModelsConvert speech to text and backWhisper, TTS systems

What AI Can Do Well

βœ… Generate text (write emails, summarize, translate)
βœ… Classify (sentiment, categories, intent)
βœ… Extract information (entities, data points)
βœ… Answer questions (with right context)
βœ… Create variations (rewrite, reformat, adapt)

What AI Can’t Do Well

❌ Guarantee accuracy (hallucinations happen)
❌ Reason deeply (it predicts, doesn’t think)
❌ Access real-time data (training cutoffs exist)
❌ Understand context perfectly (limited by context window)
❌ Be reliable in the same way as code (probabilistic, not deterministic)

The Key Insight

AI is probabilistic. Traditional software is deterministic.

Traditional Software:
Input β†’ Code β†’ Same Output Every Time

AI Software:
Input β†’ Model β†’ Different Output Every Time (Usually similar, but not guaranteed)

This fundamental difference changes everything about how you spec, test, and deploy AI features.


LLMs Demystified: The PM’s Mental Model

What an LLM Actually Is

LLM = Large Language Model = A text predictor trained on lots of text

The mental model: Imagine an auto-complete on steroids.

When you type “The cat sat on the”, an LLM predicts what comes next based on:

  • What it’s seen before in training
  • The context you’ve provided
  • The probability of each possible next word

It doesn’t “understand” anything. It predicts what text should come next.

The Key Concepts for PMs

1. Context Window

The amount of text the model can “see” at once.

ModelContext WindowWhat It Means
GPT-4 Turbo128K tokens~300 pages of text
GPT-48K tokens~20 pages
Claude 3200K tokens~500 pages

Product implication: You can pass documents, conversation history, or data within the context window. Larger is better for RAG, but more expensive.

2. Tokens

Text is broken into tokens (roughly 4 characters per token).

Cost calculation:

1,000 tokens β‰ˆ 750 words
GPT-4: $0.03 per 1K input tokens
Cost for 100K words: ~$4

3. Temperature

Controls randomness in output.

TemperatureBehaviorUse Case
0Deterministic, consistentFactual answers, classification
0.5BalancedGeneral use
1.0+Creative, variedBrainstorming, creative writing

Product implication: Use low temperature for factual tasks, high for creative tasks.

4. Prompt

The instructions you give the model.

Prompt engineering matters. Same model, different prompts = completely different results.

Bad Prompt:
"Write an email about our product"

Good Prompt:
"You are a helpful customer success manager. Write a follow-up email to 
a customer who expressed interest in our enterprise plan. 
Include:
- Reference to their specific use case (API integration)
- Clear next steps (schedule demo)
- Professional but friendly tone
Keep it under 200 words."

RAG, Fine-Tuning, and When to Use What

The Three Approaches to AI Features

ApproachWhat It IsWhen to Use
Prompt EngineeringCraft good promptsSimple use cases, prototypes
RAGRetrieve relevant data, then generateWhen you need accurate, current info
Fine-tuningTrain model on your dataWhen you need specific behavior/style

RAG (Retrieval Augmented Generation)

What it is: Retrieve relevant documents, pass them to the LLM, then generate a response.

How it works:

User Query
    ↓
Convert query to embedding (numbers)
    ↓
Search vector database for similar documents
    ↓
Pass documents + query to LLM
    ↓
LLM generates answer based on retrieved documents
    ↓
Response

When to use RAG:

  • Need accurate, up-to-date information
  • Want to cite sources
  • Information changes frequently
  • Need to control what the AI knows

Example: Customer support chatbot that answers based on your actual documentation.


Fine-Tuning

What it is: Train a model on your specific data to change its behavior.

How it works:

Base Model (e.g., GPT-4)
    ↓
Train on your examples (inputs β†’ desired outputs)
    ↓
Fine-tuned model that behaves more like you want

When to use fine-tuning:

  • Need specific output format consistently
  • Want specific style or voice
  • Behavior you want can’t be achieved with prompting
  • Have lots of training examples (100+ minimum, 1000+ better)

Example: An AI that writes in your company’s brand voice.


Decision Framework

What do you need?
β”‚
β”œβ”€β”€ Accurate, current information?
β”‚   └── YES β†’ RAG
β”‚
β”œβ”€β”€ Specific behavior or style?
β”‚   └── YES β†’ Fine-tuning (or both)
β”‚
β”œβ”€β”€ Simple, one-off task?
β”‚   └── YES β†’ Prompt engineering
β”‚
└── Complex combination?
    └── Hybrid: RAG + Fine-tuning + Good prompts

The Cost Comparison

ApproachSetup CostOngoing CostMaintenance
Prompt Engineering$0Per-tokenUpdate prompts
RAG$5K-20KPer-token + vector DBUpdate documents
Fine-tuning$10K-100KPer-token (sometimes lower)Retrain periodically

The AI Feature Development Process

How AI Development Differs from Traditional Development

TraditionalAI
Write codeDesign prompts and systems
Test outputsTest behavior across many inputs
Debug with logsDebug with examples
DeterministicProbabilistic
Can guarantee behaviorCan only improve probability

The AI Feature Development Cycle

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AI FEATURE DEVELOPMENT CYCLE                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚  1. DEFINE                                              β”‚
β”‚  β”œβ”€β”€ What should the AI do?                                β”‚
β”‚  β”œβ”€β”€ What does "good" look like?                            β”‚
β”‚  └── What are the failure modes?                            β”‚
β”‚                                                         β”‚
β”‚  2. PROTOTYPE                                            β”‚
β”‚  β”œβ”€β”€ Test with simple prompts                              β”‚
β”‚  β”œβ”€β”€ Try different models                                  β”‚
β”‚  └── Measure baseline performance                          β”‚
β”‚                                                         β”‚
β”‚  3. ITERATE                                              β”‚
β”‚  β”œβ”€β”€ Improve prompts                                       β”‚
β”‚  β”œβ”€β”€ Add RAG if needed                                      β”‚
β”‚  β”œβ”€β”€ Fine-tune if needed                                    β”‚
β”‚  └── Measure improvement                                   β”‚
β”‚                                                         β”‚
β”‚  4. EVALUATE                                              β”‚
β”‚  β”œβ”€β”€ Create test set                                       β”‚
β”‚  β”œβ”€β”€ Define success metrics                                β”‚
β”‚  └── Measure against thresholds                             β”‚
β”‚                                                         β”‚
β”‚  5. DEPLOY                                               β”‚
β”‚  β”œβ”€β”€ Monitoring and logging                                 β”‚
β”‚  β”œβ”€β”€ Feedback mechanisms                                    β”‚
β”‚  └── Iteration plan                                         β”‚
β”‚                                                         β”‚
β”‚  6. IMPROVE                                               β”‚
β”‚  β”œβ”€β”€ Collect failure cases                                  β”‚
β”‚  β”œβ”€β”€ Update system                                         β”‚
β”‚  └── Repeat                                                 β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Timeline Reality

PhaseTraditional FeatureAI Feature
Define1-2 days3-5 days
Build1-2 weeks2-4 weeks
Test2-3 days1-2 weeks
IterateOptionalMandatory
Total2-3 weeks4-8 weeks

AI takes longer because you’re not just buildingβ€”you’re training and tuning.


Specifying AI Features: The New Requirements

The AI Feature Spec Template

## AI Feature: [Name]

### Problem Statement
[What user problem does this solve?]

### AI Component
**Input:** What goes into the model?
**Output:** What comes out?
**Behavior:** How should it act?

### Examples of Good Behavior
[Provide 5-10 examples of ideal inputs and outputs]

Example 1:
- Input: "How do I reset my password?"
- Output: "To reset your password, click 'Forgot Password' on the login page..."

### Examples of Bad Behavior
[Provide 5-10 examples of what NOT to do]

Example 1:
- Input: "How do I reset my password?"
- Output: "Here's a link to reset: [made-up URL]" (Hallucination)

### Success Metrics
- Accuracy: % of responses that are correct
- Relevance: % of responses that address the question
- User satisfaction: Feedback score
- Latency: P95 response time

### Failure Modes
[What could go wrong?]
- Hallucination: Model makes up information
- Out of scope: Model answers questions it shouldn't
- Inconsistency: Same question, different answers

### Mitigation Strategies
[How do we prevent/handle failures?]
- Grounding: RAG with verified documents
- Guardrails: Prompt constraints
- Feedback: User can flag wrong answers

### Non-AI Components
[What traditional code is needed?]
- Input validation
- Output formatting
- Error handling
- Logging/monitoring

AI Costs: Understanding the Token Economy

The Token Economics

Input tokens: Text you send to the model
Output tokens: Text the model generates

Cost structure:

  • Input tokens: ~$0.01-0.03 per 1K tokens
  • Output tokens: ~$0.03-0.06 per 1K tokens

Cost Estimation Framework

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AI COST CALCULATOR                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚  Feature: Customer Support Chatbot                        β”‚
β”‚                                                         β”‚
β”‚  USAGE ESTIMATES:                                         β”‚
β”‚  β”œβ”€β”€ Queries per day: 10,000                              β”‚
β”‚  β”œβ”€β”€ Average input tokens: 500                            β”‚
β”‚  └── Average output tokens: 200                            β”‚
β”‚                                                         β”‚
β”‚  DAILY COST:                                              β”‚
β”‚  β”œβ”€β”€ Input: 10K Γ— 500 Γ— $0.01/1K = $50                    β”‚
β”‚  └── Output: 10K Γ— 200 Γ— $0.03/1K = $60                   β”‚
β”‚  Total: $110/day                                          β”‚
β”‚                                                         β”‚
β”‚  MONTHLY COST: ~$3,300                                    β”‚
β”‚  YEARLY COST: ~$40,000                                    β”‚
β”‚                                                         β”‚
β”‚  OPTIMIZATION OPTIONS:                                   β”‚
β”‚  β”œβ”€β”€ Use smaller model for simple queries: -$20K/year      β”‚
β”‚  β”œβ”€β”€ Cache common responses: -$10K/year                   β”‚
β”‚  └── Optimize prompts (shorter): -$5K/year                 β”‚
β”‚                                                         β”‚
β”‚  OPTIMIZED YEARLY COST: ~$5,000                          β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost Optimization Strategies

StrategySavingsTrade-off
Smaller models50-80%Lower quality for simple tasks
Prompt caching30-50%Only works for repeated queries
Shorter prompts10-30%May need more examples
Batch processing20-40%Not real-time
Local models80-100%Higher infra cost, more complexity

Testing and Quality for AI Features

The AI Testing Challenge

Traditional software: Same input β†’ Same output β†’ Easy to test

AI software: Same input β†’ Different output β†’ Hard to test

The AI Testing Framework

1. Create a Test Set

Test Set: 100+ examples covering:
β”œβ”€β”€ Happy path queries
β”œβ”€β”€ Edge cases
β”œβ”€β”€ Adversarial inputs
β”œβ”€β”€ Out-of-scope questions
└── Multi-turn conversations

2. Define Evaluation Criteria

CriterionDefinitionHow to Measure
AccuracyIs the answer correct?Manual review or fact-checking
RelevanceDoes it address the question?Human rating or classifier
SafetyIs it harmful?Safety classifier + manual review
ToneIs it appropriate?Human rating
ConcisenessIs it appropriately brief?Word count + human rating

3. Run Evaluation

For each test case:
β”œβ”€β”€ Run through system
β”œβ”€β”€ Evaluate against criteria
β”œβ”€β”€ Record pass/fail
└── Identify failure patterns

4. Iterate

Identify failure patterns
    ↓
Update prompts/system
    ↓
Re-evaluate
    ↓
Repeat until pass rate > threshold

The Quality Threshold

Feature TypeMinimum Pass Rate
Internal tool70%
Customer-facing85%
High-stakes (medical, legal)95%+

Common AI Product Mistakes

Mistake 1: Expecting Determinism

What happens: You expect the AI to give the same answer every time.

The reality: AI is probabilistic. Same question can yield different answers.

The fix: Design for variability. Set expectations with users. Test with multiple runs.


Mistake 2: Ignoring Hallucinations

What happens: You ship an AI that confidently makes things up.

The reality: All LLMs hallucinate sometimes. This isn’t a bugβ€”it’s a feature of how they work.

The fix: Ground responses in real data (RAG). Show sources. Let users verify.


Mistake 3: No Feedback Mechanism

What happens: You don’t know when the AI fails.

The reality: Users experience failures silently. You never learn or improve.

The fix: Thumbs up/down, report issues, collect examples.


Mistake 4: Wrong Architecture Choice

What happens: You fine-tune when you should RAG. Or vice versa.

The reality: Architecture choice determines success.

The fix: Use the decision framework earlier in this guide.


Mistake 5: No Cost Controls

What happens: Your AI feature becomes unexpectedly expensive.

The reality: AI costs scale with usage. No controls = budget surprise.

The fix: Set usage limits. Monitor costs. Have optimization plan.


Building Your AI Product Muscles

What to Learn

Week 1: Fundamentals

  • Use ChatGPT and Claude extensively
  • Understand prompting
  • Try different temperatures

Week 2: Technical Concepts

  • Learn about embeddings
  • Understand RAG
  • Explore tokenization

Week 3: Hands-On

  • Build a simple RAG system
  • Experiment with prompt engineering
  • Test different models

Week 4: Strategic

  • Analyze AI costs for your use case
  • Build an AI feature spec
  • Present to stakeholders

Resources for PMs

Non-Technical:

  • “Co-Intelligence” by Ethan Mollick
  • “The AI Playbook” courses
  • OpenAI Cookbook (practical examples)

Semi-Technical:

  • “Designing Machine Learning Systems” (Chip Huyen)
  • Andrej Karpathy’s “Intro to LLMs” YouTube
  • LangChain documentation

Questions to Ask Your AI/Engineering Team

About the Approach

  • “Are we using RAG or fine-tuning? Why?”
  • “What model are we using? Why that one?”
  • “How do we handle hallucinations?”

About Quality

  • “How do we test this feature?”
  • “What’s our accuracy on the test set?”
  • “What are the known failure modes?”

About Cost

  • “What’s the cost per query?”
  • “How does cost scale with usage?”
  • “What’s our optimization plan?”

About Deployment

  • “How do we monitor performance?”
  • “How do we gather user feedback?”
  • “What’s our iteration plan?”

The Bottom Line

AI product management is product managementβ€”with new tools and new trade-offs.

You need to understand:

  1. What AI can and can’t do
  2. The different approaches (RAG vs. fine-tuning)
  3. How to specify AI behavior
  4. How to measure AI quality
  5. How to manage AI costs

You don’t need to:

  • Implement models yourself
  • Understand the math
  • Build the infrastructure

The shift: When you understand AI technically enough to make good product decisions, you become the PM who can actually ship AI features that work.

Start today: Use AI tools extensively. Build something small. Learn by doing.


What AI feature are you building? What’s your biggest challenge?

Related Reading: