AI/ML Product Metrics Framework
How to define and connect technical model metrics to business outcomes for AI-powered features.
When to Use This Framework
Use this when asked: "How would you measure the success of a recommendation system?", "How do you define good metrics for an AI feature?", "Our model accuracy improved but engagement dropped — what happened?"
The Core Insight: Two Metric Layers
Every AI feature has two distinct layers of metrics that must both be healthy. Candidates who only discuss one layer will miss the real PM skill being tested.
Layer 1 — Model quality metrics: Is the model technically performing well? Layer 2 — Business and user metrics: Is the model creating value for users and the business?
These layers can diverge. A model can improve on precision and still hurt the user experience, or vice versa. Your job as a PM is to connect both.
Layer 1: Model Quality Metrics
Classification Metrics
Used for models that predict a category or label (spam / not spam, fraud / not fraud, churn / not churn).
Accuracy: Percentage of predictions that are correct. Simple but misleading for imbalanced datasets. If 99% of emails are not spam, a model that always predicts "not spam" has 99% accuracy but catches zero spam.
Precision: Of all the times the model predicted "positive," how often was it right? High precision = few false positives.
Recall (Sensitivity): Of all the actual positive cases, how many did the model catch? High recall = few false negatives.
F1 Score: The harmonic mean of precision and recall. Use when you need to balance both.
The precision-recall tradeoff: Raising the decision threshold increases precision but decreases recall. Lowering it does the opposite. As a PM, you choose the threshold based on the cost of each type of error.
Example: A spam filter with low precision annoys users by blocking legitimate email. A spam filter with low recall fails to protect users. Which error is worse depends on your product's context and user expectations.
AUC-ROC: Area Under the Receiver Operating Characteristic curve. Measures how well the model ranks positives above negatives across all thresholds. A value of 1.0 is perfect; 0.5 is random. Useful for comparing models independent of threshold choice.
Ranking and Recommendation Metrics
Used for feeds, search results, and recommendation systems.
Precision@K: Of the top K results shown, what fraction were relevant?
Recall@K: Of all relevant items, what fraction appeared in the top K?
NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality, weighting items shown higher more heavily. More accurate than Precision@K because it accounts for position.
Coverage: What percentage of the total item catalog does the model ever recommend? Low coverage means users are stuck in a filter bubble.
Diversity: Are recommendations varied, or does the model keep surfacing the same items? Diversity metrics prevent recommendation fatigue.
Regression and Generation Metrics
MAE / RMSE: For continuous predictions (price estimates, time predictions). Mean Absolute Error is more interpretable; RMSE penalizes large errors more heavily.
BLEU / ROUGE: For text generation tasks. Measure overlap between generated and reference text. Useful for summarization, translation. Note: high BLEU does not guarantee high user satisfaction — these are proxies.
Layer 2: Business and User Metrics
Model quality metrics are internal. Users do not experience precision and recall — they experience outcomes.
Map each model quality metric to a user outcome:
- Better spam precision → fewer false blocks → fewer support tickets from "my email disappeared"
- Better recall for recommendations → users find content they love → higher session time and return rate
- Lower latency in a prediction API → faster page load → higher conversion
Apply the HEART and AARRR frameworks to AI features the same way you would for any other product feature. The AI layer is invisible to users; what they experience is the product surface.
Watch for divergence signals:
Feedback Loops and Data Flywheel
AI products often create self-reinforcing feedback loops. Understand both the upside and the risk.
Positive flywheel: More users → more interaction data → better model → better experience → more users.
Negative feedback loop (filter bubble): Recommendations drive clicks → model learns to recommend similar items → users see less diversity → engagement narrows → eventual churn.
Popularity bias: If the model only trains on what users click, it will over-recommend already-popular items and under-surface new or niche content. Counter this with explicit exploration signals (serve some random or diverse recommendations intentionally).
Monitoring and Model Drift
Unlike traditional software, AI models degrade silently over time as the real world changes.
Data drift: The distribution of input data shifts (e.g., users start behaving differently after a major product change or world event). The model was not trained on this distribution and begins making worse predictions.
Concept drift: The relationship between inputs and the right output changes (e.g., what counts as spam evolves as spammers adapt).
PM responsibility: Define monitoring dashboards before launch. Set alerts for when model quality metrics drop below acceptable thresholds. Schedule regular retraining cycles rather than reacting only to outages.
Common Mistakes to Avoid
- Reporting only model accuracy and ignoring business outcomes
- Ignoring the precision-recall tradeoff and its user-facing implications
- Optimizing engagement metrics without measuring downstream satisfaction or harm
- Not accounting for data drift — assuming a model stays accurate forever
- Conflating high model performance on a test set with good real-world performance
- Missing feedback loop risks — especially filter bubbles in recommendation systems