← PreviousResponsible AI Framework Next →Behavioral Interview Framework

AI/ML Product Metrics Framework

How to define and connect technical model metrics to business outcomes for AI-powered features.

When to Use This Framework

Use this when asked: "How would you measure the success of a recommendation system?", "How do you define good metrics for an AI feature?", "Our model accuracy improved but engagement dropped — what happened?"

The Core Insight: Two Metric Layers

Every AI feature has two distinct layers of metrics that must both be healthy. Candidates who only discuss one layer will miss the real PM skill being tested.

Layer 1 — Model quality metrics: Is the model technically performing well? Layer 2 — Business and user metrics: Is the model creating value for users and the business?

These layers can diverge. A model can improve on precision and still hurt the user experience, or vice versa. Your job as a PM is to connect both.

Layer 1: Model Quality Metrics

Classification Metrics

Used for models that predict a category or label (spam / not spam, fraud / not fraud, churn / not churn).

Accuracy: Percentage of predictions that are correct. Simple but misleading for imbalanced datasets. If 99% of emails are not spam, a model that always predicts "not spam" has 99% accuracy but catches zero spam.

Precision: Of all the times the model predicted "positive," how often was it right? High precision = few false positives.

Recall (Sensitivity): Of all the actual positive cases, how many did the model catch? High recall = few false negatives.

F1 Score: The harmonic mean of precision and recall. Use when you need to balance both.

The precision-recall tradeoff: Raising the decision threshold increases precision but decreases recall. Lowering it does the opposite. As a PM, you choose the threshold based on the cost of each type of error.

Example: A spam filter with low precision annoys users by blocking legitimate email. A spam filter with low recall fails to protect users. Which error is worse depends on your product's context and user expectations.

AUC-ROC: Area Under the Receiver Operating Characteristic curve. Measures how well the model ranks positives above negatives across all thresholds. A value of 1.0 is perfect; 0.5 is random. Useful for comparing models independent of threshold choice.

Ranking and Recommendation Metrics

Used for feeds, search results, and recommendation systems.

Precision@K: Of the top K results shown, what fraction were relevant?

Recall@K: Of all relevant items, what fraction appeared in the top K?

NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality, weighting items shown higher more heavily. More accurate than Precision@K because it accounts for position.

Coverage: What percentage of the total item catalog does the model ever recommend? Low coverage means users are stuck in a filter bubble.

Diversity: Are recommendations varied, or does the model keep surfacing the same items? Diversity metrics prevent recommendation fatigue.

Regression and Generation Metrics

MAE / RMSE: For continuous predictions (price estimates, time predictions). Mean Absolute Error is more interpretable; RMSE penalizes large errors more heavily.

BLEU / ROUGE: For text generation tasks. Measure overlap between generated and reference text. Useful for summarization, translation. Note: high BLEU does not guarantee high user satisfaction — these are proxies.

Layer 2: Business and User Metrics

Model quality metrics are internal. Users do not experience precision and recall — they experience outcomes.

Map each model quality metric to a user outcome:

Better spam precision → fewer false blocks → fewer support tickets from "my email disappeared"
Better recall for recommendations → users find content they love → higher session time and return rate
Lower latency in a prediction API → faster page load → higher conversion

Apply the HEART and AARRR frameworks to AI features the same way you would for any other product feature. The AI layer is invisible to users; what they experience is the product surface.

Watch for divergence signals:

Model accuracy improves, but user satisfaction drops → the model is optimizing for the wrong objective

Click-through on recommendations increases, but session quality drops → the model is triggering curiosity without delivering satisfaction (clickbait optimization)

Model loss decreases, but business revenue stagnates → the model metric is not well-aligned to business value

Feedback Loops and Data Flywheel

AI products often create self-reinforcing feedback loops. Understand both the upside and the risk.

Positive flywheel: More users → more interaction data → better model → better experience → more users.

Negative feedback loop (filter bubble): Recommendations drive clicks → model learns to recommend similar items → users see less diversity → engagement narrows → eventual churn.

Popularity bias: If the model only trains on what users click, it will over-recommend already-popular items and under-surface new or niche content. Counter this with explicit exploration signals (serve some random or diverse recommendations intentionally).

Monitoring and Model Drift

Unlike traditional software, AI models degrade silently over time as the real world changes.

Data drift: The distribution of input data shifts (e.g., users start behaving differently after a major product change or world event). The model was not trained on this distribution and begins making worse predictions.

Concept drift: The relationship between inputs and the right output changes (e.g., what counts as spam evolves as spammers adapt).

PM responsibility: Define monitoring dashboards before launch. Set alerts for when model quality metrics drop below acceptable thresholds. Schedule regular retraining cycles rather than reacting only to outages.

Common Mistakes to Avoid

Reporting only model accuracy and ignoring business outcomes
Ignoring the precision-recall tradeoff and its user-facing implications
Optimizing engagement metrics without measuring downstream satisfaction or harm
Not accounting for data drift — assuming a model stays accurate forever
Conflating high model performance on a test set with good real-world performance
Missing feedback loop risks — especially filter bubbles in recommendation systems

← PreviousResponsible AI Framework Next →Behavioral Interview Framework