System Design Framework

How to structure a system design answer, reason about trade-offs, and communicate technical depth as a PM.

What PMs Are Expected to Know

System design questions for PMs are not the same as for engineers. You will not be asked to implement a consistent hashing algorithm or write distributed transaction code. You are expected to:

Structure a system at a high level and explain how the pieces fit together
Identify the right trade-offs and explain the reasoning
Speak credibly with engineers about scale, reliability, and architecture decisions
Demonstrate that you can translate user and product requirements into technical constraints

The signal interviewers look for: can this PM hold a whiteboard session with their engineering team? Can they spot when a technical decision has product implications?

The 5-Step Approach

Step 1: Clarify Requirements

Never start drawing boxes. Spend 2-3 minutes establishing scope.

Functional requirements — what the system must do:

What are the core use cases? (e.g., "users can post a photo and follow other users")

What is explicitly out of scope?

Read vs. write heavy, or balanced?

Non-functional requirements — how the system must perform:

Scale: how many users? daily active users? requests per second?

Latency: is this real-time (chat, trading) or can it tolerate seconds of delay (batch reports)?

Availability: what is the acceptable downtime? 99.9% (~9 hours/year) vs. 99.99% (~1 hour/year)?

Consistency: must all users see the same data at the same time, or is eventual consistency acceptable?

Durability: what happens if data is lost?

Establish your scale envelope upfront. A system for 10,000 users is designed very differently from one for 100 million.

Step 2: Estimate Capacity

Back-of-the-envelope math shows the interviewer you understand the scale of the problem. You do not need precision — order-of-magnitude thinking is the goal.

Useful reference points:

1 million requests/day ≈ ~12 requests/second

1 billion requests/day ≈ ~12,000 requests/second

Average tweet: ~280 bytes of text

A photo: ~1-5 MB; a thumbnail: ~10-50 KB

A video minute (compressed): ~100-500 MB

1 TB = 1,000 GB = 1,000,000 MB

Estimate storage: daily writes × average object size × retention period

Estimate throughput: daily active users × average requests per user per day ÷ 86,400 seconds

State your assumptions out loud. "I'll assume 50M DAU, each making 20 read requests per day, which gives us roughly 12,000 read requests per second."

Step 3: Design the High-Level Architecture

Start with a simple diagram. Add complexity only when you can justify it with a requirement.

Core building blocks:

Clients (web, mobile, IoT) send requests to your system. Think about whether clients need a native app or a web interface, and whether offline support matters.

Load Balancer distributes incoming requests across multiple servers. Prevents any single server from becoming a bottleneck. Also handles health checks and automatic failover.

Application Servers (the API layer) contain your business logic. Should be stateless — any server can handle any request. Stateless servers scale horizontally: just add more.

Databases store persistent data. The biggest design decision in most systems.

Cache (Redis, Memcached) stores frequently read data in memory for fast retrieval. Dramatically reduces database load for read-heavy systems.

Message Queue (Kafka, SQS, RabbitMQ) decouples producers from consumers. Enables async processing, absorbs traffic spikes, and prevents data loss if a downstream service is slow.

CDN (Content Delivery Network) serves static assets (images, video, JS) from servers geographically close to the user. Essential for media-heavy products.

Object Storage (S3, GCS) stores large blobs — images, videos, files — cheaply and durably. Not a database; not meant for transactional queries.

Search Index (Elasticsearch) enables full-text and fuzzy search. Databases are not good at search; maintain a separate search index and sync it asynchronously.

Step 4: Dive Into Key Trade-offs

Pick 2-3 areas where the design involves a meaningful trade-off and explain your reasoning.

SQL vs. NoSQL

| | SQL (Postgres, MySQL) | NoSQL (DynamoDB, MongoDB, Cassandra) |

SQL: Strong consistency, ACID transactions, flexible queries, joins. Best for structured relational data with complex query patterns.

NoSQL: Horizontal scalability, flexible schema, high write throughput. Best for simple access patterns at massive scale.

Rule of thumb: Start with SQL. Move to NoSQL when you have a concrete scale or schema-flexibility problem that SQL cannot handle.

Caching strategy

Cache-aside (lazy loading): Application checks cache first; on miss, fetches from DB and populates cache. Simple and resilient, but first requests are slow.
Write-through: Write to cache and DB simultaneously. Keeps cache warm but adds write latency.
TTL: Every cached entry expires after a set time. Prevents stale data but may cause thundering herd (many cache misses at once when TTLs expire together).

Synchronous vs. asynchronous processing

Sync: User waits for the full response. Simple, but the user is blocked by every downstream service call.
Async (via message queue): User gets an immediate acknowledgment; processing happens in the background. Better for tasks that are slow, unreliable, or can be retried — but adds complexity and requires eventual consistency.

Example: When a user posts a photo on Instagram, the upload itself is synchronous. But generating thumbnails, running content moderation, and pushing to followers' feeds all happen asynchronously via queues.

CAP Theorem

In a distributed system, you can guarantee at most two of three properties:

Consistency (C): Every read sees the most recent write
Availability (A): Every request gets a response (not an error)
Partition tolerance (P): The system continues to work when network partitions occur

Since network partitions are a physical reality, real distributed systems choose between CP (consistent but may reject requests during a partition) or AP (always responds but may return stale data).

PM translation: For most consumer products, eventual consistency is acceptable. A user seeing a slightly stale follower count is not a crisis. But for payments or inventory, strong consistency is required — showing someone a product as "in stock" when it is not has real business consequences.

Step 5: Address Reliability and Failure

A system that cannot handle failures is not production-ready. Walk through how your design handles the most important failure modes.

Redundancy: Every critical component should have a backup. No single points of failure.

Multiple application servers behind a load balancer

Database replication (primary + one or more read replicas)

Multi-region deployment for systems that cannot tolerate regional outages

Graceful degradation: When a component fails, the system should degrade gracefully rather than collapsing entirely.

If the recommendations service is down, show popular items instead

If the cache is down, fall back to the database (slower, but functional)

If a third-party API is unavailable, return a cached or default response

Rate limiting: Protect your system from abusive clients and traffic spikes. Implement at the API gateway level.

Circuit breaker: If a downstream service starts failing, stop sending requests to it rather than letting failures cascade. After a timeout, try again with a small percentage of traffic.

Data durability: For critical data, use replication and backups. Define your Recovery Point Objective (RPO — how much data loss is acceptable) and Recovery Time Objective (RTO — how long can the system be down).

Common PM System Design Questions

"Design a URL shortener (bit.ly)" Core insight: read-heavy (redirects massively outnumber writes), needs low latency, globally distributed. Key decisions: hash function for short codes, caching hot URLs, handling custom aliases.

"Design a notification system" Core insight: async by nature, must handle multiple channels (push, email, SMS), needs deduplication, rate limiting per user, and priority queuing (transactional > marketing).

"Design a news feed (Twitter/Instagram)" Core insight: the fan-out problem — when a celebrity with 50M followers posts, do you push to all feeds immediately (fan-out on write) or compute feeds on read (fan-out on read)? Most systems use a hybrid: push for normal users, pull for celebrities.

"Design a ride-sharing system (Uber)" Core insight: real-time location matching under strict latency constraints. Key components: geospatial index for driver locations, matching algorithm, trip state machine, surge pricing engine.

Common Mistakes to Avoid

Jumping to a solution before clarifying requirements and scale
Designing for 1 billion users when the question is about an MVP
Adding complexity without justifying it with a concrete requirement
Forgetting failure modes — every good design addresses what happens when components fail
Being too vague about trade-offs — "we'd use a cache" is weak; "we'd use a cache-aside strategy with a 1-hour TTL to handle the read-heavy access pattern" is strong
Not knowing the difference between SQL and NoSQL well enough to defend a choice
Ignoring the CAP theorem implications of your data store choices

← PreviousStrategy Framework