# SDI Prep — Staff-level System Design Interview Preparation > SDI Prep (sdi.ninja) is a free, staff-level system design interview preparation resource. Every problem is written to the depth of a staff-engineer interview: opening script, clarifying-question ladder, back-of-envelope math with implications, naive→better→best deep dives with concrete numbers, production failure modes, defense library for live grilling, animated API flow diagrams, and company-specific round breakdowns. Audience: senior / staff / principal software engineers preparing for high-bar system design interviews at FAANG, fintech (Stripe, PayPal, PayPay), ride-sharing (Uber, Lyft), and large-scale product companies. Authored by a working staff engineer. Content is opinionated — every "pick when" names a concrete regime (RPS, geography, blast radius) rather than "when scale is high". ## How this site is organized - **Problems** — 64 full system design problems, each with 12+ sections (Opening, Clarifying, Functional, Non-functional, Estimation, Entities, API, HLD, Architecture Diagrams, Deep Dives, Level Bar, Traps, Flashcards, AI Practice, Cue Cards, Defenses, Interview Rounds). - **Patterns** — 7 cross-cutting design patterns referenced by the problems (scaling reads/writes, real-time updates, handling large blobs, managing long tasks, dealing with contention, multi-step processes). - **Concepts** — 5 foundational concepts (CAP theorem, consistent hashing, caching, sharding, numbers to know). - **Key Technologies** — 9 deep dives on the core infrastructure (Kafka, Redis, Cassandra, DynamoDB, PostgreSQL, Elasticsearch, Flink, API Gateway, Zookeeper). - **Framework** — the 6-step delivery framework used across every problem. ## Problems - [Ad Click Aggregator (TikTok / Meta Ads)](https://sdi.ninja/problems/ad-click-aggregator): Ingest 170K ad clicks/sec, every click is money, dashboard freshness <1 min, 1-year retention. The canonical (hard) - [Amazon Fulfillment Center (Warehouse Mgmt)](https://sdi.ninja/problems/amazon-fulfillment-center): Operate 1000+ FCs running Kiva robots, human pickers, and handheld scanners. Every item movement is an event; SLAs drive outbound waves; partition-tolerant on FC↔corp link. (hard) - [Amazon Locker Pickup Network](https://sdi.ninja/problems/amazon-locker-pickup): Self-service locker network: route packages, generate OTP/QR pickup, expire and return-to-sender, IoT door control with offline-tolerant cached tokens. (medium) - [AML Transaction Monitoring (PayPal / Wise / Coinbase-style)](https://sdi.ninja/problems/aml-transaction-monitoring): Surveil all transactions for structuring / layering / integration patterns, generate alerts, run case management, file SARs to FinCEN — graph-aware, regulator-ready, with immutable audit and Chinese walls. (hard) - [AWS Lambda (Design the serverless platform)](https://sdi.ninja/problems/aws-lambda-serverless): Run untrusted code in 100ms cold starts and 1ms billing. Firecracker microVMs as the isolation primitive, snapshot-restore to crush Java/.NET cold starts, predictive warm pools, event-source mappings as concurrency-bounded pull workers — and the very real failure modes of recursive billing runaways and ENI exhaustion. (hard) - [Card Issuing Platform (Stripe Issuing / Marqeta)](https://sdi.ninja/problems/card-issuing-platform): Issue virtual + physical cards on a sponsor bank BIN, decision authorization webhooks within the network\ (hard) - [E-commerce Cart → Order Checkout](https://sdi.ninja/problems/cart-checkout): The full BAU pipeline that turns a cart into a fulfilled order: cart ↔ order ↔ inventory ↔ payment ↔ fulfillment as a saga, with two-phase inventory reservation and end-to-end idempotency. Distinct from a cart-only design (no order saga) and a flash-sale (no BAU choreography). (hard) - [Case Management / Ticketing (Jira / ServiceNow)](https://sdi.ninja/problems/case-management): Multi-tenant ticketing with custom fields, configurable workflows, SLA timers, ACL-aware search, and webhook integrations. The B2B-SaaS platform that lives or dies by extensibility + tenant isolation. (medium) - [Co-Working Space / Desk Booking](https://sdi.ninja/problems/co-working-booking): Reserve desks/rooms across global spaces with no double-booking, recurring bookings via RRULE, calendar sync, and a 10K booking/sec lunch-hour spike. (medium) - [Distributed Cache (Memcached / Redis-like)](https://sdi.ninja/problems/distributed-cache): 1M ops/sec, sub-ms p99, 1TB across a cluster. The classic scaling-reads problem — consistent hashing, virtual nodes, replication, eviction, and the very real tax of hot keys, cluster membership, and cache stampede. (hard) - [Dropbox / Google Drive (File Sync & Storage)](https://sdi.ninja/problems/dropbox): Files synchronized across all of a user\ (hard) - [DynamoDB Internals (Design DynamoDB itself)](https://sdi.ninja/problems/dynamodb-internals): Build the service, not a system on top of it. Single-digit-ms reads at any scale via consistent-hashed partitions, leader-based replication, adaptive splits for hot keys, and async global tables — all behind a flat HTTP API. (hard) - [Facebook Live Comments](https://sdi.ninja/problems/fb-live-comments): During a live stream, every viewer\ (medium) - [Facebook Post Search](https://sdi.ninja/problems/fb-post-search): Search across 500B posts, p99 <500ms, ranked by relevance + recency + engagement, personalized by social graph. The canonical (hard) - [Flash Sale System](https://sdi.ninja/problems/flash-sale): 100M users hit (hard) - [Gmail / Email System](https://sdi.ninja/problems/gmail): 2B mailboxes, 250B emails/day, 15GB+ per user, full-text search across the body, threading by Message-ID, and a spam pipeline that has to win against an adversary. The canonical (hard) - [Google Docs (Real-time Collaborative Editor)](https://sdi.ninja/problems/google-docs): 50 collaborators on one document, every keystroke replicated in <50ms, no conflicts, durable forever. The canonical OT-vs-CRDT problem with a stateful sticky-session twist. (hard) - [Google Maps](https://sdi.ninja/problems/google-maps): Render a global map at any zoom level, route between any two points in <500ms, ingest anonymized GPS pings to compute live traffic, and serve 1M tile requests/sec at peak. The proximity-based-services problem at planet scale. (hard) - [Instagram](https://sdi.ninja/problems/instagram): 2B users, 500M DAU, 100M photos/day uploaded, 1B feed reads/sec at peak — a multi-source feed that merges followed posts (fanout-on-write) with recommended Reels (pull/ranker) under a 100ms feed SLO. (hard) - [Distributed Job Scheduler (Airflow / Quartz / Temporal)](https://sdi.ninja/problems/job-scheduler): 1M jobs/day, cron + one-off, exactly-once execution, jobs that run for hours, isolated per job. The classic managing-long-tasks problem — leader election, durable state, idempotent retries, DAGs. (hard) - [Distributed Key-Value Store (Dynamo-like)](https://sdi.ninja/problems/key-value-store): Petabyte-scale, 1M+ ops/sec, tunable consistency (R + W vs N), partition-tolerant KV store across multiple regions — the staff-bar reference design that ties consistent hashing, quorums, vector clocks, anti-entropy, and gossip into one coherent system. (hard) - [Kindle Whispersync (Cross-Device Reading State)](https://sdi.ninja/problems/kindle-whispersync): Sync read position, bookmarks, highlights, and notes across Kindle, phone, web Cloud Reader. Vector-clock per device, CRDT for highlights, LWW for last-position, offline-tolerant. (hard) - [KYC / Identity Verification (PayPal, Wise, Coinbase-style)](https://sdi.ninja/problems/kyc-identity-verification): Onboard millions of customers with document upload + OCR + liveness + sanctions/PEP screening — multi-step orchestration with vendor fallback, manual-review queues, tiered KYC tied to product limits, and per-region data residency. (hard) - [Last-Mile Delivery Routing (DSP / Logistics)](https://sdi.ninja/problems/last-mile-delivery-routing): Dispatch and dynamically re-route 50K+ packages across 5K+ DSP drivers per metro per day. VRP with time windows + capacity, H3 driver shards, ETA model, in-day rebalancing on disruption. (hard) - [LeetCode (Code Execution + Judging)](https://sdi.ninja/problems/leetcode): User submits arbitrary code; we compile it, run it inside an adversarial sandbox against hidden test cases, and return verdict in <30s. The canonical (medium) - [Local Delivery (GoPuff / Gorillas)](https://sdi.ninja/problems/local-delivery): 10M users, 100K orders/day per region, hyperlocal sub-30min delivery, per-warehouse inventory, p99 order-to-confirm <2s. The classic CP-for-inventory + AP-for-browse split with a courier dispatch on top. (medium) - [Merchant Payout System (PayPal Mass Pay / Stripe Connect)](https://sdi.ninja/problems/merchant-payout-system): Daily/weekly payouts to millions of sellers across rails (ACH, SEPA, instant push-to-card, wire) with hold-back reserves, return-code handling, and year-end tax-form generation. Money moves outbound from a platform balance — every payout is an invariant: one-payout-per-period, ledger-first, never pay out more than the seller earned. (hard) - [Metrics Monitoring (Datadog / Prometheus-like)](https://sdi.ninja/problems/metrics-monitoring): 5M metrics/sec ingest from 500K servers, 1-year retention with downsampling, p99 query <500ms, alerting on derived expressions. The classic scaling-writes problem with cardinality as the silent killer. (hard) - [Money Request & Split Bill (Venmo Request / Splitwise)](https://sdi.ninja/problems/money-request-split-bill): Pending requests, group expenses, debt-graph minimization, idempotent settle-up that converts a request into a real P2P transfer — managing-state and multi-step over an event-sourced request lifecycle, with abuse defenses against catfishing/extortion serial requesters. (medium) - [Multi-Currency Wallet (Wise-style)](https://sdi.ninja/problems/multi-currency-wallet): One account holding 50+ currency balances with per-currency local account details (UK sort code, US ACH, EUR IBAN). In-wallet conversion as 2 ledger entries + FX revenue capture. Per-currency rounding policy, hot-account contention via sub-ledger sharding, idempotent conversion API with quote-id, daily reconciliation against partner banks per currency. (hard) - [News Aggregator (Google News)](https://sdi.ninja/problems/news-aggregator): Ingest articles from 50K sources, dedupe stories that span outlets via embedding similarity, classify into topics, and serve a personalized fresh feed at 10K reads/sec with <500ms p99. The deceptively-simple read-heavy + ML-clustering problem. (medium) - [News Feed (Facebook / Twitter)](https://sdi.ninja/problems/news-feed): Followers see fresh, ranked posts from people they follow within ~1s. Hybrid fanout, hot-key absorption, and a 2-stage ML ranker — the canonical staff-level read/write asymmetry problem. (hard) - [Notification System (Push / Email / SMS)](https://sdi.ninja/problems/notification-system): Any service can fire an event; the platform fans out to push (APNS/FCM), email (SES), and SMS (Twilio) per-user preferences, with dedupe, retries, provider failover, and quiet hours — without melting under a 100K/sec breaking-news burst. (medium) - [Offline Payment Mode (PayPay / Alipay underground)](https://sdi.ninja/problems/offline-payment-mode): User in an underground bar or basement concert with no signal pays via pre-fetched signed offline tokens. Backend reconciles when phone reconnects — defends replay across two merchants, double-spend, balance-went-negative, time-skew. (hard) - [Online Auction (eBay live bidding)](https://sdi.ninja/problems/online-auction): Sell a single item to the highest bidder before a hard deadline, with millions of watchers seeing every bid in <500ms — and survive last-second sniping where 1000+ bids arrive in the final second on one auction. (medium) - [P2P Money Transfer (Venmo / Cash App / PayPal)](https://sdi.ninja/problems/p2p-money-transfer): User → user money movement with funding waterfall (wallet → debit → ACH), idempotent send across a double-entry ledger, instant transfer rails, social feed with privacy, and risk holds — Stripe-style obsession with the SUM=0 invariant against retries, race conditions, and bad actors. (hard) - [Payment System (Stripe-style)](https://sdi.ninja/problems/payment-system): Authorize, capture, settle, and refund money across millions of merchants and billions of transactions/day — exactly-once, multi-currency, PCI-scoped, with sagas for the multi-step flow and a double-entry ledger as the source of truth. (hard) - [Post-Purchase Customer Experience Program](https://sdi.ninja/problems/post-purchase-cx): 1B orders/year, automated review-request emails on delivery, NPS surveys at lifecycle milestones, sentiment analysis routes negative reviews to humans. Event-driven scheduling at scale with anti-spam and anti-fake-review controls. (medium) - [Price Tracking (CamelCamelCamel)](https://sdi.ninja/problems/price-tracking): Poll 100M product pages every 30 minutes, store years of price history as time-series, and fire 1M alerts/day to users whose target price was hit — without scanning all alerts on every price update or alerting on flash drops. (medium) - [QR-Based Mobile Payment (PayPay / Alipay / WeChat Pay)](https://sdi.ninja/problems/qr-payment-system): Cashier-in-line latency: 300ms-1s end-to-end for a QR scan to debit a wallet, push to merchant POS, and survive replay/spoofing/POS-flap. The canonical PayPay HLD round. (hard) - [Distributed Rate Limiter](https://sdi.ninja/problems/rate-limiter): Allow N requests per identity per window across thousands of API instances; <10ms p99 check; CP enough to enforce paid tiers; survives a single hot key at 1M req/s. (hard) - [Real-Time Fraud Detection (Stripe Radar / PayPal-style)](https://sdi.ninja/problems/realtime-fraud-detection): Score every transaction in <100ms (300ms inline budget) using streaming features + ML + rules; auto-decline high-risk, step-up auth on mid, allow with monitoring on low — without breaking the auth path. (hard) - [Amazon Recommendation Engine](https://sdi.ninja/problems/recommendation-engine): (hard) - [Robinhood / Stock Exchange](https://sdi.ninja/problems/robinhood): 20M users, 1M orders/sec at market open, p99 order-to-confirm <100ms, 100K price ticks/sec to broadcast — a price-time-priority matching engine where every order is a regulated, audit-logged, financial transaction. (hard) - [S3 Object Storage (Design S3 internals)](https://sdi.ninja/problems/s3-object-storage): 11 nines durability for petabyte-scale immutable blobs at internet scale. Erasure-coded objects sharded across AZs, hash-prefixed metadata, multipart upload state machine, lifecycle tiering — and the strong-consistency overlay AWS finally added in 2020. (hard) - [Scalable Messaging Queue (Kafka-like, S3 Tiered)](https://sdi.ninja/problems/scalable-msg-queue): Distributed log-structured queue: 1M topics, 100K msg/s/topic peak, 7-day retention, exactly-once via transactions, S3 tiered storage for cold partitions. The hardest of the bunch — log-structured storage, leader/replica, consumer groups, and transactional semantics. (hard) - [Search Autocomplete (Typeahead)](https://sdi.ninja/problems/search-autocomplete): As the user types, return top-K query suggestions in <100ms p99, refreshed daily from the query log, with personalization, typo tolerance, and a story for trending terms surging mid-day. (medium) - [Amazon Shopping Cart](https://sdi.ninja/problems/shopping-cart): Persistent multi-device cart with sub-100ms add-to-cart at peak, anonymous → logged-in merge, soft inventory checks, and abandoned-cart recovery via async events. (medium) - [Skill Scanner (Resume → Skill Extraction at Scale)](https://sdi.ninja/problems/skill-scanner): 1M resumes/day uploaded as PDF/DOCX, parsed → skills extracted via NER → normalized → matched to job postings via vector similarity. ~5s p99 end-to-end with bulk reprocess on model updates. (medium) - [SQS Internals (Design AWS SQS, not a generic queue)](https://sdi.ninja/problems/sqs-internals): A managed message queue with visibility timeouts as the lock primitive, configurable redrive, FIFO with per-msg-group ordering and 5-minute dedup, and 120K in-flight per standard queue. Eventual ordering, at-least-once, AP-favoring. (hard) - [Strava (activity tracking + segments)](https://sdi.ninja/problems/strava): Athletes upload GPX tracks, the system matches them against named segments, updates leaderboards, fans out to social feed, and supports live tracking + privacy zones. Geo + leaderboard + fanout + large-blob, all in one. (medium) - [Subscription Billing (Stripe Billing / Chargebee)](https://sdi.ninja/problems/subscription-billing): Recurring billing engine: subscription state machine (trial → active → past_due → canceled), proration on plan change, smart-retry dunning, coupon/tax/discount stack, immutable invoices, ASC 606 revenue recognition, and entitlement webhooks. The bar is one-invoice-per-period, never double-charge, never send 14 dunning emails, and produce auditable revenue numbers. (hard) - [Ticketmaster (Live Event Ticketing)](https://sdi.ninja/problems/ticketmaster): Sell scarce, indivisible inventory (a specific Section-Row-Seat) to millions of simultaneous, motivated, sometimes adversarial buyers — without double-selling, melting the database, or letting bots win. The hardest contention problem in the canon. (very-hard) - [Tinder (Geo-Based Dating / Swipe Matching)](https://sdi.ninja/problems/tinder): Show nearby profiles, swipe left/right at 18.5K writes/sec, detect mutual right-swipes as matches with no race conditions, push the match notification within seconds. The canonical proximity + write-throughput problem. (hard) - [TrueCaller (Caller ID Lookup)](https://sdi.ninja/problems/true-caller): 5B phone numbers, sub-ms global lookup at incoming-call time, crowd-sourced names + spam tagging, plus a GDPR-compliant unlist path. The textbook scaling-reads + ethics problem. (medium) - [Uber / Ola Rider Matching](https://sdi.ninja/problems/uber): Real-time geo-matching: 10M drivers pinging location, 1M ride requests/sec, sub-minute match SLA. The proximity-services flagship — H3, DISCO, Redis locks, Kafka durability, surge pricing, rider-driver state machine. (hard) - [Unique ID Generator (Snowflake)](https://sdi.ninja/problems/unique-id-generator): Generate 64-bit unique, roughly time-sortable IDs at 1M+/sec across a fleet of stateless servers — no central counter, survives clock skew, fits in a long. The canonical (medium) - [URL Shortener (Bitly / TinyURL)](https://sdi.ninja/problems/url-shortener): Long URL → short code; redirect, analytics, custom aliases, expiration. The classic warm-up — read-heavy with one juicy ID-generation deep dive. (medium) - [Wealth Management Platform (Mint / Personal Capital / Cred)](https://sdi.ninja/problems/wealth-management): Connect to brokers (Zerodha, Robinhood) + banks (Plaid) + cards + crypto, aggregate holdings, compute real-time net worth, history, alerts. Security + regulatory + cost are the differentiating concerns. (hard) - [Web Crawler (Google-scale)](https://sdi.ninja/problems/web-crawler): Crawl 1B URLs/day, respecting robots.txt and politeness, deduping at hyperscale, and storing 10TB/day of raw HTML for downstream indexing. The canonical long-task + write-throughput + bounded-courtesy problem. (hard) - [WhatsApp / Chat System](https://sdi.ninja/problems/whatsapp): 2B users, 200M concurrent persistent WebSockets, 4B msgs/day, <500ms p99 delivery, end-to-end Signal Protocol encryption, multi-device sync, group fanout, offline replay. The (very-hard) - [Yelp / Nearby Restaurants](https://sdi.ninja/problems/yelp): Given a lat/lng + radius, return ranked nearby businesses in <500ms p99. Geo-indexing, multi-factor ranking, write-via-CDC, and a CDN-cached blob path — the canonical proximity-services problem. (medium) - [YouTube / Amazon Prime Video](https://sdi.ninja/problems/youtube): Upload arbitrarily large videos, transcode to a ladder of resolutions, store globally, and stream over adaptive bitrate to billions. The canonical large-blob + long-task + read-scale problem. (hard) - [YouTube Top K (Heavy Hitters)](https://sdi.ninja/problems/youtube-top-k): Across 1B videos and tens of millions of view events per second, return the top-100 most-viewed videos for the last hour / day, refreshed every minute, p99 read <100ms. The canonical (hard) ## Patterns - [Dealing With Contention (Concurrent Writes to the Same Resource)](https://sdi.ninja/patterns/dealing-with-contention) - [Handling Large Blobs (Uploads, Files, Media)](https://sdi.ninja/patterns/handling-large-blobs) - [Managing Long-Running Tasks (Async Job Processing)](https://sdi.ninja/patterns/managing-long-tasks) - [Multi-Step Processes (Distributed Workflows & Sagas)](https://sdi.ninja/patterns/multi-step-processes) - [Real-Time Updates (Push to Clients)](https://sdi.ninja/patterns/real-time-updates) - [Scaling Reads (Read-Heavy Workloads)](https://sdi.ninja/patterns/scaling-reads) - [Scaling Writes (High-Throughput Ingest)](https://sdi.ninja/patterns/scaling-writes) ## Concepts - [Caching](https://sdi.ninja/concepts/caching) - [CAP Theorem (and PACELC)](https://sdi.ninja/concepts/cap-theorem) - [Consistent Hashing](https://sdi.ninja/concepts/consistent-hashing) - [Numbers to Know](https://sdi.ninja/concepts/numbers-to-know) - [Sharding](https://sdi.ninja/concepts/sharding) ## Key Technologies - [API Gateway](https://sdi.ninja/tech/api-gateway) - [Apache Cassandra](https://sdi.ninja/tech/cassandra) - [Amazon DynamoDB](https://sdi.ninja/tech/dynamodb) - [Elasticsearch](https://sdi.ninja/tech/elasticsearch) - [Apache Flink](https://sdi.ninja/tech/flink) - [Apache Kafka](https://sdi.ninja/tech/kafka) - [PostgreSQL](https://sdi.ninja/tech/postgresql) - [Redis](https://sdi.ninja/tech/redis) - [Apache ZooKeeper](https://sdi.ninja/tech/zookeeper) ## Framework - [6-Step Delivery Framework](https://sdi.ninja/framework/delivery) ## Optional - [Sitemap XML](https://sdi.ninja/sitemap.xml) - [Home](https://sdi.ninja/)